>

(U01)

www.btinternet.com/~adrian.larner/database/newint1

A New Interpretation of Data

A database paper by Adrian Larner

 

Abstract

 

 

The notion of data objects (records) modelling real world objects is examined and rejected in favour of interpretation of data objects as utterances. The weaknesses of the entity/relationship model are exposed, specifically its failure to interpret data manipulations. Next examined is the “classical” interpretation of records (including tuples in relations) as propositions in the language of the first order logic. This is shown to be too restrictive for the data that must be kept, largely because of its constraints on identity of values. A less constrained – relative – identity is proposed, and a new formal interpretation advanced and briefly examined.

 

MOTIVATION

 

 

[Gray] distinguishes classical data models (including the relational model) and semantic data models, claiming that the former are “not suitable candidates for conceptual data modelling.” [Korth] terms the classical models record-based, distinguishing them from object-based models (including the entity-relationship model and the object-oriented model), which correspond to the “semantic”.

The notion of a “model” is not well understood. We do understand how Codd (in [Codd1970]) used the term: in his relational model he used tuples as an abstraction of records; a mathematical object stood as an image of a data object. This served, firstly, to exclude implementation details from the theory; and, secondly, to impose certain desirable characteristics on its objects, as described by any data sublanguage based on his theory. These characteristics were mainly simplifying constraints that excluded, for example, implicit connections between records and repeating groups within records.

By contrast, “models” like the entity-relationship model do not primarily provide an abstraction of data objects. They use data objects - records or some abstraction of them - as themselves models, i.e. representations, of things outside the record-keeping system: entities (or entities and relationships) in the (“real”) world. Indeed, this difference in the object of modelling seems to be what distinguishes classical and semantic (record-based and object-based) models.

Modelling in the sense of [Codd1970] is clearly desirable: it both separates speci­fication from implementation (which is necessary if we are to achieve data independence), and formalises - and simplifies - that specification. We may think of a theory like the relational theory as a model, defining abstract objects that stand for (represent, model) concrete objects (if records be taken as concrete). Or we may think of such a theory as a theory directly about those concrete objects, but framed in such an abstract, such a weak, language as to preclude all talk of implementation details. The distinction is apparently slight: to talk of abstractions of the concrete, or merely to talk abstractly of the concrete.

But the desirability of that sort of modelling does not entail the desirability of the other sort of modelling. The proposal that, for instance, each data object (“record”, say) in a database should, or does, represent some thing (“entity”, for instance) in the world, is a substantial thesis. To adopt this thesis is to claim that a database models the world in the sense of some sort of isomorphism, or - at least - picturing, mapping, imaging, or (in the mathematical sense) projection. There is a much weaker thesis that we might advance: that by understanding the data in a database, we understand something about the world. This weaker thesis is implied by the stronger, and - in any event - is surely true, desirable, and obvious. Why else should we maintain a database? But is the stronger thesis either true or desirable? It is certainly not obvious.

Consider a complex object, like a football match. We can think of many ways of holding information about such an object. We might program a number of tiny robots to re-create, on a miniature artificial pitch, the moves of the players, ball, and referee. This would clearly be a model of the match. We might film the match from above: the film would retain some isomorphism; it would be a mapping of the match, a projection of it; and we might generously call it a “model”. But what of a commentary on the match? a discussion of the match? recorded highlights? the mere score?

If we called all of these representations “models”, then by “model of the match” we would mean merely: something that conveys some information about the match. We would be abandoning the stronger thesis in favour of the weaker thesis, but perversely retaining the word, “model”, now drained of any distinctive meaning and synonymous with “account”. On the other hand, we might grant the title “model” only to the tiny robots, and perhaps to the film, and - with large reservations - to the commentary. And we might then make a claim, according to the stronger thesis, that these models were a more desirable way to record the match than any mere account.

It is difficult to tell whether the proponents of “semantic modelling” or “object-based models” do indeed make this claim about their semantic models. Do they distinguish models of the world from other accounts of the world? If so, do they then claim some general superiority of such models over mere accounts? What constitutes that superiority? And what evidence and argument is there for it? It might plausibly be suspected that they do implicitly make that distinction; that they do - not fully consciously - think that a model is superior to any other kind of account; that they have not worked out why it should be superior; and that they take it for granted.

The strong thesis - that data does or should model reality, rather than merely give an account of reality - is here denied. But to avoid misunderstanding, the following are not denied:

There are occasions on which, deliberately or by chance, data does model reality, i.e. we can contrive or observe that our data does serve as an image or picture of reality. But data that is merely an account of reality is in no way defective. It is an interesting question when data, and for that matter process, should be a model, and when it should not; but that is beyond the scope of this paper.

A number of popular techniques of systems analysis, including data analysis, start from a consideration of the reality in respect of which data is to be held and processed. Often, when using these techniques, our first step in analysis is indeed an attempt to model this reality; and we quite properly practise and teach such modelling as a guideline to analysis. There is nothing wrong, and much right, with that recom­mendation, as a guideline. But that is all: the idea that the eventual design of our conceptual database must or should be a model is mistaken.

The data that we keep should be interpreted. We hope for (and in relational data bases have) a formal data manipulation language, which consequently works uninter­preted - that is the nature of formal systems. Nevertheless, we need an interpretation; we need to be able to convey to a user what is meant by each component (e.g. record) in our database. But what we need to convey is not necessarily a reality of which that component is a model.

In sum, what we need is to treat our data as an account of reality, and to be able to convey to a user the meaning of that account. We do not, in addition, have to make that account an image, or model, of the reality. In this paper we address the question: how then should we interpret data?

 

ENTITY MODELLING

 

 

The commonest interpretation offered is that components of a database represent entities (with variants, such as “entities or relationships between entities”). This interpretation - and it is the interpretation, not any modelling technique, that is under discussion - has been proposed (and widely accepted) for relational data bases. Let us call the data component in question a record (in the sense of [Larner], to be strict). We are speaking of what would commonly be called an entity instance.

It is not at all clear, however, what the proponents of this interpretation mean by an “entity”. We may distinguish two senses of the word:

The entities of a theory, let us say theoretic entities, are the things that the theory asserts to exist, i.e. the things that there must be if the theory is to be true. In a first order theory they are the things over which we quantify (over which the variables range), or - accordingly - the things the predicates are true or false of. “Entity” is used in this sense in this paper, except where entities in the other sense are obviously under discussion (and specifically, in the remainder of this section).

An entity, following the etymology of the word, is an existent - something that exists (“in the real world”). Thus [Korth], and in very similar words [Gray], quoting others: An entity is an object that exists ...

The most popular interpretation, as existent, is - unless we stretch the concept of existence beyond all reasonable bounds - patently unsupportable. We have records representing vacancies, surely not things that exist, unless it is enough for some type of thing - say an X - to exist, merely that we sometimes say, “There is an X” (as we say, “There is a vacancy.”) But we can quite properly say, “There is an animal that the ancients spoke of, namely the unicorn; but it does not exist.” So, just because we can say, “There is an X”, it does not mean that Xs exist, or that we say they do. Sometimes, we use “there is” to convey existence; but then, in that sense, we would not say that there is an X that does not exist, or that there is a unicorn or a vacancy.

This is not to say that looking for representable existents is a mistake when we are analysing data, for we do - in various fashions - represent such existents in our data: we have records for persons, cars, buildings, organisations, and so on. This approach is, indeed, a good way to start an analysis; but it provides too meagre a diet of examples to nourish a complete analysis, let alone a comprehensive theory of data: some records do represent existents, and others do not.

At the other extreme from such non-existents as vacancies we have those (to some “modellers”) troublesome records that do not, except in a very strange sense, represent anything, invoices for example. An invoice, the doctrinaire modeller would say, represents a demand for payment. Represents? Nay it is. Imagine a world in which we still had invoices and other paperwork (or magneticwork), but there were not, in addition, any “demands for payment”. There would, as now, be reminders, and eventually summonses, court hearings, and the arrival of bailiffs. Such a world, lacking merely these highly spiritual and refined “demands for payment”, would be indistinguishable from our own. These “demands” are what appear when entity modellers are driven to their last redoubt; they are the mere shadows of the records in databases.

The fact is: some records are also “real-world” entities; they “represent” such entities only if being is taken as a mode of representation. We live in a sophisticated world: our records act directly on our (real world) lives, whether they be diary entries, invoices, cheques, certificates, or magnetic records on credit and debit cards. There is not, and need not be, anything else for them to represent.

The definition of entities as existents requires the analysis of data and, even more, the theory of data, to be an exercise in metaphysics. And definitions like that quoted above are essays in metaphysics, and bad metaphysics at that. Not that the answer is to do good metaphysics: the answer is to eschew metaphysics entirely.

But entity modelling suffers from a far more serious defect. Suppose that we could give an interpretation (as entity or relationship) to each record kept in our database (i.e. to each base, or stored, record). Now we are asked by a user: I join these PERSON records on equality of the values in their Religion columns, and I project their Surname columns; what is the interpretation of the resultant record? I.e. what entity (or relationship) is represented by the result?

The entity model, alas, includes no interpretation of data manipulations, i.e. no way to derive the interpretation of a record formed by a data manipulation, even though the interpretations of the records from which it was formed have been given. Is the two-Surname record an entity? If so, what sort of entity? - a pair of persons (what persons?) Or is it a relationship? - what sort of relationship? and between what entities?

It is, on the face of it, easy enough to interpret such a record: someone called “Smith” and someone called “Robinson” have the same faith. But that is a sentence, an utterance; not an entity, nor a relationship. Can there be a relationship - and would it be a relationship type or a mere instance - not between two specific persons, but merely between “someone called ‘Smith’” and “someone called ‘Robinson’”? Is there a - unique, distinguishable - entity that is merely someone called by a certain name, as opposed to a specific person called by that name?

The entity modeller could perhaps claim that what was represented was a relationship between two surnames. But this is to deny the (no doubt carefully made) distinction between entities and their attributes; and to abandon any idea of representation, for the surnames are not represented by the data; they are components of the data.

We can see that without any interpretation of data manipulations there is no hope of erecting an intelligent front end on an entity-relationship database. If the inter­pretation of a constructed record (a record resulting from a query) is not calculable from the specified interpretations of records kept in the database, there is no way for a mechanism (or for a user, without informed guesswork) to work out the manipulations that would serve to express a given interpretation.

But, even restricting ourselves to data kept in the database, and to the entity modellers’ preferred examples: what does the analyst do when a user points to a displayed record and asks, “What does it mean?” Let us be strict entity modellers: the analyst sees that the record is an EMPLOYEE record, or a CAR record. To convey to the user what it means, it is necessary to present to the user what it represents. The analyst calmly ushers in the actual employee represented by the record, or goes and fetches the actual car, and points to them. That is, taking it strictly, what the entity interpretation demands.

But, of course, that is not what is done. The analyst tells the user what the record means: “It’s the employee with number 47273” or “It’s the car with registration number F197PDU.” Again, a perfectly adequate interpretation of a record can be given by an utterance. If the record, in some sense, “represents an entity”, the user will be able to understand from the utterance which entity it is. And if the record does not represent an entity, the user will still be able to understand what the utterance means. How else do analysts communicate with users? They talk to them. So why do we not just say: the interpretation of a record consists of utterances in English, or in whatever other language we speak to our users in?

This enables us to avoid the metaphysics entirely. It may be that there arise metaphysical questions: what does reality have to be like for the analyst’s utterance (the interpretation of the record) to be true? how do English utterances latch onto the world? But if anyone is to raise these questions, it is the user. And users, having been told that their company employs a person of such-and-such name, birthdate, salary, and so on, rarely do raise the metaphysical question: what must the world be like if it contains existents with these characteristics? Nor, even, do they worry about whether the world contains existents like vacancies when told that some record keeps the information that their company has a vacancy for a cashier’s clerk.

No: the analyst, having given an English interpretation of a record for a user, need go no further. “Going further” would mean doing something to interpret the record that did not comprise communicating with the user, in English or in some other language or notation translatable into English. The limits on what we can reasonably expect an analyst to do fall short of exhibiting objects in the manner of the sages of Laputa.

 

LANGUAGE INTERPRETATION

 

 

What therefore is proposed is that data components be interpreted as utterances. If we consider an old-fashioned data storage system, we can see that such utterances could be directly recorded. A company might have all its data written in English in some sort of journal, thus: Young Smith, son of our junior partner, today applied for a position as a junior clerk. He seems a likely lad, and I offered him the post, commencing Monday next, at a wage of three shillings and four pence a week. No question what this means (or represents, or models): it means just what it says.

A more modern data processing system need record only the same sort of information. Of course, with electronic systems we can now do much more processing of this kind of data, and we can handle much greater quantities. But there is no need to claim that our records have to have some other sort of meaning.

It could, of course, be (and often is) asserted that use of such English sentences is horribly vague and ambiguous. It is true that unrestricted use of a natural language can give rise to unmanageable vagueness and ambiguity, but the steps that we can take to avoid these problems are well known. They may both be called, in different senses, formalisation:

  • For a long time now, certainly before the advent of electronic data processing, the utterances permitted in data stores were constrained to those written on forms. If a particular kind of data was to be recorded - a bill, a job application, an expenses claim - a form would be devised for that purpose. But, of course, it is perfectly possible to translate any such form into a (perhaps extended) utterance. And that utterance is no more vague nor ambiguous than the form whose interpretation it is. A record in a database, such as a tuple in a relational database, is a tightly constrained electronic sample of such a form.
  • Because we wish to manipulate records, and to derive the interpretations of the resultant records, it is necessary for us to read our records as (in a different sense) a formal language. As we shall see, one formal language that has been used is that of the first order logic or first order predicate calculus (FOPC). A statement in this language can be translated into (or, we might say, merely read as) an English utterance, but an utterance so constrained as not to suffer from the unmanageable vagueness and ambiguity of unrestricted English.

 

Formal Logic Interpretation

 

 

The use of a logic-based formal language has an important advantage. Assume that we construe (i.e. interpret) each record kept in our database as a proposition - a sentence that is either true or false, but not both. (Usually, although not invariably, we would hope that only true data would be kept, and we will hereafter make this assumption.) If we now have a data manipulation language (like, for instance the relational domain calculus) whose operations are effectively operations of the logic, we would then hope that each of those operations was an implication. When a proposition is derived by an implication we are guaranteed that - if the proposition(s) from which it was derived be true - the derived proposition is true. It would clearly be a disaster if a user or intelligent front end could derive a falsehood from the truths kept in our database.

The safe derivation of truths, of course, depends not only on the manipulation operations being implications but also on proper interpretations being made of the kept (i.e. base) records. (The formal logic works uninterpreted. That is what makes it formal. So it cannot help us with - it is blind to - the interpretations we choose.) But safe derivation should not depend on extra interpretations being made of the derived records: their interpretations should follow, by the logic, from the interpretations of the kept records.

It should perhaps be remarked that in proposing a formal logic we do not necessarily propose a special notation. For some purposes, special notations are very useful (both formally and informally); but (given the same interpretation) “"” and “For each”, for example, are equally formal. It may at times be convenient (and at others incon­venient) to abbreviate “EMPLOYEE” to “E”, but neither the expanded nor the abbreviated form is one whit less or more formal than the other. Consequently, we can translate - perhaps we should say merely “expand” - any formal notation we use into rather rigid (because clear and unambiguous) English, where that is helpful.

 

Abstraction

 

 

Yet one more advantage of a formal language is that we can so constrain it (keep it so weak) as to prevent expression of implementation details. We would hope to make no reference in our language to details of access, sequence, or data representation (to avoid compromising data independence). This is where the modelling in Codd’s sense, the abstraction, comes in. The approach taken here, however, differs slightly (but for our purposes not very significantly) from Codd’s. Rather than speaking of abstract objects, tuples, instead of records, and treating the abstract as a model of the concrete, we speak (in the manner of [Larner]) directly of records, but we speak of them abstractly, by a means now to be explained.

Our approach to abstraction, based on [Geach], is the proposal that to grasp a concept of the kind that would be expressed by a count noun (e.g. “employee” or “snowflake” in contrast to mass nouns like “staff” or “snow”) it is necessary to grasp its criterion of application (what has to be true of something, x, if x is an employee) and its criterion of identity (what has to be true of x and y if x is the same employee as y). To put it in a useful mnemonic: the criterion of application tells us when we have got one; and the criterion of identity when we have got one (rather than two).

To take a common example, we may clarify some of the vagueness of meaning of “book” by giving a criterion of application, ruling out, for instance, marginal cases like graphic novels and magazines. But then we need (for data processing purposes) to avoid ambiguity by giving a criterion of identity: what are we to mean by “the same book”? We might say that we count x as the same book as y when x is the same numbered edition of the same title by the same author(s) as y. Now two copies of the same edition (even if one is a hardback and the other is a paperback) count as the same book. Criteria of identity are often ill-defined or undefined in natural languages.

Notice that this approach allows us to have two concepts (e.g. “copy” and “edition”) that have the same criterion of application, but different criteria of identity. As it happens, when doing data analysis by drawing entity diagrams (recall that this is not here deprecated) it is useful as entities are introduced to state their criteria of application and identity (they are sometimes obvious). Connected (linked, related) entities often differ in both criteria. When one entity is a classification of another, e.g. MODEL and CAR, they have the same criterion of application but different criteria of identity. When one entity is a subtype or specialisation of another, e.g. EMPLOYEE and PERSON, they have different criteria of application but the same criterion of identity.

It will be observed that this use of criteria of application and identity allows the construal of “types”, “kinds”, or “sorts” without set-theoretical abstractions. Thus, if we consider individual dogs, we have a criterion of application (let us assume) and a criterion of identity: is the same dog as. We construe breeds of dogs as entities with the same criterion of application (of course, to own Fido, which is a dachshund, is to own a dachshund), but a different criterion of identity: is the same breed as. We could, of course, have other criteria of identity, such as:is the same species as.

Thus species and breeds are not extra entities; they are not additional to individual dogs. The individual dogs are, in a very straightforward sense, the only things there are (in a universe of dogs): “breed” and “species” are merely other ways of classifying dogs (and “colour of dog” and even “individual dog” are simply ways of classifying dogs as well). Contrast this with a set-theoretical approach to individual dogs, breeds, and the species. We would have to say something like: a breed is a set of individual dogs. Then we would have to decide whether to say that the species was a set of breeds (and therefore a set of sets of individual dogs) or a set of individual dogs (and therefore having breeds as subsets rather than members).

The use of different criteria of identity for classification therefore has these advantages:

  • It relieves us from these arbitrary choices of construal (species as set of individuals or as set of sets of individuals).
  • It does not require the postulation of extra, additional, abstract entities (i.e. sets, and specifically equivalence classes).
  • It enables us to add classifications to a theory without even having to consider massive re-construals. If we have a species merely in the sense of an animal or animals under a broad criterion of identity, on introduction of “breed” we do not have to consider redefining “species”. But if we had a species as a set of animals, we would have to consider whether to redefine a species as a set of breeds, and so make it a set of sets of animals. It is clearly an advantage to an analyst in inter­preting data (and to a user, when new record types have been added to their database) not to have even to consider re-interpretation of the already present data.

It will be appreciated how sets serve in abstraction. By talking of the set of dogs that form a breed, rather than individual dogs, the theorist may “abstract away” individual characteristics that are of no concern (for the purposes in hand). But, of course, exactly the same effect is achieved by speaking of a breed not as a set but as some­thing with the same criterion of application as “individual dog” (i.e. as a dog), and a broader criterion of identity.

This distinction was adumbrated above: between speaking of abstractions of concrete things (set-theoretical breeds that “model” individual dogs), in contrast to speaking abstractly (under a broad criterion of identity) of concrete things (dogs). And this is the distinction between Codd’s approach, i.e. ordered tuples - sets of attribute/values pairs - modelling records, and the approach in [Larner], where the theory pertains directly to records and a number of criteria of identity are used (“same record”, “same format”, “same value”).

It should be noted that the use of criteria of identity requires less theoretical apparatus (and is accordingly weaker) than the set-theoretical approach. It is demonstrably no stronger because the criteria of identity can easily be defined on the set theoretical terms (to be the same breed as is definable as: to be a member of the same breed as).

It is sometimes useful to employ what we might term a “full” or “defining” criterion of identity: one on which the criterion of application can be defined. If the criterion of identity, “is the same breed as”, were used for cats and dogs, the concept “breed of dogs” would require both the criterion of application (“is a dog”) and that of identity (“is the same breed as”) in order to define it completely. But a criterion of identity like “is the same individual dog as”, or “is the same person as”, can be used to define its associated criterion of application: to be a dog (to be an individual dog) is to be the same individual dog as something; to be a person is to be the same person as something (someone).

It will be appreciated that criteria of identity are relative identities, i.e. dyadic predicates that are symmetric and transitive, and therefore reflexive in their field. That is, using “I” as such an identity:

"x "y xIy ® yIx

"x "y "z xIy Ù yIz ® xIz

"x ($y yIx) ® xIx)

Such identities are not, in general, totally reflexive. We do not always have:

"x xIx

Obviously, not everything is the same person as itself (no dog is), nor the same individual dog as itself (no person is). Notice that we may rephrase “x is the same person as something”,$y yIx”, as “x is the same person as x”, “xIx”.

 

 

Return to the start of A New Interpretation of Data.
 

 

Continue reading A New Interpretation of Data
with the section on IDENTITIES
.

 

 

SITE HOME PAGE

Skip to the section, The EPI Interpretation.

 

THE DATABASE PAGE

 

THE DATABASE PAPERS

 

DOWNLOAD

Download A New Interpretation of Data in Restricted Text Format (rtf, Word for Windows compatible)

Another database paper ...

 

Copyright © 1994, 2001 Adrian Larner. The author asserts all moral rights.

The decorative image of a key (cc004239.gif) used on this page was obtained from IMSI's MasterClips/MasterPhotos© Collection, 1895 Francisco Blvd East, San Rafael, CA 94901-5506, USA.