(U03)

www.btinternet.com/~adrian.larner/database/pcl06

PLATOCLAST
ON DATA

Lecture VI
Entities and Identities

 

 

Here beginneth the twenty-first verse of the second chapter of the first Epistle General of St. Date:

Base relations correspond to entities in the real world ...
 
...entities in the real world are distinguishable ...they have a unique identification of some kind.
 
Primary keys perform the unique identification function in the relational model.
 
...a primary key value that was null would be a contradiction in terms ... it would be saying that there was some entity that had no identity ...[1]

 

 

The Entity Interpretation

 

We shall take Mr Date to be saying that each record kept in a data base corresponds to something he calls “an entity” and locates “in the real world”. Somehow I don’t think the vague map reference is going to help very much. If I tell you I’ve mislaid the entity my mother gave me for Christmas, you might help me to find it, though not greatly, by telling me you’d seen it somewhere on the campus. But saying that it’s somewhere in the real world isn’t going to narrow my search significantly.

And yet, for all that, I’m not certain that Mr Date has cast his net wide enough to catch all the entities he wants. I hasten to add that this isn’t how Mr Date naturally speaks, using mediaeval dog-Latin expressions like “entity”: someone must have got at him.

We’ve looked at an interpretation of normal records, the Classical interpretation of them as propositions of the FOPC, that is, an ordinary – but very constrained – sort of true-or-false sentence. You’ll remember that these propositions are formed from non-intentional, Shakespearean predicates by inserting proper names in their places. And any deviation from those constraints wrecks our logic and takes us from truth to falsity: a path not to be travelled, a burden not to be borne.

On the other hand, we’ve seen, firstly, that we can always save the Classical interpretation by retreat to self-interpretation; and secondly that it gives us a very simple way of understanding restrictions, projections, and joins. It allows us, straightforwardly, given the interpretation of some records, to work out the interpretations of their restrictions, projections, and joins. And it guarantees that if we’ve stuck to the rules, and the records we start with are interpreted as true propositions, then the records we get also have true interpretations. As I’ve hinted, I have my doubts about the Classical interpretation, but it certainly makes sense. The interpretation of normal records we’re now going to consider is much, much more common in the data base literature – indeed it has spawned a whole, rather dodgy, model of its own – but I’m not at all certain that it makes sense. It says, as Mr Date explains (or at least repeats), that a record in a data base corresponds to, or represents, an “entity”; in the “real” world.

 

 

What are Entities?

 

What is an entity? I’ll give you a number of definitions.

1
It’s something that exists (that’s what the Latin means).
 
So to say that elephants are entities is to say that elephants exist, and to say that centaurs are not entities – are non-entities – is to say that centaurs don’t exist. That doesn’t mean: there are centaurs, but they don’t go in for existing. It means: there aren’t any centaurs.
 
And I say, for instance, that sets are not entities: there aren’t any sets. And predicates aren’t entities either, they are true or false of entities. Indeed, I say – though I don’t ask you to believe me – that anything that exists, any entity, is material: it’s made of stuff. And anything that isn’t stuff is nonsense – or at least, non-entity. That’s my ontology – my answer to the question, “What is there?”
2
The entities of a theory are the things it says exists: the things there have to be if the theory is to be true.
 
They’re the things, in a first order theory, that the variables range over; the things the predicates are true or false of.[2] In our first order theory – the TT -DD – the only entities are records.
3
Entities are – this is what a lot of data analysts say – the things that the enterprise is interested in.
 
The “enterprise” is the business whose data they’re analysing. And these are the things they want their records to represent. But what do you notice about the places in these predicates?
The enterprise is interested in ...
This record represents ...
Yes! They’re intentional, aren’t they. The enterprise could be interested in, or the record could represent, something that doesn’t exist.
 
The best data analyst I know – I’l1 call him “William” for no better reason than that “William” is his name – he says that a better word is “thing”, in one of its main dictionary senses: an object of thought.[70] And William may be wise in this, as he is in much else. But surely, of all places, that in:
I am thinking of ...
is as intentional as you could possibly get.

 

 

Discovering Entities in the “Real World”

 

A quick word on data analysts (the overpaid ones call themselves “data architects”): they’re the people that do the “logical” design of data bases, and leave the real work to humble designers and programmers. And they’re always talking about “entities”, and some of them, I guess, accept this supposedly alternative data model that’s called the EAR: the Entity/Attribute/Relationship model. But I’m not going to talk about “attributes” and “relationships” today.

Sometimes – and that’s an understatement – they seem to say that you can analyse data, that is, design data bases, by discovering the entities in “the real world”. But they end up with some funny entities: not only their favourite example, PERSON, but DEMAND FOR PAYMENT (that’s what an Invoice record represents), EVENT, or VACANCY (Vacancy forsooth!). And you know they didn’t find any vacancies in the real world. Well, I suppose they found vacancies for data analysts, more’s the pity.

Now, of course, this claim about discovering their entities in the real world is not totally false. Their entities are – being charitable – William’s “things”, objects of thought. And some of our objects of thought really are entities – existents – and some aren’t. In idle moments I think of the fame that will come to me from these lectures, or sometimes of Judy. So both of these are objects of my thought: the fame is merely an object of my thought, but Judy both actually exists and is an object of my attentions ...thought.

Personally, I am quite happy to do analysis by spotting real, sense 1, entities: but I have a robust sense of reality. I’ll accept as an entity anything that I can kick, anything made of stuff. And I’ll accept an entity like a DEMAND FOR PAYMENT, but it’s not represented by an Invoice record: it jolly well is an Invoice record, and it’s just a technological limitation that we print out Invoices – demands for payment – and send them through the mail. Anyway, it makes no difference in my definition: the printed demand for payment is the same record as the magnetically encoded Invoice record. You see: we sometimes do want self-interpretation, and we should get away with it when we can (we would be mad to abandon its security). The sort of record that cries out to be interpreted in some other way – the analysts’ favourite PERSON record – is an embarrassment, as favourite persons often are.

Now, I may be wrong in my ontology – perhaps there are all sorts of non-material entities (in sense 1): sets, attributes, infallible intuitions, sakes even. But at least I place some bounds on what I’ll accept as an entity. And anyway, I don’t care if one of my records represents many entities, or many of my records represent one and the same entity. These data analysts put no bounds on what they’ll accept as entities: that’s why they mistake their “things” – their objects of thought – for sense 1 entities. But we mustn’t waste any more time on their silly theories. William doesn’t take his things to be other than objects of thought, and he’s a very sound analyst: how does he do it?

 

 

Objections to “Entities”

 

I’m not quite certain, but I have two suggestions. Firstly, the activity of data analysis is not an activity of discovery, or not entirely. We have to consider what there is (the entities, sense 1), what we want to say about what there is, how people talk in the enterprise, what happens and what might happen. But we also have to decide how we can talk about all this, what records we might have. That is, we have to decide what objects of thought to have; we discover some, and choose whether to represent them, but we invent or propose or postulate the rest.

Secondly, when we object that something is not an entity we might object on three (at least) quite different grounds. We might object scientifically or historically – factually say – as I object to unicorns and centaurs. If there were such things, they would be perfectly sensible material sorts of things. Indeed, I rather wish these strange creatures did exist. But they don’t. They just happen not to. We might object metaphysically: we might say, as I say, “There aren’t any vacancies, there aren’t any ghosts, there aren’t any virtues.” (Of course, I’m not saying that the predicate “is virtuous” applies to no-one, though it doesn’t apply to many people.) These things – unlike unicorns and centaurs – don’t qualify (by my rules) for the club of existents.

But these two sorts of objection are not very powerful: you might think there are such things.

We could, on the other hand (is that three hands?) object to a putative entity on quite other grounds, logical grounds. My objection to sets is not merely that they are not made of stuff. I would object to them even if they were. My objection is that “set” is an incoherent notion; it’s a nonsense; and likewise four-sided triangles and the null object (the one that’s not there, the one that a predicate is true of when it’s true of nothing).

 

 

Entity and Identity

 

So I want to ask: ignoring factual and metaphysical prejudices, what does an object of thought, one of William’s “things”, have to do to avoid the logical objections. And the main qualifications are: to have decent criteria of application and identity. If you claim that there are (there exist) infallible intuitions, I think your claim is totally mistaken, but to maintain it – to stop it leading you into inconsistency, to keep your story coherent – you will need only to be able (however spuriously) to say what has to be true of something if it is an infallible intuition, and what makes it the case that this is the very same infallible intuition as that. Then you’ll be able to stonewall my metaphysical abuse indefinitely (as, indeed, William does).

And this sort of consideration is, perhaps, what makes Mr Date say that “entities ... are distinguishable ... they have a unique identification of some kind”; and to claim that it would be “a contradiction in terms” to say “that there was some entity that had no identity”.

I think you can see why, however factually or metaphysically mistaken you might be, you could defend – you could coherently use – a data base whose records you interpreted as representing some very strange things. As long as you had an alternative interpretation that was clearly sound – self-interpretation springs to mind, each “INTUITION” record representing an Infallible Intuition Form (itself) – as long as you had that, you couldn’t be caught in a contradiction.

You might ask: but couldn’t I have a record called “SET” and representing itself? And of course you could – I make no objection to the word “set”. You could have a record called “SQUARE CIRCLE” if you wanted. You just couldn’t have a record, called “SET” or anything else, having the properties that sets, or square circles, are supposed to have.

 

 

Identity

 

I’m not quite certain about this, but I suspect that Mr Date and I have quite different ideas about identity. So although I say that any logically respectable entity has to have a criterion of identity, and he says that each entity has to have identity, we are really much further apart than you might think. There are two quite different senses of “identify” and “identification”. One is the sense in which a registration number identifies a car: it enables us to pick out that car. In the same way a name identifies a person. Such an identifier may be unique or non-unique within any given scope. In this sense, Mr Date is clearly wrong when he says that entities have a unique identification of some kind.

Think about unmarked ping-pong balls in a fountain: you may point to one – that one right at the top just now – but you will be hard put to pick it out again: it has no unique identification. Likewise we may know that a remote object is a double star – two stars rotating around each other – without knowing which is which. And I’m offering a prize for anyone who can tag one of the electrons in a helium atom, and thus distinguish it from the other. Not that we have to go so theoretical: think of a record showing a stock of a thousand pencils. We have PART DESCRIPTION: HB Pencil, COUNT: 1000. Obviously they are different pencils (otherwise the COUNT would be 1). Yet we don’t distinguish them; we don’t give each a unique identifier; and we don’t need to.

The second sense of “identification” concerns whether this is the same so-and-so as that. It is this sense which leads us to say: we have a thousand pencils. Each of them is not the same pencil as any of the others, whether or not we tag each with a name. Of course, these two senses of “identification” are connected: the second is a prerequisite of the first. We would be hard put to tag a pencil, say P1, with some pencil-name, and not to tag P2 with the same pencil-name, if P1 was the same pencil as P2.

Notice that this observation alone throws some doubt on Mr Date’s claim that we need primary keys to “perform the unique identification function”. Is he saying that primary keys are necessary , or sufficient, or both? to pick out (identify, sense 1) or distinguish (identify, sense 2)? data base records or the entities they are supposed to represent?

 

 

Absolute Identity

 

But that ambiguity over identity is – as I said – less important than the second difference between us over identity (sense 2). Mr Date – and, even more, Dr Codd – think that there is an identity predicate, “is the same as”, or “is the same thing as”, or “=” (equals), and that this predicate holds between each thing and itself, but not between anything and anything else. Well, I’m not certain whether that formulation is correct – or even coherent – but, if it is coherent, I think it’s wrong. And against me I have all mathematicians, and most logicians, and every data analyst and data base theorist I’ve ever met. I’ll say that they believe in absolute identity: they think that “x is the same as y” makes sense, and I don’t. I think that it is unacceptably ambiguous, unless, rather trivially, the person who says, “x is the same as y”, explains – or the context clearly shows – that they mean “the same person”, or “the same pencil”, or whatever.

So I think that when Mr Date says that “primary keys perform the unique identification function” he means that, for instance, the PARTNO in a PART record works like this:

If PART record R1 has the same PARTNO as PART record R2 then R1 is the same (absolutely) as R2.
 
If PART P1 has the same PARTNO as (or, equivalently, is represented by the same PART record as) PART P2 then P1 is the same (absolutely) as P2.
Now we know that the second of these explanations simply fails in the case of the join trap, because there we saw that two distinguishable things – the one supplied by one supplier, the other by another – nevertheless had the same PARTNO. I suspect that Mr Date would want the amendment:
If PART P1 has the same PARTNO as (or, equivalently, is represented by the same PART record as) PART P2 then the part-type of P1 is the same (absolutely) as the part-type of P2.
And this would mean that Mr Date’s ontology encompasses part-types in addition to parts (“part instances”, he would say, I expect). As you might guess, my ontology does not encompass what for brevity I’ll call “additional types”. Naturally, I allow the existence of part-types. Each part is a part-type, because it’s the same part-type as something, and it’s kickable. But these additional types are, of course, abstract objects: you can’t write with a kind of pencil, or ride a sort of bicycle, or drive a type of car, in Mr Date’s sense. You must grasp this distinction between types in my sense and Mr Date’s, between concrete or non-additional types and abstract or additional types. And here’s a test: what would happen if I took Mr Date round to Judy’s pad and she answered the door wearing literally nothing but a type of kaftan? My equanimity would be undisturbed; you’ll have to imagine his reaction.

Now, I certainly don’t wish to argue about the need for some sort of record identifier to show whether R1 is or is not the same record as R2, and, indeed, whether R1 is or is not the same record-type as R2, though it does seem to me that that is a question more of implementation than anything else. I mean, I could imagine a user wanting to know how many employees there were in a department, or whether these records pertained to the same employee. But I would guess that no user cares very much how many records are used to represent the employees in a department, or whether (and in what sense?) these records pertained to the same record of some type.

Naturally, if we could show that there was some good reason always to have exactly one stored record of some type for each – what shall I say? – entity of some type, then we could translate between “number of records” and “number of entities”, or “same record” and “same entity” of those types. But we haven’t got anywhere near showing the need for such a one-to-one correspondence, and nor has anyone else.

And we won’t get anywhere near it by this – spoken or unspoken – assumption of absolute identity. Indeed, that assumption will stop us getting there. Does my company have just one EMPLOYEE record corresponding to me? I expect so. But, if so, that record corresponds to me at the beginning of this lecture, and to me at the end of it: by no means absolutely identical, for I have aged in the interval; visibly, I dare say. Indeed, person P at time T0 just cannot be absolutely identical to person P at the later time T1: for the latter has fewer hairs, but longer whiskers, than the former – in my case.

 

 

Relative Identity

 

But what do I offer instead of absolute identity? Two things: the first is relative identity, the sort of identity expressed by “is the same such-and-such as”, where “such-and-such” is a count noun (e.g. “armadillo”; I’m ignoring mass nouns like “dust”: you can count how many armadillos you’ve got, but not how many dusts).[4]

We can allow any number of these identity predicates, say xIy, which merely need to conform, to these rules:

FOR EACH x, FOR EACH y, xIy IFF yIx (Symmetry)
 
FOR EACH x, FOR EACH y, FOR EACH z, IF xIy AND yIz THEN xIz (Transitivity)
and, if they do conform to these rules then they will conform to:
FOR EACH x, IF FOR SOME y, xIy THEN xIx (Reflexivity)
Contrast this with the rule needed for absolute identity:
FOR EACH x, xIx (Total Reflexivity)

 

 

Systemic Identity

 

The second thing I offer is, I will call it, systemic identity, because it depends on having a system: a theory. Suppose we have a theory, couched in some language, such as the TT-DD, then, if xIy is the systemic identity predicate, it holds true when all and only the predicates true of x are true of y.

Relative to the theory, within the theory, the systemic predicate works just like absolute identity. But it doesn’t have the meaning of absolute identity. Indeed, in a theory based on the FOPC with a finite number of primitive predicates, it is possible to define the systemic identity: and easy to show that it does not amount to absolute identity.[5] Consider a theory about character strings with the sole primitive predicate:

x IS THE CONCATENATION OF y AND z
And let this (which we will abbreviate to xCyz) be true when x, after uppercase translation, is the same sequence of characters as z-appended-to-y, after uppercase translation.

The systemic identity, xIy, is definable as:

FOR EACH u, FOR EACH v, (xCuv IFF yCuv) AND (uCxv IFF uCyv) AND (uCvx IFF uCvy)
And, under this definition, “A” is the same as “a”. But clearly, if absolute identity there be, “A” is not the same as “a”, and we could add a predicate to our theory to distinguish them (e.g. “contains a lower case letter”). We would get a different theory, with a different systemic identity predicate: it would contain the identity predicate xIy as a non-systemic identity (but still a relative identity – symmetric, transitive, and reflexive).

But notice that assuming systemic, rather than absolute, identity, doesn’t save Mr Date’s overbold prescription of primary keys. “P1 is the same (systemically) as P2” doesn’t follow from “P1 is a part with the same PARTNO as P2” because the system itself (the data base) contains the predicate “Supplier S supplies ...”, and this can be true of P1 and false of P2, even though they are the same part.

 

 

A Persistent Identity

 

But something very strange happens if we accept relative identity. Suppose we have a – let us say – REGISTRATION table, and its primary key is PERSONNO. Other columns might be SURNAME, BIRTHDATE, and so on. And what we say of its records is that if R1 has the same PERSONNO as R2 then R1 represents (is a registration of) the same person as R2. And suppose we find that some persons have managed to register twice, or more times: they have given different SURNAMEs, BIRTHDATEs, etc., and we have allocated them different PERSONNOs.

We could choose, when we find two registrations for the same person, to make their PERSONNOs the same value. Of course, we would have to have a different primary key: we could use a combination of columns (say, all columns), or we could add a registration number column as primary key. But what would it now mean if Rl had the same PERSONNO as R2? It would mean that R1 represented the same person as R2.

How odd! The meaning of the column, the identification as the same person, appears to be independent of whether or not it is a primary key. And yet, though the foundation of relational data integrity – the primary key notion – seems to shake and tremble, I begin to feel a little more secure. I’m no longer certain that my company does keep one and only one EMPLOYEE record that represents me: perhaps they keep lots of EMPLOYMENT records, some of which represent me. But if, of those records, Rl contains my EMPLOYEE NUMBER and R2 contains my EMPLOYEE NUMBER, then Rl and R2 represent the same person: me. Not just me at the start of this lecture, nor just me – as I now am – at the end of this lecture, but simply me: anything whatsoever that is the same person as your very own, Geoffrey Platoclast.

 

 

Does Each Record Represent an Entity?

 

 

(The following is Platoclast’s reply to the question: If the entity interpretation is as flawed as you claim, what gave it the plausibility that it must have if so many analysts accept it?)

Now look, I’m not a psychologist nor – perish the thought – a sociologist; though I’m just enough of a Moral Philosopher to know that, of all vices, folly is the commonest and does the most damage. But this is my guess – and it’s only a guess.

Systems analysis, and specifically data analysis, is very difficult. It’s so difficult that a goodly number of people can make a rich living even out of doing it badly. And others can make an even richer living out of inventing methods to do it (and probably calling them “methodologies”); or worse: they write software that automates those methods.

When you’re doing analysis, you have to invent some records; but what do you have to go on? The obvious answer is: the world; the “real world”, as they say. And, of course, any record you invent (unless you’re quite, quite mad) will have some relationship with something, say an entity, in the world.

But there are now two dangerous ways to carry forward that true, but not very promising, observation: each record has some relationship to an entity. One is to select an inadequate diet of examples; to say, “Yes, a PERSON record is related to a person; a CAR record is related to a car; and so on.” That is, you first ignore the tricky cases; and then, later, you’ll shoehorn them into the simple pattern.

The second danger is that, I guess quite unconsciously, you rephrase the observation: there is some relationship that each record has with some entity. It’s what we call a “quantifier shift”, and in more concrete cases you can see the fallacy. I might grant you that each person is the child of some mother; but this doesn’t follow: there is some mother that each person is child of.

Of course, in the case of the relationships between records and entities, the quantifier that shifts is one that is applied to a predicate, so we’re – informally – using a second order logic. And perhaps that makes the shift more plausible. But, however plausible it is, it’s a fallacy. But, once we’ve made the shift, once we think that there is some relationship between each record and an entity, and especially if we’ve also swallowed that inadequate diet of examples, it’s very tempting to say: and this relationship we call, “represents”.

Then, with a firm grasp of higher normalisation (we’ll get to that), and a shaky hold on the concept of identity, we try to ensure that the representation is one-to-one: that each record we invent represents just one entity; and each entity we’re interested in is represented by one record. Once you’ve swallowed that yarn, you can spin out wonderfully complex theories with attributes, relationships, and goodness only knows what else.

 

SITE HOME PAGE

See why I’m taking a long time on the foundations? But don’t hold me to this theory of why the entity interpretation is plausible, though false: I’m not an expert on stupidity.

THE DATABASE PAGE

THE DATABASE PAPERS

 

Preface & Contents

 

DOWNLOAD

Download Lecture VI (rtf, Word for Windows compatible)

Platoclast on Data: Lecture VII

 

Copyright © 1993, 2001 Adrian Larner. The author asserts all moral rights.