(U03)

www.btinternet.com/~adrian.larner/database/pcl09

PLATOCLAST
ON DATA

Lecture IX
Tables Are Not Essential

 

 

Don’t think I haven’t heard what some of you have been calling this course: “Data Base Made Difficult” indeed. And you’re right in a way, except it’s not G Platoclast that makes it difficult, or – at least – only to the extent that I won’t allow interpretations that let us argue from truth to falsehood. And that’s just what happens in the Join Trap, solely as a result of using the Classical interpretation. That is intolerable, and it rules the Classical interpretation out of court.

 

 

The Classical Interpretation Reconsidered

 

Mr Date thinks, if I understand him aright, that users can somehow manage informally to avoid misinterpretations of the sort that give rise to the Join Trap. But he doesn’t offer any standard interpretation covering both structure and manipulation. I don’t think human users can avoid such misinterpretations, and I’m pretty confident that query front-ends, like inference engines, or natural language systems, will lack the common sense that might – just might – save a human user.

The Join Trap is like a loose thread in a woolly jumper. You give it a little pull, don’t you? And as you pull, the entire jumper unravels, stitch by stitch. And that’s what has happened to the interpretations used in Relational Theory, as I’ve worried at the problem. But anyway, I have now re-knitted the garment: I’ve given you a new interpretation that doesn’t give rise to the Join Trap.

Surprisingly, Dr Codd saw when he wrote his first paper that the trap was a problem. But, for some reason that’s not quite clear, he thought that it wouldn’t occur in a Relational data base.[1] We’ve seen why it does. It’s because, under CI, data values have to be proper names: we would say, proper names under the systemic identity of the system (the data base); Dr Codd would say (I guess) absolutely proper names. But they are not in practice proper names, and even when they are, a minor change to the data base can stop them being proper names.

We have seen three ways of attempting to save the Classical interpretation:

Self-interpretation, which works very nicely (and is always available), but puts the burden of significant interpretation onto the user
 
HI – the Hybrid Interpretation – really a version of our new EI, so the more cynical among you might reckon it gives us the worst of both worlds
 
The re-interpretation needed if we allow additional types.

 

 

Can Abstract Types Be Saved?

 

Now, I’m not quite certain whether I’ve been fair to the proponents of the sets that serve as additional types. I think it’s fair to say that users couldn’t cope with changes like this:

1
A Part Number that was the proper name of a part (let’s say a T-part, where “T” names our theory, our data base)
2
Adding the SUPPLY relation, and distinguishing between (let’s say) U-parts, where each U-part comes from one supplier. T-parts might then be construed as types, or sets, of U-parts (of common Part Number). And then “Part Number 17 weighs 186 grams” would no longer mean “T-part 17 weighs 186 grams” but “Some member (or Each member) of T-part 17 weighs 186 grams” (because a T-part, now construed as a set, is without weight: sets are burdensome, but not literally so).
3
Adding a STORAGE relation, and distinguishing between (let’s say) V-parts, where each V-part is stored in one warehouse. V-parts might then be construed as sets of (say) W-parts, and likewise U-parts as sets of W-parts, where each W-part comes from one supplier and is stored in one warehouse, and all W-parts in a U-part come from one supplier, and all W-parts in a V-part are stored in one warehouse. And T-parts become sets of sets of W-parts and we reinterpret as “Some member of some member of T-part 17 weighs 186 grams”, and we wonder whether T-parts should be sets of U-parts or sets of V-parts (or, back to the former interpretation, sets of W-parts).
Well, it’s obvious that isn’t going to work. But suppose we allowed only what we might call “single-level” sets: no sets of sets. And then we said: construe all names as names of sets. This is what would happen then:
1
A Part Number would be the proper name of a T-part, i.e. of a set of parts (all parts having that part number). “Part Number 17 weighs 186 grams” would, once and for all, be fixed in its meaning as “Some member (or Each member) of T-part 17 weighs 186 grams”.
2
Adding the SUPPLY relation introduces U-parts, each of which is a subset of a T-part: it’s the set of parts having the same Part Number and Supplier.
3
Adding a STORAGE relation introduces V-parts, each of which is a subset of a T-part, containing those parts having the same Part Number and Warehouse. And a W-part is the intersection of a U-part and a V-part.
Now that sounds better, but there are two interesting reasons why it sounds better. First, it is a disguised version of what I’ve proposed (with no additional types and using relative identity): my “is the same part as” (“has the same Part-Number as”) is definable as “is a member of the same T-part as”. All we’ve done is use sets as a façon de parler (because they are single level sets, and therefore the names of the sets need appear only to the right of “is a member or’): it’s just like saying, “Let’s pretend ‘cat’ isn’t a common name borne by each cat, but a proper name of the cat’s club.” So either the single level set approach is the same approach as I have proposed, but in set-speak; or it’s my approach plus the usual set-theoretical excess of machinery.

The second thing that makes single level sets seem plausible is that, in this case, we have a pretty shrewd idea of what the members of these sets are supposed to be: individual, hard little parts. But suppose we find that a part can be now in one warehouse, now in another (but it can’t be a member of two V-parts). This means that our “parts” – the members of T-parts, U-parts, and so on – have to be construed as time-slices of parts, so we really didn’t have a clear idea of what we were talking about. And a similar problem arises when we think back to our person REGISTRATION records: many records represented the same person, so we have to construe a person as a set (of whats?)

So even the single-level set approach says to the user: think of a PERSON, or PART, or SUPPLIER as a set. “A set of what?” asks the user. What do we reply? “Er, I’m not quite certain.”

And let’s hope the user doesn’t also ask for the cardinality of such a set (because that changes – or does it? – when we introduce a finer method of classification). So these sets contain a vague and shifting number of vague and shifty objects, which doesn’t sound much like what we used to call “mathematics”.

In any event, you can be certain of this: those who propose additional types as sets, whether single-level or not, can’t possibly object to my approach. (But you can’t be certain that they won’t!) Because I can define my relative identities, and common names, in the terms that they themselves use. We’ve been here before, haven’t we?[2] Let’s go somewhere else.

 

 

Tables

 

Tables, relations even: do you remember us asking whether, like records, they were essential? Do you remember our criterion of identity of “table”? Here it is (recalled from Lecture I):

T1 is the same table as T2 when T1 contains all and only the rows (records) of T2.
Think how I slipped that past you in your naive youth! What on earth was “contains” supposed to mean? Remember how forming a table in the TT-DD aggregated all the records but confused them: we could identify which records we wanted only by using the predicate that we had originally used to aggregate them. In the senses of “contains” in the TT-DD (such notions as “part”, “normal selection”, and so on), it doesn’t mean much that a table contains a record. The table containing (A:1, B:2) and (A:2, B:3) also contains the normal selections (A:l, B:3) and (A:2, B:2).

Now, you know me. I don’t want any essential structure that I can avoid; and if I invest in tables I’ll expect a massive return (I’m looking for an economical theory). And I’m certainly not going to stand for tables, or anything else, if they don’t have a respectable criterion of identity.

But what of relations, in Dr Codd’s sense? Remember that a relation is supposed to be a set, so it aggregates its members (its records) without confusion. But so what? Granted that a relation containing (A:l, B:2) and (A:2, B:3) doesn’t contain (A:l, B:3) and (A:2, B:2), there is some relation (some set) that does contain them: set theory assures us of that. Indeed, set theory assures us that there is a set containing any records formable from any values held in a relational data base, normal or non-normal, relation or non-relation. So a relation merely being a set doesn’t help. What helps?

 

 

Table Names

 

Well, what helps is that we name particular relations. It’s this one – “EMPLOYEE” say – that we’re interested in; and all and only the records that it contains are employee records. But once again, even putting aside the theoretical disasters of set theory, observe the overhead of mechanism. All we want to do is pick out some records, the employee records. We could say that they were the records each of which was called “EMPLOYEE”, or of each of which a certain predicate held true (“... is an employee record”). But the set theoretician first puts them into a set, then applies the name to the set. I think we’ll just say, shall we, that relations are essential just in this sense: we classify our records (principally by format), and give them names accordingly. We do need to know – this is essential information – not only that such and such a collection of fields constitutes a record (a kept record, say) but that it constitutes an EMPLOYEE record (or a CAR record, or whatever).

But that’s all. We do not, in addition, need to treat the aggregate of all EMPLOYEE records as essential: how can it be? We can always recreate it from our EMPLOYEE records.

And notice that we could still impose the uniform format that distinguishes relations from mere tables. We could say that if any two records had the same name then they would have the same format. (This assumes that we would query records by name.)

 

 

Are Table Names Essential?

 

Now, I used to think that these record type names – EMPLOYEE and the like – were inessential. I’m still a bit tempted to that view. Suppose that each such type of kept record had a distinct format (collection of column names): surely we could say that an EMPLOYEE record was just a kept record with the fields EMPNO, EMPNAME, and so forth. And we can always ensure distinct collections of column names. Suppose both INVOICE and PRO FORMA INVOICE records had all the same columns. We could just change the column INVNO in PRO FORMA INVOICE records to PFINVNO.

You can see, can’t you, why I’m a bit twitchy about admitting record type names: I really do want a record to comprise merely its fields, not its fields and a name like EMPLOYEE. I don’t want to have two records with precisely the same fields but two different names. And yet, I can’t quite see how to avoid it. For consider a very simple EMPLOYEE record, with just EMPNO and EMPNAME. Suppose we projected each column in turn, and then formed their Cartesian product, and stored the result. Obviously the same record would appear as an EMPLOYEE record and as one of our new records. Yet it would mean one thing as an EMPLOYEE record (that an employee with a certain number had a particular name), and another thing as a new record (that an employee had a certain number and an employee – possibly a different one – had a certain name).

Do we want to say that these are different records? Or one and the same record with two interpretations? You see the dilemma. If we choose one path, ruthlessly insisting – for excellent reasons – that a record comprises merely its fields, then we lose the relation name (the record type name), but that name – remember FATHERHOOD – is vital to the user: it summarises, in a word, the predicate that holds the values together; it says what their connection is. And this is no less true under our new Existential interpretation than it was under the Classical interpretation. But if we choose the other path, admitting relation names, then we admit another way of expressing information: not merely values in fields, and fields held together in records, but also names of records.

Imagine that we keep PERSON records. We can mark these as male or female person records by having a SEX column, with values Male and Female. Or we could rename the PERSONID column as MALEPID or FEMALEPID and have two different record formats. And now we are asking for a third mechanism: to tag some of the records “MALEPERSON” and some as “FEMALEPERSON”. Too much mechanism! Too many options!

 

 

Record Identifiers and Table Names

 

Now watch! I am about to pass unscathed between the horns of this dilemma. It is right that a record should comprise its fields and nothing else. To that we must hold fast. Consequently, if a record is to have a name, then that name must be held in, or derivable from, its fields. Moreover, we don’t even know how fine must be the classifications of records that we make by giving them names (or how many names we might wish to give to a record). So I propose to add a record identifier field to every kept record. I will assume, boldly, that there is a predicate defined on record identifiers and record names, such that if, for instance, r is a record identifier, then Names (“EMPLOYEE”, r) is true if r identifies an EMPLOYEE record, and false otherwise. I will also assume – I care not by what magic – that the system will be able to find all records of any given name.

“Typical!”, you say, “He just puts aside the difficult question.” All right, you could do it this way: to every employee record add a record identifier of the form “EMPLOYEE.n” where “n” is some numeral. That should make it easy to find the EMPLOYEE records. Alternatively, different types of form could have different record identifier columns, so that Employee forms had an EMPFORMID column, and House ownership forms an OWNFORMID. But, of course, I don’t insist on either of these methods: I’m sure all those clever DBMS designers could come up with something; that’s what they’re paid for. I’ll tell you what! You could keep all the EMPLOYEE records together in one file.[3]

Making “table names” values (or the results of functions on values) quietens my theoretician’s conscience: one horn avoided – the record still comprises its fields. What of the user? I suspect that, by and large, the user will wish to refer to records by such names as EMPLOYEE, and that is fine: the system can find all such records, and the user knows what they mean (it is summarised in the word, EMPLOYEE). The user will – usually – not wish to see, and will not see, the value of the record identifier (nor the column in which it appears). I have avoided the other horn: you may break the tense hushness with applause. Thank you Judy.

 

 

Two Levels of Record

 

Some of you are not impressed. You have seen the result of my acrobatics: I have generated two levels within my data system. Well, you all knew about the physical level, the conceptual level, and the external levels (or user views) didn’t you? The more levels the better, I say: it enables us to place problems correctly, where they can best be solved. I am going to call my two levels as follows:

The lower level, where every record has a record identifier: the form level. You can see why: it comprises forms, each form with a unique identifier from which its type (and therefore its meaning) can be derived.
 
The higher level, where a record has no record identifier but has a type name (like EMPLOYEE): the concept level.[4]
And now you see why I’ve been keeping open my options on interpretation, rectangular or merely two-dimensional tables, and so on and so forth: we just can’t decide which options are right until we know precisely at which level we are talking. But now we can see at least what the form level gives us: a totally defined, self-interpreted collection of forms. We might write them on paper. We can safely apply our simple, elegant, Classical interpretation to them. They have the absolute minimum of structure: each kept form is a normal collection of fields, and comprises merely those fields.

We have seen problems of interpretation, and we shall see more. But we need now handle these at the concept level alone.

 

 

Table Uniformity a Language Question

 

Let us think briefly about one of those options that I’ve left open: relations (rectangular tables) or merely two-dimensional tables (containing records of different format). Well, whatever their shape, we’re now saying – call it the “Picnic” theory – that, though Dr Codd made a meal of data structure, tables aren’t essential.

So all we need in our data bases are records: we can build our tables as required. And therefore the question of relations or merely two-dimensional tables is not one of how we structure our records into tables (because we don’t). Rather it is a question of how we speak about our records. Do we, in one breath, as it were, speak only of one format of record, or may we speak of records of more than one format: may we speak of (that is, query) any collection of records in our data base?

This move, from a question of structure to a question of language, does not of itself give us an answer to our question. But it does do away with a certain criticism of merely two-dimensional tables: that they are a more complex structure than relations. This they cannot be, for tables no longer constitute a structure: a table of records, as we speak of it, is not a structure at all. As we present a table to a user, it is of course a structure of rows and columns (it is also ordered by row and column); but what it represents is not a structure of records, but merely some records (which are unordered). Indeed, the table strictly represents nothing. Its rows represent records, but the ordering of the rows and columns and their neat rectangular presentation represent nothing whatsoever.

 

 

Connection Between Record Levels

 

Now I have this suspicion: that some of you think you see a self-satisfied smile cross my face. For – you imagine – I now reckon that I’ve answered the question of structure: we want merely normal records, identified as comprising just their fields. And I’ve answered the question of how to interpret those records, and how their interpretations are affected by data manipulation. So now I can stop agonising over interpretation, and get on with something else, integrity perhaps. But you’re wrong. I have found some more interpretation to agonise over.

You see, I’ve distinguisbed two levels – form and concept – and I’ve said that we can use self-interpretation at the form level, perhaps CI, and we can use EI at the concept level. But I haven’t shown how those levels are connected have I? And I haven’t interpreted that connection.

Well, let’s say that we do use CI at the form level. And suppose that an Employee form means:

Form number El23 (an employee form) contains EMPNO 45678 and SURNAME “Smith” and ...
When a user asks for EMPLOYEE records, from this record they get a record interpreted, according to EI – in this case perhaps II – as:
FOR SOME x, x is the same employee as 45678 and x is the same surperson as Smith and ...
Notice that the operation used to get the EMPLOYEE record (the record called “EMPLOYEE” by the user) from the Employee Form is Projection; and under CI a projection gives us an existential quantification; and we have in the EMPLOYEE record just such a quantification: FOR SOME x. But the projection actually gives us the form level interpretation:
FOR SOME y, y is an employee form and y contains EMPNO 45678 and SURNAME “Smith” and ...
So we can see that the user applies an informal interpretation. Of course. The user applies this interpretation; that if there is some employee form containing EMPNO 45678 and SURNAME “Smith” and so on, then there is something (someone) that is the same employee as 45678 and is the same surperson as Smith and so on. And that is just fine: what else would we expect the keeping of such an employee form to imply? And how else but informally would we expect such an implication to hold? We could hardly ask the FOPC to make sure that we didn’t keep an employee form unless it really did pertain to some employee.

So I hope the existential quantifiers that I introduced to get EI don’t look quite as arbitrary as they did. They come – by necessarily informal interpretation – from the existential quantifiers that we knew and loved in the old CI: those generated by projection.

Now I don’t claim that there will always be a straight interpretation from one existential quantification to another. Indeed, I’m pretty sure there won’t. But I think there will be a straightforward interpretation from an existential quantification to one or more (perhaps zero or more?) existential quantifications and possibly a predicate.

For instance, consider a house ownership form interpreted as:

Form number HO987 (a house ownership form) contains OWNERNO 65432 and HOUSEID S7-6YP-6.
On projection (the user asks for the projected view “OWNS”) we get:
FOR SOME z, z is a house ownership form and z contains OWNERNO 65432 and HOUSEID S7-6YP-6.
And the user interprets “FOR SOME z, z is a house ownership form” as:
FOR SOME x, FOR SOME y, x is an owner of y
That is just what we would expect the presence of a house ownership form to mean. The entire interpretation of the OWNS record becomes:
FOR SOME x, FOR SOME y, x is an owner of y and x is the same owner as 65432 and y is the same house as S7-6YP-6.

I have already mentioned how EI – especially when we consider the special cases of SI and II – seems to come close to the EAR theory. But now I want to stress how my proposals run contrary to that theory. Think back to the employee forms. I do not insist that there shall be just one form for each employee. Indeed, if employees (intelligence agents perhaps!) were allowed multiple surnames then there could be multiple forms for one employee. But it doesn’t matter: we can tell when two forms pertain to the same employee, because they have the same EMPNO. Nor do we have to agonise over what peculiar entities such forms might represent: each of them clearly represents an employee, though two of them might represent the same employee. And, of course, we don’t have to worry about whether any key they might have “represents self-identity”: every collection of columns represents some relative identity (e.g. “is the same employee as and has the same surname as”).

You might think that the weak point in this theory is that users have to interpret the existence and characteristics of a form in terms of the “real world”. Really it’s not the users’ responsibility. It’s the data analyst’s responsibility to say: when we keep such and such a form, with such and such data, we do it to show that there is such a kind or kinds of thing and they are related in such away and they are classified in such a way. But surely we have always said what it means to keep a certain sort of record, otherwise it would hardly have been worthwhile to keep any records at all.

 

SITE HOME PAGE

Now at least we have a strict format of interpretation to apply: the Existential interpretation. That limits our choices (good). It limits them to such an extent that on applying our data manipulations we cannot argue from the truths recorded in our data base to falsehoods garnered from the four winds.

THE DATABASE PAGE

THE DATABASE PAPERS

 

Preface & Contents

 

DOWNLOAD

Download Lecture IX (rtf, Word for Windows compatible)

Platoclast on Data: Lecture X

 

Copyright © 1993, 2001 Adrian Larner. The author asserts all moral rights.