|
(U03) |
www.btinternet.com/~adrian.larner/database/pcl09 |
|
PLATOCLAST Lecture IX |
||
|
|
||
|
|
Don’t think I haven’t heard what some of you have been calling this course: “Data Base Made Difficult” indeed. And you’re right in a way, except it’s not G Platoclast that makes it difficult, or at least only to the extent that I won’t allow interpretations that let us argue from truth to falsehood. And that’s just what happens in the Join Trap, solely as a result of using the Classical interpretation. That is intolerable, and it rules the Classical interpretation out of court. |
|
|
|
||
|
|
The Classical Interpretation Reconsidered |
|
|
|
Mr Date thinks, if I understand him aright, that users can somehow manage informally to avoid misinterpretations of the sort that give rise to the Join Trap. But he doesn’t offer any standard interpretation covering both structure and manipulation. I don’t think human users can avoid such misinterpretations, and I’m pretty confident that query front-ends, like inference engines, or natural language systems, will lack the common sense that might just might save a human user. The Join Trap is like a loose thread in a woolly jumper. You give it a little pull, don’t you? And as you pull, the entire jumper unravels, stitch by stitch. And that’s what has happened to the interpretations used in Relational Theory, as I’ve worried at the problem. But anyway, I have now re-knitted the garment: I’ve given you a new interpretation that doesn’t give rise to the Join Trap. Surprisingly, Dr Codd saw when he wrote his first paper that the trap was a problem. But, for some reason that’s not quite clear, he thought that it wouldn’t occur in a Relational data base.[1] We’ve seen why it does. It’s because, under CI, data values have to be proper names: we would say, proper names under the systemic identity of the system (the data base); Dr Codd would say (I guess) absolutely proper names. But they are not in practice proper names, and even when they are, a minor change to the data base can stop them being proper names. We have seen three ways of attempting to save the Classical interpretation:
|
|
|
|
||
|
|
Can Abstract Types Be Saved? |
|
|
|
Now, I’m not quite certain whether I’ve been fair to the proponents of the sets that serve as additional types. I think it’s fair to say that users couldn’t cope with changes like this:
The second thing that makes single level sets seem plausible is that, in this case, we have a pretty shrewd idea of what the members of these sets are supposed to be: individual, hard little parts. But suppose we find that a part can be now in one warehouse, now in another (but it can’t be a member of two V-parts). This means that our “parts” the members of T-parts, U-parts, and so on have to be construed as time-slices of parts, so we really didn’t have a clear idea of what we were talking about. And a similar problem arises when we think back to our person REGISTRATION records: many records represented the same person, so we have to construe a person as a set (of whats?) So even the single-level set approach says to the user: think of a PERSON, or PART, or SUPPLIER as a set. “A set of what?” asks the user. What do we reply? “Er, I’m not quite certain.” And let’s hope the user doesn’t also ask for the cardinality of such a set (because that changes or does it? when we introduce a finer method of classification). So these sets contain a vague and shifting number of vague and shifty objects, which doesn’t sound much like what we used to call “mathematics”. In any event, you can be certain of this: those who propose additional types as sets, whether single-level or not, can’t possibly object to my approach. (But you can’t be certain that they won’t!) Because I can define my relative identities, and common names, in the terms that they themselves use. We’ve been here before, haven’t we?[2] Let’s go somewhere else. |
|
|
|
||
|
|
Tables |
|
|
|
Tables, relations even: do you remember us asking whether, like records, they were essential? Do you remember our criterion of identity of “table”? Here it is (recalled from Lecture I):
But what of relations, in Dr Codd’s sense? Remember that a relation is supposed to be a set, so it aggregates its members (its records) without confusion. But so what? Granted that a relation containing (A:l, B:2) and (A:2, B:3) doesn’t contain (A:l, B:3) and (A:2, B:2), there is some relation (some set) that does contain them: set theory assures us of that. Indeed, set theory assures us that there is a set containing any records formable from any values held in a relational data base, normal or non-normal, relation or non-relation. So a relation merely being a set doesn’t help. What helps? |
|
|
|
||
|
|
Table Names |
|
|
|
Well, what helps is that we name particular relations. It’s this one “EMPLOYEE” say that we’re interested in; and all and only the records that it contains are employee records. But once again, even putting aside the theoretical disasters of set theory, observe the overhead of mechanism. All we want to do is pick out some records, the employee records. We could say that they were the records each of which was called “EMPLOYEE”, or of each of which a certain predicate held true (“... is an employee record”). But the set theoretician first puts them into a set, then applies the name to the set. I think well just say, shall we, that relations are essential just in this sense: we classify our records (principally by format), and give them names accordingly. We do need to know this is essential information not only that such and such a collection of fields constitutes a record (a kept record, say) but that it constitutes an EMPLOYEE record (or a CAR record, or whatever). But that’s all. We do not, in addition, need to treat the aggregate of all EMPLOYEE records as essential: how can it be? We can always recreate it from our EMPLOYEE records. And notice that we could still impose the uniform format that distinguishes relations from mere tables. We could say that if any two records had the same name then they would have the same format. (This assumes that we would query records by name.) |
|
|
|
||
|
|
Are Table Names Essential? |
|
|
|
Now, I used to think that these record type names EMPLOYEE and the like were inessential. I’m still a bit tempted to that view. Suppose that each such type of kept record had a distinct format (collection of column names): surely we could say that an EMPLOYEE record was just a kept record with the fields EMPNO, EMPNAME, and so forth. And we can always ensure distinct collections of column names. Suppose both INVOICE and PRO FORMA INVOICE records had all the same columns. We could just change the column INVNO in PRO FORMA INVOICE records to PFINVNO. You can see, can’t you, why I’m a bit twitchy about admitting record type names: I really do want a record to comprise merely its fields, not its fields and a name like EMPLOYEE. I don’t want to have two records with precisely the same fields but two different names. And yet, I can’t quite see how to avoid it. For consider a very simple EMPLOYEE record, with just EMPNO and EMPNAME. Suppose we projected each column in turn, and then formed their Cartesian product, and stored the result. Obviously the same record would appear as an EMPLOYEE record and as one of our new records. Yet it would mean one thing as an EMPLOYEE record (that an employee with a certain number had a particular name), and another thing as a new record (that an employee had a certain number and an employee possibly a different one had a certain name). Do we want to say that these are different records? Or one and the same record with two interpretations? You see the dilemma. If we choose one path, ruthlessly insisting for excellent reasons that a record comprises merely its fields, then we lose the relation name (the record type name), but that name remember FATHERHOOD is vital to the user: it summarises, in a word, the predicate that holds the values together; it says what their connection is. And this is no less true under our new Existential interpretation than it was under the Classical interpretation. But if we choose the other path, admitting relation names, then we admit another way of expressing information: not merely values in fields, and fields held together in records, but also names of records. Imagine that we keep PERSON records. We can mark these as male or female person records by having a SEX column, with values Male and Female. Or we could rename the PERSONID column as MALEPID or FEMALEPID and have two different record formats. And now we are asking for a third mechanism: to tag some of the records “MALEPERSON” and some as “FEMALEPERSON”. Too much mechanism! Too many options! |
|
|
|
||
|
|
Record Identifiers and Table Names |
|
|
|
Now watch! I am about to pass unscathed between the horns of this dilemma. It is right that a record should comprise its fields and nothing else. To that we must hold fast. Consequently, if a record is to have a name, then that name must be held in, or derivable from, its fields. Moreover, we don’t even know how fine must be the classifications of records that we make by giving them names (or how many names we might wish to give to a record). So I propose to add a record identifier field to every kept record. I will assume, boldly, that there is a predicate defined on record identifiers and record names, such that if, for instance, r is a record identifier, then Names (“EMPLOYEE”, r) is true if r identifies an EMPLOYEE record, and false otherwise. I will also assume I care not by what magic that the system will be able to find all records of any given name. “Typical!”, you say, “He just puts aside the difficult question.” All right, you could do it this way: to every employee record add a record identifier of the form “EMPLOYEE.n” where “n” is some numeral. That should make it easy to find the EMPLOYEE records. Alternatively, different types of form could have different record identifier columns, so that Employee forms had an EMPFORMID column, and House ownership forms an OWNFORMID. But, of course, I don’t insist on either of these methods: I’m sure all those clever DBMS designers could come up with something; that’s what they’re paid for. I’ll tell you what! You could keep all the EMPLOYEE records together in one file.[3] Making “table names” values (or the results of functions on values) quietens my theoretician’s conscience: one horn avoided the record still comprises its fields. What of the user? I suspect that, by and large, the user will wish to refer to records by such names as EMPLOYEE, and that is fine: the system can find all such records, and the user knows what they mean (it is summarised in the word, EMPLOYEE). The user will usually not wish to see, and will not see, the value of the record identifier (nor the column in which it appears). I have avoided the other horn: you may break the tense hushness with applause. Thank you Judy. |
|
|
|
||
|
|
Two Levels of Record |
|
|
|
Some of you are not impressed. You have seen the result of my acrobatics: I have generated two levels within my data system. Well, you all knew about the physical level, the conceptual level, and the external levels (or user views) didn’t you? The more levels the better, I say: it enables us to place problems correctly, where they can best be solved. I am going to call my two levels as follows:
We have seen problems of interpretation, and we shall see more. But we need now handle these at the concept level alone. |
|
|
|
||
|
|
Table Uniformity a Language Question |
|
|
|
Let us think briefly about one of those options that I’ve left open: relations (rectangular tables) or merely two-dimensional tables (containing records of different format). Well, whatever their shape, we’re now saying call it the “Picnic” theory that, though Dr Codd made a meal of data structure, tables aren’t essential. So all we need in our data bases are records: we can build our tables as required. And therefore the question of relations or merely two-dimensional tables is not one of how we structure our records into tables (because we don’t). Rather it is a question of how we speak about our records. Do we, in one breath, as it were, speak only of one format of record, or may we speak of records of more than one format: may we speak of (that is, query) any collection of records in our data base? This move, from a question of structure to a question of language, does not of itself give us an answer to our question. But it does do away with a certain criticism of merely two-dimensional tables: that they are a more complex structure than relations. This they cannot be, for tables no longer constitute a structure: a table of records, as we speak of it, is not a structure at all. As we present a table to a user, it is of course a structure of rows and columns (it is also ordered by row and column); but what it represents is not a structure of records, but merely some records (which are unordered). Indeed, the table strictly represents nothing. Its rows represent records, but the ordering of the rows and columns and their neat rectangular presentation represent nothing whatsoever. |
|
|
|
||
|
|
Connection Between Record Levels |
|
|
|
Now I have this suspicion: that some of you think you see a self-satisfied smile cross my face. For you imagine I now reckon that I’ve answered the question of structure: we want merely normal records, identified as comprising just their fields. And I’ve answered the question of how to interpret those records, and how their interpretations are affected by data manipulation. So now I can stop agonising over interpretation, and get on with something else, integrity perhaps. But you’re wrong. I have found some more interpretation to agonise over. You see, I’ve distinguisbed two levels form and concept and I’ve said that we can use self-interpretation at the form level, perhaps CI, and we can use EI at the concept level. But I haven’t shown how those levels are connected have I? And I haven’t interpreted that connection. Well, let’s say that we do use CI at the form level. And suppose that an Employee form means:
So I hope the existential quantifiers that I introduced to get EI don’t look quite as arbitrary as they did. They come by necessarily informal interpretation from the existential quantifiers that we knew and loved in the old CI: those generated by projection. Now I don’t claim that there will always be a straight interpretation from one existential quantification to another. Indeed, I’m pretty sure there won’t. But I think there will be a straightforward interpretation from an existential quantification to one or more (perhaps zero or more?) existential quantifications and possibly a predicate. For instance, consider a house ownership form interpreted as:
I have already mentioned how EI especially when we consider the special cases of SI and II seems to come close to the EAR theory. But now I want to stress how my proposals run contrary to that theory. Think back to the employee forms. I do not insist that there shall be just one form for each employee. Indeed, if employees (intelligence agents perhaps!) were allowed multiple surnames then there could be multiple forms for one employee. But it doesn’t matter: we can tell when two forms pertain to the same employee, because they have the same EMPNO. Nor do we have to agonise over what peculiar entities such forms might represent: each of them clearly represents an employee, though two of them might represent the same employee. And, of course, we don’t have to worry about whether any key they might have “represents self-identity”: every collection of columns represents some relative identity (e.g. “is the same employee as and has the same surname as”). You might think that the weak point in this theory is that users have to interpret the existence and characteristics of a form in terms of the “real world”. Really it’s not the users’ responsibility. It’s the data analyst’s responsibility to say: when we keep such and such a form, with such and such data, we do it to show that there is such a kind or kinds of thing and they are related in such away and they are classified in such a way. But surely we have always said what it means to keep a certain sort of record, otherwise it would hardly have been worthwhile to keep any records at all. |
|
|
|
||
|
Now at least we have a strict format of interpretation to apply: the Existential interpretation. That limits our choices (good). It limits them to such an extent that on applying our data manipulations we cannot argue from the truths recorded in our data base to falsehoods garnered from the four winds. |
|
|
|
|
|
|
|
||
|
Copyright © 1993, 2001 Adrian Larner. The author asserts all moral rights. |
||