|
(U03) |
www.btinternet.com/~adrian.larner/database/pcl04 |
|
PLATOCLAST Lecture IV |
||
|
|
||
|
|
In Lecture II we got some way into formalising our concept of formatted data, a concept that we had approached informally in Lecture I. Lecture III was a little interlude, looking at the interpretation of data: treating records as propositions written in the language of the FOPC, either in the ordinary way or in the extraordinary way of extension-talk. Today we go back to our formalisation, to consider some data manipulations, which we have not yet discussed even informally. And then we look at how these data manipulations restriction, projection, and join can be understood in terms of the propositions that records represent. |
|
|
|
||
|
|
The Theory of Two-Dimensional Data |
|
|
|
Now remember, please, that in this formalisation we are talking about records in a theory I call it the Theory of Two-Dimensional Data (say TT-DD, if you must) couched in the FOPC enhanced by four primitive predicates. This is quite different from our interpretation of records as propositions of the FOPC. First we will develop the TT-DD, which talks about records (and nothing else); then we will talk about the interpretation of records, and how that’s affected by the manipulations. One thing I ought to say about the TT-DD is that I have not even attempted to state all its postulates. That’s because I’m interested almost exclusively in its expressive power. It certainly needs more postulates, for example:
Very well. In Lecture II we attended mostly to structure. We formalised the concepts of field, record, and (roughly) table: an aggregate of records. And we formalised various criteria of identity: being the same record, being the same record-type, having the same value. |
|
|
|
||
|
|
Data Manipulations |
|
|
|
As it happens, we also touched on data manipulation: we defined addition of records (the plus function), and the normal selection predicate. Now we continue:
Suppose now that a data base contains a number of records, and we take some records (not necessarily kept records) and give a name to each, say “e” (for Employee) to one record, “d” (for Department) to another. And suppose we define E(x) x is an EMPLOYEE record as x is kept and x is the same format as e; and likewise D(y) y is a DEPARTMENT record as y is kept and y is the same format as d. Now we can define the natural join of EMPLOYEE and DEPARTMENT (perhaps they have just the DEPTNO field-type in common): the expression “C(z)” may be taken to mean “z is an EMPLOYEE/DEPARTMENT record”, i.e. a joined EMPLOYEE and DEPARTMENT record (with the same DEPTNO), and defined thus:
Notice that the record e, the paradigm employee record, might comprise all the form-fields all the domains in the EMPLOYEE record-type, possibly themselves named, s for the employee surname, b for the employee date of birth, and so on. Then a record, say z, would be a surname and date of birth projection of an employee record when FOR SOME x, E(x) and z is the same record as (s+b)\x; or perhaps, algebraically, (S,B)\E. And I think I can leave it to your ingenuity to work out how to define restrictions. |
|
|
|
||
|
|
Aggregation of Records |
|
|
|
Let me stress here that I have spoken of projection, selection, and natural join of records. If, having a predicate of records, such as “E” “is an EMPLOYEE record”, we wish to talk of the aggregate of all records that are E, then we can use schema S1 to get table(E). Now, let t be table(E). What parts of t are EMPLOYEE records? Certainly not every normal selection of t: the “sum” operation used to form t aggregates, we might say, with confusion, or to pick up an old metaphor from our first lecture vandalously. But there is a way to pick out the EMPLOYEE records: they are the parts of t that are E (or, going back to the definition of E, that are the same format as e, or as t, and are kept). So the aggregation doesn’t do us much good, because it is an aggregation with confusion, and we would like to aggregate in such a way that we had no confusion: but what does this mean? It means that we aggregate in such a way that we do not need to apply the original predicate, “E” or “is kept”, in order to pick out what shall we say? a member from the aggregate. But, if this were possible (in general), we could form the aggregate, t, and forget the original predicate: E would be eliminable, because definable as “being a member of t”. What a wonderful idea: but not a new one. Again, I grab you by the collar and remind you of the sad tale of Frege and Lord Russell.[3] But then, you might object, these aggregates serve no purpose at all. And you would be right. For a user to ask to see, or to query, the aggregate or table of EMPLOYEE records, is merely for them to ask to see or query the EMPLOYEE records. And likewise, when I have seen each of you before me, and enlightened each of you about data base theory, I do not then thank goodness have additionally to see the whole class of you, and enlighten the whole class of you about data base theory. |
|
|
|
||
|
|
Interpretation |
|
|
|
And now we turn from the TT-DD, our formal theory about records, and look at the interpretation of the relational operators. There are two things that I want to stress here, because I don’t want you to be led astray. The first is that not only is the rest of this lecture not directly concerned with the formal TT-DD, but it’s not formal at all, although I shall be using some notation, which I will explain. Notice that in the formal definitions of the TT-DD I used lots of (rather stilted) English prose: “and”, “part”, etc. But that didn’t stop the exposition being formal. I could have been even more what shall I say? vernacular, and still remained formal. By contrast, what I’m going to say next is informal (though it’s easily formalisable) despite being couched in a condensed, un-English notation. And I want you to remember now and forever, especially the mathematicians among you that Notation makyth not Formality. My second warning is that it is amazingly difficult to work out just how Dr Codd, and even the lucid Mr Date, do intend the records in relations to be interpreted. Most of the time they talk about records representing “entities”, whatever they might be. (I think I know what they might be, and Ill talk about them in another lecture.) In fairness, most other more respectably academic writers (mathematicians, you know) say even less than next to nothing about interpretation. They give me nowhere to rest my lever. Look: in his latest effusion, Dr Codd remarks at one point that:
The relation name (say “FATHERHOOD” remember “... is father of ...”, Abraham, Isaac, and that lot) stands for (i.e. abbreviates) a predicate. The column names are pronouns (“typed” pronouns they show what sort of thing the predicate pertains to, persons for instance). Actually, I’m not quite certain whether we should say that the relation name is really superfluous, just a convenience: it certainly is if no two relations have all and only the same column headings.[5] Each row of the relation each record contains values that, inserted appropriately into the predicate, form a proposition that is shown to be (let us say) asserted by its occurring in the relation. Now and here I am making an inspired guess, but it’s the only one that makes any sense to me these values are grammatically proper names. The predicate calculus (as I mentioned in Lecture II) allows only either proper names or pronouns (variables) in the places of predicates. And only names effect the change from predicate (true or false of things) to proposition (true or false simpliciter). So I think they have to be names. We have, as you remember well:
And I think I can leave it to your ingenuity to work out how to define restrictions. |
|
|
|
||
|
|
Natural Join |
|
|
|
Let us start with natural join. Suppose we join P with Q, then Rabc gets to be a row in the result, R, when:
Now do a natural join of FATHERHOOD and CHILDHOOD. We get:
And I think the answer is, in this example, that from:
And I think I can leave it to your ingenuity to work out how to define restrictions. |
|
|
|
||
|
|
Selection |
|
|
|
So much for natural join. Now consider restriction (sometimes called “selection”). Here we have what we might call a “naked” predicate, one not represented by a relation in extension. For example, let “Qy” be something like “FATHER = Jacob”. Then a restriction on a row, Pab, giving a result row, Rab, is represented by:
Of course, the restriction “FATHER = Jacob” applied to FATHERHOOD gives:
|
|
|
|
||
|
|
Cartesian Product |
|
|
|
Before we approach the airless heights of Projection, let us consider some special cases of this “&” (AND) operation. We have seen the most general case, the natural join, and we have seen the special case, restriction, when (1) one predicate is “naked”, and (2) all the columns in the naked predicate are also in the other operand. Let us go back to natural join, and consider the case when there are no common columns (b empty):
As natural join is very common, and the use of Cartesian product, restriction, and projection is rather cumbersome, you might think that it was better to have a natural join based system. And lots of very distinguished theoreticians and practitioners would agree with you. And I would try will try to change your mind. But not yet.[6] |
|
|
|
||
|
|
Intersection |
|
|
|
To go to the other extreme, assume all P and Q columns were common (i.e. a and c empty). Then we would have the intersection:
|
|
|
|
||
|
|
Extension |
|
|
|
Just one more ANDing. But first a question: why can’t the naked predicate of a restriction have any columns not in the other operand (the relation)? The answer is that if we tried to build the relation in extension corresponding to some naked predicates, it would turn out to be very large indeed. Think about “COLUMNl < COLUMN2”, when both columns contain integers: the extension is indefinitely large. But as long as all the columns are also in the relation, the result won’t be any bigger than the relation. Really, the constraint we need is that the result should be manageably small. I mean, “COLUMNl is the square of COLUMN2” would be all right if COLUMNl were in the relation: the number of rows in the result would be no more than double the number in the operand relation. With that sort of constraint in current systems usually very strict, i.e. that c is a function of b we get, with naked predicate Q, the extension operation:
|
|
|
|
||
|
|
Projection |
|
|
|
You must have had enough of ANDing, and so have I. Let’s do a Projection:
|
|
|
|
||
|
|
An Exercise in Interpretation |
|
|
|
Right, here’s our joined relation:
Work it out:
|
|
|
|
||
|
|
Other Operations |
|
|
|
Yes, I know. I haven’t dealt with Union, nor with Difference. And for good reasons. One of those reasons is that they are not definable on records, only on tables. And we still don’t know just what status we should grant to tables. I raised that question at the end of Lecture I, the question of the essentiality of tables (to use a useful word coined by Mr Date), and I’ve been hovering round it ever since. And I plan to hover round it a bit more. You see, I think there are serious problems even with the Classical interpretations we’ve looked at today, and I want to hammer them out before we get into even deeper water. So I’m afraid things are going to get much more complicated before they become completely incomprehensible. |
|
|
|
||
|
Stay with it: you need the credits, and I need the fees. My slim briefing paper, handed out only to regular attenders, covers the principal subject areas that you will be examined on. Good afternoon. |
|
|
|
|
|
|
|
||
|
Copyright © 1993, 2001 Adrian Larner. The author asserts all moral rights. |
||