(U03)

www.btinternet.com/~adrian.larner/database/pcl04

PLATOCLAST
ON DATA

Lecture IV
Data Manipulation

 

 

In Lecture II we got some way into formalising our concept of formatted data, a concept that we had approached informally in Lecture I. Lecture III was a little interlude, looking at the interpretation of data: treating records as propositions written in the language of the FOPC, either in the ordinary way or in the extraordinary way of extension-talk.

Today we go back to our formalisation, to consider some data manipulations, which we have not yet discussed even informally. And then we look at how these data manipulations – restriction, projection, and join – can be understood in terms of the propositions that records represent.

 

 

The Theory of Two-Dimensional Data

 

Now remember, please, that in this formalisation we are talking about records in a theory – I call it the Theory of Two-Dimensional Data (say TT-DD, if you must) – couched in the FOPC enhanced by four primitive predicates. This is quite different from our interpretation of records as propositions of the FOPC. First we will develop the TT-DD, which talks about records (and nothing else); then we will talk about the interpretation of records, and how that’s affected by the manipulations.

One thing I ought to say about the TT-DD is that I have not even attempted to state all its postulates. That’s because I’m interested almost exclusively in its expressive power. It certainly needs more postulates, for example:

P4
FOR EACH x, FOR SOME y, y is part of x and y is normal.
 
Every record has a normal part.
Obvious, you may think. But that’s the way with formal systems, and many here present: the obvious has to be stated. But before I start giving more definitions, I ought to explain why I bother to formalise, rather than just informally making things as plain as possible. It’s because I’m a Constructionalist:[1] I don’t think we’ve really understood a theory until we’ve formalised it, or – at least – can see our way to formalising it. And because we certainly cannot compare one theory with another (for efficiency or anything else) until we have them formalised.

Very well. In Lecture II we attended mostly to structure. We formalised the concepts of field, record, and (roughly) table: an aggregate of records. And we formalised various criteria of identity: being the same record, being the same record-type, having the same value.

 

 

Data Manipulations

 

As it happens, we also touched on data manipulation: we defined addition of records (the plus function), and the normal selection predicate. Now we continue:

D16 x is a selection of y =df
x is part of y and x is the same format as y.
 
A selection is like a relational selection or restriction, to this extent: it retains all the columns, but may lose some fields.
D17 x is a proper selection of y =df
x is proper part of y and x is a selection of y.
 
A proper selection does lose some fields (one at least). Notice that if we had taken, say, “form-overlaps” as primitive, and defined “is the same format as” on it (compare D4), we could now define“x is normal” (which we take as primitive) as “FOR EACH y, y is not a proper selection of x”.
D18 x is a projection of y =df
x is part of y and FOR EACH z, if z is part of y and z form-overlaps x then z overlaps x.
 
Recall that a selection of y is, roughly, a horizontal slice of y; a projection is a vertical slice. So the projection, x, is a part of y, and it comprises every field in y of a restricted range of field-types. Let us suppose that x has field-types f, and it has all the f fields of y. Any part, z, of y that contains an f field clearly form-overlaps x so – if x is a projection (specifically, the f-projection) of y – that f field in z is part of x, and therefore z overlaps x.
D19 x is a proper projection of y =df
x is a proper part of y and x is a projection of y.
Obviously, given any record, y, and any record-type (i.e. any record, but we don’t worry about its values), x, that form-overlaps y, we can define “the x-projection of y”. It is the largest part of y that is form-part of x. (In effect, x is form-part of y: we simply ignore any field-types in x and not in y.) Here’s the definition of the x-projection of y:[2]
D20 x\y =df
THIS z: z is a projection of y and FOR EACH w if w is part of y and w is form-part of x then w is part of z.
We need the postulate:
P5
FOR EACH x, FOR EACH y, x form-overlaps y iff FOR SOME z, z is the same record as x\y.
 
There is an x-projection of y when x form-overlaps y.
For the equivalent of relational selection (restriction) we need a schematic function, where F is the predicate supposed to hold of the record being selected.
S2 x/F =df
THIS y: F(y) and y is the same record as x.
Naturally we need the schematic postulate:
P6
FOR EACH x, F(x) iff FOR SOME y, y is the same record as x/F
 
There is a record that is the F-selection of x, viz x itself, when F holds true of x.
We now formulate the equivalent of a relational natural join:
D21 x*y =df
THIS z: z is the same record as x+y and FOR EACH w, w is the same record as x\y iff w is the same record as y\x.
We need the postulate:
P7
FOR EACH x, FOR EACH y, (if x form-overlaps y then x\y is the same record as y\x) iff FOR SOME z, z is the same record as x*y
 
The natural join of x and y exists, and is x+y, when each part of x that is form-part of y is part of y, and vice versa (so the x-projection of y, if such there be, is the same record as the y-projection of x).

Suppose now that a data base contains a number of records, and we take some records (not necessarily kept records) and give a name to each, say “e” (for Employee) to one record, “d” (for Department) to another. And suppose we define E(x)x is an EMPLOYEE record – as x is kept and x is the same format as e; and likewise D(y)y is a DEPARTMENT record – as y is kept and y is the same format as d. Now we can define the natural join of EMPLOYEE and DEPARTMENT (perhaps they have just the DEPTNO field-type in common): the expression “C(z)” may be taken to mean “z is an EMPLOYEE/DEPARTMENT record”, i.e. a joined EMPLOYEE and DEPARTMENT record (with the same DEPTNO), and defined thus:

C(z) =df FOR SOME x, FOR SOME y, E(x) and D(y) and z is the same record as x*y.
Or, for those who prefer algebraic expressions, we might say simply: E*D.

Notice that the record e, the paradigm employee record, might comprise all the form-fields – all the domains – in the EMPLOYEE record-type, possibly themselves named, s for the employee surname, b for the employee date of birth, and so on. Then a record, say z, would be a surname and date of birth projection of an employee record when FOR SOME x, E(x) and z is the same record as (s+b)\x; or perhaps, algebraically, (S,B)\E.

And I think I can leave it to your ingenuity to work out how to define restrictions.

 

 

Aggregation of Records

 

Let me stress here that I have spoken of projection, selection, and natural join of records. If, having a predicate of records, such as “E” – “is an EMPLOYEE record”, we wish to talk of the aggregate of all records that are E, then we can use schema S1 to get table(E). Now, let t be table(E). What parts of t are EMPLOYEE records? Certainly not every normal selection of t: the “sum” operation used to form t aggregates, we might say, with confusion, or – to pick up an old metaphor from our first lecture – vandalously.

But there is a way to pick out the EMPLOYEE records: they are the parts of t that are E (or, going back to the definition of E, that are the same format as e, or as t, and are kept). So the aggregation doesn’t do us much good, because it is an aggregation with confusion, and we would like to aggregate in such a way that we had no confusion: but what does this mean?

It means that we aggregate in such a way that we do not need to apply the original predicate, “E” or “is kept”, in order to pick out – what shall we say? – a member from the aggregate. But, if this were possible (in general), we could form the aggregate, t, and forget the original predicate: E would be eliminable, because definable as “being a member of t”. What a wonderful idea: but not a new one. Again, I grab you by the collar and remind you of the sad tale of Frege and Lord Russell.[3] But then, you might object, these aggregates serve no purpose at all. And you would be right. For a user to ask to see, or to query, the aggregate or table of EMPLOYEE records, is merely for them to ask to see or query the EMPLOYEE records. And likewise, when I have seen each of you before me, and enlightened each of you about data base theory, I do not then – thank goodness – have additionally to see the whole class of you, and enlighten the whole class of you about data base theory.

 

 

Interpretation

 

And now we turn from the TT-DD, our formal theory about records, and look at the interpretation of the relational operators. There are two things that I want to stress here, because I don’t want you to be led astray. The first is that not only is the rest of this lecture not directly concerned with the formal TT-DD, but it’s not formal at all, although I shall be using some notation, which I will explain. Notice that in the formal definitions of the TT-DD I used lots of (rather stilted) English prose: “and”, “part”, etc. But that didn’t stop the exposition being formal. I could have been even more – what shall I say? – vernacular, and still remained formal. By contrast, what I’m going to say next is informal (though it’s easily formalisable) despite being couched in a condensed, un-English notation. And I want you to remember – now and forever, especially the mathematicians among you – that Notation makyth not Formality.

My second warning is that it is amazingly difficult to work out just how Dr Codd, and even the lucid Mr Date, do intend the records in relations to be interpreted. Most of the time they talk about records representing “entities”, whatever they might be. (I think I know what they might be, and I’ll talk about them in another lecture.) In fairness, most other – more respectably academic – writers (mathematicians, you know) say even less than next to nothing about interpretation. They give me nowhere to rest my lever. Look: in his latest effusion, Dr Codd remarks at one point that:

every ... row coupled with the name of the relation represents an assertion.
And, at another point:
The predicate in the logician’s sense that is common to all the assertions of one type is factored out and becomes the relation name.[4]
But that’s about all we have to go on, apart from what we know of the history of the notion of relation, which we went into in the last lecture. But that history at least fits with what Dr Codd says in these brief and unsystematic passages. So, boldly, this is – as I shall call it – the Classical interpretation of relations.

The relation name (say “FATHERHOOD” – remember “... is father of ...”, Abraham, Isaac, and that lot) stands for (i.e. abbreviates) a predicate. The column names are pronouns (“typed” pronouns – they show what sort of thing the predicate pertains to, persons for instance). Actually, I’m not quite certain whether we should say that the relation name is really superfluous, just a convenience: it certainly is if no two relations have all and only the same column headings.[5]

Each row of the relation – each record – contains values that, inserted appropriately into the predicate, form a proposition that is shown to be (let us say) asserted by its occurring in the relation. Now – and here I am making an inspired guess, but it’s the only one that makes any sense to me – these values are grammatically proper names. The predicate calculus (as I mentioned in Lecture II) allows only either proper names or pronouns (variables) in the places of predicates. And only names effect the change from predicate (true or false of things) to proposition (true or false simpliciter). So I think they have to be names.

We have, as you remember well:

FATHERHOOD:  FATHER    PERSON
 
             Abraham   Isaac
             Isaac     Jacob
             Jacob     Reuben
             Jacob     Dinah
It means, according to this Classical interpretation:
Abraham is father of Isaac, and
Isaac is father of Jacob, and
Jacob is father of Reuben, and
Jacob is father of Dinah.
Now, in general, I shall use expressions like Pab, Qbc, and Rabc to represent rows of a relation. The P, Q, and R each stand for a relation. The a, b, and c each stand for zero or more fields in a row, and we assume that using “Pab” and “Qbc” shows that Q doesn’t have the a columns of P, nor P the c columns of Q, but that they have the same values in their common columns (b).

And I think I can leave it to your ingenuity to work out how to define restrictions.

 

 

Natural Join

 

Let us start with natural join. Suppose we join P with Q, then Rabc gets to be a row in the result, R, when:

Rabc = Pab & Qbc.
As an example, make a copy of FATHERHOOD, renaming columns:
CHILDHOOD:   PERSON    CHILD
 
             Abraham   Isaac
             Isaac     Jacob
             Jacob     Reuben
             Jacob     Dinah
(FATHER has been renamed PERSON, and PERSON has been renamed CHILD.)

Now do a natural join of FATHERHOOD and CHILDHOOD. We get:

FATHER     PERSON    CHILD
 
Abraham    Isaac     Jacob
Isaac      Jacob     Reuben
Isaac      Jacob     Dinah
So what does the natural join mean? That is, given the meanings of the original relation, what is the meaning of the result?

And I think the answer is, in this example, that from:

...is father of ...
and
...is father of ...
we get
...is father of ..., who is father of ...
And this is what I mean by “Rabc = Pab & Qbc”. For a to be R to b with respect to c (i.e. Rabc) means that a is P to b which (or who) is Q to c (i.e. Pab & Qbc).

And I think I can leave it to your ingenuity to work out how to define restrictions.

 

 

Selection

 

So much for natural join. Now consider restriction (sometimes called “selection”). Here we have what we might call a “naked” predicate, one not represented by a relation in extension. For example, let “Qy” be something like “FATHER = Jacob”. Then a restriction on a row, Pab, giving a result row, Rab, is represented by:

Rab = Pab & Qb
And that is very like a natural join. Indeed, if c were empty, and the predicate Qbc, i.e. Qb, were represented as a relation (with a single column, FATHER, and a single row, Jacob), it would be a natural join, as defined above.

Of course, the restriction “FATHER = Jacob” applied to FATHERHOOD gives:

FATHER     PERSON
 
Jacob      Reuben
Jacob      Dinah

 

 

Cartesian Product

 

Before we approach the airless heights of Projection, let us consider some special cases of this “&” (AND) operation. We have seen the most general case, the natural join, and we have seen the special case, restriction, when (1) one predicate is “naked”, and (2) all the columns in the naked predicate are also in the other operand.

Let us go back to natural join, and consider the case when there are no common columns (b empty):

Rac = Pa & Qc
This is the Cartesian Product. Notice that the well-known relational language SQL uses Cartesian product rather than natural join: it never recognises common columns. So it would treat Pab and Qbc as, say, Pab and Qdc, giving the product, say, Sabdc. The user would specify the restriction, b = d, and then project just the a, b, and c.

As natural join is very common, and the use of Cartesian product, restriction, and projection is rather cumbersome, you might think that it was better to have a natural join based system. And lots of very distinguished theoreticians and practitioners would agree with you. And I would try – will try – to change your mind. But not yet.[6]

 

 

Intersection

 

To go to the other extreme, assume all P and Q columns were common (i.e. a and c empty). Then we would have the intersection:

Rb = Pb & Qb
I see all the mathematicians that are still awake checking out whether that’s the same as the intersection you get if you take relations to be sets of n-tuples. And it is. They nod.

 

 

Extension

 

Just one more ANDing. But first a question: why can’t the naked predicate of a restriction have any columns not in the other operand (the relation)? The answer is that if we tried to build the relation in extension corresponding to some naked predicates, it would turn out to be very large indeed. Think about “COLUMNl < COLUMN2”, when both columns contain integers: the extension is indefinitely large. But as long as all the columns are also in the relation, the result won’t be any bigger than the relation. Really, the constraint we need is that the result should be manageably small. I mean, “COLUMNl is the square of COLUMN2” would be all right if COLUMNl were in the relation: the number of rows in the result would be no more than double the number in the operand relation. With that sort of constraint – in current systems usually very strict, i.e. that c is a function of b – we get, with naked predicate Q, the extension operation:

Rabc = Pab & Qbc
If a function is demanded (overcautiously in the opinion of my confessor, the good Father Warden and myself – we blame each other for the proposal of non-functional extension),[7] we get:
Rabc = Pab & c=q(b)
It is generally assumed in many systems that such a function (q) should be total – it should return a value for any b, but I see no reason why (and likewise for non-functional extension).

 

 

Projection

 

You must have had enough of ANDing, and so have I. Let’s do a Projection:

Pab = FOR SOME z, Rabz
Well, it makes a change doesn’t it? But notice that although we existentially quantify the z columns (the c columns if we think still of Rabc), we talk of projecting over the a and b columns. And this sort of talk may have led to the assumption that the projection of a relation always has at least one column. But there is no reason why we should not existentially quantify all columns:
D = FOR SOME x, FOR SOME y, FOR SOME z, Rxyz
The relation D has no columns, it is a niladic predicate, which is to say: a proposition. It has either one row (if R had any rows) or no row (if R had no rows). These very small relations are another Father Warden wheeze.[8]

 

 

An Exercise in Interpretation

 

Right, here’s our joined relation:

FATHER     PERSON    CHILD
 
Abraham    Isaac     Jacob
Isaac      Jacob     Reuben
Isaac      Jacob     Dinah
Project away the PERSON column. That is, project over the FATHER and CHILD columns. What do you get?
FATHER     CHILD
 
Abraham    Jacob
Isaac      Reuben
Isaac      Dinah
But what does it mean? Do I hear “FATHER is grandfather of CHILD”? In contrast to you lot, I have many vices, but male chauvinism isn’t one of them.[9]

Work it out:

FATHER is father of someone (there’s the existential quantification), who is father of CHILD.
But “... is grandfather of ...” means “... is father of someone, who is parent of ” Of course, our projected relation means: “is paternal grandfather of”.

 

 

Other Operations

 

Yes, I know. I haven’t dealt with Union, nor with Difference. And for good reasons. One of those reasons is that they are not definable on records, only on tables. And we still don’t know just what status we should grant to tables. I raised that question at the end of Lecture I, the question of the essentiality of tables (to use a useful word coined by Mr Date), and I’ve been hovering round it ever since. And I plan to hover round it a bit more.

You see, I think there are serious problems even with the Classical interpretations we’ve looked at today, and I want to hammer them out before we get into even deeper water. So I’m afraid things are going to get much more complicated before they become completely incomprehensible.

 

SITE HOME PAGE

Stay with it: you need the credits, and I need the fees. My slim briefing paper, handed out only to regular attenders, covers the principal subject areas that you will be examined on. Good afternoon.

THE DATABASE PAGE

THE DATABASE PAPERS

 

Preface & Contents

 

DOWNLOAD

Download Lecture IV (rtf, Word for Windows compatible)

Platoclast on Data: Lecture V

 

Copyright © 1993, 2001 Adrian Larner. The author asserts all moral rights.