(U03)

www.btinternet.com/~adrian.larner/database/pcl15

PLATOCLAST
ON DATA

Lecture XV
Quantification

 

 

I hope you all enjoyed the last lecture, and profited by it. What a long way the relational theorists have gone, haven’t they? Backwards. Dr Codd had this pure vision, you see. DBMSs based on a simple, sound theory. But now they clamour for Outer Joins: they haven’t defined the problems that Outer Joins are supposed to solve; they don’t have a solid theoretical understanding of Outer Joins. But they don’t care: why shouldn’t we make whatever informal, ad hoc adjustments we want? We’re all practical people aren’t we? I wonder sometimes whether they have learned anything at all from Dr Codd. Mind you, we saw just the same approach with Nulls: there even Dr Codd hasn’t learned anything from Dr Codd.

 

 

Combination as a Primitive Operator

 

I left a question or two hanging at the end of Lecture XIII: how were we to understand and handle Unions? But, looking back at our last lecture, when we dealt with Outer Joins, I think those questions got more or less answered. Now that we understand Combination – the great conjunction operation – let’s take it as primitive (Natural Join lovers might take Conjoin as primitive).

Given the Combination, r, of p and q – remember it will have columns p.X, p.Y, q.Y, and q.Z – we can easily get the Cartesian product of p and q. It is the strict (p.X, p.Y, q.Y, q.Z) Projection of r, where a strict projection is one in which all columns in the list are mandatory. I’ll use “;” for Combination, so we get:

p,q =df (p.X, p.Y, q.Y, q.Z)\p;q
The Left, Right, and Symmetrical Conjoins of p and q are obtained by restricting r with the conditions, respectively:
p.Y ]= q.Y
p.Y =[ q.Y
p. ]= q.Y | p.Y =[ q.Y
Then a rather subtle operation is needed to find the value in p.Y and/or q.Y, put it in a single Y column, and project away the original p.Y and q.Y columns. I’m coming to that.

The Inner Join of p and q is obtained by applying the restriction “p.Y = q.Y” (or “p.Y º q.Y” if we want the join on the null value as well) and then projecting away one of the Y columns.

The Union of p and q can also be defined, in two ways. Firstly, we could project r using (p.X, p.Y) | (q. Y, q.Z), i.e. two strict projections.

Secondly we could apply the restriction:

(p.Y ]= q.Y | p.Y =[ q.Y) & ¬ (p.Y = q.Y) & ¬ (p.Y ¬ = q.Y)
That’s a long winded way of saying: p.Y is absent or q.Y is absent. But notice what happened to that supposedly “relational” union of Pb and Qb. Their combination has two columns and three rows. The columns are p.Y and q.Y, and the rows are:
(p.Y: b)
(q.Y: b)
(p.Y: b), (q.Y: b)
Their Union – quite clearly a conjunction – has the same columns, and the first two of those rows. So we still have to show how we would get the relational union, which will have a single row, (Y: b).

 

 

Non-Uniform Tables and Language Issues

 

But how far have we got? Most importantly, perhaps, we’ve seen that even the relational bigots want tables containing rows of different formats. But their solution is wrong in three ways:

Outer Joins are intractable operations, and Conjoins or – even better – Combination should be used instead.
 
Nulls cause enough problems (to them, we have a solution) without being used for yet another purpose.
 
Whether Nulls or other pads are used, the disguise is transparent: rectangularity of tables (relations, imposing union compatibility) is a Procrustean bed.
We have seen how Conjoins or Combination remove from the user that difficult choice between different forms of conjunction (join or union) that might be needed in a Tuple Calculus language. And we have seen how Combination avoids the information loss that afflicts Conjoins and Outer Joins when a table is joined on all its columns (so none are left to signal the origin of the data in the joined columns).

As you may have gathered, there are two schools of thought about which join – Natural Join or Cartesian Product – should be taken as primitive. On the one side we have, we might say, the “Wardenians”, who favour Natural Join: Mr Date is now, I believe, of this persuasion; and so is Dr Codd – in spades.[1] On the other side we have the SQL reactionaries, including myself, who favour Cartesian Product.

What I’ld really like to see is a user interface that accepted Natural Joins and translated into Cartesian Product and most likely Restriction, which the user would then accept or modify.[2]

However, this strikes me as odd: there are now proposals to implement Outer Joins in SQL. But this involves a radical change to that language to make it – in effect – Natural Join based. Yet, when you think it through as we have done, it seems that handling the sort of non-rectangular tables that are to be supported (padded to look like relations) is best achieved on a base of Combination. And that is much nearer to the current SQL Cartesian product approach than are Conjoins or Outer Joins. But then, they haven’t thought it through, have they?

And we haven’t finished thinking it through, for that matter. We still have to get a way to do relational unions, and we have a little problem with projection. You will recall that by using Conjoins or Combination we enabled restrictions to be applied either before or after such a join with no effect on the result (which wasn’t the case with Outer Joins). But we would like something similar to apply to Projection. Remember that our ideal was a query language in which we:

1
Get all the data together (now using Combination).
2
Apply any restrictions.
3
Project the required columns.

 

 

Relational Union

 

We need therefore to investigate how to move any projections to, so to speak, the end of the query (just as we can move all restrictions to the second stage); but let’s tackle relational union first. We’ll go back to an old friend:

FATHERHOOD:     FATHER    PERSON
 
                Abraham   Isaac
                Isaac     Jacob
                Jacob     Reuben
                Jacob     Dinah
And here’s a relation to union with it:
MOTHERHOOD:     MOTHER    PERSON
 
                Sarah     Isaac
                Rebecca   Jacob
                Leah      Reuben
                Leah      Dinah
What we want is:
PARENTHOOD:     PARENT    PERSON
 
                Abraham   Isaac
                Sarah     Isaac
                Isaac     Jacob
                Rebecca   Jacob
                Jacob     Reuben
                Leah      Reuben
                Jacob     Dinah
                Leah      Dinah
And what we can get, using our union, defined on Combination, is:
FMTHERHOOD:     FATHER    MOTHER    PERSON
 
                Abraham             Isaac
                          Sarah     Isaac
                Isaac               Jacob
                          Rebecca   Jacob
                Jacob               Reuben
                          Leah      Reuben
                Jacob               Dinah
                          Leah      Dinah
So I think it’s now obvious that we need some Extension function on the FATHER and MOTHER columns – we might can it “OR” – to create a “FATHER-OR-MOTHER” column, which we could (not unreasonably) rename as “PARENT”. When applied to a value and an absent value this “OR” would return the value, thus:
      FATHER    MOTHER    PARENT    PERSON
 
      Abraham             Abraham   Isaac
                Sarah     Sarah     Isaac
      Isaac               Isaac     Jacob
                Rebecca   Rebecca   Jacob
      Jacob               Jacob     Reuben
                Leah      Leah      Reuben
      Jacob               Jacob     Dinah
                Leah      Leah      Dinah
And then we project PARENT and PERSON to get the required PARENTHOOD table. And only on this projection do we lose information.

Of course, I’ve cheated a little on you. Being strict about Combination would have given us two PERSON columns: FATHERHOOD.PERSON and MOTHERHOOD.PERSON. But we would then have used the same Extension and Projection: create PERSON as the OR-extension of FATHERHOOD.PERSON and MOTHERHOOD.PERSON, and project those two columns away.

Now notice two things:

I’m not now talking DML design. We might well want in our DML all sorts of slick operators to do OR-extensions and column removal in one blow.
 
This shows where the relational interpretation of union, as OR, comes from. It is conceivable – and, I think, best conceived – as an extension function on columns. The information loss comes from removing those columns, by projection.
Let me explain one reason why I like this approach. Suppose we formed (or simply kept) the natural join of FATHERHOOD and MOTHERHOOD:
FTHMTHHOOD:     FATHER    MOTHER    PERSON
 
                Abraham   Sarah     Isaac
                Isaac     Rebecca   Jacob
                Jacob     Leah      Reuben
                Jacob     Leah      Dinah
And suppose we wanted to form PARENTHOOD from that. I think an elegant approach is to treat the OR-extension not as a function but as a relationship (as many-valued). “OR (FATHER, MOTHER)” – ca11 it “PARENT” – would then return two values for each row. We would get:
      FATHER    MOTHER    PARENT    PERSON
 
      Abraham   Sarah     Abraham   Isaac
      Abraham   Sarah     Sarah     Isaac
      Isaac     Rebecca   Isaac     Jacob
      Isaac     Rebecca   Rebecca   Jacob
      Jacob     Leah      Jacob     Reuben
      Jacob     Leah      Leah      Reuben
      Jacob     Leah      Jacob     Dinah
      Jacob     Leah      Leah      Dinah
And again PARENTHOOD could be obtained by projecting the PARENT and PERSON columns. So the “OR” extension looks a useful thing to have. And if we have it, let’s use it to do ORs.

Perhaps rather wickedly I might point out another option: start with FTHMTHHOOD and simply rename the FATHER and MOTHER columns to PARENT (yes, both of them):

PRTPRTHOOD:     PARENT    PARENT    PERSON
 
                Abraham   Sarah     Isaac
                Isaac     Rebecca   Jacob
                Jacob     Leah      Reuben
                Jacob     Leah      Dinah
Only two columns now (one inscribed twice): but two values of PARENT in each of the polynormal rows. Then the operation needed to get PARENTHOOD is – what shall we call it? – Unnest? Normalise?

But the important thing is: we can still get relational union when we want it. Now let’s turn to projection.

 

 

Quantification

 

You’ll remember that at the end of our last lecture we had projected just the NAME column of ASSIGNMENT to give ONE_ASSGNT, effected a Left Conjoin of STUDENT and ONE_ASSGNT, and fiddled about to get:

NAME      DONE
 
Douglas
Judy
Judy      Yes
...       ...
Let’s suppose we had tried to postpone all projection until after restriction and combination. First we would combine all the data for our query from STUDENT and ASSIGNMENT, and apply the join restriction, then the restriction “ON_TIME =[ Yes”. The result looks like this:
SA:   STUDENT   GRADE   ASSIGN-   WEEK      ON_
      .NAME             MENT                TIME
                        .NAME
      Douglas   2/1
      Judy      2/2
      Judy      2/2     Judy      Week 1    Yes
      Judy      2/2     Judy      Week 2    Yes
      Judy      2/2     Judy      Week 3    Yes
      Judy      2/2     Judy      Week 4    Yes
Notice that the ON TIME column serves here as did the DONE column in our tinier version. We are trying to get to something like:
NAME      DILIGENT
 
Douglas   No
Judy      Yes
What do we need to do? I think that there are two things:
We want to discover whether, for each student, there is a “Yes” value in ON_TIME (or DONE), or – we might say – any value in ASSIGNMENT.NAME. We want to extend all rows (or perhaps some row) pertaining to that student with a field saying “Yes” – the student is diligent.
 
We want to get a result with one row for each student, and containing that new value.
Now I think that we should regard it as purely coincidental that the quantification, “there is a ‘Yes’”, is within the scope of a single student and that we want one row in the result for each student. In some cases we might want such a quantification within one scope, and a report at a detail level different from that scope. So I’m going to tackle these as separate problems.

What’s new about the first problem? Its solution depends, unlike an extension or a restriction, on a number of rows. But there is more than that: any projection – in a sense – depends on a number of rows. Remember our CI interpretation of Projection:

Pb = FOR SOME z, Rbz
But with a projection we are quite happy for there to be no row pertaining to a value, say b, if there is no such z. Here we are not so happy: we really do not want to lose all reference to Douglas just because he’s never got an assignment in on time. We want to convert “FOR SOME z, Rbz” into a truth value, and represent that value in a column.

 

 

Accumulations and Collection

 

And – as it happens – we can surely see coming the need to do such things as count how many assignments have been handed in on time. I hope you can see that coming. And things could get more compiex: we might start to record a mark for each assignment and then find the average mark over the term for each student. (Only kidding!)

Surely all this is pushing us towards some way of aggregating ... Aggregating what? Records (rows) or merely values in a column? “Merely values in a column”, you reply, shuddering as I do at the thought of such extra structures as aggregates of rows, and bearing in mind that we might still have recourse to polynormality.

Very well, let us give ourselves a new operation, collection (dangerous this – any other alternative welcomed). Given a relation, q, with columns including KI, Kl, ... Kn, C, the expression:

q//(KI, K2, ...Kn # C)
denotes q extended (I use “//” here for extension) with a new column, namely “(KI, K2, ...Kn # C)” – let’s call it “CC”. Assume CC to have all absent values in the first instance, then:
Each distinct (KI, K2, ...Kn) determines a scope: rows with that (KI, K2, ...Kn) are in that scope. Within each scope, the C value in each row is placed into each CC (additively, so that CC can become many-valued).
You can now see, I hope, that SA//(STUDENT.NAME # ON_TIME), if we call the extra column “DILIGENT”, is:
      STUDENT   GRADE   ASSIGN    WEEK      ON_       DILI
      .NAME             MENT                TIME      GENT
                        .NAME
      Douglas   2/1
      Judy      2/2                                   Yes
      Judy      2/2     Judy      Week 1    Yes       Yes
      Judy      2/2     Judy      Week 2    Yes       Yes
      Judy      2/2     Judy      Week 3    Yes       Yes
      Judy      2/2     Judy      Week 4    Yes       Yes
If we had used SA//(STUDENT.NAME # WEEK) the extra column would, like DILIGENT, have no value in the Douglas row, but would have four values – Week 1 through Week 4 – in each Judy row. Anyway, now we can project STUDENT.NAME and DILIGENT to get this little gem:
NAME      DILIGENT
 
Douglas
Judy      Yes
...       ...
You might think that this is enough. We don’t need “No” in Douglas’s DILIGENT field because merely not being marked diligent is enough to show you’re a slacker, if you ask me.

Of course, we might, instead of merely projecting DILIGENT, first extend using a function like “PRESENT”, so PRESENT (DILIGENT) for Judy would come out as Yes, and for Douglas as No: and we could project that answer.

If we had used WEEK instead of ON_TIME, getting the many-valued column WEEKS say, then we would need to apply a function like “PRESENT” to get a column for projection: Yes if some value was to be found in WEEKS, and No if WEEKS had an absent value. We can also imagine other functions, such as HOWMANY (WEEKS), returning 0 for Douglas and 4 for Judy.

I ought to warn you: seeing whether anything is there, mere existential quantification (such as PRESENT), is expressible within the FOPC. Counting how many is not, so we’re stretching things a bit, but our users do want to count, add up, average, and so on. Later we’ll talk about how we can consign all that to the unsavoury limbo of “data type processing”. Bear with me for now.

Notice how polynormality is now starting to emerge from the shadows: we are seeing polynormal records transiently.

The sort of expression we are using is:

HOWMANY (STUDENT.NAME # WEEK)
If we perform the collection (“#”) and then extend the relation with the new column (WEEKS), we get a polynormal relation. But we can hide this by doing the collection and the “HOWMANY” – the COUNT – in one blow.

Thus what was the question of polynormality – should we allow polynormal records at all? – has changed. Now we want to ask: how conspicuous should polynormality be?

Remember that we started with two problems: one we can now solve by the “#” operation – collection within a scope. The other was that of specifying the level of detail of the result, which turned out very easy in our example: we simply used projection. But then we got to the end of the query: we didn’t have any more collection to do. So we’ll need to look at more complex queries to let this problem emerge properly.[3] But not today.

 

 

Interpretation

 

I want to go back to that very simple table, I’ll call it “S”:

S:     NAME      DONE
 
       Douglas
       Judy
       Judy      Yes
       ...       ...
Remember that we could do this query on it:
(NAME, DILIGENT)\S//DILIGENT := (NAME # DONE)
I use “:=” to show that I’ve named the new column. We get the result:
NAME      DONE
 
Douglas
Judy      Yes
...       ...
Let’s consider what its rows mean. Judy’s row means that Judy is a student on this course and is diligent in this sense: there is some homework assignment that she returned on time. But what does Douglas’s row mean? Quite obviously it is the same row as in S. It means that Douglas is a student on this course. It doesn’t mean, surprisingly enough, that Douglas is a slacker in this sense: there is no homework assignment that he returned on time. It doesn’t mean that thing, even though we know, by looking at that row, that Douglas is a slacker.

It doesn’t follow from the information in my files that Douglas is a slacker. I’ll prove that to you. Suppose we say that the information in S constitutes a theory, which I’ll also call S. And suppose, for simplicity, that S is not the result of a query but the only file I keep. When a student registers for the course, I insert an appropriate short (NAME) row for that student; and when a student hands in their first on-time assignment I add a long (NAME and DONE) row.

Now let me add a long row for Douglas (just for the sake of argument). I get a new theory, say T. And S is a subthoery of T; so T asserts (has as a theorem) everything that S asserts. But T asserts that Douglas is no slacker but is diligent. So S can’t assert that Douglas is a slacker: if it did then T would assert that Douglas was a slacker and was not a slacker; it would be inconsistent. Clearly T is not inconsistent (merely false). So S doesn’t assert that Douglas is a slacker; it doesn’t deny that Douglas is diligent.

 

SITE HOME PAGE

In concluding that Douglas was a slacker, my friends, you went beyond the evidence. Of course, your conclusion was correct; but it wasn’t warranted by the data you derived it from.

THE DATABASE PAGE

THE DATABASE PAPERS

 

Preface & Contents

 

DOWNLOAD

Download Lecture XV (rtf, Word for Windows compatible)

Platoclast on Data: Lecture XVI

 

Copyright © 1993, 2001 Adrian Larner. The author asserts all moral rights.