(U03)

www.btinternet.com/~adrian.larner/database/pcl14

PLATOCLAST
ON DATA

Lecture XIV
The Combination Operation

 

 

I want to go back to a problem I raised at the end of Lecture XII. Homework assignments. As you know, big brother is watching you. I’ve got a file on you; two files actually. The first is my Student Table:

STUDENT:           NAME
 
                   Douglas
                   Judy
                   ...
It means: There is someone that is a student on my course and is the same person as NAME. What a good thing you’ve all got different names.

The second is my Assignment Table:

ASSIGNMENT:        NAME      WEEK
 
                   Judy      Week 1
                   Judy      Week 2
                   Judy      Week 3
                   Judy      Week 4
                   ...      ...
It means: There is someone and there is something such that the former completed the latter on time and the former is the same person as NAME and the latter is the same homework assignment that I set for WEEK.

As you may have guessed, I would blithely ignore Dr Codd’s compatibility constraint and union these tables (please excuse the table name – I just couldn’t resist it):

STUDENT_UNION:    NAME      WEEK
 
                   Douglas
                   Judy
                   Judy      Week 1
                   Judy      Week 2
                   Judy      Week 3
                   Judy      Week 4
                   ...      ...
That’s – roughly – how I’ld collect together the information from both my tables to get that list I showed you at the end of Lecture XII.

I could get a report formatter to tidy it up, like this:

STUDENT   ASSIGNMENTS
          ON TIME
 
Douglas
 
Judy
          Week 1
          Week 2
          Week 3
          Week 4
...
I did try a strictly relational method of getting STUDENT and ASSIGNMENT together; but I found a problem, which I can sum up in one word: Douglas. What I tried first was a natural join (on NAME of course). But all that gave me was ASSIGNMENT again. You can see why: there isn’t anyone that doesn’t attend the course, and yet hands in the assignments on time (or, indeed, at all). And, if I might say so, there are not many even of those that do attend, but that’s not a problem in data base theory. In any event, those that attend the course and skip the homework get excluded by the join, and that is a problem in data base theory, because it’s them I want to hold up to rebuke.

 

 

Outer Join

 

I asked a friend who’s really into Relational Theory to give me a solution; and what he came up with was most extraordinary: an “Outer Join”. My problem, as he expressed it, was that the Natural Join – the “Inner” Join, as he called it – lost information (Douglas: that’s you, and I’d always thought you were merely a lost cause).

The Outer Join works like this: first you do the Natural Join (I suppose we had better call it the “Inner Join”). Then you spot the missing STUDENT records, and you extend each of them with a Null WEEK, and you Union them with the Inner Join result. This is what you get:

STUDENT_OUTER:    NAME      WEEK
 
                  Douglas   NULL
                  Judy      Week 1
                  Judy      Week 2
                  Judy      Week 3
                  Judy      Week 4
                  ...      ...
It’s very like STUDENT_UNION, isn’t it? Spot the differences? I did. The very first thing was that in STUDENT _UNION Judy had a row for herself, just like Douglas did. And I thought it was unfair that she lost her row and Douglas kept his, with a NULL to boot, when she’d been diligent and he’d been a slacker. Anyway, my friend said that Judy didn’t need her own row, and Douglas did, and he wasn’t in the charity business, handing out free rows to people that didn’t need them.

But then I got to worry about that NULL. Because I could see that in STUDENT _UNION Douglas’s row was just the same record as his row in STUDENT (it didn’t intersect the WEEK column at all), so it had just the same meaning as his row in STUDENT, and I knew what that meant. But what did Douglas’s row mean in STUDENT_OUTER? My friend said it meant that Douglas had handed in no week’s assignment. But my friend wasn’t worried that “no week’s assignment” not only didn’t name an assignment (that’s what makes it null), but it didn’t name anything. Anyway, I didn’t want Douglas’s row to tell me he hadn’t handed in an assignment on time: I could see that from the other rows, in none of which was he mentioned.

Naturally, I agonised a bit, and then it hit me what this NULL was. A pad. My friend wanted just what I wanted, two different formats of record in one table. But he couldn’t admit it (being an utter relational bigot), so he faked it. And then I came down on him, just as I promised you, like a ton of bricks. And he said that everyone in the relational world wanted these Outer Operations; and I said hadn’t they noticed they were asking for non-uniform tables? and he said no they hadn’t. Between you and me, there’s a technical term for that: stupidity.

 

 

Varieties of Outer Join

 

Well, I got to think about this Outer Join operation: actually there are three of them. The STUDENT and ASSIGNMENT Left Outer Join is the one we’ve used: it unions in the STUDENT rows with no ASSIGNMENT rows. The Right Outer Join would union in any ASSIGNMENT rows without STUDENT rows, but there aren’t any, so in this case it would also look just like ASSIGNMENT. I’ll fiX the data a little, by adding an expected degree grade column to STUDENT, and adding some idiot called “Darren” who does some of the homework but isn’t on the course:

STUDENT:           NAME      GRADE
 
                   Douglas   2/1
                   Judy      2/2
                   ...       ...
Sorry Judy, I’m a compulsive truth teller, and I’ve read your homework: there’s no justice in the world, is there?[1]

Anyway, here comes ASSIGNMENT, now enriched with nourishing Darren.

ASSIGNMENT:       NAME      WEEK
 
                  Judy      Week 1
                  Judy      Week 2
                  Judy      Week 3
                  Judy      Week 4
                  Darren    Week 2
                  Darren    Week 3
                  ...      ...
Now the Left Outer Join gives:
SA_LEFT:          NAME      GRADE     WEEK
 
                  Douglas   2/1       NULL
                  Judy      2/2       Week 1
                  Judy      2/2       Week 2
                  Judy      2/2       Week 3
                  Judy      2/2       Week 4
                  ...      ...       ...
And the Right Outer Join gives:
SA_RIGHT:         NAME      GRADE     WEEK
 
                  Judy      2/2       Week 1
                  Judy      2/2       Week 2
                  Judy      2/2       Week 3
                  Judy      2/2       Week 4
                  Darren    NULL      Week 2
                  Darren    NULL      Week 3
                  ...      ...       ...
And there’s a Symmetrical Outer Join that’s like the Left and Right together:
SA_BOTH:          NAME      GRADE     WEEK
 
                  Douglas   2/1       NULL
                  Judy      2/2       Week 1
                  Judy      2/2       Week 2
                  Judy      2/2       Week 3
                  Judy      2/2       Week 4
                  Darren    NULL      Week 2
                  Darren    NULL      Week 3
                  ...      ...       ...

 

 

Loss of Information on Outer Join

 

Frankly, Outer Joins worry me. Take that NAME column: sometimes its value is there because it names a student on the course (but not always, because Darren’s not such a student), and sometimes because it names a doer of homework (but not always, because Douglas is a non-doer of homework), and sometimes it names both (good old Judy). I want to look at a special case. Suppose I project ASSIGNMENT over NAME: it gives me a table of people that have handed in at least one assignment on time:

ONE_ASSGNT:       NAME
 
                  Judy
                  Darren
                  ...
When I Left Outer Join STUDENT to ONE_ASSGNT I get a table that looks the same as STUDENT. So I have lost information, and not just Darren: I would lose information even if Darren weren’t there. I’ve lost the vital information that Douglas is a slacker and Judy isn’t. But this Outer Join was introduced not for any reason of formal elegance or inescapability, but solely because it was supposed not to lose information.

You can see the problem: Natural Join was just fine (apart from the problem of Nulls: should we use “ = “ or “º”?) while it was merely Inner Join. But this Outer Natural Join, instead of “=” say, uses ... what? It uses:

For Left Outer Join, “]=”
For Right Outer Join, “=[”
For Symmetrical Outer Join, “]= OR =[”
And the sort of information that is lost is: would “=” have worked just as well? And it’s really worse than that, because to use those operators we have to assume a join with an empty row (in ASSIGNMENT for the left join, in STUDENT for the right join, and in both for the symmetrical join), and then we have to remove some of the rows from the result – Judy’s row, for instance – because they’re not “needed”, and the a!l-empty row in the case of the symmetrical join.

 

 

Intractability of Outer Join

 

But what is this I see before me? It is? Ladies and Gentlemen, I am pleased to inform you that Douglas has handed in his homework assignment for week 1. And Douglas is the first student to hand in a late assignment, which means that I shall have to redesign my ASSIGNMENT table:

ASSIGNMENT:       NAME      WEEK      ON_TIME
 
                  Douglas   Week 1    No
                  Judy      Week 1    Yes
                  Judy      Week 2    Yes
                  Judy      Week 3    Yes
                  Judy      Week 4    Yes
                  ...      ...       ...
Douglas: you’re still a slacker, as I shall now demonstrate with a query. I shall use a Tuple Calculus, like SQL, so first I’ll get all my data together with a Left Outer Join of STUDENT and my new ASSIGNMENT. Then I’ll apply the restriction condition, “ON_TIME = Yes”. Then I’ll project away the ON_TIME column. Here we go Douglas; you’ll find you’re still a slacker. First the Left Outer Join:
NAME      GRADE     WEEK      ON_TIME
 
Douglas   2/1       Week 1    No
Judy      2/2       Week 1    Yes
Judy      2/2       Week 2    Yes
Judy      2/2       Week 3    Yes
Judy      2/2       Week 4    Yes
...       ...       ...       ...
Funny! Douglas’s special little row has gone. Aha! That’s because he now has a row showing a returned assignment. Now the restriction:
NAME      GRADE     WEEK      ON_TIME
 
Judy      2/2       Week 1    Yes
Judy      2/2       Week 2    Yes
Judy      2/2       Week 3    Yes
Judy      2/2       Week 4    Yes
...       ...       ...       ...
And the projection:
NAME      GRADE     WEEK
 
Judy      2/2       Week 1
Judy      2/2       Week 2
Judy      2/2       Week 3
Judy      2/2       Week 4
...       ...       ...
Wait a minute! What happened to Douglas? I’m sorry, Douglas. You’re a slacker, but I didn’t want to lose you.

Can you see that to avoid losing Douglas I have to make sure that the restriction is applied before the join? So ASSIGNMENT gets restricted, like this:

NAME      WEEK      ON_TIME
 
Judy      Week 1    Yes
Judy      Week 2    Yes
Judy      Week 3    Yes
Judy      Week 4    Yes
...      ...       ...
And only then can I do the join. But we never had this problem with Inner Joins: it didn’t matter which sequence we did the restriction and the join in. Of course not: remember they are both “ANDs”. It doesn’t matter whether we say “p AND q AND r” or “p AND r AND q”. But Outer Joins aren’t ANDs. Pity. Yes: pity the poor user.

And what’s the problem? Remember my complaining about Judy losing her row in Outer Join (the one she had in the Union)? Well now poor Douglas has lost his just when he needed it. Let’s put them back in, and get the Union:

NAME      GRADE     WEEK      ON_TIME
 
Douglas   2/1
Douglas             Week 1    No
Judy      2/2
Judy                Week 1    Yes
Judy                Week 2    Yes
Judy                Week 3    Yes
Judy                Week 4    Yes
...       ...       ...       ...
Now the restriction: we’ll have to say “ON_TIME =[ Yes” (that will stop the short rows disappearing):
NAME      GRADE     WEEK      ON_TIME
 
Douglas   2/1
Judy      2/2
Judy                Week 1    Yes
Judy                Week 2    Yes
Judy                Week 3    Yes
Judy                Week 4    Yes
...       ...       ...       ...
We can go on to project away “ON_TIME”, but you can see we haven’t lost our dear old slacker. And we would have got just the same result if we had done the restriction first and then the union.

 

 

Conjoin

 

Do those absent GRADE values worry you? I thought so. The operation that I really want is, I’ll call it, Conjoin. The Left Conjoin of two tables, P and Q is the Union of P with the Natural Join of P and Q. In the case of STUDENT and ASSIGNMENT that gives us:

NAME      GRADE     WEEK      ON_TIME
 
Douglas   2/1
Douglas   2/1       Week 1    No
Judy      2/2
Judy      2/2       Week 1    Yes
Judy      2/2       Week 2    Yes
Judy      2/2       Week 3    Yes
Judy      2/2       Week 4    Yes
...       ...       ...       ...
Happy? The Right Conjoin is, of course, the Union of Q with the Natural Join of P and Q. And the Symmetrical Conjoin is the Union of P, Q, and the Natural Join of P and Q. And isn’t that a great conjunction operator? When it ANDs Pab and Qbc it gives us three rows: Pab and Qbc and Rabc. So now the user doesn’t have to choose between different sorts of conjunction, horizontal and vertical, so to speak. They get both: package deal, no extra charge (just extra rows). And, as all Conjoins are conjunctions, it makes no difference whether we perform restrictions before the conjoin or after it. Moreover, it doesn’t matter what sequence we conjoin in: P conjoined to Q and then conjoined to S gives the same result as conjoining Q and S and then conjoining P to them. You’ll find you can’t do that with Outer Joins either: they’re not associative in mathematical jargon. In English: they force users to make choices between the execution sequences of subtly different queries.

But now I will show you a better way. With Conjoin we have solved the problems of inadvertently losing Douglas, and of forcing subtle choices on the user, and of constraining the optimisation of query performance (because the optimiser needs to shuffle operations around), and of still allowing the user – in a Tuple Calculus language like SQL – to get all the data together, then restrict it, then project it. But we still have to handle the case when we project just the NAME in ASSIGNMENT and then do the Left Outer Join: remember we couldn’t spot the slackers in the result. (I know, Douglas: I’m going to apply “ON_TIME = Yes” to ASSIGNMENT first.) Conjoin doesn’t help us here. We have, remember:

STUDENT:        NAME
 
                Douglas
                Judy
                ...
ONE_ASSGNT:     NAME
 
                Judy
                Darren
                ...
Their Left Conjoin looks just like STUDENT. In other conjoins and left outer joins we have had extra columns in STUDENT or ASSIGNMENT or both, and in the result these have shown us whether the NAME value came from STUDENT or from ASSIGNMENT or from both. But we can’t always count on that: and this is a case in point.

 

 

Combination

 

The problem is Natural Join, of which both Outer Joins and Conjoins are species. But just as our old natural join, Inner Join, amounted to a Cartesian Product, a Restriction (using “=” or “º”), and a Projection (to remove one of the identical columns), so we might treat a Conjoin – a symmetrical conjoin – as a sort of Cartesian product, a Restriction. and a Projection. But what “sort of Cartesian product”?

The Union of P, Q, and the Cartesian Product of P and Q.
And this – I’ll call it Combination, – is The Great Conjunction Operator. Let’s combine STUDENT and ONE_ASSGNT.
STUDENT         ONE_ASSGNT
.NAME           .NAME
 
Douglas
Douglas         Judy
Douglas         Darren
Judy
Judy            Judy
Judy            Darren
                Judy
                Darren
...             ...
You see the STUDENT rows, Douglas and Judy; the ONE_ASSGNT rows, Judy and Darren; and the product rows. Now we do the Restriction: it’s:
STUDENT.NAME ]= ONE_ASSGNT.NAME OR
STUDENT.NAME =[ ONE_ASSGNT.NAME
In any concrete syntax – any DML – we would want an appropriate operator meaning “]= OR =[” so that we could use a single expression, but we won’t worry about that for the moment. We get:
STUDENT         ONE_ASSGNT
.NAME           .NAME
 
Douglas
Judy
Judy            Judy
                Judy
                Darren
...             ...
To get the asymmetry of the Left Conjoin we would use just the “STUDENT.NAME ]= ONE_ASSGNT.NAME” condition (giving the first three rows shown), and to get that of the Right Conjoin we would use “STUDENT .NAME =[ ONE_ASSGNT .NAME”. Here’s the left:
STUDENT         ONE_ASSGNT
.NAME           .NAME
 
Douglas
Judy
Judy            Judy
...             ...
And then we could use Functional Extension to get a new column, say “DONE” with value, “Yes”, defined as “STUDENT.NAME = ONE_ASSGNT.NAME” or “PRESENT (ONE_ASSGNT.NAME)”: we’ll assume this value absent when ONE_ASSGNT.NAME is absent. We get:
STUDENT         ONE_ASSGNT      DONE
.NAME           .NAME
 
Douglas
Judy
Judy            Judy            Yes
...             ...             ...
And then we could project away ONE_ASSGNT.NAME (at last we need it no more):
STUDENT         DONE
.NAME
 
Douglas
Judy
Judy            Yes
...             ...
But how do we get what we really want? That is:
NAME            DILIGENT
 
Douglas         No
Judy            Yes
...             ...

 

SITE HOME PAGE

Now there is a really interesting question.

THE DATABASE PAGE

THE DATABASE PAPERS

 

Preface & Contents

 

DOWNLOAD

Download Lecture XIV (rtf, Word for Windows compatible)

Platoclast on Data: Lecture XV

 

Copyright © 1993, 2001 Adrian Larner. The author asserts all moral rights.