|
(U03)
|
www.btinternet.com/~adrian.larner/database/pcl14
|
|
PLATOCLAST ON DATA
Lecture XIV The Combination Operation
|
|
|
|
|
I want to go back to a problem
I raised at the end of Lecture XII.
Homework assignments.
As you know, big brother is watching you.
I’ve got a file on you; two files actually.
The first is my Student Table:
- STUDENT: NAME
- Douglas
- Judy
- ...
It means: There is someone that is a student on my course and is the same person as NAME.
What a good thing you’ve all got different names.
The second is my Assignment Table:
- ASSIGNMENT: NAME WEEK
- Judy Week 1
- Judy Week 2
- Judy Week 3
- Judy Week 4
- ... ...
It means: There is someone and there is something such that the former completed the latter on time
and the former is the same person as NAME and the latter is the same homework assignment that I set for WEEK.
As you may have guessed, I would blithely ignore Dr Codd’s compatibility constraint and union these tables
(please excuse the table name I just couldn’t resist it):
- STUDENT_UNION: NAME WEEK
- Douglas
- Judy
- Judy Week 1
- Judy Week 2
- Judy Week 3
- Judy Week 4
- ... ...
That’s roughly how Ild collect together the information from both my tables
to get that list I showed you at the end of Lecture XII.
I could get a report formatter to tidy it up, like this:
- STUDENT ASSIGNMENTS
- ON TIME
- Douglas
- Judy
- Week 1
- Week 2
- Week 3
- Week 4
- ...
I did try a strictly relational method of getting STUDENT and ASSIGNMENT together;
but I found a problem, which I can sum up in one word: Douglas.
What I tried first was a natural join (on NAME of course).
But all that gave me was ASSIGNMENT again.
You can see why: there isn’t anyone that doesn’t attend the course, and yet hands in the assignments on time
(or, indeed, at all).
And, if I might say so, there are not many even of those that do attend, but that’s not a problem in data base theory.
In any event, those that attend the course and skip the homework get excluded by the join,
and that is a problem in data base theory, because it’s them I want to hold up to rebuke.
|
|
|
|
|
|
|
|
I asked a friend who’s really into Relational Theory
to give me a solution; and what he came up with was most extraordinary:
an “Outer Join”.
My problem, as he expressed it, was that the Natural Join the “Inner” Join, as he called it
lost information (Douglas: that’s you, and I’d always thought you were merely a lost cause).
The Outer Join works like this:
first you do the Natural Join (I suppose we had better call it the “Inner Join”).
Then you spot the missing STUDENT records, and you extend each of them with a Null WEEK,
and you Union them with the Inner Join result. This is what you get:
- STUDENT_OUTER: NAME WEEK
- Douglas NULL
- Judy Week 1
- Judy Week 2
- Judy Week 3
- Judy Week 4
- ... ...
It’s very like STUDENT_UNION, isn’t it?
Spot the differences? I did.
The very first thing was that in STUDENT _UNION Judy had a row for herself, just like Douglas did.
And I thought it was unfair that she lost her row and Douglas kept his, with a NULL to boot,
when she’d been diligent and he’d been a slacker.
Anyway, my friend said that Judy didn’t need her own row,
and Douglas did, and he wasn’t in the charity business, handing out free rows to people that didn’t need them.
But then I got to worry about that NULL. Because I could see that in STUDENT _UNION
Douglas’s row was just the same record as his row in STUDENT
(it didn’t intersect the WEEK column at all),
so it had just the same meaning as his row in STUDENT, and I knew what that meant.
But what did Douglas’s row mean in STUDENT_OUTER?
My friend said it meant that Douglas had handed in no week’s assignment.
But my friend wasn’t worried that “no week’s assignment” not only didn’t name an assignment
(that’s what makes it null), but it didn’t name anything.
Anyway, I didn’t want Douglas’s row to tell me he hadn’t handed in an assignment on time:
I could see that from the other rows, in none of which was he mentioned.
Naturally, I agonised a bit, and then it hit me what this NULL was.
A pad.
My friend wanted just what I wanted, two different formats of record in one table.
But he couldn’t admit it (being an utter relational bigot), so he faked it.
And then I came down on him, just as I promised you, like a ton of bricks.
And he said that everyone in the relational world wanted these Outer Operations;
and I said hadn’t they noticed they were asking for non-uniform tables? and he said no they hadn’t.
Between you and me, there’s a technical term for that: stupidity.
|
|
|
|
|
|
|
|
Well, I got to think about this Outer Join operation:
actually there are three of them.
The STUDENT and ASSIGNMENT Left Outer Join is the one we’ve used:
it unions in the STUDENT rows with no ASSIGNMENT rows.
The Right Outer Join would union in any ASSIGNMENT rows without STUDENT rows,
but there aren’t any, so in this case it would also look just like ASSIGNMENT.
I’ll fiX the data a little, by adding an expected degree grade column to STUDENT,
and adding some idiot called “Darren” who does some of the homework but isn’t on the course:
- STUDENT: NAME GRADE
- Douglas 2/1
- Judy 2/2
- ... ...
Sorry Judy, I’m a compulsive truth teller, and I’ve read your homework: there’s no justice in the world,
is there?[1]
Anyway, here comes ASSIGNMENT, now enriched with nourishing Darren.
- ASSIGNMENT: NAME WEEK
- Judy Week 1
- Judy Week 2
- Judy Week 3
- Judy Week 4
- Darren Week 2
- Darren Week 3
- ... ...
Now the Left Outer Join gives:
- SA_LEFT: NAME GRADE WEEK
- Douglas 2/1 NULL
- Judy 2/2 Week 1
- Judy 2/2 Week 2
- Judy 2/2 Week 3
- Judy 2/2 Week 4
- ... ... ...
And the Right Outer Join gives:
- SA_RIGHT: NAME GRADE WEEK
- Judy 2/2 Week 1
- Judy 2/2 Week 2
- Judy 2/2 Week 3
- Judy 2/2 Week 4
- Darren NULL Week 2
- Darren NULL Week 3
- ... ... ...
And there’s a Symmetrical Outer Join that’s like the Left and Right together:
- SA_BOTH: NAME GRADE WEEK
- Douglas 2/1 NULL
- Judy 2/2 Week 1
- Judy 2/2 Week 2
- Judy 2/2 Week 3
- Judy 2/2 Week 4
- Darren NULL Week 2
- Darren NULL Week 3
- ... ... ...
|
|
|
|
|
|
|
|
Frankly, Outer Joins worry me.
Take that NAME column: sometimes its value is there because it names a student on the course
(but not always, because Darren’s not such a student),
and sometimes because it names a doer of homework (but not always, because Douglas is a non-doer of homework),
and sometimes it names both (good old Judy).
I want to look at a special case.
Suppose I project ASSIGNMENT over NAME:
it gives me a table of people that have handed in at least one assignment on time:
- ONE_ASSGNT: NAME
- Judy
- Darren
- ...
When I Left Outer Join STUDENT to ONE_ASSGNT I get a table that looks the same as STUDENT.
So I have lost information, and not just Darren:
I would lose information even if Darren weren’t there.
I’ve lost the vital information that Douglas is a slacker and Judy isn’t.
But this Outer Join was introduced not for any reason of formal elegance or inescapability,
but solely because it was supposed not to lose information.
You can see the problem:
Natural Join was just fine (apart from the problem of Nulls: should we use “ = “ or
“º”?)
while it was merely Inner Join.
But this Outer Natural Join, instead of “=” say, uses ... what? It uses:
- For Left Outer Join, “]=”
- For Right Outer Join, “=[”
- For Symmetrical Outer Join, “]= OR =[”
And the sort of information that is lost is: would “=” have worked just as well?
And it’s really worse than that, because to use those operators we have to assume a join with an empty row
(in ASSIGNMENT for the left join, in STUDENT for the right join, and in both for the symmetrical join),
and then we have to remove some of the rows from the result Judy’s row, for instance
because they’re not “needed”, and the a!l-empty row in the case of the symmetrical join.
|
|
|
|
|
|
|
|
But what is this I see before me?
It is?
Ladies and Gentlemen, I am pleased to inform you that Douglas has handed in his homework assignment for week 1.
And Douglas is the first student to hand in a late assignment,
which means that I shall have to redesign my ASSIGNMENT table:
- ASSIGNMENT: NAME WEEK ON_TIME
- Douglas Week 1 No
- Judy Week 1 Yes
- Judy Week 2 Yes
- Judy Week 3 Yes
- Judy Week 4 Yes
- ... ... ...
Douglas: you’re still a slacker, as I shall now demonstrate with a query.
I shall use a Tuple Calculus, like SQL, so first Ill get all my data together with a
Left Outer Join of STUDENT and my new ASSIGNMENT.
Then I’ll apply the restriction condition, ON_TIME = Yes”.
Then I’ll project away the ON_TIME column.
Here we go Douglas; you’ll find you’re still a slacker.
First the Left Outer Join:
- NAME GRADE WEEK ON_TIME
- Douglas 2/1 Week 1 No
- Judy 2/2 Week 1 Yes
- Judy 2/2 Week 2 Yes
- Judy 2/2 Week 3 Yes
- Judy 2/2 Week 4 Yes
- ... ... ... ...
Funny! Douglas’s special little row has gone.
Aha! That’s because he now has a row showing a returned assignment.
Now the restriction:
- NAME GRADE WEEK ON_TIME
- Judy 2/2 Week 1 Yes
- Judy 2/2 Week 2 Yes
- Judy 2/2 Week 3 Yes
- Judy 2/2 Week 4 Yes
- ... ... ... ...
And the projection:
- NAME GRADE WEEK
- Judy 2/2 Week 1
- Judy 2/2 Week 2
- Judy 2/2 Week 3
- Judy 2/2 Week 4
- ... ... ...
Wait a minute! What happened to Douglas? I’m sorry, Douglas.
You’re a slacker, but I didn’t want to lose you.
Can you see that to avoid losing Douglas I have to make sure that the restriction is applied
before the join?
So ASSIGNMENT gets restricted, like this:
- NAME WEEK ON_TIME
- Judy Week 1 Yes
- Judy Week 2 Yes
- Judy Week 3 Yes
- Judy Week 4 Yes
- ... ... ...
And only then can I do the join.
But we never had this problem with Inner Joins:
it didn’t matter which sequence we did the restriction and the join in.
Of course not: remember they are both ANDs”. It doesn’t matter whether
we say “p AND q AND r” or “p AND r AND q”.
But Outer Joins aren’t ANDs.
Pity.
Yes: pity the poor user.
And what’s the problem? Remember my complaining about Judy losing her row in Outer Join
(the one she had in the Union)?
Well now poor Douglas has lost his just when he needed it.
Let’s put them back in, and get the Union:
- NAME GRADE WEEK ON_TIME
- Douglas 2/1
- Douglas Week 1 No
- Judy 2/2
- Judy Week 1 Yes
- Judy Week 2 Yes
- Judy Week 3 Yes
- Judy Week 4 Yes
- ... ... ... ...
Now the restriction: we’ll have to say “ON_TIME =[ Yes” (that will stop the short rows disappearing):
- NAME GRADE WEEK ON_TIME
- Douglas 2/1
- Judy 2/2
- Judy Week 1 Yes
- Judy Week 2 Yes
- Judy Week 3 Yes
- Judy Week 4 Yes
- ... ... ... ...
We can go on to project away ON_TIME”,
but you can see we haven’t lost our dear old slacker.
And we would have got just the same result if we had done the restriction first and then the union.
|
|
|
|
|
|
|
|
Do those absent GRADE values worry you?
I thought so.
The operation that I really want is, I’ll call it, Conjoin.
The Left Conjoin of two tables, P and Q is the Union of P
with the Natural Join of P and Q.
In the case of STUDENT and ASSIGNMENT that gives us:
- NAME GRADE WEEK ON_TIME
- Douglas 2/1
- Douglas 2/1 Week 1 No
- Judy 2/2
- Judy 2/2 Week 1 Yes
- Judy 2/2 Week 2 Yes
- Judy 2/2 Week 3 Yes
- Judy 2/2 Week 4 Yes
- ... ... ... ...
Happy? The Right Conjoin is, of course, the Union of Q with the Natural Join of P and Q.
And the Symmetrical Conjoin is the Union of P, Q,
and the Natural Join of P and Q.
And isn’t that a great conjunction operator?
When it ANDs Pab and Qbc it gives us three rows:
Pab and Qbc and Rabc.
So now the user doesn’t have to choose between different sorts of conjunction, horizontal and vertical, so to speak.
They get both: package deal, no extra charge (just extra rows).
And, as all Conjoins are conjunctions,
it makes no difference whether we perform restrictions before the conjoin or after it.
Moreover, it doesn’t matter what sequence we conjoin in: P conjoined to Q
and then conjoined to S gives the same result as conjoining Q and S
and then conjoining P to them.
Youll find you can’t do that with Outer Joins either:
they’re not associative in mathematical jargon.
In English: they force users to make choices between the execution sequences of subtly different queries.
But now I will show you a better way.
With Conjoin we have solved the problems of inadvertently losing Douglas,
and of forcing subtle choices on the user, and of constraining the optimisation of query performance
(because the optimiser needs to shuffle operations around),
and of still allowing the user in a Tuple Calculus language like SQL
to get all the data together, then restrict it, then project it.
But we still have to handle the case when we project just the NAME in ASSIGNMENT
and then do the Left Outer Join:
remember we couldn’t spot the slackers in the result.
(I know, Douglas:
I’m going to apply “ON_TIME = Yes” to ASSIGNMENT first.)
Conjoin doesn’t help us here. We have, remember:
- STUDENT: NAME
- Douglas
- Judy
- ...
- ONE_ASSGNT: NAME
- Judy
- Darren
- ...
Their Left Conjoin looks just like STUDENT.
In other conjoins and left outer joins we have had extra columns in STUDENT or ASSIGNMENT or both,
and in the result these have shown us whether the NAME value came from STUDENT or from ASSIGNMENT
or from both.
But we can’t always count on that: and this is a case in point.
|
|
|
|
|
|
|
|
The problem is Natural Join,
of which both Outer Joins and Conjoins are species.
But just as our old natural join, Inner Join, amounted to a Cartesian Product, a Restriction
(using “=” or “º”),
and a Projection (to remove one of the identical columns), so we might treat a Conjoin a symmetrical
conjoin as a sort of Cartesian product, a Restriction. and a Projection.
But what “sort of Cartesian product”?
- The Union of P, Q, and the Cartesian Product of P and Q.
And this I’ll call it Combination, is The Great Conjunction Operator.
Let’s combine STUDENT and ONE_ASSGNT.
- STUDENT ONE_ASSGNT
- .NAME .NAME
- Douglas
- Douglas Judy
- Douglas Darren
- Judy
- Judy Judy
- Judy Darren
- Judy
- Darren
- ... ...
You see the STUDENT rows, Douglas and Judy; the ONE_ASSGNT rows, Judy and Darren;
and the product rows.
Now we do the Restriction: it’s:
- STUDENT.NAME ]= ONE_ASSGNT.NAME OR
- STUDENT.NAME =[ ONE_ASSGNT.NAME
In any concrete syntax any DML
we would want an appropriate operator meaning “]= OR =[”
so that we could use a single expression, but we won’t worry about that for the moment.
We get:
- STUDENT ONE_ASSGNT
- .NAME .NAME
- Douglas
- Judy
- Judy Judy
- Judy
- Darren
- ... ...
To get the asymmetry of the Left Conjoin we would use just the
“STUDENT.NAME ]= ONE_ASSGNT.NAME”
condition (giving the first three rows shown),
and to get that of the Right Conjoin we would use “STUDENT .NAME =[ ONE_ASSGNT .NAME”.
Here’s the left:
- STUDENT ONE_ASSGNT
- .NAME .NAME
- Douglas
- Judy
- Judy Judy
- ... ...
And then we could use Functional Extension to get a new column, say “DONE” with value,
“Yes”, defined as “STUDENT.NAME = ONE_ASSGNT.NAME” or “PRESENT (ONE_ASSGNT.NAME)”:
we’ll assume this value absent when ONE_ASSGNT.NAME is absent.
We get:
- STUDENT ONE_ASSGNT DONE
- .NAME .NAME
- Douglas
- Judy
- Judy Judy Yes
- ... ... ...
And then we could project away ONE_ASSGNT.NAME (at last we need it no more):
- STUDENT DONE
- .NAME
- Douglas
- Judy
- Judy Yes
- ... ...
But how do we get what we really want?
That is:
- NAME DILIGENT
- Douglas No
- Judy Yes
- ... ...
|
|
|
|
SITE HOME PAGE
|
Now there is a really interesting question.
|
|
THE DATABASE PAGE
|
|
THE DATABASE PAPERS
|
|
|
Preface & Contents
|
|
|
DOWNLOAD
|
Download
Lecture XIV (rtf, Word for Windows compatible)
|
|
Platoclast on Data: Lecture XV
|
|
|
Copyright © 1993, 2001 Adrian Larner.
The author asserts all moral rights.
|