(U03)

www.btinternet.com/~adrian.larner/database/pcl20

PLATOCLAST
ON DATA

Lecture XX
More Data Manipulation

 

 

Hierarchies

 

Think back to our last lecture, to that old fashioned batch program. You’ll remember, of course, that the hierarchical structure of its input was all in the eye of the beholder, a view as we say. It was like this:

Data on Course 1
Data on Course 1 Offering 1
Data on Course 1 Offering 2
Data on Course 1 Offering 2 Student 1
Data on Course 1 Offering 3
Data on Course 1 Offering 3 Student 1
Data on Course 1 Offering 3 Student 2
Data on Course 2
Data on Course 3
Data on Course 3 Offering 1
Data on Course 3 Offering 1 Student 1
Data on Course 4
Data on Course 4 Offering 1
Data on Course 4 Offering 1 Student 1
Course 2 is a course with no offerings, and Course 1 Offering 1 an offering with no students. You’ll notice that our program works a bit like a tuple calculus DML, whose variables range over records. Our program “ranges over” these records in a literal sense: it processes them one at a time. We could easily add record selection to our program; suppose we were interested only in second year students. The program would, in effect, apply to each record (as it came to it) the condition, “STUDENT.YEAR =[ 2”. And that’s pretty much how a language like SQL applies a restriction, a WHERE condition.

Let me just remind you of our algorithm; I’ve reworked it slightly to do Initialisation, then the Read, then Termination.

DO
   IF NO RECORD THEN
      PERFORM PROGRAM INITIALISATION
   ELSEIF COURSE RECORD THEN
      PERFORM COURSE INITIALISATION
   ELSEIF OFFERING RECORD THEN
      PERFORM OFFERING INITIALISATION
   ELSEIF STUDENT RECORD THEN
      PERFORM STUDENT INITIALISATION
   ENDIF
   READ INPUT
   IF CHANGE (STUDENTID) THEN
      PERFORM STUDENT TERMINATION
      IF CHANGE (OFFERINGID) THEN
         PERFORM OFFERING TERMINATION
         IF CHANGE (COURSEID) THEN
            PERFORM COURSE TERMINATION
            IF END OF INPUT THEN
               PERFORM PROGRAM TERMINATION
            ENDIF
         ENDIF
      ENDIF
   ENDIF
   UNTIL END OF INPUT
ENDDO
I would stress again how easily this pattern could be implemented in an automatic program generator. But it has a little infelicity, doesn’t it? Just the sort of silly little thing you would expect me to worry about. The Initialisations appear in a straightforward case structure, but the Terminations in nested IFs. Why should this be?

You can see, I hope, that the reason is: we have a separate record for each Initialisation. But we don’t have a record for each Termination: in the extreme case we do STUDENT TERMINATION when we read a new Student Record, or an Offering Record, or a Course Record, or at end of input. And, except in the first case (a Student Record) we also do OFFERING TERMINATION.

Suppose, instead of using Left Conjoins we had used Left Outer Joins. Our input would look like this:
Data on Course 1 Offering 1
Data on Course 1 Offering 2 Student 1
Data on Course 1 Offering 3 Student 1
Data on Course 1 Offering 3 Student 2
Data on Course 2
Data on Course 3 Offering 1 Student 1
Data on Course 4 Offering 1 Student 1
Then we would have to do, for instance, COURSE INITIALISATION and OFFERING INITIALISATION on the first record. And even our termination algorithm would have to change: notice that we want to drive STUDENT TERMINATION between the second and third records, although the STUDENTID doesn’t change. And a similar thing happens on offerings in the last two records shown.

 

 

Level Breaks

 

In the RPG language, which handled this sort of structure, what we used to do was define “Level Break” Indicators: we would have used L0 for Student, Ll for Offering, L2 for Course, and LR – that means “Last Record” – for end of input. RPG would then switch on L0 when the Student Id changed, Ll when the Offering Id changed, and so on. And it would cascade the level breaks: when it switched on a level break indicator it automatically switched on all the lower-numbered level break indicators (all of them for LR). So all we had to do was to say for each indicator which field change set it on, and what Termination and Initialisation processes (and output) it controlled.

I want to bring a special case to your notice – not very special, really, but very common. As you know, the COURSEID of this course is “ITRDB”. Actually it’s two fields: “IT” for the department, Information Technology, and “RDB” for the subject, Relational Database. Now suppose we wanted to level break on Department as well as on Course (perhaps we might want to calculate total student registrations for all the courses in each department). We wouldn’t have a special record to drive DEPARTMENT INITIALISATION would we?

 

 

Headlining

 

Naturally, the RPG Level Break system would work: we would set L3 on the two-letter Department Id. But notice what we would need to get such a special record: we would have to take the records that are Department projections of our input records, and union them into the input. We might call such an operation “Headline”:

The (column-list) HEADLINE of P is the UNION of P with the (column-list) PROJECTION of P.
You will appreciate that Headlining could also be used to convert a Left Outer Join (ignoring those silly nulls) into a Left Conjoin. Now I’m getting suspicious looks from some of you: I must be getting up to something, you think. And I’m certainly not up to teaching you all the delights of RPG, so it must be something else. Mind you, I would like to teach you all RPG: as well as its lovely level break structure it also had a facility called “Matching Records” which enabled you to read in two files and handle their Symmetrical Conjoin record by record.

Cast your minds back now to Lecture XV, when we were trying to use the Collection operator to set up that DILIGENT column: the one that told us which students were diligent and which ones were slackers.

Remember I said we wanted to do two things:

Discover whether, for each student, there was an on-time assignment
 
Get a result with one row for each student.
And I said: these are separate problems. Then we solved the first, and the second was trivial. It was just a projection. The reason it was trivial was that one row for each student was all we wanted; we didn’t want any other rows. But now we’ve seen when we might want one row for each such-and-such – one row for each Department – and still want to keep all our other rows: one for each Course, Offering, and Student. And the answer to that problem is again projection, but in the special form of headlining, when we keep all the old data and the projected data as well; we union them together.

Well we got to headlining by considering batch programs, but actually that’s where the headline operation itself isn’t particularly important: we can get the same effect in a level break structure. No, it’s in ordinary queries where we need headlining, and I’ll try to explain why. But first of all I want to make two points. The first is that headlining is an utterly non-relational operation: it’s explicitly designed to create a table with records of different formats. The second is that when we look at our RPG-like algorithm we find a highest level in the hierarchy, which we called PROGRAM. Does anything correspond to this is an ordinary query?

 

 

Empty-Headlining

 

Imagine that we headline a table using an empty column-list, so the projection gives us a row with no columns. Yes, I know that – following the TT-DD – there’s no such thing, no empty row. But there is at the concept level, because we can give such a row a record identifier at the form level, which is esoteric at the concept level. But it’s just present enough, so to speak, to allow a concept level row to exist with no columns. Or – if your conscience is too acute for that – pad your records. The dubious empty row is then a nice solid row, with pads in every column.

Now consider any tables, say p and q, that don’t contain an empty row. Let p+ and q+ be p and q respectively, headlined with an empty column list. We now have an interesting identity that might give you a bit more insight into the combination operation:

r = p;q IFF r+ = p+, q+
The combination of p and q (namely r) when empty-headlined (r+) is the Cartesian product of the empty-headlinings of p and q (i.e. p+ and q+). So you can see just how close our new primitive operation, combination, is to our old primitive Cartesian product. Merely headline every table with the empty column-list and we can go back to using Cartesian product.

And you will be able to see, I guess, how we can think of the Left Conjoin of p and q as an inner join of p and q+; and their Right Conjoin is of p+ and q. The Symmetrical Conjoin is an inner join of p+ and q+. I’ll let you work out what the join conditions have to be (not simple equality, I’ll tell you that).

 

 

Constraints of Uniform Format

 

Very well: you can see that if we really need headlining then it’s one more nail in the coffin of uniform, rectangular tables. So let’s look at how things are managed in a relational language, SQL. In SQL we talk about “grouping”. Let’s go back to our old friend, the assignment table.

ASSIGNMENT:       NAME      WEEK      ON_TIME
 
                  Douglas   Week 1    No
                  Judy      Week 1    Yes
                  Judy      Week 2    Yes
                  Judy      Week 3    Yes
                  Judy      Week 4    Yes
                  Darren    Week 1    No
                  Darren    Week 2    Yes
                  Darren    Week 3    Yes
                  ...       ...       ...

 

 

Grouping

 

Now I want this query:

For each student that has returned at least one assignment, list their name and the number of assignments they’ve returned.
And here it is in SQL:
SELECT NAME, COUNT(*)
  FROM ASSIGNMENT
 GROUP BY NAME
What does “GROUP BY NAME” mean? It means two things – the very things that I wanted to distinguish. It means that the values of NAME determine the scope of the collection of rows to be counted; and that one row in the result will be formed from each collection of rows, and therefore the level of detail is: one row for each value of NAME.

I’m going now to modify this a little. Firstly, I’m going to suppose that SQL gets modified to count values in a column, rather than rows. Secondly, I’m going to ignore recent weeks:

SELECT NAME, COUNT(WEEK)
  FROM ASSIGNMENT
 WHERE WEEK <= "Week 4"
 GROUP BY NAME
Notice the sequence of operations: FROM to get the records, then WHERE to apply the restriction, then GROUP BY to form the collections, then the accumulation (COUNT) is performed, then we get the degrouping into one row per NAME value, then the SELECT for projection. So suppose I want to add the condition: omit any student that has done all the assignments. This condition, “4 ¬= COUNT(WEEK)”, can’t go into the WHERE condition, because we want it applied after grouping.

This is how it’s done:

SELECT NAME, COUNT(WEEK)
  FROM ASSIGNMENT
 WHERE WEEK <= "Week 4"
 GROUP BY NAME
HAVING 4 ¬= COUNT(WEEK)
The HAVING condition is applied just before the SELECT. And now we’ll just modify the query enough to hit trouble. Instead of an assignments, we’ll count just on-time assignments.
SELECT NAME, COUNT(WEEK)
  FROM ASSIGNMENT
 WHERE WEEK <= "Week 4" AND ON_TIME = Yes
 GROUP BY NAME
HAVING 4 ¬= COUNT(WEEK)
See the problem? Yes, it’s Douglas again. We should have “Douglas zero”, but we don’t get a Douglas row at all: “ON_TIME = Yes” eliminates Douglas. The solution is obviously to headline NAME before the WHERE is applied. The WHERE condition has to become “ON_TIME =[ Yes”.

You will remember that we had a very similar problem on Outer Joins when we complicated them with restrictions, but we countered that by going to Conjoins. Well, it wasn’t merely a similar problem, it was the same problem. And Headlining is, in effect, a conjoin. It’s the Left Conjoin of the projected column(s) with the original table.

Notice that this seems to put an end to our ambition to reduce all queries to the form:

1
Get the data together
2
Apply the restrictions
3
Project the output
Now we want to project – to headline – data before applying any restriction. And we have to fit accumulations in after restrictions, but then we might have more restrictions on the accumulations. I don’t think things are quite that bad, but life is beginning to look complex again.

 

 

Nesting Queries

 

In any event; I think you can see how this constraint of uniform format gives us horrid problems. In current SQL they are overcome, to some extent, by the use of nested queries.

For instance, to list those students having exactly n assignments in on time, we could code:

SELECT NAME
  FROM ASSIGNMENT Al
 GROUP BY NAME
HAVING n =
       (SELECT COUNT(*)
          FROM ASSIGNMENT A2
         WHERE AI.NAME = A2.NAME AND ON_TIME = Yes)
Or words to that effect (I’m no SQL expert); the “A1” and “A2” mean: call ASSIGNMENTA1”, call another copy of ASSIGNMENTA2”. Funnily enough this works, because if you miss out a GROUP BY in a SELECT but use an accumulation, like COUNT, the grouping is done over the entire table: just as if an empty-headline had been performed. And n can even be zero. But isn’t it quite, quite horrible, and moving away from the tuple calculus towards the algebra. Perhaps I misled you a bit by saying SQL was tuple calculus based. Really it’s neither fish nor fowl.

I should say that there are lovers of nested queries; mostly the same people that love algebra. But I suspect that most users hate them, and they are horrible to optimise.

 

 

Relaxing the Uniform Format Constraint

 

Of course, it would be much better if we weren’t restricted to relations, then we might say something like:

SELECT NAME, COUNT(WEEK)
  FROM ASSIGNMENT
HEADLINE NAME
 WHERE ON_TIME =[ Yes
Though it would be nicer not to need an explicit “HEADLINE”. Perhaps we should allow scope and conditions in the COUNT:
SELECT NAME,
       COUNT(WEEK WHERE ON_TIME = Yes BY NAME)
  FROM ASSIGNMENT
But I’m not here to redesign SQL. I should ask you to note that the problem solved by headlining is the inverse of one we met earlier. Do you remember how we discussed both union and Cartesian product (or natural join) as different sorts of conjunction (vertical and horizontal, we might say)? And we said that we didn’t want the user to have to choose between them, but we preferred combination, especially for a tuple calculus. It allowed the tuple variable to range over Pab and Qbc (the unioned rows) and Rabc (the joined row).

Now we have some row, say Rbc, and we want the tuple variable to range over that and its projection, say Pb. Headlining is an explicit request for this to happen. But might we not handle it implicitly? Could we have a DML whose tuple variables ranged over not only every row, as we think of rows, but over every projection of every row? There’s a thought.

In SQL it could make our query look something like:

SELECT NAME, COUNT(WEEK BY NAME)
  FROM ASSIGNMENT
 WHERE ON_TIME = Yes
And, of course, it would make combination the very same operation as Cartesian product. And Conjoins the same operations as Outer Joins. A bit mind-boggling that: I’ll let you boggle for a while.

 

 

“Ideal” Query Scheme

 

Let’s think a bit more about our idealistic scheme of doing all queries by getting their data together, applying restrictions, and projecting. We’ve applied our now boggling minds to the possible need for headlining before the restrictions. We’ve never bothered to say where the extensions fit in. And we’ve just raised the problem of restrictions on the results of accumulation functions.

Well, accumulations are a special sort of extension, and it seems to me that they ought to fit in before the restrictions. Though this means that they may need to contain conditions, as in that “COUNT (WEEK WHERE ON_TIME = Yes BY NAME)”. I’m not too worried by that: conditions are only a sort of calculation with truth-valued operands. And I’m very chary about early restriction. Take that last query, giving the student names and the counts of on-time assignments. What would happen if we wanted to have the count of other assignments as well? You can see we can’t use the WHERE condition “ON_TIME = Yes”, nor “ON_TIME = No”, because we want both. So it’s probably better to say:

SELECT NAME,
       COUNT(WEEK WHERE ON_TIME = Yes BY NAME),
       COUNT(WEEK WHERE ON_TIME = No BY NAME)
  FROM ASSIGNMENT
I’m sure we ought to have such accumulations, indeed we would be hard put to avoid them if we used encapsulated accumulation functions. I must confess, I’m not at all certain that all queries could conveniently be shoe-horned into the “ideal” sequence of data gathering, extension, restriction, and projection. But it might still be an ideal worth following. We all need one of those, don’t we?... We don’t? Is Idealism dead?

 

 

Recapitulation

 

Do we have a few minutes? Good. I hope you’ve begun to see how SQL – not peculiarly SQL but any language that handles only relations, only tables of uniform format – has serious limitations.

Inner Joins, which are what we usually end up with after Cartesian Product and Restriction, just lose all reference to the unmatched operand records: we need Conjoins, or – better – Combination.

Restrictions not only test conditions – find out whether they’re true or not – but also remove rows that don’t match. And sometimes we want these rows for other purposes, or we want Projections of them to GROUP BY.

Associated with these problems, we can’t properly handle empty groups: but empty groups are the very essence of tests for non-existence or universality or zero counts.

To overcome these restrictions we have to resort to nested queries, which – I think – are difficult to understand, or we have to distort our queries to get the results we want. But I do stress: it’s not SQL that’s the problem, it’s relational. Imagine a relational system in which we could do non-functional extension. Consider a relation with a single column containing -1, 0, and 1. I want to count the number of real square roots of these, so I should be able to say (call the relation “R” and the column “N”):

SELECT N, COUNT (ROOT)
  FROM R
 WHERE ROOT := REALROOT(N)
 GROUP BY N
Here I use the “WHERE” for an extension (surprisingly natural isn’t it?) But the row with -1 can’t be extended, and – I suppose – it disappears. What we need is a combined “Headline and Extend”: extend the table, and union the original table and the extension together.

Wake up! Can’t you see how dreadful this is? Look at this query:

SELECT SNO, COUNT(PNO)
  FROM SP,P
 WHERE SP.PNO = P .PNO
   AND P.TYPE = "SCREW"
 GROUP BY SP.SNO
It doesn’t give me a list of suppliers with the number of different types of screw they supply. This has got to be wrong. We’re not talking subtleties here; we’re not thinking about really obscure queries. Suppose we want to list all employees with the number of other employees living in the same area as them. We could say:
SELECT EI.EMPNO, COUNT(E2.EMPNO)-1
  FROM EMP El, EMP E2
 WHERE E1.AREA = E2.AREA
 GROUP BY E1.EMPNO
But we couldn’t say:
SELECT EI.EMPNO, COUNT(E2.EMPNO)
  FROM EMP El, EMP E2
 WHERE El.AREA = E2.AREA
   AND El.EMPNO ¬= E2.EMPNO
 GROUP BY El.EMPNO
In other words, if I say in my relational language just what I want to say, “number of other employees living in the same area”, I get the wrong answer. Instead I have to say “one less than the number of employees living in the same area”.

Now imagine that we want the number of other employers, other than sales representatives. We want to say:

SELECT EI.EMPNO, COUNT(E2.EMPNO)
  FROM EMP El, EMP E2
 WHERE E1.AREA = E2.AREA
   AND El.EMPNO¬= E2.EMPNO
   AND E2.JOB ¬= "SALES REP"
 GROUP BY El.EMPNO
But we have to say:
SELECT EI.EMPNO, COUNT(E2.EMPNO)-1
  FROM EMP El, EMP E2
 WHERE E1.AREA = E2.AREA
   AND (E2.JOB ¬= "SALES REP"
     OR E1.EMPNO = E2.EMPNO)
 GROUP BY El.EMPNO
That is, we ask for “one less than the number of employees that (1) either are not sales representatives or are the very same employee, and (2) live in the same area”.

 

SITE HOME PAGE

O Relational bigots: pull the other one.

THE DATABASE PAGE

THE DATABASE PAPERS

 

Preface & Contents

 

DOWNLOAD

Download Lecture XX (rtf, Word for Windows compatible)

Platoclast on Data: Lecture XXI

 

Copyright © 1993, 2001 Adrian Larner. The author asserts all moral rights.