(U03)

www.btinternet.com/~adrian.larner/database/pcl18

PLATOCLAST
ON DATA

Lecture XVIII
Recapitulation

 

 

I’m afraid that we’ve come a long way round to where I wanted us to get to. But sometimes the longest way round is the shortest way home. What I’m going to do in this lecture is to try to say where we’ve got to, and then to look back along the way, so we all understand how, and why, we arrived here.

 

 

Tables and Records

 

For a long time, at the beginning of the course, we hacked away intermittently at that curious problem of whether tables were essential. And one way to put that problem is to ask; what is the largest, the most extensive, object that we want our DML to handle? Well, for good or ill, the answer we gave was: a record. Though, of course, we recognised the utility of referring to our records by common names, “EMPLOYEE” and the like, and thereby allowing the operations specified on a record with such a name to be applied automatically to all records of that name. And that’s what gives the illusion that a DML has to be concerned with tables.

I stress; a DML doesn’t have to be concerned with tables. A queried table, if we must talk of such a thing, comprises simply the records deemed to be “contained” in it. Of course, I don’t care if a user thinks in terms of tables, or generally talks in set-speak (though I don’t recommend it); I’m just looking for something that is a member of the Simple, Safe Theory Club.

 

 

Data Manipulation

 

Having identified the record as our most extensive structure, and its interpretation as a certain form of proposition, I was then concerned to see what facilities were needed for queries, that is, implications. And we showed that the major part of the work could be done by a variety of operations on records, all – as far as the FOPC is concerned – conjunctions. As it happens, we also managed to express extensions and restrictions in a similar way. We ended this stage, you’ll remember, with, for records, the Great Conjunction Operation: Combination. Restrictions and Extensions were given somewhat different handling in the Algebra and Tuple Calculus, but we thought that users would probably think of them in a slightly different way, so we didn’t worry too much about that.

I want to remind you how our new combination operation allowed us to define not only our old friends Cartesian product and Inner Natural Join, but also those extended natural joins that we called Conjoins, and found preferable to the Outer Joins that the relational people seem to love. Moreover, we also used combination to define Union, which seemed to threaten the simplicity of our DML by having an interpretation of alternation instead of conjunction (OR rather than AND).

Projection, after a long struggle, and our new interpretation, EI, turned out to be the inverse of a conjunction. Not, of course, that we adopted EI to change the interpretation of projection: we did it primarily to avoid going from truth to falsity, and all these other things were added unto us. We examined other existential quantifications, and it seemed that they could be handled by calling encapsulated functions (such as COUNT), as long as we had a Collection operation. So we allowed ourselves such an operation, but salved our consciences by showing that any use of it could be expressed (though perhaps longwindedly) as a number of conjunctions (ultimately combinations).

And then we took the last primitive relational operator, Difference, and showed how we could handle it using no more mechanism whatsoever. And that brought us where we are now, in terms of operations on records: each of these operations is definable on conjunctions, or on the inverse of a conjunction in the case of projection.

 

 

Fields

 

But that was only how we tackled our DML, or the theory behind it, from the top; considering its most extensive object, the record. We also needed to tackle it from the bottom, considering its least extensive object. That, of course, is the field. And we now conceive a field as a token. So our DML is demarcated below by an interface to a Token Handler.

I’ll come back to the token handler, and what it comprises, but let me stress that the tokens, as far as the DML are concerned, amount to the values in fields, and are therefore interpretable as the names used in the equivalences of EI. Quite how we handle the field types (formats) I’ve left rather vague. We could have a DML that handled its records in tables, picking up column names – well, column name tokens – from record definitions. Or we could, somewhat more flexibly, include column (and therefore domain) information in each token; that is, we would have functions like “COLUMN” and “TYPE” that the DML would apply to tokens to return column and data type identifying tokens.

 

 

Handling Data Outside the Manipulation Language

 

In addition to the DML proper, I assume one or more Presentation Handlers. While the DML would produce records, or – taking them collectively – tables, as a result of its queries, the presentation handlers would process those tables in order to build character arrays, or images, or some combination of those, for presentation to a user at a VDU or on paper. A presentation handler would need access to DISPLAY functions in order to convert tokens to characters or images. It is in such a handler that we would manage, for instance, natural language support.

As part of, or in addition to, presentation handlers, we would also want Report Formatters that could restructure – strictly, I suppose, denormalise – tables. These would allow the user to do a lot of traditional things: sequencing, level breaking, and the like; and perhaps some lovely new-fangled things as well, about which you wouldn’t expect an old fogey like me to know. Notice that we might want some report formatting – “token shuffling”, we might say – before presentation (conversion to characters or images), and some more report formatting afterwards.

And, just as we want presentation handlers and report formatters for our output, we probably want all sorts of user-friendly interfaces to pick up input, and to assist the user to interpret data and formulate queries.

Now I mention all these extra bits and bobs not just to show how much I’m on top of my job but to explain why I don’t feel the need to keep stuffing more and more function into the DML. So many of the things that threaten to make DMLs complex don’t belong in DMLs at all. On the output side, for instance, people often want reports with level breaks: revenue by sales representative within office within area. These reports look like hierarchies, but don’t you be fooled, they are hierarchies. And then people argue that we should have a DML that handles hierarchies, with all their implicit connections between records. In reports, these implicit connections are shown by sequencing and indentation. But I would want to say: let the report formatter handle hierarchies, as long as the DML can give it the normal records it needs to build them.

On the input side, as I mentioned before, the Natural Join School want the user to be able to specify a join between two types of record – CUSTOMER and INVOICE for instance – and for the system to work out what column they should be joined on, Customer Number I suppose. But I don’t want that. I’m quite happy to have a front-end, an input handler, that guesses what join is needed, confirms it with the user, and then submits the query to the DML. This allows us to have a DML that simply performs the queries passed to it, without all that clever second-guessing of the user. And I’m going to maintain that position until someone publishes a decent theory of second-guessing, and proves it correct.

 

 

Token Handling

 

Back to the token handler. What do we want it to do? We’ve seen quite a lot that it should do: handling token identity, in some cases inequalities on tokens, and other highly overloaded functions. These include, perhaps, checking for nulls and pads, the THE and PROP functions, the Reduction operation, DIFF and UNDIFF, the “]” and “[” operations, and the PRESENT function. The token handler itself calls proper and improper functions (it is the interface between the DML and those functions).

Think back to our interpretation of Functional Extension, for instance:

Rabc = Pab & c º q(b)
Well, think about the components of this expression. “R” and “P” are predicate letters; they stand for predicates. And “q” is a function letter; it stands for a function. And the FOPC knows that R and P are predicates and q is a function. But it doesn’t know what predicates, or what function. By contrast, it knows not only that “&” is a truth function, but also which truth function it is: conjunction.

Now imagine that we wanted to work out whether that expression was true, taking the “=” to mean “iff”, so known to the FOPC. Well, we can imagine just passing it to a pretty clued-up implementation of the FOPC, and it would scan it, and then say to us:

Is R true of a, b, and c in that order?
Is P true of a and b in that order?
Is c the q of b?
And then, when we answered those, it would able to say “Yea” or “Nay”, as the mood took it.

Our DML, you will recall, is based on the FOPC – now merely on part of the FOPC. Our tokens are effectively the “a”, “b”, and “c” in our definition. Our encapsulated functions – all in the Token Handler as far as the DML is concerned – are like the q in our definition. What the DML does know about are its native operations, like “&”. And, just as the dictionary describes the encapsulated functions, so the data base definition describes record types like “P”, and the actual data base describes the records like Pab.

Now, imagine how complex the FOPC would become if we started adding predicates to it. Suppose, for instance, that the TT-DD, instead of being a theory built on the FOPC, were actually part of the FOPC. Then, every time we added a new predicate or function or definition or postulate we would have to make sure the FOPC was still reliable: every brick we laid would shake the foundations. I think that would be intolerable.

And I want to place the same sort of constraints on the DML, because it is simply a form of the FOPC, or of a part of that logic. So, as long as we don’t add anything to it, it is completely reliable. We don’t have to keep re-thinking our DML as we add data types: numbers, characters, part numbers, dates, text lines, signatures, images, moving images, or whatever (any more than we have to rethink it because we insert records, or define new record types). That’s really what encapsulation gives us.

Of course, we want our encapsulated functions to be reliable as well. We don’t save ourselves essential work; but we divide and conquer. We get the DML right; then we get each data type right; then we get each function and operation right. We conquer; but we don’t have to keep reconquering. And when you consider the pain you’ve all been through just listening to me conquering a simple little DML, or rather reconquering it in the footsteps of Dr Codd (and a few of my own), you may agree that getting it right and not constantly enhancing it is a good idea.

 

 

First Order Logic Subset Data Manipulation Language

 

As you may have noticed, I’ve done somewhat more (or should I say “less”?) than base my DML on the FOPC. Having required a token handler, and all those encapsulated functions reached through it, I’ve actually pushed a number of FOPC-like operations out of the DML entirely. And that’s why I say my DML is based on only part of the FOPC, specifically the conjunction of predicates (and its inverse, which we needed for projection).

Let me explain why I did this, though we haven’t as yet worked out all the implications. The operations we were concerned with were existential quantification and negation.

1
I happen to know, as you do not, a nasty little problem with the FOPC. It’s too complex. It’s so complex that there is no general, mechanical method for telling whether or not one of its formulae is inconsistent (self-contradictory), either in every universe of discourse, or even in every finite universe.
 
I should remark, by the way, that set theory is far, far worse. There we cannot even tell whether or not the theory itself is inconsistent, whereas we know the FOPC is perfectly consistent. When I say we cannot do these things, I don’t mean merely that we haven’t found out how we might do them. I mean we have proofs that they can’t be done.[1]
 
But if we get rid of negation and existential quantification, we don’t have those worries.
2
As we have seen, existential quantification is – in form – very similar to lots of other operations that involve reduction of a collection. These operations include such functions as COUNT and TOTAL that we know our users will want. So why have two mechanisms? Existential quantification, PRESENT, is such a minimal case of an accumulation function, indeed just a special condition on COUNT, “NOT 0 = COUNT ...”, that to have a special FOPC operation for it would be wildly extravagant.
 
Our other use for existential quantification was in projection, where it has now given way to inverse conjunction. Believe it; that was a surprise to me because it was an unexpected by-product of sorting out interpretation. I dare say some of you are worried because I said we could keep projection by existential quantification at the form level. We can, but we don’t have to: I’m sure you all remember HI, that dreadful compromise between CI and EI: perhaps it wasn’t so dreadful after all.
3
Negation: its disappearance from the DML into the token handler came pretty simply. It sort of strolled along negligently after existential quantification and fell down the same hole.
 
That pleased me because we’d had a bit of trouble – logical trouble – with negation, but I expect you’ve forgotten it by now. Recall that tiny table, S:
 
S:        NAME       DONE
 
          Douglas
          Judy
          Judy       Yes
          ...        ...
 
And remember how we used collection to get:
 
          NAME       DILIGENT
 
          Douglas
          Judy       Yes
          ...        ...
 
And, of course, we could have put “No” in Douglas’s row, but we didn’t. Just as well, because we had an absolutely knock-down proof that Douglas’s line just couldn’t be interpreted as “Douglas is a slacker”, meaning “Douglas has handed in no homework assignment on time.”

Before I sort out that last little problem, let me go back a long way to the concept of the efficiency of a theory. That meant, remember, the ratio of what we got out of it to what we put in. In linguistic terms – and that’s not far from a DML – what we were looking for was a minimal vocabulary that carried maximal expressive power. Now, it’s difficult to compare logics for efficiency. But for sure, the FOPC without existential quantification and negation is certainly more efficient than the full FOPC, if it has the same or greater expressive power.

We’ve seen how to express all the primitive relational operations – join, restriction, projection, union, and difference – in our DML; we’ve managed something better than outer joins, namely Conjoins, and we can manage outer joins – I’ll come to that. So it looks as if we have a DML that’s more efficient than the Relational Algebra or Calculus, or current DMLs built on them. But what is more: I have wonders in store for you. I have yet to show you how to do queries that are impossible in relational systems. Our DML, we shall find, is doubly more efficient than relational DMLs: not only does it require less than the Relational Algebra or Calculus, but it delivers more.

 

 

The “Closed Data Base” Assumption

 

Now let’s go back to that important question whether or not we can show that Douglas is a slacker. You can see where our problems come from. The records we keep say: so-and-so has returned such-and-such assignment on time. We don’t keep any records saying that they haven’t returned an assignment on time. So the conclusion that, for instance, Douglas has returned no assignment on time depends on an extra assumption: that if anyone has returned an assignment on time then it says so in our data. But that assumption – in general it’s caned the “Closed Data Base” assumption – is not itself part of our data.

And I see no reason to make it part of our data, or to think of it as a postulate of the theory that our data base represents. It’s not always true, in the general case.

Think what happens when a user specifies a query with a restriction in it. They ask for employees with, say, SALARY greater than £30,000. Obviously “each employee has a salary greater than £30,000” is not a theorem of our theory; it’s not even a proposition consistent with our theory as some of our employees, we might well assume, get much smaller salaries. They’re on the poverty line, really.

But the records the user gets as a result of the query also represent a theory: it’s a subtheory of our original theory (the one represented by the data base). And “each employee has a salary greater than £30,000” is a theorem of that theory. Of course, it’s not that one theory is true and the other false. They are both true, but “employee” in one theory names only some of the persons that “employee” names in the other.

Well, if – as we surely must – we allow users to build theories from theories in this way, merely by using restrictions, there is no reason why they shouldn’t use the facilities of the DML to build theories, query by query, containing the closed data base assumption. And that’s just how we can understand the query we need to go from the first of these tables to the second:

S:        NAME       DONE
 
          Douglas
          Judy
          Judy       Yes
          ...        ...
 
 
          NAME       DILIGENT
 
          Douglas    No
          Judy       Yes
          ...        ...
 
We might write the query like this:
(NAME, DILIGENT)\S//DILIGENT := IF PRESENT (NAME # DONE) THEN Yes ELSE No
And obviously what a record in the result tells us is whether we have recorded an assignment completed on time by the named person. And that’s OK. But let me stress two things:
It’s our interpretation that if there’s no assignment recorded, then there has been no assignment completed on time; and as long as that assumption is true we are all right. If it’s true, and we know it’s true, why should we bother whether it follows from our data or not?
 
(Of course we could say that that assumption is part of the meaning of “PRESENT”, and then our desired conclusion – that Douglas is a slacker – would follow from our data; but not by means of the FOPC alone, for “PRESENT” is no part of the FOPC.)
 
Even if we don’t make that assumption part of the meaning of “PRESENT”, what “PRESENT” means – that a certain sort of record, or a certain kind of information, is recorded – is not expressible in the FOPC. Yes, I know it’s expressible in the TT-DD, which is built on the FOPC: we introduced “is kept” just for that purpose. But we added that predicate. And in the same way we added the encapsulated function “PRESENT”.
 
(So even the conclusion that there is no on-time assignment recorded for Douglas doesn’t follow from the FOPC. It follows from the FOPC enhanced with “PRESENT”.)
I stress that I’m not saying we don’t want functions like “PRESENT”, or – in our restriction example – “greater than”. I’m saying merely that we want them in the right place: not in the DML but encapsulated away in the token handler or a function handler. Of course, their names are parts of the DML, where they work like the letters we use for schematic predicates and functions work in the FOPC. To put it starkly: the DML works just the same whether “PRESENT” means “present” or “absent” or even “3 = COUNT ...”; and whether “greater than” means “greater than” or “less than” or “is the same number modulo 4”.

But it’s not quite that stark. Though it could be; at the cost of performance. Remember that we needed some invariants specified, which tell the DML, at a certain level – where its queries are optimised – a number of important facts about functions. At that level, the DML does need to know that “greater than” is an ordering relationship, and that checking for presence of a type of record can cease when a record of that type has been found (contrast other accumulations, such as TOTAL, where all records must, be found).

 

 

Outer Join Defined

 

I think that’s about enough. No – I was going to define Outer Joins for you. To get the Outer Join of Pab and Qbc we first form the Conjoin (let’s use the Left Conjoin and define the Left Outer Join). And then we apply a restriction, namely:

PRESENT (c) OR NOT PRESENT (b # c)
PRESENT(c)” means that we have an Rabc row, and “NOT PRESENT (b # c)” means that we have a row whose b value is in no Rabc row. I can’t imagine why anyone would ever want such a funny restriction condition, but I thought I just ought to mention it. And that really is enough.

I had a student complaining after the last lecture that my theory was “too subtle”. Not so. You may, if you wish, compliment me on my subtlety; but my theory is not in the least subtle. There are no subtleties in logic. I don’t make a subtle distinction between what’s in the DML and what’s outside in the token handler and its appendages; I put a great wall around the DML, and I keep watch lest anything but conjunction enter there. Existential quantification and negation don’t subtly hover on the borderline; they go out the DML and across that borderline with Geoffrey’s big boot behind them. Subtle forsooth!

 

 

A Remark on Object Orientation

 

 

(It was objected that Professor Platoclast had given a fairly conventional account of strict data typing, but had dismissed the very same idea, viz Object Orientation, as a “whacky idea”.[2])

My dear chap, not so, not so. Let me sum up Object Orientation for you:

1
Strict Data Typing, just like what I said. I love it.
2
Type hierarchies, for instance, the rectangle type within the quadrilateral type within the polygon type within the shape type. I don’t mind it, but it’s not my business at the moment.
3
The association of each function – each sort of CALL in programming terms – with each variable of the same type as a chosen parameter of the function. These variables, along with the collection of their associated functions, are termed “objects”. I hate it.
Let me try to explain that last point. The first thing to say is that not all programming languages that are described as “Object Oriented” have this feature, so of them I have no criticism. But those that do have the feature work like this. We define encapsulated data types, in their hierarchies. Then we use the term “object” to denote any variable of a type, along with all the functions associated with that type. So a complex number object would be a complex number variable along with the REAL function, the IMAG function, the MOD function, and so on. And we say that, instead of calling REAL and passing a complex number variable as parameter, we send a message to a complex number object. The message is of type REAL and it “asks the object to return its real part”. To do this, the object drives its REAL method. Sorry about the distorted jargon.

Thus far, we have nothing but the jargon: a sort of Object Oriented façon de parler, that we can translate back into ordinary programmer-speak. But we get problems. One is that if we want to create something, for instance, to create the complex number r + si, there is no complex number object to send a message to. So we have to send a COMPLEX message, with parameters r and s to a different sort of object, the complex number class. This weird thing is supposed to be both the set of complex numbers and an object in its own right, and this might sound like gibberish but don’t you be fooled ...

The second problem is what we do about functions with more than one parameter. The parameters might be of the same type, as in subtracting complex numbers. Then we have to decide whether the function c-d should be implemented by sending a message to c telling it to subtract the parameter d from itself, or sending a message to d telling it to subtract itself from the parameter c.

Or we could have parameters of different types, as in a function like LECTURERS (d, s): it asks for a list of the lecturers that teach some course for Department d that is attended by Student s. We then have to decide whether to send a message to Department d, with Student s as parameter, or to send a message to Student s, with Department d as parameter. Actually, we have other options: we could send a message to the Course class object with d and s as parameters, or we could send such a message to the Lecturer class object. In sum, the LECTURERS function has to be a method associated with just one sort of object, despite the fact that – on the face of it – it pertains to two sorts of input variable and one sort of output variable.

Now this means, when you think about it, that not only do we have to say what function we want, but also how to get to that function. And if we change the way we get to the function – perhaps for performance reasons we want to make LECTURERS a method belonging to the Course class object rather than to each Student object – then we have to modify our program; we have to change the call, i.e. the message sending and the message itself. And this makes our program dependent on the organisation of, and the route of access to, the functions in our system.

Well, so what? So, over the weekend, go and read Mr Codd’s first paper.[3] On Monday I’ll show you what the Object Oriented have forgotten, or never learned because they know nothing about data base. And precious little about programming either.

 

 

A Remark on Polymorphic Types

 

 

(A supplementary question about “polymorphic types” was handled as follows.)

Yes, some of the Object Oriented do mutter about these so-called “polymorphic” data types; I think they got the idea from the Functional Programming crowd. As far as I can tell, a polymorphic type is something like “a set of characters” or “a list of integers”, or rather, “a set” or “a list”. Obviously if we have data types like lists of integers, lists of numbers, or even lists of lists of things, then these data types will share a lot of functions. We’ll expect some sort of systematic overloading of operators; for instance, those that find the head of a list, remove the head of a list, and so on.

So a polymorphic type is a sort of schematic type: “set of such-and-such” or “list of so-and-so”. By recognising these schematic types, we can define methods for them. And, of course, it’s more efficient to write a “head of list” function than to write a “head of integer list”, and a “head of character list”, and a “head of list list”, and so on. Perhaps.

But this gives me two worries. Firstly, these polymorphic types seem to me to be, in the main, not what you or I would call “data types”, but rather “data structures”. And I rather think that the distinction is important, and shouldn’t be blurred. It shouldn’t be blurred even if a structure is used to implement a type (as a complex number might be implemented by a two-element list). But beware: the object oriented can’t be trusted to distinguish questions of implementation from questions of specification. I’ve said some hard things about Dr Codd and Mr Date; but they can, at least, be trusted to make that distinction.

Secondly, I fear that what might be useful in some programming tasks – structures like sets, lists, queues, stacks, trees, and so on – are just dangerous when we get to data bases. I’ve spent a lot of time trying to show you how we can get by with only one sort of data structure, the normal record. I don’t want to see lots of other structures brought in under the guise of data types, polymorphic or not.

However , I do stress that, although I’m personally – indeed financially – very interested in programming, our current concern is data base. If I wander away from my redoubt to skirmish in the field of programming, it’s only because I see threats out there, and I attack only to defend.

 

SITE HOME PAGE

I don’t want to conquer the land of the programmers; I’m not imperialistic, but I suspect that they are.

THE DATABASE PAGE

THE DATABASE PAPERS

 

Preface & Contents

 

DOWNLOAD

Download Lecture XVIII (rtf, Word for Windows compatible)

Platoclast on Data: Lecture XIX

 

Copyright © 1993, 2001 Adrian Larner. The author asserts all moral rights.