|
(U03) |
www.btinternet.com/~adrian.larner/database/pcl18 |
|
PLATOCLAST Lecture XVIII |
||
|
|
||
|
|
I’m afraid that we’ve come a long way round to where I wanted us to get to. But sometimes the longest way round is the shortest way home. What I’m going to do in this lecture is to try to say where we’ve got to, and then to look back along the way, so we all understand how, and why, we arrived here. |
|
|
|
||
|
|
Tables and Records |
|
|
|
For a long time, at the beginning of the course, we hacked away intermittently at that curious problem of whether tables were essential. And one way to put that problem is to ask; what is the largest, the most extensive, object that we want our DML to handle? Well, for good or ill, the answer we gave was: a record. Though, of course, we recognised the utility of referring to our records by common names, “EMPLOYEE” and the like, and thereby allowing the operations specified on a record with such a name to be applied automatically to all records of that name. And that’s what gives the illusion that a DML has to be concerned with tables. I stress; a DML doesn’t have to be concerned with tables. A queried table, if we must talk of such a thing, comprises simply the records deemed to be “contained” in it. Of course, I don’t care if a user thinks in terms of tables, or generally talks in set-speak (though I don’t recommend it); I’m just looking for something that is a member of the Simple, Safe Theory Club. |
|
|
|
||
|
|
Data Manipulation |
|
|
|
Having identified the record as our most extensive structure, and its interpretation as a certain form of proposition, I was then concerned to see what facilities were needed for queries, that is, implications. And we showed that the major part of the work could be done by a variety of operations on records, all as far as the FOPC is concerned conjunctions. As it happens, we also managed to express extensions and restrictions in a similar way. We ended this stage, you’ll remember, with, for records, the Great Conjunction Operation: Combination. Restrictions and Extensions were given somewhat different handling in the Algebra and Tuple Calculus, but we thought that users would probably think of them in a slightly different way, so we didn’t worry too much about that. I want to remind you how our new combination operation allowed us to define not only our old friends Cartesian product and Inner Natural Join, but also those extended natural joins that we called Conjoins, and found preferable to the Outer Joins that the relational people seem to love. Moreover, we also used combination to define Union, which seemed to threaten the simplicity of our DML by having an interpretation of alternation instead of conjunction (OR rather than AND). Projection, after a long struggle, and our new interpretation, EI, turned out to be the inverse of a conjunction. Not, of course, that we adopted EI to change the interpretation of projection: we did it primarily to avoid going from truth to falsity, and all these other things were added unto us. We examined other existential quantifications, and it seemed that they could be handled by calling encapsulated functions (such as COUNT), as long as we had a Collection operation. So we allowed ourselves such an operation, but salved our consciences by showing that any use of it could be expressed (though perhaps longwindedly) as a number of conjunctions (ultimately combinations). And then we took the last primitive relational operator, Difference, and showed how we could handle it using no more mechanism whatsoever. And that brought us where we are now, in terms of operations on records: each of these operations is definable on conjunctions, or on the inverse of a conjunction in the case of projection. |
|
|
|
||
|
|
Fields |
|
|
|
But that was only how we tackled our DML, or the theory behind it, from the top; considering its most extensive object, the record. We also needed to tackle it from the bottom, considering its least extensive object. That, of course, is the field. And we now conceive a field as a token. So our DML is demarcated below by an interface to a Token Handler. I’ll come back to the token handler, and what it comprises, but let me stress that the tokens, as far as the DML are concerned, amount to the values in fields, and are therefore interpretable as the names used in the equivalences of EI. Quite how we handle the field types (formats) I’ve left rather vague. We could have a DML that handled its records in tables, picking up column names well, column name tokens from record definitions. Or we could, somewhat more flexibly, include column (and therefore domain) information in each token; that is, we would have functions like “COLUMN” and “TYPE” that the DML would apply to tokens to return column and data type identifying tokens. |
|
|
|
||
|
|
Handling Data Outside the Manipulation Language |
|
|
|
In addition to the DML proper, I assume one or more Presentation Handlers. While the DML would produce records, or taking them collectively tables, as a result of its queries, the presentation handlers would process those tables in order to build character arrays, or images, or some combination of those, for presentation to a user at a VDU or on paper. A presentation handler would need access to DISPLAY functions in order to convert tokens to characters or images. It is in such a handler that we would manage, for instance, natural language support. As part of, or in addition to, presentation handlers, we would also want Report Formatters that could restructure strictly, I suppose, denormalise tables. These would allow the user to do a lot of traditional things: sequencing, level breaking, and the like; and perhaps some lovely new-fangled things as well, about which you wouldn’t expect an old fogey like me to know. Notice that we might want some report formatting “token shuffling”, we might say before presentation (conversion to characters or images), and some more report formatting afterwards. And, just as we want presentation handlers and report formatters for our output, we probably want all sorts of user-friendly interfaces to pick up input, and to assist the user to interpret data and formulate queries. Now I mention all these extra bits and bobs not just to show how much I’m on top of my job but to explain why I don’t feel the need to keep stuffing more and more function into the DML. So many of the things that threaten to make DMLs complex don’t belong in DMLs at all. On the output side, for instance, people often want reports with level breaks: revenue by sales representative within office within area. These reports look like hierarchies, but don’t you be fooled, they are hierarchies. And then people argue that we should have a DML that handles hierarchies, with all their implicit connections between records. In reports, these implicit connections are shown by sequencing and indentation. But I would want to say: let the report formatter handle hierarchies, as long as the DML can give it the normal records it needs to build them. On the input side, as I mentioned before, the Natural Join School want the user to be able to specify a join between two types of record CUSTOMER and INVOICE for instance and for the system to work out what column they should be joined on, Customer Number I suppose. But I don’t want that. I’m quite happy to have a front-end, an input handler, that guesses what join is needed, confirms it with the user, and then submits the query to the DML. This allows us to have a DML that simply performs the queries passed to it, without all that clever second-guessing of the user. And I’m going to maintain that position until someone publishes a decent theory of second-guessing, and proves it correct. |
|
|
|
||
|
|
Token Handling |
|
|
|
Back to the token handler. What do we want it to do? We’ve seen quite a lot that it should do: handling token identity, in some cases inequalities on tokens, and other highly overloaded functions. These include, perhaps, checking for nulls and pads, the THE and PROP functions, the Reduction operation, DIFF and UNDIFF, the “]” and “[” operations, and the PRESENT function. The token handler itself calls proper and improper functions (it is the interface between the DML and those functions). Think back to our interpretation of Functional Extension, for instance:
Now imagine that we wanted to work out whether that expression was true, taking the “= to mean “iff, so known to the FOPC. Well, we can imagine just passing it to a pretty clued-up implementation of the FOPC, and it would scan it, and then say to us:
Our DML, you will recall, is based on the FOPC now merely on part of the FOPC. Our tokens are effectively the “a”, “b”, and “c” in our definition. Our encapsulated functions all in the Token Handler as far as the DML is concerned are like the q in our definition. What the DML does know about are its native operations, like “&”. And, just as the dictionary describes the encapsulated functions, so the data base definition describes record types like “P”, and the actual data base describes the records like Pab. Now, imagine how complex the FOPC would become if we started adding predicates to it. Suppose, for instance, that the TT-DD, instead of being a theory built on the FOPC, were actually part of the FOPC. Then, every time we added a new predicate or function or definition or postulate we would have to make sure the FOPC was still reliable: every brick we laid would shake the foundations. I think that would be intolerable. And I want to place the same sort of constraints on the DML, because it is simply a form of the FOPC, or of a part of that logic. So, as long as we don’t add anything to it, it is completely reliable. We don’t have to keep re-thinking our DML as we add data types: numbers, characters, part numbers, dates, text lines, signatures, images, moving images, or whatever (any more than we have to rethink it because we insert records, or define new record types). That’s really what encapsulation gives us. Of course, we want our encapsulated functions to be reliable as well. We don’t save ourselves essential work; but we divide and conquer. We get the DML right; then we get each data type right; then we get each function and operation right. We conquer; but we don’t have to keep reconquering. And when you consider the pain you’ve all been through just listening to me conquering a simple little DML, or rather reconquering it in the footsteps of Dr Codd (and a few of my own), you may agree that getting it right and not constantly enhancing it is a good idea. |
|
|
|
||
|
|
First Order Logic Subset Data Manipulation Language |
|
|
|
As you may have noticed, I’ve done somewhat more (or should I say “less”?) than base my DML on the FOPC. Having required a token handler, and all those encapsulated functions reached through it, I’ve actually pushed a number of FOPC-like operations out of the DML entirely. And that’s why I say my DML is based on only part of the FOPC, specifically the conjunction of predicates (and its inverse, which we needed for projection). Let me explain why I did this, though we haven’t as yet worked out all the implications. The operations we were concerned with were existential quantification and negation.
Before I sort out that last little problem, let me go back a long way to the concept of the efficiency of a theory. That meant, remember, the ratio of what we got out of it to what we put in. In linguistic terms and that’s not far from a DML what we were looking for was a minimal vocabulary that carried maximal expressive power. Now, it’s difficult to compare logics for efficiency. But for sure, the FOPC without existential quantification and negation is certainly more efficient than the full FOPC, if it has the same or greater expressive power. We’ve seen how to express all the primitive relational operations join, restriction, projection, union, and difference in our DML; we’ve managed something better than outer joins, namely Conjoins, and we can manage outer joins I’ll come to that. So it looks as if we have a DML that’s more efficient than the Relational Algebra or Calculus, or current DMLs built on them. But what is more: I have wonders in store for you. I have yet to show you how to do queries that are impossible in relational systems. Our DML, we shall find, is doubly more efficient than relational DMLs: not only does it require less than the Relational Algebra or Calculus, but it delivers more. |
|
|
|
||
|
|
The Closed Data Base Assumption |
|
|
|
Now let’s go back to that important question whether or not we can show that Douglas is a slacker. You can see where our problems come from. The records we keep say: so-and-so has returned such-and-such assignment on time. We don’t keep any records saying that they haven’t returned an assignment on time. So the conclusion that, for instance, Douglas has returned no assignment on time depends on an extra assumption: that if anyone has returned an assignment on time then it says so in our data. But that assumption in general it’s caned the “Closed Data Base” assumption is not itself part of our data. And I see no reason to make it part of our data, or to think of it as a postulate of the theory that our data base represents. It’s not always true, in the general case. Think what happens when a user specifies a query with a restriction in it. They ask for employees with, say, SALARY greater than £30,000. Obviously “each employee has a salary greater than £30,000” is not a theorem of our theory; it’s not even a proposition consistent with our theory as some of our employees, we might well assume, get much smaller salaries. They’re on the poverty line, really. But the records the user gets as a result of the query also represent a theory: it’s a subtheory of our original theory (the one represented by the data base). And “each employee has a salary greater than £30,000” is a theorem of that theory. Of course, it’s not that one theory is true and the other false. They are both true, but “employee” in one theory names only some of the persons that “employee” names in the other. Well, if as we surely must we allow users to build theories from theories in this way, merely by using restrictions, there is no reason why they shouldn’t use the facilities of the DML to build theories, query by query, containing the closed data base assumption. And that’s just how we can understand the query we need to go from the first of these tables to the second:
But it’s not quite that stark. Though it could be; at the cost of performance. Remember that we needed some invariants specified, which tell the DML, at a certain level where its queries are optimised a number of important facts about functions. At that level, the DML does need to know that “greater than is an ordering relationship, and that checking for presence of a type of record can cease when a record of that type has been found (contrast other accumulations, such as TOTAL, where all records must, be found). |
|
|
|
||
|
|
Outer Join Defined |
|
|
|
I think that’s about enough. No I was going to define Outer Joins for you. To get the Outer Join of Pab and Qbc we first form the Conjoin (let’s use the Left Conjoin and define the Left Outer Join). And then we apply a restriction, namely:
I had a student complaining after the last lecture that my theory was “too subtle”. Not so. You may, if you wish, compliment me on my subtlety; but my theory is not in the least subtle. There are no subtleties in logic. I don’t make a subtle distinction between what’s in the DML and what’s outside in the token handler and its appendages; I put a great wall around the DML, and I keep watch lest anything but conjunction enter there. Existential quantification and negation don’t subtly hover on the borderline; they go out the DML and across that borderline with Geoffrey’s big boot behind them. Subtle forsooth! |
|
|
|
||
|
|
A Remark on Object Orientation |
|
|
||
|
|
(It was objected that Professor Platoclast had given a fairly conventional account of strict data typing, but had dismissed the very same idea, viz Object Orientation, as a “whacky idea”.[2]) My dear chap, not so, not so. Let me sum up Object Orientation for you:
Thus far, we have nothing but the jargon: a sort of Object Oriented façon de parler, that we can translate back into ordinary programmer-speak. But we get problems. One is that if we want to create something, for instance, to create the complex number r + si, there is no complex number object to send a message to. So we have to send a COMPLEX message, with parameters r and s to a different sort of object, the complex number class. This weird thing is supposed to be both the set of complex numbers and an object in its own right, and this might sound like gibberish but don’t you be fooled ... The second problem is what we do about functions with more than one parameter. The parameters might be of the same type, as in subtracting complex numbers. Then we have to decide whether the function c-d should be implemented by sending a message to c telling it to subtract the parameter d from itself, or sending a message to d telling it to subtract itself from the parameter c. Or we could have parameters of different types, as in a function like LECTURERS (d, s): it asks for a list of the lecturers that teach some course for Department d that is attended by Student s. We then have to decide whether to send a message to Department d, with Student s as parameter, or to send a message to Student s, with Department d as parameter. Actually, we have other options: we could send a message to the Course class object with d and s as parameters, or we could send such a message to the Lecturer class object. In sum, the LECTURERS function has to be a method associated with just one sort of object, despite the fact that on the face of it it pertains to two sorts of input variable and one sort of output variable. Now this means, when you think about it, that not only do we have to say what function we want, but also how to get to that function. And if we change the way we get to the function perhaps for performance reasons we want to make LECTURERS a method belonging to the Course class object rather than to each Student object then we have to modify our program; we have to change the call, i.e. the message sending and the message itself. And this makes our program dependent on the organisation of, and the route of access to, the functions in our system. Well, so what? So, over the weekend, go and read Mr Codd’s first paper.[3] On Monday I’ll show you what the Object Oriented have forgotten, or never learned because they know nothing about data base. And precious little about programming either. |
|
|
|
||
|
|
||
|
|
||
|
|
(A supplementary question about “polymorphic types” was handled as follows.) Yes, some of the Object Oriented do mutter about these so-called “polymorphic” data types; I think they got the idea from the Functional Programming crowd. As far as I can tell, a polymorphic type is something like “a set of characters” or “a list of integers”, or rather, “a set” or “a list”. Obviously if we have data types like lists of integers, lists of numbers, or even lists of lists of things, then these data types will share a lot of functions. We’ll expect some sort of systematic overloading of operators; for instance, those that find the head of a list, remove the head of a list, and so on. So a polymorphic type is a sort of schematic type: “set of such-and-such” or “list of so-and-so”. By recognising these schematic types, we can define methods for them. And, of course, it’s more efficient to write a “head of list” function than to write a “head of integer list”, and a “head of character list”, and a “head of list list”, and so on. Perhaps. But this gives me two worries. Firstly, these polymorphic types seem to me to be, in the main, not what you or I would call “data types”, but rather “data structures”. And I rather think that the distinction is important, and shouldn’t be blurred. It shouldn’t be blurred even if a structure is used to implement a type (as a complex number might be implemented by a two-element list). But beware: the object oriented can’t be trusted to distinguish questions of implementation from questions of specification. I’ve said some hard things about Dr Codd and Mr Date; but they can, at least, be trusted to make that distinction. Secondly, I fear that what might be useful in some programming tasks structures like sets, lists, queues, stacks, trees, and so on are just dangerous when we get to data bases. I’ve spent a lot of time trying to show you how we can get by with only one sort of data structure, the normal record. I don’t want to see lots of other structures brought in under the guise of data types, polymorphic or not. However , I do stress that, although I’m personally indeed financially very interested in programming, our current concern is data base. If I wander away from my redoubt to skirmish in the field of programming, it’s only because I see threats out there, and I attack only to defend. |
|
|
|
||
|
I don’t want to conquer the land of the programmers; I’m not imperialistic, but I suspect that they are. |
|
|
|
|
|
|
|
||
|
Copyright © 1993, 2001 Adrian Larner. The author asserts all moral rights. |
||