A DATABASE BOOK REVIEW

 

www.btinternet.com/~adrian.larner/review/date7

Three Books by CJ Date
An Introduction to Database Systems, 7th (25th Anniversary) Edition,
ISBN 0-201-38590-2, 938 pp

Relational Database Writings 1994-1997,
ISBN 0-201-39814-1, 586 pp, with contributions by Hugh Darwen and David McGoveran

Foundation for Object/Relational Databases The Third Manifesto,
ISBN 0-201-30978-5, 496 pp, CJ Date and Hugh Darwen

all published by Addison Wesley Longman, Inc

An Extended Review by Adrian Larner for the Computer Journal

 

 

The new edition of Date’s bestselling Introduction is better than ever for standing on to reach high shelves. The Relational Model part is drastically revised, and Date and McGoveran’s views on views incorporated. There are new chapters on Decision Support, Temporal Databases, and Type Inheritance, the last of these, along with the Object Orientation (OO) part, having been rethought after the Third Manifesto book (FFORD/TTM). The old discussions of storage structures and access methods have gone, and likewise DB2, but Introduction is still far too big. Why not resurrect the old Part 2 – call it “Introduction to Data Storage and Access” – to cover Concurrency, Recovery, Optimisation, Distributed Databases, and Decision Support, getting rid of 200 pages, and reinforcing the message that relational theory is about specification, not implementation? The quality is as good as ever, both in content and presentation: this is still, by far, the best text for professionals and final year computing students. What can a conscientious critical reviewer do? Nit-pick.

The latest Writings book comprises 37 reprints of Date’s Database Programming and Design column under the (true) heading “Theory is Practical!” four chapters on relational DB management; a large section on “Missing Information”; and five chapters and an appendix (not to be missed!) on OO, mostly. It is a fun read, with some delicious polemic, like Darwen – who has occasionally paused from writing software to speak a few truths in jest – denying the libel of being a “research scientist”.

FFORD/TTM, the latest version of The Third Manifesto, is supposedly an attempt to integrate the relational model and OO; a rapprochement, we might say, like Come into my parlour said the spider to the fly. Date likes Abstract Data Types (ADTs), albeit not under that name. They pre-date OO, and are the most valuable true-but-not-new element of the OO package deal. Date – rightly – wants to cherry-pick them (as relational domains), along with genuine subtyping (impossible among mutable objects). As for the rest of the package, the new-but-not-true elements, we may cast it into the flames. But Date says it in a nice way here, which is why Writings is more fun. His prescriptions and proscriptions labelled “OO” designedly do not even pertain to OO, apart from the proscription of object identifiers.

Let us carp. FFORD/TTM is not an easy read. Despite long polishing, it is rough at the edges. For instance, A variable is a holder for an encoding of a value. No: it is merely a name and a value that the name transiently designates; holding and encoding are implementation concepts. However, the book repays a second reading; but fit Introduction in between, and relax with Writings when it all gets too heavy.

Integrity is important; and much neglected, but not by Date. Introduction lists the components of the relational model – Structure, Integrity, Manipulation – thus, so that Integrity does not come last. Date’s new classification of constraints is argued unconvincingly in Writings. The distinction between relation variable (“relvar”, horrid word) constraints and database constraints probably will not stand up to criticism. The former pertain to one relvar and are checked after each insert, delete, or update; the latter apply to more than one relvar and are checked at commit time. But a slight change to a DB specification could easily move a constraint from the one category to the other without affecting its required time of check. There certainly are tuple constraints, always immediately checkable; but the need to check some relvar constraints at commit is shown by Date’s own example of an insert to and delete from a relvar of fixed cardinality.

In the Temporal Data chapter (based on Darwen’s work) Date is too ready to extend the relational algebra to handle problems of time. Users prefer to see how things are over maximal continuous (“coalesced”) intervals, but such data is awkward to update. The same information can be shown by multiple records, each showing the state in a chosen minimal (“unfolded”) interval (such as an hour). Date proposes extended operations on coalesced intervals. Better to keep only unfolded intervals in base relations, apply the usual kinds of updates there, and show coalesced intervals only in read-only views. But Date remains terribly keen to update through views, even at the cost of a more complex algebra.

Once more, The Information Principle is highlighted: in a relational DB all information is – Date says – conveyed by values in column positions (fields in records, say). This is strictly impossible: at a minimum, that a given proposition is asserted has to be conveyed by a record interpreted as that proposition being “basic” (kept in a base relvar). But the relational theory goes beyond that minimum: the names of base relvars convey information, namely the open sentences that, when completed by the values in a row, express the proposition conveyed by the row. Date wishes this were not so. He advises against having two base relvars with the same header. He even says that it is the header of a relation that denotes an open sentence; this is clearly false: differently interpreted relations with the same header are unavoidable (any relation and a restriction of it, for instance). He thinks we should so design our DBs that we need not specify the relvar on insertion of a row, which means that all relvars become inessential (to use a useful coinage of his own); and so do their supposed values, relations. The outcome would be a DB comprising mere “base records”, their associated open sentences being conveyed by the headers of one or more of their fields. But perhaps that sticks in the throat: the relational model without relations! We would even have to say – pace Plato – that a user asking for “the Customer relation” really only wanted to see all of the Customer records. Date’s position is untenable, but half-way houses usually are.

Are those open sentences Predicates? Historically, yes, “relation” being another name for predicate. But the only things the logic allows to go in the places of predicates, so closing them and making propositions, are proper names (which name exactly one distinguishable thing). Darwen explains all this clearly in Writings, explicitly requiring proper names as values. The resultant inescapable formal underwriting of connection traps is not described, however, but your now repentant reviewer’s contribution to leading Darwen up this garden path is generously acknowledged. Taking Date’s own example of a connection trap, if “Smith supplies ...” and “... are used in the Manhattan project” are predicates, and we assert that Smith supplies monkey wrenches and that monkey wrenches are used in the Manhattan project, then – despite our intuitions – “monkey wrenches” has to be a proper name, and the predicate logic forces the conclusion: There is something that Smith supplies and that is used in the Manhattan project. Such traps are formally unavoidable on this interpretation, and only our common sense saves us from them, despite logic. We shall just have to use a different kind of open sentence and not construe values as proper names, but as something else: common names, perhaps, like “monkey wrench”. Date uncharacteristically muddies the waters on this issue by saying that types are to relations as nouns are to sentences. Not so: Values are to tuples-in-relvars as proper names are to sentences.

Introduction contains a brief chapter on Logic-Based Databases, a good introduction to the proof-theoretic view of DB queries. It is marred by the presence of what is called a “chasm trap”: If Anne is mother of Betty and Betty is mother of Celia then Anne is grandmother of Celia. Given only that conditional, it is an error to say, We now ... prove that Anne is the grandmother of Celia – i.e. we show how to answer the query ‘Is Anne Celia’s grandmother?’ The two questions are quite different in kind (a proof and a decision). If we know of someone (Betty) that is daughter of Anne and mother of Celia, we have the proof; but we might not know of anyone like that, and still either be able to prove grandmotherhood (if Brian is son of Anne and father of Celia), or not be able to decide (if we have no data on sonhood or fatherhood). While such chasm traps are user errors, however, all these logic-based systems suffer from the sort of connection trap (the ill-named “fan trap”) described above. Try your monkey wrenches on Datalog and see!

Until this problem of interpretation is resolved, there can be no solution to the problem of Missing Information. Introduction has one chapter and Writings six chapters on the subject, the latter principally about – against – many-valued logics. Useful and fascinating as these logics are in their place, their more-than-two values cannot be interpreted as truth values. Firstly, there are only two truth values, true and false; and secondly, as proved by Gödel, modal concepts, like “being known to be true”, cannot be formalised in a finitely many valued logic. So there, one might hope, is an end to it; and in considerably less than six chapters. However, as long as the two-valued predicate logic is used to interpret records in relational DBs in the way that Date proposes, there can be no satisfactory way to interpret “missing values”. That use of the logic leaves nothing yet uninterpreted that could be exploited to that end. So many-valued logic will not work, and nor will anything else.

Date has thought a lot about Views. It is good to see a clear statement that an ANSI/SPARC external schema is not (in general) a relational view but a collection of such views. So it is, alas. If we avoid the botch of “Outer Joins”, a program processing such views, sales within sales rep. for instance, has to perform old-fashioned matching record logic to get their data together properly, a feature known as “data access dependence”. The underlying cause is that the records of the single external schema (which the program would like as one file) are of different types; they are not “tuple homogeneous” as rows in a single relation are, so must go in more than one relation. A worse shame if relations really are inessential.

We now have, it seems, clarity (at some cost) on the question of First Normal Form. A field in a record might be specified as atomic – “scalar” is Date’s preferred term – or not. Say that a record all of whose fields are scalar is simple. A record might contain no more than one field of any given attribute, or not. Say that a record comprising only one field of each of its attributes is flat. Now, simplicity and flatness are distinct but connected notions. If a record comprises the three fields A:1, B:1, and B:2 (the value 1 of attribute A and the values 1 and 2 of attribute B) it is simple but not flat. The semantically equivalent record with the two fields A:1 and BSET:{1,2} is flat but not simple. All tuples in all relations are flat. In his 1970 paper, Codd additionally proposed that his new relational model should be of relations containing only simple tuples. These he called “normalised” or “in first normal form”, as Date acknowledges in an historical footnote. Codd observed that – except in the rare case of data that could not be normalised, and that he chose to ignore in practice – this proposal allowed the predicate logic to be used (lightly sugared) as a data manipulation language. (The reason is that only names can fill the places in predicates and names are atomic: they have no significant components.) Date now uses “first normal form relation” to mean “relation”: containing flat but not necessarily simple tuples. So, in this new terminology, all relations are by definition in first normal form. It is not a wise change of meaning, but it is clear (now; in Writings there is certainly confusion on the point). Introductionadmits one kind of non-simple record: a tuple that contains values that are themselves relations, a horrid mistake stated with lucidity. FFORD/TTM unnecessarily also admits values that are tuples, but values that are non-empty relations with an empty primary key would serve as well.

Date has developed a bit of a thing about naming. His Data Types do not exist until they are named. Let us take any dimensioned data type, such as Length or Weight. If we calculate the variance of some weights, what is the data type of that variance? Weight squared, you might say. But Date insists that you must have anticipated the need for this type and have named it, otherwise squaring a weight to calculate a variance will be impossible. Third and fourth moments of distributions will require weight cubed and weight to the fourth power; and so on. It must be a mistake to demand their prior specification and naming. A similar argument applies to rates, like grams per second, so we have another problem of, but not restricted to, temporal DBs. Again, Date is clear, even when he is wrong. He says that types will form a closed set. He also says that there must be a truth value type if comparisons are to be legal expressions. In one sense of “expression” – title of a value – this is trivially true; but it is perfectly sensible to have a language in which comparisons are expressed but which lacks a data type of truth value.

The predicate logic has the characteristic of Monotonicity: if we have a sound argument – starting from true premises and the conclusion validly following from them – then no additional premises can make the argument unsound and its conclusion false. This is a good idea: it is safe. Proofs of programs (formal or informal) need a monotonic logic, so we can show the program works and be confident it will continue to work even if other programs are run. Date sometimes favours the non-monotonic: he continues to espouse the Closed World Assumption. Informally and plausibly it says that if my DB does not say that so-and-so is an employee of mine, then so-and-so is not an employee of mine; and less plausibly if we substitute “prospective customer” for “employee”.

When he considers single inheritance data Subtyping, he defines the concept of “Most Specific Type” (MST). If we have the types Ellipse and Circle, an ellipse with equal semi-axes has MST Circle, and ellipses with semi-axes 1cm and 2cm, or 2cm and 3cm, have MST Ellipse. But if we introduce a new subtype of Ellipse (and supertype of Circle), IntegerEllipse say (with semi-axes in integer ratio), the MST of the ellipse with semi-axes 1cm and 2cm becomes IntegerEllipse. This would not matter if the notion of MST were merely metalinguistic (used only to talk about our programming languages), but Date proposes that it should be part of the language. He describes an IS_MS_Ellipse operator that would return true for the ellipse with semi-axes 1cm and 2cm; until we introduced IntegerEllipse, and then it would return false. Ouch! – a cry for the salve of monotonicity. Charitably, we should notice that between FFORD/TTM and the new edition of Introduction, Date has changed his approach to subtyping. In FFORD/TTM his over-complex approach is based on variables and is intended to be extrapolated to values (named or titled by expressions); an appendix discusses the alternative approach of “specialisation by constraint”, which he now proposes in Introduction. Perhaps the proposals of named types, messy functions like the MST, and pseudo-variables applicable to representations of values are all mere hangovers of the first, ill-chosen, approach. Why should all subtypes have to be named? Given a data type and a predicate applicable to values of that type, we automatically have a subtype, unnamed perhaps but titled: “such-and-such data type of which such-and-such predicate holds true.” We may choose to name it, but naming it does not bring it into existence. The concept of MST then becomes trivial and useless: the MST of a value is the subtype that contains that value alone. Date says – oddly – that if B is a subtype of A then B has operators ... of its own that do not apply to A. “Radius” would be such an operator. But this is merely a matter of definition: we could define a “geometric mean of the semi-axes” (say MSA) operator for an ellipse, and then Radius(c) would amount merely to “MSA(c) and also check that c is a circle”. In an obvious sense, the two exclusive and exhaustive subtypes, Circle and NonCircularEllipse, cannot each have operators that Ellipse does not have.

In any event, subtyping by constraint excludes much of the “Subtyping” found in OO systems: Date’s example is ColouredCircle as “subtype” of Circle. This he rejects, but leaves the connection between the two types mysterious. Surely they are instance and type. The test is easy: is everything true of a circle true of a coloured circle? Yes, one would think. But subtyping needs a further test: is it possible for two coloured circles to be one and the same circle? If not then we have subtyping, otherwise instance and type. A notorious example: Copy (of a book) is not a subtype of Edition, even though each copy is an edition, because two copies can be the same edition. It is perhaps because objects in OO systems cannot be subtyped that the only half-plausible examples of object “subtyping” are actually of instancing a type (Stack as “subtype” of Bag, for example).

Date does, of course, appreciate that only values can be subtyped; variables – and therefore mutable objects – cannot be subtyped; but he uses his “THE_” operators (THE_Radius, THE_Centre, and the like), which designate components of a representation, not merely as functions but also as Pseudo-variables. If we represent rectangles as ShorterSide and LongerSide, and also as ShorterSide and Diagonal, we have to rename one “ShorterSide”, so we have, say, LittlerSide and Diagonal. Then we get two operators, THE_ShorterSide and THE_LittlerSide which always return the same value. Their only difference is that an assignment to the pseudo-variable, THE_ShorterSide (r), leaves the value of THE_LongerSide (r) unchanged, whereas assignment to THE_LittlerSide (r) leaves the value of THE_Diagonal (r) unchanged. What a horrid kind of assignment it is that implicitly specifies some variable not to be modified; except, of course, in the Dijkstran case, of modifying no other variable, which makes program proofs run sweeter (and, incidentally, makes one a bit chary of Date’s construal of views as “virtual relvars”). Quite why we should want these THE_ pseudo-variables in a relational DB system is unclear: there are no scalar variables in the DB to apply them to. Could it be that an updatable view might contain a column defined with THE_ShorterSide, or with THE_LittlerSide? Now there’s subtle.

Both Introduction and a paper in Writings describe the rapidly-getting-hoary Optimisation problem of Stonebraker’s Rectangles, but fail to draw the right conclusion: optimisers should conjoin the invariants of pertinent data types (specifically of total orderings) and tuple integrity rules to restriction conditions, exploit transitivities and other characteristics of predicates, and evaluate all clauses as early as possible (preferably before accessing data).

 

 

 

 

 

 

 

Not (yet?) in Introduction, but proposed in FFORD/TTM, is a new Relational Algebra, which has a single (AND) operator for restriction, extension, natural join, and intersection, because they are all conjunctions in the predicate logic; and a new union (OR) operator that does not require the traditional compatibility. A negation (NOT) operator works as follows: Form a Cartesian product of the domains of the columns of the relation being negated, each domain being expressed as a (probably big) single-column relation. The NOT of the original relation is its difference from the “product-of-domains” relation. (Date would probably be wary of making relations sound anything like domains.) “R1 OR R2” is then definable – predictably enough – as “NOT ((NOT R1) AND (NOT R2))”. It seems that we have theoretical propriety, but dubious practicality; we have a new union that still will not let us get those sales rep. and sale records together sensibly. All right: theory is practical; but it has to be the right theory. Complexity – probably approaching computational completeness – is added to the algebra by a transitive closure operation, alas uninterpretable in the predicate logic. Interesting; but it is a curse to live in interesting times.

(Your reviewer is indebted to Chris Date and Hugh Darwen for a number of discussions about some of the issues mentioned in this review.)

 

SITE HOME PAGE

 

THE DB REVIEWS

Another Review ...

 

Copyright © 1999, 2001 Adrian Larner. The author asserts all moral rights.

The decorative image of a key (cc004239.gif) used on this page was obtained from IMSI's MasterClips/MasterPhotos© Collection, 1895 Francisco Blvd East, San Rafael, CA 94901-5506, USA.