|
|
The new edition of Dates
bestselling Introduction is
better than ever for standing on to reach high shelves.
The Relational Model part is drastically revised, and Date and McGoverans
views on views incorporated.
There are new chapters on Decision Support, Temporal Databases, and Type Inheritance,
the last of these, along with the Object Orientation (OO) part,
having been rethought after the Third Manifesto book
(FFORD/TTM).
The old discussions of storage structures and access methods have gone, and likewise DB2,
but Introduction is still far too big.
Why not resurrect the old Part 2 call it Introduction to Data Storage and Access
to cover Concurrency, Recovery, Optimisation, Distributed Databases, and Decision Support,
getting rid of 200 pages, and reinforcing the message that relational theory is about specification, not implementation?
The quality is as good as ever, both in content and presentation:
this is still, by far, the best text for professionals and final year computing students.
What can a conscientious critical reviewer do?
Nit-pick.
The latest Writings
book comprises
37 reprints of Dates Database Programming and Design column under the (true) heading
Theory is Practical!
four chapters on relational DB management;
a large section on Missing Information;
and five chapters and an appendix (not to be missed!) on OO, mostly.
It is a fun read, with some delicious polemic, like Darwen
who has occasionally paused from writing software to speak a few truths in jest
denying the libel of being a research scientist.
FFORD/TTM,
the latest version of The Third Manifesto,
is supposedly an attempt to integrate the relational model and OO;
a rapprochement, we might say, like Come into my parlour said the spider to the fly.
Date likes Abstract Data Types (ADTs), albeit not under that name.
They pre-date OO,
and are the most valuable true-but-not-new element of the OO package deal.
Date rightly wants to cherry-pick them (as relational domains),
along with genuine subtyping (impossible among mutable objects).
As for the rest of the package, the new-but-not-true elements, we may cast it into the flames.
But Date says it in a nice way here, which is why Writings is more fun.
His prescriptions and proscriptions labelled OO designedly do not even pertain to OO,
apart from the proscription of object identifiers.
Let us carp.
FFORD/TTM is not an easy read.
Despite long polishing, it is rough at the edges.
For instance, A variable is a holder for an encoding of a value.
No: it is merely a name and a value that the name transiently designates;
holding and encoding are implementation concepts.
However, the book repays a second reading;
but fit Introduction in between, and relax with Writings when it all gets too heavy.
Integrity is important;
and much neglected, but not by Date.
Introduction lists the components of the relational model Structure, Integrity, Manipulation
thus, so that Integrity does not come last.
Dates new classification of constraints is argued unconvincingly in Writings.
The distinction between relation variable (relvar, horrid word)
constraints and database constraints probably will not stand up to criticism.
The former pertain to one relvar and are checked after each insert, delete, or update;
the latter apply to more than one relvar and are checked at commit time.
But a slight change to a DB specification could easily move a constraint
from the one category to the other without affecting its required time of check.
There certainly are tuple constraints, always immediately checkable;
but the need to check some relvar constraints at commit
is shown by Dates own example of an insert to and delete from a relvar of fixed cardinality.
In the Temporal Data
chapter (based on Darwens work)
Date is too ready to extend the relational algebra to handle problems of time.
Users prefer to see how things are over maximal continuous (coalesced) intervals,
but such data is awkward to update.
The same information can be shown by multiple records,
each showing the state in a chosen minimal (unfolded) interval (such as an hour).
Date proposes extended operations on coalesced intervals.
Better to keep only unfolded intervals in base relations,
apply the usual kinds of updates there, and show coalesced intervals only in read-only views.
But Date remains terribly keen to update through views, even at the cost of a more complex algebra.
Once more,
The Information Principle is highlighted:
in a relational DB all information is Date says
conveyed by values in column positions (fields in records, say).
This is strictly impossible:
at a minimum, that a given proposition is asserted has to be conveyed
by a record interpreted as that proposition being basic (kept in a base relvar).
But the relational theory goes beyond that minimum:
the names of base relvars convey information,
namely the open sentences that, when completed by the values in a row,
express the proposition conveyed by the row.
Date wishes this were not so.
He advises against having two base relvars with the same header.
He even says that it is the header of a relation that denotes an open sentence;
this is clearly false: differently interpreted relations with the same header are unavoidable
(any relation and a restriction of it, for instance).
He thinks we should so design our DBs
that we need not specify the relvar on insertion of a row,
which means that all relvars become inessential (to use a useful coinage of his own);
and so do their supposed values, relations.
The outcome would be a DB comprising mere base records,
their associated open sentences being conveyed by the headers of one or more of their fields.
But perhaps that sticks in the throat: the relational model without relations!
We would even have to say pace Plato
that a user asking for the Customer relation
really only wanted to see all of the Customer records.
Dates position is untenable, but half-way houses usually are.
Are those
open sentences Predicates?
Historically, yes, relation being another name for predicate.
But the only things the logic allows to go in the places of predicates,
so closing them and making propositions, are proper names
(which name exactly one distinguishable thing).
Darwen explains all this clearly in Writings, explicitly requiring proper names as values.
The resultant inescapable formal underwriting of connection traps is not described,
however, but your now repentant reviewers contribution to leading Darwen up this garden path
is generously acknowledged.
Taking Dates own example of a connection trap,
if Smith supplies ... and ... are used in the Manhattan project
are predicates, and we assert that Smith supplies monkey wrenches
and that monkey wrenches are used in the Manhattan project,
then despite our intuitions monkey wrenches has to be a proper name,
and the predicate logic forces the conclusion:
There is something that Smith supplies and that is used in the Manhattan project.
Such traps are formally unavoidable on this interpretation,
and only our common sense saves us from them, despite logic.
We shall just have to use a different kind of open sentence and not construe values as proper names,
but as something else: common names, perhaps, like monkey wrench.
Date uncharacteristically muddies the waters on this issue
by saying that types are to relations as nouns are to sentences.
Not so: Values are to tuples-in-relvars as proper names are to sentences.
Introduction contains a brief chapter
on Logic-Based Databases,
a good introduction to the proof-theoretic view of DB queries.
It is marred by the presence of what is called a chasm trap:
If Anne is mother of Betty and Betty is mother of Celia then
Anne is grandmother of Celia.
Given only that conditional, it is an error to say,
We now ... prove that Anne is the grandmother of Celia i.e. we show how to answer the query
Is Anne Celias grandmother?
The two questions are quite different in kind (a proof and a decision).
If we know of someone (Betty) that is daughter of Anne and mother of Celia,
we have the proof; but we might not know of anyone like that,
and still either be able to prove grandmotherhood (if Brian is son of Anne and father of Celia),
or not be able to decide (if we have no data on sonhood or fatherhood).
While such chasm traps are user errors, however,
all these logic-based systems suffer from the sort of connection trap (the ill-named fan trap)
described above.
Try your monkey wrenches on Datalog and see!
Until this problem of interpretation is resolved,
there can be no solution to the problem of Missing Information.
Introduction has one chapter and Writings six chapters on the subject,
the latter principally about against many-valued logics.
Useful and fascinating as these logics are in their place,
their more-than-two values cannot be interpreted as truth values.
Firstly, there are only two truth values, true and false; and secondly, as proved by Gödel,
modal concepts, like being known to be true,
cannot be formalised in a finitely many valued logic.
So there, one might hope, is an end to it; and in considerably less than six chapters.
However, as long as the two-valued predicate logic is used to interpret records
in relational DBs in the way that Date proposes,
there can be no satisfactory way to interpret missing values.
That use of the logic leaves nothing yet uninterpreted that could be exploited to that end.
So many-valued logic will not work, and nor will anything else.
Date has thought a lot
about Views.
It is good to see a clear statement that an ANSI/SPARC external schema
is not (in general) a relational view but a collection of such views.
So it is, alas. If we avoid the botch of Outer Joins,
a program processing such views, sales within sales rep. for instance,
has to perform old-fashioned matching record logic to get their data together properly,
a feature known as data access dependence.
The underlying cause is that the records of the single external schema
(which the program would like as one file) are of different types;
they are not tuple homogeneous as rows in a single relation are,
so must go in more than one relation.
A worse shame if relations really are inessential.
We now have, it seems,
clarity (at some cost) on the question of First Normal Form.
A field in a record might be specified as atomic scalar is Dates preferred term or not.
Say that a record all of whose fields are scalar is simple.
A record might contain no more than one field of any given attribute, or not.
Say that a record comprising only one field of each of its attributes is flat.
Now, simplicity and flatness are distinct but connected notions.
If a record comprises the three fields A:1, B:1, and B:2 (the value 1 of attribute A and the values 1 and 2 of attribute B)
it is simple but not flat.
The semantically equivalent record with the two fields A:1 and BSET:{1,2} is flat but not simple.
All tuples in all relations are flat.
In his 1970 paper,
Codd additionally proposed that his new relational model should be of relations containing only simple tuples.
These he called normalised or in first normal form,
as Date acknowledges in an historical footnote.
Codd observed that except in the rare case of data that could not be normalised,
and that he chose to ignore in practice this proposal allowed the predicate logic to be used
(lightly sugared) as a data manipulation language.
(The reason is that only names can fill the places in predicates and names are atomic:
they have no significant components.)
Date now uses first normal form relation to mean relation:
containing flat but not necessarily simple tuples.
So, in this new terminology, all relations are by definition in first normal form.
It is not a wise change of meaning, but it is clear (now; in Writings there is certainly confusion on the point).
Introductionadmits one kind of non-simple record:
a tuple that contains values that are themselves relations,
a horrid mistake stated with lucidity.
FFORD/TTM unnecessarily also admits values that are tuples,
but values that are non-empty relations with an empty primary key would serve as well.
Date has developed a bit of a thing about naming.
His Data Types do not exist until they are named.
Let us take any dimensioned data type, such as Length or Weight.
If we calculate the variance of some weights, what is the data type of that variance?
Weight squared, you might say.
But Date insists that you must have anticipated the need for this type and have named it,
otherwise squaring a weight to calculate a variance will be impossible.
Third and fourth moments of distributions will require weight cubed and weight to the fourth power; and so on.
It must be a mistake to demand their prior specification and naming.
A similar argument applies to rates, like grams per second,
so we have another problem of, but not restricted to, temporal DBs.
Again, Date is clear, even when he is wrong.
He says that types will form a closed set.
He also says that there must be a truth value type if comparisons are to be legal expressions.
In one sense of expression title of a value this is trivially true;
but it is perfectly sensible to have a language in which comparisons are expressed
but which lacks a data type of truth value.
The predicate logic has
the characteristic of Monotonicity:
if we have a sound argument starting from true premises and the conclusion validly following from them
then no additional premises can make the argument unsound and its conclusion false.
This is a good idea: it is safe.
Proofs of programs (formal or informal) need a monotonic logic,
so we can show the program works and be confident it will continue to work even if other programs are run.
Date sometimes favours the non-monotonic:
he continues to espouse the Closed World Assumption.
Informally and plausibly it says that if my DB does not say that so-and-so is an employee of mine,
then so-and-so is not an employee of mine;
and less plausibly if we substitute prospective customer for employee.
When he considers single inheritance
data Subtyping,
he defines the concept of Most Specific Type (MST).
If we have the types Ellipse and Circle,
an ellipse with equal semi-axes has MST Circle,
and ellipses with semi-axes 1cm and 2cm, or 2cm and 3cm, have MST Ellipse.
But if we introduce a new subtype of Ellipse (and supertype of Circle),
IntegerEllipse say (with semi-axes in integer ratio),
the MST of the ellipse with semi-axes 1cm and 2cm becomes IntegerEllipse.
This would not matter if the notion of MST were merely metalinguistic
(used only to talk about our programming languages),
but Date proposes that it should be part of the language.
He describes an IS_MS_Ellipse operator that would return true for the ellipse with semi-axes 1cm and 2cm;
until we introduced IntegerEllipse, and then it would return false.
Ouch! a cry for the salve of monotonicity.
Charitably, we should notice that between FFORD/TTM and
the new edition of Introduction,
Date has changed his approach to subtyping.
In FFORD/TTM his over-complex approach
is based on variables and is intended to be extrapolated to values (named or titled by expressions);
an appendix discusses the alternative approach of specialisation by constraint,
which he now proposes in Introduction.
Perhaps the proposals of named types, messy functions like the MST,
and pseudo-variables applicable to representations of values are all mere hangovers of the first, ill-chosen, approach.
Why should all subtypes have to be named?
Given a data type and a predicate applicable to values of that type,
we automatically have a subtype, unnamed perhaps but titled:
such-and-such data type of which such-and-such predicate holds true.
We may choose to name it, but naming it does not bring it into existence.
The concept of MST then becomes trivial and useless:
the MST of a value is the subtype that contains that value alone.
Date says oddly that if B is
a subtype of A then B has operators ... of its own that do not apply to A.
Radius would be such an operator.
But this is merely a matter of definition:
we could define a geometric mean of the semi-axes (say MSA) operator for an ellipse,
and then Radius(c) would amount merely to MSA(c) and also check that c is a circle.
In an obvious sense, the two exclusive and exhaustive subtypes,
Circle and NonCircularEllipse, cannot each have operators that Ellipse does not have.
In any event, subtyping by constraint
excludes much of the Subtyping found in OO systems:
Dates example is ColouredCircle as subtype of Circle.
This he rejects, but leaves the connection between the two types mysterious.
Surely they are instance and type.
The test is easy: is everything true of a circle true of a coloured circle?
Yes, one would think.
But subtyping needs a further test: is it possible for two coloured circles to be one and the same circle?
If not then we have subtyping, otherwise instance and type.
A notorious example:
Copy (of a book) is not a subtype of Edition, even though each copy is an edition,
because two copies can be the same edition.
It is perhaps because objects in OO systems cannot be subtyped
that the only half-plausible examples of object subtyping are actually of instancing a type
(Stack as subtype of Bag, for example).
Date does, of course, appreciate that only values can be subtyped;
variables and therefore mutable objects cannot be subtyped;
but he uses his THE_ operators (THE_Radius, THE_Centre, and the like),
which designate components of a representation,
not merely as functions but also as Pseudo-variables.
If we represent rectangles as ShorterSide and LongerSide, and also as ShorterSide and Diagonal,
we have to rename one ShorterSide, so we have, say, LittlerSide and Diagonal.
Then we get two operators, THE_ShorterSide and THE_LittlerSide which always return the same value.
Their only difference is that an assignment to the pseudo-variable, THE_ShorterSide (r),
leaves the value of THE_LongerSide (r) unchanged,
whereas assignment to THE_LittlerSide (r) leaves the value of THE_Diagonal (r) unchanged.
What a horrid kind of assignment it is that implicitly specifies some variable not to be modified;
except, of course, in the Dijkstran case, of modifying no other variable,
which makes program proofs run sweeter
(and, incidentally, makes one a bit chary of Dates construal of views as virtual relvars).
Quite why we should want these THE_ pseudo-variables in a relational DB system is unclear:
there are no scalar variables in the DB to apply them to.
Could it be that an updatable view might contain a column defined with THE_ShorterSide,
or with THE_LittlerSide?
Now theres subtle.
Both Introduction and
a paper in Writings describe the rapidly-getting-hoary Optimisation
problem of Stonebrakers Rectangles, but fail to draw the right conclusion:
optimisers should conjoin the invariants of pertinent data types (specifically of total orderings)
and tuple integrity rules to restriction conditions,
exploit transitivities and other characteristics of predicates,
and evaluate all clauses as early as possible (preferably before accessing data).
|