(U03)

www.btinternet.com/~adrian.larner/database/pcl10

PLATOCLAST
ON DATA

Lecture X
Handling Non-normal Records

 

 

I’m going to begin today by saying a few words about quantification. You may have thought it odd that although I insist on relative identities, always saying “the same such-and-such” rather than merely “the same”, I still quantify with “There is something” (FOR SOME x) or “Everything is such that” (FOR EACH x). Don’t we really want to say “There is some person” or “Every employee is such that”?

 

 

Typed and Untyped Quantification

 

If you do want to say that sort of thing, then it’s OK. We can translate it back into indefinite quantification.

FOR SOME PERSON x, F(x)” comes out as:
 
FOR SOME x, x is a person AND F(x); or
FOR SOME x, x is the same person as x AND F(x).

FOR EACH PERSON x, F(x)” comes out as:
FOR EACH x, IF x is a person THEN FOR SOME y, x is the same person as y AND F(y).
But notice that the universal quantification is rather complex, and also that such quantifications lead to what. seem mysterious results. For instance, assuming a universe of inscriptions, let “F” hold true of all and only the inscriptions in a certain book. Then its negation, let us say, “¬F” holds true of all other inscriptions. It is then the case, I expect, that each alphabetic letter is F (i.e. something that is the same alphabetic letter as some inscription in the book), and that each alphabetic letter is ¬F (i.e. something that is the same alphabetic letter as some inscription not in the book).

Now normally we would say that either Fa or ¬Fa, but not both. However, this depends on “a” being a proper name (under our systemic criterion of identity), and likewise we would say that FOR EACH x, NOT (Fx AND ¬Fx), but that depends on x ranging over all distinguishable entities of our system, and therefore being replaceable (in any particular case) by a proper name.

My goodness! Don’t look so worried. You’re not worried that “Something is an inscription in the book” and “Something is not an inscription in the book” both come out true. Why shouldn’t they? No reason at all. Notice that, whatever “a” is, we always have:

NOT (Fa AND NOT Fa)
But we don’t always have:
NOT (Fa AND ¬Fa)
So we don’t always have:
(NOT Fa) IF AND ONLY IF (¬Fa)
Propositional negation doesn’t always have the same effect as predicate negation. Indeed, their equivalence depends on “a” being a proper name (or, using x for “a” above, on x ranging over the distinguishable entities of the system). You’re not shocked by “Something is not F” being true while “It is false that something is F” is false, because “something” is obviously nothing like a proper name. You’re only shocked by “The letter ‘A’ is not F” being true while “It is false that the letter ‘A’ is F” is false, because you take “the letter ‘A’” to be a proper name. But it isn’t: it’s a common name.

If it weren’t for all that delicate placement of English definite and indefinite articles, if you spoke Latin say (much hopes ), you would happily affirm “Dog doesn’t wag tail” and deny “It is false that dog wags tail”. Of course dog wags tail and dog doesn’t wag tail. Thank you Douglas For those who didn’t hear, Douglas offers the explanation: different dogs. Quite so.[1]

 

 

Polynormal Records

 

Now, that little introduction was another Platoclast “two birds with one stone” attempt: to explain why I’m not mug enough to abandon indefinite quantification, but also to show that when we get what looks like a name – and here I’m thinking particularly of a column name – and it names several things, or perhaps nothing, then we have to be a bit careful and introduce some clearly stated quantification to avoid confusion between propositional and predicate negation. And here’s a case in point.

You’ll remember that in order to get minimal structure, i.e. just one way to hold fields together, we treated a record as comprising its fields (and nothing else – no pointers, indexes, etc.) Such a record was, however, somewhat more complex than a normal record: it could have multiple values of the same field-type. We used the example:

A:l, B:2, B:5
Let us call such a record “polynormal” (where normal records are also regarded as polynormal). Well, we are faced with a problem: should we have only normal records, as Dr Codd proposed, or should we allow ourselves polynormal records? How can we answer this question, which certainly seems to be a question about structure? Contrast our question about relations and tables.

We can look – as we have looked at inordinate length – at the interpretation of records. The Entity interpretation gives us no assistance: but that interpretation, in any event, gives us no clue how to interpret operations like projection or join. Presumably any join of, say, two person records, represents an entity: but what sort of entity? And how can we tell which one of that sort?

The Classical interpretation looked more promising, indeed it seems entirely satisfying under self-interpretation. And it is the Classical interpretation that makes the proposal of normality so persuasive. A normal record type represents a predicate, and a normal record of that type represents a proposition formed by inserting a name in each place of that predicate. This close fit between the FOPC and normal records is the most powerful argument for accepting Dr Codd’s proposal.

Yet we have, except perhaps for self-interpretation, rejected the Classical interpretation. The names (values) found in records are no longer the proper names demanded by the FOPC: we still interpret in terms of the FOPC, but not in the intimate, one-for-one way that provided an obvious argument for normality. So the argument for polynormality looks a little stronger; that for normality a little weaker.

And, of course, when we consider the TT-DD, which is – I think, anyway – a pretty natural formalisation, our records are just polynormal records. And there is no more reason to impose normality than to impose value-normality.

 

 

The Postulate of Two-Dimensionality

 

Incidentally, you may well be wondering why I call it a Theory of Two-Dimensional Data, when its records don’t exactly fit neatly in a table. It’s because of another postulate I’m going to make, and what follows from it.

P8
FOR EACH x, FOR EACH y, if x is a field and y is a field then if x is the same format as y and x is the same value as y then x is the same record as y.
 
The identity of a field is determined by the identities of its format (field-type) and its value: format and value are the two dimensions.
Given a record (a polynormal record) we can draw a matrix with a row for each field format and a column for each value. Each cell (row/column intersection) can then be ticked or not accordingly as the value is found in a field of that format. Thus the record, A:l, B:2, B:5 looks like this:

The format of a record is, in effect, the projection (in the mathematical sense) of such a matrix onto its vertical axis; the value is the projection onto its horizontal axis. A normal record has just one tick in each row, a value-normal record has one tick in each column. We’re certainly not tempted to enforce value-normality, so it seems a shame (don’t you think?) to destroy such an elegant symmetry. I’m sure Dr Codd wouldn’t approve: he invented normal records to avoid “repeating groups”, and now we’re thinking of allowing – well, not repeating groups – repeating fields.

But this is important; it’s the very soul and essence of formatted data: a field comprises a type and value, a question and answer. Fl is the same field as F2 when Fl is the same answer to the same question as F2. And all we do is add the definition that a record comprises its fields, and we get polynormality. It’s that simple.

 

 

Restriction Conditions and the Quantifying Operators

 

Let’s think for a moment about what complexities might arise if we used polynormal records. I’m not at all certain, by the way, that we should use polynormal records, but I have another (ulterior) motive for this exercise as it happens. I think you can see that we’ll have no problems with projection or Cartesian product. But restriction conditions are going to get a little complex. In a restriction on a normal record we can just use a column name, say C, to pick out a value, and then compare it – I’ll use “@” as a schematic operator, it might be “=” or “<” or whatever. We could compare the column to a literal, or to another column, say D. So the condition would look like:

C@D
Pretty simple. But now C and D might comprise any number of values: I’ll assume not merely one or more, but zero or more.

Consider this case first, that for each value in D, there is a value in C that is “@” to it; we write this (I pronounce it “C left at D”, though my “left” is a right brancket!):

C]@D
 
FOR EACH d, FOR SOME c, c @ d
And conversely (“C right at D”):
C@[D
 
FOR EACH c, FOR SOME d, c @ d
And now consider the “overlap” case: some value in C is “@” to some value in D; we write (“C around at D”):
C[@]D
 
FOR SOME c, FOR SOME d, c @ d
We might combine these in various ways (first, “C about at D”):
C]@[D =df
C]@D & C@[D
 
(FOR EACH d, FOR SOME c, c @ d) AND (FOR EACH c, FOR SOME d, c @ d)
C@D =df
C]@D & C[@]D & C@[D
 
(FOR EACH d, FOR SOME c, c @ d) AND (FOR SOME c, FOR SOME d, c @ d) AND (FOR EACH c, FOR SOME d, c @ d)
I’ll try to bring all of them together in a diagram, but first an example of what I’ll call a “minimal matrix”. This little matrix:

means we have one C value (one row), and two D values (two columns), and the C value is “@” to just one of the D values. So that makes:

C]@D false
C@[D true
C[@]D true
So “C]@[D” is false, and so is “C@D”. The matrix is minimal because if we made it any smaller, at least one of the three truth values listed would change.

And here’s my diagram:

In each quarter, the upper (top left) matrix is for when “C[@]D” is false, and the lower (bottom right) matrix is for when “C[@]D” is true. Notice the empty matrices: the vertical when we have no D values (so “C]@D” is trivially true), the horizontal when we have no C values (so “C@[D” is trivially true), and the origin alone, when we have neither C nor D values.

Remember the combined definitions:

C]@[D =df C]@D & C@[D
 
That’s true in the bottom right quarter only.
 
C@D =df C]@D & C[@]D & C@[D
 
That’s true in the bottom right quarter, lower matrix.
You can think of these “]” and “[” modifiers like this: apply the “@” between each C and each D value. That gives you a matrix of values (true when c @ d, or false when NOT c @ d), and the diagram shows minimal examples of such matrices.
C[@]D says there is some true cell in the matrix.
 
C]@D says there is a true cell for each D (in every column).
 
C@[D says there is a true cell for each C (in every row).
 
C]@[D says there is a true cell in each row and each column.
 
C@D says there is a true cell in each row, and in each column, and the matrix is not empty.
And now consider a special case, when “@” is “=”.
C[=]D says that C and D overlap (have a common value).
 
C]=D says that D is either empty or part of C.
 
C=[D says that C is either empty or part of D.
 
C]=[D says that C and D have all the same values.
 
C=D says that C and D are not empty and have all the same values.
Well, I think you’ll agree that that’s pretty complex; but it’s probably not so complex as to be unmanageable if we found some good reason to go to polynormal rather than normal records. Anyway, as I said, I have an ulterior motive: not for nothing did I assume that there might be no value in the columns.

 

 

Restrictions on Present and Absent Fields

 

So let’s suppose we didn’t allow polynormal records, just normal records, but we allowed that there might be no value in some column of a record. It would mean: the record didn’t contain that type of field, not that it had – by some foundation shattering magic – an “empty” field, a question without an answer: we wouldn’t countenance that.

So we’re going to exploit the “]” and “[” operators to handle restriction conditions of the form “C@D”, where “C” or “D” might denote either one value or no value. This gives us a somewhat simpler diagram than we had before.

Naturally enough, all the minimal matrices with a side greater than 1 have been removed, giving us only five matrices. Consequently, the lower matrix in the bottom right corner gives us what are now equivalent expressions (we would naturally use the first):

C@D
 
C[@]D
C@D” holds true when both C and D are present and C is “@” to D. Either matrix in the bottom right gives us the expression:
C]@[D
C]@[D” holds true when either C@D or neither C nor D is present.

Notice also that “C]@D” holds true when either C@D or D is not present, and “C@[D” when either C@D or C is not present.

So let’s consider again the special case, when “@” is “=”.

C=D says that C and D have the same value.
 
C]=[D says that C and D are the same, in the sense: they either have the same value or neither has a value.
 
C]=D says that D either has no value or has the same value as C.
 
C=[D says that C either has no value or has the same value as D.

 

 

Completed Identities and Cryptic and Esoteric Data

 

What on earth is old Platoclast up to? That’s what you ask yourselves isn’t it? Not satisfied with pursuing interpretation of normal records to the bitter end, I indulge in sundry digressions and animadversions. I step aside from the path to examine exotic flowers: remember cryptic values that identify (show sameness) without being usable to pick out records; remember taking those column criteria of identity and defining their completions, just in case a such-and-such column might contain a value that wasn’t the same such-and-such as itself; and now we’ve considered comparing values that might be multiple, or even absent. Believe me, it’s all in a good cause: several good causes actually.

But I’ve just about done with collecting these exotics: a few more words and they’ll be ready to come into their own. Firstly, you will have appreciated that completed identities, where for column values C and D we say “CºD” rather than “C=D”, can be quite easily and obviously combined with our “]” and “[” operators. The ideas they express are orthogonal, even – especially – when we consider only zero or one value. We use the “]” and “[” to talk about possibly absent values; but “º” isn’t concerned with absent values, rather with present values that aren’t the same such-and-such as anything.

Secondly, you may be wondering where those record identifiers, present only at the form level, fit in when we get to the concept level. Are they cryptic? or absent? or what? Well, they’re not cryptic (though they might be cryptic at the form level). At the concept level they don’t appear at all, and they can’t be referenced by their column name, because that’s been projected away. In implementation – and I stress that this is an implementation, and invisible to the user – we might keep them attached to their records. And then I would call them esoteric: they are manifested as record type names (like “EMPLOYEE”), but in no other way. Notice that keeping esoteric record identifiers would greatly assist update of the data base: when records (in a view) are updated, it rather helps to know which kept records need to be modified.

Cryptic data is quite different: it occupies a column, and it can be compared (for equality at least, and possibly for ordering as well). It just can’t be clearly seen.

 

 

SQL Introduced

 

As we have a few minutes left, let’s look at an example of a relational query language: the most popular of them all, SQL. I want to look at this now because we shall be using it for examples before long, and because it has an oddity: it allows duplicate records in relations, and this question of duplicate records pertains directly to that of esoteric columns (columns projected away, but retained in implementation, for implementation reasons).

I’m going to look only at a tiny subset of SQL, a part of its query statement, the SQL SELECT. It looks like this:

SELECT columnl, column2, ...
   FROM relationl, relation2, ...
  WHERE condition
It means:
1
Form the Cartesian Product of relationl, relation2, ...
2
Restrict the resulting rows by applying condition to each.
3
Project columnl, column2, ... to give the result.
The strange thing is that if the projection removes all the keys of the result, the result may contain more than one row comprising exactly the same fields. Obviously, some sort of esoteric record identifier is kept with each row. Indeed, not only does the user see duplicate rows, and the SQL row counter function (COUNT) count duplicate rows, but (under certain circumstances) fields can be updated through these duplicate rows (update through one row modifying one stored record, and update through another, duplicate, row modifying another).

If you think about it, you’ll find that this is not a good idea: we may wish to keep esoteric keys in implementation, but we certainly don’t want to do anything to modify our criterion of identity of records at any level. At least, the only thing we would allow is – at the concept level – the record type name (which is, admittedly, derived from the now esoteric key); but that means that the only duplicates we would allow are records with different type names (but meeting our criterion of identity). That is to say: records in different tables as perceived by a user at the concept level. (And even the Relational bigots allow those: records similar except that they occur in different relations. This is what demonstrates the essentiality of relations.) But that still doesn’t allow duplicate records in the same table.

Obviously, if we want rows that are different but look the same to a user, what we need are cryptic values. A cryptic value occupies a column: but the user can’t tell – not just by looking – whether or not a cryptic field in one record has the same value as that in another, though they could be compared in a restriction condition (an SQL WHERE, if SQL allowed cryptic values, which it doesn’t).

But I don’t know: perhaps we ought to say that SQL does provide some cryptic value support, and that’s why we seem to have duplicate rows. We would have to say that the “ALL” option in SQL, the default for a SELECT, meant: keep the cryptic key. And the “DISTINCT” option, which removes duplicates, would mean: project away the cryptic key value. So we ought to be very careful what we say, especially what we say about modifying SQL: should we (as the relational bigots say) get rid of these duplicate rows? or should we extend the cryptic value support?

Do notice the elegance of SQL from the users’ viewpoint: they don’t have to work out some series of restrictions, projections, and joins, to express a query . They just get all the queried data together in a Cartesian product (FROM), then apply all the restrictions (WHERE), and then project (SELECT) the required columns. At least, I think that’s what the SQL keywords mean; but perhaps FROM indicates the projection of the preceding columns (the ones after SELECT), and SELECT doesn’t mean anything. Anyway, that makes no difference to what the whole thing means: the projection of the restriction of the Cartesian product, which is very neat. Any query comprising restrictions, projections, and joins can be expressed in that way.

But, of course, it’s not necessarily performed in that sequence: it’s performed in a much more efficient sequence because these relational operations can be shuffled around each other quite harmlessly. For instance, a restriction on one of two joined tables can be applied before or after the join; any columns not mentioned anywhere in the query can be projected away at any point.

 

SITE HOME PAGE

More important than all that: the user has absolutely no control over the sequence of operations. The user just can’t say “Do this restriction before that projection”, or “Do this join before that join”. And remember: what the user can’t say, the user can’t get wrong.

THE DATABASE PAGE

THE DATABASE PAPERS

 

Preface & Contents

 

DOWNLOAD

Download Lecture X (rtf, Word for Windows compatible)

Platoclast on Data: Lecture XI

 

Copyright © 1993, 2001 Adrian Larner. The author asserts all moral rights.