(U03)

www.btinternet.com/~adrian.larner/database/pcl01

PLATOCLAST
ON DATA

Lecture I
Data Structure

 

 

Good afternoon, and welcome to Professor Geoffrey Platoclast’s lectures on Relational Data Bases[1]. We begin our examination of Relational Theory, or, as its proponents call it, the Relational Model of data, by seeing what Mr Date says about it in Part III of his splendid Introduction[2]. The model, he says, is concerned with the structure, the manipulation, and the integrity of data. What I’m going to say about these aspects of Relational Theory is that Dr Codd, who first proposed it[3], was half right about structure, almost entirely right about manipulation, and mostly wrong about integrity. But as everyone else was not merely very wrong, but – which is worse – totally confused, about data, Dr Codd’s achievement is very great indeed: comparable to Aristotle’s achievement in logic, or Lady Lovelace’s in programming[4].

 

 

The Definition of Formatted Data

 

Mr Date – who’s almost never confused, even when he’s wrong – begins by explaining some of the technical terms of the theory. Well, it’s always a good idea to start by saying what you’re talking about. Data – formatted data, as we know it in the data processing (DP) industry – consists of what are loosely termed “records” and are, though nowadays written on magnetic media, what in olden days (as my nephews and nieces say) were called “forms”, and written on paper. What you do with a form, to make it hold any information at all, is to fill in the particulars on it. And a particular – quaint old word – is, roughly speaking, a question whose answer is filled in. The technical term in DP is: a field. What we fill in, the answer, is: a value. We have no good word for the question, though we sometimes call that a “field” as well, and then we try to avoid the ambiguity by talking of “types” of field (questions) and “instances” (particulars, or questions and answers together).

Well, here we are still hovering on the threshold of data theory and already we’re finding ambiguities, and probably vaguenesses too. We need to clear them up, and of course Mr Date does clear them up, in his own inimitable way. But we need – even before that – to understand what is needed to clear them up: to equip ourselves, so to speak, with some conceptual tools that enable us to think and speak clearly.

 

 

Vagueness and Criteria of Application

 

Let’s step back from the problems of DP terms of art and consider the more general question of grasping a concept in the face of vagueness and ambiguity. A concept is vague when we’re not quite certain, or under some circumstances might not be quite certain, whether to apply it or not. As it happens, all concepts are vague, but some are vaguer than others. Take the concept “book”: there are certainly some things that we should unhesitatingly call “books”. But then there are other things, graphic novels (comic books), pocket diaries and organisers, magazines, talking books, and the like, that we may not be quite certain whether to call “books” or not. Or consider a book being slowly destroyed by fire. To begin with, it is clearly a book, and at the end it is clearly not a book but a pile of ashes. But we’ld be hard put – though it does not much matter – to say at what point it ceased to be a book. Fortunately we never have to clear up all vagueness, merely to reduce vagueness to a level where it does not get in our way. And the thing to clear up vagueness is a criterion of application for the concept. All this means is that we have, in one sense, grasped the concept “book” if we know what it means to be a book, that is, what has to be true of something if it is a book (and false of it if it is not a book). Sometimes, and particularly for technical terms, we do not have to discover a criterion of application: we can choose the criterion we want and use it to define the concept. (Scientists do this all the time.) In contrast, children learning to speak a language have to find out (by rather mysterious means) the criteria of application current in their society.

 

 

Ambiguity and Criteria of Identity

 

Let us now glance back at the ambiguity we have already mentioned. When we look at records, or filled-in forms (we shall ignore blank forms), we see all those particulars, or fields. Now, the entry “SURNAME: Smith” on one form (we shall use this “QUESTION: Answer” or “FIELD TYPE: Value” format to express fields) is obviously a different field from the entry “BIRTHDATE: 12th January 1944” on the same, or another, form. But is the entry “SURNAME: Smith” a different field from “SURNAME: Jones” on another form? And here we want to say: in one sense, yes, and in another sense (they are both the SURNAME field), no.

Now this is not in the least a question of vagueness. We might have – I hope we do have – a criterion of application of “field” sufficient to say that “SURNAME: Smith” is a field and so is “SURNAME: Jones”. But this does not help when we want to answer the question: is this the same field as that? And we get exactly the same problem in our book example. We may have specified our criterion of application, ruling out (for instance) magazines, but including graphic novels. But we could still be given pause by the question: is my copy of “Persuasion” the same book as your copy of “Persuasion”?

 

 

Fields

 

To clear up such ambiguities we need (and, again, we often choose rather than discover) a criterion of identity to tell us whether or not this is the same such-and-such[5] as that. So let us say that our criterion of identity of “field” is, if Fl is a field and F2 is a field:

Fl is the same field as F2 when Fl comprises the same question and the same answer (value) as F2.
Notice that, pedantically, we do not say “if Fl and F2 are fields” because this would imply that they were different fields: a bit silly if we are about to give a criterion that tells us whether they are different fields or not.

Our criterion is that of “instances” of fields. The type criterion is, if Fl is a field and F2 is a field:

Fl is the same type of field (or “the same field-type”) as F2 when Fl has the same question as F2.
And this means that our “instance” criterion could be written as:
Fl is the same field as F2 when Fl is the same field-type as F2 and Fl has the same value as F2.
It also means – because “field” (instance) and “field-type” have the same criterion of application – that field-types are not something extra: there are no field-types in addition to the fields. Indeed, every field is an instance (a field) and is a type (a field-type).

Likewise, if we decide on some criteria of identity for books – we distinguish them by talking about copies, editions, and so on – that does not mean that there are some extra things. When we go to a shop and buy a copy of a book we do not additionally have to buy an edition: to buy a copy is to buy an edition, even though we can, in buying different copies, both buy the same edition.

That little digression has, I hope, not merely clarified some of our terms but – more importantly – equipped us with some conceptual tools, criteria of application (When have we got one?) and criteria of identity (When have we got one?), that will be very useful later.

 

 

Records

 

Now we come to one of the most important of Dr Codd’s proposals, and to two of Mr Date’s important properties of relations[7]: no duplicate rows and unordered rows. Dr Codd’s proposal[8], though he did not put it in this way, is for a certain criterion of identity for records, if R1 is a record and R2 is a record:

Rl is the same record as R2 when Rl comprises all and only the fields of R2.

Now suppose for a moment that in this criterion we took the word “field” to be ambiguous (as it was before we chose the criteria of identity for field and field-type). Then “record” would be ambiguous between record (instance) and record-type, in just the same way as “field” was. But now, of course, we can again use this valuable technique of specifying a criterion of identity to say what we mean by record-type (or record format):

Rl is the same record-type as R2 when Rl comprises all and only the field-types of R2.
Again, record-types are not something over and above records: anything that is a record is a record-type. The concepts have the same criterion of application.

But what is that criterion of application? Well, as it happens, by what looks like magic – or trickery, depending on how cynical you are – we can use the criterion of identity of “record” to get its criterion of application. This is how it is done:

R is a record when there is something that is the same record as R[9].
But this means (when we unwrap “is the same record as” according to our criterion of identity):
R is a record when there is something that comprises all and only the fields of R.
But, when there is something that comprises all and only the fields of R, then certainly R comprises all and only the fields of R, so:
R is a record when R comprises all and only the fields of R.
And you may think, with some justification, that this criterion of application is not only dubiously derived (though it is actually quite properly derived) but also quite trivial. And in a way, it is quite trivial. But we used to think it was wrong, until Dr Codd persuaded us that it was right. So many of the results of genius are trivial: after they are discovered.

Think back to some of the methods of data organisation that we used before we had Relational systems. There were sequential files in which the same record (according to our criterion of identity) might appear twice: let us say, two sequential records might be one and the same record; they might contain all and only the same fields.

 

 

Records are Essential

 

Think also of why records are important. Suppose you kept a number of records as paper forms. At the very simplest they might each comprise two fields, SURNAME and GIVEN NAME, so two specimens might be:

SURNAME: Smith, GIVEN NAME: William
SURNAME: Jones, GIVEN NAME: Robert
And suppose a vandal cut up your records, but did not destroy their fields. You would then have the four records (not four new records!):
GIVEN NAME: William
GIVEN NAME: Robert
SURNAME: Jones
SURNAME: Smith
As you see, you have lost information (was it Robert Jones and William Smith, or William Jones and Robert Smith?) The records we keep contain information over and above that in their fields: information about which fields go with which. The record is, as Mr Date calls it, essential[10]. Records keep information together: they relate one item of information with another.

Now consider the records of a complex, non-relational data base. Perhaps they (and therefore their fields) are connected to each other by pointers, or perhaps some fields are implicitly connected by the way indexes are implemented. Indeed, the “records” in such data bases are not records as we have defined them: they do not comprise all and only their fields. If a vandal left all the records in such a data base intact, in the sense of keeping together all the fields in each record, yet severed all the other (implicit) connections (pointers, indexes, and the like), information would be lost. In such a system, the implicit connections are essential.

 

 

Other Data Structures are Inessential

 

So, in saying that a record comprises its fields, and nothing else, we (following Dr Codd) exclude all implicit data connections. In a Relational system, the only way in which fields are held together is by occurring together in the same data base record. Well, almost the only way[11].

But why should we insist on having only one way to hold fields together? I’m afraid the answer (well, one important answer) is both cynical and true. We will probably make mistakes in implementing any method of holding fields together, and users will make mistakes in exploiting that method. So the fewer methods there are, the more likely we, and the users, are to get them right. And one method is the fewest we can get away with.

Of course, it also makes things simpler for the data base designer when choosing what method to use to connect fields; and for the user in guessing which of those methods the data base designer chose (and then wondering why). And you can see that the decision between one method and more than one is far more important than the choice between, say, three methods or four (at least when you calculate the percentage savings on time and effort of choosing).

 

 

No Duplicate Records

 

And now we can easily understand Mr Date’s two important properties. A relation is a collection of records. (We shall have more to say about relations later, but that will serve for the moment.) Of course, given our criterion of identity of records, it could not contain duplicate records: for R1 to be a duplicate (that is, a copy) of R2, R2 would have to have all and only the fields of R1. But if R2 has all and only the fields of Rl, then it is not a duplicate of R1: it is the same record[12] as R1.

Naturally, if our criterion of identity of “record” were different, so that two records could have the same fields but different record numbers, or pointers connecting them to different places, or different index entries, then two records could have the same fields (and we would get duplicates). But Relational records comprise their fields: they do not have other bells and whistles, and that is why they cannot be duplicated.

 

 

Records are Unordered

 

And likewise we have Mr Date’s second property: no ordering. This does not mean that our records are simply unordered. We could have a “SEQUENCE NUMBER” field in each record, and order them on SEQUENCE NUMBER. It means that, ignoring their fields, our records are unordered (contrast that with a sequential file in which there is an ordering possibly unrelated to any fields).

You see, we might say: these records are ordered in such-and-such sequence. But this would be equivalent to numbering them, 1, 2, 3, and so on. And then a record would not comprise only its fields, because it would have a number as well. Such a number would be another sort of implicit connection between records (any record but the last in a file would have a “next” connection to another record), and if this connection were cut information would be lost. So we would not have only one way to connect fields.

Very well then, our – apparently trivial – criteria (of identity and application) for “record” mean, in effect, that the record is, using Mr Date’s term, the only essential connection between fields. This does not mean that there can be no ordering, pointers, or indexes in a Relational data base. It means merely that we would lose no information (though we might be put to inconvenience) if a vandal destroyed all our ordering, pointers, and indexes: as long as the records were left intact. This is what is meant by ordering, pointers, and indexes being inessential.

 

 

Tables

 

We have come a long way without seriously considering what you might take to be an, even the, important concept in Relational Theory: the relation. As you will have read, Mr Date describes a relation, informally, as a table. And obviously it is very like a file or data set: a collection of records. Let us, for the moment, say “table”. And naturally (now, I hope) we will want to say what criteria we are going to choose for “table”.

 

 

Columns and Rows

 

One very obvious thing to say is that a table is two-dimensional. We have already met one dimension, that of fields within records. In our tables this will be the “column” dimension: a column in a table corresponds to a field-type. Imagine that we have some records. We could write them out as rows, one under another (as we have seen, it would not matter what row-order we wrote them in). Then we would, if necessary, shuffle the fields about within each record so as to line up all fields of the same field-type in the same column of the table. Notice that this shuffling cannot produce a different record, according to our criterion of identity of records. That, very quickly, covers Mr Date’s third important property: columns are unordered.

The second dimension is that of the rows, which are simply the records. Notice that the columns of the table are, we might say, indexed on column names, which are simply (names of) our field-types. I’m using “indexed on” here to mean simply: we can use a column name to pick out a column. But what are the rows indexed on? What can we use to pick out a row? As we know, we cannot have any extra index: we could not name or number the rows (because the rows are records; remember the criteria of “record”). The only things we can use to index the rows are: the rows themselves, the collection of fields in each. But that is all right, because each row comprises a distinct collection of fields: if Rl comprised the same collection of fields as R2, Rl would be the same row (record) as R2.

Sometimes we find that (or we actually ensure that) we could index (uniquely identify) the rows of a table without taking all columns (all field-types) in the rows into account. For example, we might have a table of EMPLOYEE records, and we might ensure that each record had a uniquely valued EMPLOYEE NUMBER field. Then this single field (rather than the whole collection of fields) could be used to index the rows in the table. Such a field is termed a key, as are, collectively, any number of fields that serve for unique row indexing. As we have seen, we always have a key, even if the only key we have comprises all the columns of the table.

And now we can easily state the criterion of identity of “table”:

T1 is the same table as T2 when T1 contains all and only the rows[13] (records) of T2.

Well, we have not quite reached the concept of “relation” yet. To get there we need one more step, a refinement of the criterion of application. But this step is one that I rather think we should not take, and one of the points where Dr Codd went wrong. So before we take that step, we will take another, which Dr Codd also took, and which he was probably right – and certainly inspired – to have taken.

 

 

Polynormal Records

 

Think about this record:

A:1, B:2, B:5
And before you object that there is something wrong with it, check back to the criteria of “record”: you will find that it is perfectly in order, despite having two values (2 and 5) within the same field-type (B). What would happen if we tried to put such a record into a table? Somehow we have to line up two fields within the same column, which could be done:
A       B
 
1       2
        5
We have to think of that as one row. We could put it like this:
A      -B-
 
1      2 5
Or like this:
A      B B
 
1      2 5
No, it’s not three columns; it’s two columns but one of them is inscribed twice.

We could even put it like this:

A       B
 
1     (2,5)
And we could write the record as:
A:1, B:(2,5)

 

 

Normal Records

 

Dr Codd thought this was unwise, however it was written. “Use two records,” he would have said:

A:1, B:2
A:1, B:5

And Dr Codd was right, I reckon – but why? As we’ve seen, by using tables we deal in two dimensions, columns and rows. That is, we deal in at least two dimensions. But if we allow a record to contain different values in the same field-type (that is, different fields of the same type) we find ourselves dealing in more than two dimensions, we might say three in this simple example (well, two and a little widgy bit, I’ld say), but there is no limit to the amount of free rein we could give ourselves. [14]

And the simplest explanation of Dr Codd’s proposal is that the fewer dimensions we have, the better: it makes the system easier for the user to understand; it makes it easier to get the system to perform well. Even looking at the short distance we have now travelled, we have found problems in tabulating a record with two fields of the same type. And of course, cynical but true, we will make mistakes in implementing, and the user in using, each dimension. So the fewer dimensions we have ... but we’ve been here before[15].

And yet, and yet! Isn’t it annoying that our criterion of identity of records, records without implicit connections, allows just slightly more than Dr Codd rightly proposed. I hate that. I lie awake nights worrying about it.

Mr Date expresses the restriction as: all attribute (field) values are atomic. But he sells the pass – not quite, he offers it for sale – by allowing what he calls “composite attributes” (the unfortunate term “attribute” means, roughly, “column”). What he has in mind are fields like:

BIRTHDATE: 12th January 1944
What I want to say here is: I don’t mind what criteria you use for “value”, as long as you have some. If, according to your criteria, 12th January 1944 is a value, and 12th and January and 1944 are not, then this field is just fine. If, according to your criteria, 12th, or January, or 1944 – or for that matter “2th Janu” or “ary 194” – are values of BIRTHDATE then a record containing BIRTHDATE: 12th January 1944 is not normal: it has multiple fields of the same field-type. And if you say that 12th, January, and 1944 are values, but not of BIRTHDATE, then either you must say that they do not occur in a record containing BIRTHDATE: 12th January 1944, or, if you say they do occur, you must put them into an appropriate field (otherwise they are answers without questions, which is just not allowed), thus:
BIRTHDATE DAY: 12th
BIRTHDATE MONTH: January
BIRTHDATE YEAR: 1944
Incidentally, you should not flinch from denying that, say, January occurs in 12th January 1944. After all, the word “end” does not occur in “trendy”: it’s just an orthographic accident. 12th January 1944 is the same date as 44.012, and clearly January does not occur in 44.012. Also, if January occurs in 12th January 1944, then it appears in The day after 31st January 1944, i.e. in 1st February 1944.[16]

This then is a normal, or “first normal form” (1NF) record: a record that contains one and only one value in each of its fields. Another way to put this is as follows:

Let us say that R1 is part of R2 when R2 contains all the fields of R1. Then R1 is the same record as R2 when Rl is part of R2 and R2 is part of R1 (so “part”, in this sense, includes “whole”).
 
Let us say that Rl is a proper part of R2 when R1 is a part of R2 but R2 is not a part of R1 (so “proper part” is like “part” but does not include “whole”).
 
And now we can say that R is normal if no proper part of R is the same record-type as R.
You can see how this works. A record comprising “A:1, B:2” has only two proper parts, “A:l” and “B:2”, so to get a proper part we have to lose a field-type (either the B or the A respectively), and then we do not get a record of the same record type. If however, as in “A:1, B:2, B:5”, we can lose a field (like B:5) and not lose a field-type (we then get the proper part “A:1, B:2”), we have a proper part that is the same record-type (it has all and only the same field-types, viz A and B).

It should be stressed again that we never get anything less than a field: we never deal with questions without answers, or answers without questions (otherwise we might say that “A:1, B:” was a proper part of “A:1, B:2”). Indeed, having got to this point we could anticipate a little more of the formalisation coming up in our next lecture by noting that, although we started with fields in describing what records are, we could go the other way and define “field” on “record”, if R is a record:

R is a field when R has no proper part.
That is, a field is the smallest – atomic – part of a record.

We have now covered all four of MrDate’s important properties of relations, and we have seen that tables – as far as we have got – are two-dimensional collections of records, and these records ought to be normal. Usually, though not always, “relation” is taken to imply normality[17]. Dr Codd spoke of normal relations, in contrast to our use here of “normal record[18]”, but a normal relation is just a relation containing normal records.

 

 

Tables of Uniform Format

 

But, as I mentioned, there is another property of relations that Dr Codd proposed, and we ought to consider. Mr Date does not discuss this property (though it does follow from his definition of “relation”[19]). For that matter, Dr Codd did not discuss it either: he rather took it for granted. What do we mean by saying that our tables are two-dimensional? We mean that we can pick out any field in a table by giving its row and column indices (saying which row and column it is in). And by saying our tables are normal, we mean that we can pick out any field uniquely.

Dr Codd proposed a bit more than this: that by giving any row and column in a table – we can now say, “relation” – we should pick out some field. And this means that every record (row) in a relation is the same record-type. Or, to put it in yet another way: every row in a relation intersects every column in it. Look – we could put these two records into a table:

A:1, B:2
A:3, C:5
It would look like this:
A       B       C
 
1       2
3               5
The first row shown does not intersect the C column, and the second does not intersect the B column. This table is not a relation (and I say, no worse for that). And if you are thinking, “He’s talking about Nulls,” think again! We will come to Nulls later, but this is not them.

 

 

Are Tables Essential?

 

There is however a better way to put the difference between Dr Codd and me on this subject of “uniform format”. We have said: should all tables be relations, that is, contain only one record type? But the better way is this, to use Mr Date’s term again: should tables (whether relations or not) be essential? Or treated as if they were essential? I think not.

Remember that devious vandal. Suppose we have a Relational data base, or – let us say – a Tabular data base: one with records as we have defined them (so no implicit connections between fields), the records comprising fields, and the records kept in tables (relations or not). And now let the vandal destroy the tables, but leave the fields and the records intact. Have we lost any information? If not, then tables are not essential, and should not be treated as if they were.

 

SITE HOME PAGE

What is essential is that you start arriving on time for these lectures.... Oh, sorry - it’s five past to five to, is it?... Right: see you for some serious work at the crack of five past nine, tomorrow.

THE DATABASE PAGE

THE DATABASE PAPERS

 

Preface & Contents

 

DOWNLOAD

Download Lecture I (rtf, Word for Windows compatible)

Platoclast on Data: Lecture II

 

Copyright © 1993, 2001 Adrian Larner. The author asserts all moral rights.