|
(U03) |
www.btinternet.com/~adrian.larner/database/pcl01 |
|
PLATOCLAST Lecture I |
||
|
|
||
|
|
Good afternoon, and welcome to Professor Geoffrey Platoclast’s lectures on Relational Data Bases[1]. We begin our examination of Relational Theory, or, as its proponents call it, the Relational Model of data, by seeing what Mr Date says about it in Part III of his splendid Introduction[2]. The model, he says, is concerned with the structure, the manipulation, and the integrity of data. What I’m going to say about these aspects of Relational Theory is that Dr Codd, who first proposed it[3], was half right about structure, almost entirely right about manipulation, and mostly wrong about integrity. But as everyone else was not merely very wrong, but which is worse totally confused, about data, Dr Codd’s achievement is very great indeed: comparable to Aristotle’s achievement in logic, or Lady Lovelace’s in programming[4]. |
|
|
|
||
|
|
The Definition of Formatted Data |
|
|
|
Mr Date who’s almost never confused, even when he’s wrong begins by explaining some of the technical terms of the theory. Well, it’s always a good idea to start by saying what you’re talking about. Data formatted data, as we know it in the data processing (DP) industry consists of what are loosely termed “records” and are, though nowadays written on magnetic media, what in olden days (as my nephews and nieces say) were called “forms”, and written on paper. What you do with a form, to make it hold any information at all, is to fill in the particulars on it. And a particular quaint old word is, roughly speaking, a question whose answer is filled in. The technical term in DP is: a field. What we fill in, the answer, is: a value. We have no good word for the question, though we sometimes call that a “field” as well, and then we try to avoid the ambiguity by talking of “types” of field (questions) and “instances” (particulars, or questions and answers together). Well, here we are still hovering on the threshold of data theory and already we’re finding ambiguities, and probably vaguenesses too. We need to clear them up, and of course Mr Date does clear them up, in his own inimitable way. But we need even before that to understand what is needed to clear them up: to equip ourselves, so to speak, with some conceptual tools that enable us to think and speak clearly. |
|
|
|
||
|
|
Vagueness and Criteria of Application |
|
|
|
Let’s step back from the problems of DP terms of art and consider the more general question of grasping a concept in the face of vagueness and ambiguity. A concept is vague when we’re not quite certain, or under some circumstances might not be quite certain, whether to apply it or not. As it happens, all concepts are vague, but some are vaguer than others. Take the concept “book”: there are certainly some things that we should unhesitatingly call “books”. But then there are other things, graphic novels (comic books), pocket diaries and organisers, magazines, talking books, and the like, that we may not be quite certain whether to call “books” or not. Or consider a book being slowly destroyed by fire. To begin with, it is clearly a book, and at the end it is clearly not a book but a pile of ashes. But we’ld be hard put though it does not much matter to say at what point it ceased to be a book. Fortunately we never have to clear up all vagueness, merely to reduce vagueness to a level where it does not get in our way. And the thing to clear up vagueness is a criterion of application for the concept. All this means is that we have, in one sense, grasped the concept “book” if we know what it means to be a book, that is, what has to be true of something if it is a book (and false of it if it is not a book). Sometimes, and particularly for technical terms, we do not have to discover a criterion of application: we can choose the criterion we want and use it to define the concept. (Scientists do this all the time.) In contrast, children learning to speak a language have to find out (by rather mysterious means) the criteria of application current in their society. |
|
|
|
||
|
|
Ambiguity and Criteria of Identity |
|
|
|
Let us now glance back at the ambiguity we have already mentioned. When we look at records, or filled-in forms (we shall ignore blank forms), we see all those particulars, or fields. Now, the entry “SURNAME: Smith” on one form (we shall use this “QUESTION: Answer” or “FIELD TYPE: Value” format to express fields) is obviously a different field from the entry “BIRTHDATE: 12th January 1944” on the same, or another, form. But is the entry “SURNAME: Smith” a different field from “SURNAME: Jones” on another form? And here we want to say: in one sense, yes, and in another sense (they are both the SURNAME field), no. Now this is not in the least a question of vagueness. We might have I hope we do have a criterion of application of “field” sufficient to say that “SURNAME: Smith” is a field and so is “SURNAME: Jones”. But this does not help when we want to answer the question: is this the same field as that? And we get exactly the same problem in our book example. We may have specified our criterion of application, ruling out (for instance) magazines, but including graphic novels. But we could still be given pause by the question: is my copy of “Persuasion” the same book as your copy of “Persuasion”? |
|
|
|
||
|
|
Fields |
|
|
|
To clear up such ambiguities we need (and, again, we often choose rather than discover) a criterion of identity to tell us whether or not this is the same such-and-such[5] as that. So let us say that our criterion of identity of “field” is, if Fl is a field and F2 is a field:
Our criterion is that of “instances” of fields. The type criterion is, if Fl is a field and F2 is a field:
Likewise, if we decide on some criteria of identity for books we distinguish them by talking about copies, editions, and so on that does not mean that there are some extra things. When we go to a shop and buy a copy of a book we do not additionally have to buy an edition: to buy a copy is to buy an edition, even though we can, in buying different copies, both buy the same edition. That little digression has, I hope, not merely clarified some of our terms but more importantly equipped us with some conceptual tools, criteria of application (When have we got one?) and criteria of identity (When have we got one?), that will be very useful later. |
|
|
|
||
|
|
Records |
|
|
|
Now we come to one of the most important of Dr Codd’s proposals, and to two of Mr Date’s important properties of relations[7]: no duplicate rows and unordered rows. Dr Codd’s proposal[8], though he did not put it in this way, is for a certain criterion of identity for records, if R1 is a record and R2 is a record:
Now suppose for a moment that in this criterion we took the word “field” to be ambiguous (as it was before we chose the criteria of identity for field and field-type). Then “record” would be ambiguous between record (instance) and record-type, in just the same way as “field” was. But now, of course, we can again use this valuable technique of specifying a criterion of identity to say what we mean by record-type (or record format):
But what is that criterion of application? Well, as it happens, by what looks like magic or trickery, depending on how cynical you are we can use the criterion of identity of “record” to get its criterion of application. This is how it is done: But this means (when we unwrap “is the same record as” according to our criterion of identity):
Think back to some of the methods of data organisation that we used before we had Relational systems. There were sequential files in which the same record (according to our criterion of identity) might appear twice: let us say, two sequential records might be one and the same record; they might contain all and only the same fields. |
|
|
|
||
|
|
Records are Essential |
|
|
|
Think also of why records are important. Suppose you kept a number of records as paper forms. At the very simplest they might each comprise two fields, SURNAME and GIVEN NAME, so two specimens might be:
Now consider the records of a complex, non-relational data base. Perhaps they (and therefore their fields) are connected to each other by pointers, or perhaps some fields are implicitly connected by the way indexes are implemented. Indeed, the “records” in such data bases are not records as we have defined them: they do not comprise all and only their fields. If a vandal left all the records in such a data base intact, in the sense of keeping together all the fields in each record, yet severed all the other (implicit) connections (pointers, indexes, and the like), information would be lost. In such a system, the implicit connections are essential. |
|
|
|
||
|
|
Other Data Structures are Inessential |
|
|
|
So, in saying that a record comprises its fields, and nothing else, we (following Dr Codd) exclude all implicit data connections. In a Relational system, the only way in which fields are held together is by occurring together in the same data base record. Well, almost the only way[11]. But why should we insist on having only one way to hold fields together? I’m afraid the answer (well, one important answer) is both cynical and true. We will probably make mistakes in implementing any method of holding fields together, and users will make mistakes in exploiting that method. So the fewer methods there are, the more likely we, and the users, are to get them right. And one method is the fewest we can get away with. Of course, it also makes things simpler for the data base designer when choosing what method to use to connect fields; and for the user in guessing which of those methods the data base designer chose (and then wondering why). And you can see that the decision between one method and more than one is far more important than the choice between, say, three methods or four (at least when you calculate the percentage savings on time and effort of choosing). |
|
|
|
||
|
|
No Duplicate Records |
|
|
|
And now we can easily understand Mr Date’s two important properties. A relation is a collection of records. (We shall have more to say about relations later, but that will serve for the moment.) Of course, given our criterion of identity of records, it could not contain duplicate records: for R1 to be a duplicate (that is, a copy) of R2, R2 would have to have all and only the fields of R1. But if R2 has all and only the fields of Rl, then it is not a duplicate of R1: it is the same record[12] as R1. Naturally, if our criterion of identity of “record” were different, so that two records could have the same fields but different record numbers, or pointers connecting them to different places, or different index entries, then two records could have the same fields (and we would get duplicates). But Relational records comprise their fields: they do not have other bells and whistles, and that is why they cannot be duplicated. |
|
|
|
||
|
|
Records are Unordered |
|
|
|
And likewise we have Mr Date’s second property: no ordering. This does not mean that our records are simply unordered. We could have a “SEQUENCE NUMBER” field in each record, and order them on SEQUENCE NUMBER. It means that, ignoring their fields, our records are unordered (contrast that with a sequential file in which there is an ordering possibly unrelated to any fields). You see, we might say: these records are ordered in such-and-such sequence. But this would be equivalent to numbering them, 1, 2, 3, and so on. And then a record would not comprise only its fields, because it would have a number as well. Such a number would be another sort of implicit connection between records (any record but the last in a file would have a “next” connection to another record), and if this connection were cut information would be lost. So we would not have only one way to connect fields. Very well then, our apparently trivial criteria (of identity and application) for “record” mean, in effect, that the record is, using Mr Date’s term, the only essential connection between fields. This does not mean that there can be no ordering, pointers, or indexes in a Relational data base. It means merely that we would lose no information (though we might be put to inconvenience) if a vandal destroyed all our ordering, pointers, and indexes: as long as the records were left intact. This is what is meant by ordering, pointers, and indexes being inessential. |
|
|
|
||
|
|
Tables |
|
|
|
We have come a long way without seriously considering what you might take to be an, even the, important concept in Relational Theory: the relation. As you will have read, Mr Date describes a relation, informally, as a table. And obviously it is very like a file or data set: a collection of records. Let us, for the moment, say “table”. And naturally (now, I hope) we will want to say what criteria we are going to choose for “table”. |
|
|
|
||
|
|
Columns and Rows |
|
|
|
One very obvious thing to say is that a table is two-dimensional. We have already met one dimension, that of fields within records. In our tables this will be the “column” dimension: a column in a table corresponds to a field-type. Imagine that we have some records. We could write them out as rows, one under another (as we have seen, it would not matter what row-order we wrote them in). Then we would, if necessary, shuffle the fields about within each record so as to line up all fields of the same field-type in the same column of the table. Notice that this shuffling cannot produce a different record, according to our criterion of identity of records. That, very quickly, covers Mr Date’s third important property: columns are unordered. The second dimension is that of the rows, which are simply the records. Notice that the columns of the table are, we might say, indexed on column names, which are simply (names of) our field-types. I’m using “indexed on” here to mean simply: we can use a column name to pick out a column. But what are the rows indexed on? What can we use to pick out a row? As we know, we cannot have any extra index: we could not name or number the rows (because the rows are records; remember the criteria of “record”). The only things we can use to index the rows are: the rows themselves, the collection of fields in each. But that is all right, because each row comprises a distinct collection of fields: if Rl comprised the same collection of fields as R2, Rl would be the same row (record) as R2. Sometimes we find that (or we actually ensure that) we could index (uniquely identify) the rows of a table without taking all columns (all field-types) in the rows into account. For example, we might have a table of EMPLOYEE records, and we might ensure that each record had a uniquely valued EMPLOYEE NUMBER field. Then this single field (rather than the whole collection of fields) could be used to index the rows in the table. Such a field is termed a key, as are, collectively, any number of fields that serve for unique row indexing. As we have seen, we always have a key, even if the only key we have comprises all the columns of the table. And now we can easily state the criterion of identity of “table”:
Well, we have not quite reached the concept of “relation” yet. To get there we need one more step, a refinement of the criterion of application. But this step is one that I rather think we should not take, and one of the points where Dr Codd went wrong. So before we take that step, we will take another, which Dr Codd also took, and which he was probably right and certainly inspired to have taken. |
|
|
|
||
|
|
Polynormal Records |
|
|
|
Think about this record:
We could even put it like this:
|
|
|
|
||
|
|
Normal Records |
|
|
|
Dr Codd thought this was unwise, however it was written. “Use two records,” he would have said:
And Dr Codd was right, I reckon but why? As we’ve seen, by using tables we deal in two dimensions, columns and rows. That is, we deal in at least two dimensions. But if we allow a record to contain different values in the same field-type (that is, different fields of the same type) we find ourselves dealing in more than two dimensions, we might say three in this simple example (well, two and a little widgy bit, I’ld say), but there is no limit to the amount of free rein we could give ourselves. [14] And the simplest explanation of Dr Codd’s proposal is that the fewer dimensions we have, the better: it makes the system easier for the user to understand; it makes it easier to get the system to perform well. Even looking at the short distance we have now travelled, we have found problems in tabulating a record with two fields of the same type. And of course, cynical but true, we will make mistakes in implementing, and the user in using, each dimension. So the fewer dimensions we have ... but we’ve been here before[15]. And yet, and yet! Isn’t it annoying that our criterion of identity of records, records without implicit connections, allows just slightly more than Dr Codd rightly proposed. I hate that. I lie awake nights worrying about it. Mr Date expresses the restriction as: all attribute (field) values are atomic. But he sells the pass not quite, he offers it for sale by allowing what he calls “composite attributes” (the unfortunate term “attribute” means, roughly, “column”). What he has in mind are fields like:
This then is a normal, or “first normal form” (1NF) record: a record that contains one and only one value in each of its fields. Another way to put this is as follows:
It should be stressed again that we never get anything less than a field: we never deal with questions without answers, or answers without questions (otherwise we might say that “A:1, B:” was a proper part of “A:1, B:2”). Indeed, having got to this point we could anticipate a little more of the formalisation coming up in our next lecture by noting that, although we started with fields in describing what records are, we could go the other way and define “field” on “record”, if R is a record:
We have now covered all four of MrDate’s important properties of relations, and we have seen that tables as far as we have got are two-dimensional collections of records, and these records ought to be normal. Usually, though not always, “relation” is taken to imply normality[17]. Dr Codd spoke of normal relations, in contrast to our use here of “normal record[18]”, but a normal relation is just a relation containing normal records. |
|
|
|
||
|
|
Tables of Uniform Format |
|
|
|
But, as I mentioned, there is another property of relations that Dr Codd proposed, and we ought to consider. Mr Date does not discuss this property (though it does follow from his definition of “relation”[19]). For that matter, Dr Codd did not discuss it either: he rather took it for granted. What do we mean by saying that our tables are two-dimensional? We mean that we can pick out any field in a table by giving its row and column indices (saying which row and column it is in). And by saying our tables are normal, we mean that we can pick out any field uniquely. Dr Codd proposed a bit more than this: that by giving any row and column in a table we can now say, “relation” we should pick out some field. And this means that every record (row) in a relation is the same record-type. Or, to put it in yet another way: every row in a relation intersects every column in it. Look we could put these two records into a table:
|
|
|
|
||
|
|
Are Tables Essential? |
|
|
|
There is however a better way to put the difference between Dr Codd and me on this subject of “uniform format”. We have said: should all tables be relations, that is, contain only one record type? But the better way is this, to use Mr Date’s term again: should tables (whether relations or not) be essential? Or treated as if they were essential? I think not. Remember that devious vandal. Suppose we have a Relational data base, or let us say a Tabular data base: one with records as we have defined them (so no implicit connections between fields), the records comprising fields, and the records kept in tables (relations or not). And now let the vandal destroy the tables, but leave the fields and the records intact. Have we lost any information? If not, then tables are not essential, and should not be treated as if they were. |
|
|
|
||
|
What is essential is that you start arriving on time for these lectures.... Oh, sorry - it’s five past to five to, is it?... Right: see you for some serious work at the crack of five past nine, tomorrow. |
|
|
|
|
|
|
|
||
|
Copyright © 1993, 2001 Adrian Larner. The author asserts all moral rights. |
||