|
(U03) |
www.btinternet.com/~adrian.larner/database/pcl22 |
|
PLATOCLAST Lecture XXII |
||
|
|
||
|
|
I remember once, in my youth, designing a data base some of whose records were keyed on a six-letter code. The codes, in one case, were abbreviated names of physicians. The other data in the records contained the full name, address, and so on. I thought it was quite a good design until a user asked to change a code. Well, they couldn’t. They could change anything else, but not a code. They could insert a new record with their new code and all the other data from the old record, but if they then deleted the old record they would have lots of records in their data base that effectively pointed to the deleted record, because they contained the old code (as a foreign key). Let’s put the problem, and therefore the solution, very crisply: I was trying to use the code as a key, or identifier, in both senses of “identification. I was trying to use it to show that this record pertained to the same physician as that; and to pick out a particular physician. And the user wanted to retain the code for the first sort of identification (sameness), but abandon it for the second (picking out). And as I’ve got older, I’ve seen the same problem time and time again: Car Registration Numbers that get transferred from car to car, Personnel Numbers that have to have an extra digit added. And the only answer is: don’t do it. Never, never, never use the same field for both purposes. |
|
|
|
||
|
|
Cryptic Identifiers |
|
|
|
Well, that’s easily said. But suppose you try it. You identify invoices (in the sense: sameness) by a twelve digit number, that you generate automatically. And you show that number to your users at the screen, and perhaps even print it on the paper invoice. You’re in trouble. Someone, somewhere, will start to use your identifier to pick out an invoice, to say which one they’re talking about. And then, when you have to correct an invoice, and you send the corrected version, you either put the same identifier on it and the users complain, or you put a different identifier on it and the users complain. And in either case you probably wreck your internal controls. But there is an answer: cryptic identifiers. Just never show anyone what your system uses to indicate sameness. Let them ask, by all means, that this Order record should be joined to that Customer record on Customer Id. Let them demand that they pertain to the same Customer. But don’t let them know how you know. When you think about it, it’s just data encapsulation, isn’t it? You’re making Customer Id a token (and we never show the users tokens). If Customer Id is not “represented by” a token, but actually is a token, what data type is represented by such a token? And, of course, the answer is: a Customer. So what should the column name be, if it contains Customer Ids, tokens that represent Customers, whose column criterion is: is the same Customer? No, it shouldn’t be “CUSTOMER ID”. It should be “CUSTOMER”. |
|
|
|
||
|
|
Ordered Records |
|
|
|
Another little problem: how can we keep data of very volatile ordering in a relational data base? Suppose we want to keep a document as lines of text. Obviously a one column relation, with column TEXTLINE, isn’t going to be much use. The whole point is that there is a first line of text, a second line of text, and so on. Now most relational bigots pass this off: have two columns, LINENO and TEXTLINE, they say. But a key question what is the data type of LINENO to be? Suppose we choose INTEGER, and we number our rows 1, 2, 3, and so on. We’ll have to update, on average, half our records to insert a new record. And so we get schemes like numbering the rows 10, 20, 30, and so on, and then it takes a mass insert a hundred rows say to really mess up the system. Now there are actually two problems wrapped up together here. The first is to find a data type that is densely ordered, so that between any two of its values we can always find another. Rational numbers will do, or Dewey decimals. That’s the easy bit: let’s just call them Line Numbers. The second problem is that the user really doesn’t want to be troubled by them. Actually, a user would not update such a table directly through the DML, but through a word processor. But even the word processor wouldn’t want to be troubled by line numbers (I mean, its programmer wouldn’t). Or no more than saying: I just read the line with this number (you passed me the number and the text); now delete it, or now replace it with this text, or now insert these text lines after it. You can see that what we want is a cryptic line number, with its own data type handler. That’s how we handle the cases that arise when we really do have sequenced data, with no obvious sequence column. At times Dr Codd has been a bit unguarded in the way he’s described connections between records. Yes: such connections must be in data, in the sense that they are achieved by comparing values in columns. But the compared values don’t have to be overt, though they do have to be explicit: they don’t have to be visible to the user, but the user must be able to “talk about them”, that is, to compare them for equality or ordering. |
|
|
|
||
|
|
Physical Views |
|
|
|
Now to another problem: I found this one when I was idly flipping through some of the papers in the literature. Not something I do very often; too depressing. This problem is one of the problems of time; it’s about a diary, or schedule, relation.
Notice the “exclusively engaged” in the interpretation. It means that for each PERSON there may not be two records one of whose FROMTIME-TOTIME periods overlaps another. (Notice that, for convenience, the TOTIME of one record may be the FROMTIME of another, because the “between” is inclusive with respect to FROMTIME, but exclusive with respect to TOTIME.) This is the problem. Suppose we have one of these instances of DIARY (assume all times on the same day):
The author of the article, who shall be anonymous for shame, was all for extending relational algebra, or calculus, or theory for all I know, with special time operators, etc. etc.[1] And I thought to myself: this won’t do. So this is my answer: DIARY looks to me like a view. It’s a view of:
We have the DIARY record:
Aha, you say. But SLOT is a huge table, with in a certain sense redundant data. Data, at least, that could be much more economically expressed. Yes indeed. As a matter of fact a nice economical way to express SLOT is as shown in DIARY. DIARY actually contains all the information that SLOT contains. I think DIARY might well contain very good store level records. And SLOT might well contain very good form or concept level records. Then, physically, SLOT would be a view of DIARY, i.e. constructible from DIARY. But, logically, DIARY is a view of SLOT. And typically a user would want to look at DIARY but update SLOT (most likely through a diary handler interface). And now you can see at least one practical reason why I wanted to shake free from the absurd identification of stored records and kept records, physically and logically basic records. Imagine trying to treat SLOT as a logical view of DIARY; contemplate the difficulty of writing a query (a view definition) to derive SLOT from DIARY; consider the problem of showing that DIARY was up datable through that view. Ghastly problems! Of course, there is a way of creating SLOT from DIARY. Perhaps there is even a way using a relational DML. But it doesn’t matter even if there isn’t, because logically we are under no obligation to define, using the DML, a derivation of basic records from the views built on them. The obligation is the other way round, to define the views on the basic records, DIARY on SLOT, and that’s easy enough. But suppose the user wants to look at SLOT? I have two answers: firstly, we could just say no. The user can update SLOT but not look at it. (We don’t have to derive SLOT in order to update it: physically, we can update DIARY directly.) Secondly, if you won’t stand for that, we can as I said easily create SLOT from DIARY. Using the DML? Not necessarily, probably using a quick and slick program. Cheating? Not at all: we can implement logically basic data any way we like, as long as it looks to the user like updatable normal records. Just for once I want to stress a paradox. DIARY really is a view; it has a non-trivial view definition. Yet, apart from DIARY itself, there is no data actually stored in the data base from which it could be derived. SLOT by contrast is not a view; it has no definition. It is not stored in the data base; it is derived from DIARY (but not using the DML, and the derivation is nowhere openly defined). And DIARY really is a view of SLOT itself. You might ask, of course, what time period I was planning to cover with SLOT records. Well, lots of them have “free” in TASK, and their physical storage is fairly economical. So I was thinking in terms of from now until the crack of doom. I remind you again that just because a table is big doesn’t mean we can’t keep it in our data base. |
|
|
|
||
|
|
Limits of First Order Logic |
|
|
|
So far we’ve had a look at some fairly hard data base problems. But now I want to turn to something a little harder, impossible really. Ive spent quite some time in these lectures knocking set theory , complaining amongst other things that it’s too complex and powerful. Yet I’ve also given translations out of set-speak. So you may be wondering whether there is anything we can say in set-speak that we can’t manage to say one way or another in normal speak; by which I mean the FOPC with a decent modicum of extralogical predicates. And there are indeed things we can’t say. And one thing we can’t say is, in a word, “and so on”. Let’s start with numbers. How in the FOPC can we say, “There are no Fs”? Here it is, reading “=F” as “is the same F as”.
|
|
|
|
||
|
|
The Problem of the Ancestral |
|
|
|
Now here’s another little gem. Suppose we have the predicate parent” as primitive, that is, so-and-so is parent of such-and-such. Intuitively, we would think we could define “ancestor”. After all, x is ancestor of y when:
Some people, mathematicians who are without conscience, allow what they call a “recursive definition”:
But grant me set theory and see what I can say:
You might be tempted to think that we could tackle this problem backwards, by trying to define “parent”, taking “ancestor” as primitive. And that works in certain special cases, when this holds true:
|
|
|
|
||
|
|
Bills of Material |
|
|
|
Well, so what? Consider the predicate, “is an immediate component or. Its ancestral is “is a component or. Consider the relation:
That is the heart of a common problem of manufacturers and quantity surveyors: the production of a Bill of Materials. It is so difficult in a DML based on the FOPC, that we might well regard it as impossible. Here’s a solution; in my next lecture I’ll give you two more. You’ll have noticed in Frege’s definition the clause, “NOT x=y”, which is there to stop anyone being reckoned an ancestor of themself.[3] But we could relax that, and in fact we say that an improper ancestor of x is either x or one of x’s ancestors. If you start with yourself and draw your family tree, going upwards to your parents, your grandparents, and so on, then everyone in the tree is one of your improper ancestors. Now consider this table of kept records, that is, of logically basic records:
The first, PP, records the “parent” relationship:
You will have guessed, of course, that we don’t store IPP: it’s full of redundant data. We store the view, PP. I should remark that the derivation of IPP from PP is rather interesting. Naturally IPP is not definable on PP using our DML, but it is what we might call “incrementally derivable”. This means that if PP is updated, the required update of IPP is definable on the update of PP. And this means that we could start with PP and IPP both empty, and we could insert one row at a time into PP, and each insert into PP would require one or more inserts into IPP. So any instance of PP can be constructed, and the appropriate instance of IPP constructed. And the only reason we can’t define IPP on PP using our DML is that we can’t, in general, say in advance how many inserts into PP we might do. We could define IPP for a PP with one row, or a PP with two rows, or a PP with three rows, but not, of course: and so on. However, it is perfectly simple to build IPP, and indeed AP, from PP. We can write a program to do it, but we can’t define the function of that program as an expression of the FOPC, or of our DML. But that doesn’t matter: we are under an obligation to define each view on kept records by means of our DML and its encapsulated functions. But we are under no such obligation to define any table of kept records on a view of that, or any other, table: that would be absurd. Kept record types, after all, are the primitive, formally undefined, predicates of the theory that is our data base. We have to be able to keep them, somehow, even if it takes a program to do it. We certainly don’t have to define them. |
|
|
|
||
|
|
What is Logically Basic? |
|
|
|
And yet, and yet: I think that I haven’t made enough distinctions. You see, it’s true that the notion of “base record” was ambiguous between physically base, or stored; and logically base, or kept. But “logically base record” is also ambiguous. It could mean: record of a type that is a formally undefined predicate. Or it could mean: record that is a postulate of the theory that is our data base. |
|
|
|
||
|
Can you see any reason why each postulate of a theory should be formed from a primitive rather than from a defined predicate of that theory? If you can: please tell me, because I can’t. I’ll worry about it anyway. |
|
|
|
|
|
|
|
||
|
Copyright © 1993, 2001 Adrian Larner. The author asserts all moral rights. |
||