(U03)

www.btinternet.com/~adrian.larner/database/pcl22

PLATOCLAST
ON DATA

Lecture XXII
Some Data Base Problems

 

 

I remember once, in my youth, designing a data base some of whose records were keyed on a six-letter code. The codes, in one case, were abbreviated names of physicians. The other data in the records contained the full name, address, and so on.

I thought it was quite a good design until a user asked to change a code. Well, they couldn’t. They could change anything else, but not a code. They could insert a new record with their new code and all the other data from the old record, but if they then deleted the old record they would have lots of records in their data base that effectively pointed to the deleted record, because they contained the old code (as a foreign key).

Let’s put the problem, and therefore the solution, very crisply: I was trying to use the code as a key, or identifier, in both senses of “identification”. I was trying to use it to show that this record pertained to the same physician as that; and to pick out a particular physician. And the user wanted to retain the code for the first sort of identification (sameness), but abandon it for the second (picking out).

And as I’ve got older, I’ve seen the same problem time and time again: Car Registration Numbers that get transferred from car to car, Personnel Numbers that have to have an extra digit added. And the only answer is: don’t do it. Never, never, never use the same field for both purposes.

 

 

Cryptic Identifiers

 

Well, that’s easily said. But suppose you try it. You identify invoices (in the sense: sameness) by a twelve digit number, that you generate automatically. And you show that number to your users at the screen, and perhaps even print it on the paper invoice. You’re in trouble. Someone, somewhere, will start to use your identifier to pick out an invoice, to say which one they’re talking about. And then, when you have to correct an invoice, and you send the corrected version, you either put the same identifier on it and the users complain, or you put a different identifier on it and the users complain. And in either case you probably wreck your internal controls.

But there is an answer: cryptic identifiers. Just never show anyone what your system uses to indicate sameness. Let them ask, by all means, that this Order record should be joined to that Customer record on Customer Id. Let them demand that they pertain to the same Customer. But don’t let them know how you know.

When you think about it, it’s just data encapsulation, isn’t it? You’re making Customer Id a token (and we never show the users tokens). If Customer Id is not “represented by” a token, but actually is a token, what data type is represented by such a token? And, of course, the answer is: a Customer. So what should the column name be, if it contains Customer Ids, tokens that represent Customers, whose column criterion is: is the same Customer? No, it shouldn’t be “CUSTOMER ID”. It should be “CUSTOMER”.

 

 

Ordered Records

 

Another little problem: how can we keep data of very volatile ordering in a relational data base? Suppose we want to keep a document as lines of text. Obviously a one column relation, with column TEXTLINE, isn’t going to be much use. The whole point is that there is a first line of text, a second line of text, and so on.

Now most relational bigots pass this off: have two columns, LINENO and TEXTLINE, they say. But – a key question – what is the data type of LINENO to be? Suppose we choose INTEGER, and we number our rows 1, 2, 3, and so on. We’ll have to update, on average, half our records to insert a new record.

And so we get schemes like numbering the rows 10, 20, 30, and so on, and then it takes a mass insert – a hundred rows say – to really mess up the system. Now there are actually two problems wrapped up together here. The first is to find a data type that is densely ordered, so that between any two of its values we can always find another. Rational numbers will do, or Dewey decimals. That’s the easy bit: let’s just call them Line Numbers.

The second problem is that the user really doesn’t want to be troubled by them. Actually, a user would not update such a table directly through the DML, but through a word processor. But even the word processor wouldn’t want to be troubled by line numbers (I mean, its programmer wouldn’t). Or no more than saying: I just read the line with this number (you passed me the number and the text); now delete it, or now replace it with this text, or now insert these text lines after it.

You can see that what we want is a cryptic line number, with its own data type handler. That’s how we handle the cases that arise when we really do have sequenced data, with no obvious sequence column.

At times Dr Codd has been a bit unguarded in the way he’s described connections between records. Yes: such connections must be in data, in the sense that they are achieved by comparing values in columns. But the compared values don’t have to be overt, though they do have to be explicit: they don’t have to be visible to the user, but the user must be able to “talk about them”, that is, to compare them for equality or ordering.

 

 

Physical Views

 

Now to another problem: I found this one when I was idly flipping through some of the papers in the literature. Not something I do very often; too depressing. This problem is one of the problems of time; it’s about a diary, or schedule, relation.

DIARY:    PERSON  TASK    FROMTIME  TOTIME
It means:
PERSON is exclusively engaged on TASK from FROMTIME (inclusively) to TOTIME (exclusively).
The times, FROMTIME and TOTIME, are actually smallest schedulable periods, say of five minutes. An example would be:
19th August 1992, 7.00 am
It means: the five minute period 0700 to 0705 on 19th August 1992.

Notice the “exclusively engaged” in the interpretation. It means that for each PERSON there may not be two records one of whose FROMTIME-TOTIME periods overlaps another. (Notice that, for convenience, the TOTIME of one record may be the FROMTIME of another, because the “between” is inclusive with respect to FROMTIME, but exclusive with respect to TOTIME.)

This is the problem. Suppose we have one of these instances of DIARY (assume all times on the same day):

1  DIARY:    PERSON  TASK    FROMTIME  TOTIME
             p1      t2      10:00     12:00
 
2  DIARY:    PERSON  TASK    FROMTIME  TOTIME
             p1      t1      10:00     10:45
             p1      t4      11:15     12:00
 
3  DIARY:    PERSON  TASK    FROMTIME  TOTIME
             p1      t3      10:45     11:15
And suppose we want to allocate this period to task t4.
   DIARY:    PERSON  TASK    FROMTIME  TOTIME
             p1      t4      10:30     11:30
 
What we would like (according to the previous instances) are:
1  DIARY:    PERSON  TASK    FROMTIME  TOTIME
             p1      t2      10:00     10:30
 
             p1      t4      10:30     11:30
 
             p1      t2      11:30     12:00
 
2  DIARY:    PERSON  TASK    FROMTIME  TOTIME
             p1      t1      10:00     10:30
             p1      t4      10:30     12:00
 
3  DIARY:    PERSON  TASK    FROMTIME  TOTIME
             p1      t3      10:30     11:30
The problem is: how can we treat this as a simple record insert? It seems as if we have a very complex update for a very simple task. Just try writing it in SQL.

The author of the article, who shall be anonymous for shame, was all for extending relational algebra, or calculus, or theory for all I know, with special time operators, etc. etc.[1] And I thought to myself: this won’t do. So this is my answer: DIARY looks to me like a view. It’s a view of:

   SLOT:     PERSON  TASK    SLOTTIME
It means, roughly:
PERSON is exclusively engaged on TASK during SLOTTIME.
A SLOTTIME is, of course, a smallest schedulable period (five minutes). I allow a null value, “free”, in TASK (for an unscheduled slot). You can see how to build the view (it can be done in a relational DML).[2]

We have the DIARY record:

   DIARY:    PERSON  TASK    FROMTIME  TOTIME
             p       t       fs        ts
when this holds true:
For person p and task t, for each slot from fs up to but not including ts, but not for the slot just before fs and not for the slot ts, there is in SLOT a record for that slot.
And now it’s easy to update SLOT, as required. We simply update with the chosen task, for the chosen person, every SLOT record from the 10:30 to the 11:25 record. And DIARY , being merely a view, changes accordingly. What could be easier?

Aha, you say. But SLOT is a huge table, with – in a certain sense – redundant data. Data, at least, that could be much more economically expressed. Yes indeed. As a matter of fact a nice economical way to express SLOT is as shown in DIARY. DIARY actually contains all the information that SLOT contains.

I think DIARY might well contain very good store level records. And SLOT might well contain very good form or concept level records. Then, physically, SLOT would be a view of DIARY, i.e. constructible from DIARY. But, logically, DIARY is a view of SLOT. And typically a user would want to look at DIARY but update SLOT (most likely through a diary handler interface).

And now you can see at least one practical reason why I wanted to shake free from the absurd identification of stored records and kept records, physically and logically basic records. Imagine trying to treat SLOT as a logical view of DIARY; contemplate the difficulty of writing a query (a view definition) to derive SLOT from DIARY; consider the problem of showing that DIARY was up datable through that view. Ghastly problems!

Of course, there is a way of creating SLOT from DIARY. Perhaps there is even a way using a relational DML. But it doesn’t matter even if there isn’t, because – logically – we are under no obligation to define, using the DML, a derivation of basic records from the views built on them. The obligation is the other way round, to define the views on the basic records, DIARY on SLOT, and that’s easy enough.

But suppose the user wants to look at SLOT? I have two answers: firstly, we could just say no. The user can update SLOT but not look at it. (We don’t have to derive SLOT in order to update it: physically, we can update DIARY directly.)

Secondly, if you won’t stand for that, we can – as I said – easily create SLOT from DIARY. Using the DML? Not necessarily, probably using a quick and slick program. Cheating? Not at all: we can implement logically basic data any way we like, as long as it looks to the user like updatable normal records.

Just for once I want to stress a paradox. DIARY really is a view; it has a non-trivial view definition. Yet, apart from DIARY itself, there is no data actually stored in the data base from which it could be derived. SLOT by contrast is not a view; it has no definition. It is not stored in the data base; it is derived from DIARY (but not using the DML, and the derivation is nowhere openly defined). And DIARY really is a view of SLOT itself.

You might ask, of course, what time period I was planning to cover with SLOT records. Well, lots of them have “free” in TASK, and their physical storage is fairly economical. So I was thinking in terms of from now until the crack of doom. I remind you again that just because a table is big doesn’t mean we can’t keep it in our data base.

 

 

Limits of First Order Logic

 

So far we’ve had a look at some fairly hard data base problems. But now I want to turn to something a little harder, impossible really. I’ve spent quite some time in these lectures knocking set theory , complaining – amongst other things – that it’s too complex and powerful. Yet I’ve also given translations out of set-speak. So you may be wondering whether there is anything we can say in set-speak that we can’t manage to say one way or another in normal speak; by which I mean the FOPC with a decent modicum of extralogical predicates. And there are indeed things we can’t say. And one thing we can’t say is, in a word, “and so on”.

Let’s start with numbers. How in the FOPC can we say, “There are no Fs”? Here it is, reading “=F” as “is the same F as”.

NOT FOR SOME x, x =F x
And “There is just one F”?
FOR SOME x, x =F x AND FOR EACH y IF y =F y THEN x =F y
And “There are just two Fs”?
FOR SOME x, x =F x AND FOR SOME y, y> =F y AND NOT x =F y AND FOR EACH z IF z =F z THEN x =F z OR y =F z
Well, you get the hang of it. So now allow me to abbreviate “There are just n Fs” to:
FOR n, FOR EACH w IF w =F w THEN w =F ONE OF n
And “There are just n Gs” to:
FOR n, FOR EACH w IF w =G w THEN w =G ONE OF n
And now I want to say “There are exactly as many Fs as there are Gs”:
(NOT FOR SOME w, w =F w AND
NOT FOR SOME w, w =G w) OR
(FOR 1, FOR EACH w IF w =F w THEN w =F ONE OF 1 AND
FOR 1, FOR EACH w IF w =G w THEN w =G ONE OF 1) OR
(FOR 2, FOR EACH w IF w =F w THEN w =F ONE OF 2 AND
FOR 2, FOR EACH w IF w =G w THEN w =G ONE OF 2) OR
(FOR 3, FOR EACH w IF w =F w THEN w =F ONE OF 3 AND
FOR 3, FOR EACH w IF w =G w THEN w =G ONE OF 3) OR
and so on.
Did you spot the informal part there? “And so on”. So we can’t – as far as anyone knows – say “There are just as many Fs as there are Gs” in the FOPC. Unless we have a limit. If we knew, for instance, that there were no more than 100 Fs then we would be all right (it’s a long inscription, but it does finish). But if we don’t know a limit, we’re lost.

 

 

The Problem of the Ancestral

 

Now here’s another little gem. Suppose we have the predicate “parent” as primitive, that is, so-and-so is parent of such-and-such. Intuitively, we would think we could define “ancestor”. After all, x is ancestor of y when:

x is parent of y OR
FOR SOME z, x is parent of z AND (z is parent of y OR
FOR SOME w, z is parent of w AND (w is parent of y OR
and so on ... ))
Quite. In this specific case, we can complete the definition, given a conservatively large estimate of the number of generations from the Creation to the Last Judgement. But in the general case – it’s called “the definition of the ancestral” – we can’t. The ancestral of a dyadic predicate is that predicate which bears to it the relationship that “ancestor” bears to “parent”.

Some people, mathematicians who are without conscience, allow what they call a “recursive definition”:

x is ancestor of y =df
x is parent of y OR
FOR SOME z, x is parent of z AND z is ancestor of y
You may worry that the definition of “ancestor” itself contains “ancestor”. That, the mathematicians say, is what is meant by “recursive”. That, I say, is what is meant by “cheating”.

But grant me set theory and see what I can say:

x is ancestor of y =df
NOT x=y AND
FOR EACH s, IF y is a member of s AND
FOR EACH z, FOR EACH w, IF z is a member of s
AND w is parent of z THEN w is a member of s
THEN x is a member of s
I’ll try and put that in English. Let a parent set be a set that contains all the parents of its members. Now we can say that x is ancestor of y if, not being the same as y, it is a member of every parent set that y is a member of. Dear old Frege invented that: it’s clever, isn’t it. But we won’t stand for it.

You might be tempted to think that we could tackle this problem backwards, by trying to define “parent”, taking “ancestor” as primitive. And that works in certain special cases, when this holds true:

x is parent of y IF AND ONLY IF x is ancestor of y AND
FOR EACH z, NOT (x is ancestor of z AND z is ancestor of y)
If that’s true – and it wouldn’t be true if, for instance, someone was born of parent/child incest – then “IF AND ONLY IF” could be replaced by “=df”.

 

 

Bills of Material

 

Well, so what? Consider the predicate, “is an immediate component or”. Its ancestral is “is a component or”. Consider the relation:

COMPONENTHOOD:  CONTAINING_PART  CONTAINED_PART
meaning:
CONTAINED_PART is an immediate component of CONTAINING_ PART
The problem is to build a similarly formatted relation meaning: CONTAINED_PART is a component of CONTAINING_PART .

That is the heart of a common problem of manufacturers and quantity surveyors: the production of a Bill of Materials. It is so difficult in a DML based on the FOPC, that we might well regard it as impossible. Here’s a solution; in my next lecture I’ll give you two more.

You’ll have noticed in Frege’s definition the clause, “NOT x=y”, which is there to stop anyone being reckoned an ancestor of themself.[3] But we could relax that, and in fact we say that an improper ancestor of x is either x or one of x’s ancestors. If you start with yourself and draw your family tree, going upwards to your parents, your grandparents, and so on, then everyone in the tree is one of your improper ancestors.

Now consider this table of kept records, that is, of logically basic records:

IPP:   IMPARANC  PARENT  PERSON
It means:
IMPARANC is improper ancestor of PARENT, who is parent of PERSON.
It has two interesting views:
PP:   PARENT   PERSON
 
AP:   IMPARANC  PERSON
They’re both just projections.

The first, PP, records the “parent” relationship:

PARENT is parent of PERSON
The second, AP, records the “ancestor” relationship, that is, the proper ancestor relationship:
IMPARANC is ancestor of PERSON
A proper ancestor is an improper ancestor of a parent: think about it.

You will have guessed, of course, that we don’t store IPP: it’s full of redundant data. We store the view, PP. I should remark that the derivation of IPP from PP is rather interesting. Naturally IPP is not definable on PP using our DML, but it is what we might call “incrementally derivable”. This means that if PP is updated, the required update of IPP is definable on the update of PP.

And this means that we could start with PP and IPP both empty, and we could insert one row at a time into PP, and each insert into PP would require one or more inserts into IPP. So any instance of PP can be constructed, and the appropriate instance of IPP constructed. And the only reason we can’t define IPP on PP using our DML is that we can’t, in general, say in advance how many inserts into PP we might do. We could define IPP for a PP with one row, or a PP with two rows, or a PP with three rows, but not, of course: and so on.

However, it is perfectly simple to build IPP, and indeed AP, from PP. We can write a program to do it, but we can’t define the function of that program as an expression of the FOPC, or of our DML. But that doesn’t matter: we are under an obligation to define each view on kept records by means of our DML and its encapsulated functions. But we are under no such obligation to define any table of kept records on a view of that, or any other, table: that would be absurd. Kept record types, after all, are the primitive, formally undefined, predicates of the theory that is our data base. We have to be able to keep them, somehow, even if it takes a program to do it. We certainly don’t have to define them.

 

 

What is Logically Basic?

 

And yet, and yet: I think that I haven’t made enough distinctions. You see, it’s true that the notion of “base record” was ambiguous between physically base, or stored; and logically base, or kept. But “logically base record” is also ambiguous. It could mean: record of a type that is a formally undefined predicate. Or it could mean: record that is a postulate of the theory that is our data base.

 

SITE HOME PAGE

Can you see any reason why each postulate of a theory should be formed from a primitive rather than from a defined predicate of that theory? If you can: please tell me, because I can’t. I’ll worry about it anyway.

THE DATABASE PAGE

THE DATABASE PAPERS

 

Preface & Contents

 

DOWNLOAD

Download Lecture XXII (rtf, Word for Windows compatible)

Platoclast on Data: Lecture XXIII

 

Copyright © 1993, 2001 Adrian Larner. The author asserts all moral rights.