(U03)

www.btinternet.com/~adrian.larner/database/pcl12

PLATOCLAST
ON DATA

Lecture XII
Integrity, Domains, Defaults

 

 

From now on, I shall use the term “Null” to mean an Improper Value, a value of which its column criterion of identity does not hold true. I shall ignore the useless theory of Semantic Nulls.

I have already said, remember, that part of the work of a data analyst is to specify the criterion of identity of each column. The original intention of this was to clarify the meaning of the column. Then we saw that this meaning, this criterion, was an important part of the Existential interpretation of records. And now we see that it also serves to distinguish the proper values from the Nulls.

We briefly considered the possibility of using completed identities, instead of ordinary column criteria, in EI. And now we can see that when we specify the interpretation of a column in a type of record (in giving the interpretation of that record), if we use the completed identity, “º” we admit Nulls in the column, and if we use the column criterion, “=”, we don’t admit Nulls.

But remember that whether a value is null or not is, firstly, relative to its column, and secondly, determined by the data analyst, who chooses the column criterion of identity. So you can, if you please, follow Father Warden’s advice – Eschew Nulls![1] You just have to specify the appropriate column criteria. Thus what was defined as “Religion” (criterion: “has the same faith as”) can be redefined as “Information pertinent to Religion” (criterion: “has the same information about religion recorded as”). So what was a null value, “Withheld” say, becomes non-null. If my religion is withheld, and so is yours, it does not follow that we have the same faith; but we do have the same religion information recorded.[2]

But you may find that such redefinition gives your users a few problems, because they might join records on a column and make the natural (but wrong) assumption about the meaning of equality. Users might assume that equality on a BIRTHDATE column meant “has the same date of birth as”, that is, they might use the criterion you would have used if the good Father hadn’t persuaded you to eschew nulls.

 

 

Entity Integrity

 

At last, we are able to understand, to some extent, Dr Codd’s first integrity rule, the rule of so-called Entity Integrity: No component of the primary key of a base relation is allowed to accept nulls.[3]

Let’s ignore the “component” bit, and think only of simple – single column – keys. “Base” means (it’s ill-defined) something like “kept”. So we can translate:

No primary key of a kept record is to admit a value of which its criterion of identity does not hold true.
And, though we have not really made sense of “primary key” yet, this seems a very strange proposal. Suppose we did design a data base in which some primary key admitted nulls, and Dr Codd complained. We could meet his objection by doing no more than redefining the criterion of identity of the primary key column, leaving everything else as it was.

But what would have changed? Where previously we had said, for instance, “KºJ” (when referring to the primary key column) we could (but need not) now say “K=J” (we need not because “ = “ and “º” are the same identity if no value is improper). Likewise, if we call our previously null value “N”, then where we had said “K=J” we would now have to say “K=J AND NOT K=N”. So we have some fiddling with the notation of restriction conditions, and we had better tell our users the bad news, but what else has changed? Nothing. Unless, of course, we are using a DML based on natural join: one that doesn’t give us explicit control over the equality conditions of joins, but makes up its own tiny robotic mind instead.

Obviously this entity integrity rule is founded in confused or mistaken theories: of Semantic Nulls? of Absolute Identity? of the Entity interpretation? Probably some combination of these, along with the use of Natural Joins on these Primary Keys, and consequent confusion about the appropriate equality for such joins, “=” “ or “º”.

I think we’ll just abandon this integrity rule. Then we won’t have to worry, at least for the moment, about what “primary key” means, what “base relation” means, or why “base” records should have different constraints from any other records.

 

 

Referential Integrity

 

While we’re at it, we’ll have a look at Dr Codd’s other integrity rule. Here it is – Referential Integrity – in Mr Date’s succinct formulation: The database must not contain any unmatched foriegn key values.FONT COLOR="#000000">[4]

A “foreign key” is (I simplify) a column whose value in each record must be either null or equal to the value in the primary key of some type of base record. Mr Date assumes here (I think) that nulls are not values, so the rule really amounts to:

The database must not contain any unmatched non-null foriegn key values.
And all I want to say is: consider a column each of whose non-null values either is or is not equal to the value in the primary key of a base record. Such a column either is a foreign key, in the sense that each non-null value is (we ensure) equal to such a value. Or it isn’t. If it is, then it does satisfy the rule; and if it isn’t, it still does (trivially).

So merely being a foreign key ensures that the integrity rule is kept; and not being a foreign key also ensures that the rule is kept. The rule is trivial, and we can ignore it, because it amounts to: each foreign key is a foreign key. I’ll believe that.

Again, we’ll ignore all the stuff about primary keys, distinguishing nulls from non-nulls, and so on. That’s all tied up with the same confusions that beset the first integrity rule.

So both Dr Codd’s integrity rules bite the dust, which is quite right as they were founded on at least three gross confusions: about absolute identity, about the Entity interpretation, and about the concept of “Null”. It seems to me that the whole business of integrity-which, luckily, hasn’t got very far-will have to be rebuilt from the bottom up. So I don’t regret razing the current structure to the ground.

 

 

Domains and Nulls

 

Let’s now turn to another area that Mr Date, at least, regards as a question of integrity. The rule in question is: that each value in a column be drawn from the domain of the column. Naturally, we would want to distinguish those values that are allowed in a column (those that pass the edit check) from those of which the column criterion of identity holds true, that is, between allowed values and proper values. Of course, all proper values are allowed, but not all allowed values are proper.

Now, because neither Dr Codd nor Mr Date have exploited the concept of column criterion of identity, they are hard put to make this distinction. So sometimes they want “domain” to mean “set of allowed values”, and sometimes “set of proper values”. Thus they might assume that a BIRTHDATE column had a domain of dates (proper values), but then they also want the null value, “Unknown”. Of course, they could – and should – extend the domain to include “Unknown”. But then the domain wouldn’t be a domain of dates.

So you can see why Dr Codd has been blown hither and thither, and has ended up saying: “Unknown” isn’t a value. Whereas we would say: of course “Unknown” is a value (and, perhaps, allowed in BIRTHDATE) but “Unknown” is not a date (so not proper to BIRTHDATE). But you see his dilemma: if he says that “Unknown” is a value of BIRTHDATE and he says that BIRTHDATE has a domain of dates, then he ends up saying that “Unknown” is a date. And if he denies that “Unknown” is a date, then he can’t admit it as a value of BIRTHDATE. So he says: it is allowed in BIRTHDATE, but is not a value; it is a Mark.

And then he starts to elaborate a weird theory of Marks, requiring multiple truth values, with an unsound logic, and the whole enterprise is obviously quite unsupportable. And, indeed, Mr Date doesn’t support it, because he can see that it must be wrong. But he can’t see what’s wrong with it.

 

 

Default Values

 

However, Mr Date has his own theory. And a very exciting theory it is too: it’s all about default values. They work like this: when a user inserts a row in a data base, but doesn’t specify the values to go in all its columns, any unspecified columns take on a default value. The default value they take is specified as part of the definition of the column. Not every column has to have a default, but if it doesn’t then the user must provide a value when inserting the row.[5]

Well, there’s a lot more to it than that. And, as I said, it’s a jolly good idea. Actually, we’ve seen it already in a very limited case: when you fill in a paper form, and “leave out an answer”, you get the default “blank”. But – even in this case – it is obvious that the concepts of default value and null value are orthogonal. In some columns a blank default is a proper value, e.g. Tick if over 70 years of age: two persons with “blank” are of the same age-range. In other columns a blank default is a null, e.g. SECOND GIVEN NAME: two persons with a blank default do not have the same second name.

And likewise some nulls are defaults and some aren’t; “Home” might not be the desired default work location, and 9 is probably not the default household size. Even more complex: “Baby” is the null default for the patient name only when the patient in question is a newborn; otherwise perhaps there is no default, though there might be a null (blank for instance: it’s only on paper forms that “blank” is, so to speak, the default default).

Indeed, consider a table of pupil records kept by a piano teacher who teaches for the grade examinations. A pupil record might contain a column for the grade currently in preparation. The desired default, on admission of a new pupil, is Grade I. Not only is this not a null value, it might even be the primary key of another table.

Suppose we have a column, C, with default value D. Do we want to say that C=D when C contains D? In that case, D must be proper to C, so not a null at all. And that’s fine. It’s an excellent idea. But it doesn’t come within a mile of tackling the problem of nulls.

But Mr Date suspects that his approach is not the final word (and he’s certain that Dr Codd’s isn’t); he says: I believe the whole area of missing information (that’s null values to you and me) stands in need of considerable further research.[6] Well, he’s nothing if not modest (which is more than you can say of some data theorists; I name no names[7]).

 

 

Operations

 

Well, what more do we need to say to finish off this problem of nulls? Of course, we have language issues, very similar to those we found with absent values. In restriction conditions, as we have seen, we can use “]” and “[” for absent values, but, say, “@” and “@” (and specifically “=” and “º”) to handle null values. Of course, we have to settle problems about where nulls come, or absent values are deemed to come, in orderings. Expression evaluation gives similar problems (possibly with different solutions) when, say, we evaluate “C+5” and C is absent or C is null (not the same number as itself).[8]

I do want to introduce a couple of general purpose operators. I want them to convert between absent values and null values. The first of these I will call, I hope for obvious reasons, “THE”. Let C comprise zero or more values. If C contains only one non-null value, then THE(C) contains just that value. Otherwise THE(C) contains the null value. So “THE” converts from an absent value (or multiple values) to the null value.

My second operator I call “PROP” (for “proper”). Let C comprise zero or more values. PROP(C) comprises all the non-null values of C (so if C contains no null value then PROP(C) is just C). “PROP” converts from a null value to an absent value.

 

 

Modalities

 

As you may have noticed, I haven’t described enough machinery to distinguish more that one null value in a column. Deliberately so. Of course, we could have more than one null value: we could have “None”, “Other”, “Unknown”, “Unknown and possibly None”, “Withheld”, and “Withheld and possibly None” under RELIGION. But if we choose this option we shall need an extra primitive operator for the column. We know that we need an “is of the same faith as” operator, say “=”. And, taking this as primitive, we can define its completion, “º”. But that distinguishes only one null value, so that Other º Withheld, for instance.

My recommendation is to allow only one null value per column. This keeps things simple, in theory and for the user. Consider the advice we would give on filling in the column: enter the person’s religion code; if none, leave blank.

How then would I handle these multiple values? Think back to what we said about missing information: recording that it is missing, and why it is missing. Using a single null – blank say – allows us to show that the information (the proper information, the religion) is missing. To say why it is missing – because the person has no religion, can’t say, won’t say, whatever – we could use another column, a modality column. And we really ought to use another column.

We know what trouble it causes to use one column for two different things (if you don’t know, you should). For instance, a SUPPLIER column in a Part table: enter Supplier Number, or – if no supplier because made inhouse – identifier of line on which constructed. Oh dear me, no. And likewise I would deprecate: enter Religion, or – if no religion specifiable – enter explanation why not.

In any event, as we saw when we briefly considered modalities – meaning, roughly, comments on the values in fields – we might wish to make a comment (which might be “OK” or not) whether or not the field contained a proper value. Even quite proper values can be “Estimated”, “Suspected False”, “Learned by Hearsay”, and so on. You’ll find if you take out a life insurance policy that they will record your date of birth to calculate the premium, but they won’t pay anything out until they’ve actually seen your birth certificate. So they must keep the modality of BIRTHDATE, merely provided or provided and confirmed, as well as its value. And, as I said, the null value may be just fine (not missing at all), so have an “OK” modality: no second given name, for example.

Indeed, I suspect that confusion between the orthogonal concepts of proper/improper values and OK/non-OK modalities has made the problem of nulls even harder to solve than it should have been. Dr Codd’s third and fourth “truth values” (and surely more to come) are ill-aimed attempts to capture different modalities. But as soon as we allocate (where required) a modality column, these problems are solved in a simple, and now boringly familiar, way (the way: as values in columns). The requisite values – certainly not extra truth values – are perfectly ordinary, proper values in a modality column. Rather wickedly, I would be tempted to make “OK” the improper value in modality columns; and I wouldn’t reckon that we ever needed modalities of modalities.

I must admit that modalities could get interesting. For instance, if the value of C is proper, but conjectural, and that of D is proper, but suspected false, then the value of C+D is proper, but what? conjectural-and-suspect? I suppose so. And this seems to require that the permitted values in a modality column be closed over conjunction (“-and-“), in the sense that a conjunction of modalities is, so to speak, no less doubtful than either of its conjuncts. On the other hand, imagine doing a join (using “º”) on some columns. How might the modality of a joined column be derived from its original two modalities, say again “conjectural” and “suspect”? Well, the join might just make the modality less doubtful (or strictly, no more doubtful): conjectural-or-suspect, say. So it looks as if modalities have to be closed over alternation (“-or-”) as well. And that creates a very interesting lattice of modalities. A lattice is a graph of a partial ordering, I should add for the non-mathematicians. But I think I’ll assign all that to the limbo of data type processing, along with operations on absent and improper values.

But I will defend myself against this charge: that introducing modality columns makes things complex. The question is whether the user does or does not wish to process, and in what ways they wish to process, these values – Unknown, Withheld, Conjectural, and so forth. If they don’t want to keep or process them, then that’s fine: they don’t need modality columns. If they do want to keep them, and perhaps process them, then they have to go somewhere, and be processed somehow. I suspect they are better in a column where they do belong (where they are proper) than in one where they aren’t.

 

 

Relatively Proper Values

 

I want to raise one objection to my own theory. I have said that we check for nulls by asking: do you (the user) say that this is the same such-and-such as that? Suppose the user replies: well, sometimes I do, and sometimes I don’t. Think back to the Customer table with INCOME High, Medium, Low, or Unknown; remember this value was used to join with the Treatment records. Suppose the user – a bit of a sophisticated user – says: “No, we don’t always say that two customers with the same INCOME value, Unknown, have the same income (i.e. income range). In effect we say that for the purposes of Treatment they get the same credit limit and the same sort of dunning letter. But if we were pairing customers with the same income, perhaps for test-selling, sending mailshots to one in a pair but not to the other, then we wouldn’t want two of unknown income to be paired.”

The solution is, I think, obvious in this case. We make “Unknown” a null: the user doesn’t always reckon it proper. When we join with Treatment records we use “º” to pick up the Unknowns. When we join for pairing, we use “=” to exclude the Unknowns.

Notice that the restriction to one null value puts limits on this approach. But the more important question is: does this show that Dr Codd was right after all about that first integrity rule? No it doesn’t. Suppose we join to a record on a key column that may not – as Dr Codd would insist – contain a null. Then it doesn’t matter whether we use “=” or “º”. And if we join on a key column that may contain a null, then we will probably use “º” anyway, and that will be fine. But see what I mean about what’s natural when we do a natural join?

As it happens, we often want to join on an identity that differs from the criterial identity (“=”) of a column. For example, we might have non-negative integer columns C and D that we wish to join not on “C=D” but on “C=D and “NOT C=0” (and, of course, NOT D = 0). We could call that “positive identity”: C is the same positive integer as D. But that doesn’t mean we have to make zero a null value, though we could say that zero is improper with respect to this particular join condition. So what? Every odd number would be improper with respect to the join condition “C is the same even number as D”. Actually I rather like this formulation. Let’s say that a null is a value that is improper with respect to its column’s criterial identity.

This is why we can, in theory, avoid nulls altogether: we can always put the exclusions into the join condition. The advantages of picking out one special identity – the criterial identity – are that it captures the intended meaning of the column, and therefore that it will probably be the most frequently required identity. And in many cases, it will be the easiest identity for users to understand. That’s why, in a non-negative integer column (a Count column), the column criterion should be “is the same non-negative integer as”: users understand it. Make the criterion exclude zero, or make it include (say) “undefined”, and we merely confuse. After all, our users come to our systems knowing what counting numbers are: they do include zero, and they don’t include “undefined”. Nevertheless, our users will sometimes want to exclude zero; and will sometimes be obliged to accept “undefined” (on a zero-divide, say).

Now please notice that everything I’ve said about nulls, in contrast to absent values, comes under the statement: I’m talking about records at the concept level. There are no nulls at the form level.

I really think that that’s all I want to say about nulls. Interpretation? Well, I’ve done it all already, haven’t I? It’s just a question of whether we admit nulls, by using “º” in the EI column equivalences, or use “=” and don’t admit nulls.

 

 

Padding

 

I now want to introduce a new concept, applicable only at the form level, and only when we perform a rather strange operation on forms. Suppose that we have a table containing records of different formats. We can, quite easily, create a relation from it – at the form level – by padding each row with all the columns absent from it, and giving each of them a special value, which I will call a pad.

If “T” names each of the records in a table, then PAD(T) will be a title of each of the records in the relation formed, as described, by padding each of the T rows. This gives us the new pad value at the forms level (not the same as blank). Notice that “PAD” works rather like “THE” on absent values, but instead of a null it delivers a pad. Interpretation? None: a padded form has precisely the same meaning as it had unpadded.

I know what you’re asking yourselves: What’s he up to now? And, fairly obviously, I’m continuing to prepare my defence of non-uniform tables. The first thing I did to that end, you’ll recall, was to show that the rectangularity of tables was not a question of structure but of language. What I’ve done by allowing padding is to avoid any criticism pertaining to the form level: if you don’t like non-uniform tables at that level, you can just pad them and be done with it. This means that the question of rectangularity pertains purely to the language we use about records at the concept level.

Please note that just as there are no nulls at the the form level, there are no pads at the concept level. But for different reasons: the values that are null at the concept level are still present, but not null, at the form level; the pads at the form level aren’t there at all at the concept level.

Now you might ask: do believers in relations, unlike the non-rectangularist heretic Platoclast, ever use anything like pads? If they do, you can be sure I’ll be down on them like a ton of bricks. Because padding is, let’s admit it, a way of disguising a non-uniform table as a relation. And if they pad them to disguise them then they must have them.

I will now briefly turn to another subject: homework assignments. Red faces, I see. Here is a list of all the students in this class with the assignments they have handed in on time:

STUDENT   ASSIGNMENT
 
Douglas
Judy      Week 1
Judy      Week 2
Judy      Week 3
...       ...
Think about that Douglas. I reckon there’s a lesson to be learned.

 

 

Afterthoughts on Nulls

 

 

( Following this lecture, Professor Platoclast was pressed on the question whether he really had said all he wanted to about nulls.)

Yes, I know what you mean: the solution seems too neat for the problem, too final, too slick somehow. I suspect that what I’ve done is to solve the problem to what we might call a “design point”. It’s satisfactory, but we might have gone further. But I guess if we went any further it would get too complex to handle.

Here’s one approach. Dr Codd has obviously latched on to something with his two sorts of marks. Usually – but not always – when we have a null, then either there is a right value, but it’s not to hand, or there really isn’t a value. So we have the unknowns, withhelds, and so on; and we have the nones and the not applicables.

Let’s think about the first sort. One thing we might contemplate is having common names for values, common with respect to the column criterion. I mean, suppose the proper values of a column were 1, 2, 3, and 4. We normally expect a value to be a proper name, say “One”. It’s proper in this sense, using “is the same four-element” as the column criterion:

IF x is called “One” and y is called “One” THEN x is the same four-element as y.
But we might give the following common names: “Odd” to 1 and 3; “Even” to 2 and 4; “Little” to 1 and 2; “Big” to 3 and 4; “Peripheral” to 1 and 4; “Central” to 2 and 3; “Littlish” to 1, 2, and 3; “Biggish” to 2, 3, and 4; “Untwoish” to 1, 3, and 4; “Unthreeish” to 1, 2, and 4; and, of course, “Four-Element” to 1,2, 3, and 4. Then we could use one of these names for any “unknown”: “Four-Element” when totally unknown, or one of the others if we had some information – perhaps we might know it was 2 or 3, so we’ld call it “Central”.

This sounds tempting, but I don’t think it gives us a design point. Suppose we had some particular fields: suppose we knew that F1 was Little, and F2 was 2 + FI , then F2 would be Big. And likewise we might know that F3 was Little and F4 was 2 + F3, and therefore Big. But just remembering that F1 and F3 were Little and F2 and F4 Big wouldn’t be enough. Because when it came to calculating F2-F1 we might know the answer was 2. But F4-F1 would have to give the answer “Littlish”. So we couldn’t have general rules like Big-Little = Littlish.

In fact we would want to call F1’s value, say, “Little1” and F3’s value “Little3”, and these are more like “unknowns”, and the whole thing gets very messy. We can’t really keep the number of values down, even to 1 less than 2n, which is the number of names we’ld need for n proper names.

And then things get even more complex when we admit the “none”: do we have to give a pseudo-name, an “empty name”, to that? Then we have to handle such an empty name in the FOPC (not easily done). And we could get nulls like “either none or Peripheral”, or – a new extreme of ignorance – “either none or Four-Element”. That’s very like an absent field. I mean, what can you say about a person’s second given name if you haven’t a clue: you don’t even know whether they’ve got one. Though perhaps things would be worse if, for instance, you knew that the name (if any) wasn’t “Leopold”, but that was all you knew.

Of course, if we had a particular case where we wanted that sort of detailed information on a small domain, saying “We know it’s one of these values”, or “It’s not one of these values”, then we could design a special type of record, with one column for each possible value, and mark each column as excluded or not excluded. It would be, in effect, like making “Big”, “Biggish”, “Central”, and so on, into proper names. You’ld perhaps need one more column for “none”.

Or, of course, you could keep multiple rows, recording all the possible values instead of just one row with the actual value. But I don’t think any of this is good enough for a general solution, and things would get very complex in big domains.

You see, solving the problem of nulls doesn’t mean giving a best solution in the sense of being able to record, and never losing, any information. That would need a very powerful and complex theory. The problem is more subtle than that: we have to solve it in a simple theory, without losing too much information, and without requiring special design too often. Even a method that used “unknowns” might not handle all the subtleties of ignorance and partial knowledge. Imagine that all we knew of a person’s second given name (if any) was that if they had one and it didn’t contain a “P” then either it was “Sam” or it did contain, but didn’t begin with, a “Q”.

 

SITE HOME PAGE

My claim is that the method I’ve proposed is sound. It doesn’t ever lead from truth to falsehood. And, of course, it’s quite clean and economical: it fits within the ordinary FOPC. In other words, it’s OK as far as it goes, and it doesn’t (thank goodness) go very far.

THE DATABASE PAGE

THE DATABASE PAPERS

 

Preface & Contents

 

DOWNLOAD

Download Lecture XII (rtf, Word for Windows compatible)

Platoclast on Data: Lecture XIII

 

Copyright © 1993, 2001 Adrian Larner. The author asserts all moral rights.