|
(U01) |
www.btinternet.com/adrian.larner/database/snulls |
|
Nulls A Socratic Enquiry A database paper by Adrian Larner |
||
|
|
||
Abstract |
|
|
|
|
An attempt is made to solve the major problems caused by nulls. The question, “What is a null?”, is posed, and the answer given is that: nulls are values (or, at least, there is no advantage to be gained by denying that nulls are values); it is not their meanings (e.g. unknown) that make them null; to define a column it is necessary to associate with it a criterion of identity that is the interpretation of “=” applied to its intended (“proper”) values; and a null is a value permitted in a column but not, under that criterion of identity, equal to anything. It is shown how nulls, under this definition, affect Natural Joins and primary keys (Entity Integrity). The handling of multiple nulls is discussed, as are (briefly and dismissively) default values, many valued logics, and nulls in Outer Joins. |
|
|
|
||
|
|
MOTIVATION |
|
|
|
There is no generally, nor even widely, agreed solution to the problem of nulls in relational data bases. But that may be because there is no agreement on what the problem is, except in rather vague terms. And that may be because we do not know what nulls are; we do not have an acceptable definition of “null”. These questions are now addressed, backwards, taking the approach favoured by mediaeval scholastics asking the Socratic question, as reflected in Plato’s accounts of Socrates’ disputations. What is the nature of a null? |
|
|
|
||
|
|
WHAT ARE NULLS? |
|
|
See EF Codd: Extending the Relational Database Model to Capture More Meaning, ACM TODS 4, No 4 (December 1979). Note that SQL nulls are not values. For a fair but critical discussion see CJ Date: An Introduction to Database Systems, Volume 1, Fifth Edition, Addison-Wesley (1990). Codd’s more recent and elaborated proposals are described in chapters 8 and 9 of EF Codd: The Relational Model for Database Management Version 2 (“RMV2”), Addison-Wesley (1990). |
One school of thought Codd’s holds that nulls are not values, but something else: “marks” that are applied to cells (row/column intersections) that do not contain values. Does this proposal of marked nulls get us any nearer to a solution, or even an understanding? (We come back later to the question why we might even imagine that nulls were not values.) |
|
|
|
||
Are Nulls Values? |
|
|
|
The author is indebted to
Chris Date for this observation. |
Suppose that we have a relation, some of whose columns may contain marked nulls, for example a PERSON table with a Surname column admitting “no surname” (there are persons with no surname), a Birthdate column admitting “unknown date”, and a Religion column admitting “withheld” and “none”. Now, we can construct another relation, say NEO_PERSON: for each column in PERSON (e.g. Surname), we have an equivalent column in NEO_PERSON (Neo-Surname), for each row in PERSON we have a row in NEO_PERSON, and for each (non-null) value in such a column as Surname (e.g. “Smith”), we have a value in the equivalent column (“Neo-Smith”). For each marked null (e.g. “None”) in the column, we have a value (“Null-None”). Values not drawn from character domains may be represented by character equivalents (rather than prefixing “Neo”), e.g. the number 7 and the date 07/03/1950 might appear in NEO_PERSON as the numeral “7” and the date inscription “7th March 1950”. Nulls in such columns would, again, appear as “Null-Unknown”, “Null-None”, etc. PERSON and NEO_PERSON are transparently isomorphic and inter-translatable. NEO_PERSON differs from PERSON in that:
|
|
|
|
|
|
|
|
||
The Meanings of Improper Values |
|
|
|
|
Intuitively, improper values are values like “unknown”, “none”, “not applicable”; or values somehow associated with “missing data”. Not that this gets us much further: a person that has no second given name does not (as a general rule) have anything missing, any more than does a person that has no fourth given name. We might be tempted to say that a person with a fourth given name has something superfluous. (For which smallest n an nth given name is not missing-if-absent but superfluous-if-present is not a question on which we are all likely to agree.) Let us, however, consider this hypothesis (which seems to be commonly accepted, although rarely stated):
And the hypothesis does fail. There was recently a newspaper report on a large flawless cut diamond. Its colour was described as “D-colour”, which means completely colourless; i.e. having the value “None” in the column “Colour”. Obviously we could have a DIAMOND table and a DIAMOND_COLOUR table, Colour being a foreign key in the former and the primary key of the latter. In no way would “D-colour” or “None” cause any problem as a value of that primary key. A retail outlet classifies its customers by Income Range: High, Medium, Low, and Unknown. The salespersons simply estimate the income range of each customer (they can tell a besuited executive that parks a new company car outside the shop from an impecunious student). Sometimes they are uncertain, and put “Unknown” in the CUSTOMER record. The income range is used only to determine the Credit Limit and the Dunning Letter Style. So (for third normal form) we need a TREATMENT relation, with IncomeRange as its primary key:
There are lessons to be drawn from this sort of example; and the first lesson, yet the most easily ignored, is that the hypothesis meaning is what makes a value improper fails. Granted it is intuitive; it is false. Whatever makes a null, whatever makes a value improper; it is not its meaning. We should rejoice at the fall of this hypothesis. If we want a data theory and data manipulation languages that are formal, we want ipso facto such a theory and languages that work uninterpreted. The relational algebra or SQL do not depend for their effectiveness on their inherent knowledge that in an expression like “RELIGION = 'Sikhism'” the word “RELIGION” means religion or that Sikhism is a religion. Formally, nothing stops a data analyst from using “RELIGION” to mean birthdate and “Sikhism” to mean 07/03/1950; it would be inconvenient, but not formally incorrect. A very small number of terms have an interpretation known to (inherent in) the theory or language: “=”, “AND”, etc. But these are terms of the formal logic itself: we can formalise identity and truth functions. We cannot formalise an open-ended set of ill-defined meanings including “unknown”, “none”, “withheld”, “not applicable”, and so on. And if we could, we would have a very complex logic. No: we shall be much better off if improper values are like proper values to the extent that their interpretations (their meanings) are not recognised by our data theory or data manipulation language. |
|
|
|
||
An Anecdote of Slipshod Analysis |
|
|
|
|
A data analyst defines a data base containing a PERSON relation, with the sorts of columns we have considered, above. A user asks the analyst: suppose I first restrict the PERSON relation to those rows that do not have an improper value in Religion, and then I do an equijoin of the restricted relation on the Religion column say with itself. Now I have a relation each of whose rows says something about two persons, presumably something about their religion. What does it say?The analyst merely shrugs: I have no idea. Clearly this is a slipshod analysis; an incompletely defined data base. The conclusion is inescapable: a relation has not been defined until an interpretation has been given of the meaning of “=”, “is the same proper value as”, for each column. Perhaps (it is the analyst’s decision) the answer should be: the two persons are of the same faith. By and large, we would expect the column name to give a good indication of the meaning of this identity, which we may term the column criterion of identity. Notice that we must now say: the interpretation of “=” is not constant. It depends on the column. It is indeed the criterion of identity of the column, as defined by the analyst. |
|
|
|
||
Proper and Improper Values |
|
|
|
|
Now we assume that an analyst has defined the Birthdate column in the PERSON relation; so we know its criterion of identity. Joining PERSON records on their Birthdate columns gives us records pertaining to persons born on the same day. But, at some stage, the analyst discovers that the birthdates of some persons, of whom records must be kept, are unknown. Two choices are available:
|
|
|
|
||
What a Null Is |
|
|
|
|
And now we have a definition of “is null” or “is an improper value”. A value, v, is improper in a column when, “=” being the proper identity of the column (and interpreted as the column criterion of identity, as defined by the data analyst), NOT v=v, although of course vºv. Another, less formal, way to make the distinction between improper and proper values is this:
|
|
|
|
||
|
|
THE PROBLEM OF NULLS |
|
|
|
Having achieved a definition of “null” a value is null if improper in its column it is not immediately obvious what problem they might cause. Users might confuse the two identities, “=” and “º”, but there are far worse things that users confuse. Why should nulls be such a problem? Suppose we had some operation in our data manipulation language that involved an implicit equality operation. In SQL we specify an equijoin by defining a Cartesian product (FROM PERSON P1, PERSON P2), and an explicit join condition (WHERE P1.SURNAME = P2.SURNAME, or in an extension to the language that we might now recommend WHERE P1.SURNAME º P2.SURNAME). And then we would usually remove one SURNAME column by projection (SELECT). But suppose, instead, we used a Natural Join operation the Cartesian product, identity restriction, and projection all in one so that we never stated the identity, but it was implicit (e.g. in commonly named columns). In that case, which identity would be implicit: “=” or “º”? We might well ask. And now suppose that not only did we use Natural Join for data manipulation, but we took it, along with Projection, as the basis for data base design. We might attempt to avoid update anomalies by never having a base (stored) relation that could be decomposed by Projection and recomposed, without loss, by Natural Join. We would find ourselves with base relations that frequently needed Natural Joins on their primary keys. If we did design our data in that way, we might reasonably conjecture that improper values would give us problems. Sometimes we would want our joins to use “=”, and sometimes “º”, and we would have no way to specify which. Implicit comparisons are such a bother. |
|
|
|
||
Entity Integrity |
|
|
|
The metaphysics can be found in Chapter 12 of CJ Date, op. cit. |
One very crude solution, which we might consider but would surely reject, is never to allow improper values in primary keys. We can always achieve this by redefining their column criteria of identity; and without improper values, “=” and “º” become the same comparison, so we draw the sting of implicit identity. (Whether one prefers this “logical” basis for the Entity Integrity rule, or the metaphysical basis each entity having the attribute of self-identity, represented by the primary key column may be merely a matter of philosophical taste.) |
|
|
|
In any event, if the problem of nulls does lie in Natural Join, we can be very confident of our definition of nulls as improper values, not meeting the criteria of identity of their columns. That, surely, is almost exactly to define a null as: something that will give you a problem on a Natural Join. Trivial; but that is logic for you. There is, however, yet another approach to entity integrity. As every column has to have a criterion of identity, a primary key column has to have such a criterion. We would expect the criterion of identity of the Person-Id column to be: is the same person as. But we would also expect (it is, near enough, the same expectation) that the same criterion would apply to each row of the PERSON relation: one row for one person. Suppose we had an improper value in Person-Id. We would have a row that did not represent something that was the same person as itself, i.e. did not represent a person. In that case, the relation would appear to be ill-named. To go to the dogs once more, if we had a PURE_BREED relation, and one of its rows had the improper value of Breed Name as primary key value, we might reasonably rename the relation as BREED, and change the criterion of identity of Breed Name so that the improper “Mixed Breed” became proper. (The intuitively correct PURE_BREED table would then be a restricted view of BREED, lacking just one of its rows.) But such nice choice of relation names, important as it may be for usability, makes no formal difference. Of course we may have an improper value in a primary key, and if we join using “º” we shall have no trouble. And if we join using “=”, when that is what is wanted, we shall still have no trouble. We shall have trouble only if we do a Natural Join without the faintest idea of whether “=” or “º” is “natural”; or worse perhaps use a syntactically explicit equality in a restriction condition without knowing which of the two meanings it should bear. Suppose we decided to compare methods of marketing on our customers with Income Ranges High, Medium, or Low (comparing pairs of customers of the same income range subject to different marketing methods). We might want to avoid pairing customers of Income Range Unknown: now we do not want to say, “They are of the same Income Range.” (We said that before because we meant merely: we treat them the same way for credit.) Very well: make Unknown improper; join for marketing comparisons using “=”; and for credit treatment using “º”. Now we have, it seems, a null in the primary key of TREATMENT. And it makes not one iota of difference; as long as we maintain explicit control over our join equalities. |
|
|
|
||
On Some Solutions |
|
|
|
|
||
Multiple Nulls |
|
|
|
|
If we allow only one kind of null in each column, as we have seen, we gain the advantage of being able to define it formally: that value permitted in the column of which the column identity (=) does not hold true. What then of the proposal to allow two or more kinds of null in a column? Why should we not allow “None”, “Withheld”, “Unknown”, and “Other” in the Religion column? (Notice that we now have a conclusive test that shows each of these to be null, given the column criterion, “is of the same faith as”.)
If we employ someone of unknown birthdate, we might need to enter a conjectural date for purposes of retirement and pension calculation. We could still need a modality field value (proper value) “unknown”. Life insurance companies keep the birthdates of their customers, but each birthdate needs a modality: merely used for premium calculation purposes, or confirmed (on receipt of a copy birth certificate). Some applications handle what might be false data (e.g. hospital patients include drug addicts, some of whom give false names); such data needs a modality column to flag it as accepted or possibly false. So the concept of modality comments on the status or reliability of data in some other column is entirely orthogonal to the concept of null (improper value). If modalities are needed, they can be treated like any other data: as values in columns. If (as is often the case) there is “only one kind of null” for a column, the improper value functionally determines the modality, so no modality column is needed. |
|
|
|
||
Default Values |
|
|
|
See Chapter 15 of CJ Date: Relational Database Selected Writings, Addison-Wesley (1986), and criticism in Codd’s RMV2. |
Date has proposed the use of default values for nulls. The proposal is wise; default values would be very useful. However, the concept of default, like that of modality, is orthogonal to the concept of “null”. It is left as an exercise for the reader to describe: a non-null (proper) default, a null non-default. Added Note: Date has since clarified what he means by default, or rather what he does not mean, viz default. What he does mean, as far as I can understand it, is: null, but not treated as nulls are treated in SQL, which uses ... |
|
|
|
||
Many Valued Logic |
|
|
|
There are only two truth values. Many (i.e. more than two) valued logics have their uses, but their extra values cannot be truth values. In the same way, a numeric keypad can have buttons for only ten decimal digits 0 to 9; a telephone keypad may be many (more than ten) keyed. But the extra keys cannot have extra decimal digits. See sections 2.5 and 5.3 of PT Geach: Logic Matters, Blackwell (1972). |
If a comparison involves an equality, and we are unsure which equality is intended “=” or “º” we might get any of three results: True if “=” (and therefore “º”) gives true; False if “º” (and therefore “=”) gives false; and Confusion if “=” gives false and “º” gives true. The solution is to use explicit equalities and therefore to be sure which is intended. Introducing “Confusion” as another “truth value” is patently the wrong answer. |
|
|
|
||
Domains |
|
|
|
|
The proposal to treat nulls as something other than values probably arose because of two conflicting notions of “domain”. Are we to understand the domain of a column as the collection of its permitted values? or as the collection of its proper values? If we try to say both (almost certainly because we have not explicitly made the distinction) we end up with nulls permitted, but not in the domain: so they must be permitted not-values, marks (with added confusion caused by inadvertent introduction of the orthogonal concept of modality). If we say that the domain is the collection of permitted values, then we may define the “proper domain” as the collection of proper values. “Nulls allowed” then means that the proper domain is a proper subset of the domain; and “Nulls not allowed” means that the domain and the proper domain coincide. There is then no reason to think of nulls as anything other than values. |
|
|
|
||
Missing Values |
|
|
|
|
It is unlikely that any enlightenment is to be found by thinking of nulls as “missing values”. Firstly, we are no more likely to be able to formalise values that are missing, in the sense of not being in the database at all, than to formalise values that should be in the database but are incorrectly recorded. Secondly, if there is one sort of data that we do not need to formalise, it is the data that we do not keep. Users know that what they are querying in a database or a paper system is nothing but what is recorded. They understand the tenuity of its connection with reality. But it is, of course, very easy to find examples of improper values that informally speaking stand for “missing values”; and fairly easy to find examples that do not. |
|
|
|
||
Nulls in Outer Joins |
|
|
|
|
The “nulls” generated by Outer Join operations are unrelated to the nulls discussed in this paper. They are mere pads; a transparent ad hoc adjustment to overcome the constraint of Union Compatibility (that all the rows in a relation have one and the same format). Nulls improper values will not serve this purpose; and nor will any other values. The “short rows” in an Outer Join simply do not intersect all its columns (they are the same rows as were found in one of the operand relations that lacked those columns), and so there are no values at all in the absent cells (where the rows do not intersect the columns), neither proper nor improper. The only acceptable solutions are to reject Outer Joins or to abandon Union Compatibility. To define a value to put in a cell, to mean that the value is not there at all, and nor is the cell: this is vanity. |
|
|
|
||
|
|
CONCLUSION |
|
|
|
In order to define a relation it is necessary to define the criterion of (proper) identity of each of its columns. If we represent this identity by “=”, we can define the value identity of the column, “º”, which allows the column to have one improper value. We define “null” as improper in its column. The nulls thus defined are clearly and precisely those values that give rise to the recognised problems of nulls, among which the first and foremost is the problem of Natural Joins. This problem arises from the use an implicit, and therefore ambiguous, identity. This definition of null enables us to explain not only the problem of Natural Join, with its implications for data design, and the Entity Integrity rule (without recourse to dubious metaphysics), but also:
A brief investigation of multiple kinds of null shows that they can be treated (where necessary) as modalities, i.e. values in one column that are comments on another. This concept of modality is orthogonal to that of null: a modality may be associated with either a proper or an improper value. In sum, it is proposed that the definition of “null” as “improper value” solves, with minimal theoretical adjustment, the major problems of nulls, and explains much about alternative unacceptable solutions. More importantly, it is proposed that any alternative treatment of nulls should (1) be at least as satisfactory at explaining and solving the problems of nulls, (2) require no more complex theoretical adjustment, and (3) include, and preferably start with, an answer to the Socratic question: What is a null? |
|
|
|
||
|
|
||
|
|
||
|
|
||
|
|
Download Nulls A Socratic Enquiry in Restricted Text Format (rtf, Word for Windows compatible) |
|
|
|
||
|
Copyright © 1994, 2001 Adrian Larner. The author asserts all moral rights. |
||
|
The decorative image of a key (cc004239.gif) used on this page was obtained from IMSI's MasterClips/MasterPhotos© Collection, 1895 Francisco Blvd East, San Rafael, CA 94901-5506, USA. |
||