|
(U03) |
www.btinternet.com/~adrian.larner/database/pcl24 |
|
PLATOCLAST Lecture XXIV |
||
|
|
||
|
|
Primary Keys |
|
|
|
One of the major shocks and disappointments I received when struggling to understand relational theory was the notion of primary key. Firstly, we had what appeared to be a very complex system, based on set theory; and yet this extra primitive notion had been added. Moreover, the choice of primary key for a relation seemed purely arbitrary, in theory. Naturally, in practice, the choice of primary key was usually obvious (though, I suspect, usually obvious and wrong as well). Secondly, the concept appeared to be used to solve too many problems. It was to be part of the mechanism for uniquely identifying picking out an instance of a value (by giving the relation, the primary key value, and the column name). It was also supposed to serve for some sort of major I hesitate to say “privileged” connection between records. It was also intended to represent the self-identity (sameness) of the so-called entities supposedly represented by records. As we’ve already seen, we can pick out records (should we wish to do that) by record identifiers. We have seen that the rule covering these major connections the Referential Integrity rule is actually trivial. And we have said, firstly, that there is no such relationship as “being the same entity” absolute identity; and secondly, that every column is associated with some criterion of relative identity. I conclude, therefore, that thank goodness the notion of “primary key” was just a horrible mistake. Now I’m pretty certain that the relational bigots would pick on my record identifiers and claim them as primary keys. They certainly do serve to identify records (in the sense of sameness); they do serve to indicate a certain sort of identity (in the sense of sameness), namely identity of records. And they might serve, I guess, to make some connections between records. For instance, where the connection really is a matter of making an association with a form, they could be so used. But users might not want to associate, say, a person form and a house form; they might want to associate a person with a house, so should use a Person Id and a House Id. I just don’t see why they should always have to worry about how many forms it takes us to keep information on a person or a house, and which of those forms should be associated. But, as we’ve seen, this whole area is confused in relational theory, because of the unthinking acceptance of absolute identity, and the unfounded assumption of one-to-one relationships between records and entities. All this talk about one-to-one relationships, unique identifiers, and the like has to be sharpened up: one what to one what? unique under what criterion of identity? And when we do sharpen it up, the notion of primary key vanishes away like the dew in the morn. |
|
|
|
||
|
|
Identifiers and Keys |
|
|
|
Let us, however, turn our minds to the notion of key. We haven’t had a TT-DD schema for some while, so here’s one:
And that’s about as far as I can get in the TT-DD towards formalising the concepts of “key” and “functional determination”. I used to be worried that the relational people seemed to get much further, until it dawned that they weren’t doing it formally, merely informally in notation. I warned you about that, didn’t I? |
|
|
|
||
|
|
Candidate Keys |
|
|
|
Let’s suppose you find relational theorists saying things like “EMPNO is a candidate key of the EMPLOYEE relation” or “EMPNO functionally determines EMPNAME”. What does it mean? Well, one thing it means is that EMPNO is a key in that TT-DD sense of the EMPLOYEE record type, or is a key of the (EMPNO, EMPNAME) projection of it. But it means a bit more than that, which we can’t formalise in the TT-DD (nor they in the FOPC or set theory): that EMPNO is asserted to be such a key. This is easily seen. Suppose EMPNAMEs were unique: every employee had a different name. Then we would find that EMPNAME was a key of EMPLOYEE, or of its (EMPNO, EMPNAME) projection. By chance, as it were. But it wouldn’t be a candidate key because it wouldn’t be asserted to be a key. Notice that “It is asserted that” is a propositional operator, like NOT (“It is not the case that”). But it’s not a truth-functional operator because we can’t tell whether “It is asserted that p” is true or false merely by knowing whether “p” is true or false. We can do that for “NOT”; if “p” is true then “NOT p” is false, and vice versa. However, the only propositional operators in the FOPC, or set theory for that matter, are truth-functional: NOT, AND, OR, and the like. So “candidate key” and “functionally determines” are, in the senses used, not formalisable. |
|
|
|
||
|
|
Keys in Optimisation |
|
|
|
Now, I’m going to have a look in a bit at where these functional determinacies come from, but I want to say now that there’s certainly one area where some people think they’re important; but they’re not important at all, though keys in the TT-DD sense are. Actually, I suspect that this holds in most areas, but the area I’m particularly referring to is query optimisation. I won’t go into it in detail ask the good Father Warden if you want to understand it but think about when you’ve done a projection and you need to check whether you’ve got any duplicate records, which you often have. Now it costs you to check for duplicates, but it may cost you not to check early enough if you do have duplicates. And one thing that helps is: if there is at least one key in the projection, that is, one collection of columns that keys all the columns, then you won’t have any duplicates (if you didn’t start with any!) But and here’s the rub it doesn’t matter whether the key is a candidate key or not: it doesn’t matter whether it’s asserted that it’s a key. If it is asserted, in your data base definition, then it’s just fine, because the DBMS will have ensured that it’s true. But you may have found out by other methods that it’s true. And that’s what counts: being a key, not being asserted to be a key. Although it’s unwise, you see, to drink from a bottle asserted to contain poison, that’s not what causes the trouble. It’s drinking from a bottle that actually contains poison, however you may have discovered or even not that it does contain poison. And so it is with optimisation. |
|
|
|
||
|
|
Keys in Data Base Design |
|
|
|
Now let’s turn our attention to another use of keys. People use them in data base design. We’ve touched on this already. Suppose we want to hold this sort of data:
But notice that if you really do want to impose the constraint that MAKE keys (MAKE, BODYFRAME) then you do. You will assert it in the data base schema, and you will want your DBMS to keep it true. And one possible physical implementation that will keep it true is to have store level records like those shown. But does that have anything at all to do with logical data base design? I don’t know, but it’s not obvious, is it? Nor, as it happens, does it matter even at the store level that MAKE should be a candidate key of (MAKE, BODYFRAME). Suppose this were true: we thought, to the best of our knowledge, that MAKE did key (MAKE, BODYFRAME), but we didn’t care, and we weren’t certain. Couldn’t we say this, at the store level:
Well, at the logical level, no change to the form or concept level. But we’ld send a message down to the physical level to say:
Of all my proposals, I suspect that this relegation of relational data base design to the physical level will annoy the relational bigots most intensely. Good. But I have a second, even better reason for it. |
|
|
|
||
|
|
The Join Trap |
|
|
|
Let me assume that MAKE really is a candidate key of CARMAKE; for all time” as they say. Now I’m going to add another column to CARMAKE, so we have:
You see, it didn’t cost us nothing to avoid the join trap: we avoided some invalid, unreliable implications. But some invalid implications lead us to the truth. Remember: all that we ask from implication is that it doesn’t lead from truth to falsity. How do we sort this out? Like this:
Look at it this way: there are lots of reasons why one collection of columns might key another. When we think purely in terms of records at the store level these reasons really don’t matter. So decomposing records by such relationships, knowing that they can be restored by joins on keys, is just fine. You see, at the store level, all we have to do is to keep data: that’s what a store is for. You put things in, and you want to go back later and take them out. And that’s all. But at the logical level our records have interpretations, and the manipulations we perform on them are implications, so we can’t arbitrarily rip them apart and use the manipulations to put them together. In terms specific to EI, when we split a record into two projections we get two interpretations with different existentially quantified variables: that is a loss of information. I agree that it may not matter. If all the relevant users know that any two cars of the same make have the same bodyframe material, then they can add that knowledge to what the data base tells them. But I’m not in the business of devising data bases that work only if users have lots of supplementary information in their heads. I suspect we may have to support, before very long, unbelievably stupid front-end inference engines. And theyll think that if Peter is an apostle and the apostles are holy and the apostles are exemplified by James and the apostles are 12, then Peter is not only holy but also exemplified by James, and 12 to boot. Now you may be tempted to say that CARMAKE should be interpreted as:
But also think carefully: what, if anything, is wrong with the interpretation I’ve given? It requires a separation between physical and logical levels? There isn’t a one-to-one relationship between them? So logical data base design and physical data base design become very different exercises? What would you count as goodness then? |
|
|
|
||
|
|
Origins of Functional Determination |
|
|
|
You can imagine that I’m a lot less interested in functional determinacies than the relational people are. Though, of course, Ild be interested in key relationships if I were into physical data base design, and particularly optimisation, so I’ve generously thrown out the odd suggestion like defeasible determinacies as I’ve hacked my way through their forest. But if you are interested in functional determinacies, you ought to ask yourself where they come from. Well, ignoring the accidental ones the unasserted, coincidental keying relationships we have some that derive from business rules. For example, we might have a policy that each employee was assigned a single visual display unit to work at. That would make Employee Number determine VDU Id. And then weld see people designing a data base in which VDU Id was found in the EMPLOYEE record: a great idea, at the store level; it does wonders for performance. But it’s not the job of the logical data base designer to make their design dependent on business rules that might change tomorrow. Then, even worse, there are other enterprise’s business rules: the determination of BODYFRAME by MAKE might be an example. Again, these could change: OK to exploit them, perhaps as defeasible, at the store level. Sometimes the other enterprise is the government, central or local, with all their laws and regulations. Do you think anyone has ever been caught on that elegant, utterly Wardenesque functional determination of VAT rate by the empty collection of columns, i.e. there is just one VAT rate? You betcha! Vanity of vanities, all is vanity! Even government regulations pass away. That’s an absurd understatement if you happen to be working in the local government sector. We have social conventions: everyone has just one surname. I wouldn’t bet on it. And then we get to hard reality. Hardish. Every car has one steering wheel? I believe there are prototype cars that you can drive in either direction, like some of those old trams. And some cars I learnt to drive in one have two brake pedals and two clutch pedals, and sometimes I wished my instructor could override my steering as well (so did she). But surely every person has one head! I don’t want to get gruesome, but no. So what determinacies are reliable enough to use in data base design? If business rules, laws, social conventions, and even what we took for reality, all let us down, what can we rely on? What can we rely on when all else fails? Logic. Logic alone will never lead us astray. Some designers like to throw in a bit of commonsense as well. I do myself; but I don’t recommend it. |
|
|
|
||
|
|
Duplicate Rows |
|
|
||
|
|
(Professor Platoclast was asked whether his “de-emphasising” of primary, and even candidate, keys meant that he admitted duplicate rows.) “De-emphasising”! I couldn’t manage anything as subtle as de-emphasis; I don’t de-emphasise primary keys, I reject them. Candidate keys I admit informally, and I’ld like to formalise them but it needs a modal logic: a logic with functions that aren’t truth functions. As for duplicate rows, that’s a surprisingly difficult problem, but largely because discussions of it are so confused. |
|
|
|
||
|
|
Duplicate Needs Definition |
|
|
|
First of all, we have to be able to say what duplicate rows are. It seems obvious to me that if Rl is a duplicate of R2 then we need two criteria of identity for rows. Suppose we had only one criterion: then Rl might be the same row as R2 by that criterion, that is, “Rl” and “R2” would merely be different names for one and the same row; but if there’s only one row, with no matter how many names, then it’s not a duplicate. On the other hand, Rl might be a different row from R2 by that single criterion. So we would have two rows, but they wouldn’t be duplicates; they’ld just be two rows. Obviously, therefore, we need two criteria of identity, so that we can say Rl is a duplicate of R2 when it’s the same by one criterion, but not the same by the other. But those who don’t want duplicate rows can’t just say: if Rl is the same as R2 by some criterion then it’s the same by any criterion. That’s obviously absurd, because any two rows are the same by some criterion, for example by “... is a row, and so is ...” And I think what those opposed to duplicate rows want to say is, using our criterion of identity for rows (comprising all and only the same fields), that if Rl is the same by that criterion as R2 then Rl is the same by any criterion as R2. Now even that is pretty doubtful. For let Rl be a row in our data base, and let us read in that row and display it on a screen, and let us call the displayed row R2. I think I want to say that R2 is the same row as Rl, otherwise there’s something wrong with our application program: we asked it to display a row and it displayed some other row. Yet clearly Rl is not the same as R2 by any criterion: “is displayed on a screen and “was written by a display program are true of R2 but false of Rl, whereas “was read by a display program is true of Rl but false of R2. I think you can probably see why I’m not too worried about duplicate rows. Suppose, whatever it means, that Rl is a duplicate of R2, and that Rl is in our data base and that R2 is in our data base. Now, either Rl is the same row as R2 or it isn’t, by our criterion of identity of rows. If it isn’t, then we have two rows; if it is, then Rl is the same row as R2. But really that’s all I’m interested in: is Rl, or is it not, the same row as R2? And whatever the answer might be l don’t care whether they’re duplicates or not, in any sense of “duplicate”. As I said, you can get duplicate rows only if you have two criteria of identity for rows; but I’m interested in only one such criterion, so the problem doesn’t worry me. (The Professor was not allowed to get away with this response. Date had written a paper attacking duplicate rows;[2] SQL allowed duplicate rows; surely the problem could not be so easily dismissed.)Very well; I admit the problem seems important to Mr Date, but he doesn’t have the theoretical apparatus to address the problem. He never gets round to mentioning the criteria of identity he’s using. And his attack is based on set theory, which he considers simple and sound, and I reckon to be complex and unsound. His argument starts with a claim that individual objects must be identifiable (distinguished from other objects). If that were so, there would be a criterion of identity for individual objects: “is the same object as”. And that would mean that you could tell me, for instance, how many objects there are on this desk. But you can’t: you can tell me how many books, or how many sheets of paper, or how many sticks of chalk; but not how many objects, or things. Then Mr Date argues that if we have duplicate rows in a relation we have to distinguish them by position. Actually he says that we have to distinguish them somehow, not necessarily by position, and then he assumes position. But, as far as I can tell, people who want duplicate rows (in SQL, say) don’t want them distinguished by position. |
|
|
|
||
|
|
Interpretation of Duplicate Rows |
|
|
|
And then he says ... what does he say? Ah! He then goes on to the question of meaning, and this is where things go hopelessly wrong, because he raises the question but doesn’t answer it. He gives an example with duplicate rows, but doesn’t tell us what it means. And this is the difficult thing in arguing against duplicate rows. We either have to say: duplicate rows have no meaning, or we have to give a meaning. If we say they have no meaning, ... Well, there’s an end to it. But it’s not very convincing, because people wouldn’t use duplicate rows if they had no meaning. On the other hand, if we give a meaning to duplicate rows, and possibly even translate each table with duplicate rows into a table without duplicate rows, then our last state is worse than our first. We have actually shown how to handle duplicate rows: we can translate them into non-duplicates, manipulate them, and translate them back. Then it seems absurd to criticise duplicate rows so violently. Mr Date actually states a query on a table with duplicate rows and formalises that query in several different ways, with different results. But the trouble is: we can’t tell which, if any, of his formalisations are correct because he hasn’t given us any interpretation of his table. He claims that the duplicate rows give rise to ambiguity in the result; but the ambiguity comes merely from the failure to specify the meaning of the table. |
|
|
|
||
|
|
A Case for Duplicate Rows |
|
|
|
Let me try to build a case for duplicate rows, and you’ll see just how difficult it is to argue against them. Suppose I sell what are called big ticket” items: beds, washing machines, that sort of thing. I might want to display a table of what I have in stock: I’ll show you a part of that table, highly simplified:
But is this system correct in the DML? Or does the DML distinguish only one row? Clearly, the DML distinguishes two rows: the table shows that. Then the two rows must differ in the values of at least one field? Quite so. But the table doesn’t show that they differ in at least one field: yes it does, or there wouldn’t be two rows shown. The table doesn’t show in which field they differ. True. There are a number of possibilities. There might be a third field. not shown at all. This could be a cryptic field. I’ll put it in:
But there might not be an extra column. Perhaps the ITEM column contains two different values, each of which appears similar to the other. Again. we would have two rows that were the same overt row. We could say: ITEM is a partially cryptic field. Look. you would allow this wouldn’t you:
|
|
|
|
||
|
|
Arbitrary Choice Among Duplicates |
|
|
|
Now you might complain: but how can I update one of these rows (or delete one of them), without affecting the other. That’s easy enough in a program, which looks at one row at a time. But how about the poor user? Well, it’s a matter of language design, but why shouldn’t we allow a user to select ARBITRARY, or select A. We might already allow, I think, something like:
But how could a user pick out a particular row? How distinguish between two items that appear the same? Well, they couldn’t: this appearance of “duplicate rows” is used only when the user can’t, and doesn’t need to, distinguish. Indeed, in some cases, the user mustn’t distinguish: for perishable items, we might want to force the system to extract them from the store first-in-first-out, but to hide that rule from the user and to enforce it unseen. But notice: I have defended what you might choose to call “duplicate rows” by telling you what they mean. Indeed, I give you a translation into overtly different rows, or a clearly acceptable format, like:
|
|
|
|
||
|
|
Duplicate Rows in SQL |
|
|
|
SQL?... No, of course SQL doesn’t implement Relational Theory without cryptic fields, but then I presume it wasn’t intended to. They wanted the sort of duplicate rows that SQL has. But that doesn’t mean it ought to be changed to implement Relational Theory without cryptic fields. Perhaps it ought to be changed to implement Relational Theory with cryptic fields; or even Platoclast’s Picnic Theory with cryptic fields. The main thing is to work out what we’re doing and where we want to go, before we start rebuilding the vehicle. |
|
|
|
||
|
Go home. You’ve had the taxpayers’ money’s worth for today. |
|
|
|
|
|
|
|
||
|
Copyright © 1993, 2001 Adrian Larner. The author asserts all moral rights. |
||