|
(U03) |
www.btinternet.com/~adrian.larner/database/pcl25 |
|
PLATOCLAST Lecture XXV |
||
|
|
||
|
|
As you’ll recall, by the time we’d finished with Dr Codd’s two integrity rules, we didn’t have much left in the way of integrity. So I guess we ought to take another look at that area. And I’ve also been pretty rude about designing on the basis of functional determinacies, because so many of them are unreliable. Now, I don’t want you to go away thinking that old Platoclast cares nothing for integrity or design. But the best I can manage is, I hope, to make you think that I have a few ideas in those areas, which I do think important; and perhaps that there might be some useful leads into further research. By the way, I want to tackle Integrity and Design together because it’s really the design of a data base that integrity pertains to. I mean, you can put lots of false records into a data base; and any false record, in a sense, destroys integrity. But we don’t use “integrity” in that sense: we use it to mean conformance with its design. Let’s first ask ourselves at precisely what level we want integrity: is it at a physical level, the store level, say? or at the form level? or the concept level? I think the answer is: at the form level. Though we shall certainly want to take into account our interpretations at the concept level. I don’t think that we need to worry too much about anything like Entity Integrity. Or rather, we have already spent a lot of time on those worries: we concluded that we wanted a criterion of identity for every column, and where we chose to admit null values a criterion of completed identity. |
|
|
|
||
|
|
Referential Integrity |
|
|
|
But there surely is something behind the idea of Referential Identity, even if the concept of Foreign Key doesn’t quite latch on to it, and the rule of Referential Integrity turns out, as it did, to be trivial. Let’s think of the sort of example where we do seem to need something like referential integrity. Imagine that we make some products, and some of those products have options. We might make cars, for instance, and the options of a car might include its body colour, its upholstery, and so on. Notice that an option is not, say, “beige leather”: it is “beige leather upholstery of such-and-such car”. We don’t sell the option except with a particular car. Now, very often referential integrity is discussed in terms of deletion; for instance, if we delete the car record then we delete all its options. I want to say two things about this:
|
|
|
|
||
|
|
Existence Dependence and Simple Dependence |
|
|
|
When Mr Date discusses referential integrity, he describes three ways to achieve it on deletes: Cascade, Restriction, and Nullification.
But there’s a difference: does an Option merely need a Product to hang on to? or is there a Product that it needs to hang on to? Does an Option, in other words, merely lust for, or is it in love with, a Product? I’m sure you’ll appreciate the difference: is there someone you need to satisfy your passion, as there is when you are in love? or do you merely need someone to satisfy your passion? Cascade delete is the lover’s delete: if I cannot have you, I shall die. Restriction delete just means: I won’t let you go until I’ve found a replacement. We could write that in the FOPC it’s a question of where the quantifiers go but you’re more likely to remember the distinction the way I’ve described it. Please don’t think I belittle the difference of lust and love; far from it. It may be “only” a quantifier shift, but that’s a logical difference, and as Wittgenstein said, in a rare moment of lucidity all logical differences are big differences. Remember that. I do stress however that we’re talking of records, not of things in the wider world. Two such things might be related in a way that made Cascade delete appropriate, but their records might have Restriction deletion; or vice versa. But what is the difference between the relationships: the difference between one sort of dependency and the other? Let’s call one “Existence Dependency” the lover’s dependency. Its formal expression starts with an existential quantifier; there is someone ... And the other we’ll call “Simple Dependency”, because simply anyone will do (of a certain type, no doubt). Well, think about that Option Record. Surely, in some sense, it’s part of the Product Record, isn’t it? Or rather, the Product Record and its Option Records together constitute a meaningful record, and the Option Record lacks any independent meaning. A fine classical example is the smile on the face of the Cheshire Cat: a cat without a smile, as Alice remarked, is common enough; but a smile without a cat?... We’re getting close. But we’re not there yet. I want to call upon an old friend: a criterion of identity. Let s1 be a cat’s smile and let s2 be a cat’s smile. What criterion must we apply to tell whether s1 is the same smile as s2? I think you can see that one necessary condition is that s1 is on the same cat as s2. And, in general, I want to say that when there is an existence dependency, say D”, of a B on an A, then b1 is the same B as b2” implies that for some a1, for some a2, D(b1,a1) and D(b2,a2), and a1 is the same A as a2. A simple dependency merely demands that for each B, b, there is some A, a, such that b depends on a. So we could say that something is existence dependent when its very identity hangs on the thing it is dependent on. And that’s why we get the Cascade delete: no entity without identity. An Option Record without its Product Record would not be the same Option Record as itself: so it wouldn’t be an Option Record at all. Of course, we have a choice: we could make it some other kind of record, but if we don’t then we must delete it (that’s a logical must”). |
|
|
|
||
|
|
Normal amd Non-Normal Records |
|
|
|
You’ll appreciate that we go through this philosophical agony because we demand normal records: isn’t it rather obvious why, given a record containing repeating groups, if we deleted the record we would also delete the groups? Never mind; if normality is worth having, it’s worth a little agony. Not everyone would agree. It is to me at least perfectly obvious that many of the forms we want to record are not normal: Orders, Invoices, Inventories of a Part at various warehouses, Bills of Material indeed. Nevertheless, each of these can be inscribed as a collection of normal records, and therefore handled by a very simple FOPC based DML. We are faced with a stark choice: we can either represent each such “complex object” as a single, non-normal record, or as multiple normal records. Oddly enough, if we take the first option, the non-normal record, we can still use relations; not of course normal relations, but nevertheless tables containing records of uniform format. It’s just that some of their columns contain values that are themselves relations; and some of their columns might contain such values upon their backs to bite em.[1] Or, loyal to Dr Codd, we can insist on using multiple normal records. But these turn out, almost inevitably, to have different formats, typically a header format (PRODUCT) and a detail format (OPTION). And I want to say again that, in imposing uniform format, and therefore sundering for ever these complex objects into their constituent parts, Dr Codd has placed upon our backs a burden greater than we can bear. And worse: those that now turn to non-normal records can, with at least mathematico-linguistic right on their side, keep the honoured name of “Relational Theory”; while we, who regard normal records as the baby and relations as the bathwater, must appear as traitors to the cause. Though, of course, we still use “relations in the logical sense: polyadic predicates; record types. |
|
|
|
||
|
|
Integrity Constraints on Views |
|
|
|
Now, think about those cascade deletes, say Product cascading to Option. It seems to me that they are needed because of an integrity constraint on a table comprising a number of records, and ought to be specified on such a table. Suppose we formed a Right Outer Join of Product and Option Records (joining on the Product Id field). Now we could say (though this is just one way to put it) that that table had no absent fields. And that is the integrity constraint: you will see that it is breached only if there is an Option without a Product. Ah well, perhaps there is a use for outer joins after all. That constraint holds, of course, for Simple Dependency as well as Existence Dependency: the difference between them is that in the latter case we are not able to modify the Product Id value in an Option record. I’m not going any further into the integrity question: it’s too huge. There are masses of different constraints that we might want to place on our data. |
|
|
|
||
|
|
Dependence and Functional Determination |
|
|
|
Rather, I want to explore some of the concepts that we’ve come across. Let me start by pointing out that the dependence of one criterion of identity on another, like that of Option on that of Product, is not something new to us. We’ve thought about how the token identity of column values determines their identity according to their column criterion. We’ve seen how one column criterion, like “is the same car as”, determines others, like “is the same make as”. Indeed, this is nothing more than functional determination. It will have occurred to you, naturally, that functional determination must be bound up with criteria of identity; because the very concept of “function “ is. What is a function? A many-to-one relationship: so whether a relationship is or is not a function turns on the question of what criterion of identity we use to count the “one”. But, as you’ll recall, I have tried quite hard to play down functional determinacies as unreliable in logical design. So how can I be confident that Options are existence dependent on Products (because that means: an Option functionally determines a Product)? I’m afraid that the answer is quite trivial. Suppose by “Feature” we mean something like beige upholstery or navy blue body colour. Then we obviously have a many-to-many relationship between Products and Features: a car may have many Features; and beige leather upholstery may be a Feature of many cars, though I wouldn’t drive one. What then is an Option? It is, so to speak, the attachment or application of one Feature to one Product: that’s the way we define it. So it’s a logical truth that an Option functionally determines a Product. It means that an application of one feature to one product is an application to one product. That should be safe: but I told you we could depend on logic. |
|
|
|
||
|
|
Independent Entities |
|
|
|
Some data analysts my friend, William, is one of them think that existence dependency is the most important characteristic of relationships between records (they would say “entities”); far more important than their order (one-to-one, many-to-one, and so forth). Such analysts try to discover (or invent) independent entities, that is, entities that are not existence dependent on any other entity; then they look for entities dependent on them; and so on. I think they’re probably right. Now put that together with this observation. If some of the entities we deal in, values in columns, have criteria of identity that depend on the criteria of identity of values in other columns, then we too should be able to discover in our sense independent entities. I suspect, you see, that our in.dependent entities might turn out to be very like those of analysts like William. And now suppose that we take this approach, and we decide that CAR is one of our entity types: a record type, and the data type of a column in that sort of record. We might call the column “CAR ID”, but remember that when we discussed tokens and data types we were a bit tempted to call it simply “CAR”. And when we discussed some data base problems we were a bit tempted to make CAR values cryptic. |
|
|
|
||
|
|
Columns as Classifications of Entities |
|
|
|
But what other columns might we want in a CAR record? Perhaps COLOUR, MAKE, BODYFRAME, and so forth. These are all. apparently, classifications of cars. That means perhaps that we can say it in different ways “is the same car as” is narrower than “is the same car-make as”, that CAR functionally determines MAKE. But we need to be rather careful. There are, as far as I can see, at least three significantly different sorts of classifications. Let’s start with MAKE. Here we can say that “is the same car as” is narrower than “is the same make as” (if we assume that every car has a make, otherwise we’ll use the completed identity, “is the same make-wise as”, to allow for cars with no recognised make). That means that if c1 has a different make from c2 then c1 is not the same car as c2. And therefore that a car can’t change its make. So I suppose we could say I wouldn’t, and I’ll come to that later that CAR is existence dependent on MAKE. By contrast, consider COLOUR. Clearly, even if at any time a car has just one colour, there is no existence dependency. So c1 could have a different colour from c2 yet c1 coulld be the same car as c2. We have a simple dependency. At any time we do have the functional determinacy, but we do not have that determinacy over time. You can see how this relates to Restriction delete. Suppose we had separate COLOUR records, and we wanted to delete the one pertaining to a particular CAR record. We wouldn’t want, I guess, a Cascade delete: we would still want the CAR record. So before deleting, say, the (COLOUR: Scarlet) record, we might change all CAR records with that value to show (COLOUR: Red): then we could delete the COLOUR record. We might be tempted to say and I am tempted to say it that the CAR record, the logical record, shouldn’t contain fields like COLOUR. Really we should have a separate CAR COLOUR record with a CAR (or CAR ID) field and a COLOUR field. You can see that we would certainly need that sort of record, perhaps with other fields in it, if we wanted to represent cars that had more than one colour, either serially or concurrently. I hasten to add that even if we designed our logical records in that way, while we permitted a car to have only one colour at one time there is no reason why our physical store level records (and some of our views as well) shouldn’t have the COLOUR field in the CAR record. But I want to be very strict about logical record design. Even if putting COLOUR in the CAR record would give us better performance, and facilitate the business rule that we record only one colour for each car at any time; and even if our users (in external views) always want to see COLOUR along with CAR; still I wouldn’t let that affect my logical design. And I don’t have to: the physical and the external are not the same level as the logical, and even when physical and external views correspond with each other it doesn’t mean that the logical view has to correspond with both. We can get performance and methods of enforcing integrity rules at the physical level, and we can provide the views the user wants, without distorting logical design. Remember the DIARY data base: the records could well be physical records, as well as records in the user view. But the only thing that preserved the simplicity of our DML was a totally different format of logical record.[2] |
|
|
|
||
|
|
Existence Dependence and Classification |
|
|
|
Let’s think about MAKE again: I showed how CAR was, in a sense, existence dependent on MAKE; how CAR time-independently determined MAKE. Incidentally, proper existence dependency isn’t just a matter of time, though time is probably the most important destroyer of existence dependency (as with COLOUR). We also have to consider place, and role. Am I an academic type? Well I am here and in the role of Professor, but not back at the office and in the role of Senior Systems Analyst. What would you say if I suggested that a car could be one make at one time and another at another time? At least in theory you could imagine the very gradual refashioning of, say, a Silver Cloud, into a 2CV. I pause a while for those with a weak imagination You have two choices, don’t you? You could agree with me that even MAKE could change over time, so we should treat it like COLOUR. Or you could say: no, a car’s MAKE is the make it was, so to speak, born with. But now I want to say: consider keeping not merely a CAR record, but also the history of that car. We would want records for each service, each accident, each respray, and so on: we might say, event records. But equally obviously each such event record would be existence dependent on the CAR record. And obviously we might have incomplete histories of some cars, cars imported when already several years old perhaps. Those histories would lack one quite important event: manufacture. In the same way we might have PERSON records without event records detailing their births. In fact, there’s nothing very special from a record keeping view about what we might call the “Initiation event of an entity, the first event in its life history. This is one of those cases where record keeping is very different from the reality about which records are kept: cars are indeed existent dependent on the event of their manufacture, and persons on the event of their birth. But, in the way I design records, initiation event records: manufacture records of cars or birth records of persons are dependent on the car records or person records to which they pertain. So think again about MAKE: I can oblige you to concede, you see, that a CAR is existent dependent on its make only because the MAKE is a classification of its initiation event, its manufacture. But if we kept a manufacture record we wouldn’t make the CAR record dependent on it; we would make it dependent on the CAR record. And that’s one reason why I don’t want to say that CAR is really existence dependent on MAKE, even if I admit that MAKE is let us say an invariant classification of CAR. You can see that I’m trying to convince you that things like cars (or rather car records, or values of the CAR data type) should be treated as independent: not existence dependent on anything. In other words, I’m saying that just because “c1 is the same car as c2” implies “c1 is the same make as c2” we shouldn’t think that the former criterion is defined on the latter. MAKE is, or we normally want to treat it as, a mere classification of cars. And I don’t want to say that things are existence dependent on their classifications. I mean, we could pretty arbitrarily classify all students as either diligent or slackers, and say that if s1 is the same student as s2 then s1 has the same diligence as s2. But we wouldn’t want to say that students were existence dependent on diligences, would we? So we’ve established two sorts of classification: variant classifications, like the COLOUR of a CAR, which we are tempted to exclude from the logical record; and invariant classifications, like MAKE. But I’ve said that I certainly wouldn’t want all invariant classifications to be treated as existence dependencies. And I’ve tried to give some reasons why: because, in some cases including that of MAKE the invariance was caused by the classification pertaining to an initiation event, and the record of that event (if there were one) would be dependent on the record of the thing initiated, not the other way round. Also, because classifications can be pretty arbitrary, like the diligence classification of a student. Let’s consider the other extreme. Think of one of our Option Records, containing a Product Id. Fairly obviously, we could say that Options could be classified by Product. Yet that invariant classification is one that we would want to treat as an existence dependency. |
|
|
|
||
|
|
Independence and Existential Interpretations |
|
|
|
Now, I’m not absolutely certain about this, but I think that the difference between existence dependency and mere invariant classification is brought out by our interpretation of records, EI. I suspect that a genuinely independent record has that special case interpretation that I called II, for Independent Interpretation.[3] It comprises merely a single quantified variable, and some classifications:
By contrast, in the case of the MAKE of a car, I at least want to say that the make is merely a classification: there isn’t something over and above actual cars that is their make, any more than there is something over and above animals that is their species, or something over and above students that is their diligence. But you can see how treating classifications species, types, kinds, and sorts as extra things, albeit abstract things, does rather muddy the waters when we would like to draw a contrast with the cases that do involve extra things: Products as well as Options. Now were getting very close indeed to the EAR theory, and to something like Primary Keys. Suppose we have an independent record, with an II interpretation. And suppose that we have a column whose criterion of identity is narrower well, no broader than any other column. In the case of a CAR record, that column would be, say, CAR ID or CAR. Let’s assume it’s CAR, and let’s assume it’s cryptic. Now we can say what we mean when we call the other columns Attributes”: we mean that they are mere classifications. And we can clearly distinguish Entities the things supposedly represented (in the case of CAR records, actually represented) by records with an II interpretation from Relationships. The relationships are perhaps what are represented by records with an interpretation other than II, or better other than SI: the records whose interpretation involves more than one existentially quantified variable. So perhaps, after all, there is something in the EAR theory. But notice that we get there only by abandoning all the horrible, vague senses of “entity”: our entities are now entities of a theory (the second sense we defined). They are the things over which we quantify in EI; the things asserted to exist by the theory that is our data base. But I must stress this: there is no guarantee that these things, these entities, will be the sorts of things that we would want as entities in another sense, things like cars and persons. Think back to those Person Registration records where we found that we needed more than one record for each person, because several records might pertain to the same person, under different aliases. Those records would not have had an II interpretation: almost certainly we would have wanted a variable for the person and a variable for the what shall we say? recognition or presentation of the person. The interpretation, of the form EI would have looked like:
So, at some level in our data analysis and design, we need to recognise persons. Could we say that we have a person record at the concept level, and registration records exist merely at the form level? Tempting, but no. You see, we might well have different forms pertaining to the same presentation (perhaps we keep our forms in different locations). So we actually need a third level, above the concept level. But this third level is not another record level. When we start analysing we begin with obvious entities like persons and cars: entities in the informal sense of “things the enterprise is interested in”, kinds of objects of thought. And our system will pertain to such things at all levels. In more specific terms, at every level we shall be concerned with whether or not this record pertains to the same person as, or the same car as, that record. In other words, the entities we informally and intuitively find at the third level I’ll call it the “map” level are not necessarily independent entities at any lower level, but they are very important attributes, that is, classifications. I take the term “map” from my friend, William, who believes that the very first step in systems analysis even before the statement of requirements is to survey, or map, the entire application area, as understood by the enterprise whose applications are being analysed. |
|
|
|
||
|
I expect he’s right (again). |
|
|
|
|
|
|
|
||
|
Copyright © 1993, 2001 Adrian Larner. The author asserts all moral rights. |
||