|
(U03) |
www.btinternet.com/~adrian.larner/database/pcl27 |
|
PLATOCLAST Lecture XXVII |
||
|
|
||
|
|
No, it’s not only from sheer perversity that dear old Platoclast lectures on first normal form in the penultimate week of the course. It is that, of course, but I wanted to show you what could be built, with a very simple theory, on the assumption of first normal form normal records before I delivered my final defence of it. I wanted to show you its value; to convince you it’s worth defending. Because it’s under threat. People think they can get more power that constant temptation if they abandon it. |
|
|
|
||
|
|
Polynormality |
|
|
|
Yet it’s a real problem. If you’ve played with Father Warden’s Functional Determinacy data base, you’ll appreciate how convenient it would be to allow polynormal records. The problem is this, isn’t it? If we have “(Kl, K2) ® (Cl, C2)”, let’s say, “Kl and K2 together determine Cl and C2”, then we can reduce quite happily to “(Kl, K2) ® Cl” and “(Kl, K2) ® C2”. But we can’t reduce, say, “(Kl, K2) ® Cl” to “Kl ® Cl” and “K2 ® Cl”. Because it’s Kl and K2 together that are the key (the determinant). If we had, for instance, “(Kl, K2) ® Cl” and “(Kl, K3) ® Cl”, and we reduced them to “Kl ® Cl” and “K2 ® Cl” and “K3 ® Cl”, we just wouldn’t know what determined Cl: Kl or (K2, K3) or (Kl, K2, K3)? What we might call the “orthodox” answer, and I shall argue that it is literally the orthodox answer, the straight-thinking answer, is that:
But it is tempting, isn’t it, to use:
|
|
|
|
||
|
|
Non-Normality |
|
|
|
Suppose I concede that we can keep the polynormal FD table in our data base (remembering that I’ve already conceded something very like it being transiently formed: you need only to take the first normal form table and collect KEYCOL within KEY and COL). I know what you’ld want next. Think of an ordering specification for a query, such as:
|
|
|
|
||
|
|
Limits of Complexity |
|
|
|
But beyond this very specific question, “How far will you go?”, is a more critical one: Is there anywhere at all that youll stop? What, after all, are we trying to do? We are trying to express a more or less complex object in a record. And if we stop at normal records, or polynormal records, or records whose values may be ordered lists, or ... wherever we stop, there will be some complex object that we want to express, and which won’t fit. Well need to represent it by multiple records. We could choose not to stop: we could have (as some people want) relations whose values can be relations; and their values can be relations; for ever. And some people would, no doubt, allow ordered lists of values as well, and, for all I know, trees, lattices, hypergraphs, and any mathematical jiggery pokery you care to come up with. That won’t do for me: I’m a stopper. So, if I have to stop, Ill stop at something simple: normal records. Even my talk of transiently polynormal records is, you will recall, construable as schematic specification of normal records. But remember: not only are normal records simple, there is nothing simpler. That is, they give minimal organisation dependence. So any more complex structure gives non-minimal organisation dependence. And that’s why Ill stick to normal records. |
|
|
|
||
|
|
Deviations from Normality |
|
|
|
Consequently, to represent many complex objects (even an Order with its Order Lines), I’ll need multiple normal records. That’s fine by me: I do allow them all to be queried together in the same table. |
|
|
|
||
|
|
Composite Columns |
|
|
|
Now there are folk about that appear to agree with me about first normal form, yet they are wolves in sheep’s clothing. There are three fiddles currently on the market. The first is what we might call the “practical man’s fiddle”. This allows a little minor adjustment for what are called “composite domains”, like dates comprising Year, Month, and Day. They want to allow them. I’ve already dealt with this fiddle in an early lecture, and I think you can see how it ought to be avoided by having a data type of “DATE” with “YEAR”, “MONTH”, and “DAY” functions. Years, months, and days aren’t really components of dates: it’s just that the presentations, or DISPLAY forms, of years, months, and days (what the user sees) are components of some DISPLAY forms of dates. That doesn’t matter. It’s just like the DISPLAY forms of its REAL and IMAGINARY are components of one DISPLAY form of a COMPLEX NUMBER. |
|
|
|
||
|
|
Compound Values |
|
|
|
The next fiddle is the “cheeky boy’s fiddle”. Take the FD table, and write it as:
But you know, and I know, there’s going to be some funny business that somehow discovers that Kl, K2, and so on are in some mysterious sense “in” this single value called “{Kl, K2, ... Km}”. In fact, it’s not a name at all, but a title. And the easy demonstration is that, if the cheeky boy insists it’s a name, then you insist on the system working with any other name (the Shakespearean test!) Can we substitute, say, “k” for “{Kl, K2, ...Km}”? If we can’t, if that stops the system working, then it wasn’t a name but in effect a compound value, not allowed in a first normal form record. Remember that these “values” visible to the user are the display forms of tokens. The system has to work without its presentation handler, without the DISPLAY functions, with no effect other than it being impossible for a human user to see the result. |
|
|
|
||
|
|
Encapsulated Data Structures |
|
|
|
But the third fiddle is the subtlest, the “fifth columnist’s fiddle”. It really does look like this
You see what a subtle fiddle this is. It exploits data type processing to encapsulate (in this case) a set other aggregates could be managed as well as an atomic value. What can we say? I suspect, and regret, that formally this is impeccable. I wish there were a way to prove that any such data type was merely an aggregate handler, and then we could formally exclude it. But I don’t think there is. But it doesn’t matter too much that there is no formal test. Suppose we are right about first normal form: that any other form of aggregation or connection of values gives rise to organisation dependence. Then this cunningly encapsulated aggregation gives rise to organisation dependence. Of course it does: instead of just one sort of INSERT (for a normal record) we now have two, because we have “INTO” as well; instead of just one sort of DELETE we now have two, because we have “FROM” as well. And we have two sorts of query: we have the ordinary one that shows us “k”, and the extra function “IN” to show us the elements of k. So those that use the fifth columnist’s fiddle bring down coals of fire on their own heads. Or, at least, on the heads of their users. And the test of virtue is simple, though it’s an informal test that relies on the individual conscience of each designer: is the sole purpose of this data type, COLSET, to aggregate or structure logically atomic data, that is, atomic data that the user might want to see? If so, then it’s a fiddle. Notice that in the case of COMPLEX NUMBER this test fails: encapsulating complex numbers is not a fiddle. Although a user might conceivably think of complex numbers as aggregates of a real and an imaginary, and they might be presented as areal and an imaginary part, the objective of encapsulation is not to aggregate those parts. Indeed, in deciding to encapsulate we do not decide to implement as an aggregate. You may think that it’s difficult to judge. But it’s no more difficult than any other moral judgement: you must be strict with your own decisions, and where there is reasonable doubt charitable with those of others. One needs to consider the operations performed on an encapsulated data type. With complex numbers we need to add, subtract, and multiply them. You can see that an explicit representation as real and imaginary parts would require explicit analysis of those functions, and implementation of them at that explicit level would give us representation dependence. It would make it more difficult to go to an alternative representation. In the case of COLSETs, however, an explicit representation of their members in multiple normal records would merely remove all the defined functions, such as IN, INTO, and FROM. It would introduce no representation dependence. And that shows that their encapsulation was purely for the purpose of aggregation. So we have a pretty clear cut case of fiddling: but I don’t see any general test to detect that sort of fiddling. We just have conscientiously to walk the thin line between representation dependence and organisation dependence. |
|
|
|
||
|
|
Adequacy of Normal Records |
|
|
|
You will hear it said that there are complex objects that can’t be represented in relational data bases. But you need to ask: does this mean, “in a single normal record”? or “in current relational data bases”? or “in a single relation” (because of the uniform format restriction)? Let me point out that any physical record written to a disk can be construed as a normal record. So if data can be kept on disk then there is some way of treating it as comprising normal records. For that matter, if any data can be represented as a graph comprising nodes that are normal records (possibly merely fields) and arcs, possibly labelled, connecting those nodes, then it can be represented as normal records. And again, if you can just write down in English what data you hold, perhaps at rather inordinate length, then it can be represented as normal records. So fear not! All data can be represented as normal records, and therefore with minimal organisation dependence: don’t be fooled by any claim to the contrary. |
|
|
|
||
|
|
Non-Normal Records |
|
|
|
As for the proposal of Non-normal relations, in the sense of relations admitting values that are relations, it is perfectly simple to normalise them (indeed most proposals of non-normal relations provide an UNNEST operator just for that purpose). The fully normalised result may comprise rows of different format. Any query that could be performed on the non-normal relation can be specified on the normalised table, and the appearance of the non-normal relation can be generated by pretty straightforward report formatting. When I say that any query can be performed, I mean, of course, “in effect”: many of the queries on non-normal relations comprise NESTs and UNNESTs that as far as I can see serve merely to overcome the internal problems that they have generated for themselves. In a normalised table, these queries tend to amount to the identity manipulation: leave everything just as it is. But we mustn’t be too harsh: the proponents of non-normal relations are simply doing openly what the fifth column fiddlers are doing covertly: the latter are knaves, the former merely mathematicians. You can understand the temptation. We have said that, in some sense, a Product and its Options comprise a single non-normal record. It seems natural to have such records, each comprising the data for one product and the set, or table, of data comprising each of the options. |
|
|
|
||
|
|
Data Structure Dependence |
|
|
|
Now, my friend William has a hypothesis about existence dependent records like Options. He reckons that any existence dependent record type is dependent on two other record types. That’s a bold hypothesis, and I think it’s wrong: but in many cases it’s right. So it’s ever such a good guideline for a systems analyst. And we might be tempted to say that that’s the way with Options: they’re dependent on Products. but also on Features. Indeed we might have a non-normal Feature and Options record, comprising the data for one feature and the table of data comprising each of the options. But. if we’re thinking at the concept level, why should we have to make this choice between keeping Options within Products and keeping Options within Features? Isn’t that precisely to ask: which of these alternative organisations should we have in our data base? So we would be considering the question: in what precise way should we make our data base organisation dependent? Well. that seems absurd to me, because Dr Codd has taught us that we don’t want organisation dependence in the data base. We do want structured data for users. and even more for batch programs. But that’s fine. We can do the Left Conjoin of Products (or Features) with Options. and order by Product (or Feature) and Option. to get Options within Products (or Features). What does it look like. nicely ordered?
|
|
|
|
||
|
|
Structured Output |
|
|
|
And then we can. if we want. do some report formatting:
My guess is that this appearance at the user interface has been misinterpreted as the appearance of a non-normal relation. But it can equally be construed. as I’ve shown. as the non-normal appearance of a table of normal records. And then there has been level confusion: it’s been assumed that because the data looks, or even is, structured into aggregates at the user interface. it also has to be structured in that way within the data base, at the concept level. You see what a multitude of ills follow from confusion: of appearance and reality, and of the level of data. |
|
|
|
||
|
|
Simplicity and Power |
|
|
|
So you see why I’ve been working so hard and so have you to simplify our theory of data. and to elucidate all those levels store. form. concept. map. and external view. It’s to achieve clarity. to help us think straight. And nothing aids clarity like economy of means: using nothing but normal records to structure and aggregate fields, using nothing but the conjunctive part of the FOPC as our DML, using nothing but ordering to structure external views. Once we have exploited economy of means to get our heads in order. we get extra bonuses when we come to implement a DBMS, or design a data base. It’s like using a programming language that is strictly structured (no GOTOs ), and can be compiled in one pass. The greatest gain is that its syntax is so simple that even I can understand it; but it also turns out to be extremely simple to implement. And yet, and yet, we are always tempted to increase power, and therefore complexity. This is why I have worked so hard to demonstrate to you just how much power we can get without resorting to complex theories and data structures. And I say unto you: Blessed are the weak, for though they shall not inherit the earth, when the powerful inherit it, and make a total mess of it, the weak shall be able to say: I told you so. I understand that some of you plan to be involved in what is quaintly called “Industrial Action” instead of attending our next lecture. Well, I’m not going to stand up here and talk to half a dozen blacklegs. But if anyone has any burning questions, and cares to drop in at my pad around two o’clock next Monday, we’ll have a little symposium: a question and answer session. a good argument perhaps. So I shall avoid lecturing, but claim my fees nevertheless. Of course, Judy, youll come earlier, have a snack lunch, and give me a hand with the coffees. You what? ... You have a luncheon date! Parents visiting I suppose? |
|
|
|
||
|
|
Character Strings |
|
|
||
|
|
(In a discussion shortly after this lecture Professor Platoclast discussed “Character String” as a data type.) Yes, indeed, I see the problem. We often think of “Character String” as a data type; I’ve used it myself. But it’s largely a hangover from old-fashioned programming languages. What sort of column might contain a character string? A NAME column: well, I’ld want to say that a NAME column should contain a value of type NAME. That’s because, although we sometimes express names as character strings, they’re not character strings. Remember that names, like any words, were spoken long before they were written. A name could be written in different ways, so that one inscription of a name might be a different inscription from another, and yet be the same name.. Like Mao Tse Tung and Mao Zedong. Or we could have two different names written the same way, such as the English name “Lee” and the Chinese name “Lee”. So I’ld want to say: a character string is just one way to implement a name; or to display a name, or anything else, in a presentation manager. Pure character strings are pretty rare: I mean, columns that really do need character string values. But suppose we had one. Then l’1d want to ask: what functions do we want on it? Let’s suppose we had a table whose records represented positions in that word game where we change one word to another by modifying a letter at a time: you know, changing POST to MAIL. Our records might be POST, MOST, MAST, MALT , MALL, and MAIL. And in this case I would want to say: let’s not use character strings because we mostly want to work with characters. I’ld want a record to contain a word identifier, a letter position (1, 2, 3, and so on), and a letter. On the other hand, if we just wanted to keep lists of written words perhaps each record would represent a dictionary entry then character string would be all right, really, “written word” or “inscription”. But, you might say, wouldn’t we be interested in parts of words? Sometimes, when we wanted to know about prefixes for instance. But then, if we’re really thinking about words, and not character strings, we would want to say that “antacid” contained the prefix “anti” but “antiquary” didn’t contain that prefix. So character strings wouldn’t do. So I want to say: if what we really want, most of all, is a string of characters, a structure of characters, then we shouldn’t have a character string. That would breach first normal form. But if we don’t want that, by and large, then it’s all right to have a character string. |
|
|
|
||
|
It’s one of those things you can have only if you don’t want it. But that’s because, when you’ve got it, it’s really something else. Is that clear? |
|
|
|
|
|
|
|
||
|
Copyright © 1993, 2001 Adrian Larner. The author asserts all moral rights. |
||