(U03)

www.btinternet.com/~adrian.larner/database/pcl26

PLATOCLAST
ON DATA

Lecture XXVI
More on Design

 

 

Four Levels of Design

 

If anyone was hoping I would give them a complete exposition of data design, I’m afraid they’ll be disappointed. Indeed, my message is: it’s far more complex than you’ve been led to think. We don’t just have logical and physical design. We have:

1
Mapping, in which we identify the “entities” of interest to the enterprise being analysed
 
2
Conceptualisation, in which we work out what concept level record types are needed to capture the mapping (though entities at the map level may become attributes at the concept level)
 
3
Forms design, in which we define the forms needed to keep our concept level records
 
4
Top level physical design, in which we define the store level records needed to hold the information in those forms.
The lower levels of physical design are principally the responsibility of the DBMS developer, though the data base administrator will usually be given choices of design where different implementation options are available.

Not only do we have to specify all these levels, we have to specify the relationships between them, that is, informally how the concept level records capture the map entities, formally how the form level records imply the concept level records, and formally how the form level records are derived from the store level records.

Now, I don’t say that this framework – this four-level analysis – is the correct approach, or even a correct approach. But it’s the best I can manage at the moment. However, if the correct approach is even something like this, then you can see that it’s completely fruitless for us to try to investigate methods of data design until we know what level we are talking about. And you can see the sort of chaos that would result, and in some cases has resulted, from, for instance, applying the map level concept of “entity” (something the enterprise is interested in) at the form level, or functional determinacies at the map level.

But don’t look so worried! Of course much is unknown, little is understood. What else is going to keep you all in work, either researching away at data theory in these hallowed halls, or turning an honest penny designing real systems, until the shadow of comfortable retirement falls across your path, as it does across mine?

There is a time, you know, in any discipline, or any area of research, to nail things down. A time to produce simple, formal theories; to consolidate advances; to establish a firm foundation for future construction. And there is a time to tear things down, to open up a space for research, to cast off the fallacies and artificial constraints that bind us. A time to build and a time to destroy, as the preacher says.[1]

Well, as far as basic theory goes, I’ve tried to build: to make a firm foundation, a formal system with an interpretation of its structures and manipulations. And I’m sure that we can construct a superstructure on that foundation, but only if we clear away what’s there already. To some extent, where I’ve destroyed, I’ve also rebuilt: I’ve shown how to do without essential tables, how to handle nulls, how to exclude quantification and negation from the DML and handle them in data type processing. In some cases, I’ve merely given a framework for rebuilding: I haven’t defined the concrete syntax of a DML, I’ve merely pointed the way to handling views, or to processing bills of material. And in other cases, like Integrity and Design, I’ve mostly destroyed, but given you enough to show that there’s hope of reconstruction.

 

 

Problems of Design

 

There is hope, but the path is hard. Let me discuss a few problems.

Suppose we discover at the map level some sort of integrity constraint: perhaps something very simple, such as:

A Car Ownership is existence dependent on a Person and a Car.
Then we design our concept level records, our form level records, and our store level records. We might find that at the concept level we have multiple records for one person, and multiple records for one car. We might find that at the form level we have records that need to be projected (certainly), and perhaps restricted and joined, perhaps even unioned, to give us those concept level records. Then, to express that integrity constraint (not yet to impose it, but merely to express it) might require quite a complexly expressed view of our form level records.

So we will almost certainly find that we have to express integrity constraints on views (as I’ve hinted already). But then we have to impose each constraint: we have to work out how to maintain it, for example, by Cascade or Restriction deletes. And then we have to work out what this implies at the store level: what rules have to be stated there to implement the constraint maintenance at the form level. I think you can see that trying to handle all this in a generalised way, trying to get some universal replacement for Dr Codd’s two integrity rules, is going to need a lot more work. But I suspect that imposing integrity rules on views will let us handle them by similar mechanisms to those used for updating through views, which doesn’t help you much as I have given no more than a hint of a framework for doing that. But it will probably be better than struggling to impose inter-relational constraints on base relations (whatever base relations are kept or stored).

 

 

Analysing “Part” Entities

 

Let me give an illustration of the oddities that happen as we move from one design level to another, and you“ll see just how delicate the whole game becomes; how much everything depends on what level we’re at. Think about the concept of “part”: we have already seen the role it plays in the TT-DD; we have thought about existence dependencies in terms of parthood, an Option record being part of a complex, non-normal record comprising a Product and all its Options; and we have seen the problems, and some of the solutions, to Bills of Material, where “part” appears as “component”.

As it happens, when we analyse enterprises at the map level we often find “part” or “bill of material” structures. For example, one time period may contain another, one organisation – a company, branch, or department – may contain another, one location may contain another. Let’s take the last case: location. I would never, myself, regard a location as an entity. People who talk about “locations” usually mean geographical areas, or buildings, or parts of buildings. The concept of “location”, when properly analysed, tends to turn out to pertain to access, physical or electronic, or whatever. A building is not really a location, but it is located.

In any event, locations – be they entities or attributes – tend to have this “part” or “containment” structure: access to a location involves, in some way, access to a location of which it is part. Thus, physical access to this lecture room involves access to the Information Technology building; access to the building involves access to the Stratford University campus; and access to that involves access to this wonderful industrial conurbation in which it is located, though not, as I fondly imagined when I took this job, to beautiful, tourist-haunted Stratford upon Avon, which is twenty odd miles away. Funny that.

Now, we saw in the case of Products and Options, that the parthood relationship involved existence dependency. And, in away, so does all parthood. If, perish the thought, this building, or the entire campus, were destroyed, then this room would be destroyed with it. But let us move to the concept level. If we had a record for the campus, and a record for the building, would we want a Cascade delete from one to the other? Not at all! For suppose the University were closed down, and its component parts sold off. This building might become a private nursery school. We wouldn’t, perhaps, want to keep the campus record: it would pertain to a fairly arbitrary area. But we would want to keep the building record.

I want to point out that locations don’t give us simple hierarchies. For instance, we might have towns A, B, and C. A and B could be in the same county, C in another. Yet A could be in one television area, and Band C in another. And A and C could be in one postal area, and B in another. So we really do have something like a bill of material structure, a partial ordering. And we might have to recognise some strange locations, such as the intersection of a county and a television area.

So having a location structure means having some locations that we are interested in, and some that we aren’t interested in, but which have to be there because they’re intersections of other locations. And it means having an “is immediately contained in” relationship, and records to represent that relationship. Although a location is – at the map level – existence dependent on its containing locations, a location record isn’t existence dependent on its immediately containing location records. But immediate location containment records are existence dependent on both their associated location records.

Typically, a location will have a reference, but we might have many kinds of references: geographical names, postcodes, building names, and the like. So we might discover that two location forms – say one for a local authority area, perhaps a county, and one for an ecclesiastical area, a diocese – pertained to the same location. Then the existence of that “location“ would depend on there being at least one form for it (and possibly more than one). Except that if we deleted all those forms and inserted others (perhaps on the revision of boundaries), we might find that we still wanted that location because it was also, or had become, the intersection of two other locations.[2]

 

 

Composition Deletion

 

I won’t even attempt to describe how we might represent all this at the store level. But consider the interesting problem of specifying the delete rule for immediate location containment records. If we delete a location record, say L, what do we do to its containment records? The ones that say x immediately contains L, or L immediately contains x?

I think you’ll find that neither Cascade nor Restriction works. What we need is a “Composition Rule”, something like:

FOR EACH x, FOR EACH y, such that x immediately contains L and L immediately contains y, insert a record showing that x immediately contains y.

After that, we could delete the records pertaining to L. I simplify: we have to consider other possibilities, like the deletion of location records that represented mere intersections of two locations when one of those locations was deleted. Of course, we have to place this rule at the right level: is it driven by deletion of the county, deletion of ilie diocese, or both? Is it at the concept level or the form level? And how do we represent it at the store level? Have I convinced you that design and integrity can be very complex indeed?

 

 

Query Results Affected by Functional Determination

 

Now let’s go back to Functional Determination. You’ll remember that I objected to using functional determinacies in logical design, at least those functional determinacies that derive from something as volatile as business rules, legislation, or reality. Nevertheless, we do have to have somewhere to express such determinacies, and some way to express them. I guess the place to express them is in the appropriate view definitions: they are integrity constraints. I’m more concerned about how to express them.

Suppose we have a business rule that says; each employee shall have only one manager. I’ve said that I wouldn’t exploit that at the concept level; I wouldn’t risk a design having MANAGER_ID in the EMPLOYEE record, just in case the rule changed. So let’s suppose we have a special REPORTING record type, with MANAGER_ID and EMPLOYEE_ID fields. We could, just transiently, while we did want the rule to apply, impose the constraint that EMPLOYEE_ID was a key of REPORTING.

Now, as I’ve said, I wouldn’t allow that rule to influence my logical data base design. And I wouldn’t want users, or programs, to assume it in their queries. In other words, I wouldn’t want it fixed in the interpretation of records, for example, saying that a REPORTING record meant: EMPLOYEE_ID has the manager, MANAGER_ID. I would want it merely to mean: EMPLOYEE_ID has a manager, MANAGER_ID.

But suppose now that a user put the query, “How many managers does such-and-such employee have?”, or “Who are the managers of such-and-such employee?” I would then want the DBMS to answer: Just one, or just this one, but that’s because of the business rule that each employee has only one manager.

A similar need arises in queries that contain logically true or false conditions (usually coded by accident). For example, if a user wanted to list all employees with birthdates on or after 1/1/67 and before 1/1/68, but inadvertently said “on or after 1/1/68 and before 1/1/67”, or, “on or after 1/1/67 or before 1/1/68”. Then I’ld want the DBMS to say: it’s logically true that there aren’t any, or it’s logically true that all employees are like that.

 

 

Assertions

 

So I really do want functional determinacies and other integrity constraints, whatever their origin, to be handled by the DBMS. But this means, as I’ve said, that we would need some sort of logic of assertion, so that we could say; it is asserted by a business rule that ..., or it is asserted by logic that .... And, as I’ve said, this can’t be done in the FOPC, and adding it to the FOPC would give us a horribly complex logic. In any event, logics of assertion, which are one sort of modal logic, are ill-understood: I have them on my “to do” list. But I ask myself: what can we do in our theory in advance of getting the right logic of assertion? And my answer is: make room in the logic we’ve got. Perhaps, if instead of using the full FOPC we used only some part of it, possibly just the conjunctive part, getting rid of quantification and negation, we might even be able to slot in our modal assertion operators without creating an over-complex system.

So we leave a little space for expansion, just as we did by not insisting on other than transiently polynormal records, and – in the meantime – we can enjoy the simplicity of our current theory. I think that’s important. One of the major problems of current relational theory is that it’s right up to its boundaries. There’s no space within the theory for solving new problems: so extra mechanisms, whether collection within scope (GROUPing), or Outer Joins, or solutions to Bills of Materials or problems of time, all tend to get specified by unsystematic, ad hoc, extensions to the theory.[3]

 

 

A Miscellany of Problems

 

I’m now going to discuss some little problems raised by my young friend, Douglas, who’s actually been exercising his mighty brain on data base theory.

 

 

Design of One-One Relationships

 

His first problem is of one-to-one relationships, two-way functional determinacies we might say. Douglas has spotted two things: firstly that my method of logical data analysis might lead to different types of records with common keys, and secondly that these sorts of relationship are supposed to give design problems, according to Mr Date, anyway.[4]

I’ll take Mr Date’s example: we have one type of record with Supplier Number, Name, and City; and another with Supplier Number and Status. Supplier Number is the primary key of each type, and is a foreign key that links to the other. So they both have to be present, and there’s no way to insert one and then the other, unless we postpone the integrity check. Mr Date has another example with an Invoice record that has to contain a Shipment Number and a Shipment Record that has to contain an Invoice Number: Shipments and Invoices are one-one related.

Now I like these problems, because solving them is easy as falling off a log. I’ll tackle the second first. I define a Pseudo-Invoice record, just like Mr Date’s Invoice record but without the Shipment Number; and a Pseudo-Shipment record, just like the Shipment record but without an Invoice number. And I define an ISLink record with Invoice Number and Shipment Number columns, each of them a key. And then I can say: Mr Date’s Invoice Record is a view of the natural join of Pseudo-Invoice and ISLink; and his Shipment record is a view of the natural join of Pseudo-Shipment and ISLink. Clean and beautiful: and, if we ever relax the rule and make the relationship of Invoices and Shipments one-to-many; or many-to-one, or many-to-many, we have no problems.

We could use the same trick for the two different Supplier records, but a record with two occurrences of the same Supplier Number seems a bit odd. What we do there is have the two records – call them Pseudo-Supplier and Pseudo-SStatus – and then a Supplier record, one with Number, Name, and City, or one with Number and Status, is the appropriate projection of the natural join of a Pseudo-Supplier and a Pseudo-SStatus record.

I really do want to stress that we mustn’t get led astray by all this “entitythink”. Take an entity like “employee”. We imagine that because we have such entities, we ought to have employee records, and those records ought to be logically basic, and they ought to be stored. Oh dear, oh dear. We could have – probably should have – a logically basic PERSON record, and a logically basic COMPANY record, and a logically basic EMPLOYMENT CONTRACT record existence dependent on them. And then an EMPLOYEE record is likely to be a projection of the natural join of a PERSON record and an EMPLOYMENT CONTRACT record where the COMPANY field in the latter is the COMPANY ID of the COMPANY record pertaining to our own company.

And that raises the possibility that one employee might have more than one EMPLOYEE record: if they have more than one contract of employment with us, for instance. Well, we need to contemplate that possibility. But notice that, even then, we probably wouldn’t say that one person was each of two or more employees. The criterion of identity of “employee” would be: having one and the same PERSON ID. Whereas the criterion of application would be: having at least one contract of employment with us.

I must stop discussing data base design: it’s a life’s work; I can’t fit it in at the end of a course on data base theory.

 

 

Counts and Tallies

 

Douglas’s second question: did I really say, when I was discussing duplicate records, that I had doubts about count fields? Yes I did. The problem I have – I’m not certain whether I’ll even convince you it is a problem – is that counts (but not all numbers) really do seem to me to be ambiguous abbreviations of tallies. And tallies give us polynormal records. So I worry about even transiently polynormal records that contain collections of counts: wouldn’t they be collections of collections of tallies?

We had a choice, remember, if we could bear to contemplate those apparently duplicate records, between:

BTI:     CLASS   ITEM
 
         Bed     Std Dbl
         Bed     Std Dbl
and
BTI:     CLASS   ITEM    CT
 
         Bed     Std Dbl  2
Consider the latter: 2 whats? I mean: the count is 2 under what criterion of identity? Let’s say: is the same item as. So, somehow, this criterion of identity is important to us, but we use it for nothing but counting. We don’t use it (except to that extent) to distinguish one item from another.

Suppose we get one more of that sort of item. I suppose there isn’t anything obviously wrong with going to:

BTI:     CLASS   ITEM    CT
 
         Bed     Std Dbl  2
         Bed     Std Dbl  1
I mean: there’s nothing logically wrong. Yet, if we got two more rather than one more we logically couldn’t go to:
BTI:     CLASS   ITEM    CT
 
         Bed     Std Dbl  2
         Bed     Std Dbl  2
No duplicate records, you see. If we did use tallies, we would be all right (two tallies might look alike – have the same number of tally marks – but they are different tallies). Well, you’ll say, we ought to add to the count field rather than insert a record. Yes, but isn’t that reminiscent of that DIARY table: when a period is allocated to a task, and the subsequent period is then allocated to the same task, we extend the period rather than inserting a record. But – as we saw – DIARY was best conceived as a view. And I suspect that records with count fields are best conceived as views (then the ambiguity of counts is easily explainable, and acceptable).

In fact, the table with the count is very like a view of the table with apparently duplicate records. And perhaps that’s how we should think about our data base design. Or perhaps we should actually keep tallies, and therefore polynormal records.

But then, should the underlying records admit – as it seems – duplicates? Obviously, those who allow count fields in basic records can’t object to any and every use of criteria of identity merely to distinguish things sufficiently to allow counting. So how can they object to this use of a criterion of identity to distinguish rows just enough to be able to count them?

I don’t know; I really don’t know. But I can see why I don’t know: we haven’t got the problem formulated clearly enough.

 

 

Nested Qeuries

 

So, to Douglas’s third and final question, which is about nested queries. Douglas likes nested queries; and I don’t. But he has come up with a very neat argument for them, and I don’t know who he got it from: Father Warden probably. It’s an argument based on good language design practice: I’ll use SQL as the language.

As you know, SQL uses table names in its FROM clause: we can say “SELECT ... FROM P, Q”, where “P” and “Q” are table names. Now Douglas argues this way: he says that if a language expression allows numbers, as in “SQUARE(5)” then instead of the name of the number 5, we should be able to use an expression for a number, like “3+2”, or “18-n”, if “n” were a numeric variable. And, in general, if a language allows in some context the name of a sort of thing, then it ought equally to allow a title for that sort of thing, which is to say, an expression that evaluates to that sort of thing.

And, he concludes, with apparently impeccable logic, a nested query is merely an expression that evaluates to the sort of thing that a relation name names: a table. So, how could I have the gall to exclude nested queries?

Well, I must say, this is a very neat argument, and if we were talking about the algebra it would be an absolutely knockdown argument; but we’re not. As it happens, we could remove the SQL FROM clause entirely. Suppose we have a query that gathers together records from relations P1, P2, and so on. Then we could – in all the other clauses of the SELECT – say “P1.COLA”, “P2.COLB”, and so on; or, if we had to use two or more copies of a relation, say “1ST_P1.COL”, “2ND_P1.COL”, and so on. And, of course, it makes no difference whether we speak of current SQL where records are “gathered” by Cartesian product, or Platoclastian SQL, where they’re gathered by Combination.

I can see disadvantages to this approach: it’s very wordy, for one thing. But then I want to say: factoring out these so-called table names – record type names – into the FROM clause is no more than a convenience to reduce the wordiness. And I want to say: these so-called table names are really variabies taking tuples as values, that is, ranging over tuples. So you could no more replace one of them with an expression for a table than you could replace “x” by an expression for a person in: FOR EACH x, x has a mother.

But I don’t think that I’ve got a very good case. I mean, Douglas might say – quite properly – that the relation names aren’t merely variables, like “x,” untyped variables. They are actually common names: “P1” names any row of a certain type; “EMPLOYEE” names any employee record. Consequently a “table-valued expression”, like “SELECT P1.EMPNAME WHERE P1.SALARY > 30000”, names, or at least titles, any row of a certain projection of a certain restriction of a record of a certain type.

And I’m rather inclined to let it go at that, and allow young Douglas to try coding queries with expressions like:

WHERE (SELECT P1.EMPNAME WHERE P1.SALARY > 30000).EMPNAME = "Smith"
Douglas doesn’t look very happy with my answer; and neither am I. But we’re getting near to the end of our course. (Did I hear, “Thank goodness”?) And I don’t want you to think that all the problems have been solved; I don’t want you to think that there’s a fount of all data base theory knowledge. Not even me (he says modestly).

 

 

A Problem with Normal Records

 

In my next lecture I’m going to tackle a topic that you thought we’d already beaten to death: normality, or first normal form. This is the sample problem that I want you to think about, a problem in meta-data, as it happens: data about data. I got it from Father Warden.

He wants to manipulate functional determinacies, so he wants to keep them in a data base. A statement of such a determinacy looks like this:

(K1, K2, ...Km) keys (K1, K2, ...Km, C1, C2, ...Cm)
And we can simplify to:
(K1, K2, ...Km) ® (C1, C2, ...Cn)
And even to:
(K1, K2, ...Km) ® C1
(K1, K2, ...Km) ® C2
...
(K1, K2, ...Km) ® Cn
But what we’re left with aren’t exactly normal records; they are, so to speak, propositions formed from an indefinitely polyadic predicate. And the question is: how should we represent such statements in a data base?

 

SITE HOME PAGE

We could, as it happens, use polynormal records with a many-valued “K” column (the order of the Kj doesn’t matter); that is, we could exploit non-transient polynormality. But should we do that?

THE DATABASE PAGE

THE DATABASE PAPERS

 

Preface & Contents

 

DOWNLOAD

Download Lecture XXVI (rtf, Word for Windows compatible)

Platoclast on Data: Lecture XXVII

 

Copyright © 1993, 2001 Adrian Larner. The author asserts all moral rights.