|
(U03) |
www.btinternet.com/~adrian.larner/database/pcl16 |
|
PLATOCLAST Lecture XVI |
||
|
|
||
|
|
I had prepared a lecture for you today on Negation, and how to handle our last primitive relational operator, Difference. But over the weekend I was sitting chatting with Judy in the Angel, drinking Gin-and-Tonic and Brown Ale (yes, respectively), when in wanders Douglas. And, of course, we get to talking data base theory, which is pretty highbrow for the Public at the Angel. Anyway, Douglas says that several times recently I’ve been getting by with handwaving in the general direction of “data type processing”, and isn’t it about time I came clean? A few nods. OK then, you get the lecture I hadn’t prepared. |
|
|
|
||
|
|
Data Independence |
|
|
|
I guess I hope a lot of you have read Dr Codd’s 1970 paper,[1] and you may have wondered why, although he talks a lot about something called “Data Independence”, I haven’t even mentioned the subject. Well, Ill start now. The first thing to get out of the way is Representation Independence. As you know, in a data base, or in a program in some language, we can represent the same data in different ways. For instance, we can represent numbers in fixed point or floating point, and base two or base ten. Representation Independence means that our DML doesn’t rely on any particular representation. So a query, or a program with such a query imbedded in it, that referred to some numeric fields, wouldn’t have to be changed just because we changed the data representation from decimal to binary say. Obviously Representation Independence is a good thing: programs and queries don’t have to be rewritten, and users don’t get troubled with silly distinctions. Do I care whether my salary is recorded in binary or decimal? As long as it’s not what do you call it? smallint. I should add that some people confuse data representations with data types, and some programming languages and DMLs do as well. But that’s just a silly mistake, thinking that decimal numbers are a different type of data from binary numbers. Of course, character strings and numbers really are different data types: they’re different even if they have the same representation (numbers represented using the characters “0” to “9”, minus, and decimal point, for instance). |
|
|
|
||
|
|
Data Encapsulation |
|
|
|
How do we get Representation Independence? We get it by what’s called “Data Type Encapsulation”. This is the way encapsulation works. The user, or the programmer, and the DML, and even in some cases the programming language, know that, for instance, “Complex Number” is a data type. But they do not know how complex numbers values of that data type are represented. It’s very like them being what I’ve referred to as “cryptic”. But what do they know about these complex numbers? Well they know the names of some functions that apply to them. For instance, they might know that there are these functions:
There’s another sort of thing a user, or a DML, might need to know: some facts about complex numbers, like whether they’re ordered or not (they’re not), or which of them is the identity for certain operations (such as addition): I’ll call these facts “invariants, constant truths if you like. The user needs to know these things in order to understand what complex numbers are, the DML needs them to optimise the performance of queries. So what’s hidden? The representation: are complex numbers represented as Real and Imaginary pairs, or as vectors (modulus and direction), or some in one way and some in another, or all in both? Representation can be important for performance, but performance requirements change (even if we get .them right in the first place); so we want to be able to change representation for better performance without having to modify all our programs, queries, and users’ heads. Most of the work done on representation independence and data encapsulation is in the area of program language design, rather than query design. So it doesn’t belong in my course: I don’t want to steal the thunder of your other lecturers, and anyway that’s not what I’m paid for. But it overlaps data base theory, and because the data base theoreticians haven’t paid it enough attention, all those programming people with their whacky ideas the latest is called “Object Orientation have started invading our pure and elegant world (well, fairly pure and not totally scruffy). |
|
|
|
||
|
|
Token Processing |
|
|
|
The key to data encapsulation is what I’ll call tokens”. Imagine that all the values in a record are represented as tokens, where a token comprises a length and a chunk of data of that length. The length might be explicit, so that the token physically comprised a length prefix and a data suffix (of that length); or some columns might be implicitly associated with a specific length, for instance, all values in EMPNO might be data-only tokens of length 4 bytes. What can the DML do with tokens? The purest answer is: nothing but pass them to, or receive them from, the functions that process them. And, of course, shuffle them around: move and copy them. But we could relax that a little: we could allow the DML to compare tokens for identity (two tokens are identical when they are of the same length and their token data is of the same bit configuration). And but this is an implementation detail the DML might sometimes be required to prefix an explicit length to a fixed length token to pass it to a function, or remove an explicit length on return of a token. How would this work? Well, the DML includes what we might term “native” and “foreign “ operators. Its native operators the operators it understands and processes for itself would include, say, “;” for Combination, “/” for restriction, “//” for extension, “\” for projection, “#” for collection, and so on. Its foreign operators which it would recognise, but not understand (not be able to process} would include, for instance, “LEFT”. When the DML found “LEFT” in an expression, it would look it up in a dictionary, which might say:
So the DML can work out whether the LEFT is a character string LEFT or a bit string LEFT (technically, the term “LEFT” is said to be overloaded when such an ambiguity has to be resolved). Suppose it’s a character string left. Then the dictionary entry says: to get the result of the operation call the function “CHARLEFT”, pass the operand tokens, and receive the result token (which will be of data type character string). Notice that the DML has no idea whether the first operand token is the (disk or storage) address of the character string, or the character string itself. Indeed, apart from knowing the type name, charstring”, it doesn’t know what a character string is. You can see how this approach makes the DML very open-ended. It is easy to add extra data types, and operations defined on those data types. I should mention again that there will have to be invariant information (probably in the dictionary). For instance, suppose that LEFT (c, n) returns the leftmost n characters of the character string c. If an index is kept on a column of type charstring then it might be important to the optimiser to know this:
Notice that token identity is a useful performance option. We could simply put “=” and “º” in the dictionary (they would be very overloaded), and the DML could find and call the appropriate comparison routine. It would return a truth value: that’s a data type that, at least in this case, would not be encapsulated from the DML. In a restriction, “true” will cause a row to be retained, “false” will cause a row to be dropped, so the DML must be able to tell which is which. However, we would certainly expect that “t = t” would be true for data denoted by any token, “t”. And we might (perhaps only for some columns, or some data types) ensure that two different tokens never denoted identical data; otherwise the DML first compares for token identity, and only if that fails does it call the identity routine of the appropriate data type. As I say, we don’t have to allow direct token identity comparison, but it helps. What implications does this “encapsulated data type” approach have? When we design a data base we need to specify the data type (the domain) of each column. It might be an entirely new data type, such as “Homework Assignment”, or a fairly standard data type, such as “Month”. We need to develop a representation for it, or choose one from what we already have available. We will probably need to define, and possibly develop, some functions on our various data types. We may need to specify some invariants. Much of this is recorded in the data type dictionary that the DML has to reference. In practice, we would expect a DML to come with a starter dictionary and associated functions for numbers, character strings, and so on. Right, let’s look at all that stuff I dismissed to “data type processing”. Intuitively, data type processing has to cover the ordinary functions we want on data types: Addition, Subtraction, Multiplication, and so on for numeric data types; Concatenation, Substringing, and so on for string data types; equality and inequality for (almost) all data types, and greater and less for ordered data types. My three extra burdens are:
|
|
|
|
||
|
|
Handling Nulls and Absent Values |
|
|
|
To handle null or absent values we need some sort of improper data type (that’s easy enough, in a way) and some improper functions (which is where it gets a little harder). We would expect an improper data type to include a proper data type, in the sense that the type “Possibly null integer” would include “integer”. And we would expect an improper function to invoke a proper function. Thus “improper addition” of two possibly null integers would probably check whether both its operands were proper, and if so pass them on to proper addition. So in handling nulls and absent values, we will just assume that functions on proper, present values are performed in the usual way. The choices we have in handling nulls seem to come down to:
Personally I prefer the second approach: “suppressed nullity”. This maximises information, but it’s far less safe. The principle is, and I state it roughly and informally because it requires careful implementation for each operand position in each function:
Another guideline but only a guideline is:
Well, I’ve given you these few guidelines; by no means a complete or final answer. But I’m sure that the form of implementation is correct: in data type handling, not in the DML. Notice that the introduction of improper data types and improper functions gives us a chance (as with =”) to make specific provision for nulls, rather than struggling to work out how functions originally defined on proper values should be force-fitted to handle nulls. Now let’s consider absent values. Remember that we’re thinking principally about expressions in restriction conditions and extensions. It seems to me that we should go for “infectious absence”. Take an extension like Calculate PRICE as QUANTITY times UNIT _PRICE”. I think if a row has no QUANTITY value, then it should get no PRICE extension. Here’s an argument based on an invariant. Suppose we have the restriction condition “A = B+C”, and C is absent. Well we have the invariant that A = B+C if and only if A-B = C. So, as the latter amounts to equating a present to an absent value, so should the former. And that means B+C should be absent (infectious absence). We may presume the user had the option to say “A ]= B+C” if that’s what they wanted. Notice that if we use infectious absence and suppressed nullity the user has the option of using the “THE” function to make an absent value null (and get suppression), or the “PROP” function to make a null value absent and get infectiousness. Pretty. I like that. There are some rather fine considerations with respect to tokens for nulls and absent values. An absent value should, I suppose, lack a token. Almost by definition, if it has a token then the table has been padded, and in that case I think the DML should recognise pad tokens. I’m less certain about tokens for nulls. If more than one kind of null is allowed in a column (not recommended!) then I’m pretty confident that the DML shouldn’t recognise null tokens. Nulls would then be fully encapsulated. If, however, only one sort of null is allowed in any column, then there might be a case to allow the DML to recognise the null token. I suspect it doesn’t make much difference: assuming a competent DML and data type processing design, it merely amounts to labelling some well demarcated null-handling module(s) as part of the DML or part of the data type processing. On the whole, I would go for a single, standard null token known to the DML (it facilitates implementation of “THE” and “PROP”). I pause here to agonise a bit. I’m not at all certain about moving identity wholly or partially into the DML. The more overloaded a function gets, the more tempted we are to import it into the DML: but that may be because we are constantly misled by implementation considerations. Of course, if we process tokens in some sensible way, we may well get to implement identities in a single module. It’s not obvious that that module should be regarded as part of the DML. |
|
|
|
||
|
|
Accumulation by Reduction of Collections |
|
|
|
Let’s think about accumulation functions: they’re the ones that apply to a collection and return a single value. We’ve thought a bit about “PRESENT” and “COUNT”. Others might be “MAXIMUM”, “MINIMUM”, “TOTAL” or “SUM”, and “AVERAGE”. In general, they work like this (I’m going to give you an algorithm as a special treat):
The default set in “result” will depend on fn (in practice a special call to fn might be needed to find it, or it might be the identity value of fn as defined in an invariant in the dictionary). Think now about the accumulation functions we have mentioned:
Are you tempted to try and treat PRESENT and COUNT as parts of the DML? I think we should resist that temptation, which arises as for identity from their being very overloaded. They apply to every data type, don’t they? This is going to take more than one lecture! I haven’t got round to modalities yet. And the other thing I haven’t discussed is what we do when we want first to collect some values in a column and then, say, TOTAL them. You see, we might have four rows with the values 1,2, 2, and 3. Our collection, it seems, would comprise 1, 2, and 3, with a TOTAL of 6. Not very satisfactory. In any event, my thanks and perhaps yours to Douglas for showing such interest and enthusiasm. But perhaps next time you want me to change course at your whim, Douglas my friend, you could buy the drinks. And you might even do me the favour of staying behind briefly to explain the unexpected disappearance of yourself and Judy during my temporary absence (admittedly somewhat extended) from the barroom of the Angel. Yes, thank you; I got home quite without incident. |
|
|
|
||
|
|
On Some Fallacies |
|
|
||
|
|
(Professor Platoclast here responds to a question about the mistakes that strict data typing avoids.) What I’m asked is whether it is an advantage that strict data typing avoids such horrors as multiplying two currency fields, or I suppose to take an example from Mr Date, adding two date-and-time fields.[2] Well, strict data typing does protect against horrors, like asking whether a currency amount is greater than a name. But the examples I started by quoting aren’t horrors at all, and data typing doesn’t prevent them, thank goodness. Suppose we have a number of currency amounts: obviously we can find their mean by adding them all together and dividing by their count, and the answer is a currency amount: five pounds three shillings and fourpence, say.[3] But equally obviously it seems to me we can find their variance by adding all their squares together, dividing by their count, and subtracting the square of their mean. The answer would be a square currency amount: of course, because its positive square root would be the standard deviation, which is a currency amount. So there just isn’t anything wrong with multiplying currency amounts together: I guess the confusion is caused by the result not being itself a currency amount, but so what? There’s nothing wrong in multiplying feet by feet and getting square feet, even though square feet aren’t feet. In much the same way, Mr Date argues that we have to treat dates and intervals as of the same type, because if D1 and D2 are dates then (D1+D2) doesn’t make sense, whereas (D1+D2)/2 does make sense. Now, surely, you see the fallacy: (D1+D2) makes perfectly good sense, but it’s not a date. It’s a what shall we say? date-pair, that is, d is the same date-pair as e if there are dates, D1, D2, D3, and D4, such that d is D1 with D2 (i.e. D1+D2) and e is D3 with D4 (i.e. D3+D4) and D1-D3 is the same signed interval as D4-D2.[4] |
|
|
|
||
|
Having knocked down that argument, we can go back to Mr Date’s initial observation that “at first sight it might appear that intervals and dates are fundamentally different kinds of objects”. It might appear so, but don’t you be fooled: it does appear so, and it jolly well is so. |
|
|
|
|
|
|
|
||
|
Copyright © 1993, 2001 Adrian Larner. The author asserts all moral rights. |
||