(U03)

www.btinternet.com/~adrian.larner/database/pcl16

PLATOCLAST
ON DATA

Lecture XVI
Datatypes

 

 

I had prepared a lecture for you today on Negation, and how to handle our last primitive relational operator, Difference. But over the weekend I was sitting chatting with Judy in the Angel, drinking Gin-and-Tonic and Brown Ale (yes, respectively), when in wanders Douglas. And, of course, we get to talking data base theory, which is pretty highbrow for the Public at the Angel. Anyway, Douglas says that several times recently I’ve been getting by with handwaving in the general direction of “data type processing”, and isn’t it about time I came clean? A few nods. OK then, you get the lecture I hadn’t prepared.

 

 

Data Independence

 

I guess – I hope – a lot of you have read Dr Codd’s 1970 paper,[1] and you may have wondered why, although he talks a lot about something called “Data Independence”, I haven’t even mentioned the subject. Well, I’ll start now. The first thing to get out of the way is Representation Independence. As you know, in a data base, or in a program in some language, we can represent the same data in different ways. For instance, we can represent numbers in fixed point or floating point, and base two or base ten. Representation Independence means that our DML doesn’t rely on any particular representation. So a query, or a program with such a query imbedded in it, that referred to some numeric fields, wouldn’t have to be changed just because we changed the data representation – from decimal to binary say.

Obviously Representation Independence is a good thing: programs and queries don’t have to be rewritten, and users don’t get troubled with silly distinctions. Do I care whether my salary is recorded in binary or decimal? As long as it’s not – what do you call it? – smallint.

I should add that some people confuse data representations with data types, and some programming languages and DMLs do as well. But that’s just a silly mistake, thinking that decimal numbers are a different type of data from binary numbers. Of course, character strings and numbers really are different data types: they’re different even if they have the same representation (numbers represented using the characters “0” to “9”, minus, and decimal point, for instance).

 

 

Data Encapsulation

 

How do we get Representation Independence? We get it by what’s called “Data Type Encapsulation”. This is the way encapsulation works. The user, or the programmer, and the DML, and even – in some cases – the programming language, know that, for instance, “Complex Number” is a data type. But they do not know how complex numbers – values of that data type – are represented. It’s very like them being what I’ve referred to as “cryptic”.

But what do they know about these complex numbers? Well they know the names of some functions that apply to them. For instance, they might know that there are these functions:

REAL(c), returning a real number
IMAG(c), returning a real number
MOD(c), returning a real number
And they might have functions that create complex numbers, such as COMPLEX(r,s), which from real numbers r and s returns the complex number r+si. And any DML front end report generator will need a function like DISPLAY(c) to get a character representation of the complex number c.

There’s another sort of thing a user, or a DML, might need to know: some facts about complex numbers, like whether they’re ordered or not (they’re not), or which of them is the identity for certain operations (such as addition): I’ll call these facts “invariants”, constant truths if you like. The user needs to know these things in order to understand what complex numbers are, the DML needs them to optimise the performance of queries.

So what’s hidden? The representation: are complex numbers represented as Real and Imaginary pairs, or as vectors (modulus and direction), or some in one way and some in another, or all in both? Representation can be important for performance, but performance requirements change (even if we get .them right in the first place); so we want to be able to change representation for better performance without having to modify all our programs, queries, and users’ heads.

Most of the work done on representation independence and data encapsulation is in the area of program language design, rather than query design. So it doesn’t belong in my course: I don’t want to steal the thunder of your other lecturers, and anyway that’s not what I’m paid for. But it overlaps data base theory, and because the data base theoreticians haven’t paid it enough attention, all those programming people with their whacky ideas – the latest is called “Object Orientation” – have started invading our pure and elegant world (well, fairly pure and not totally scruffy).

 

 

Token Processing

 

The key to data encapsulation is what I’ll call “tokens”. Imagine that all the values in a record are represented as tokens, where a token comprises a length and a chunk of data of that length. The length might be explicit, so that the token physically comprised a length prefix and a data suffix (of that length); or some columns might be implicitly associated with a specific length, for instance, all values in EMPNO might be data-only tokens of length 4 bytes.

What can the DML do with tokens? The purest answer is: nothing but pass them to, or receive them from, the functions that process them. And, of course, shuffle them around: move and copy them. But we could relax that a little: we could allow the DML to compare tokens for identity (two tokens are identical when they are of the same length and their token data is of the same bit configuration). And – but this is an implementation detail – the DML might sometimes be required to prefix an explicit length to a fixed length token to pass it to a function, or remove an explicit length on return of a token.

How would this work? Well, the DML includes what we might term “native” and “foreign “ operators. Its native operators – the operators it understands and processes for itself – would include, say, “;” for Combination, “/” for restriction, “//” for extension, “\” for projection, “#” for collection, and so on. Its foreign operators – which it would recognise, but not understand (not be able to process} – would include, for instance, “LEFT”. When the DML found “LEFT” in an expression, it would look it up in a dictionary, which might say:

charstring : = LEFT (charstring, count)
CHARLEFT
 
bitstring : = LEFT (bitstring, count)
BITLEFT
It means: “LEFT” must occur followed by a parenthesised character or bit string and non-negative integer pair. If this is not so, it is a syntax error. The DML knows the data type name (e.g. “charstring”) of each column in the data base (roughly, it’s the domain of the column). So it can work out, from the dictionary, the data types of expressions. Thus it knows that LEFT of a bit string column and a non-negative integer column (or literal, perhaps) is (i.e. returns) a bit string.

So the DML can work out whether the LEFT is a character string LEFT or a bit string LEFT (technically, the term “LEFT” is said to be overloaded when such an ambiguity has to be resolved). Suppose it’s a character string left. Then the dictionary entry says: to get the result of the operation call the function “CHARLEFT”, pass the operand tokens, and receive the result token (which will be of data type character string).

Notice that the DML has no idea whether the first operand token is the (disk or storage) address of the character string, or the character string itself. Indeed, apart from knowing the type name, “charstring”, it doesn’t know what a character string is.

You can see how this approach makes the DML very open-ended. It is easy to add extra data types, and operations defined on those data types. I should mention again that there will have to be invariant information (probably in the dictionary). For instance, suppose that LEFT (c, n) returns the leftmost n characters of the character string c. If an index is kept on a column of type charstring then it might be important to the optimiser to know this:

LEFT (c, n) does not follow c in alphabetic (i.e. index) order.
Such information is essential to a sound choice of access: shall the index be used or not, given for example that the query has a restriction condition like “col is between LEFT (c, 3) and c”.

Notice that token identity is a useful performance option. We could simply put “=” and “º” in the dictionary (they would be very overloaded), and the DML could find and call the appropriate comparison routine. It would return a truth value: that’s a data type that, at least in this case, would not be encapsulated from the DML. In a restriction, “true” will cause a row to be retained, “false” will cause a row to be dropped, so the DML must be able to tell which is which.

However, we would certainly expect that “t = t” would be true for data denoted by any token, “t”. And we might (perhaps only for some columns, or some data types) ensure that two different tokens never denoted identical data; otherwise the DML first compares for token identity, and only if that fails does it call the identity routine of the appropriate data type. As I say, we don’t have to allow direct token identity comparison, but it helps.

What implications does this “encapsulated data type” approach have? When we design a data base we need to specify the data type (the domain) of each column. It might be an entirely new data type, such as “Homework Assignment”, or a fairly standard data type, such as “Month”. We need to develop a representation for it, or choose one from what we already have available. We will probably need to define, and possibly develop, some functions on our various data types. We may need to specify some invariants. Much of this is recorded in the data type dictionary that the DML has to reference. In practice, we would expect a DML to come with a starter dictionary and associated functions – for numbers, character strings, and so on.

Right, let’s look at all that stuff I dismissed to “data type processing”. Intuitively, data type processing has to cover the ordinary functions we want on data types: Addition, Subtraction, Multiplication, and so on for numeric data types; Concatenation, Substringing, and so on for string data types; equality and inequality for (almost) all data types, and greater and less for ordered data types. My three extra burdens are:

Null and absent values
Operations on multiple values (after collection)
Modalities
I’ll use the expression “Proper Data Type” to mean a data type, such as “integer”, or “charstring”: an ordinary data type, devoid of nulls. And likewise I’ll use “Proper Function” to mean a function that takes input operands of proper data types. But even a proper function might return an improper value (or return no value, return an absent value): arithmetic division does that when the divisor is zero.

 

 

Handling Nulls and Absent Values

 

To handle null or absent values we need some sort of improper data type (that’s easy enough, in a way) and some improper functions (which is where it gets a little harder). We would expect an improper data type to include a proper data type, in the sense that the type “Possibly null integer” would include “integer”. And we would expect an improper function to invoke a proper function. Thus “improper addition” of two possibly null integers would probably check whether both its operands were proper, and if so pass them on to proper addition. So in handling nulls and absent values, we will just assume that functions on proper, present values are performed in the usual way.

The choices we have in handling nulls seem to come down to:

Stop the query when we find it’s asked for a function on a null.
 
Carry on, but ensure the user knows.
 
Carry on regardless.
I don’t want to discuss either stopping or signalling that we’ve hit a null. How to do those things is a fine decision in query language design. I want to discuss how to carry on. There are two major schools of thought here, and perhaps our DBMS needs to allow for either, and perhaps the decision has to be under the control of the user. In any event, the first approach (which we see in current SQL) is what I’ll call “infectious nullity”. If an input operand of a function is null, the function returns null. Here, of course, we’re talking about functions not defined on nulls, like addition or concatenation. If we define a function on nulls (for instance the “=” identity) then we’ve said how we want it to work. The advantage of infectious nullity is safety: the user might get a result absolutely packed with nulls, and returning no useful information at all. But at least they won’t be misled. Infectious nullity minimises information.

Personally I prefer the second approach: “suppressed nullity”. This maximises information, but it’s far less safe. The principle is, and I state it roughly and informally because it requires careful implementation for each operand position in each function:

Maximise information, where possible by treating a null as the identity value of the function being performed (more strictly, the identity value at the operand position).
Thus, for instance, if k is an integer and n is null, k+n returns k: n is treated as zero, which is the identity value of addition (for each x, x=x+0). Likewise multiplying by a null is equivalent to multiplying by the number 1 (the identity value of multiplication). Things get harder for subtraction and division: they have right (second operand) identities of zero and one respectively, but no left identities: I would use zero and one regardless. However, all these operations should return null when both operands are null.

Another guideline – but only a guideline – is:

Treat nulls in such a way that they don’t violate the invariants of the data type.
This can be important in that difficult area of ordering. Where (if anywhere) should nulls come when, for instance, we sort the data? Real numbers, you will surely recall, have a total ordering: given any two numbers, one of them is greater than the other. So where does a null fit in? My recommendation is: make the null “the greatest negative number”, so it’s less than zero but greater than any negative number. That makes the mathematicians wince: there is no such number. Quite.

Well, I’ve given you these few guidelines; by no means a complete or final answer. But I’m sure that the form of implementation is correct: in data type handling, not in the DML. Notice that the introduction of improper data types and improper functions gives us a chance (as with “=”) to make specific provision for nulls, rather than struggling to work out how functions originally defined on proper values should be force-fitted to handle nulls.

Now let’s consider absent values. Remember that we’re thinking principally about expressions in restriction conditions and extensions. It seems to me that we should go for “infectious absence”. Take an extension like “Calculate PRICE as QUANTITY times UNIT _PRICE”. I think if a row has no QUANTITY value, then it should get no PRICE extension.

Here’s an argument based on an invariant. Suppose we have the restriction condition “A = B+C”, and C is absent. Well we have the invariant that A = B+C if and only if A-B = C. So, as the latter amounts to equating a present to an absent value, so should the former. And that means B+C should be absent (infectious absence). We may presume the user had the option to say “A ]= B+C” if that’s what they wanted.

Notice that if we use infectious absence and suppressed nullity the user has the option of using the “THE” function to make an absent value null (and get suppression), or the “PROP” function to make a null value absent and get infectiousness. Pretty. I like that.

There are some rather fine considerations with respect to tokens for nulls and absent values. An absent value should, I suppose, lack a token. Almost by definition, if it has a token then the table has been padded, and in that case I think the DML should recognise pad tokens. I’m less certain about tokens for nulls. If more than one kind of null is allowed in a column (not recommended!) then I’m pretty confident that the DML shouldn’t recognise null tokens. Nulls would then be fully encapsulated. If, however, only one sort of null is allowed in any column, then there might be a case to allow the DML to recognise the null token. I suspect it doesn’t make much difference: assuming a competent DML and data type processing design, it merely amounts to labelling some well demarcated null-handling module(s) as part of the DML or part of the data type processing. On the whole, I would go for a single, standard null token known to the DML (it facilitates implementation of “THE” and “PROP”).

I pause here to agonise a bit. I’m not at all certain about moving identity wholly or partially into the DML. The more overloaded a function gets, the more tempted we are to import it into the DML: but that may be because we are constantly misled by implementation considerations. Of course, if we process tokens in some sensible way, we may well get to implement identities in a single module. It’s not obvious that that module should be regarded as part of the DML.

 

 

Accumulation by Reduction of Collections

 

Let’s think about accumulation functions: they’re the ones that apply to a collection and return a single value. We’ve thought a bit about “PRESENT” and “COUNT”. Others might be “MAXIMUM”, “MINIMUM”, “TOTAL” or “SUM”, and “AVERAGE”. In general, they work like this (I’m going to give you an algorithm as a special treat):

stop := no
result := default
work := null
 
DO FOR EACH Value, v, in the Collection
   CALL fn REFER (result, v)
      SET (result, stop) RESET (work)
 
WHILE NOT stop
The algorithm repeatedly calls the function, fn, passing as input the previous or default “result” and a value, “v”, from the collection; the output goes back in “result”. Optionally, fn may force termination by setting “stop” to “yes”. An initially null “work” variable is provided for fn if needed.

The default set in “result” will depend on fn (in practice a special call to fn might be needed to find it, or it might be the identity value of fn as defined in an invariant in the dictionary).

Think now about the accumulation functions we have mentioned:

PRESENT has a fn that simply sets “yes” in “result” (default “no”), and also sets “stop” to “yes”.
 
COUNT has the fn Increment: it adds 1 to “result” (default 0).
 
MAXIMUM has the fn, say, Max. On first entry, with “result” default null, “result” is set to “v”; thereafter “result” is set to the greater of “result” and “v”.
 
MINIMUM you can work out for yourselves now.
 
TOTAL (SUM) has the fn Add: it adds “v” to “result” (default 0).
 
AVERAGE, as I think of it, works not on numbers but on ordered pairs of numbers, like 5/2, let’s say s/c. Its default “result” is 0/0 (which will give null as soon as we attempt to convert it to a single number). Its fn is a Pair Add. Given inputs (“result” and “v”) of s/c and t/d it returns (to “result”) (s+t)/(c+d). Any input operand that is not an ordered pair but a single number is first coerced: v becomes v/1.
Notice that with this algorititm, or any equivalent operation, we have the option of a language in which we admit these accumulation functions, PRESENT, COUNT, and so on, or we could have a general reduction operation and admit only the functions represented by fn. Thus instead of saying, for instance:
COUNT (values)
MAXIMUM (values)
TOTAL (values)
we would say perhaps (using “\\” for “reduction”):
increment \\ (values )
max \\ (values)
+ \\ (values)
Well, please yourselves. But at least you see what I mean by consigning these accumulation functions to data type processing, and perhaps you even see to some extent how that might be done.

Are you tempted to try and treat PRESENT and COUNT as parts of the DML? I think we should resist that temptation, which arises – as for identity – from their being very overloaded. They apply to every data type, don’t they?

This is going to take more than one lecture! I haven’t got round to modalities yet. And the other thing I haven’t discussed is what we do when we want first to collect some values in a column and then, say, TOTAL them. You see, we might have four rows with the values 1,2, 2, and 3. Our collection, it seems, would comprise 1, 2, and 3, with a TOTAL of 6. Not very satisfactory.

In any event, my thanks – and perhaps yours – to Douglas for showing such interest and enthusiasm. But perhaps next time you want me to change course at your whim, Douglas my friend, you could buy the drinks. And you might even do me the favour of staying behind briefly to explain the unexpected disappearance of yourself and Judy during my temporary absence (admittedly somewhat extended) from the barroom of the Angel. Yes, thank you; I got home quite without incident.

 

 

On Some Fallacies

 

 

(Professor Platoclast here responds to a question about the mistakes that strict data typing avoids.)

What I’m asked is whether it is an advantage that strict data typing avoids such horrors as multiplying two currency fields, or – I suppose – to take an example from Mr Date, adding two date-and-time fields.[2] Well, strict data typing does protect against horrors, like asking whether a currency amount is greater than a name. But the examples I started by quoting aren’t horrors at all, and data typing doesn’t prevent them, thank goodness.

Suppose we have a number of currency amounts: obviously we can find their mean by adding them all together and dividing by their count, and the answer is a currency amount: five pounds three shillings and fourpence, say.[3] But equally obviously – it seems to me – we can find their variance by adding all their squares together, dividing by their count, and subtracting the square of their mean. The answer would be a square currency amount: of course, because its positive square root would be the standard deviation, which is a currency amount. So there just isn’t anything wrong with multiplying currency amounts together: I guess the confusion is caused by the result not being itself a currency amount, but so what? There’s nothing wrong in multiplying feet by feet and getting square feet, even though square feet aren’t feet.

In much the same way, Mr Date argues that we have to treat dates and intervals as of the same type, because if D1 and D2 are dates then (D1+D2) doesn’t make sense, whereas (D1+D2)/2 does make sense. Now, surely, you see the fallacy: (D1+D2) makes perfectly good sense, but it’s not a date. It’s a – what shall we say? – date-pair, that is, d is the same date-pair as e if there are dates, D1, D2, D3, and D4, such that d is D1 with D2 (i.e. D1+D2) and e is D3 with D4 (i.e. D3+D4) and D1-D3 is the same signed interval as D4-D2.[4]

 

SITE HOME PAGE

Having knocked down that argument, we can go back to Mr Date’s initial observation that “at first sight it might appear that intervals and dates are fundamentally different kinds of objects”. It might appear so, but don’t you be fooled: it does appear so, and it jolly well is so.

THE DATABASE PAGE

THE DATABASE PAPERS

 

Preface & Contents

 

DOWNLOAD

Download Lecture XVI (rtf, Word for Windows compatible)

Platoclast on Data: Lecture XVII

 

Copyright © 1993, 2001 Adrian Larner. The author asserts all moral rights.