(U03)

www.btinternet.com/~adrian.larner/database/pcl17

PLATOCLAST
ON DATA

Lecture XVII
Collection and Negation

 

 

I really am going to get round to the nasty subject of negation, and specifically to the Difference operation. But this data type handling has opened a can of pretty horrible worms and I’d better tidy up before we proceed on our way.

 

 

Token Identity

 

I want to start by reviving an old question. You will. I sincerely hope. recall the distinction we made between criterial identity, “=”, and complete identity, “º”. We decided that if we could make do with just one null value then we could define complete identity on criterial identity:

xºy =df x = y | (¬(x = x)& ¬(y = y))
Of course. our complete identity was not intended to be absolute, or even systemic, identity. It was identity relative to the allowed values of a column, in contrast to criterial identity, which was relative to the proper values of the column.

We briefly considered. and provisionally, and rightly, rejected treating “=” and “º” both as primitive operators, because we didn’t want multiple nulls. However, I now want to reconsider that option. because we have come across the concept of token identity, and this might be all too suspiciously like complete identity (within a column). I still don’t want more than one sort of null, I hasten to add.

OK. Suppose we take “º” to signify token identity, and we don’t define it on “=”. Naturally we want token identity to be narrower – well. no broader – than criterial identity, that is:

IF xºy THEN x = y | (¬(x = x) & ¬(y = y)
Token identity therefore implies what º used to mean, completed criterial identity. To ensure that we have a single null, we now have to postulate that if x is null and y is null then x is token identical to y:
IF ¬(x = x) & ¬(y = y) THEN x º y
So what possibility does this open up that wasn’t there before? It’s this: that we might have x = x and x = y (so, of course, “y = y”), but not x º y. Of course. if we think merely in terms of implementation then this might well happen: we could get two tokens for one and the same value. But so what? We certainly wouldn’t want the user to know about that; nor the DML (in contrast to the particular implementation of the DML).

 

 

Identities of Multidimensional Data

 

No; what I’m thinking about is something like this. Suppose we have a table referring to products that have a length recorded, naturally enough in a LENGTH column. And suppose we have two ranges of products, metric and traditional, the former with lengths in centimetres, millimetres, etc. and the latter with lengths in inches, thous, etc.[1] Well, let x be the length value 1 inch and y the length value 2.54 centimetres. It would, I think, be not unreasonable to say that x was the same length as y. I have come across some data base theorists who would deny this, but not – I guess – in the presence of an engineer, or even of a builders’ merchant.

Yet we might want a function, say “METRIC”, such that METRIC(x) returned No, and METRIC(y) returned Yes. So although we wish to say that x=y (they are the same length), we don’t wish to say that xºy. They have to have different tokens if “METRIC” is to distinguish between them.

We’ve already discussed, at some length, two different inscriptions being the same alphabetic letter. And for the mathematicians, a similar example is rational numbers. A rational number is a pair of numbers. But the distinct pairs 6/4 and 3/2 are one and the same rational number. And for the theologians, the Father, Son, and Holy Spirit are distinct persons, but one and the same God.[2]

You want an example in data processing? Look at this:

V0 := 3
V1 := 3
When those two instructions have been performed, V0 and V1 are the same number, but they remain distinct variables. And so 2 + V0 is the same number as 2 + V1, but ADDR(V0) is not the same pointer value as ADDR(V1).

Now, of course, going back to the LENGTH example, there is always a way out. We could call the column “LENGTH INSCRIPTION”, and then we could have the function “LENGTH”, so that x and y were different length inscriptions but LENGTH(x) and LENGTH(y) were the same length. But which is the more convenient for our users? Remember that that is what determines our choice of a column criterion of identity: when do the users, in most instances, or most naturally, say that this is the same such-and-such as that?

So I open this possibility to you. And I remind you once again of the importance of representation independence. It is, to me, a cause of wonder that my generation of DP professionals have passed through decimalisation of our coinage, and partial metrication of our measures, and still we cling to data types like “length in inches” and “temperature in Celsius”. Perhaps it’s a conspiracy among programmers to keep themselves in work rewriting their representation-dependent programs.

I ought to mention the other advantage of encapsulated data types: they stop us making silly mistakes like adding lengths to weights. But we overcome that advantage by having data types like “length in inches” and “weight in grams”: they are the same data type, numbers. So we can add them together.

Did you like that little joke: overcoming advantages? I must use it again: current advances in data base have overcome most of the advantages of Dr Codd’s original theory. Where was I?

Ah, yes. Modalities. What we are really starting to touch on is what we might call many-dimensioned data. We started with two-dimensional data, the dimensions being type and value, but now we’re beginning to see that perhaps formatted data is data with at least two dimensions. Those lengths appeared to have type, numeric value, and unit. And we could even contemplate four dimensions. We could divide “unit” into “system” and, in the physicist’s sense, “dimensions”. “System” would mean, say Metre, Kilogram, Second; or Foot, Pound, Second. “Dimensions”, accordingly, would be expressible as the indices of length, mass, and time. I believe physicists now allow other dimension triples, though I can’t remember what they are. I think energy and momentum come in somewhere, but then the system has to change accordingly with Joules or calories or perhaps ergs for energy, and goodness knows what for momentum.

But we still want to hide all this dimensionality to a large extent. We want the system to protect the user, or the programmer, from adding a viscosity to a velocity, but to allow addition of an area in Hectares to an area in Rods.[3] And we don’t want to force users to go through elaborate conversion exercises to compare two measures of energy: we want an “is the same energy” column criterion.

 

 

Modality as a Data Dimension

 

Now think about modalities: they are an extra dimension, aren’t they. Perhaps – I say only perhaps – we should hide them away with the data to which they pertain. Of course we solved an important problem by giving them their own column, but it was a bit like expressing lengths in two columns: a numeric column, say MEASURE, and a unit column, UNIT _OF _MEASURE. It’s OK. It’s clean. But is it what the user wants? If not, but only if we can keep it clean, we should try another solution.

With lengths, as we’ve seen, the secret is to have token identity as primitive, and to provide (as part of a data type) the extra functions that the user might sometimes need, like “METRIC” to check on units. And we might take the same approach with modalities. That is, keep both value and modality encapsulated together in a single column, and provide the functions sometimes needed to handle them separately.

Now I say this because of the intimate, and implicit, relationship between a value and its modality. As we have seen, when we add, say, values V1 and V2 giving V3 = V1 + V2, we want the pertinent modalities, say M1 and M2, to be ANDed, giving M3 = M l-and-M2. But – out of sheer convenience for the user – we don’t want the operation on modalities to be made explicit: we don’t want the user to have to ask for it.

So the only way to make it work is to go back to a single column, each of whose tokens denotes a value proper and its associated modality. Let such tokens be t1 and t2. I suspect that the desired column identity, “t1 = t2”, is “VALUE(tl) = VALUE(t2)” or, perhaps, “VALUE(tl) º VALUE(t2)”. So this is rather different from the LENGTH example, where the column criterion of identity depended on both encapsulated components, the measure and the unit of measure. And that makes me a little uneasy about putting modalities in the same column as values.

On the other hand, what could be more horrible than a DML with side-effects? Do we really want an operation applied to one column (the value) to drive an operation on some other column (the modality)? Certainly not! Better to oblige the user to state any operation required on modality columns. On balance, I’ll go for a single column of an encapsulated data type from which both value and modality can be derived.

 

 

Collections

 

Now let’s move on to that nasty little problem with collections. We have, in about the simplest case:

R:     X      Y      Z
 
       x1     y1     1
       x1     y2     1
       ...    ...    ...
I show just the (X: x1) rows. We ask for:
R // SZ := TOTAL (X # Z)
Obviously we want the sum of the Z values within the scope of each X value:
X      Y      Z     SZ
 
x1     y1     1     2
x1     y2     1     2
...    ...    ...   ...
But it seems that we get this (seems! nay, it is):
X      Y      Z     SZ
 
x1     y1     1     1
x1     y2     1     1
...    ...    ...   ...
And, of course, we would have the same problem with COUNT (unless we really did want to count the distinct Z values).

 

 

Differentiation

 

Now it’s not for nothing that I’ve softened you up with all that talk of LENGTH and value/modality columns. Just consider, we might have in a value/modality column two tokens, tl and t2, such that tl = t2 (they have the same value) and yet not tl º t2 (they are different tokens ). So, if only, in our example, the Z entries, each shown as “1”, had different modalities, the TOTAL would work.: (X # Z) would contain both of them, each with the value 1, and their TOTAL would be 2.

Of course, different modalities aren’t necessary: different colours, different smells, different anythings would do. Naturally, we get no joy from our initial definitions: both 1’s are certainly the same field, the same value in the same column, (Z: 1). What a shame: but, in a very obvious sense they do have a difference. One of them is in a record (row), with (Y : y1), and the other in a different record, with (Y : y2). And we can see that wherever this problem arises, the apparently identical fields are always in different records.

So, let us have a function, DIFF, that works in this way: DZ = DIFF (Z, Y) returns the value of Z differentiated by Y. This means that if dz1 = DIFF (z1, yl) then dzl is the same Z as as z1, but if dz2 = DIFF (z2, y2) then even if zl is the same Z as z2, as long as NOT yl = y2, dzl is not the same DZ as dz2. You will appreciate that DIFF is a very overloaded function, and could be a token-processing function. But now we can get what we wanted:

R // SZ := TOTAL (X # DIFF (Z, Y))
A slight change to the TOTAL function is required: it must accept differentiated Zs, but we could achieve this in our general reduction algorithm, which would extract the undifferentiated value and pass it to the relevant fn:
stop := no
result := default
work := null
 
DO FOR EACH Value, v, in the Collection
 
   CALL fn REFER (result, UNDIFF (v))
      SET (result, stop) RESET (work)
 
   WHILE NOT stop
All I’ve done is introduce another overloaded operator, UNDIFF, with the invariant:
FOR EACH x, FOR EACH y, x = UNDIFF ( DIFF (x, ))
Again, I want to stress that I’m not talking DML design. I’m trying to get us all thinking clearly about what we need, and where. Only when we’ve got that worked out can we start to design a pretty language for the user to express it in. We’re thinking about what the it is, which the user needs to express.

So what have we got now? It seems to me that we have a DML proper: a record processor that has native operations like Combination, Restriction, Extension, and Projection. And beneath that (called by it) we have a field processor – a token processor – that handles some very overloaded functions. And beneath that we have improper function processors. And beneath them we have proper function processors.

Well, I like that: layer upon layer. So instead of handwaving in the general direction of data type processing I can now solve problems by placing them neatly into the appropriate layer.

 

 

Collections as Joins

 

But what really bugs me is that I’ve had to introduce that “collection within scope” operator (“#”). Let’s have a look at a bit more of that table, R, perhaps all of it:

R:     X      Y      Z
 
       x1     y1     1
       x1     y2     1
       x2     y1     2
       x3     y1     3
       x3     y2     1
       x3     y3     1
Now – believe me, or I’ll set it as a homework assignment – we can contrive a number of natural joins on the X column to give us:
X      Y1     Z1     Y2     Z2     Y3     Z3
 
x1     y1     1      y2     1
x1     y2     1      y1     1
x2     y1     2
x3     y1     3      y2     1      y3     1
x3     y1     3      y3     1      y2     1
x3     y2     1      y1     3      y3     1
x3     y2     1      y3     1      y1     3
x3     y3     1      y1     3      y2     1
x3     y3     1      y2     1      y1     3
Got that? for each X there’s a row with each full permutation of Y and Z values. We project away the Y’s (that’s UNDIFF – think about it!)
X      Z1     Z2     Z3
 
x1     1      1
x1     1      1
x2     2
x3     3      1      1
x3     3      1      1
x3     1      3      1
x3     1      1      3
x3     1      3      1
x3     1      1      3
And you can see that our TOTAL is just Zl + Z2 + Z3. Actually we ought still to say it’s the “plus” reduction of Zl, Z2, and Z3. That’s because their sum comes out absent in the shorter rows if we say absence is infectious. But it’s not infectious in reduction (as our algorithm shows).

But all we have now done, until we actually do the addition, is in the FOPC. So you can see, I have a way to explain any collection within scope in terms of the FOPC: it’s a number of joins, and therefore a number of conjunctions (ANDings). But I don’t have a way to explain every collection within scope.

I’ll say that again, a bit more formally:

For each collection within scope, there is a sequence of joins which have the same effect (and could therefore be used as a definition of that collection in terms of the FOPC).
But this is false:
There is a sequence of joins which have the same effect as each collection within scope.
As you can see: we don’t know how many joins we have to do until we see the actual data. It’s not predictable in advance. And that’s why (at least for the moment) it’s better for us to stay with the collection operation. That operation itself is not definable in the FOPC; nor even is each specification of it in a query; but each use of it in the performance of a query is so definable.[4]

So collection within scope doesn’t have to weigh too heavily on my conscience. Especially as I have used it to ensure that my DML is so simple that it’s based on only part of the FOPC; a part that doesn’t include existential quantification. I achieve the effect of such quantification by collection (and therefore, in any instance of a query, by conjunction) and that little encapsulated function, PRESENT. Indeed, I get more than mere existential quantification: I get even more powerful functions, like COUNT and TOTAL (way outside the FOPC) by the same mechanism, but with different encapsulated functions.

You might remember that those “]” and “[” operators were also interpretable as quantifications. But they occur in restriction conditions, and they too can be pushed down a level from the DML into the token processor.

 

 

Difference

 

And now, you can bet, I’ve got together all the mechanism that I need to handle negation, and the difference operation, without admitting either the “NOT” or the” AND NOT” operator into the part of the FOPC that I base my DML on. Except in restriction conditions, of course: but we won’t worry about them. It’s negations of records, or tables, that I plan to avoid.

This is the traditional definition of difference:

Rb = Pb AND NOT FOR SOME y, y = b AND Qy
Or we could put that another way:
Rb = Pb AND FOR EACH y, IF y = b THEN NOT Qy
Or – apparently simpler – we could use the variable y throughout, rather than the name “b”:
Ry = Py AND NOT Qy
Well, you can see it coming, can’t you? “NOT FOR SOME y” is going to end up as a restriction condition involving a collection. We combine the tables, p and q, and apply the restriction condition “p.Y ]º q.Y”. Here are p and q:
p:   Y         q:   Y
 
      1               1
      2               3
Here’s the combination, p;q:
p.Y      q.Y
 
 1
 1         1
 1         3
 2
 2         1
 2         3
            1
            3
And here it is restricted, p;q/p.Y]ºq.Y
p.Y      q.Y
 
 1
 1         1
 2
And now the collection, restriction, and projection:
p.Y\p;q/p.Y]ºq.Y/NOT PRESENT (p.Y # q. Y)
And that gives:
p.Y
 
 2
Of course, just as we can say “0 ¬= COUNT” instead of “ PRESENT”, we could say “0 = COUNT” instead of “NOT PRESENT”, if we’re really desperate to get rid of the “NOT”. But, as I said, I don’t worry about it in restriction conditions: I just don’t want it applied to rows.

And there – more or less – we have it. But we’ve covered a lot of country since we started with all those conjunction operations, discovered Combination (the Great Conjunction Operation), and swept through Outer Joins and Unions; so I’ll say it again in the next lecture.

 

 

A Clarification

 

 

(Following this lecture, Professor Platoclast was asked to clarify his paradoxical justification of collection: was it, or was it not, definable on the FOPC-based operators of the DML he was proposing?)

You are right, my boy, it’s a quantifier shift. Let’s use “x” as the expression of a query. (We let t take a table or collection of tables as its value.) Now, what we’ve always assumed (unthinkingly) is this:

IF x is properly expressible within our DML THEN FOR SOME y, y is expressed in our DML AND FOR EACH t, x applied to t gives the same result as y applied to t.
This means that x is translatable into y, so – in effect – x is expressed in a mere syntactic variant of our DML.

But what I’m proposing is to relax this condition of “proper expressibility” to:

IF x is properly expressible within our DML THEN FOR EACH t, FOR SOME y, y is expressed in our DML AND x applied to t gives the same result as y applied to t.
And there you see the quantifier shift. What we’re doing is, so to speak, waiting until we’ve seen the operands of the query, the actual table or tables, before we decide which expression in our DML to use. So x doesn’t have to be translatable into our DML; at least, not until we’re just about to execute it.

Let me give a very simple analogy. Suppose we have a language with names for switches, say “S1”, “S2”, and so on, and two operations, SETON(s) and SETOFF(s). Now, is “FLIP(s)” properly expressible? I.e. “If s is on, SETOFF(s), otherwise SETON(s).” In a sense, no: our language doesn’t include “If’, “on”, or “otherwise”. Yet, “FLIP(s)” is properly expressible in this sense: that if we first substitute the actual switch name giving, say, “FLIP(S1)”, then there is a translation (whether we know it or not – that’s a different matter) to “SETON(S1)” or “SETOFF(S1)” accordingly as S1 is off or on.

We might call “FLIP(s)” a schematic operation: it stands ambiguously for either of our actual operations. Remember we had something like that in the TT-DD, when we used “F” as a schematic predicate – we used it in the schema, or schematic definition, for Table(F). Then, later, we replaced it with an actual predicate.

So you could say that I’m proposing to admit some schematic queries, and a query with a collection operation is one sort of schematic query. But – please – a fine philosophical point, a schematic query is not a sort of query; “schematic” isn’t that sort of adjective. It’s an adjective like “putative” or “so-called” or “imaginary”: we don’t count putative employees among our employees, nor imaginary experiences within our reportage.

Using schematics as schematics is fine; using them seriously is dangerous. Remember what happened when Russell used, instead of an Axiom, a schematic axiom: the Axiom Schema of Abstraction.

 

SITE HOME PAGE

A schema has a sort of promissory note attached: “I promise to convert this schematic so-and-so to a genuine so-and-so before use.” If you can’t sign the note with a clear conscience, then keep your grubby paws off it entirely.

THE DATABASE PAGE

THE DATABASE PAPERS

 

Preface & Contents

 

DOWNLOAD

Download Lecture XVII (rtf, Word for Windows compatible)

Platoclast on Data: Lecture XVIII

 

Copyright © 1993, 2001 Adrian Larner. The author asserts all moral rights.