|
(U03) |
www.btinternet.com/~adrian.larner/database/pcl17 |
|
PLATOCLAST Lecture XVII |
||
|
|
||
|
|
I really am going to get round to the nasty subject of negation, and specifically to the Difference operation. But this data type handling has opened a can of pretty horrible worms and I’d better tidy up before we proceed on our way. |
|
|
|
||
|
|
Token Identity |
|
|
|
I want to start by reviving an old question. You will. I sincerely hope. recall the distinction we made between criterial identity, “=”, and complete identity, “º”. We decided that if we could make do with just one null value then we could define complete identity on criterial identity:
We briefly considered. and provisionally, and rightly, rejected treating “=” and “º” both as primitive operators, because we didn’t want multiple nulls. However, I now want to reconsider that option. because we have come across the concept of token identity, and this might be all too suspiciously like complete identity (within a column). I still don’t want more than one sort of null, I hasten to add. OK. Suppose we take “º” to signify token identity, and we don’t define it on “=”. Naturally we want token identity to be narrower well. no broader than criterial identity, that is:
|
|
|
|
||
|
|
Identities of Multidimensional Data |
|
|
|
No; what I’m thinking about is something like this. Suppose we have a table referring to products that have a length recorded, naturally enough in a LENGTH column. And suppose we have two ranges of products, metric and traditional, the former with lengths in centimetres, millimetres, etc. and the latter with lengths in inches, thous, etc.[1] Well, let x be the length value 1 inch and y the length value 2.54 centimetres. It would, I think, be not unreasonable to say that x was the same length as y. I have come across some data base theorists who would deny this, but not I guess in the presence of an engineer, or even of a builders’ merchant. Yet we might want a function, say “METRIC”, such that METRIC(x) returned No, and METRIC(y) returned Yes. So although we wish to say that x=y (they are the same length), we don’t wish to say that xºy. They have to have different tokens if “METRIC” is to distinguish between them. We’ve already discussed, at some length, two different inscriptions being the same alphabetic letter. And for the mathematicians, a similar example is rational numbers. A rational number is a pair of numbers. But the distinct pairs 6/4 and 3/2 are one and the same rational number. And for the theologians, the Father, Son, and Holy Spirit are distinct persons, but one and the same God.[2] You want an example in data processing? Look at this:
Now, of course, going back to the LENGTH example, there is always a way out. We could call the column “LENGTH INSCRIPTION”, and then we could have the function “LENGTH”, so that x and y were different length inscriptions but LENGTH(x) and LENGTH(y) were the same length. But which is the more convenient for our users? Remember that that is what determines our choice of a column criterion of identity: when do the users, in most instances, or most naturally, say that this is the same such-and-such as that? So I open this possibility to you. And I remind you once again of the importance of representation independence. It is, to me, a cause of wonder that my generation of DP professionals have passed through decimalisation of our coinage, and partial metrication of our measures, and still we cling to data types like “length in inches” and “temperature in Celsius”. Perhaps it’s a conspiracy among programmers to keep themselves in work rewriting their representation-dependent programs. I ought to mention the other advantage of encapsulated data types: they stop us making silly mistakes like adding lengths to weights. But we overcome that advantage by having data types like “length in inches” and “weight in grams”: they are the same data type, numbers. So we can add them together. Did you like that little joke: overcoming advantages? I must use it again: current advances in data base have overcome most of the advantages of Dr Codd’s original theory. Where was I? Ah, yes. Modalities. What we are really starting to touch on is what we might call many-dimensioned data. We started with two-dimensional data, the dimensions being type and value, but now we’re beginning to see that perhaps formatted data is data with at least two dimensions. Those lengths appeared to have type, numeric value, and unit. And we could even contemplate four dimensions. We could divide “unit” into “system” and, in the physicist’s sense, “dimensions”. “System” would mean, say Metre, Kilogram, Second; or Foot, Pound, Second. “Dimensions”, accordingly, would be expressible as the indices of length, mass, and time. I believe physicists now allow other dimension triples, though I can’t remember what they are. I think energy and momentum come in somewhere, but then the system has to change accordingly with Joules or calories or perhaps ergs for energy, and goodness knows what for momentum. But we still want to hide all this dimensionality to a large extent. We want the system to protect the user, or the programmer, from adding a viscosity to a velocity, but to allow addition of an area in Hectares to an area in Rods.[3] And we don’t want to force users to go through elaborate conversion exercises to compare two measures of energy: we want an “is the same energy” column criterion. |
|
|
|
||
|
|
Modality as a Data Dimension |
|
|
|
Now think about modalities: they are an extra dimension, aren’t they. Perhaps I say only perhaps we should hide them away with the data to which they pertain. Of course we solved an important problem by giving them their own column, but it was a bit like expressing lengths in two columns: a numeric column, say MEASURE, and a unit column, UNIT _OF _MEASURE. It’s OK. It’s clean. But is it what the user wants? If not, but only if we can keep it clean, we should try another solution. With lengths, as we’ve seen, the secret is to have token identity as primitive, and to provide (as part of a data type) the extra functions that the user might sometimes need, like “METRIC” to check on units. And we might take the same approach with modalities. That is, keep both value and modality encapsulated together in a single column, and provide the functions sometimes needed to handle them separately. Now I say this because of the intimate, and implicit, relationship between a value and its modality. As we have seen, when we add, say, values V1 and V2 giving V3 = V1 + V2, we want the pertinent modalities, say M1 and M2, to be ANDed, giving M3 = M l-and-M2. But out of sheer convenience for the user we don’t want the operation on modalities to be made explicit: we don’t want the user to have to ask for it. So the only way to make it work is to go back to a single column, each of whose tokens denotes a value proper and its associated modality. Let such tokens be t1 and t2. I suspect that the desired column identity, “t1 = t2”, is “VALUE(tl) = VALUE(t2)” or, perhaps, “VALUE(tl) º VALUE(t2)”. So this is rather different from the LENGTH example, where the column criterion of identity depended on both encapsulated components, the measure and the unit of measure. And that makes me a little uneasy about putting modalities in the same column as values. On the other hand, what could be more horrible than a DML with side-effects? Do we really want an operation applied to one column (the value) to drive an operation on some other column (the modality)? Certainly not! Better to oblige the user to state any operation required on modality columns. On balance, I’ll go for a single column of an encapsulated data type from which both value and modality can be derived. |
|
|
|
||
|
|
Collections |
|
|
|
Now let’s move on to that nasty little problem with collections. We have, in about the simplest case:
|
|
|
|
||
|
|
Differentiation |
|
|
|
Now it’s not for nothing that I’ve softened you up with all that talk of LENGTH and value/modality columns. Just consider, we might have in a value/modality column two tokens, tl and t2, such that tl = t2 (they have the same value) and yet not tl º t2 (they are different tokens ). So, if only, in our example, the Z entries, each shown as “1, had different modalities, the TOTAL would work.: (X # Z) would contain both of them, each with the value 1, and their TOTAL would be 2. Of course, different modalities aren’t necessary: different colours, different smells, different anythings would do. Naturally, we get no joy from our initial definitions: both 1s are certainly the same field, the same value in the same column, (Z: 1). What a shame: but, in a very obvious sense they do have a difference. One of them is in a record (row), with (Y : y1), and the other in a different record, with (Y : y2). And we can see that wherever this problem arises, the apparently identical fields are always in different records. So, let us have a function, DIFF, that works in this way: DZ = DIFF (Z, Y) returns the value of Z differentiated by Y. This means that if dz1 = DIFF (z1, yl) then dzl is the same Z as as z1, but if dz2 = DIFF (z2, y2) then even if zl is the same Z as z2, as long as NOT yl = y2, dzl is not the same DZ as dz2. You will appreciate that DIFF is a very overloaded function, and could be a token-processing function. But now we can get what we wanted:
So what have we got now? It seems to me that we have a DML proper: a record processor that has native operations like Combination, Restriction, Extension, and Projection. And beneath that (called by it) we have a field processor a token processor that handles some very overloaded functions. And beneath that we have improper function processors. And beneath them we have proper function processors. Well, I like that: layer upon layer. So instead of handwaving in the general direction of data type processing I can now solve problems by placing them neatly into the appropriate layer. |
|
|
|
||
|
|
Collections as Joins |
|
|
|
But what really bugs me is that I’ve had to introduce that “collection within scope” operator (“#”). Let’s have a look at a bit more of that table, R, perhaps all of it:
But all we have now done, until we actually do the addition, is in the FOPC. So you can see, I have a way to explain any collection within scope in terms of the FOPC: it’s a number of joins, and therefore a number of conjunctions (ANDings). But I don’t have a way to explain every collection within scope. I’ll say that again, a bit more formally:
So collection within scope doesn’t have to weigh too heavily on my conscience. Especially as I have used it to ensure that my DML is so simple that it’s based on only part of the FOPC; a part that doesn’t include existential quantification. I achieve the effect of such quantification by collection (and therefore, in any instance of a query, by conjunction) and that little encapsulated function, PRESENT. Indeed, I get more than mere existential quantification: I get even more powerful functions, like COUNT and TOTAL (way outside the FOPC) by the same mechanism, but with different encapsulated functions. You might remember that those “]” and “[” operators were also interpretable as quantifications. But they occur in restriction conditions, and they too can be pushed down a level from the DML into the token processor. |
|
|
|
||
|
|
Difference |
|
|
|
And now, you can bet, I’ve got together all the mechanism that I need to handle negation, and the difference operation, without admitting either the “NOT” or the” AND NOT” operator into the part of the FOPC that I base my DML on. Except in restriction conditions, of course: but we won’t worry about them. It’s negations of records, or tables, that I plan to avoid. This is the traditional definition of difference:
And there more or less we have it. But we’ve covered a lot of country since we started with all those conjunction operations, discovered Combination (the Great Conjunction Operation), and swept through Outer Joins and Unions; so Ill say it again in the next lecture. |
|
|
|
||
|
|
A Clarification |
|
|
||
|
|
(Following this lecture, Professor Platoclast was asked to clarify his paradoxical justification of collection: was it, or was it not, definable on the FOPC-based operators of the DML he was proposing?) You are right, my boy, it’s a quantifier shift. Let’s use “x” as the expression of a query. (We let t take a table or collection of tables as its value.) Now, what we’ve always assumed (unthinkingly) is this:
But what I’m proposing is to relax this condition of “proper expressibility” to:
Let me give a very simple analogy. Suppose we have a language with names for switches, say “S1”, “S2”, and so on, and two operations, SETON(s) and SETOFF(s). Now, is “FLIP(s)” properly expressible? I.e. “If s is on, SETOFF(s), otherwise SETON(s).” In a sense, no: our language doesn’t include “If’, “on”, or “otherwise”. Yet, “FLIP(s)” is properly expressible in this sense: that if we first substitute the actual switch name giving, say, “FLIP(S1)”, then there is a translation (whether we know it or not that’s a different matter) to “SETON(S1)” or “SETOFF(S1)” accordingly as S1 is off or on. We might call “FLIP(s)” a schematic operation: it stands ambiguously for either of our actual operations. Remember we had something like that in the TT-DD, when we used “F” as a schematic predicate we used it in the schema, or schematic definition, for Table(F). Then, later, we replaced it with an actual predicate. So you could say that I’m proposing to admit some schematic queries, and a query with a collection operation is one sort of schematic query. But please a fine philosophical point, a schematic query is not a sort of query; “schematic” isn’t that sort of adjective. It’s an adjective like “putative” or “so-called” or “imaginary”: we don’t count putative employees among our employees, nor imaginary experiences within our reportage. Using schematics as schematics is fine; using them seriously is dangerous. Remember what happened when Russell used, instead of an Axiom, a schematic axiom: the Axiom Schema of Abstraction. |
|
|
|
||
|
A schema has a sort of promissory note attached: “I promise to convert this schematic so-and-so to a genuine so-and-so before use.” If you can’t sign the note with a clear conscience, then keep your grubby paws off it entirely. |
|
|
|
|
|
|
|
||
|
Copyright © 1993, 2001 Adrian Larner. The author asserts all moral rights. |
||