(U01)

www.btinternet.com/~adrian.larner/database/newint3

A New Interpretation of Data

(continued)

A database paper by Adrian Larner

 

THE “EPI” INTERPRETATION

 

 

Let us abandon all hope of being able to interpret the equality used in restriction conditions and natural joins as absolute, or even systemic, identity. (What, in any event, was the supposed advantage of that interpretation?) Our equality on persons (CHILD and FATHER) required “is the same person as”; and our equality on parts (supplied and used) required “is the same part as”. We do need a careful definition of “part”. Specifically, we need its criterion of identity, “same part”, to be carefully elaborated. Perhaps: x is the same part as y when they are of the same material, shape, size, and specified tolerances, but irrespective of their origin, destination, or current location.

Now consider a CAR record, with columns Car-Id, Model, Make, Colour, and so on. We will need a criterion of identity associated with each column, so that we can interpret an equality on Car-Id, on Model, on Make, on Colour, ... But a moment’s consideration will show that, of course, we need such a criterion. Would the data analyst’s work be done if we did not know what it meant to join (say) two CAR records on Make?

We would expect that if two such records were compared on these fields:

If identical on Car-Id, they would be identical on Model and on Colour.
 
If identical on Model, they would be identical on Make.
Or, we might say, Car-Id classification is finer than Model classification and than Colour classification; Model classification is finer than Make classification. Or even: Car-Id determines Model and Colour; Model determines Make

But of what are Car-Id, Model, Make, and Colour classifications? What is it that their criteria of identity – let us say, their column identities – classify? It is one and the same thing, in the sense: falling under the same criterion of application. Thus Car-Id, or “is the same car as”, puts this sort of thing into classes, each of which contains one (individual) car. And that is the sort of thing – cars – that Model, or “is the same model as”, puts into classes, each of which contains one model of car. Likewise Colour; likewise Make.

So we appear to move back from the classical interpretation, whose entities were the things named by values – cars, models, colours, makes – towards the entity interpretation, at least to the extent that one record represents things falling under just one criterion of application. But those things fall under different criteria of identity, one for each column (we might say, for each field, for each attribute). Instead of saying (somewhat vaguely) that an attribute is a property, we can say that an attribute is the sort of property by which things can be classified.

What more do we need to say about the interpretation of our CAR record? – that there is something that is thus variously classified:

$x x is the same car as ..., AND x is the same model as ..., AND x is the same make as ..., AND x is the same colour as ..., AND ...
To interpret a particular record we need values, say “F197PDU”, “740”, “Volvo”, “Silver”, etc. These are names, given under their column identities. Thus:
“F197PDU” names something, and indeed anything that is the same car as it.
 
“740” names something, and anything that is the same model of car as it (i.e. that is a car, and is the same model as it).
 
“Volvo” names something, and anything that is the same make of car as it (i.e. that is a car, and is the same make as it).
 
“Silver” names something, and anything that is the same colour of car as it (i.e. that is a car, and is the same colour as it).
Thus, in general, this sort of record has an interpretation of the form:
$x  x  =C(1)  v1  Ù   x  =C(2)  v2  ...   Ù   x  =C(r)  vr
 
where each vj is a name given (to x) under the column identity “=C(j)”.
But not all records can have such a simple interpretation. The records that do have such an interpretation are, roughly, those that we would expect (as shown on an entity diagram) not to be dependent on any other record, i.e. insertible in the database irrespective of what other records are present, without foreign keys, not subject to referential integrity (although perhaps providing referential integrity for other records).

Suppose we had such an independent CAR record; and an independent PERSON record. We might also have a CAR OWNERSHIP record, with columns Car-Id, Person-Id, and perhaps Period-of-Ownership, interpreted in one of the following ways:

$x $y  x is owned by y  AND  x is the same car as ...  AND  y is the same person as ...
 
$x $y $z  z is an ownership of x by y  AND  x is the same car as ...  AND  y is the same person as ...  AND  z is the same period of ownership as ...
Notice that to express the relationship between the car and the person we may introduce the ownership entity; but we must introduce it if it has any attribute (any identity sentence). Again, we can see a parallel in entity diagramming: to associate two entities in such a diagram we may introduce either a (typically many-to-many) relationship, or a new entity dependent on those two entities (typically many-to-one to each of them).

We now have the general form of the proposed “EPI” interpretations: Existential quantifications, Predicate, Identities. Thus, for a record (relation) type, R, with r columns:

(E)
$x1 $x2 ... $xn
(P)
R(x1, x2, ..., xn)
(I)
Ù  x1 =C(1) v1  Ù  x2 =C(2) v2  Ù  ...  Ù  xr =C(r) vr
 
where each xj is one of the xi.
Notice that the classical interpretation (subject to reservations about the interpretation of the identities) can be thought of as a special case of the EPI interpretation in which n = r and each xj is xj.

 

Data Manipulation Re-interpreted

 

 

Suppose that we have a relation, R, with the EPI interpretation given above, and another with the interpretation:

(E)
$xn+1 $xn+2 ... $xn+m
(P)
Q(xn+1, xn+2, ..., xn+m)
(I)
Ù  xr+1 =C(r+1) vr+1  Ù  xr+2 =C(r+2)2 vr+2  Ù  ...  Ù  xr+q =C(r+q) vr+q
 
where each xr+j is one of the xn+i.
The interpretation of the Cartesian Product of these two relations is the conjunction (ANDing) of their interpretations. We may rearrange the E, P, and I components to express it in canonical form:
(E)
$x1 $x2 ... $xn+m
(P)
R(x1, x2, ..., xn) Ù Q(xn+1, xn+2, ..., xn+m)
(I)
Ù  x1 =C(1) v1  Ù  xr+2 =C(2) v2    Ù  ...  Ù  xr+q =C(r+q) vr+q

A restriction is also interpreted as a conjunction. The relation R restricted by the condition,

Q(xr+1, xr+2, ..., xr+q)
 
where each xr+j is one of the xi.
is interpreted as:
(E)
$x1 $x2 ... $xn
(P)
R(x1, x2, ..., xn) Ù Q(xr+1, xr+2, ..., xr+q)
(I)
Ù  x1 =C(1) v1  Ù  x2 =C(2) v2    Ù  ...  Ù  xr =C(r) vr
 
where each xj is one of the xi.
I.e. The restriction condition is simply conjoined to the predicate. To obtain a join, we perform a Cartesian product and a restriction. The commonest form of such a join is an equijoin on two columns having the same criterion of identity. (So column identities are determined by their domain; or – we might equally say – two columns are of the same domain if they have the same column identity: the identity determines the domain.)

Applying to R the equality condition,

xr+1  =A  xr+2
 
(or, equivalently, xr+1 =B xr+2)
 
where each of A and B is one of the C(k) such that the interpretation of R contains the identities, xr+1 =A va and xr+2 =B vb
we obtain a relation with the interpretation:
(E)
$x1 $x2 ... $xn
(P)
R(x1, x2, ..., xn) Ù xr+1 =A xr+2
 
(I)
Ù  x1 =C(1) v1  Ù  x2 =C(2) v2  ...  Ù  xr =C(r) vr
This leaves us with a superfluous identity. We have:
xr+1 =A xr+2 in the predicate and
xr+1 =A va and xr+2 =B vb among the identities.
And any pair of these three implies the third (because “=A” is the same identity as “=B”, and application of the restriction condition ensures that va =A vb).

The interpretation of projection differs markedly from that of the classical interpretation: we merely drop the identities associated with the columns that are not projected. Notice that implication is assured (truth functionally, indeed: the rule is, from p Ù q to derive p). Thus, with column list (C(p1), C(p2), ... C(pk)) applied to R, assuming p1 < p2 < ... < pk £ r, we obtain a relation with the interpretation:

(E)
$x1 $x2 ... $xn
(P)
R(x1, x2, ..., xn)
(I)
Ù  x1 =C(1) v1  Ù  ...  Ù  xp1-1 =C(2) v2  Ù  xp1+1 =C(2) v2  Ù  ...  Ù  xp2-1 =C(2) v2  Ù  xp2+1 =C(2) v2  Ù  ...  ...  Ù  xpk-1 =C(2) v2  Ù  xpk+1 =C(2) v2  Ù  ...  Ù  xr =C(r) vr
 

We may remark a number of the characteristics of the EPI interpretation, which – on the face of it – looks like something of a compromise between the classical and entity interpretations:

The existential quantifications of the EPI interpretations of records are given for each kept (base) record type. Manipulations neither remove nor add a quantifier (existential or universal). By contrast, existential quantifiers are introduced by projection in the classical interpretation. Intuitively: manipulations add no ontological commitment; what we have not asserted to exist by means of our kept records, we do not (later) assert to exist when we define views or queries.
 
Data manipulations both combine and add to the predicate.
 
No identity is ever added, although data manipulations can combine identities; projection removes identities (that is all it does).
 
Like an entity interpretation, and in contrast to a classical interpretation, an EPI interpretation of a record refers to (asserts the existence of) a number (n) of entities, which is not necessarily the same number as the number (r) of columns.
 
The predicate in an EPI interpretation of a record (when present) is closer to a “relationship”, in the entity/relationship sense, than to the predicate represented by a relation, in the classical interpretation. Most of the meaning of the classical predicate is borne by the identities in an EPI interpretation.
 
The “entities” of the EPI theory are theoretic entities: those things asserted to exist in the “E” – existential quantification – part of the interpretations of kept records.

 

The Join Trap Re-examined

 

 

We have, according to the EPI interpretation, two records, construed as:

$x $y  x supplies y  Ù  x is the same supplier as S1  Ù  y is the same part as P1
 
$ $z   is used in z  Ù   is the same part as P1  Ù  z is the same project as J1
Their equijoin is, accordingly:
$x $y $ $z  x supplies y  Ù   is used in z  Ù  y is the same part as Ù  x is the same supplier as S1  Ù  y is the same part as P1  Ù   is the same part as P1  Ù  z is the same project as J1
And their composition (the supplier and project projection of the above) is:
$x $y $ $z  x supplies y  Ù   is used in z  Ù  y is the same part as Ù  x is the same supplier as S1  Ù  z is the same project as J1
All we may conclude, therefore, is that S1 supplies something (y) and something ( ) is used in J1; and these things are one and the same part. And this is correct. There is no way to conclude that there is something both supplied by S1 and used in J1. There is, of course, a perfectly straightforward sense (indeed, that given by the EPI interpretation) in which we could say that there is some part both supplied by S1 and used in J1; but, in that sense, there is no implication of “Something is both supplied by S1 and used in J1”. This should cause us no surprise: there is some flower that both grows in my garden and inspired the poet Wordsworth, namely the daffodil. But that does not imply that there is something that the poet saw on his lonely walk and that I see from my window. In the car, model, material, and test-method example, the EPI interpretation avoids the conclusion that F197PDU has been tested to destruction by a dummy, and the conclusion that it has a steel frame. It does reach the conclusion that some car that is the same model as F197PDU has a steel frame. So we might do any of the following:
Add supplementary information: if some car of a given model has a frame of a certain material, then each car of that model has a frame of that material.
 
Leave it to the user to supply that supplementary information.
 
Put the Material column in the CAR record.
The implications of the last proposal are complex, and beyond the scope of this paper. They require consideration of the “level” at which higher normalisation should be applied: if applied at the physical (internal) level, such normalisation would avoid redundancy and inconsistency of data; why should it then also be applied at the conceptual level? And they raise questions about what “lossiness” of a join is: inability to reconstruct the records, or to reconstruct the records with their original interpretations?

 

Tests of the EPI Interpretation

 

 

The EPI interpretation is proposed as a very general hypothesis (as indeed were the entity and classical interpretations), and therefore it can be tested – perhaps refuted – but no evidence and argument could establish it. We know that both the entity and classical interpretations have failed to point the way to any satisfactory handling of nulls, and it may be because they use the concept of absolute identity. If N is a null, it is difficult to see how to avoid making “N = N” true if “=” is interpreted as absolute identity.

In order to account for nulls, the EPI interpretation is given a minimal extension: a column is allowed to have a value that is not in the field of its criterion of identity. It is proposed that this be taken as the meaning of “null”:

If C is a column with criterion of identity represented by “=”, then “v is null in C” means: ¬(v = v), i.e. ¬ $w v=w
This move allows a column identity to be reflexive, but not totally reflexive, in its column. But if such an identity is not totally reflexive within its column, we may formally define (and so obtain without extra theoretical apparatus) the completion of the identity within the column:
v º w  =df  v = w Ú ¬ (v = v) Ù ¬ (w = w)
We now have two identities, the primitive column identity, “=”, representing the criterion of identity of the column, and the defined completed (totally reflexive) identity, “º”. For any value, v, that is permitted in the column, v º v. But only for a proper value – one identical to something under the column identity – does it hold that v = v.

The adjustment needed to the EPI interpretation of a record is therefore: for each identity, say “=”, of a column in which nulls are permitted, for the identity sentence,

xj = vj
substitute
xj º vj
Again, the implications – for data manipulation languages and for data design and integrity – go beyond the scope of this paper. But it is in investigating such implications, of treatment of nulls, and of natural joins in data design; along with the implications for data analysis and design techniques and machine-assisted query definition; that the EPI interpretation proposal can be evaluated. We should, to conclude, now consider again the problem of intentionality in relation to the EPI interpretation. It is to some extent relieved because we merely translate to the word, say, “vacancy”. We leave it to the user to connect that word to the world (whether as the name of an existent or not). Or, we might use self-interpretation on VACANCY records, so that the vacancy was nothing but the record itself (nothing in the EPI interpretation either requires or excludes the use of self-interpretation). But these are sly moves: suppose that the term “vacancy” is intended by both data analyst and user as the name of an entity. What can we then say?

In this case, we certainly do interpret our records as saying that there exists something that is a vacancy ($x x is the same vacancy as ....) But our interpretation is completely unmetaphysical. If a vacancy is to be an entity of our system, it need be only a theoretic entity: what do we demand of it for logical respectability? That the analyst provide (and the user understand) its criteria of application and identity: do they know when they have a vacancy? do they know what counts as one vacancy – can they identify and distinguish vacancies? That criterion of identity is the interpretation of “=” in the primary key column of VACANCY.

And that really is all that is needed. Having defined “vacancy” (or “worshipped deity”, or any apparently intentional object) in this way, there is nothing more to say. The real world may or may not contain vacancies, Carthaginian gods, or all manner of weird and wonderful wights, but the FOPC demands only logical respectability of them. Within that constraint, it can live with any metaphysics.

 

CONCLUSIONS

 

 

Despite the impressions sometimes given, the relational theory of data does include an interpretation of its records. The commonly proposed entity(/relationship) interpretation is ill-defined, and – in respect of data manipulations – undefined. By contrast, the classical interpretation – in terms of the FOPC – is well, indeed formally, defined. Its constraints, however, are such as not to permit the wide range of values that we wish to store in our data bases. This is shown both by theoretical considerations, particularly of intentionality, and by the occurrence of join traps. Self-interpretation – a mode of interpretation that makes no reference external to the record-keeping system – provides an only partially successful answer.

An investigation of various concepts of identity reveals where much of the problem lies, for most joins depend upon an identity, and identity underlies the giving of names; and it is constraints on names that cause much of the problem of the classical interpretation. The EPI interpretation is proposed, in which the identities used are all relative identities, and the values accordingly common names, in contrast to the proper names demanded by the classical interpretation; and by this means the join traps are avoided. The EPI interpretation promises insights into the analysis and design of data, and – with a minimum of extra theoretical apparatus – a method of conceptualising nulls. Further investigations in these areas, and in machine-assisted query definition, are needed for a fuller evaluation of the proposal.

 

 

Return to the start of A New Interpretation of Data.
Return to the section Identities.
Reread this section of A New Interpretation of Data.

 

See the Bibliography of A New Interpretation of Data.

 

 

SITE HOME PAGE

 

 

THE DATABASE PAGE

 

THE DATABASE PAPERS

 

DOWNLOAD

Download A New Interpretation of Data in Restricted Text Format (rtf, Word for Windows compatible)

Another database paper ...

 

Copyright © 1994, 2001 Adrian Larner. The author asserts all moral rights.

The decorative image of a key (cc004239.gif) used on this page was obtained from IMSI's MasterClips/MasterPhotos© Collection, 1895 Francisco Blvd East, San Rafael, CA 94901-5506, USA.

 

 

A New Interpretation of Data

(continued)

A database paper by Adrian Larner

 

BIBLIOGRAPHY

 

 

[Codd1970]
EF Codd: A Relational Model of Data for Large Shared Data Banks, Communications of the ACM, Vol 13, No 6 (June 1970)
 
[Codd1990]
EF Codd: The Relational Model for Database Management Version 2, Addison-Wesley, 1990
 
[Gray]
Peter MD Gray, Krishnarao G Kulkarni, and Norman W Paton: Object-Oriented Databases, Prentice Hall, 1992
 
[Geach]
PT Geach: Logic Matters, Blackwell, 1972
 
See Part 4, Intentionality, and Part 7, Identity Theory.
 
[Korth]
Henry F Korth and Abraham Silberschatz: Database System Concepts, Second Edition, McGraw-Hill, 1991
 
[Larner]
Adrian Larner: A New Model of Data

 

 

Return to the start of A New Interpretation of Data.
Return to the section, Identities.
Return to the section, The EPI Interpretation.
 

 

 

SITE HOME PAGE

 

 

THE DATABASE PAGE

 

THE DATABASE PAPERS

 

DOWNLOAD

Download A New Interpretation of Data in Restricted Text Format (rtf, Word for Windows compatible)

Another database paper ...