www.btinternet.com/~adrian.larner/database/newmod1

A New Model of Data

A database paper by Adrian Larner

 

Abstract

 

 

A new model of data (excluding “real world” interpretation) is proposed. Informally it is close to the relational model, but – it is claimed – formally simpler, more powerful in expression, and conducive to greater data structure independence. The model amounts to a theory of records (effectively tuples) at a single level of aggregation (in contrast to the two or three levels of the relational theory). After the formal development of the theory a data manipulation language based on it is described, including a number of “outer” operations more tractable than relational outer joins. A treatment of “kept” (base, prime) data is described, which avoids some of the (admittedly slight) complexities of the relational approach. In any first order theory (lacking sets) the definition of ancestral predicates (transitive closures) can present problems: two specific examples are solved, one of them being that of bills of material. The theory permits records with missing and multiple values (non-uniform and non-flat records): justifications are advanced, and operators to apply restriction conditions to such records are defined.

 

MOTIVATION

 

 

What is perhaps the distinguishing aim of databases, in contrast to other means of storage of persistent data, is data independence, i.e. the insulation of users – in the broad sense of application programs and human users – from changes in stored data caused either by new requirements of other users (logical data independence) or by new storage hardware or system software (physical data independence). The achievement of data independence is explicitly the major motivation of Codd’s seminal paper [Codd1970], and was indeed one of the principal aims of IBM’s hierarchical database management system, IMS, to which that paper was to some extent a reaction.

Data independence may be subdivided into:

Data Representation Independence:
An integer field might be changed from binary to decimal; a complex number field might be changed from x and y co-ordinates to modulus and angle; a month field might be changed from 1 through 12 to “i” through “xii”.
Data Access Independence:
Any one of the following might be substituted for any other, or for a variant of itself: hashing, indexed search, exhaustive sequential search, chaining through pointers.
Data Organisation Independence:
We distinguish Data Sequence (or ordering) Independence and Data Structure Independence.

The solutions to the problems of representation, access, and sequence dependence are radical, and brutally simple. We talk of our data, using a data manipulation language (DML), at a conceptual level so abstract, i.e. so weak, that it cannot distinguish one representation, or means of access, or sequence, from another. We simply do not permit ourselves to specify the representation, access, or sequence of stored data: at this level. There is, of course, a level at which these data characteristics must be implemented.

The implementation of data representations is best done within encapsulated data types: the programming theorists have solved our problem for us. The implementation of data access is in part a question of the database management system (DBMS) providing a wide selection of means of access (ideally much wider than current relational DBMSs provide), and in part a question of sound selection from among those means of access – by the data base administrator (tuning) and the DBMS itself (optimisation).

Sequencing is needed in external (user) views of data: the sequence required should be explicitly definable in the specification of those views (and, for human users, in the definition of ad hoc queries). This facility is available in current relational DBMSs, mainly by means of the SQL SELECT ... ORDER BY clause.

 

Data Structure

 

 

Data structure dependence presents a different problem. User views are often required to be highly structured, and we wish to avoid any impact on other users caused by the complexities of, or changes to, such structures. Yet we cannot eliminate all structure from our stored data in the same way that we can eliminate all sequence. A database without any structure would be a mere collection of atomic data items: fields. We might have a field, (SURNAME: Smith), i.e. of field type (attribute) SURNAME and value “Smith”, and another field, (GIVEN_NAME: William), but we would have no means of showing that they were associated with each other, but not associated with some other GIVEN_NAME or SURNAME field respectively. To make such an association we need something that amounts to: (SURNAME: Smith) and (GIVEN_NAME: William) together comprise a kept record (i.e. a “stored” or “base” record).

Very commonly, user views require a hierarchical structure: such structures are exploited by programmers when they design their programs, consciously and explicitly if they employ the method of Data Structured Design [Jackson]. Human users also, unless they have reconciled themselves to purely relational data views, commonly demand hierarchical structures, e.g. Students within Classes within Courses, allowing for Courses lacking Classes and Classes lacking Students. The use of outer joins – extended relational views – is an attempt (not entirely successful) to support such structures. Most programming against relational data bases, however, still involves the construction of the required data view within the program. Thus, to report on Students within Classes within Courses we would typically use three views (i.e. three relations): read a Course view; for each Course record read a Class view to obtain the Classes within the Course; and for each Class record read a Student view to obtain the Students within the Class. Clearly, this sharing of the data structuring task, between the DBMS (which should take entire responsibility for it) and the program, is undesirable. It increases program complexity and consequently decreases programming productivity and quality. The best that can be said of it is that it is perhaps not too high a price to pay for the advantages of relational databases.

We can say, therefore, that:

The relational model of data (ignoring the extended model with outer joins) is incapable of handling the (mostly hierarchical) external views needed by users. This is in large part because of the constraint of Union Compatibility: the views require records of different formats, but a relation is constrained to contain records of only one format. (Outer joins create relations with records of essentially different formats, disguised – albeit rather transparently – by padding with nulls.)
 
Data structure independence cannot be achieved by excluding all structure (i.e. by excluding all reference to structure from the DML). Yet we would hope not to have to specify complex structure. Our aim should therefore be to minimise structure: to admit into our theory the fewest, and only the simplest, notions of structure.
 
The complex structures needed in user views should be provided, if possible, by a mechanism similar to that used for sequencing; we might say, “last minute” structuring. We shall, as will be seen, achieve such structuring by the sequencing mechanism and the DML.

 

The Proposed Data Model

 

See A New Interpretation of Data
 

It is claimed that the data model here proposed is an improvement on the relational data model. The principal criterion used in claiming this improvement is that of efficiency (metaphorically, output/input ratio): it has a simpler basis than the relational model (less input) and gives greater power of expression (more output); see [Goodman]. Using Date’s valuable concept of essentiality, it has fewer and simpler essential constructs than does the relational model, yet it can express all that (indeed, more than) the relational model can express; see [Date1990].

Following [Goodman], the simplicity is achieved by founding the theory (the model) on the first order predicate calculus (FOPC), without set theory, enhanced by four primitive (formally undefined) predicates, one that is dyadic and symmetrical, and three that are monadic. It should be stressed that this theory – the Theory of Two-Dimensional Data (TT-DD) – pertains solely to records; how such records are to be interpreted is quite another question, not addressed here. It will, however, be appreciated that any interpretation applicable to a relational system would be applicable to a slightly constrained TT-DD system. There are, of course, problems in the interpretation of relational systems; but they are outside the scope of this paper.
 

 

As in the relational model, we distinguish the language of the theory (i.e. of the model) from the DML. The language of the relational theory is based on set theory, a relation being a set of sets of ordered pairs; relational DMLs, by contrast, are based on – and pure relational DMLs simply are – the FOPC, more or less sugared. We first seek simplicity (and therefore efficiency) in our theory.

 

THE THEORY OF TWO-DIMENSIONAL DATA – RECORDS

 

 

The basis of this theory is the FOPC; its universe of discourse comprises only records (in a database), in a sense of “record” now to be explained. All data, it is here proposed, comprises fields, which are atomic data items. A record is an aggregate of fields, and is, we shall say, constituted by its fields. This means that:

If p is an aggregate of fields, and q is an aggregate of all and only the same fields, then p is one and the same record as q: we say, “pIq”.

 

Aggregation

 

 

Clearly, we need some notion of aggregation, and we eschew the set-theoretical concepts used in the relational model. Instead, we use the mereological concept of “part”. Intuitively, a handle is part of a cup, and a cup is part of a cup-and-saucer, and a cup handle and saucer together (i.e. considered together, not “stuck together”) are part of a cup, saucer, and plate together. The primitive we use is “overlaps” (O): that a record overlaps a record is expressed by “xOy”. Of course, the variables “x”, “y”, etc. range over records alone, but a record is construed as any aggregate of fields. Following [Goodman] throughout this section, we postulate:

P1
xO«  $"w  wOz ®  wOx Ù wOy
 
The record, x, and the record, y, overlap if and only if there is some record, z (we imagine that z is a record, perhaps a single field, that is part of x and is part of y), such that whatever overlaps z overlaps x and overlaps y.

NOTE: In the FOPC notation, quantifiers – along with the descriptor (¢) and “F” (see S1, below) – have widest possible scope; the other operators are, in decreasing binding strength, functions (Ç, È – see D5, D6 below), predicates, negation (¬), conjunction (Ù), inclusive alternation (Ú), the conditional (®), and the biconditional («).

We now use the primitive “overlaps” formally to define “is part of”, “is a proper part of”, “is the same record as”, and “is a field”.

D1
xÍy  =df  "z  zOx  ®  zOy
 
x is part of y (we use the roughly equivalent set theoretical notation, as we have no other use for it) when each record that overlaps x, overlaps y. On this definition, “is part of” is totally reflexive: "x xÍx.

Perhaps closer to our intuitive notion of “parthood” is the irreflexive predicate, “is a proper part of”:

D2
xÌy  =df  xÍy Ù ¬yÍx
 
x is a proper part of y when x is part of y but y is not part of x.
D3
xIy  =df  xÍy Ù yÍx
 
x is the same record as y when x is part of y and y is part of x. [Goodman] gives the equivalent definition, “"z  zO« zOy”: when each record either overlaps both x and y or overlaps neither x nor y.
D4
Ëx  =df  "y  ¬yÌx
 
x is a field when no record is a proper part of x.

It is convenient to define two functions of records, rather than the equivalent triadic predicates – just as it is easier to use the function, “a+b”, in arithmetic, than to use the predicate, “c is the sum of a and b”. If two records overlap, their overlap (Ç) is the record that exactly comprises their common parts. Unconditionally, the aggregate (È) of two records is the record that exactly comprises all their parts.

D5
xÇy  =df  ¢"w  wÍ«  wÍÙ  wÍy
 
The overlap of x and y is the largest record, z, such that each record that is part of z, is part of x and is part of y. We postulate that records have an overlap if and only if they overlap:
P2
xO«   $z  z I xÇy
D6
xÈy  =df  ¢"w  wO«  wOÚ  wOy
 
The aggregate of x and y is this record, z: all and only the records that overlap z either overlap x or overlap y. We postulate that any records have an aggregate:
P3
$z  z I xÈy

We introduce a schematic definition. Suppose that we have some predicate, say F. The aggregate of all the records that are F (or “that are Fs”, or “of which ‘F’ holds true”) is given by:

S1
FxFx  =df  ¢"z  yO«  $Fx Ù xOz
 
The file of Fs (or “of F records”) is this record, y: each record, z, overlaps y if and only if some F record overlaps z. We might form the file of F records by taking an F record, aggregating another F record with it, aggregating another F record with the result of the last aggregation, and so on until all F records had been aggregated: the final result would be the file of Fs, FxFx.

Notice that our method of aggregation allows us only one level of aggregation of fields. This “file” – the file of F records – is a record; it is merely the aggregate of all the fields that are part of an F record. It contains (i.e. has as part), probably, all sorts of records that are not F records (but rather: aggregates of parts of F records). The aggregation mechanism is much weaker than that of set theory (and of relational theory).

We now exploit schematic definition, S1. First we define a redundant predicate, “is a record” (“redundant” because each thing in our universe of discourse is a record).

D7
Ox  =df  $y  yOx
 
x is a record when something overlaps it. (The redundancy is ensured because “xOx” is implied by P1.)

We now substitute “O” for “F” in S1:

D8
FxOx  =df  ¢"z  yO«  $x  Ox Ù xOz
 
This, “the file of records”, simplifies to “¢y  "x  xOy”, and may be understood as: the database, the record that encompasses all records. Recalling the weakness of our method of aggregation, the database is precisely the mere aggregate of fields that we have already observed not to have sufficient structure for our needs.

We define “x is an atom of y”, i.e. x is a field, and is part of y:

D9
xÎy  =df  Ëx Ù xÍy
 
Not the epsilon of set membership!

Now we can substitute “is an atom of y” for F in S1, and postulate that the defined aggregate, the file of atoms of y, is the same record as y: a record is the mere aggregate of its fields.

P4
I  Fx xÎy
 
This postulate, along with definition D3, formalises the requirement of [Codd1970] that there should be no implicit connections between records: a record is entirely constituted by its parts, and those parts break down exhaustively into fields.

 

Kept (“Base”, “Stored”) Records

 

 

The monadic primitive, “is kept” (K), is introduced to mark those records in the database that we would normally regard as “base”, or – in the most simplistic implementation – stored. A kept record is, in relational terms, a tuple in a base relation. Substituting “is kept” in S1 we get:

D10
FxKx  =df  ¢"z  yO«  $Kx Ù xOz
 
The file of kept records – we might say, the primary database – is the aggregate of all the records kept in the database.

But what fields might there be, in some sense, “in the database”, but not kept (stored) in it? – the fields that are, in relational terms, unused values in domains.

 

 

Return to the start of A New Model of Data.
 

 

Continue reading A New Model of Data
with the section on DATA TYPES AND VALUES
.

 

 

SITE HOME PAGE

Skip to the section, Towards a Data Manipulation Language.
Skip to the section, Kept Records and Forms.
Skip to the section, Absent and Many-Valued Field Types.

 

THE DATABASE PAGE

 

THE DATABASE PAPERS

 

DOWNLOAD

Download A New Model of Data in Restricted Text Format (rtf, Word for Windows compatible)

Another database paper ...

 

Copyright © 1994, 2001 Adrian Larner. The author asserts all moral rights.

The decorative image of a key (cc004239.gif) used on this page was obtained from IMSI's MasterClips/MasterPhotos© Collection, 1895 Francisco Blvd East, San Rafael, CA 94901-5506, USA.