|
|
www.btinternet.com/~adrian.larner/database/newmod1 |
|
A New Model of Data A database paper by Adrian Larner |
||
|
|
||
Abstract |
|
|
|
|
A new model of data (excluding real world interpretation) is proposed. Informally it is close to the relational model, but it is claimed formally simpler, more powerful in expression, and conducive to greater data structure independence. The model amounts to a theory of records (effectively tuples) at a single level of aggregation (in contrast to the two or three levels of the relational theory). After the formal development of the theory a data manipulation language based on it is described, including a number of outer operations more tractable than relational outer joins. A treatment of kept (base, prime) data is described, which avoids some of the (admittedly slight) complexities of the relational approach. In any first order theory (lacking sets) the definition of ancestral predicates (transitive closures) can present problems: two specific examples are solved, one of them being that of bills of material. The theory permits records with missing and multiple values (non-uniform and non-flat records): justifications are advanced, and operators to apply restriction conditions to such records are defined. |
|
|
|
||
MOTIVATION |
|
|
|
|
What is perhaps the distinguishing aim of databases, in contrast to other means of storage of persistent data, is data independence, i.e. the insulation of users in the broad sense of application programs and human users from changes in stored data caused either by new requirements of other users (logical data independence) or by new storage hardware or system software (physical data independence). The achievement of data independence is explicitly the major motivation of Codds seminal paper [Codd1970], and was indeed one of the principal aims of IBMs hierarchical database management system, IMS, to which that paper was to some extent a reaction. Data independence may be subdivided into:
The solutions to the problems of representation, access, and sequence dependence are radical, and brutally simple. We talk of our data, using a data manipulation language (DML), at a conceptual level so abstract, i.e. so weak, that it cannot distinguish one representation, or means of access, or sequence, from another. We simply do not permit ourselves to specify the representation, access, or sequence of stored data: at this level. There is, of course, a level at which these data characteristics must be implemented. The implementation of data representations is best done within encapsulated data types: the programming theorists have solved our problem for us. The implementation of data access is in part a question of the database management system (DBMS) providing a wide selection of means of access (ideally much wider than current relational DBMSs provide), and in part a question of sound selection from among those means of access by the data base administrator (tuning) and the DBMS itself (optimisation). Sequencing is needed in external (user) views of data: the sequence required should be explicitly definable in the specification of those views (and, for human users, in the definition of ad hoc queries). This facility is available in current relational DBMSs, mainly by means of the SQL SELECT ... ORDER BY clause. |
|
|
|
||
Data Structure |
|
|
|
|
Data structure dependence presents a different problem. User views are often required to be highly structured, and we wish to avoid any impact on other users caused by the complexities of, or changes to, such structures. Yet we cannot eliminate all structure from our stored data in the same way that we can eliminate all sequence. A database without any structure would be a mere collection of atomic data items: fields. We might have a field, (SURNAME: Smith), i.e. of field type (attribute) SURNAME and value Smith, and another field, (GIVEN_NAME: William), but we would have no means of showing that they were associated with each other, but not associated with some other GIVEN_NAME or SURNAME field respectively. To make such an association we need something that amounts to: (SURNAME: Smith) and (GIVEN_NAME: William) together comprise a kept record (i.e. a stored or base record). Very commonly, user views require a hierarchical structure: such structures are exploited by programmers when they design their programs, consciously and explicitly if they employ the method of Data Structured Design [Jackson]. Human users also, unless they have reconciled themselves to purely relational data views, commonly demand hierarchical structures, e.g. Students within Classes within Courses, allowing for Courses lacking Classes and Classes lacking Students. The use of outer joins extended relational views is an attempt (not entirely successful) to support such structures. Most programming against relational data bases, however, still involves the construction of the required data view within the program. Thus, to report on Students within Classes within Courses we would typically use three views (i.e. three relations): read a Course view; for each Course record read a Class view to obtain the Classes within the Course; and for each Class record read a Student view to obtain the Students within the Class. Clearly, this sharing of the data structuring task, between the DBMS (which should take entire responsibility for it) and the program, is undesirable. It increases program complexity and consequently decreases programming productivity and quality. The best that can be said of it is that it is perhaps not too high a price to pay for the advantages of relational databases. We can say, therefore, that:
|
|
|
|
||
The Proposed Data Model |
|
|
|
It is claimed that the data model here proposed is an improvement on the relational data model. The principal criterion used in claiming this improvement is that of efficiency (metaphorically, output/input ratio): it has a simpler basis than the relational model (less input) and gives greater power of expression (more output); see [Goodman]. Using Dates valuable concept of essentiality, it has fewer and simpler essential constructs than does the relational model, yet it can express all that (indeed, more than) the relational model can express; see [Date1990]. Following [Goodman],
the simplicity is achieved by founding the theory (the model)
on the first order predicate calculus (FOPC), without set theory,
enhanced by four primitive (formally undefined) predicates,
one that is dyadic and symmetrical, and three that are monadic.
It should be stressed that this theory
the Theory of Two-Dimensional Data (TT-DD)
pertains solely to records;
how such records are to be interpreted is quite another question, not addressed here.
It will, however, be appreciated that any interpretation applicable to a relational system
would be applicable to a slightly constrained TT-DD system.
There are, of course, problems in the interpretation of relational systems;
but they are outside the scope of this paper. |
||
|
|
As in the relational model, we distinguish the language of the theory (i.e. of the model) from the DML. The language of the relational theory is based on set theory, a relation being a set of sets of ordered pairs; relational DMLs, by contrast, are based on and pure relational DMLs simply are the FOPC, more or less sugared. We first seek simplicity (and therefore efficiency) in our theory. |
|
|
|
||
THE THEORY OF TWO-DIMENSIONAL DATA RECORDS |
|
|
|
|
The basis of this theory is the FOPC;
its universe of discourse comprises only records (in a database),
in a sense of record now to be explained.
All data, it is here proposed, comprises fields,
which are atomic data items.
A record is an aggregate of fields, and is, we shall say, constituted by its fields.
This means that:
|
|
|
|
||
Aggregation |
|
|
|
|
Clearly, we need some notion of aggregation,
and we eschew the set-theoretical concepts used in the relational model.
Instead, we use the mereological concept of part.
Intuitively, a handle is part of a cup, and a cup is part of a cup-and-saucer,
and a cup handle and saucer together (i.e. considered together, not stuck together)
are part of a cup, saucer, and plate together.
The primitive we use is overlaps (O):
that a record overlaps a record is expressed by
xOy.
Of course, the variables x,
y, etc.
range over records alone, but a record is construed as any aggregate of fields.
Following [Goodman] throughout this section, we postulate:
NOTE: In the FOPC notation, quantifiers along with the descriptor (¢) and F (see S1, below) have widest possible scope; the other operators are, in decreasing binding strength, functions (Ç, È see D5, D6 below), predicates, negation (¬), conjunction (Ù), inclusive alternation (Ú), the conditional (®), and the biconditional («). We now use the primitive overlaps
formally to define is part of, is a proper part of,
is the same record as, and is a field.
Perhaps closer to our intuitive notion
of parthood is the irreflexive predicate, is a proper part of:
It is convenient to define two functions of records,
rather than the equivalent triadic predicates
just as it is easier to use the function,
a+b, in arithmetic, than to use the predicate,
c is the sum of a and b.
If two records overlap, their overlap (Ç)
is the record that exactly comprises their common parts.
Unconditionally, the aggregate (È)
of two records is the record that exactly comprises all their parts.
We introduce a schematic definition.
Suppose that we have some predicate, say F.
The aggregate of all the records that are F
(or that are Fs,
or of which F holds true)
is given by:
Notice that our method of aggregation allows us only one level of aggregation of fields. This file the file of F records is a record; it is merely the aggregate of all the fields that are part of an F record. It contains (i.e. has as part), probably, all sorts of records that are not F records (but rather: aggregates of parts of F records). The aggregation mechanism is much weaker than that of set theory (and of relational theory). We now exploit schematic definition,
S1.
First we define a redundant predicate,
is a record
(redundant because each thing in our universe of discourse is a record).
We now substitute
O
for F in S1:
We define
x is an atom of y,
i.e. x is a field,
and is part of y:
Now we can substitute
is an atom of y
for F in S1,
and postulate that the defined aggregate,
the file of atoms of y,
is the same record as y:
a record is the mere aggregate of its fields.
|
|
|
|
||
Kept (Base, Stored) Records |
|
|
|
|
The monadic primitive,
is kept (K),
is introduced to mark those records in the database that we would normally regard as base,
or in the most simplistic implementation stored.
A kept record is, in relational terms, a tuple in a base relation.
Substituting is kept in S1 we get:
But what fields might there be, in some sense, in the database, but not kept (stored) in it? the fields that are, in relational terms, unused values in domains. |
|
|
|
||
|
|
||
|
|
Continue reading A New Model of Data |
|
|
|
||
|
|
Skip to the section, Towards a Data Manipulation Language. |
|
|
|
||
|
|
||
|
|
Download A New Model of Data in Restricted Text Format (rtf, Word for Windows compatible) |
|
|
|
||
|
Copyright © 1994, 2001 Adrian Larner. The author asserts all moral rights. |
||
|
The decorative image of a key (cc004239.gif) used on this page was obtained from IMSI's MasterClips/MasterPhotos© Collection, 1895 Francisco Blvd East, San Rafael, CA 94901-5506, USA. |
||