|
(U03)
|
www.btinternet.com/~adrian.larner/database/pcl02
|
|
PLATOCLAST ON DATA
Lecture II Data Formalised
|
|
|
|
|
In the last lecture we twice came to decisions
on what was the simplest approach to a problem:
how many ways to keep fields together, just using records or also using pointers and the like?
how many values to allow in one field-type in one record?
Now sometimes, with a lot of luck, or a lot of experience, or both,
we can just see what the right, simplest, answer is.
But this doesn’t always work, and one way to get a better measure of simplicity is to formalise our theory.
There is nothing magic about formalisation.
All it means is that we make a large part of our theory mechanical,
and we can easily using a computer if necessary check that that part is correct.
Then all the mistakes will be in the other part.
They are just easier to find, you see, if you don’t have to search such a large area.
|
|
|
|
|
|
|
|
What is more to the point,
for our present purposes, if two theories have a common mechanical part, a logic,
we can discover which of them is simpler by comparing the parts we have added in,
the extralogical parts of the theories.
We can compare what we can express in one that we cannot express in the other.
The more desirable theory is the one that requires less to be added, or allows us to express more, or both.
We could call this the more
efficient, or economical, theory[1]:
it has the higher output/input ratio.
It is convenient to begin formalisation by taking a ready-made (and already proven) mechanism,
and we shall take a well-known logic, the First Order Predicate Calculus (FOPC).
This logic is most intimately connected with Relational Theory,
indeed Relational Data Manipulation Languages (DMLs) are simply this logic,
more or less heavily disguised (and sometimes enhanced, or even dis-enhanced, by some extra bells and whistles).
So this little exercise in using the FOPC is designed to kill two birds with one stone:
to familiarise ourselves with the logic and to formulate a theory of formatted data.
|
|
|
|
|
|
|
|
We now begin to define our extra-logical components:
1 Some primitive, or formally undefined, terms
Although these terms are formally undefined, they are not meaningless.
Indeed, we need to grasp their meanings very securely.
Actually we are going to have only four primitive terms, and they are all about records.
“Record” will be not a primitive but a defined term;
records are the only things we are going to talk about in our theory,
but in the rather extended sense that any collection of fields will constitute a record.
So, as we shall see, the smallest sort of record (as we hinted in the last lecture) will be a single field,
and the largest, which encompasses our entire “universe of discourse”
everything our theory talks about
will be the entire data base.
The four primitive terms are:
- Overlaps[2]
What does this mean?
One record will be said to overlap another when they have a common part,
i.e. when they share at least one field.
Obviously, to use “overlaps” we need to mention two records,
or rather we need two record mentions.
I mean, we could say: Rl overlaps Rl, which would, of course, be true.
(Well not “of course”; I’m telling you that there is no “empty record”.)
But to use any of the other three primitive terms we need to mention only one record.
- Is normal You know what that means.
- Is value-normal You don’t know what that means, so I’ll have to tell you.
Take any record, say “A:1, B:2, B:5”. (It is not normal.)
Swap the question and answer in each field; do this mechanically, and do not try to think what it means:
“1:A, 2:B, 5:B”.
So the questions become answers, and the answers become questions.
And now it is normal. Any record that can be made normal in this extraordinary way is value-normal.
Of course, we can have a normal record that is not value-normal, “A:l, B:1” for instance;
and we can have records that are both normal and value-normal (any field, for instance);
and we can have records that are neither normal nor value-normal (the entire data base at a guess).
- Is kept Even I do not know what this means,
but let us assume for the moment that if a record is kept then it is stored in the data base
(written as a complete record in the data base).
Actually, all we need to assume is that some records are recognisably marked as kept.
2. After the primitive terms, some definitions
Strictly, we do not add definitions.
We introduce some terms that will be used as
abbreviations
for already available expressions[3].
The only way we can go wrong here is in misinterpreting the new terms
(that is, in our informal understanding of them differing from the meaning they are formally defined to have).
To distinguish the definitions, and mark them as part of our formal system,
we give them numbers beginning with “D”, thus:
- D1 x is a record =df
- FOR SOME y, y overlaps x.
“x” is used here as a
pronoun[4],
like “it”, and so is “y”.
But they are, so
to speak, not necessarily the same “it”.
We use them to avoid that terrible problem in English that there are not enough pronouns to go round:
Bob hit Tom and he kicked Joe and he slapped him and he tripped him up.
“FOR SOME” is a bit of the FOPC. “FOR SOME y” means
“there is something such that”, or “there exists something that”.
And “=df” shows that what is on the left is defined as,
i.e. is a mere abbreviation for, what is on the right.
So the whole definition means:
- When we write “... is a record”, we mean:
“There is something such that it overlaps ...”.
Or more simply when we say of something that it is a record we say that something overlaps it.
Actually, whatever thing we might choose has something overlapping it,
so this definition does nothing but signal: we are talking in this theory only about records (in a wide sense).
3. After primitives and definitions, postulates or axioms
Postulates or axioms are sentences that we simply assert in our theory.
We can then use the logic (that is what it is really good for) to derive other true sentences from our postulates
(that is the safe, error free, part the dangerous part is making the postulates in the first place).
Here is a postulate that we need:
- P1
- FOR EACH x, FOR EACH y, x overlaps y iff FOR SOME z,
FOR EACH w, if w overlaps z then
both w overlaps x and w overlaps y.
Let us unpick this a little.
“FOR EACH x” means “Each thing, say x, is such that”.
“iff” means “if and only if”.
So this says:
- Given any record(s), say x and y
(x and y might be two records or one and the same record),
under what condition do they overlap?
This is the condition: there is some record, say z
(think of z as a common part of x and y, if they have a common part),
and z has this property
anything (w) that overlaps it also overlaps x and overlaps y.
So without actually using the term “part” (because it is not yet formally defined!)
this rather cunning postulate says that two records overlap when they have a common part.
And that is what we wanted “overlaps” to mean.
|
|
|
|
|
|
|
|
So there we have our method of formalisation:
the FOPC, our primitives (four in this theory),
some definitions (one down and lots to come), and some (not many) postulates,
and we are past the worst of those.
We ought briefly to consider the FOPC itself. Its vocabulary consists of:
- 1. The variables (pronouns),
- “x”, “y”, and so on.
These always (in our theory) stand for (technically “range over”) records.
But they do not stand for anything like
“is normal”.[5]
- 2. The quantifiers,
- “FOR SOME”, called the existential quantifier,
because it can be read as “There is something” or “There exists”,
and “FOR EACH[6]”,
called the universal quantifier,
because it can be read “Each thing in the universe”
not the whole universe, just our “universe of discourse”, what we are talking about.
- 3. Predicates,
- such as the monadic (one place) “... is normal”,
and the dyadic (two place) “... overlaps ....”
These are true or false of records, or in the case of dyadic predicates of records taken pairwise.
We can have predicates of three, four, or more places
(which are true of things taken triplewise, quadruplewise, or generally n-tuplewise).
Just occasionally we get a predicate true of everything in the universe of discourse: we have seen one,
“is a record”.
So (as is mechanically provable from Dl and P1):
FOR EACH x, x is a record.
- Sometimes, to make a general point, we use a schematic predicate,
just an upper case letter to which we have not given any meaning.
For example, FOR EACH x, F(x) or not F(x):
each thing (record) either is an F or is not an F
(e.g. is a field or is not a field, is normal or is not normal, etc.)
- 4. Names, though these are optional,
- say “a”, “b”, and so on.
Each name names one and only one thing that the variables range over,
and that the predicates are true or false of.
That is, these names are proper names,
not common names of the sort that we usually give to records (“EMPLOYEE” and the like);
and they are not “empty names” they name something.
- By filling the places in a predicate with a name or names we get a proposition.
A proposition is a sentence that is true or false,
and a proposition formed in this way is true or false
accordingly as the predicate is true or false of the things named.
As we have already seen, there are other ways of forming propositions
the postulate, P1, for instance
like filling a predicate’s place with a variable and then applying a quantifier to the variable.
- 5. Lastly, the truth functions that connect sentences:
- the monadic “not” for
negation (read as “it is false that”), and the dyadics: “(both ...) and”,
“(either ...) or” (always used inclusively),
“if ... then” (sometimes written “only if”), “if”,
and “iff” (if and only if).
In your average lecture on logic we’ld go on to show how the logic did implications;
how it derived one proposition from another, how it proved theorems.
But, as I’ve said, we’re mainly interested in what we can say in our theory,
that is, its power of expression, so we won’t get into implication.
|
|
|
|
|
|
|
|
So let’s see just what we can express in our theory.
First we need definitions of some of the terms concerned with parts and aggregates of records.
- D2 x is part of y =df
- FOR EACH z, if z overlaps x then z overlaps y.
- That is, x is part of y when
any record that overlaps x also overlaps y.
Remember that each record is part of itself, but not a proper part
- D3 x is a proper part of y =df
- x is part of y and not y is part of x.
And now we can define the criterion of identity of records:
- D4 x is the same record as y =df
- x is part of y and y is part of x.
And we can define an atomic record, or field:
- D5 x is a field =df
- FOR EACH y, not y is a proper part of x.
- A field is a record that has no part except for itself, no proper part.
How then shall we handle “questions and answers”, types and values? We are coming to that.
It is convenient to formalise some concepts as functions rather than predicates
(just as in mathematics it is more convenient to say “y-z”, using the “minus” function,
than to have to use the triadic predicate “... is the difference of ... and ...”).
- D6 xy =df
- THIS z: FOR EACH w, w is part of z
iff both w is part of x and w is part of y.
- The overlap, or product, of two records is this record:
the one that exactly comprises all their common parts.
The use of the product, “xy”, requires a postulate,
because not all pairs of records have a product (not all pairs of records overlap).
- P2
- FOR EACH x, FOR EACH y, x overlaps y iff FOR SOME z,
z is the same record as xy.
- This postulate says: records have a product when they overlap.
- D7 x+y =df
- THIS z: FOR EACH w, w overlaps z
iff either w overlaps x or w overlaps y.
- The sum of two records is this record: the one that exactly comprises all their parts.
- P3
- FOR EACH x, FOR EACH y, FOR .SOME z,
z is the same record as x+y.
- This postulate says: any pair of records has a sum.
Notice where our understanding of “record” has got to.
Any part of a record is a record.
Any sum of records is a record.
It means: if our universe (say, our data base) comprises n fields
then it contains one less than 2n records.
(One less? Yes: there is no empty record.)
This is, you might think, like the work of the vandal:
instead of cutting all our records into fields, the vandal has stuck the fields together into all possible combinations.
But what this shows is that some predicate other than the ones defined on “overlaps” will be needed
to distinguish the desired records from the undesired well, not undesired but incidental records.
And we have “is kept” waiting in the wings for just this purpose.
But it is worthwhile asking: what in Relational Theory distinguishes the “kept” records from the incidental records?
After all, relational records are just sets of fields, relations just sets of records.
And these sets exist (according to Set Theory), just as all our incidental sums of parts of records exist;
so something more than mere existence is needed.
Actually, I’m not quite certain what does the trick in Relational Theory,
but I have a hunch: and it’s much more complex than a single monadic predicate.
As it happens, set theories are far more “generative” than the theory used here.
We start with n fields and get nearly 2n records:
but that is all we get; we do stop there.
A set theory can start with no fields (or anything that is, nothing else)
and generate the empty set, the set containing the empty set, the set containing that, and so on: without limit.
Compared with which, 2n (1 less, even) is very modest indeed.
Now we come to some methods of aggregation, with a schematic definition.
Suppose we were equipped with some predicate, say “F”:
perhaps, for example, “F(x)” might mean “x is an EMPLOYEE record”.
To get something like (but not very like) a relation (the EMPLOYEE relation)
the aggregate of all the records that are F we would use this schematic definition:
- S1 Table (F) =df
- THIS x: FOR EACH y, x overlaps y
iff FOR SOME z, both F(z) and z overlaps y.
- The table of F is this record:
the one that exactly comprises the records that are F.
We can imagine the table of the EMPLOYEE records being built as follows:
take any EMPLOYEE record, form its sum with another EMPLOYEE record,
form the sum of the result with another EMPLOYEE record, and so on.
The cumulative sum is the table of EMPLOYEE
records[7].
- Notice that not every predicate has a table: but almost every predicate does.
If a predicate is empty (for example “not ...is a record”) then it has no table.
At the moment, we do not have many predicates to substitute for “F” in that schema.
But here is one: “is a record”.
Table(is a record) comprises the entire universe (of records!).
We might call it:
- D8 Database =df
- THIS x: FOR EACH y, x overlaps y.
|
|
|
|
|
|
|
|
I must now ask you
to bear with a rather strange and complex definition.
I will tell you what I am trying to do:
I am trying to get an “is the same record-type as” predicate.
Of course, we could introduce such a predicate as a primitive,
but remember what we said about a preference for simple systems.
I don’t want to use another dyadic primitive if I can avoid it,
because it would be simpler to use a monadic primitive
(and even simpler to use no extra primitive, but I can’t see how to get away with that).
It doesn’t matter how many defined predicates we add: they’re just abbreviations.
Recall what a “selection” or “restriction” is in relational terms:
it consists of some horizontal slices of a table (to put it very crudely). Now consider this definition:
- D9 x is a normal selection of y =df
- x is normal and x is part of y and FOR EACH z,
if z is normal and z is part of y then x
is not a proper part of z.
- Think of y as a table containing lots of records.
In our usage y is a record of course, but almost certainly not a normal record.
We are going to take some parts of it: some of them will certainly be normal (any single field would be, for instance).
Think of any normal part of y.
Of course, all its parts will be normal.
But we could think of a special sort of normal part of y,
the sort that is the same record-type as y,
that is, it has all the field-types (columns) of y.
That is what a normal selection is.
And you can see that if, still being a part of y, it contained as much as one extra field,
it would not be normal
(because that field would have to be the same field-type as a field already present in it).
- But this is just what D9 says: x is just such a part, a normal selection,
when it is normal, and it is part of y,
and there isn’t any other part of y that contains x,
unless that other part isn’t normal. Well, so what?
- DlO x is the same format as y =df
- FOR EACH z, if z is a normal selection of x
or z is a normal selection of y then z is a normal selection of x +y.
- Now think about two records, x and y,
and imagine they are the same record-type (the same format).
Obviously their sum, x+y, is also the same record type.
And, as we know, any normal selection of x, or of y,
will be the same record-type, and it will be a normal selection of x+y.
- But suppose that x and y are different record-types.
Then one will lack a field-type in the other:
let us say x lacks one of y’s field-types,
and let y contain a certain field of that type, and call that field “f”.
So we take a normal selection of x.
But it will not be a normal selection of x+y:
because we could add f to it without stopping it from being normal,
but still keeping it a part of x+y.
So these two rather convoluted definitions allow us to define
“is the same format as”.
And now we can use that to define “form-overlaps”
(two records form-overlap if their formats overlap, that is, if they have any common field- type):
- D11 x form-overlaps y =df
- FOR SOME z, z is the same format as x and z overlaps y.
- Suppose x and y have a field type in common.
Take any field of that type that. is found in y,
and form the sum of it with x,
i.e. add it to x, giving z.
Now z is obviously the same format as x
(because the added field was of a type already in x), and it overlaps y.
Now you can go back to Dl, D2, and D3 and see how we used “overlaps”
to define “is a record”, “is part of”, and “is a proper part of”;
and you can substitute “form-overlaps” for “overlaps”
and write definitions DFl, DF2, and DF3 to define “is a format”
(another redundant predicate: every record is a format, i.e. is a record-type),
“is form-part of”, and “is a proper form-part of”.
And likewise D5 can be used to create DF5, the definition of “is a form- field”,
i.e. a field-type.[8]
There isn’t a DF4, because D4 was the definition of “is the same record as”,
and we already have (at DlO) the definition of the equivalent “form” predicate: is the same format as.
|
|
|
|
|
|
|
|
Well, that is how we manage to express common format,
and you might guess that the next task is to express common value:
to say of two fields (usually) that though they may be of different field-type (different column name)
they have the same value.
So we will want an “is the same value as” predicate, true of two records that contain all and only the same values.
But now we know how to get such a dyadic predicate defined, rather than introduced as a primitive.
It needs our mysterious monadic primitive predicate: is value-normal.
So we could define (D12) x is a value-normal selection of y, just like D9,
but with “value-normal” instead of “normal”; (D13) x is the same value as y,
just like D10, but with “value-normal selection” instead of “normal selection”;
and (D14) x val-overlaps y, just like D11,
but with “same value” instead of
“same format”.[9]
And then you could define DVI to DV3, and DV5: “is a value”
(of a record) “is val-part of”, “is a proper val-part of”,
and “is a val-field”
(the value of a field).[10]
|
|
|
|
|
|
|
|
Just to round things off, for this lecture,
we should recall that other primitive predicate: is kept.
A record is kept when it is (roughly) what we would normally call a “data base record”,
a stored or base record.
But notice that our data base does not have to comprise only kept records (or parts of kept records).
But if it does not, then some part of it does. And it’s this part: Table(is kept).
We can go back to schema S1, and substitute “is kept” for “F”
to define the aggregate of all our stored records:
- D15 Primary Database =df
- THIS x: FOR EACH y, x overlaps y
iff FOR SOME z, both z is kept and z overlaps y.
But what else might there be in our data base, apart from stored records?
Well, suppose that f is a form-field (see if you remembered to write it down definition DF5).
And suppose that by “F(x)” we mean:
x is the same format as f.
Then Table (F) would comprise all the fields of the same field-type as f,
i.e. all the values of the field-type of f.
That sounds very like the domain of f, and, as we know,
domains are in some strange way part of our data bases, but not exactly kept
in them.[11]
|
|
|
|
|
Implication
|
|
|
|
|
(After Lecture II,
Professor Platoclast was asked to expand on his remark that the function of a logic,
and the FOPC in particular, was to derive propositions from propositions,
i.e. to show what were the theorems of a theory, the propositions implied by its postulates.)
Yes, I didn’t go into that aspect, the FOPC as an implication mechanism,
because I was more interested in investigating the power of expression of the theory.
But, of course, I can show you how the FOPC serves to derive propositions from propositions,
when one or more propositions imply others.
One thing we know about the FOPC is that it’s valid:
given only true propositions, it will (mechanically) derive only true propositions, never false ones.
The propositions we start from are our postulates:
any proposition that is either a postulate or implied by the postulates
(or implied by propositions implied by the postulates, etc.) is a theorem of our theory.
Here’s an exercise in implication:
let’s prove that the predicate, “is a record”, is redundant;
that is, that everything (in our universe of discourse) is a record.
So what we have to prove is the theorem:
- T1
- FOR EACH x, x is a record.
As we said, “is a record” is defined,
so is merely an abbreviation for what is on the right hand side of its definition.
Therefore we look at its definition, Dl, and replace “is a record” in T1, getting:
- Tld
- FOR EACH x, FOR SOME y, y overlaps x.
Notice that T1 and Tld are one and the same proposition.
To prove Tld (i.e. to prove T1) we start with P1.
P1 is a postulate: we simply assert our theory asserts that it is true.
Now P1 says:
- FOR EACH x, FOR EACH y,
x overlaps y iff FOR SOME z, FOR EACH w,
if w overlaps z then both w overlaps x and w overlaps y.
I.e. it says that for each thing, x, that we mght choose,
there’s a predicate that holds true of that thing, x, and of anything, y.
The predicate is: that x overlaps y iff ... etc.
The FOPC lets us argue, quite rightly, that if this predicate holds true of x
and anything (y) then it holds true particularly of x and x.
So P1 implies the lemma (that means: the theorem, which we use as a step in our argument):
- L1
- FOR EACH x, x overlaps x
iff FOR SOME z, FOR EACH w,
if w overlaps z then both w overlaps x and w overlaps x.
And, of course, the FOPC allows us to reduce
“both w overlaps x and w overlaps x” to
“w overlaps x”
(saying it twice is just like saying it once).
Remember that “iff” means “if and only if”:
the FOPC allows us to use just the “if” or just the “only if”. We will use just the “if”, so LI implies:
- L2
- FOR EACH x, x overlaps x
if FOR SOME z, FOR EACH w, if w overlaps z
then w overlaps x.
which we could equally express as:
- FOR EACH x, if FOR SOME z, FOR EACH w,
if w overlaps z then w overlaps x, then x overlaps x.
So (whatever x is) x overlaps x
if there is something (z) such that FOR EACH w, ... etc.
And the FOPC lets us argue that because x overlaps x
if something (z) is such that FOR EACH w, ... etc.,
then specifically x overlaps x if x
is such that FOR EACH w, ... etc.
So our z gets replaced by x, and L2 implies:
- L3
- FOR EACH x, if FOR EACH w, if w overlaps x
then w overlaps x, then x overlaps x.
But the FOPC recognises that “if w overlaps x then w overlaps x” is true of
anything:
surely, though not everything overlaps x,
everything overlaps x if it overlaps x.
So as x overlaps x is true if
“overlapping x if overlapping x”
is true of everything (which the FOPC knows it is), the FOPC lets us conclude that:
- L4
- FOR EACH x, x overlaps x.
But if x overlaps x the FOPC lets us conclude that
something overlaps x, so L4 implies:
- FOR EACH x, FOR SOME y, y overlaps x.
And that is just the theorem (T1 or Tld) that were trying to prove.
So this may have convinced you that T1 really does follow from P1, given the definition Dl;
or it may merely have convinced you that mechanising our logic (our implicational mechanism),
rather than taking the pains to work out all the implications for ourselves, is a jolly good idea.
You don’t have to remember all that.
But do remember the importance of valid implication: never going from truth to falsity.
Suppose that the data in our data base represents some true propositions,
and suppose that we have a query language that is really a logic,
an implicational mechanism that never goes from truth to falsity.
|
|
|
|
SITE HOME PAGE
|
You can see that the result of any query will be some data that represents true propositions as well.
Guaranteed.
I could live with that.
|
|
THE DATABASE PAGE
|
|
THE DATABASE PAPERS
|
|
|
Preface & Contents
|
|
|
DOWNLOAD
|
Download
Lecture II (rtf, Word for Windows compatible)
|
|
Platoclast on Data: Lecture III
|
|
|
Copyright © 1993, 2001 Adrian Larner.
The author asserts all moral rights.
|