|
|
|
|
|
Student Publications
Author: Kinan M Al Haffar
Title:
A Model Theory for Generic Schema
Management Models
Area:
Country :
Profile:
Program:
Available for Download:
Yes
Sharing knowledge is a vital component in
the growth and advancement of our society
in a sustainable and responsible way. Through
Open Access, AIU and other leading institutions
through out the world are tearing down the
barriers to access and use research literature.
Our
organization is interested in the dissemination
of advances in scientific research fundamental
to the proper operation of a modern society,
in terms of community awareness, empowerment,
health and wellness, sustainable development,
economic advancement, and optimal functioning
of health, education and other vital services.
AIU’s mission
and vision is consistent with
the vision expressed in the Budapest Open
Access Initiative and Berlin Declaration
on Open Access to Knowledge in the Sciences
and Humanities. Do you have something you
would like to share, or just a question
or comment? We would be happy to hear from
you, please use the Request Info link below.
For more infromation on the AIU's Open Access
Initiative, click
here.
|
|
|
|
|
|
Abstract
The core of a model theory for
generic schema management is
developed. This theory has two
distinctive
features: it applies to a variety of
categories of schemas, and it
applies to transformations of both
the schema
structure and its integrity
constraints. A subtle problem of
schema integration is considered in
its general form,
not bound to any particular category
of schemas. The proposed solution,
as well as the overall theory, is
based
entirely on schema morphisms that
carry both structural and semantic
properties. Duality results that
apply to the
two levels (i.e., the schema and the
data levels) are established. These
results lead to the main
contribution of
this paper: a formal schema and data
management framework for generic
schema management. Implications of
this theory are established that
apply to integrity problems in
schema integration. The theory is
illustrated by a
particular category of schemas with
object-oriented features along with
typical database integrity
constraints.
1 Introduction
This paper presents the core results
of a model theory for generic schema
management, by which we mean schema
and
database transformation capabilities
that are independent of a particular
data model. Such transformations
require major
database programming tasks, such as
integrating source schemas when
building a data warehouse or
integrating different user
views into an overall database
schema. In spite of nontrivial
typing issues created by such
transformations, database
programming and other relevant
paradigms have been primarily suited
to dealing with structural aspects
of those
transformations. A major challenge
is in properly addressing semantics:
the integrity constraints associated
with database
schemas.
A second major challenge is in
developing such a model theory that
is applicable to a variety of data
models, such as the
relational, object-oriented, and XML
models [23]. This is challenging
because schemas and their underlying
databases are
very different in these three major
categories of models as are
languages and their underlying
logics used for expressing the
integrity constraints.
On the pragmatic side, generic
schema management operations and
tools have been considered in [4],
and more specific
system implications in [5, 20].
These papers argue that many
difficult and expensive database
programming problems involve
the manipulation of mappings between
schemas. Examples are populating a
data warehouse from data sources,
exposing a set
of data sources as an integrated
schema, generating a web site
wrapper, generating an
object-oriented wrapper for
relational
data, and mapping an XML schema to a
relational schema. Despite many
commonalities of these schema
management
problems, tools and languages are
typically engineered for only one
problem area. A more attractive
approach would be to
build generic schema management
tools and languages that would apply
to all of these problems, with only
some
customization required for the data
model and problem at hand. The
formal open problem is to find a
suitable model theory
that would be able to handle this
generality.
Our proposed model theory has a
categorical flavor, manifested in
the use of arrows that appear at two
levels. At the meta
level the arrows represent schema
transformations. For example, an
arrow could map a data source schema
to a data
warehouse schema, or two arrows
could map each of two data source
schemas into a mediated schema.
These transformations
are defined as schema morphisms that
map the structural properties and
integrity constraints of a schema.
At the data level the
arrows are data transformations
specified as database morphisms.
Database morphisms map the actual
data sets in a manner
that is compatible with the
operations available on those sets
and that preserves the schema's
integrity constraints.
There are several implications of
this arrow-theoretic approach. The
first is that it leads to a very
general view of the
schema integration problem expressed
entirely in terms of arrows. The
generality is accomplished by using
a particular
categorical construction which
applies to a variety of categories
of schemas [10, 12, 17]. The
specific nature of arrows used in
this construction is determined by
the category
1
of schemas that defines the kind of
structural and integrity
transformations that the arrows
actually represent.
The second implication is in the
formal results that relate
properties of arrows at the meta
level and the data
level. These results reveal a subtle
duality manifested in the reversal
of the corresponding arrows between
the two
levels. They also provide guidance
for how to define schema
transformations appropriately, in
particular, so that
the arrows preserve integrity
constraints.
The third, most important
implication is a formal framework
for generic schema and data
management which
applies to a variety of data models.
This is really the main contribution
of this paper. This formal framework
includes and relates the two levels
(schema and data transformations)
and captures both structural
properties and
integrity constraints. A key
component of this theory is a strong
integrity requirement on the
permissible
structural schema transformations so
that they preserve the integrity
constraints. This model theory
relies on
earlier results on a general model
theory for a variety of programming
paradigms and of associated logic
paradigms [9, 10].
We apply the proposed formal theory
to two situations. First, we apply
it to the schema integration problem
where we prove results that are
generic across data models and
include a proper treatment of the
integrity
constraints. Second, we use it as a
pattern for defining a data model,
one based on abstract data types,
which
includes object-oriented features
and typical database integrity
constraints.
The paper is organized as follows.
Section 2 starts with basic
definitions, followed by a running
example of a
particular category of
object-oriented schemas. The basic
categorical definitions are given in
Section 3,
particularly the use of morphisms to
represent schema mappings. Sections
4 through 6 are the core of the
paper.
Section 4 shows how to express the
schema integration problem using
morphisms. Section 5 introduces the
notion
of a generic schema transformation
framework, specifies the condition
for such a framework to preserve
database
integrity and proves the
implications on schema integration.
Section 6 shows how the proposed
model theory
applies to the category of schemas
of the running example. Section 7
discusses related work and
concludes.
2 Schemas
A database schema consists of two
components: a schema signature and
the associated integrity
constraints. A
schema signature specifies
structural and operational features
of a database schema. Its typical
components are
signatures for data types and their
operations and signatures for
database sets (relations,
collections, etc.),
illustrated in Example 1. Note that
type signatures for collections
determine the signatures for
operations on
collections. Signatures for
integrity constraints are logical
expressions whose form is determined
by the choice of
a particular logic and syntax of the
constraint language. A feature of
the definitions below is that they
are
sufficiently general to apply to a
variety of schema signatures and
associated constraint languages.
Definition 1 (Schemas) A
database schema Sch is a pair Sch =
(Sig, E) where Sig is a schema
signature and E is
a set of integrity constraints
expressed as sentences (formulae of
a chosen logic with all variables
quantified).
This paper uses examples based on a
category of schemas of an
object-oriented style. Schemas of
this
category have user-defined abstract
data types specified as Java
interfaces. A schema signature
consists of the
specification of those types and of
collection objects which represent
the actual database. The integrity
constraints
are specified in Horn clause logic.
We use only the most typical
database constraints: those
expressing key
dependencies and referential
integrity constraints.
Example 1 (Sample schema)
schema Publishers { interface
Publisher { String publisherldO ;
String name();
String locationO ;
Set<Publication> publications();
}
interface Publication {
String publicationldO ;
String title();
int year();
Publisher publisherO ;
2
Set<String> keywords();
}
Collection<Publisher> dbPublishers;
Collection<Publication> dbPubs;
Constraints:
Publication X,Y; Publisher Z ,W;
dbPubs.member(X)
:- dbPublishers.member(Z),
Z.publications.member(X);
dbPublishers.member(Z) :-
dbPubs.member(X),
X.publisher.equals(Z);
Z.equals(W) :-
dbPublishers.member(Z),
dbPublishers.member(W),
Z.publisherldO .equals
(W.publisherldO ); X.equals(Y) :-
dbPubs .member (X) ,
dbPubs.member(Y), X.publicationldO .
equals (Y.publicationldO ); }
As Java interfaces lack any kind of
general-purpose constraint
specifications, such constraints are
omitted
from our examples. However, this
omission is by no means a limitation
of our approach (see [1, 2]). Note
that
equals is a method of the Java root
class Object which is
intended to be overridden to provide
a meaning of
equality specific to a particular
data type. If this is the standard
notion of equality then the logic
paradigm
becomes Horn clause logic with
equality as in [11].
A database is a model for a schema.
Given a schema signature Sig,
the collection of all databases that
conform
to Sig (implement Sig) is
denoted by Db(Sig). A
database d that implements
Sig would thus have to implement
the
signatures of data types in Sig
as sets and the operation
signatures as functions. Signatures
for database sets
would also have to be interpreted as
sets, bags etc.
If a database d that conforms
to a schema signature Sig
satisfies the integrity constraints
E, we say that d is
consistent with respect to the
schema (Sig,E). This is
expressed by the satisfaction
relation, denoted |=, between
databases and the sets of sentences
(i.e., integrity constraints) that
they satisfy.
Definition 2 (Database
consistency) A database d is
consistent with respect to the
schema (Sig,E) iff d belongs
to Db(Sig) and d \= e for all e
� E.
3 Schema Morphisms
In this approach the schema
signatures for a given data model
are required to constitute a
category, denoted Sig.
The same applies to the schemas.
Schema transformations within a
particular category of schemas are
viewed as
morphisms of that category. This
approach allows us to talk about
schemas without specifying their
data model
(i.e. category). Formally, a
category C consists of the
following [17]:
� A collection of objects and a
collection of arrows (morphisms).
� Each arrow has its domain object
and its codomain object.
� If / : X -->� Y and
g : Y -->� Z are
arrows of C, then / and g are
composed into an arrow gf : X
-->� Z.
� The composition of arrows is
associative, i.e., if / : X -->
Y, g : Y ^ Z and h : Z
-->� W are arrows of C,
then h(gf) = (hg)f.
� Each object X of C is
equipped with the identity arrow
lx with the property that for
any arrow f:X^Y,lY f =
f
and flx = f.
The collection of all databases
Db(Sig) conforming to a given
schema signature Sig is
required to constitute a
category, with database morphisms
that satisfy the above categorical
requirements.
Schema morphisms are defined as
mappings of schema signatures that
preserve the integrity constraints.
Definition 3 (Schema
morphisms) A morphism of schemas
Schi = (Sigi,Ei) and Sch2 =
(Sig2,E2) consists of a
morphism (j> : Sigi --> Sig2 of
schema signatures which extends to a
mapping of constraints, such that
for all
databases d in Db(Sig2), if d \=
e 2 for all e 2 � E2 then d
\= <j)(ei) for all ei � E\.
An equivalent condition is that
<j>(ei) is in the closure of
E2 for all e\ � E\.
A particular case is (j>(e\)
� E2.
Definition 3 implies that schemas
and their morphisms constitute a
category, which we denote by Sch.
The last
condition in Definition 3
differentiates this work from many
others. It requires that the
integrity constraints of the
source schema are transformed into
constraints that are consistent with
those of the target. This is
expressed in a
model-oriented fashion: a database
that satisfies the constraints of
the target schema also satisfies the
transformed
constraints of the source schema (as
they appear in the target).
3
In Definition 1, distinguishing the
notions of schema signature and
schema (i.e., a schema signature
plus its
integrity constraints) leads to two
notions of schema equivalence. The
first one is just structural, based
on schema
signature morphisms. The second is
semantic, requiring structural
equivalence plus the semantic
equivalence
expressed in terms of integrity
constraints.
Definition 4 (Schema
equivalence) Let Schi and Schz be
two schemas
� Schi and Sch-z are structurally
equivalent if there exists a pair of
schema signature morphisms f : Sigi
--> Sig-2 and g
: Sig-2 --> Sigi such that fg =
lsig2 and gf =^sigx
� Schi and Sch-z are equivalent
if the above condition is satisfied
for schema morphisms f and g.
The category of all schemas for a
given data model is required to
include the initial schema Scho
that contains
basic features implicitly available
in any other schema (for example,
predefined standard types such as
Boolean,
integer and string, and their
associated constraints). Scho
typically does not contain
signatures for database sets
but does contain type signatures for
the required collection types. Any
other schema Schx implicitly
extends Scho
by a unique schema morphism Scho
-->� Schx- The uniqueness
requirement means that there is one
standard way
of incorporating Scho into
Schx which makes Scho a
subschema of Schx-
The notion of a subschema Schi
of Schi may be expressed
by the categorical requirement that
a monic arrow
(schema morphism) m : Schi
-->� Schi exists. This means
that given schema morphisms / :
Schx -->� Schi and g
: Schx -->� Schi, mf = mg
implies / = g [17]. In
familiar cases, monic arrows are
injections.
4 Schema Integration
A particularly important problem in
schema management is schema
integration [6, 21, 22]. Here, we
are given
two (or more) schemas and are asked
to produce an integrated schema that
can represent the information
content
of the given schemas. In one version
of the problem, which we treat here,
the integrated schema is populated
with
data and the given schemas are
defined as views of the integrated
schema.
All published schema integration
results we know of are tied to a
particular data model (i.e., to a
particular
category of schemas). By contrast,
our approach applies to different
categories of schemas. Moreover, it
addresses the subtle problem of
merging the integrity constraints of
two schemas. These two distinctive
features
are accomplished by expressing the
idea of schema integration in terms
of morphisms of schemas. This way
both
structural and semantic conditions
are taken into account.
Definition 5 (Schema
integration) An integration Schi2 of
schemas Schi and Schi is defined by
the following
commutative diagram of schema
morphisms:
Schm --> Schi
hi
P{
Schi --> Sch\2
Two schemas are integrated over
their matching part Schm. Two
schemas always have a matching part:
the
initial schema Scho. The
matching part Schm is
typically a subschema of both
Schi and Schi-The
commutativity
of the above diagram asserts that
the two composite schema morphisms
pcf>i and qfa are
identical. This means
that Schm appears the same in
Schn whichever path (via
Schi or Schi) is taken.
Example 2 (Schema
integration) We show the integration
Schi2 of Schi (in Example 1) with
another schema,
Schz, such that Schi and Schi can
be defined as views of Schn- First,
we define Schi as follows:
schema
Authors
{
interface
Author
{
String authorldO ;
String name();
Date dateOfBirthO ;
Collection<Book> books(); }
interface Book { String
publicationlDO ;
4
String title();
int year();
String publisherO ;
Set<String> keywords();
float price();
Collection<Author> authors();
}
Collection<Author> dbAuthors;
Collection<Book> dbBooks;
Constraints:
Author X,Y; Book Z,W;
dbBooks.member(Z) :-
dbAuthors.member(X),
X.books.member(Z);
dbAuthors.member(X) :-
dbBooks.member(Z),
Z.authors.member(X);
X.equals (Y) :- dbAuthors.member(X),
dbAuthors.member(Y), X.authorldO
.equals(Y.authorldQ );
Z.equals(W) :- dbBooks.member(Z),
dbBooks.member(W),
Z.ISBN().equals(W.ISBN())
}
A schema Schn that integrates
Schi and Schz is given below:
schema Publications {
import Publishers.Publisher;
import Publishers.Publication;
import Authors.Author;
interface Book extends Publication
// integrates Book and Publication
in Schl2
{ float price();
Collection<Author> authors(); }
Collection<Publisher> dbPublishers;
Collection<Publication> dbPubs;
Collection<Author> dbAuthors;
Collection<Book> dbBooks;
Constraints:
import Publishers.Constraints;
import
Authors.Constraints; Publication X;
dbPubs.member(X) :-
dbBooks.member((Book)X) // an
additional constraint beyond Schl
and Sch2 }
Among all integrations Schn
of schemas S\ and 52, the
schema-join of Schi and 5c/i2
denoted Sch\ * 5c/i2 , if
it exists, has a distinctive
property: it represents the minimal
integration of Schi and
Sch'i- This notion is specified
below entirely in terms of schema
morphisms by a categorical
construction called a pushout
[17].
Definition 6 (Schema join)
Schema-join Sch\*Sch2 of schemas
Schi and Schi is defined by the
following
commutative diagram of schema
morphisms:
Schm -->
Schi
hi
h[
Schi --> Schi *
Schi
with the following property:
Given any integration Schn of
schemas Schi and Schi as defined in
Definition 5
above, there exists a unique
schema morphism <f> : Schi *
Schi -->� 5c/ii2 such that
<f>k = q and 4>h = p.
A specific construction of
schema-join is presented in Section
6. The fact that schema integration
and schema-
join are expressed entirely in terms
of arrows representing schema
morphisms has two distinctive
implications: (i)
both structural and database
integrity properties are integrated,
and (ii) the notion of schema
integration is data
model independent. Specific notions
of schema integration are obtained
by choosing a particular category of
schemas which includes its schema
morphisms.
We believe that the notion of least
upper bound in the schema lattice of
[7] and the schema + operator of
[19]
can be modeled precisely as schema
join. Our category-theoretic
treatment generalizes [7] by
considering the
instance level, not just the schema
level and generalizes both models by
making it applicable to a variety of
schema categories, including such
features as methods and constraints.
5
5 Generic Schema Transformation
Framework
The categorical approach developed
so far leads to a very general
notion of a data model. This model
has two
levels. The meta level consists of
database schemas and their
transformations expressed as schema
morphisms.
This transformation-based view is
quite different from the standard
notions of a data model. At the
instance level
these transformations operate on
databases that conform to schemas at
the meta level.
The notion of database integrity
plays a fundamental role in this
formal framework. The acceptable
transformations at both levels are
required to satisfy the integrity
constraints. This integrity
requirement is
expressed as a condition that
involves acceptable schema signature
transformations and the satisfaction
relation
between databases and the integrity
constraints.
The framework is proposed as a
formal pattern for defining data
models such that schema management
and
the associated database
transformations have well-defined
formal meanings. As such, it offers
interesting
observations on the relationships
between the two levels which are
sometimes quite different from the
usual
views of schema and data
transformations. An example is the
reversal of the directions of
transformations at the
two levels. Schema management
operations such as schema
integration have particularly
desirable semantic
properties when the conditions for
this formal framework are satisfied.
The core of this model theory
requires just one more categorical
notion: a morphism of categories.
Given
categories C and B, a functor F :
C --> B consists of two
related functions with the following
properties [17]:
� The object function which assigns
to each object X of C an
object FX of B.
� The arrow function which assigns
to each arrow h : X --> Y
of C an arrow Fh : FX
-- FY of B.
� F(lx) = IFX and F(gh) =
F(g)F(h), the latter whenever
the composite gh is defined.
Two functors play a crucial role in
this theory. The first one is Sen
: Sign -->� Set
where Set denotes the
category of sets. If Sig is a
schema signature, then Sen(Sig)
is the set of all well-formed
sentences (integrity
constraints) over Sig. Many
constraint languages, hence many
Sen functors, are possible for a
given data model
(i.e., which implies a choice of
Sign). Sen is determined
by the choice of logic that
specifies the syntax of
sentences of the constraint
language. This syntax is defined
starting with the features of a
schema signature. The
schema signature typically
determines the terms of the
constraint language, and the logic
determines the formulae
based on those terms. This is why
Sen maps a schema signature into
a set of sentences over that
signature. Sen
also maps a schema signature
morphism to a function that
transforms the sets of sentences.
The second functor is Db :
Sign -- Catop. For
each signature Sig, Db(Sig)
is the category of Sig
databases,
together with their morphisms. These
database morphisms represent data
transformations, which correspond to
the schema transformations
represented by arrows in Sign.
Cat denotes the category of
categories. Objects of Cat
are categories and arrows of Cat
are functors. Catop
differs from Cat only to the
extent that the direction of its
arrows is reversed. This reversal of
the direction of arrows that happens
going from schemas to their
databases is
one of the subtle and characteristic
features of this model theory.
Recall that databases in the
category Db(Sig) are not
necessarily consistent with respect
to a set of integrity
constraints E. The notion of
database integrity is captured by
the satisfaction relation |= between
databases in
Db(Sig) and sets of sentences
over Sig.
We now have the machinery to define
the notion of data model described
in the beginning of this section. To
emphasize the transformation-based
view relative to classical data
models, we give it a new name.
Definition 7 (Schema
transformation framework) A schema
transformation framework consists
of:
� A category of schema signatures
Sign equipped with the
initial object. This category
consists of objects
representing schema signatures
together with their morphisms.
� A functor Sen : Sign
-- Set. Sen(Sig) is a set
of sentences over the schema
signature Sig.
� A functor Db : Sign
--> Catop. For each
signature Sig, Db(Sig) is the
category of Sig databases, together
with their morphisms.
� For each signature Sig, a
relation \=sig Q | Db(Sig)
| x Sen(Sig) called the
satisfaction relation. \ Db(Sig) \
denotes the set of objects of the
category Db(Sig).
� For each schema signature
morphism (j> : Sig A -->
SigB, the following Integrity
Requirement holds for each
SigB database dB and each
sentence e � Sen(SigA):
dB \=si9B Sen(4>)(e) iff
Db(<j>)(dB) \=si9A e.
6
The above definition is based on
[9]. Its relationships are
represented by the following
diagram:
Db(SigA) --> Sen(SigA)
Db^)]
|5en(0)
Db(SigB) --> Sen(SigB)
Note the reversal of the direction
of the arrow Db(<j>) relative
to Sen(<j>). Sen((f>) maps
each constraint in Sen(SigA)
to a
constraint in Sen(SigB). By
contrast, Db{4>) maps each
database in Db(SigB) to a
database in Db(SigA)- If we
think of <f> as
a mapping from a logical database
schema SigA into a physical
schema SigB, then Db{4>)
tells how to materialize a view
in
Db(SigA) from a database in
Db{Sigs)-
To see why this reversal of arrows
happens, consider an injective
schema morphism </> : Sch\ -->
Sch^ which makes
Sch\ a subschema of Sch^ .
Schi just extends the signatures
and the constraints of Sch\
(we assume a monotonic logic such as
Horn clause logic). A database that
is consistent with respect to
Schi is also (by projection)
consistent with Sch\. The
other
way around does not hold.
The Integrity Requirement
puts a very strong semantic
restriction on the permissible
schema signature transformations
(schema morphisms). It only allows
ones that preserve the validity of
constraints. That is, suppose 4>
maps constraint BA to
es
(more precisely, es =
(Db)(<j))(eA ) for BA G
Sen(SigA))- Then the
Integrity Requirement says that
es is valid in a database
(1B iff &A is valid in
the SigA database that
corresponds to ds (i.e., in
Db(<j>)(d,B))-
The Integrity Requirement
applies directly to the problem of
schema integration in Definition 6.
It ensures that each valid
database of the integrated schema
corresponds to valid databases of
the schemas to be integrated. This
point is made precise in
the following theorem.
Theorem 1 (Materializing
subschemas) Suppose Sch\ and Sch^
are schemas that are integrated into
Schi2 relative to some
schema transformation framework.
Given a consistent database d in
Db{Sch\2), it is possible to
construct databases d\ and
G?2 that are consistent relative
to schema Sch\ and schema Schi
respectively.
Proof We have schema
morphisms </>i : Sch\ -->�
Sch\2 and </>2 : Schz
-->� Sch\2 where Sch\ =
(Sigi,E\) and 5c/i2 =
(Sigz,E'i)- We construct
database d\ as Db((f>i)(d)
and database di as
Db(<f>2)(d). Since d is a
consistent database and </>i
and </>2 are schema morphisms,
Definition 3 implies:
� d |=sjgi2 (Sen)(cf>i)(e)
for all eG-Bi
� d |=sjgi2 (Sen)((/)2
)(e) for all e � E2.
We can now complete the proof by
applying the Integrity Requirement
to the above two lines, yielding:
� Db{<f>i){d) \=si91 e for
all e�fii
� Db(<j>2)(d) \=si92 e for
all e G E2.
In essence, databases conforming to
Sch\ and Schz are
materialized views of Sch\2-
Thus, the above theorem relates the
consistency of databases conforming
to the integrated schema to those of
the materialized views. This is an
unusual
perspective in that views typically
do not have integrity constraints.
Note that in order to make this
proof possible both the reversal of
arrows and the Integrity Requirement
are essential.
Corollary 1 (Schema-joins)
Let Schri be an integration of
schemas Sch\ and Sch,2 relative to a
schema transformation
framework. Let d be a consistent
database in the category Db{Sig\2).
If Sch\ * Schz = (Sigi*2,
�-1*2 ) exists, then there is a
canonical way to construct a
consistent database d* in the
category Db(Sigi*2)-
Proof By Definition 6, there
is a unique <j> : Sch\ *
Schz --> Sch\2- d* is
constructed as Db(cf))(d).
Since d is consistent, by
Theorem 1 so is d*.
Corollary 1 says that for any
consistent database of an integrated
schema, there exists a corresponding
(minimal)
consistent database of the
schema-join of the two schemas that
were integrated. The canonical arrow
Db(<f>) is a recipe for
constructing that (minimal)
consistent database.
7
6 An Application of the Schema
Transformation Framework
One role of the generic schema
transformation framework is to serve
as a formal generic pattern for
defining new data models,
with well defined meanings for
schema and data transformations.
This section gives a detailed
description of how one
constructs such a data model: a
category of schemas and their
databases for Examples 1 and 2,
which we call Objc (00 with
Constraints).
A schema of Objc consists of
a collection of sorts (type names)
some of which are predefined in the
initial schema Scho
and which in particular must include
the sort Boolean along with the
standard axioms. The others are
abstract data types
defined in the schema itself, each
of which is a set of method
signatures specified as a Java
interface. The is-a (inheritance)
relationship thus amounts to the
subset relation which agrees with
the rules of the Java type system. A
schema contains
collection objects which are the
actual database sets. To specify
their type, a parametric
Collection < T > type is used
and
instantiated with the required
element type selected among the
abstract data types defined in the
schema.
Definition 8 (Objc schema
signatures) A schema signature
consists of:
� A finite set of sorts S, which
includes Boolean.
� A finite set of interfaces A
(abstract data types) such that A C
S .
� An interface Ak is a set of
method signatures of the form C m{C\
xi, Ci X2,... ,Cn x n ) where C
� S and Ci � S for i =
1,2,... ,n. (m is the method
name, C is the return type, and C\
is the type of parameter x\.)
� If the interface A ^ extends
the interface A\ (inheritance) then
A\ C A k-
� A finite set of collection
signatures {Collection < Aj >:
X,} where Aj � A and
Collection < Aj >� S.
To specify the type of data sets of
a schema, parametric collection
types are required. The issue of
parametric types is a
major one by itself. It will be
elaborated only to a limited extent
here by showing that a specific
collection type (i.e., for a
specific element type) is obtained
from a parametric type by the same
pushout construction [10] that we
used for schema
integration.
Definition 9 fObjc
collection types) An instantiated
collection interface Collection < A
> is defined by the
following pushout diagram
T -->� Collection < T >
i
H
A -^ Collection < A >
In the above definition a parametric
interface Collection < T > is
viewed as a morphism T -->
Collection < T >. The
substitution T --> A
and instantiation Collection < T
>--^ Collection < A > are
also morphisms. The morphism A
-->�
Collection < A > is obtained
from T --> Collection < T >
and the substitution T -->
A.
The above construction generalizes
the usual views of instantiated
parametric types. If a pushout is
based on schema
morphisms (which apply to both
structural properties and integrity
constraints), not just on signature
morphisms, then the
notion of instantiation of
parametric types becomes semantic in
nature.
Constraints are sentences expressed
in Horn clause logic. Terms include
method invocations and thus appear
in the object-
oriented form. Atoms are invocations
of Boolean methods. This limited
logic is sufficiently expressive to
capture most typical
database constraints, such as key
and inclusion dependencies.
Definition 10 fObjc
constraints) For a given schema
signature Sig with sorts S:
� A collection object of type
Collection < A ^ > is a term of type
Collection < A ^ >.
� A variable X of type A ^ where
A k � S is a term of type A ^ .
� If a is a term of type A k ,
are terms of respective types A\,
A-z,..., An, C m(Ai, A-z,..., An) is
a method
of the interface Ak. of Sig, then
a.m(ai,a 2, , a n ) is a term
of type C.
� With the above definitions, an
atom is of the form a.m{a\, 02,
� � �, a n ) where m 's result
type is Boolean.
� A constraint is of the form p
<-- pi,p2, ,Pn where p,pi,P2,
,pn are atoms.
Permissible schema signature
transformations of Objc are
defined below as schema signature
morphisms. Their core is a
mapping of sorts, extended to
signatures of methods and collection
objects. This mapping is the
identity on the inital schema's
predefined sorts, so these sorts
have the same interpretation in all
Objc schemas. Note that
Collection < T > is not intended
to
have methods with arguments of type
Collection < T >, so that
Collection < A ^ > would be a
structural subtype of Collection
< A\ > if A/- is a
subtype of Ai.
8
Definition 11 fObjc schema
signature morphisms) A morphism of
schema signatures </> : Sigi -->
Sig-z consists of the
following:
� An injective mapping of sorts
<f> : S\ --> S2 such that
-- (f>(s) = s for s So where So
are the sorts of the initial schema
Scho-
-- (^(Collection < Ai >) =
Collection < <f>(Ai) >
� <f> applies to interfaces as
follows: <f>(C m(Ci, C2, , Cn)) =
4>{C) (j>(m)((j)(Ci), <f>(C-2),
� � �, (j>(Cn))
� <f> maps collection objects of
Sigi into collection objects of
Sig-2 such that ^(Collection < Aj
>: Xj) =
<f>(Collection < Aj >) :
4>(Xj).
The above definition is intended to
allow a mapping of an abstract data
type to its extension with possible
renaming and
possibly enforcing structural
subtyping conditions.
The notion of schema signature
morphism is extended below to a
schema morphism. The extension maps
the integrity
constraints in accordance with the
mapping of the sorts in a schema
signature morphism.
Definition 12 fObjc schema
morphisms) A morphism of schemas cf>
: Schi -- Sdi2 is a schema
signature morphism <f>
: Sigi --> Sig2 such that
(j> extends to a function E\ --> E2
as follows:
� <j)(xs ) = X0(3)
1, 0,2-, � � � ; 0,n )) =
cf>(a).cf>(m)(cf>(ai),...
,cf>(a,,)), where m may be a Boolean
method.
� (j > (p < - p i , P 2 , . . . ,
P n ) = 4>(P)
^4>(Pl),4>(P2),---A(Pn)
Given Definitions 10 and 11, the
definition of the Sen functor
is immediate and so is the
verification of its functorial
properties.
Definition 13 fObjc Sen
functor) Define Sen : Sign
--> Set as
follows:
� Sen(Sig) is a set of sentences
of the form {p <-- pi,p2,
,Pn} where p,pi,P2, ,pn are atoms
according to
Definition 10.
� Given a schema signature
morphism (j> : Sigi --> Sig2,
Sen(cf>)({P ^- pi,P2, ,Pn}) =
{4>(P)
<-<t>(j>l),(p(p2),...,<l>(pn)}-
Proposition 1 fObjc Sen
functor) Then Sen is a functor
Sign -->� Set.
Proof Sen obviously
preserves the identity schema
signature morphism, i.e. Sen(lsig
) = ^sen(Sig) and the
composition of
schema signature morphisms, i.e.,
Sen(<f>\<j)2) = Sen{4>i)Sen{4>2)-
A database for a given Objc
schema includes a domain for each
sort in the schema. A domain is a
set equipped with
functions representing the
interpretation of the method
signatures. A database also includes
the actual data sets, one for each
collection object signature in the
schema. By Definition 2, a database
for a given schema must be
consistent.
Definition 14 ('Objc
databases) A database d for a given
schema Sch = (Sig,E) consists of the
following:
� A collection of sets (domains)
{Ds \ s � S} where S is the set of
sorts of Sig. S includes Boolean,
whose domain is true
and false.
� For each method m of an
interface A ^ with the signature C
m(C\ x\, C2 X2, ,Cn x n ) a function
fm : Dw --> Dc
where Dw = DAk x Dc\ x
. .. x Dcn with w = AkCi ... C,,.
� For each collection variable of
type Collection < A k > a set
with elements from the domain Dk.
� If Ai <Z Au then there exists a
function (projection) Dk --> Di and
thus also a function Collection < Dk
>-->
Collection < Di >.
� Let e be a constraint with
variables X such that Xs C X
is a set of variables in X of sort
s. If 8 is a substitution of
variables in e (i.e. a family of
functions Xs --> Ds ) then e < 9
> evaluates to true.
Although collections in the above
definition are interpreted as sets,
in general they can be bags.
To complete the above definition, we
must specify its morphisms of the
category of databases: a family of
functions, one
per database domain, that map the
actual data sets as well. These maps
are required to satisfy two
conditions: a standard
algebraic condition that applies to
operations on the original domains
and their images, and a condition
that applies to the
integrity constraints. As we are
talking here about the category of
databases for a schema, not just for
a schema signature,
database morphisms are required to
map each consistent database into
another consistent database.
9
Definition 15 (Database
morphisms for Objc
schemas) Let d and d' be databases
consistent with respect to a schema
(Sig, E).
Then a database morphism h : D
--> D' is a family of
functions h s : Ds -- D's for
s G S where S stands for the
set of sorts
of Sig.
� For each method m of interface
A k with signature C m(Ci x\, C2
X2,... ,Cn x n ) and a G Dw,
h s{fm{a)) =
f m (h w(a)) holds, where h w is
a product of functions h A k x
ftci x ... x hcn when w = A
^CiC-2 ... Cn, as
in the following diagram:
/m
Dw --> Dc
h wl
|/ic
D'w ^ D'c
� Suppose e G E has
variables X. If 6S : Xs -- Ds
is a substitution of variables for X
and e < 8 > evaluates
to true, then so does e <
9' >, where 8' is constructed as the
composition of 8S and h s .
Definition 16 (Db functor
for Objc schemas)
Define Db as follows:
Db(Sig) is a category with
objects and arrows defined in 14 and
15. Given <f> : Sigi --> Sig2,
Db((j>) is defined as:
� If (fe is a Sig2
database with domains {Ds | s G
S2} then Db((j>)(d2) is a Sigi
database with domains {Ds I s G
(f>(Si)}. The same applies to the
collection objects.
� A morphism h : e?2 -->
d'2 of Sig2 databases (a family of
functions {h s \ s G �2},) is
mapped into a morphism
Db{<f>){h) of Sigi databases by
restricting h to the family of
functions {h s | s G
(f>(Si)}.
Proposition 2 specifies how the
Db functor is constructed and
how its functorial properties are
verified. Note that D s and
hs
now correspond to the sort 4>~1
(s) � Si.
Proposition 2 (Db functor
for Objc schemas) Db
in Definition 16 is a functor
Sign --> Catop.
Proof A schema signature
morphism <f> : Sigi -->�
5**32 according to Definition 11
maps to a database morphism
Db((f>)
: Db{Sig2) --> Db(Sigi)
according to the above
construction and Definition 15. The
family Db(<f>)(d2) -->�
Db{4>)(d'2) is
a Sigi database morphism,
because h is a morphism of
Sig2 databases. Two properties
are essential here: the injective
mapping
of sorts required in Definition 11
of schema signature morphisms and
mapping of collections as specified
in Definition 14.
The developments presented so far in
this section lead to two theorems.
The first one asserts that the
paradigm satisfies
the conditions of Definition 7 and
thus is a schema transformation
framework. It is based on the
already established functors
Sen and Db.
Theorem 2 (A schema
transformation framework)
� Let the category of schema
signatures Objc be
defined according to Definitions 8
and 11.
� Define the functor Sen :
Sign -->� Set
according to Proposition 1.
� Define the functor Db :
Sign --> Catop
according to the Proposition 2 so
that the category Db(Sig) is defined
according to Definitions 14 and
15.
� The satisfaction relation is
also specified in Definitions 14 and
15.
Under the above conditions the
Integrity Requirement holds.
Proof We have to prove that
each schema signature morphism </> :
Sig A --> SigB in Definition
11 satisfies the Integrity
Requirement, that is, ds
\=Sig B Sen{<j))(e) iff
Db(<f>)(dB) \=Sig A e holds for
each sentence e � Sen(SigA)-
Given e � Sen(SigA), Sen(cf))(e)
is defined by Proposition 1. For
a given object ds of the category
Db(SigB),
Db(<j>)(d,B) is defined in
Proposition 2. The proof amounts to:
dB \=Sig B (4>(P) <-
<MPl),'MP2),---,</>(>n)) iff
Db((f>)(dB) \=SigA (P <~ Pi,
Pi, , Pn)
The second theorem establishes the
existence of the schema-join of any
two schemas of this schema
transformation
framework. This result provides a
standard technique for integrating
two Objc schemas. This
integration is both structural and
semantic (i.e. the integrity
constraints are properly integrated)
because it is based entirely on
schema morphisms according to
the pushout construction in
Definition 6.
Theorem 3 The schema-join of any
two Objc schemas over
the initial schema Scho exists.
10
Proof Given schemas Schi
and Schz the integrated
schema Schi * Sch^ is
defined as follows: (i)
Ssch!*Sch2 =
<5iU52 (sorts), (ii) ASch1*Sch 2
= A Scht U ASch2
(interfaces), (in) CSch1*Sch 2 =
Csch^Csch* (collections), and
(iv) Esch!*Sch2 = Escht U
ESch 2 (constraints).
With this construction Schi
and Sch^ are subschemas of
Schi * Sch^ since we have
two injective schema
morphisms <f>i : Schi --> Schi
* Schz and <f>2 '
Sch,2 -->� Schi *
Sch.2-
Suppose that we are given p :
Schi --> S12 and q : Sch,2
--> Si 2 . The required schema
morphism <j> : Schi
* Scti2 --> Sch\2 is defined
on the sorts as follows: <f>(s) =
p(s) for s � Si and
<j>(s) = q(s) for SGS2.
Note that <f>i(s) = s for
s � So and <fo(s) = s for
s � So- So ^ is well defined for s �
Si fl S2.
As interfaces, collections and
constraints of Schi * Sc/12
are defined as disjoint unions of
the corresponding
components of Schi and
Sch2 , we have <f>h = p
and <j)k = q.
Note that with the choice of a
different logic the above result
would not necessarily hold. The
union of
constraints could lead to a set of
sentences for which there is no
database (particularly in a given
category) that
satisfies the constraints
(represents a model for the
constraints) <j>i{Ei) and
^2(^2) (see Definition 3).
7 Conclusions and Related Work
The main contribution of this paper
is a model theory for generic schema
management. While the ideas on
generic schema management proposed
so far have been mostly informal [4]
and pragmatic [5], this paper shows
that a formal framework for such
generic tools does indeed exist.
A distinctive feature of the
presented paradigm is that it
applies to a variety of categories
of schemas and to a
variety of logic bases for
expressing database integrity
constraints. This level of
generality is accomplished by
making use of a categorical model
theory called institutions,
proposed for programming language
paradigms [9].
We believe ours is the first
application of this theory to
database models and languages. An
earlier attempt in [15]
is similar in spirit, but is less
general. It also does not address
the logic basis and preservation of
integrity
constraints and lacks provable
results.
The core of this model theory is a
general, transformation-oriented
definition of a data model. This
generic
schema management framework serves
as a pattern for constructing data
models in such a way that schema
transformations as well as the
associated database transformations
have well-defined meaning. As a
rule, most
well known data models have not been
constructed in such a spirit.
Furthermore, this framework is
intended to be
applied to data models that are
still not completely or formally
established, such as XML. A further
distinctive
feature of this framework is that it
has a strong database integrity
requirement as its possibly most
fundamental
component.
This paper also shows how to address
some well-known problems such as
schema equivalence [13, 14] and
schema integration [6] at this level
of generality. The results are
independent of a particular category
of schemas
and apply to both structural
properties and integrity
constraints. The requirement for
proper handling of the
integrity constraints in those
problems is one of the contributions
of this paper. Furthermore, this
approach is
independent of a particular logic
paradigm used as a basis for the
constraint language.
Contrary to the published work on
schema and data transformations
(schema integration in particular)
in
which transformations at the meta
and data levels have the same
direction [19], this paper reveals a
subtlety not
considered in the relevant papers.
The subtlety is manifested in the
reversal of the transformation
arrows at the
two levels. This observation is an
important component of the presented
model theory, having both pragmatic
and
mathematical significance.
In this paper one option is that
schema transformations must satisfy
particular structural subtyping
conditions.
Such a strict discipline has its
place in programming languages, but
it is often not satisfied in schema
transformations. Furthermore, if the
abstract data types are equipped
with constraints, semantic
compatibility
notions such as behavioral subtyping
[16] become a major issue. Model
theoretic implications of behavioral
compatibility issues in mapping
abstract data types equipped with
constraints are given in
[!]
In a model in which inheritance is
identified with subtyping (as in
Java) a single partial order of
sorts
is required. This introduces further
subtleties in schema transformations
and schema integration. Models
with different types of ordering of
sorts are elaborated in [1] and [3].
The relevant results on pushouts of
algebraic specifications are given
in [12].
11
In this paper we do not consider the
problem of mapping one schema
transformation framework (for
example, relational)
into another (for example,
object-oriented). This situation is
captured by the notion of a morphism
of schema transformation
frameworks (following [9, 15]), and
is a topic of work in progress.
The formal paradigm presented in
this paper applies to appropriately
simplified XML schemas [23]. The
reasons are: the
nature of the type system of the XML
Schema, the kind of the integrity
constraints (keys and referential
integrity constraints)
expressible in XML schemas (also
considered in [8]), and the
existence of the initial schema
(more precisely, a name space).
Space limitations do not allow
further elaboration of these
important implications of the
development presented in this paper.
However, application of this model
theory to the XML data model is a
major topic of future research.
This paper is not addressing
explicitly modeling of database
dynamics. But this is by no means a
limitation of the
paradigm. Database dynamics is taken
into account by choosing the Sen
functor for a suitable temporal
logic. A development
of such a paradigm is given in [1,
2, 3].
References
[I] Alagic, S.: Semantics of
temporal classes, Information and
Computation, 163, pp. 60 - 102,
2000.
[2] Alagic, S.: Constrained
matching is type safe, Proceedings
of the Sixth Int. Workshop on
Database Programming
Languages, LNCS 1369,
Springer-Verlag, pp. 78-96, 1998.
[3] Alagic, S. and M. Alagic,
Order-sorted model theory for
temporal executable specifications,
Theoretical Computer
Science 179, pp. 273-299,
1997.
[4] Bernstein, P. A. , A. Halevy, R.
A. Pottinger: A vision for
management of complex models, A
CM SIGMOD Record 29(4),
2000.
[5] Bernstein, P.A. and E. Rahm:
Data warehouse scenarios for model
management. ER 2000, LNCS 1920,
Springer-Verlag,
pp. 1-15, 2000.
[6] Batini, C, M. Lenzerini, and S.
B. Navathe: A comparative analysis
of methodologies for database schema
integration,
ACM Computing Surveys 18(4),
pp. 323-364, 1986.
[7] Buneman, P., S. Davidson, and A.
Kosky: Theoretical aspects of schema
merging. EDBT 1992, pp. 152-167.
[8] Fan, W. and L. Libkin, On XML
integrity constraints in the
presence of DTDs, Proc. of ACM
PODS Conf., 2001.
[9] Goguen, J. and R. Burstall,
Institutions: Abstract model theory
for specification and programming,
Journal of the ACM,
39, No. 1, pp. 92-146, 1992.
[10] Goguen, J.: Types as theories,
in: G. M. Reed, A. W. Roscoe and R.
F. Wachter, Topology and Category
Theory in
Computer Science, pp.
357-390, Clarendon Press, Oxford,
1991.
[II] Goguen, J. and J. Meseguer,
Order-sorted algebra I: Equational
deduction for multiple inheritance,
overload
ing, exceptions and partial
operations, Theoretical Computer
Science 105, pp. 217-273, 1992.
[12] Haxthausen, A.E. and F. Nickl:
Pushouts of order-sorted algebraic
specifications, in M. Wirsing and M.
Nivat,
eds., Algebraic Methodology and
Software Tech. (AMAST 96), LNCS
1101, Springer-Verlag, pp.
132-147, 1996. [13]
Hull, R.: Relative information
capacity of simple relational
database schemata, SI AM Journal
of Computing,
Vol. 15, No. 3, pp.856 -886,
1986. [14] Hull, R.: Managing
semantic heterogeneity in databases:
A theoretical
perspective, PODS 1997, 51-61
[15] Kalinichenko, L.A.: Methods and
tools for equivalent data model
mapping
construction, Proc. of EDBT,
LNCS 416, Springer-Verlag,
pp. 92-119, 1990. [16] Liskov, B.
and J. M. Wing: A behavioral notion
of
subtyping, ACM TOPLAS 16, pp.
1811 1841. [17] Mac Lane, S.:
Categories for a Working
Mathematician,
Springer, 1998.
[18] Madhavan, J., P. A. Bernstein,
E. Rahm: Generic schema matching
with Cupid, VLDB 01, to
appear. [19] Miller, R. J.,
Y. E. Ioannidis, R. Ramakrishnan,
Schema equivalence in heterogeneous
systems: Bridging
theory and practice, Information
Systems Vol. 19, No. 1, pp.
3-31, 1994. [20] Rahm, E. and P.A.
Bernstein: On matching
schemas automatically. MSR Tech.
Report MSR-TR-2001-17,
2001. [21] Spaccapietra, S., C.
Parent, Y. Dupont: Model independent
assertions for integration of
heterogeneous
schemas, VLDB Journal 1, pp.
81-126, 1992. [22] Spaccapietra, S.
and C. Parent: View integration: A
step forward in
solving structural conflicts,
IEEE TKDE
6(2), pp. 258 - 274, 1994.
[23] W3C: XML Schema,
http://www.w3c.org/XML/schema,
2001.
12
|
|
|
dd |
|
|
|
|
|
|
|
|
|
|