A model theory for generic schema management models - Kinan M Al Haffar, Student Publications, Research, Projects, Investigation and Academic Work presented by Atlantic International University students, faculty and other contributors. Bachelor, master, doctoral degree programs by distance learning, online, correspondence, or

Student Publications

Author: Kinan M Al Haffar
Title: A Model Theory for Generic Schema Management Models
Area:
Country :
Profile:
Program:
Available for Download: Yes

Sharing knowledge is a vital component in the growth and advancement of our society in a sustainable and responsible way. Through Open Access, AIU and other leading institutions through out the world are tearing down the barriers to access and use research literature. Our organization is interested in the dissemination of advances in scientific research fundamental to the proper operation of a modern society, in terms of community awareness, empowerment, health and wellness, sustainable development, economic advancement, and optimal functioning of health, education and other vital services. AIU’s mission and vision is consistent with the vision expressed in the Budapest Open Access Initiative and Berlin Declaration on Open Access to Knowledge in the Sciences and Humanities. Do you have something you would like to share, or just a question or comment? We would be happy to hear from you, please use the Request Info link below.

For more infromation on the AIU's Open Access Initiative, click here.


	Abstract The core of a model theory for generic schema management is developed. This theory has two distinctive features: it applies to a variety of categories of schemas, and it applies to transformations of both the schema structure and its integrity constraints. A subtle problem of schema integration is considered in its general form, not bound to any particular category of schemas. The proposed solution, as well as the overall theory, is based entirely on schema morphisms that carry both structural and semantic properties. Duality results that apply to the two levels (i.e., the schema and the data levels) are established. These results lead to the main contribution of this paper: a formal schema and data management framework for generic schema management. Implications of this theory are established that apply to integrity problems in schema integration. The theory is illustrated by a particular category of schemas with object-oriented features along with typical database integrity constraints. 1 Introduction This paper presents the core results of a model theory for generic schema management, by which we mean schema and database transformation capabilities that are independent of a particular data model. Such transformations require major database programming tasks, such as integrating source schemas when building a data warehouse or integrating different user views into an overall database schema. In spite of nontrivial typing issues created by such transformations, database programming and other relevant paradigms have been primarily suited to dealing with structural aspects of those transformations. A major challenge is in properly addressing semantics: the integrity constraints associated with database schemas. A second major challenge is in developing such a model theory that is applicable to a variety of data models, such as the relational, object-oriented, and XML models [23]. This is challenging because schemas and their underlying databases are very different in these three major categories of models as are languages and their underlying logics used for expressing the integrity constraints. On the pragmatic side, generic schema management operations and tools have been considered in [4], and more specific system implications in [5, 20]. These papers argue that many difficult and expensive database programming problems involve the manipulation of mappings between schemas. Examples are populating a data warehouse from data sources, exposing a set of data sources as an integrated schema, generating a web site wrapper, generating an object-oriented wrapper for relational data, and mapping an XML schema to a relational schema. Despite many commonalities of these schema management problems, tools and languages are typically engineered for only one problem area. A more attractive approach would be to build generic schema management tools and languages that would apply to all of these problems, with only some customization required for the data model and problem at hand. The formal open problem is to find a suitable model theory that would be able to handle this generality. Our proposed model theory has a categorical flavor, manifested in the use of arrows that appear at two levels. At the meta level the arrows represent schema transformations. For example, an arrow could map a data source schema to a data warehouse schema, or two arrows could map each of two data source schemas into a mediated schema. These transformations are defined as schema morphisms that map the structural properties and integrity constraints of a schema. At the data level the arrows are data transformations specified as database morphisms. Database morphisms map the actual data sets in a manner that is compatible with the operations available on those sets and that preserves the schema's integrity constraints. There are several implications of this arrow-theoretic approach. The first is that it leads to a very general view of the schema integration problem expressed entirely in terms of arrows. The generality is accomplished by using a particular categorical construction which applies to a variety of categories of schemas [10, 12, 17]. The specific nature of arrows used in this construction is determined by the category 1 of schemas that defines the kind of structural and integrity transformations that the arrows actually represent. The second implication is in the formal results that relate properties of arrows at the meta level and the data level. These results reveal a subtle duality manifested in the reversal of the corresponding arrows between the two levels. They also provide guidance for how to define schema transformations appropriately, in particular, so that the arrows preserve integrity constraints. The third, most important implication is a formal framework for generic schema and data management which applies to a variety of data models. This is really the main contribution of this paper. This formal framework includes and relates the two levels (schema and data transformations) and captures both structural properties and integrity constraints. A key component of this theory is a strong integrity requirement on the permissible structural schema transformations so that they preserve the integrity constraints. This model theory relies on earlier results on a general model theory for a variety of programming paradigms and of associated logic paradigms [9, 10]. We apply the proposed formal theory to two situations. First, we apply it to the schema integration problem where we prove results that are generic across data models and include a proper treatment of the integrity constraints. Second, we use it as a pattern for defining a data model, one based on abstract data types, which includes object-oriented features and typical database integrity constraints. The paper is organized as follows. Section 2 starts with basic definitions, followed by a running example of a particular category of object-oriented schemas. The basic categorical definitions are given in Section 3, particularly the use of morphisms to represent schema mappings. Sections 4 through 6 are the core of the paper. Section 4 shows how to express the schema integration problem using morphisms. Section 5 introduces the notion of a generic schema transformation framework, specifies the condition for such a framework to preserve database integrity and proves the implications on schema integration. Section 6 shows how the proposed model theory applies to the category of schemas of the running example. Section 7 discusses related work and concludes. 2 Schemas A database schema consists of two components: a schema signature and the associated integrity constraints. A schema signature specifies structural and operational features of a database schema. Its typical components are signatures for data types and their operations and signatures for database sets (relations, collections, etc.), illustrated in Example 1. Note that type signatures for collections determine the signatures for operations on collections. Signatures for integrity constraints are logical expressions whose form is determined by the choice of a particular logic and syntax of the constraint language. A feature of the definitions below is that they are sufficiently general to apply to a variety of schema signatures and associated constraint languages. Definition 1 (Schemas) A database schema Sch is a pair Sch = (Sig, E) where Sig is a schema signature and E is a set of integrity constraints expressed as sentences (formulae of a chosen logic with all variables quantified). This paper uses examples based on a category of schemas of an object-oriented style. Schemas of this category have user-defined abstract data types specified as Java interfaces. A schema signature consists of the specification of those types and of collection objects which represent the actual database. The integrity constraints are specified in Horn clause logic. We use only the most typical database constraints: those expressing key dependencies and referential integrity constraints. Example 1 (Sample schema) schema Publishers { interface Publisher { String publisherldO ; String name(); String locationO ; Set<Publication> publications(); } interface Publication { String publicationldO ; String title(); int year(); Publisher publisherO ; 2 Set<String> keywords(); } Collection<Publisher> dbPublishers; Collection<Publication> dbPubs; Constraints: Publication X,Y; Publisher Z ,W; dbPubs.member(X) :- dbPublishers.member(Z), Z.publications.member(X); dbPublishers.member(Z) :- dbPubs.member(X), X.publisher.equals(Z); Z.equals(W) :- dbPublishers.member(Z), dbPublishers.member(W), Z.publisherldO .equals (W.publisherldO ); X.equals(Y) :- dbPubs .member (X) , dbPubs.member(Y), X.publicationldO . equals (Y.publicationldO ); } As Java interfaces lack any kind of general-purpose constraint specifications, such constraints are omitted from our examples. However, this omission is by no means a limitation of our approach (see [1, 2]). Note that equals is a method of the Java root class Object which is intended to be overridden to provide a meaning of equality specific to a particular data type. If this is the standard notion of equality then the logic paradigm becomes Horn clause logic with equality as in [11]. A database is a model for a schema. Given a schema signature Sig, the collection of all databases that conform to Sig (implement Sig) is denoted by Db(Sig). A database d that implements Sig would thus have to implement the signatures of data types in Sig as sets and the operation signatures as functions. Signatures for database sets would also have to be interpreted as sets, bags etc. If a database d that conforms to a schema signature Sig satisfies the integrity constraints E, we say that d is consistent with respect to the schema (Sig,E). This is expressed by the satisfaction relation, denoted \|=, between databases and the sets of sentences (i.e., integrity constraints) that they satisfy. Definition 2 (Database consistency) A database d is consistent with respect to the schema (Sig,E) iff d belongs to Db(Sig) and d \= e for all e � E. 3 Schema Morphisms In this approach the schema signatures for a given data model are required to constitute a category, denoted Sig. The same applies to the schemas. Schema transformations within a particular category of schemas are viewed as morphisms of that category. This approach allows us to talk about schemas without specifying their data model (i.e. category). Formally, a category C consists of the following [17]: � A collection of objects and a collection of arrows (morphisms). � Each arrow has its domain object and its codomain object. � If / : X -->� Y and g : Y -->� Z are arrows of C, then / and g are composed into an arrow gf : X -->� Z. � The composition of arrows is associative, i.e., if / : X --> Y, g : Y ^ Z and h : Z -->� W are arrows of C, then h(gf) = (hg)f. � Each object X of C is equipped with the identity arrow lx with the property that for any arrow f:X^Y,lY f = f and flx = f. The collection of all databases Db(Sig) conforming to a given schema signature Sig is required to constitute a category, with database morphisms that satisfy the above categorical requirements. Schema morphisms are defined as mappings of schema signatures that preserve the integrity constraints. Definition 3 (Schema morphisms) A morphism of schemas Schi = (Sigi,Ei) and Sch2 = (Sig2,E2) consists of a morphism (j> : Sigi --> Sig2 of schema signatures which extends to a mapping of constraints, such that for all databases d in Db(Sig2), if d \= e 2 for all e 2 � E2 then d \= <j)(ei) for all ei � E\. An equivalent condition is that <j>(ei) is in the closure of E2 for all e\ � E\. A particular case is (j>(e\) � E2. Definition 3 implies that schemas and their morphisms constitute a category, which we denote by Sch. The last condition in Definition 3 differentiates this work from many others. It requires that the integrity constraints of the source schema are transformed into constraints that are consistent with those of the target. This is expressed in a model-oriented fashion: a database that satisfies the constraints of the target schema also satisfies the transformed constraints of the source schema (as they appear in the target). 3 In Definition 1, distinguishing the notions of schema signature and schema (i.e., a schema signature plus its integrity constraints) leads to two notions of schema equivalence. The first one is just structural, based on schema signature morphisms. The second is semantic, requiring structural equivalence plus the semantic equivalence expressed in terms of integrity constraints. Definition 4 (Schema equivalence) Let Schi and Schz be two schemas � Schi and Sch-z are structurally equivalent if there exists a pair of schema signature morphisms f : Sigi --> Sig-2 and g : Sig-2 --> Sigi such that fg = lsig2 and gf =^sigx � Schi and Sch-z are equivalent if the above condition is satisfied for schema morphisms f and g. The category of all schemas for a given data model is required to include the initial schema Scho that contains basic features implicitly available in any other schema (for example, predefined standard types such as Boolean, integer and string, and their associated constraints). Scho typically does not contain signatures for database sets but does contain type signatures for the required collection types. Any other schema Schx implicitly extends Scho by a unique schema morphism Scho -->� Schx- The uniqueness requirement means that there is one standard way of incorporating Scho into Schx which makes Scho a subschema of Schx- The notion of a subschema Schi of Schi may be expressed by the categorical requirement that a monic arrow (schema morphism) m : Schi -->� Schi exists. This means that given schema morphisms / : Schx -->� Schi and g : Schx -->� Schi, mf = mg implies / = g [17]. In familiar cases, monic arrows are injections. 4 Schema Integration A particularly important problem in schema management is schema integration [6, 21, 22]. Here, we are given two (or more) schemas and are asked to produce an integrated schema that can represent the information content of the given schemas. In one version of the problem, which we treat here, the integrated schema is populated with data and the given schemas are defined as views of the integrated schema. All published schema integration results we know of are tied to a particular data model (i.e., to a particular category of schemas). By contrast, our approach applies to different categories of schemas. Moreover, it addresses the subtle problem of merging the integrity constraints of two schemas. These two distinctive features are accomplished by expressing the idea of schema integration in terms of morphisms of schemas. This way both structural and semantic conditions are taken into account. Definition 5 (Schema integration) An integration Schi2 of schemas Schi and Schi is defined by the following commutative diagram of schema morphisms: Schm --> Schi hi P{ Schi --> Sch\2 Two schemas are integrated over their matching part Schm. Two schemas always have a matching part: the initial schema Scho. The matching part Schm is typically a subschema of both Schi and Schi-The commutativity of the above diagram asserts that the two composite schema morphisms pcf>i and qfa are identical. This means that Schm appears the same in Schn whichever path (via Schi or Schi) is taken. Example 2 (Schema integration) We show the integration Schi2 of Schi (in Example 1) with another schema, Schz, such that Schi and Schi can be defined as views of Schn- First, we define Schi as follows: schema Authors { interface Author { String authorldO ; String name(); Date dateOfBirthO ; Collection<Book> books(); } interface Book { String publicationlDO ; 4 String title(); int year(); String publisherO ; Set<String> keywords(); float price(); Collection<Author> authors(); } Collection<Author> dbAuthors; Collection<Book> dbBooks; Constraints: Author X,Y; Book Z,W; dbBooks.member(Z) :- dbAuthors.member(X), X.books.member(Z); dbAuthors.member(X) :- dbBooks.member(Z), Z.authors.member(X); X.equals (Y) :- dbAuthors.member(X), dbAuthors.member(Y), X.authorldO .equals(Y.authorldQ ); Z.equals(W) :- dbBooks.member(Z), dbBooks.member(W), Z.ISBN().equals(W.ISBN()) } A schema Schn that integrates Schi and Schz is given below: schema Publications { import Publishers.Publisher; import Publishers.Publication; import Authors.Author; interface Book extends Publication // integrates Book and Publication in Schl2 { float price(); Collection<Author> authors(); } Collection<Publisher> dbPublishers; Collection<Publication> dbPubs; Collection<Author> dbAuthors; Collection<Book> dbBooks; Constraints: import Publishers.Constraints; import Authors.Constraints; Publication X; dbPubs.member(X) :- dbBooks.member((Book)X) // an additional constraint beyond Schl and Sch2 } Among all integrations Schn of schemas S\ and 52, the schema-join of Schi and 5c/i2 denoted Sch\ * 5c/i2 , if it exists, has a distinctive property: it represents the minimal integration of Schi and Sch'i- This notion is specified below entirely in terms of schema morphisms by a categorical construction called a pushout [17]. Definition 6 (Schema join) Schema-join Sch\Sch2 of schemas Schi and Schi is defined by the following* commutative diagram of schema morphisms: Schm --> Schi hi h[ Schi --> Schi * Schi with the following property: Given any integration Schn of schemas Schi and Schi as defined in Definition 5 above, there exists a unique schema morphism <f> : Schi * Schi -->� 5c/ii2 such that <f>k = q and 4>h = p. A specific construction of schema-join is presented in Section 6. The fact that schema integration and schema- join are expressed entirely in terms of arrows representing schema morphisms has two distinctive implications: (i) both structural and database integrity properties are integrated, and (ii) the notion of schema integration is data model independent. Specific notions of schema integration are obtained by choosing a particular category of schemas which includes its schema morphisms. We believe that the notion of least upper bound in the schema lattice of [7] and the schema + operator of [19] can be modeled precisely as schema join. Our category-theoretic treatment generalizes [7] by considering the instance level, not just the schema level and generalizes both models by making it applicable to a variety of schema categories, including such features as methods and constraints. 5 5 Generic Schema Transformation Framework The categorical approach developed so far leads to a very general notion of a data model. This model has two levels. The meta level consists of database schemas and their transformations expressed as schema morphisms. This transformation-based view is quite different from the standard notions of a data model. At the instance level these transformations operate on databases that conform to schemas at the meta level. The notion of database integrity plays a fundamental role in this formal framework. The acceptable transformations at both levels are required to satisfy the integrity constraints. This integrity requirement is expressed as a condition that involves acceptable schema signature transformations and the satisfaction relation between databases and the integrity constraints. The framework is proposed as a formal pattern for defining data models such that schema management and the associated database transformations have well-defined formal meanings. As such, it offers interesting observations on the relationships between the two levels which are sometimes quite different from the usual views of schema and data transformations. An example is the reversal of the directions of transformations at the two levels. Schema management operations such as schema integration have particularly desirable semantic properties when the conditions for this formal framework are satisfied. The core of this model theory requires just one more categorical notion: a morphism of categories. Given categories C and B, a functor F : C --> B consists of two related functions with the following properties [17]: � The object function which assigns to each object X of C an object FX of B. � The arrow function which assigns to each arrow h : X --> Y of C an arrow Fh : FX -- FY of B. � F(lx) = IFX and F(gh) = F(g)F(h), the latter whenever the composite gh is defined. Two functors play a crucial role in this theory. The first one is Sen : Sign -->� Set where Set denotes the category of sets. If Sig is a schema signature, then Sen(Sig) is the set of all well-formed sentences (integrity constraints) over Sig. Many constraint languages, hence many Sen functors, are possible for a given data model (i.e., which implies a choice of Sign). Sen is determined by the choice of logic that specifies the syntax of sentences of the constraint language. This syntax is defined starting with the features of a schema signature. The schema signature typically determines the terms of the constraint language, and the logic determines the formulae based on those terms. This is why Sen maps a schema signature into a set of sentences over that signature. Sen also maps a schema signature morphism to a function that transforms the sets of sentences. The second functor is Db : Sign -- Catop. For each signature Sig, Db(Sig) is the category of Sig databases, together with their morphisms. These database morphisms represent data transformations, which correspond to the schema transformations represented by arrows in Sign. Cat denotes the category of categories. Objects of Cat are categories and arrows of Cat are functors. Catop differs from Cat only to the extent that the direction of its arrows is reversed. This reversal of the direction of arrows that happens going from schemas to their databases is one of the subtle and characteristic features of this model theory. Recall that databases in the category Db(Sig) are not necessarily consistent with respect to a set of integrity constraints E. The notion of database integrity is captured by the satisfaction relation \|= between databases in Db(Sig) and sets of sentences over Sig. We now have the machinery to define the notion of data model described in the beginning of this section. To emphasize the transformation-based view relative to classical data models, we give it a new name. Definition 7 (Schema transformation framework) A schema transformation framework consists of: � A category of schema signatures Sign equipped with the initial object. This category consists of objects representing schema signatures together with their morphisms. � A functor Sen : Sign -- Set. Sen(Sig) is a set of sentences over the schema signature Sig. � A functor Db : Sign --> Catop. For each signature Sig, Db(Sig) is the category of Sig databases, together with their morphisms. � For each signature Sig, a relation \=sig Q \| Db(Sig) \| x Sen(Sig) called the satisfaction relation. \ Db(Sig) \ denotes the set of objects of the category Db(Sig). � For each schema signature morphism (j> : Sig A --> SigB, the following Integrity Requirement holds for each SigB database dB and each sentence e � Sen(SigA): dB \=si9B Sen(4>)(e) iff Db(<j>)(dB) \=si9A e. 6 The above definition is based on [9]. Its relationships are represented by the following diagram: Db(SigA) --> Sen(SigA) Db^)] \|5en(0) Db(SigB) --> Sen(SigB) Note the reversal of the direction of the arrow Db(<j>) relative to Sen(<j>). Sen((f>) maps each constraint in Sen(SigA) to a constraint in Sen(SigB). By contrast, Db{4>) maps each database in Db(SigB) to a database in Db(SigA)- If we think of <f> as a mapping from a logical database schema SigA into a physical schema SigB, then Db{4>) tells how to materialize a view in Db(SigA) from a database in Db{Sigs)- To see why this reversal of arrows happens, consider an injective schema morphism </> : Sch\ --> Sch^ which makes Sch\ a subschema of Sch^ . Schi just extends the signatures and the constraints of Sch\ (we assume a monotonic logic such as Horn clause logic). A database that is consistent with respect to Schi is also (by projection) consistent with Sch\. The other way around does not hold. The Integrity Requirement puts a very strong semantic restriction on the permissible schema signature transformations (schema morphisms). It only allows ones that preserve the validity of constraints. That is, suppose 4> maps constraint BA to es (more precisely, es = (Db)(<j))(eA ) for BA G Sen(SigA))- Then the Integrity Requirement says that es is valid in a database (1B iff &A is valid in the SigA database that corresponds to ds (i.e., in Db(<j>)(d,B))- The Integrity Requirement applies directly to the problem of schema integration in Definition 6. It ensures that each valid database of the integrated schema corresponds to valid databases of the schemas to be integrated. This point is made precise in the following theorem. Theorem 1 (Materializing subschemas) Suppose Sch\ and Sch^ are schemas that are integrated into Schi2 relative to some schema transformation framework. Given a consistent database d in Db{Sch\2), it is possible to construct databases d\ and G?2 that are consistent relative to schema Sch\ and schema Schi respectively. Proof We have schema morphisms </>i : Sch\ -->� Sch\2 and </>2 : Schz -->� Sch\2 where Sch\ = (Sigi,E\) and 5c/i2 = (Sigz,E'i)- We construct database d\ as Db((f>i)(d) and database di as Db(<f>2)(d). Since d is a consistent database and </>i and </>2 are schema morphisms, Definition 3 implies: � d \|=sjgi2 (Sen)(cf>i)(e) for all eG-Bi � d \|=sjgi2 (Sen)((/)2 )(e) for all e � E2. We can now complete the proof by applying the Integrity Requirement to the above two lines, yielding: � Db{<f>i){d) \=si91 e for all e�fii � Db(<j>2)(d) \=si92 e for all e G E2. In essence, databases conforming to Sch\ and Schz are materialized views of Sch\2- Thus, the above theorem relates the consistency of databases conforming to the integrated schema to those of the materialized views. This is an unusual perspective in that views typically do not have integrity constraints. Note that in order to make this proof possible both the reversal of arrows and the Integrity Requirement are essential. Corollary 1 (Schema-joins) Let Schri be an integration of schemas Sch\ and Sch,2 relative to a schema transformation framework. Let d be a consistent database in the category Db{Sig\2). If Sch\ * Schz = (Sigi2,* �-12 ) exists, then there is a* canonical way to construct a consistent database d in the category Db(Sigi2)- Proof By Definition 6, there is a unique <j> : Sch\ * Schz --> Sch\2- d* is constructed as Db(cf))(d). Since d is consistent, by Theorem 1 so is d. Corollary 1 says that for any consistent database of an integrated schema, there exists a corresponding (minimal) consistent database of the schema-join of the two schemas that were integrated. The canonical arrow Db(<f>)* is a recipe for constructing that (minimal) consistent database. 7 6 An Application of the Schema Transformation Framework One role of the generic schema transformation framework is to serve as a formal generic pattern for defining new data models, with well defined meanings for schema and data transformations. This section gives a detailed description of how one constructs such a data model: a category of schemas and their databases for Examples 1 and 2, which we call Objc (00 with Constraints). A schema of Objc consists of a collection of sorts (type names) some of which are predefined in the initial schema Scho and which in particular must include the sort Boolean along with the standard axioms. The others are abstract data types defined in the schema itself, each of which is a set of method signatures specified as a Java interface. The is-a (inheritance) relationship thus amounts to the subset relation which agrees with the rules of the Java type system. A schema contains collection objects which are the actual database sets. To specify their type, a parametric Collection < T > type is used and instantiated with the required element type selected among the abstract data types defined in the schema. Definition 8 (Objc schema signatures) A schema signature consists of: � A finite set of sorts S, which includes Boolean. � A finite set of interfaces A (abstract data types) such that A C S . � An interface Ak is a set of method signatures of the form C m{C\ xi, Ci X2,... ,Cn x n ) where C � S and Ci � S for i = 1,2,... ,n. (m is the method name, C is the return type, and C\ is the type of parameter x\.) � If the interface A ^ extends the interface A\ (inheritance) then A\ C A k- � A finite set of collection signatures {Collection < Aj >: X,} where Aj � A and Collection < Aj >� S. To specify the type of data sets of a schema, parametric collection types are required. The issue of parametric types is a major one by itself. It will be elaborated only to a limited extent here by showing that a specific collection type (i.e., for a specific element type) is obtained from a parametric type by the same pushout construction [10] that we used for schema integration. Definition 9 fObjc collection types) An instantiated collection interface Collection < A > is defined by the following pushout diagram T -->� Collection < T > i H A -^ Collection < A > In the above definition a parametric interface Collection < T > is viewed as a morphism T --> Collection < T >. The substitution T --> A and instantiation Collection < T >--^ Collection < A > are also morphisms. The morphism A -->� Collection < A > is obtained from T --> Collection < T > and the substitution T --> A. The above construction generalizes the usual views of instantiated parametric types. If a pushout is based on schema morphisms (which apply to both structural properties and integrity constraints), not just on signature morphisms, then the notion of instantiation of parametric types becomes semantic in nature. Constraints are sentences expressed in Horn clause logic. Terms include method invocations and thus appear in the object- oriented form. Atoms are invocations of Boolean methods. This limited logic is sufficiently expressive to capture most typical database constraints, such as key and inclusion dependencies. Definition 10 fObjc constraints) For a given schema signature Sig with sorts S: � A collection object of type Collection < A ^ > is a term of type Collection < A ^ >. � A variable X of type A ^ where A k � S is a term of type A ^ . � If a is a term of type A k , are terms of respective types A\, A-z,..., An, C m(Ai, A-z,..., An) is a method of the interface Ak. of Sig, then a.m(ai,a 2, , a n ) is a term of type C. � With the above definitions, an atom is of the form a.m{a\, 02, � � �, a n ) where m 's result type is Boolean. � A constraint is of the form p <-- pi,p2, ,Pn where p,pi,P2, ,pn are atoms. Permissible schema signature transformations of Objc are defined below as schema signature morphisms. Their core is a mapping of sorts, extended to signatures of methods and collection objects. This mapping is the identity on the inital schema's predefined sorts, so these sorts have the same interpretation in all Objc schemas. Note that Collection < T > is not intended to have methods with arguments of type Collection < T >, so that Collection < A ^ > would be a structural subtype of Collection < A\ > if A/- is a subtype of Ai. 8 Definition 11 fObjc schema signature morphisms) A morphism of schema signatures </> : Sigi --> Sig-z consists of the following: � An injective mapping of sorts <f> : S\ --> S2 such that -- (f>(s) = s for s So where So are the sorts of the initial schema Scho- -- (^(Collection < Ai >) = Collection < <f>(Ai) > � <f> applies to interfaces as follows: <f>(C m(Ci, C2, , Cn)) = 4>{C) (j>(m)((j)(Ci), <f>(C-2), � � �, (j>(Cn)) � <f> maps collection objects of Sigi into collection objects of Sig-2 such that ^(Collection < Aj >: Xj) = <f>(Collection < Aj >) : 4>(Xj). The above definition is intended to allow a mapping of an abstract data type to its extension with possible renaming and possibly enforcing structural subtyping conditions. The notion of schema signature morphism is extended below to a schema morphism. The extension maps the integrity constraints in accordance with the mapping of the sorts in a schema signature morphism. Definition 12 fObjc schema morphisms) A morphism of schemas cf> : Schi -- Sdi2 is a schema signature morphism <f> : Sigi --> Sig2 such that (j> extends to a function E\ --> E2 as follows: � <j)(xs ) = X0(3) 1, 0,2-, � � � ; 0,n )) = cf>(a).cf>(m)(cf>(ai),... ,cf>(a,,)), where m may be a Boolean method. � (j > (p < - p i , P 2 , . . . , P n ) = 4>(P) ^4>(Pl),4>(P2),---A(Pn) Given Definitions 10 and 11, the definition of the Sen functor is immediate and so is the verification of its functorial properties. Definition 13 fObjc Sen functor) Define Sen : Sign --> Set as follows: � Sen(Sig) is a set of sentences of the form {p <-- pi,p2, ,Pn} where p,pi,P2, ,pn are atoms according to Definition 10. � Given a schema signature morphism (j> : Sigi --> Sig2, Sen(cf>)({P ^- pi,P2, ,Pn}) = {4>(P) <-<t>(j>l),(p(p2),...,<l>(pn)}- Proposition 1 fObjc Sen functor) Then Sen is a functor Sign -->� Set. Proof Sen obviously preserves the identity schema signature morphism, i.e. Sen(lsig ) = ^sen(Sig) and the composition of schema signature morphisms, i.e., Sen(<f>\<j)2) = Sen{4>i)Sen{4>2)- A database for a given Objc schema includes a domain for each sort in the schema. A domain is a set equipped with functions representing the interpretation of the method signatures. A database also includes the actual data sets, one for each collection object signature in the schema. By Definition 2, a database for a given schema must be consistent. Definition 14 ('Objc databases) A database d for a given schema Sch = (Sig,E) consists of the following: � A collection of sets (domains) {Ds \ s � S} where S is the set of sorts of Sig. S includes Boolean, whose domain is true and false. � For each method m of an interface A ^ with the signature C m(C\ x\, C2 X2, ,Cn x n ) a function fm : Dw --> Dc where Dw = DAk x Dc\ x . .. x Dcn with w = AkCi ... C,,. � For each collection variable of type Collection < A k > a set with elements from the domain Dk. � If Ai <Z Au then there exists a function (projection) Dk --> Di and thus also a function Collection < Dk >--> Collection < Di >. � Let e be a constraint with variables X such that Xs C X is a set of variables in X of sort s. If 8 is a substitution of variables in e (i.e. a family of functions Xs --> Ds ) then e < 9 > evaluates to true. Although collections in the above definition are interpreted as sets, in general they can be bags. To complete the above definition, we must specify its morphisms of the category of databases: a family of functions, one per database domain, that map the actual data sets as well. These maps are required to satisfy two conditions: a standard algebraic condition that applies to operations on the original domains and their images, and a condition that applies to the integrity constraints. As we are talking here about the category of databases for a schema, not just for a schema signature, database morphisms are required to map each consistent database into another consistent database. 9 Definition 15 (Database morphisms for Objc schemas) Let d and d' be databases consistent with respect to a schema (Sig, E). Then a database morphism h : D --> D' is a family of functions h s : Ds -- D's for s G S where S stands for the set of sorts of Sig. � For each method m of interface A k with signature C m(Ci x\, C2 X2,... ,Cn x n ) and a G Dw, h s{fm{a)) = f m (h w(a)) holds, where h w is a product of functions h A k x ftci x ... x hcn when w = A ^CiC-2 ... Cn, as in the following diagram: /m Dw --> Dc h wl \|/ic D'w ^ D'c � Suppose e G E has variables X. If 6S : Xs -- Ds is a substitution of variables for X and e < 8 > evaluates to true, then so does e < 9' >, where 8' is constructed as the composition of 8S and h s . Definition 16 (Db functor for Objc schemas) Define Db as follows: Db(Sig) is a category with objects and arrows defined in 14 and 15. Given <f> : Sigi --> Sig2, Db((j>) is defined as: � If (fe is a Sig2 database with domains {Ds \| s G S2} then Db((j>)(d2) is a Sigi database with domains {Ds I s G (f>(Si)}. The same applies to the collection objects. � A morphism h : e?2 --> d'2 of Sig2 databases (a family of functions {h s \ s G �2},) is mapped into a morphism Db{<f>){h) of Sigi databases by restricting h to the family of functions {h s \| s G (f>(Si)}. Proposition 2 specifies how the Db functor is constructed and how its functorial properties are verified. Note that D s and hs now correspond to the sort 4>~1 (s) � Si. Proposition 2 (Db functor for Objc schemas) Db in Definition 16 is a functor Sign --> Catop. Proof A schema signature morphism <f> : Sigi -->� 5*32 according to Definition 11 maps to a database morphism Db((f>)* : Db{Sig2) --> Db(Sigi) according to the above construction and Definition 15. The family Db(<f>)(d2) -->� Db{4>)(d'2) is a Sigi database morphism, because h is a morphism of Sig2 databases. Two properties are essential here: the injective mapping of sorts required in Definition 11 of schema signature morphisms and mapping of collections as specified in Definition 14. The developments presented so far in this section lead to two theorems. The first one asserts that the paradigm satisfies the conditions of Definition 7 and thus is a schema transformation framework. It is based on the already established functors Sen and Db. Theorem 2 (A schema transformation framework) � Let the category of schema signatures Objc be defined according to Definitions 8 and 11. � Define the functor Sen : Sign -->� Set according to Proposition 1. � Define the functor Db : Sign --> Catop according to the Proposition 2 so that the category Db(Sig) is defined according to Definitions 14 and 15. � The satisfaction relation is also specified in Definitions 14 and 15. Under the above conditions the Integrity Requirement holds. Proof We have to prove that each schema signature morphism </> : Sig A --> SigB in Definition 11 satisfies the Integrity Requirement, that is, ds \=Sig B Sen{<j))(e) iff Db(<f>)(dB) \=Sig A e holds for each sentence e � Sen(SigA)- Given e � Sen(SigA), Sen(cf))(e) is defined by Proposition 1. For a given object ds of the category Db(SigB), Db(<j>)(d,B) is defined in Proposition 2. The proof amounts to: dB \=Sig B (4>(P) <- <MPl),'MP2),---,</>(>n)) iff Db((f>)(dB) \=SigA (P <~ Pi, Pi, , Pn) The second theorem establishes the existence of the schema-join of any two schemas of this schema transformation framework. This result provides a standard technique for integrating two Objc schemas. This integration is both structural and semantic (i.e. the integrity constraints are properly integrated) because it is based entirely on schema morphisms according to the pushout construction in Definition 6. Theorem 3 The schema-join of any two Objc schemas over the initial schema Scho exists. 10 Proof Given schemas Schi and Schz the integrated schema Schi * Sch^ is defined as follows: (i) Ssch!Sch2* = <5iU52 (sorts), (ii) ASch1Sch 2 = A Scht* U ASch2 (interfaces), (in) CSch1Sch 2 = Csch^Csch (collections), and (iv) Esch!Sch2 = Escht* U ESch 2 (constraints). With this construction Schi and Sch^ are subschemas of Schi * Sch^ since we have two injective schema morphisms <f>i : Schi --> Schi * Schz and <f>2 ' Sch,2 -->� Schi * Sch.2- Suppose that we are given p : Schi --> S12 and q : Sch,2 --> Si 2 . The required schema morphism <j> : Schi * Scti2 --> Sch\2 is defined on the sorts as follows: <f>(s) = p(s) for s � Si and <j>(s) = q(s) for SGS2. Note that <f>i(s) = s for s � So and <fo(s) = s for s � So- So ^ is well defined for s � Si fl S2. As interfaces, collections and constraints of Schi * Sc/12 are defined as disjoint unions of the corresponding components of Schi and Sch2 , we have <f>h = p and <j)k = q. Note that with the choice of a different logic the above result would not necessarily hold. The union of constraints could lead to a set of sentences for which there is no database (particularly in a given category) that satisfies the constraints (represents a model for the constraints) <j>i{Ei) and ^2(^2) (see Definition 3). 7 Conclusions and Related Work The main contribution of this paper is a model theory for generic schema management. While the ideas on generic schema management proposed so far have been mostly informal [4] and pragmatic [5], this paper shows that a formal framework for such generic tools does indeed exist. A distinctive feature of the presented paradigm is that it applies to a variety of categories of schemas and to a variety of logic bases for expressing database integrity constraints. This level of generality is accomplished by making use of a categorical model theory called institutions, proposed for programming language paradigms [9]. We believe ours is the first application of this theory to database models and languages. An earlier attempt in [15] is similar in spirit, but is less general. It also does not address the logic basis and preservation of integrity constraints and lacks provable results. The core of this model theory is a general, transformation-oriented definition of a data model. This generic schema management framework serves as a pattern for constructing data models in such a way that schema transformations as well as the associated database transformations have well-defined meaning. As a rule, most well known data models have not been constructed in such a spirit. Furthermore, this framework is intended to be applied to data models that are still not completely or formally established, such as XML. A further distinctive feature of this framework is that it has a strong database integrity requirement as its possibly most fundamental component. This paper also shows how to address some well-known problems such as schema equivalence [13, 14] and schema integration [6] at this level of generality. The results are independent of a particular category of schemas and apply to both structural properties and integrity constraints. The requirement for proper handling of the integrity constraints in those problems is one of the contributions of this paper. Furthermore, this approach is independent of a particular logic paradigm used as a basis for the constraint language. Contrary to the published work on schema and data transformations (schema integration in particular) in which transformations at the meta and data levels have the same direction [19], this paper reveals a subtlety not considered in the relevant papers. The subtlety is manifested in the reversal of the transformation arrows at the two levels. This observation is an important component of the presented model theory, having both pragmatic and mathematical significance. In this paper one option is that schema transformations must satisfy particular structural subtyping conditions. Such a strict discipline has its place in programming languages, but it is often not satisfied in schema transformations. Furthermore, if the abstract data types are equipped with constraints, semantic compatibility notions such as behavioral subtyping [16] become a major issue. Model theoretic implications of behavioral compatibility issues in mapping abstract data types equipped with constraints are given in [!] In a model in which inheritance is identified with subtyping (as in Java) a single partial order of sorts is required. This introduces further subtleties in schema transformations and schema integration. Models with different types of ordering of sorts are elaborated in [1] and [3]. The relevant results on pushouts of algebraic specifications are given in [12]. 11 In this paper we do not consider the problem of mapping one schema transformation framework (for example, relational) into another (for example, object-oriented). This situation is captured by the notion of a morphism of schema transformation frameworks (following [9, 15]), and is a topic of work in progress. The formal paradigm presented in this paper applies to appropriately simplified XML schemas [23]. The reasons are: the nature of the type system of the XML Schema, the kind of the integrity constraints (keys and referential integrity constraints) expressible in XML schemas (also considered in [8]), and the existence of the initial schema (more precisely, a name space). Space limitations do not allow further elaboration of these important implications of the development presented in this paper. However, application of this model theory to the XML data model is a major topic of future research. This paper is not addressing explicitly modeling of database dynamics. But this is by no means a limitation of the paradigm. Database dynamics is taken into account by choosing the Sen functor for a suitable temporal logic. A development of such a paradigm is given in [1, 2, 3]. References [I] Alagic, S.: Semantics of temporal classes, Information and Computation, 163, pp. 60 - 102, 2000. [2] Alagic, S.: Constrained matching is type safe, Proceedings of the Sixth Int. Workshop on Database Programming Languages, LNCS 1369, Springer-Verlag, pp. 78-96, 1998. [3] Alagic, S. and M. Alagic, Order-sorted model theory for temporal executable specifications, Theoretical Computer Science 179, pp. 273-299, 1997. [4] Bernstein, P. A. , A. Halevy, R. A. Pottinger: A vision for management of complex models, A CM SIGMOD Record 29(4), 2000. [5] Bernstein, P.A. and E. Rahm: Data warehouse scenarios for model management. ER 2000, LNCS 1920, Springer-Verlag, pp. 1-15, 2000. [6] Batini, C, M. Lenzerini, and S. B. Navathe: A comparative analysis of methodologies for database schema integration, ACM Computing Surveys 18(4), pp. 323-364, 1986. [7] Buneman, P., S. Davidson, and A. Kosky: Theoretical aspects of schema merging. EDBT 1992, pp. 152-167. [8] Fan, W. and L. Libkin, On XML integrity constraints in the presence of DTDs, Proc. of ACM PODS Conf., 2001. [9] Goguen, J. and R. Burstall, Institutions: Abstract model theory for specification and programming, Journal of the ACM, 39, No. 1, pp. 92-146, 1992. [10] Goguen, J.: Types as theories, in: G. M. Reed, A. W. Roscoe and R. F. Wachter, Topology and Category Theory in Computer Science, pp. 357-390, Clarendon Press, Oxford, 1991. [II] Goguen, J. and J. Meseguer, Order-sorted algebra I: Equational deduction for multiple inheritance, overload ing, exceptions and partial operations, Theoretical Computer Science 105, pp. 217-273, 1992. [12] Haxthausen, A.E. and F. Nickl: Pushouts of order-sorted algebraic specifications, in M. Wirsing and M. Nivat, eds., Algebraic Methodology and Software Tech. (AMAST 96), LNCS 1101, Springer-Verlag, pp. 132-147, 1996. [13] Hull, R.: Relative information capacity of simple relational database schemata, SI AM Journal of Computing, Vol. 15, No. 3, pp.856 -886, 1986. [14] Hull, R.: Managing semantic heterogeneity in databases: A theoretical perspective, PODS 1997, 51-61 [15] Kalinichenko, L.A.: Methods and tools for equivalent data model mapping construction, Proc. of EDBT, LNCS 416, Springer-Verlag, pp. 92-119, 1990. [16] Liskov, B. and J. M. Wing: A behavioral notion of subtyping, ACM TOPLAS 16, pp. 1811 1841. [17] Mac Lane, S.: Categories for a Working Mathematician, Springer, 1998. [18] Madhavan, J., P. A. Bernstein, E. Rahm: Generic schema matching with Cupid, VLDB 01, to appear. [19] Miller, R. J., Y. E. Ioannidis, R. Ramakrishnan, Schema equivalence in heterogeneous systems: Bridging theory and practice, Information Systems Vol. 19, No. 1, pp. 3-31, 1994. [20] Rahm, E. and P.A. Bernstein: On matching schemas automatically. MSR Tech. Report MSR-TR-2001-17, 2001. [21] Spaccapietra, S., C. Parent, Y. Dupont: Model independent assertions for integration of heterogeneous schemas, VLDB Journal 1, pp. 81-126, 1992. [22] Spaccapietra, S. and C. Parent: View integration: A step forward in solving structural conflicts, IEEE TKDE 6(2), pp. 258 - 274, 1994. [23] W3C: XML Schema, http://www.w3c.org/XML/schema, 2001. 12
	dd