Conference PaperPDF Available

Clio grows up: From research prototype to industrial tool

Authors:

Abstract and Figures

Clio, the IBM Research system for expressing declarative schema mappings, has progressed in the past few years from a research prototype into a technology that is behind some of IBM's mapping technology. Clio provides a declarative way of specifying schema mappings between either XML or relational schemas. Mappings are compiled into an abstract query graph representation that captures the transformation semantics of the mappings. The query graph can then be serialized into different query languages, depending on the kind of schemas and systems involved in the mapping. Clio currently produces XQuery, XSLT, SQL, and SQL/XML queries. In this paper, we revisit the architecture and algorithms behind Clio. We then discuss some implementation issues, optimizations needed for scalability, and general lessons learned in the road towards creating an industrial-strength tool.
Content may be subject to copyright.
Clio Grows Up: From Research Prototype to Industrial Tool
Laura M. Haas
IBM Silicon Valley Labs
laura@almaden.ibm.com
Mauricio A. Hern´andez
IBM Almaden Research Center
mauricio@almaden.ibm.com
Howard Ho
IBM Almaden Research Center
ho@almaden.ibm.com
Lucian Popa
IBM Almaden Research Center
lucian@almaden.ibm.com
Mary Roth
IBM Silicon Valley Labs
torkroth@us.ibm.com
ABSTRACT
Clio, the IBM Research system for expressing declarative schema
mappings, has progressed in the past few years from a research pro-
totype into a technology that is behind some of IBM’s mapping tech-
nology. Clio provides a declarative way of specifying schema map-
pings between either XML or relational schemas. Mappings are
compiled into an abstract query graph representation that captures
the transformation semantics of the mappings. The query graph can
then be serialized into different query languages, depending on the
kind of schemas and systems involved in the mapping. Clio currently
produces XQuery, XSLT, SQL, and SQL/XML queries. In this pa-
per, we revisit the architecture and algorithms behind Clio. We then
discuss some implementation issues, optimizations needed for scal-
ability, and general lessons learned in the road towards creating an
industrial-strength tool.
1. INTRODUCTION
Mappings between different representations of data are fundamen-
tal for applications that require data interoperability, that is, integra-
tion and exchange of data residing at multiple sites, in different for-
mats (or schemas), and even under different data models (such as
relational or XML). To provide interoperability, information integra-
tion systems must be able to understand and translate between the
various ways in which data is structured. With the advent of the
flexible XML format, the abundance of different schemas describing
similar or related data has proliferated even more.
Wecan distinguish between two main forms of data interoperabil-
ity. Data exchange (or data translation) is the task of restructur-
ing data from a source format (or schema) into a target format (or
schema). This is not a new problem; the first systems supporting
the restructuring and translation of data were built several decades
ago. An early such system was EXPRESS [9], which performed
data exchange between hierarchical schemas. However, the need for
systems supporting data exchange has persisted and, in fact, grew
larger over the years. Data exchange requirements appear in the ETL
(extract-transform-load) workflows, used to populate a data ware-
house from a set of data sources, in XML messaging, in schema evo-
lution (when migrating data from an old versionto a new version), in
database restructuring, etc. A second form of data interoperability is
data integration (or federation), which means the ability to query a
set of heterogeneous data sources via a virtual unified target schema.
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage, and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
SIGMOD 2005 June 14-16, 2005, Baltimore, Maryland, USA.
Copyright 2005 ACM 1-59593-060-4/05/06 $5.00.
There is no need to materialize a target instance in this case; instead,
the emphasis is on query processing.
In both cases, of data exchange and data integration, relationships
or mappings must first be established between the source schema(s)
and the target schema. There are two complementary levels at which
mappings between schemas can be established, and both pose chal-
lenges. The first is mostly a syntactic one: by employing schema
matching techniques, a set of uninterpreted correspondences between
elements (terms, field names, etc.) in two different schemas are es-
tablished. A significant body of work on schema matching algo-
rithms has been developed (see [8] for a survey). The second level
at which mappings can be established is a more operational one, that
can relate instances over the source schema(s) with instances over the
target schema. Establishing an “operational” mapping is necessary if
one needs to move actual data from a source to a target, or to answer
queries. Such “operational” mappings can be seen as interpretations
(with runtime consequences) of the correspondences that result after
schema matching. It is this second view on schema mappings that we
focus on here, and we will use the term schema mappings to mean
“operational” mappings rather than correspondences.
Schema mappings in various logical forms have been used for
query answering in data integration (see [4] for a survey). Schema
mappings expressed as constraints (source-to-target tgds) have been
used to formalize data exchange between relational schemas [3]. Sim-
ilar schema mappings appear as an important component of the model
management framework of Bernstein et al [1, 5].
The system that we are about to describe, Clio, is the first to ad-
dress the problem of semi-automatic generation ofschema mappings
as well as the subsequent problem of using the schema mappings
for the actual runtime (e.g., how to generate an XQuery or XSLT
transformation from the schema mappings to implement the data ex-
change). Originally designed for the relational case [6], Clio has then
evolved into a full-fledged system [7] for generating mappings and
transformations between hierarchical XML schemas. In this paper,
we describe our further experience in building an industrial-strength
Clio; in particular, we focus on the practical challenges encountered
in: (1) the scalable semi-automatic generation of mappings between
schemas with large degree of complexity, and (2) the subsequent use
of these schema mappings to accomplish efficient and functional data
exchange, via query generation.
Clio system architecture. A pictorial view that shows how Clio is
structured into its main components is shown in Figure 1. At the core
of the system are the mapping generation component and the query
generation component. The mapping generation component takes as
input correspondences between the source and the target schemas and
generates a schema mapping consisting of a set of logical mappings
that provide an interpretation of the given correspondences. The logi-
cal mappings are declarative assertions (expressed as source-to-target
constraints, to be described shortly). They can be viewed as abstrac-
tions (or requirements) for the more complex physical transforma-
tions (e.g., SQL, XQuery or XSLT scripts) that operate at the data
Schema mapping
Source
schema S Target
schema T
Correspondences
“conforms to”
data
Query
Generation
Executable transformation
(SQL/XQuery/XSLT/…)
“conforms to”
• XML Schema
• Relational
Mapping
Generation
SQL XQuery XSLT
Schema
Matching
GUI
• XML Schema
• Relational
Figure 1: Clio architecture
transformation runtime. Such abstractions are easier to understand
and to reason about, and are independent of any physical transforma-
tion language. Nonetheless, they capture most of the information that
is needed to generate the physical artifacts automatically. The query
generation component has then the role to convert a set of logical
mappings into an executable transformation script. Query generation
consists of a generic module, independent of a particular execution
language, and of a number of pluggable components that are specific
to each execution language: SQL, SQL/XML, XQuery and XSLT.
At any time during the design, a user can interact with the system
through a GUI component. The user can view, add and remove cor-
respondences between the schemas, can attach transformation func-
tions to such correspondences, can inspect and edit (in a controlled
way) the generated logical mappings, and can inspect (without edit-
ing) the generated transformation script. Finally, correspondences
can also be generated via an optional schema matching component.
Clio can interact with any schema matcher and also has its own built-
in schema matching algorithm. However, in this paper, we will focus
on the rest of the components, which are specific to Clio.
2. MAPPING AND QUERY GENERATION
Figure 2 illustrates an actual Clio screenshot showing portions of
two gene expression schemas and correspondences between these
schemas. We will use this as a running example to illustrate the
concepts, the algorithms, as well as the challenges that we faced in
designing the mapping and query generation components.
The left-hand (source) schema, GENEX, is a relational schema for
gene expression (microarray) data stored in the GeneX database1. It
is a schema of medium complexity; it consists of 63 tables that are
inter-related via key and foreign key constraints(there are 47 foreign
key constraints that encode the relationships between tables). The
right-hand (target) schema is an XML schema (GeneXML, formerly
GEML) intended for the exchange (with other databases) of gene
expression data. In addition to a structural change to a hierarchical
format, this schema also presents changes in the concrete elements
and attributes (some of the source GENEX elements do not have cor-
respondences in GeneXML and vice-versa). The XML schema is
24KB in size, including 286 elements and attributes. The number
of foreign keys (keyref) is much reduced compared to the relational
GENEX schema, since many of the original foreign keys are now
replaced by hierarchical parent-child relationships.
For illustration purposes, in this section we only describe the process
of mapping data about experiment sets and about the factors and
treatments that were applied to each experiment set. Thus, we ignore
a large portion of the two schemas and focus only on the relevant
tables and XML structures (shown in Figure 2). To generate a com-
plete transformation, the remainders of the schemas (e.g. samples,
array measurements, citations, software, etc.) will also have to be
1http://sourceforge.net/projects/genex
Figure 2: Schemas and correspondences in Clio.
matched; additional mappings will then be generated and the final
result will be materialized into a complete transformation script. In
Section 3 we give further details on the complexity and challenges of
generating a large-scale transformation.
Schemas in Clio The first step in Clio is loading and transforming the
schemas into an internal representation. The model that we adopt for
this purpose is a nested relational model, that is suitable for describ-
ing relational tables as well as the XML hierarchy. In its basic form,
the nested relational model consists of several fundamental types:
set, record and choice types, together with atomic (primitive) types.
For example, GEML in Figure 2 is represented as a record type and its
repeatable element exp set is represented via a set type. In addi-
tion to this basic nested relational model, Clio’s implementation of a
schema loader includes a series of add-on features to capture the in-
tricacies of XML Schema: data fields to remember whether a record
component corresponds to an element or an attribute and whether
the component is optional ornillable, various encodings to deal with
mixed content elements, derived types, type variables (used, in par-
ticular, to encode recursive types), etc.
2.1 Mapping Generation
The main idea of the algorithm [7] is to put correspondences in
groups (not necessarily disjoint) based on how their end-points (source,
respectively, target) relate to each other in the schemas. Thus, for any
two related elements in the source schema, for which there exist cor-
respondences into two related elements in the target schema, there
will be a group (and, subsequently, a mapping) that includes those
correspondences (and possibly more). As an example, FACTOR NAME
and BIOLOGY DESC are related in the source because there is a for-
eign key that links the EXPERIMENTFACTORtable to the EXPERIM-
ENTSET table); furthermore, biology desc and factor name
are related in the target because the latter is a child of the former.
Hence, there will be a mapping that maps related instances of BIOLO-
GY DESC and FACTOR NAME (and possibly more) into related in-
stances of biology descand factor name (and possibly more).
Generation of tableaux The first step of the algorithm is to generate
all the basic ways in which elements relate to each other within one
schema, based on the schema structure and constraints. This genera-
tion step (described in [7] in detail) considers each set-type element
in a nested schema as a direct generalization of the concept of a table
in a relational schema; it then joins to each such set-type element all
S1= ExpSet
S2= ExpFactor ExpSet
S3= TreatmentLevel ExpSet
S4= TL_FactorValue TreatmentLevel ExpSet
ExpFactor ExpSet
T1= GEML/exp_set
T2= GEML/exp_set/…/exp_factor
T3= GEML/exp_set/…/treatment
T4= GEML/exp_set/ { …/treatment/…/treat_factor …/exp_factor }
Figure 3: Source and target tableaux (informal notation).
the other set-type elements that can be reached by following foreign
key (keyref) constraints (a process called the chase). The result is a
set of tableaux2, one set in each schema.
In Figure 3 we show several of the source and target tableaux
that are generated for our example. (For brevity, we sometimes use
ExpSet instead of EXPERIMENTSET; similar abbreviations are also
used for the other tables.) For the source schema, ExpSet forms a
tableau by itself, because ExpSet is a top-level table (there are no
outgoing foreign keys). In contrast, ExpFactor does not form a
tableau by itself but needs ExpSet into which it has a foreign key. A
more complicated tableau is S4that is constructed for TL FactorVa-
lue. Each factor value is associated to one treatment level (thus, the
foreign key into TreatmentLevel) and each treatment level cor-
responds to one experiment set (thus, the foreign key into ExpSet).
However, a factor value is also an experiment factor (thus, the foreign
key into ExpFactor), and each experiment factor is associated to
an experiment set (hence, the second occurrence of ExpSet).
The above tableau S4illustrates the complexity that can arise even
when a relatively small number of tables is involved. An additional
constraint (which is true for the actual data) could be used to infer
that the two occurrences of ExpSet correspond to the same exper-
iment set instance. Clio’s chasing engine includes such an inference
mechanism. However, such a constraint is hard to extract in practice
(it is not a key, but a complicated dependency). The price to pay for
not having such a constraint will be further ambiguity that remains to
be solved during mapping generation (tobe described shortly).
For the target schema, the tableauxare more complex, due to nest-
ing and also to the fact that predicates (such as join conditions) can
have a context. While tableaux T1, T2and T3are straight paths to
set-type elements (e.g., exp factor), the tableau T4, intended to
denote the collection of treatment factors treat factor, also con-
tains a join with exp factor. Moreover, the join is relative to a
given instance of exp set. The reason for this is the existence of
a keyref constraint that associates every treat factor element
with an exp factor element, within the same exp set instance.
Such keyref constraints, part of the XML Schema specification, can
be easily specified by putting the constraint on the correct element in
the hierarchy (exp set instead of the root, for this example).
To represent tableaux such as T4and S4unambiguously, Clio uses
an internal notation based on: (1) generators, which are used to bind
variables to individual elements in sets, and (2) conditions. Path ex-
pressions, which can be absolute or relative to the bound variables,
can appear in both the generators and the conditions. As an example,
the internal form of T4is shown below.
Generators:
t0GEML/exp set, t1t0/exp set header/treatment list/treatment,
t2t1/treat factor list/treat factor,
t3t0/exp set header/exp factor list/exp factor ;
Conditions:
t2/factor id = t3/id
The use of collection-bound variables (rather than that of arbi-
trarily bound variables) has the advantage that the resulting notation
2also called logical relations in [7]. However, the term tableaux,
used in classical papers on the chase, was the one widely used during
the development of Clio. Hence, we stand by it here.
1. m1: s0/ExpFactor, s1/ExpSet where s0/es_fk = s1/es_pk
2. t0/GEML/exp_set, t1t0/exp_set_header/exp_factor_list/exp_factor
3. such that s1/LOCAL_ACCESSION = t0/@local_accession
4. s1/NAME = t0/@name s1/ES_PK = t0/@id
5. s1/RELEASE_DATE = t0/@release_date
6. s1/ANALYSIS_DESC = t0/exp_set_header/analysis_desc
1. m3: s0/TL_FactorValue, s1/TreatmentLevel, s2/ExpSet,
2. s3/ExpFactor, s4/ExpSet,
3. where s0/tl_fk = s1/tl_pk s1/es_fk = s2/es_pk
4. s0/ef_fk = s3/ef_pk s3/es_fk = s4/es_pk
5. t0/GEML/exp_set, t1t0/exp_set_header/treatment_list/treatment,
6. t2t1/treat_factor_list/treat_factor, t3t0/exp_set_header/exp_factor_list/exp_factor
7. where t2/factor_id = t3/id
8. such that
Figure 4: Two logical mappings.
maps easily to efficient iteration patterns (e.g., the from clause of
SQL, or the for loops of XQuery/XSLT).
Generation of logical mappings The second step of the mapping
generation algorithm is the generation of logical mappings. (Recall
that a schema mapping is a set of logical mappings.) The basic algo-
rithm in [7] pairs all the existing tableaux in the source with all the
existing tableaux in the target, and then finds the correspondences
that are covered by each pair. If there are such correspondences, then
the given pair of tableaux is a candidate of a logical mapping. (In
Section 3 we describe an additional filtering that takes place before
such a candidate is actually output to a user.)
In Figure 4 we show two of the resulting logical mappings, for our
example. The mapping m1is obtained from the pair (S2, T2), which
is covered by 10 correspondences. The source tableau is encoded in
the clause (with its associated where clause that stores the condi-
tions). A similar encoding happens for the target tableau, except that
an clause is used instead of . Finally, the such that clause encodes
all the correspondences between the source and the target that are
covered. (Only five of them are shown in the figure.)
A more complex mapping is m3that is obtained from the pair
(S4, T4), which is covered by all thecorrespondences shown in Fig-
ure 2. An additional complication in generating this mapping arises
from the fact that the correspondences that map EXPERIMENTSET
columns to exp set elements/attributes have multiple interpreta-
tions, because each correspondence can match either the first occur-
rence or the second occurrence of ExpSet in S4. This ambiguity
is resolved by generating all possible interpretations (e.g., all these
correspondences match the first occurrence of ExpSet, or all match
the second occurrence, or some match the first occurrence and some
match the second occurrence). A user would then have to go through
all these choices and select the desired semantics. The default choice
that we provide is the one in which all the ambiguous correspon-
dences match the first choice (e.g., the first occurrence of ExpSet).
For this example, all the different interpretations are in fact equiv-
alent, since the two occurrences of ExpSet represent the same in-
stance, due to the constraint discussed earlier.
In Clio, the tableaux that are constructed based on chasing form
only the basic (default) way of constructing mappings. Users have
the option of creating additional mappings through the mapping edi-
tor. In each schema, a new tableau can be specified by selecting the
needed collections and then creating predicates on those collections.
Each such tableau can then participate in the algorithm for mapping
generation. In other words, each tableau will be paired with all the
tableaux in the opposite schema to reach all possible mappings, based
on covered correspondences.
Mapping language To the language that we have just described, we
must add an additional construct: Skolem functions. These functions
can explicitly represent target elements for which no source value is
given. For example, the mapping m1of Figure 4 will not specify
a value for the @id attribute under exp factor (because there is
no correspondence to map into @id). To create a unique value for
this attribute, which is required by the target schema, a Skolem func-
1. for $x0 in $doc0/GENEX/EXPERIMENTFACTOR,
2. $x1 in $doc0/GENEX/EXPERIMENTSET
3. where
4. $x1/ES_PK/text() = $x0/ES_FK/text()
5. return
6. <exp_set>
7. { attribute id { $x1/ES_PK } }
8. { attribute name { $x1/NAME } }
9. { attribute local_accession { $x1/LOCAL_ACCESSION } }
10. { attribute release_date { $x1/RELEASE_DATE } }
11. <exp_set_header>
12. <biology_desc> { $x1/BIOLOGY_DESC/text() } </biology_desc>
13. <analysis_desc> { $x1/ANALYSIS_DESC/text() } </analysis_desc>
14. <exp_factors_list> {
15 for $x0L1 IN $doc0/GENEX/EXPERIMENTFACTOR,
16. $x1L1 IN $doc0/GENEX/EXPERIMENTSET
17. where
18. $x1L1/ES_PK/text() = $x0L1/ES_FK/text() and
19. $x1/BIOLOGY_DESC/text() = $x1L1/BIOLOGY_DESC/text() and
20. $x1/ANALYSIS_DESC/text() = $x1L1/ANALYSIS_DESC/text() and
21. $x1/NAME/text() = $x1L1/NAME/text() and
22. $x1/LOCAL_ACCESSION/text() = $x1L1/LOCAL_ACCESSION/text() and
23. $x1/RELEASE_DATE/text() = $x1L1/RELEASE_DATE/text()
24. return
25. <exp_factor>
26. { attribute factor_name { $x0L1/FACTOR_NAME } }
27. { attribute factor_units { $x0L1/FACTOR_UNITS } }
28. { attribute major_category { $x0L1/MAJOR_CATEGORY } }
29. { attribute minor_category { $x0L1/MINOR_CATEGORY } }
30. <id>
{"SK6(",$x1L1/BIOLOGY_DESC/text(),$x1L1/ANALYSIS_DESC/text(),…}</id>
31. </exp_factor>
32. } </exp_factors_list>
33. </exp_set_header>
34. </exp_set>
Figure 5: XQuery fragment for one of the logical mappings.
tion can be generated (and the siblings and ancestors of @id, such as
factor name and biology desc, which contain a source value,
will appear as arguments of the function). In general, the gener-
ation of Skolem functions could be postponed until query genera-
tion (see Section 2.2) and the schema mapping language itself could
avoid Skolem functions. However, to be able to express mappings
that arise from mapping composition (e.g., a mapping that is equiva-
lent to the sequence of two consecutive schema mappings), functions
must be part of the language [2]. Clio has been recently extended
to support mapping composition, an important feature of metadata
management. Hence, the schema mapping language used in Clio in-
cludes Skolem functions; the resulting language is a nested relational
variation of the language of second-order tgds of Faginet al [2].
2.2 Query Generation
Our basic query generation operates as follows. Each logical map-
ping is compiled into a query graph that encodes how each target
element/attribute is populated from the source-side data. For each
logical mapping, query generators walk the relevant part of the target
schema and create the necessary join and grouping conditions. The
query graph also includes information on what source data-values
each target element/attribute depends on. This is used to generate
unique values for required target elements as well as for grouping.
In Figure 5, we show the XQuery fragment that produces the target
<exp set> elements as prescribed by logical mapping m13. The
query graph encodes that an <exp set> element will be generated
for every tuple produced by the join of the source tables EXPERIMENT-
FACTOR and EXPERIMENTSET (see lines 1–2 in m1). Lines 1–4 in
Figure 5 implement this join. Lines 7–10 output the attributes within
exp set and implement lines 3–5 of m1. Then, we need to pro-
duce the repeatable exp factor elements. The query graph pre-
scribes two things about exp factor: 1) that it will be generated
for every tuple that results from the join of EXPERIMENTFACTOR
and EXPERIMENTSET (lines 15–18 in the query – they are the same
3The complete XQuery is the concatenation of all the fragments that
correspond to all the logical mappings.
as 1–4 except for the variable renaming), and, 2) since it appears
nested within exp set, that such tuples join with the current tuple
from the outer part of the query to create the proper grouping (lines
19–23 – requiring that the values for exp set in the inner tuple be
the same as in the outer tuple). The grouping condition in lines 19–23
does not appear anywhere in logical mapping m1. This is computed
in the query graph when the shape of the target schema is taken into
consideration. Finally, lines 25–31 produce an actual exp factor
element. Of particular interest, line 30 creates a value for the id el-
ement. No correspondence exists for the id element and, thus, there
is no value for it in m1. However, since id is a required target ele-
ment, the query generator produces a Skolem value for id based on
the dependency information stored in the query graph.
3. PRACTICAL CHALLENGES
In this section, we present several of the implementation and op-
timization techniques that we developed in order to address some of
the many challenging issues in mapping and query generation.
3.1 Scalable Incremental Mapping Generation
One of the features of the basic mapping generation algorithm
is that it enumerates a priori all possible “skeletons” of mappings,
that is, pairs of source and target tableaux. In a second phase (in-
sertion/deletion of correspondences), mappings are generated, based
on the precomputed tableaux, as correspondences are added in or
removed. This second phase must be incremental in that, after the
insertion of a correspondence (or of a batch of correspondences) or
after the removal of a correspondence, a new mapping state must be
efficiently computed based on the previous mapping state. This is an
important requirement since one of the main uses of Clio is interac-
tive mapping generation and editing. To achieve this, the main data
structure in Clio (the mapping state) is the list of skeletons, where
a skeleton is a pair of tableaux (source and target) together with all
the inserted correspondences that match the given pair of tableaux.
When correspondences are inserted or deleted, the relevant skeletons
are updated. Furthermore, at any given state, only a subset of all
the skeletons is used to generate mappings. The other skeletons are
deemed redundant (although they could become non-redundant as
more correspondences are added4). This redundancy check, which
we describe next, can significantly reduce the amount of irrelevant
mappings that a user has to go through.
Redundancy check A tableau T1is a sub-tableau of a tableau T2
if there is a one-to-one mapping of the variables of T1into the vari-
ables of T2, so that all the generators and all the conditions of T1
become respective subsets of the generators and conditions of T2. A
pair (Si, Tj)of source and target tableaux is a sub-skeleton of a sim-
ilar pair (S
i, T
j)if Siis a sub-tableau of S
iand Tjis a sub-tableau
of T
j. The sub-tableaux relationships in each of the two schemas
as well as the resulting sub-skeleton relationship are also precom-
puted in the first phase, to speed up the subsequent processing that
occurs in the second phase. When correspondences are added, in
the second phase, if (Si, Tj)is a sub-skeleton of (S
i, T
j)and the
set Cof correspondences covered by (S
i, T
j)is the same as the set
of correspondences covered by (Si, Tj), then the mapping based on
(S
i, T
j)and Cis redundant. Intuitively, we do not want to use large
tableaux unless more correspondences will be covered. For exam-
ple, suppose that we have inserted only two correspondences, from
BIOLOGY DESC and ANALYSIS DESC of EXPERIMENTSET to
biology desc and analysis desc under exp set. These
two correspondences match on the skeleton (S2, T2)where S2and
T2are the tableaux in Figure 3 involving experiment sets and ex-
periment factors. However, this skeleton and the resulting mapping
are redundant, because there is a sub-skeleton (S1, T1)that covers
4And vice-versa, non-redundant skeletons can become redundant as
correspondences are removed.
the same correspondences, where S1and T1are the tableaux in Fig-
ure 3 involving experiment sets. Until a correspondence that maps
EXPERIMENTFACTOR is added, there is no need to generate a logi-
cal mapping that involves EXPERIMENTFACTOR.
The hybrid algorithm The separation of mapping generation into
the two phases (precomputation of tableaux and then insertion of cor-
respondences) has one major advantage: when correspondences are
added, no time is spent on finding associations between the elements
being mapped. All the basic associations are already computed and
encoded in the tableaux; the correspondences are just matched on
the relevant skeletons, thus speeding up the user addition of corre-
spondences in the GUI. There is also a downside: when schemas are
large, the number of tableaux can also become large. The number of
skeletons (which is the product of the numbers of source and, respec-
tively, target tableaux) is even larger. A significant amount of time
is then spent in the preprocessing phase to compute the tableaux and
skeletons as well as the sub-tableaux and sub-skeleton relationships.
A large amount of memory may be needed to hold all the data struc-
tures. Moreover, some of the skeletons may not be needed (or at least
not until some correspondence is added that matches them).
A more scalable solution, which significantly speeds up the initial
process of loading schemas and precomputing tableaux, without sig-
nificantly slowing down the interactive process of adding and remov-
ing correspondences, is the hybrid algorithm. The main idea behind
this algorithm is to precompute only a bounded number of source
tableaux and target tableaux (up to a certain threshold, such as 100
tableaux in each schema). We give priority to the top tableaux (i.e.,
tableaux that include top-level set-type elements without including
the more deeply nested set-type element). When a user interacts with
the Clio GUI, she would usually start by browsing the schemas from
the top and, hence, by adding top-level correspondences, that will
match the precomputed skeletons.
However, correspondences between elements that are deeper in the
schema trees may fail to match on any ofthe precomputed skeletons.
We then generate, on the fly, the needed tableaux based on the end-
points of such a correspondence. Essentially, we generate a source
tableau that includes all the set-type elements that are ancestors of the
source element in the correspondence (and similarly for the target).
The tableaux are then closed under the chase, thus including all the
other schema elements that are associated via foreignkeys. Next, the
data structures holding the sub-tableaux and the sub-skeleton rela-
tionships are updated, incrementally, now that the tableaux and a new
skeleton have been added. The new correspondence will then match
the newly generated skeleton, and a new mapping can be generated.
Overall, the performance of adding a correspondence takes a hit, but
the amount of tableaux computation is limited locally (surrounding
the end-points) and is usually quite acceptable. As a drawback, the
algorithm may lose its completeness, in that there may be certain
associations between schema elements that will no longer be consid-
ered (they would appear if we were to compute all the tableaux as in
the basic algorithm). Still, this is a small price to pay, compared to
the ability to load and map (at least partially) two complex schemas.
Performance evaluation: mapping MAGE-ML We now give an
idea of the effectiveness of the hybrid generation algorithm by illus-
trating it on a mapping scenario that is close to worst case in practice.
We load the same complex XML schema on both sides and experi-
ment with the creation of the identity mapping. The schema that we
consider is the MAGE-ML schema 5, intended to provide a standard
for the representation of microarray expression data that would facil-
itate the exchange of microarray information between different data
systems. MAGE-ML is a complex XML schema: it features many
recursive types, contains 422 complex type definitions and1515 ele-
ments and attributes, and is 172KB in size.
We perform two experiments. In the first one, we control the
nesting level of the precomputed tableaux (maximum 6 nested lev-
5http://www.mged.org/Workgroups/MAGE/mage.html
els of sets), but we set no limit on the total number of precomputed
tableaux, when loading a schema. This experiment gives us a lower
bound estimation on the amount of time and memory that the basic
algorithm (that precomputes all the tableaux) requires. In the second
experiment, we control the nesting level of the precomputed tableaux
(also, maximum 6) and the total number of precomputed tableaux
(maximum 110 per schema). This experiment shows the actual im-
provement that the hybrid algorithm achieves.
For the first experiment, we looked at the time to load the MAGE-
ML schema on one side only (as source). This includes precomput-
ing the tableaux as well as the sub-tableaux relationship. (The time
to compile, in its entirety, the types of MAGE-ML, into the nested re-
lational model poses no problem; it is less than 1 second.) We were
able to precompute all (1030) the tableaux that obey the nesting level
limit, in about 2.6 seconds. However, computing thesub-tableaux re-
lationship (checking all pairs of the 1030 tableaux for the sub-tableau
relationship) takes 74 seconds. The total amount of memory to hold
the necessary data structures is 335MB. Finally, loading the MAGE-
ML schema on the target side causes the system to run out of memory
(512MB were allocated).
For the second experiment, we also start by loading the MAGE-
ML schema on the source side. The precomputation of the tableaux
(116 now) takes 0.5 seconds. Computing the sub-tableaux relation-
ship (checking all pairs of the 116 tableaux to record if one is asub-
tableau of another) takes 0.7 seconds. The total amount of mem-
ory to hold the necessary data structures is 163MB. We were then
able to load the MAGE-ML schema on the target side in time that is
similar to that of loading the schema on the source side. The sub-
sequent computation of the sub-skeleton relationship (checking all
pairs of the 13456 = 116 x 116 skeletons to record whether one is a
sub-skeleton of another) takes 35 seconds. The amount of memory
needed to hold everything at this point is 251MB. We then measured
the performance of adding correspondences. Adding a correspon-
dence for which there are precomputed matching skeletons is 0.2
seconds. (This includes the time to match and the time to recompute
the affected logical mappings.) Removing a correspondence requires
less time. Adding a correspondence for which no precomputed skele-
ton matches and for which new tableaux and a new skeleton must be
computed on the fly takes about 1.3 seconds. (This also includes the
time to incrementally recompute the sub-tableaux and sub-skeleton
relationships.) Overall, we found the performance of the hybrid al-
gorithm to be quite acceptable and we were able to easily generate
30 of the (many!) logical mappings. Generating the executable script
(query) that implements the logical mappings takes, additionally, a
few seconds. To give a feel of the complexity of the MAGE-ML
transformation, the executable script corresponding to the 30 logical
mappings is 25KB (in XQuery) and 131KB (in XSLT).
The Clio implementation is in Java and the experiments were run
on a 1600MHz Pentium processor with 1GB main memory.
3.2 Query Generation: Deep Union
There are two major drawbacks that the query generation algo-
rithm described in Section 2.2 suffers from: there is no duplicate
removal within and among query fragments, and there is no group-
ing of data among query fragments. For instance, suppose we are
trying to create a list of orders with a list of items nested inside. As-
sume the input data comes from a simple relational table Orders
(OrderID,ItemID). If the input data looks like {(o1,i1), (o1,i2)},
our nested query solution produces the following output data: {(o1,(i1,
i2)), (o1,(i1,i2))}. The grouping of items within each order is what
the user expected, but users may reasonably expect that only one in-
stance of o1appears in the result. Even if we eliminate duplicates
from the result of one query fragment, our mapping could result in
multiple query fragments, each producing duplicates or extra infor-
mation that needs to be merged with the result of another fragment.
For example, assume that a second query fragment produces a tuple
WITH
ExpSetFlat AS -- Q1: The join and union of the relational data for ExpSet
(SELECT DISTINCT
x1.BIOLOGY_DESC AS exp_set_exp_set_header_biology_desc,
x1.ANALYSIS_DESC AS exp_set_exp_set_header_analysis_desc,
x0.ES_FK AS exp_set_id,
VARCHAR('Sk_GEML_2(' || x1.BIOLOGY_DESC || x1.ANALYSIS_DESC || … || ')') AS ClioSet0,
VARCHAR('Sk_GEML_3(' || x1.BIOLOGY_DESC || x1.ANALYSIS_DESC || … || ')') AS ClioSet1
FROM GENEX.EXPERIMENTFACTOR x0, GENEX.EXPERIMENTSET x1
WHERE x0.ES_FK = x1.ES_PK
UNION
SELECT DISTINCT
x1.BIOLOGY_DESC AS exp_set_exp_set_header_biology_desc,
x1.ANALYSIS_DESC AS exp_set_exp_set_header_analysis_desc,
x0.ES_FK AS exp_set_id,
VARCHAR('Sk_GEML_2(' || x1.BIOLOGY_DESC || x1.ANALYSIS_DESC || … || ')') AS ClioSet0,
VARCHAR('Sk_GEML_3(' || x1.BIOLOGY_DESC || x1.ANALYSIS_DESC || … || ')') AS ClioSet1
FROM GENEX.TREATMENTLEVEL x0, GENEX.EXPERIMENTSET x1
WHERE x0.ES_FK = x1.ES_PK),
ExpFactorFlat AS -- Q2: The join of relational data for ExpFactor
(SELECT DISTINCT
VARCHAR('SK29(' || x0.FACTOR_NAME || … || ')') AS exp_factor_id,
x0.FACTOR_NAME AS exp_factor_factor_name,
VARCHAR('Sk_GEML_2(' || x1.BIOLOGY_DESC || x1.ANALYSIS_DESC || … || ')') AS InSet
FROM GENEX.EXPERIMENTFACTOR x0, GENEX.EXPERIMENTSET x1
WHERE x0.ES_FK = x1.ES_PK),
TreatmentLevelFlat AS -- Q3: The join of relational data for TreatmentLevel
(SELECT DISTINCT
VARCHAR('SK109(' || x0.NAME || ',' … || ')') AS treatment_id,
x0.NAME AS treatment_treatment_name,
VARCHAR('Sk_GEML_4 (' || 'SK110(' || x0.NAME || x1.RELEASE_DATE || … || ')') AS ClioSet0,
VARCHAR('Sk_GEML_3(' || x1.BIOLOGY_DESC || x1.ANALYSIS_DESC || … || ')') AS InSet
FROM GENEX.TREATMENTLEVEL x0, GENEX.EXPERIMENTSET x1
WHERE x0.ES_FK = x1.ES_PK),
Figure 6: SQL/XML script, relational part.
{(o1,(i3))}. We would expect this tuple to be merged with the previ-
ous result and produce only one tuple for o1with three items nested
inside. We call this special union operation deep-union.
We illustrate the algorithm by showing the generated query in the
case of SQL/XML, a recent industrial standard that extends SQL with
XML construction capabilities. For the example, we assume that we
are only interested in generating the transformation for two of our
logical mappings: m1(mapping experiment sets with their associ-
ated experiment factors) and m2(which is similar to m1and maps
experiment sets with their associated treatment levels).
The generated SQL/XML script can be separated in two parts. The
first part (shown in Figure 6) generates a flat representation of the out-
put in which a collection of tuples is represented by a system gener-
ated ID, and each tuple contains the ID of the collection it is supposed
to belong to. The purpose of the second part (Figure 7) is to recon-
struct the hierarchical structure of the target by joining tuples based
on their IDs (i.e., joining parent collections with the corresponding
children elements based on IDs). The final result is free of duplicates
and merged according to the deep union semantics.
Briefly, Q1 joins EXPERIMENTFACTORwith EXPERIMENTSET
and will be used to populate the atomic components at the exp set
level in the target (thus, not including the atomic data that goes under
exp factor and treatment). Two set-IDs are generated in Q1
(under the columns ClioSet0 and ClioSet1), for each different
set of values that populate the atomic components at the exp set
level. The first one, ClioSet0, will be used to group exp factor
elements under exp set, while ClioSet1 will be used to group
treatment elements. The values for the set-IDs are generated as
strings, by using two distinct Skolem functions that depend on all the
atomic data at the exp set level. The atomic data for exp factor
and treatment are created by Q2 and Q3, respectively. In both
cases, a set-ID (named InSet) is created to capture what experi-
ment set the data belongs to. The main idea here is that, as long as
the values that go into the atomic components at the exp set level
are the same, the InSet set-ID will match the set-ID stored under
ClioSet0 (in the case of Q2) or ClioSet1 (in the case of Q3).
On a different note, we remark that all queries that appear in the first
half of the script (e.g., Q1, Q2, and Q3) use the DISTINCT clause
ExpFactorXML AS -- Q4: Add XML tags to the data from Q2.
(SELECT
x0.InSet AS InSet,
xml2clob(
xmlelement(name "exp_factor",
xmlattribute(name "factor_name", x0.exp_factor_factor_name),
xmlelement(name "id", x0.exp_factor_id),
)) AS XML
FROM ExpFactorFlat x0),
TreatmentLevelXML AS -- Q5: Add XML tags to the data from Q3.
(SELECT
x0.InSet AS InSet,
xml2clob(
xmlelement(name "treatment",
xmlattribute(name "id", x0.treatment_id),
xmlattribute(name "treatment_name", x0.treatment_treatment_name)
)) AS XML
FROM TreatmentLevelFlat x0),
-- Q6: Combines the results of Q1, Q4, and Q5 into one XML document.
SELECT xml2clob(xmlelement (name "exp_set",
xmlattribute (name "id", x0.exp_set_id),
xmlelement (name "exp_set_header",
xmlelement (name "biology_desc", x0.exp_set_exp_set_header_biology_desc),
xmlelement (name "analysis_desc", x0.exp_set_exp_set_header_analysis_desc),
xmlelement (name "exp_factors_list",
(SELECT xmlagg (x1.XML)
FROM ExpFactorXML x1
WHERE x1.InSet = x0.ClioSet0)),
xmlelement (name "treatment_list",
(SELECT xmlagg (x1.XML)
FROM TreatmentLevelXML x1
WHERE x1.InSet = x0.ClioSet1)))
)) AS XML
FROM ExpSetFlat x0
Figure 7: SQL/XML script continued, XML construction part.
to remove duplicate values.
In the second half of the script, the query fragments Q4 and Q5
perform the appropriate XML tagging of the results of Q2 and Q3.
Finally, Q6 tags the exp set result of Q1 and, additionally, joins
with Q4 and Q5 using the created set-IDs, in order to nest all the
corresponding exp factor and treatment elements.
4. REMAINING CHALLENGES
There remain a number of open issues regarding scalability and
expressiveness of mappings. Complex mappings sometimes need a
more expressive correspondence selection mechanism than that sup-
ported by Clio. For instance, deciding which group of correspon-
dences to use in a logical mapping may be based on a set of pred-
icates. We are also exploring the need for logical mappings that
nest other logical mappings inside. Finally, we are studying mapping
adaptation issues that arise when sourceand target schemas change.
5. REFERENCES
[1] P. Bernstein. Applying Model Management to Classical Meta Data
Problems. In CIDR, 2003.
[2] R. Fagin, P. Kolaitis, L. Popa, and W.-C. Tan. Composing Schema
Mappings: Second-Order Dependencies to the Rescue. In PODS, 2004.
[3] R. Fagin, P. G. Kolaitis, R. J. Miller, and L. Popa. Data Exchange:
Semantics and Query Answering. In ICDT, 2003.
[4] M. Lenzerini. Data Integration: A Theoretical Perspective. In PODS,
2002.
[5] S. Melnik, P.A. Bernstein, A. Halevy, andE. Rahm. Supporting
Executable Mappings in Model Management. In SIGMOD, 2005.
[6] R. J. Miller, L. M. Haas, and M.A. Hern´andez. Schema Mapping as
Query Discovery. In VLDB, 2000.
[7] L. Popa, Y. Velegrakis, R. J. Miller, M. A. Hern´andez, and R. Fagin.
Translating Web Data. In VLDB, 2002.
[8] E. Rahm and P. A. Bernstein. A Survey of Approaches to Automatic
Schema Matching. The VLDB Journal, 10(4):334–350, 2001.
[9] N. C. Shu, B. C. Housel, R. W. Taylor, S. P. Ghosh, and V. Y. Lum.
EXPRESS: A Data EXtraction, Processing, and REStructuring System.
TODS, 2(2):134–174, 1977.
... There are many techniques for schema matching ranging from comparison of column names by string similarity to machine learning algorithms; for an overview, see [5]. In the example from Section 2.1, two correspondences that are easy to automatically find are (N1, N) and (N2, N) and tools such as Clio [14] can create the ED from Section 2.1 from these two correspondences. ...
... Many software products solve model management problems [4], including ETL (Extract, Transform, Load) tools [5], which extract data from separate databases, apply user-specified transformations, and then load the result into a target system such as a data warehouse; query mediators [5], which answer queries about a "virtual" integrated database by combining queries about separate source databases; and visual schema mapping tools [14] which allow users to create schema mappings by visually connecting related schema elements with lines, as shown in Figure 2. ...
... In [2] and [13] the authors give an "institution theoretic" meta semantics to some of the above operations by defining a schema mapping S → T to be a morphism in a suitable category of schemas; CQL's semantics is an instantiation of this meta semantics. [14] Our approach to model management is based on the algebraic approach to databases, data migration, and data integration we describe in [19] and [20]. Those works, and hence this work, extend a particular category-theoretic data model that originated in the late 1990s [11] and was later extended in [21] and [23] and implemented in CQL (http://categoricaldata.net). ...
Preprint
We survey the field of model management and describe a new model management approach based on algebraic specification.
... The composite schema matching is performed method independently collaboratively and combined on the result [9]. The Clio [13]- [16], Cupid [17], SYM [18], as well as [19] are examples of hybrid schema matching. Meanwhile, composite combination methods are used in SemInt ( [20]- [21]), LSD [9], COMA [22], COMA++ [23], COMA 3.0 ( [24]- [25]), IMAP [26], Protoplasm ( [27]- [30]), Falcon-AO ( [31]- [32]), and ASMOV [33]. ...
... Our interest in left Kan extensions comes from their use in data migration [33,34,37], where C and D represent database schemas, F represents a "schema mapping" [18] defining a translation from schema C to D, and I represents an input C-database (often called an instance) that we wish to migrate to schema D. Our implementation of the fastest left Kan algorithm we knew of from existing literature [9] was impractical for large input instances, yet it bore a striking operational resemblance to an algorithm from relational database theory known as the chase [13], which is also used to solve data migration problems, and for which efficient implementations are known [7]. The chase takes an input instance I and a set of formulae F in a subset of first-order logic known to logicians as existential Horn logic [13], to category theorists as regular logic [29], to database theorists as datalog-E, and/or embedded dependencies [13], and to topologists as lifting problems [36], and constructs an F -model chase F (I) that is "universal" among other such "F-repairs" of I. ...
Article
Full-text available
We show how computation of left Kan extensions can be reduced to computation of free models of cartesian (finite-limit) theories. We discuss how the standard and parallel chase compute weakly free models of regular theories and free models of cartesian theories and compare the concept of “free model” with a similar concept from database theory known as “universal model”. We prove that, as algorithms for computing finite-free models of cartesian theories, the standard and parallel chase are complete under fairness assumptions. Finally, we describe an optimized implementation of the parallel chase specialized to left Kan extensions that achieves an order of magnitude improvement in our performance benchmarks compared to the next fastest left Kan extension algorithm we are aware of.
... Our interest in left Kan extensions comes from their use in data migration [33,37,34], where C and D represent database schemas, F represents a "schema mapping" [18] defining a translation from schema C to D, and I represents an input C-database (often called an instance) that we wish to migrate to schema D. Our implementation of the fastest left Kan algorithm we knew of from existing literature [9] was impractical for large input instances, yet it bore a striking operational resemblance to an algorithm from relational database theory known as the chase [13], which is also used to solve data migration problems, and for which efficient implementations are known [7]. The chase takes an input instance I and a set of formulae F in a subset of first-order logic known to logicians as existential Horn logic [13], to category theorists as regular logic [29], to database theorists as datalog-E and/or embedded dependencies [13], and to topologists as lifting problems [36], and constructs an F-model chase F (I) that is "universal" among other such "F-repairs" of I. ...
Preprint
We show how computation of left Kan extensions can be reduced to computation of free models of cartesian (finite-limit) theories. We discuss how the standard and parallel chase compute weakly free models of regular theories and free models of cartesian theories, and compare the concept of "free model" with a similar concept from database theory known as "universal model". We prove that, as algorithms for computing finite free models of cartesian theories, the standard and parallel chase are complete under fairness assumptions. Finally, we describe an optimized implementation of the parallel chase specialized to left Kan extensions that achieves an order of magnitude improvement in our performance benchmarks compared to the next fastest left Kan extension algorithm we are aware of.
Chapter
Full-text available
Data Spaces form a network for sovereign data sharing. In this chapter, we explore the implications that the IDS reference architecture will have on typical scenarios of federated data integration and question answering processes. After a classification of data integration scenarios and their special requirements, we first present a workflow-based solution for integrated data materialization that has been used in several IDS use cases. We then discuss some limitations of such approaches and propose an additional approach based on logic formalisms and machine learning methods that promise to reduce data traffic, security, and privacy risks while helping users to select more meaningful data sources.
Article
Full-text available
Schema matching is critical problem within many applications to integration of data/information, to achieve interoperability, and other cases caused by schematic heterogeneity. Schema matching evolved from manual way on a specific domain, leading to a new models and methods that are semi-automatic and more general, so it is able to effectively direct the user within generate a mapping among elements of two the schema or ontologies better. This paper is a summary of literature review on models and prototypes on schema matching within the last 25 years to describe the progress of and research chalenge and opportunities on a new models, methods, and/or prototypes.
Article
Full-text available
Schema matching is an important process in the Enterprise Information Integration (EII) which is at the level of the back end to solve the problems due to the schematic heterogeneity. This paper is a summary of preliminary result work of the model development stage as part of research on the development of models and prototype of hybrid schema matching that combines two methods, namely constraint-based and instance-based. The discussion includes a general description of the proposed models and the development of models, start from requirement analysis, data type conversion, matching mechanism, database support, constraints and instance extraction, matching and compute the similarity, preliminary result, user verification, verified result, dataset for testing, as well as the performance measurement. Based on result experiment on 36 datasets of heterogeneous RDBMS, it obtained the highest P value is 100.00% while the lowest is 71.43%; The highest R value is 100.00% while the lowest is 75.00%; and F-Measure highest value is 100.00% while the lowest is 81.48%. Unsuccessful matching on the model still happens, including use of an id attribute with data type as autoincrement; using codes that are defined in the same way but different meanings; and if encountered in common instance with the same definition but different meaning.
Chapter
Data quality is one of the most important problems in data management, since dirty data often leads to inaccurate data analytics results and incorrect business decisions. Poor data across businesses and the U.S. government are reported to cost trillions of dollars a year. Multiple surveys show that dirty data is the most common barrier faced by data scientists. Not surprisingly, developing effective and efficient data cleaning solutions is challenging and is rife with deep theoretical and engineering problems. This book is about data cleaning, which is used to refer to all kinds of tasks and activities to detect and repair errors in the data. Rather than focus on a particular data cleaning task, we give an overview of the endto- end data cleaning process, describing various error detection and repair methods, and attempt to anchor these proposals with multiple taxonomies and views. Specifically, we cover four of the most common and important data cleaning tasks, namely, outlier detection, data transformation, error repair (including imputing missing values), and data deduplication. Furthermore, due to the increasing popularity and applicability of machine learning techniques, we include a chapter that specifically explores how machine learning techniques are used for data cleaning, and how data cleaning is used to improve machine learning models. This book is intended to serve as a useful reference for researchers and practitioners who are interested in the area of data quality and data cleaning. It can also be used as a textbook for a graduate course. Although we aim at covering state-of-the-art algorithms and techniques, we recognize that data cleaning is still an active field of research and therefore provide future directions of research whenever appropriate.
Book
Data quality is one of the most important problems in data management, since dirty data often leads to inaccurate data analytics results and incorrect business decisions. Poor data across businesses and the U.S. government are reported to cost trillions of dollars a year. Multiple surveys show that dirty data is the most common barrier faced by data scientists. Not surprisingly, developing effective and efficient data cleaning solutions is challenging and is rife with deep theoretical and engineering problems. This book is about data cleaning, which is used to refer to all kinds of tasks and activities to detect and repair errors in the data. Rather than focus on a particular data cleaning task, we give an overview of the endto- end data cleaning process, describing various error detection and repair methods, and attempt to anchor these proposals with multiple taxonomies and views. Specifically, we cover four of the most common and important data cleaning tasks, namely, outlier detection, data transformation, error repair (including imputing missing values), and data deduplication. Furthermore, due to the increasing popularity and applicability of machine learning techniques, we include a chapter that specifically explores how machine learning techniques are used for data cleaning, and how data cleaning is used to improve machine learning models. This book is intended to serve as a useful reference for researchers and practitioners who are interested in the area of data quality and data cleaning. It can also be used as a textbook for a graduate course. Although we aim at covering state-of-the-art algorithms and techniques, we recognize that data cleaning is still an active field of research and therefore provide future directions of research whenever appropriate.
Conference Paper
Full-text available
To enable modern data intensive applications including data warehousing, global information systems and electronic commerce, we must solve the schema mapping problem in which a source (legacy) database is mapped into a different, but fixed, target schema. Schema mapping involves the discovery of a query or set of queries that transform the source data into the new structure. We introduce an interactive mapping creation paradigm based on value correspondences that show how a value of a target attribute can be created from a set of values of source attributes. We describe the use of the value correspondence framework in Clio, a prototype tool for semi-automated schema mapping, and present an algorithm for query derivation from an evolving set of value correspondences.
Conference Paper
Full-text available
Model management is an approach to simplify the programming of metadata-intensive applications. It offers developers powerful operators, such as Compose, Diff, and Merge, that are applied to models, such as database schemas or interface specifications, and to mappings between models. Prior model management solutions focused on a simple class of mappings that do not have executable semantics. Yet many metadata applications require that mappings be executable, expressed in SQL, XSLT, or other data transformation languages.In this paper, we develop a semantics for model-management operators that allows applying the operators to executable mappings. Our semantics captures previously-proposed desiderata and is language-independent: the effect of the operators is expressed in terms of what they do to the instances of models and mappings. We describe an implemented prototype in which mappings are represented as dependencies between relational schemas, and discuss algebraic optimization of model-management scripts.
Article
Full-text available
Schema matching is a basic problem in many database application domains, such as data integration, E- business, data warehousing, and semanticquery proc essing. In current implementations, schema matching is typically per- formed manually, which has significant limitations. On the other hand, previous research papers have proposed many techniques to achieve a partial automation of the match op- eration for specific application domains. We present a taxon- omy that covers many of these existing approaches, and we describe the approaches in some detail. In particular, we distin- guish between schema-level and instance-level, element-level and structure-level, and language-based and constraint-based matchers. Based on our classification we review some pre- vious match implementations thereby indicating which part of the solution space they cover. We intend our taxonomy and review of past work to be useful when comparing different ap- proaches to schema matching, when developing a new match algorithm, and when implementing a schema matching com- ponent.
Conference Paper
Full-text available
Data integration is the problem of combining data residing at different sources, and providing the user with a unified view of these data. The problem of designing data integration systems is important in current real world applications, and is characterized by a number of issues that are interesting from a theoretical point of view. This document presents on overview of the material to be presented in a tutorial on data integration. The tutorial is focused on some of the theoretical issues that are relevant for data integration. Special attention will be devoted to the following aspects: modeling a data integration application, processing queries in data integration, dealing with inconsistent data sources, and reasoning on queries.
Article
Full-text available
A schema mapping is a specification that describes how data structured under one schema (the source schema) is to be transformed into data structured under a di#erent schema (the target schema). Schema mappings play a key role in numerous areas of database systems, including database design, information integration, and model management. A fundamental problem in this context is composing schema mappings: given two successive schema mappings, derive a schema mapping between the source schema of the first and the target schema of the second that has the same e#ect as applying successively the two schema mappings.
Conference Paper
A schema mapping is a specification that describes how data structured under one schema (the source schema) is to be transformed into data structured under a different schema (the target schema). A fundamental problem is composing schema mappings: given two successive schema mappings, derive a schema mapping between the source schema of the first and the target schema of the second that has the same effect as applying successively the two schema mappings.In this article, we give a rigorous semantics to the composition of schema mappings and investigate the definability and computational complexity of the composition of two schema mappings. We first study the important case of schema mappings in which the specification is given by a finite set of source-to-target tuple-generating dependencies (source-to-target tgds). We show that the composition of a finite set of full source-to-target tgds with a finite set of tgds is always definable by a finite set of source-to-target tgds, but the composition of a finite set of source-to-target tgds with a finite set of full source-to-target tgds may not be definable by any set (finite or infinite) of source-to-target tgds; furthermore, it may not be definable by any formula of least fixed-point logic, and the associated composition query may be NP-complete. After this, we introduce a class of existential second-order formulas with function symbols and equalities, which we call second-order tgds , and make a case that they are the “right” language for composing schema mappings. Specifically, we show that second-order tgds form the smallest class (up to logical equivalence) that contains every source-to-target tgd and is closed under conjunction and composition. Allowing equalities in second-order tgds turns out to be of the essence, even though the “obvious” way to define second-order tgds does not require equalities. We show that second-order tgds without equalities are not sufficiently expressive to define the composition of finite sets of source-to-target tgds. Finally, we show that second-order tgds possess good properties for data exchange and query answering: the chase procedure can be extended to second-order tgds so that it produces polynomial-time computable universal solutions in data exchange settings specified by second-order tgds.
Article
EXPRESS is an experimental prototype data translation system which can access a wide variety of data and restructure it for new uses. The system is driven by two very high level nonprocedural languages: DEFINE for data description and CONVERT for data restructuring. Program generation and cooperating process techniques are used to achieve efficient operation. This paper describes the design and implementation of EXPRESS. DEFINE and CONVERT are summarized and the implementation architecture presented. The DEFINE description is compiled into a customized PL/1 program for accessing source data. The restructuring specified in CONVERT is compiled into a set of customized PL/1 procedures to derive multiple target files from multiple input files. Job steps and job control statements are generated automatically. During execution, the generated procedures run under control of a process supervisor, which coordinates buffer management and handles file allocation, deallocation, and all input/output requests. The architecture of EXPRESS allows efficiency in execution by avoiding unnecessary secondary storage references while at the same time allowing the individual procedures to be independent of each other. Its modular structure permits the system to be extended or transferred to another environment easily.
Article
We present a novel framework for mapping between any combination of XML and relational schemas, in which a high-level, userspecified mapping is translated into semantically meaningful queries that transform source data into the target representation. Our approach works in two phases. In the first phase, the high-level mapping, expressed as a set of inter-schema correspondences, is converted into a set of mappings that capture the design choices made in the source and target schemas (including their hierarchical organization as well as their nested referential constraints).
Article
Data exchange is the problem of taking data structured under a source schema and creating an instance of a target schema that reflects the source data as accurately as possible. In this paper, we address foundational and algorithmic issues related to the semantics of data exchange and to query answering in the context of data exchange. These issues arise because, given a source instance, there may be many target instances that satisfy the constraints of the data exchange problem. We give an algebraic specification that selects, among all solutions to the data exchange problem, a special class of solutions that we call universal. A universal solution has no more and no less data than required for data exchange and it represents the entire space of possible solutions. We then identify fairly general, and practical, conditions that guarantee the existence of a universal solution and yield algorithms to compute a canonical universal solution efficiently. We adopt the notion of "certain answers" in indefinite databases for the semantics for query answering in data exchange. We investigate the computational complexity of computing the certain answers in this context and also study the problem of computing the certain answers of target queries by simply evaluating them on a canonical universal solution.