Laconic schema mappings: Computing the core with sql queries
ABSTRACT A schema mapping is a declarative specification of the relationship between instances of a source schema and a target schema. The data exchange (or data translation) problem asks: given an instance over the source schema, materialize an instance (or solution) over the target schema that satisfies the schema mapping. In general, a given source instance may have numerous different solutions. Among all the solutions, universal solutions and core universal so-lutions have been singled out and extensively studied. A universal solution is a most general one and also represents the entire space of solutions, while a core universal solution is the smallest univer-sal solution and is unique up to isomorphism (hence, we can talk about the core). The problem of designing efficient algorithms for computing the core has attracted considerable attention in recent years. In this paper, we present a method for directly computing the core by SQL queries, when schema mappings are specified by source-to-target tuple-generating dependencies (s-t tgds). Unlike prior meth-ods that, given a source instance, first compute a target instance and then recursively minimize that instance to the core, our method avoids the construction of such intermediate instances. This is done by rewriting the schema mapping into a laconic schema mapping that is specified by first-order s-t tgds with a linear order in the ac-tive domain of the source instances. A laconic schema mapping has the property that a "direct translation" of the source instance according to the laconic schema mapping produces the core. Fur-thermore, a laconic schema mapping can be easily translated into SQL, hence it can be optimized and executed by a database system to produce the core. We also show that our results are optimal: the use of the linear order is inevitable and, in general, schema map-pings with constraints over the target schema cannot be rewritten to a laconic schema mapping.
-
Citations (0)
-
Cited In (0)
Page 1
Laconic Schema Mappings:
Computing the Core with SQL Queries
Balder ten Cate
INRIA and ENS Cachan
balder.tencate@inria.fr
Laura Chiticariu
IBM Almaden
chiti@almaden.ibm.com
Phokion Kolaitis
UC Santa Cruz and
IBM Almaden
kolaitis@cs.ucsc.edu
Wang-Chiew Tan
UC Santa Cruz
wctan@cs.ucsc.edu
ABSTRACT
A schema mapping is a declarative specification of the relationship
between instances of a source schema and a target schema. The
data exchange (or data translation) problem asks: given an instance
over the source schema, materialize an instance (or solution) over
the target schema that satisfies the schema mapping. In general,
a given source instance may have numerous different solutions.
Among all the solutions, universal solutions and core universal so-
lutions have been singled out and extensively studied. A universal
solution is a most general one and also represents the entire space
of solutions, while a core universal solution is the smallest univer-
sal solution and is unique up to isomorphism (hence, we can talk
about the core).
The problem of designing efficient algorithms for computing the
core has attracted considerable attention in recent years. In this
paper, we present a method for directly computing the core by
SQL queries, when schema mappings are specified by source-to-
target tuple-generating dependencies (s-t tgds). Unlike prior meth-
ods that, given a source instance, first compute a target instance
and then recursively minimize that instance to the core, our method
avoids the construction of such intermediate instances. Thisis done
by rewriting the schema mapping into a laconic schema mapping
that is specified by first-order s-t tgds with a linear order in the ac-
tive domain of the source instances. A laconic schema mapping
has the property that a “direct translation” of the source instance
according to the laconic schema mapping produces the core. Fur-
thermore, a laconic schema mapping can be easily translated into
SQL, hence it can be optimized and executed by a database system
to produce the core. We also show that our results are optimal: the
use of the linear order is inevitable and, in general, schema map-
pings with constraints over the target schema cannot be rewritten to
a laconic schema mapping.
1.INTRODUCTION
A schema mapping specifies the relationship between instances
of a source schema and a target schema. The data exchange (aka
data translation) problem asks: given a source instance, transform
it into a target instance so the schema mapping is satisfied. Such
Permission tocopy without feeall orpart ofthis material isgranted provided
that the copies are not made or distributed for direct commercial advantage,
theVLDBcopyrightnoticeandthetitleofthepublicationanditsdateappear,
and notice is given that copying is by permission of the Very Large Data
Base Endowment. To copy otherwise, or to republish, to post on servers
or to redistribute to lists, requires a fee and/or special permission from the
publisher, ACM.
VLDB ‘09, August 24-28, 2009, Lyon, France
Copyright 2009 VLDB Endowment, ACM 000-0-00000-000-0/00/00.
a target instance is called a solution for the given source instance.
Data translation underlies numerous data inter-operability applica-
tions and has been the subject of research that dates back to more
than thirty-years ago [14]. In the past, schema mappings were ex-
pressed procedurally, as a query that can be directly executed to
compute a solution. In recent years, systems such as Clio [10] and
HePToX [1] adopt a declarative logical formalism for specifying
schema mappings.
More formally, a schema mapping is a triple M = (S,T,Σ),
where S is the source schema, T is the target schema, and Σ is the
specification of the relationship between S and T. In recent data
exchange literature, Σ is given by a finite set of source-to-target
tuple generating dependencies (s-t tgds), target tgds, and target
equality-generating dependencies (egds). Intuitively, the s-t tgds
of a schema mapping dictate the existence of certain facts in a solu-
tion for a given source instance. The target tgds and target egds are
constraints over the target schema T that further “shape” the facts
dictated by the s-t tgds for a given source instance. Target tgds and
target egds contain as special cases such important dependencies as
inclusion dependencies and functional dependencies, respectively.
In general, a given source instance may have no solutions, since
it may not be possible to materialize a solution that satisfies a target
egd. On the other hand, a given source instance may have a multi-
tude of solutions. Intuitively, this is so because, while the s-t tgds
of a schema mapping dictate the existence of certain facts in a so-
lution of a given source instance, they do not spell out what should
not be in a solution for the given source instance. Furthermore, s-t
tgds maynot specifyhow certainattributesof arelation inthetarget
schema should be populated with values from the given source in-
stance. As a consequence, there are numerous ways to materialize
these unknown values.
Prior research [5] has shown that, among all solutions of a given
source instance, the universal solutions are the preferred ones be-
cause they are the most general and also encapsulate the entire
space of solutions. Furthermore, it was shown in[7] that the core of
a universal solution is the smallest universal solution and is unique
up to isomorphism. Henceforth, we shall talk about the core uni-
versal solution or, simply, the core. In addition to being the small-
est universal solution, the core possesses certain other good proper-
ties. Specifically, among all universal solutions, thecore returns the
most conservative answerson conjunctive querieswithinequalities.
In other words, the result of evaluating a conjunctive query with in-
equalities Q over the core is contained in the result of evaluating
Q over any univeral solution. Furthermore, in a precise sense, the
core is the solution that satisfies the most embedded dependencies.
Earlier Work on Computing the Core For schema mappings
specified by s-t tgds, Clio and HePToX compute universal solu-
tions by first compiling the schema mapping into a script in an ex-
Page 2
ecutable language, such as SQL, which can then be executed on a
given source instance to construct a universal solution for that in-
stance. Such a framework for computing a universal solution has
several advantages. In particular, this framework is able to push
the computation process into off-the-shelf transformation engines,
such as relational database management systems.
In general, the universal solution produced by the above method
is not the core. In view of the desirable properties of the core (espe-
cially, being the smallest universal solution), one would liketohave
amethod tocompute thecoreusingSQLqueries. Itshould benoted
that computing the core of an arbitrary database instance is an NP-
hard problem [2]. It was shown, however, that for broad classes of
schema mappings, there are polynomial-time algorithms for com-
puting the core of universal solutions. Indeed, for schema map-
pings specified by s-t tgds and target egds (in particular, for schema
mappings specified by s-t tgds), two different polynomial-time al-
gorithms for computing the core of universal solutions were given
in [7]; the first is a greedy algorithm, while the second is a blocks
algorithm. Furthermore, for schema mappings specified by s-t tgds,
target egds, and target tgds obeying the weak acyclicity condition,
a polynomial-time algorithm for computing the core based on a so-
phisticated extension of the blocks algorithm was given in [9].
Incontrast tothedirect method of computing auniversal solution
using SQL queries, all the above algorithms for computing the core
are recursive algorithms that rely on extra post-processing steps on
an intermediate target instance. Specifically, they are based on the
following generic methodology: first, compute an intermediate tar-
get instance for the given source instance; second, recursively min-
imize the intermediate target instance until the core is obtained.
In more concrete terms, for a schema mapping M specified by
s-t tgds and target egds, the greedy algorithm, given a source in-
stance I, first computes a target instance J that is universal for
I. After this, the greedy algorithm will repeatedly remove tuples
from J one at a time, as long as the s-t tgds of M are satisfied.
When no more tuples can be removed, the resulting instance is the
core of the universal solutions for I. The blocks algorithm for a
schema mapping M specified by s-t tgds and target egds begins
by computing a target instance J that is a universal solution for I
with respect to the s-t tgds of M. After this, the blocks algorithm
computes a sequence of intermediate instances such that the next
instance is both a proper subinstance and a homomorphic image
of the preceding instance via a homomorphism that is the identity
everywhere except for a block (a connected component) of nulls.
When no proper subinstance of the current instance is the homo-
morphic image of the current instance via such a homomorphism,
then the current instance is the core. Both the greedy algorithm and
the blocks algorithm terminate after a number of iterations that, in
general, depends on the size of the given source instance. Thus,
both these algorithms are inherently recursive and do not give rise
to a computation of the core via SQL queries.
Summary of Contributions In this paper, we address the follow-
ing question: Can the core be computed using SQL queries? In
other words, can one leverage off-the-shelf relational database sys-
tems and compute the core while adhering to the framework of sys-
tems such as Clio and HePToX?
This question was first addressed by one of the authors of this
paper in [3], where it was shown that for schema mappings spec-
ified by a syntactically restricted (and rather limited) class of s-t
tgds, the core can be computed using SQL queries (see Section
5.1). Here, we present a method that applies to every schema map-
ping specified by s-t tgds and makes it possible to compute the
core using SQL queries. Unlike the aforementioned prior meth-
ods that, given a source instance, first compute a target instance
Σ : PTStud(x,y) → ∃z Advised(y,z)
GradStud(x,y) → ∃z (Advised(y,z) Æ Workswith(y,z))
S PTStud(age, name)
GradStud(age, name)
T Advised(sname,facid)
Workswith(sname,facid)
agename
32John
30Ann
GradStud
age
PTStud
snamefacid
John
N1
Ann
N2
Bob
N3
Ann
N4
Advised
snamefacid
John
N1
Bob
N3
Ann
N4
Advised
sname facid
John
N1
Bob
N3
Ann
N4
Cathy
N1
Advised
name
27Bob
30Ann
snamefacid
Bob
N3
Ann
N4
Workswith
snamefacid
Bob
N3
Ann
N4
Workswith
snamefacid
Bob
N3
Ann
N4
Workswith
Source instance I
Universal solution Ju
The core Jc
A solution J
Figure 1: An example of a schema mapping M = (S,T,Σ), a
source instance I, a solution J for I, a universal solution Jufor
I, and the core Jcfor I.
and then recursively minimize that instance into a core, our method
avoids the construction of such intermediate instances. Instead, we
first rewrite a schema mapping specified by s-t tgds into a logically
equivalent laconic schema mapping that is specified by first-order
s-t tgds with a linear order in the active domain of the source in-
stance. A laconic schema mapping has the property that a “direct
translation” of the source instance according to the laconic schema
mapping produces the core. Furthermore, a laconic schema map-
ping can be easily translated into SQL, hence it can be optimized
and executed by adatabase systemtoproduce thecore. Our method
of computing the core can be easily integrated into existing data ex-
change systems by adding a module that rewrites the schema map-
ping and slightly extending the existing SQL translation module to
handle first-order s-t tgds with a linear order. We also show that our
results are optimal; the linear order that may arise in the rewrite is
necessary and our method cannot be extended to schema mappings
that involve constraints over the target schema.
In [12], similar results were independently obtained for a re-
stricted class of s-t tgds, and empirical data is provided showing
that their method outperforms existing approaches to computing
core universal solutions.
Paper Outline In the next section, we recall basic notions and
properties of schema mappings. Section 3 explains how the canon-
ical universal solution of a source instance with respect to a schema
mapping specified by first-order s-t tgds can be obtained using SQL
queries. In Section 4, we introduce the notion of laconic schema
mappings, and present our algorithm for transforming a schema
mapping specified by first-order s-t tgds into a logically equivalent
laconic schema mapping specified by first-order s-t tgds, assum-
ing a linear order on the active domain of the source instance. In
Sections 5 and 6, we demonstrate the optimality of our algorithm.
2.BACKGROUND AND NOTATION
We present the necessary background and results related to the
core; we also illustrate various concepts by means of an example.
Instances and homomorphisms We assume that there is an infi-
nite set Const of constant values and an infinite set Vars of null
values that is disjoint from Const. We further assume that we have
a fixed linear order < on the set Const of all constant values. We
consider source instances to have values from Const and target in-
Page 3
stances to havevalues from Const∪Vars. Weuse dom(I)to denote
the set of values that occur in facts in the instance I. A homomor-
phism h : I → J, with I,J instances of the same schema, is a
function h : Const ∪ Vars → Const ∪ Vars with h(a) = a for all
a ∈ Const, such that for all relations R and all tuples of (constant
or null) values (v1,...,vn) ∈ RI, (h(v1),...,h(vn)) ∈ RJ. In-
stances I,J arehomomorphically equivalent if thereare homomor-
phisms h : I → J and h′: J → I. An isomorphism h : I∼= J is
a homomorphism that is a bijection between dom(I) and dom(J)
and that preserves truth of atomic formulas in both directions. In-
tuitively, nulls act as placeholders for actual (constant) values, and
a homomorphism from I to J captures the fact that J “contains
more, or at least as much information” as I.
Query languages We will denote by CQ, UCQ, and FO the sets of
conjunctive queries, unions of conjunctive queries, and first-order
queries, respectively. The sets CQ<, UCQ<, and FO<are defined
similarly, except that thequeries mayrefer tothelinear order. Thus,
unless indicated explicitly, it is assumed that queries do not refer to
the linear order. For a query q and an instance I, we denote by
q(I) the answers of q in I; furthermore, we denote by q(I)↓ the
ground answers of q, i.e., q(I)↓= q(I)∩Constkfor k the arity of
q. In other words, q(I)↓contains the tuples from q(I) that consist
entirely of constants.
Schemamappings, solutions,universal solutionsAschema map-
ping is a triple M = (S,T,Σ), where S and T are the disjoint
source and target schemas respectively, and Σ is a finite set of sen-
tences of some logical language over the schema S∪T. From a se-
mantic point of view, a schema mapping can be identified with the
set of all pairs (I,J) such that I is a source instance, J is a target
instance and (I,J) satisfies Σ (which we denote by (I,J) |= Σ).
Two schema mappings, M = (S,T,Σ) and M′= (S,T,Σ′), are
logically equivalent if Σ and Σ′are logically equivalent, i.e., they
are satisfied by the same pairs of instances. Given a schema map-
ping M = (S,T,Σ) and a source instance I, a solution for I with
respect to M is a target instance J such that (I,J) satisfies Σ. We
denote the set of solutions for I with respect to M by SolM(I),
or simply Sol(I) when the schema mapping is clear from the con-
text. A universal solution for a source instance I with respect to
a schema mapping M is a solution J ∈ SolM(I) such that, for
every J′∈ SolM(I), there is a homomorphism from J to J′.
We will consider the following logical languages for specify-
ing schema mappings. A source-to-target tuple generating depen-
dency or, in short, an s-t tgd, is a first-order sentence of the form
∀x(φ(x) → ∃yψ(x,y)), where φ(x) is a conjunction of atomic
formulas over S, and ψ(x,y) is a conjunction of atomic formulas
over T, such that each variable in x occurs in φ(x). A LAV s-t tgd
is a s-t tgd in which φ is a single atomic formula. A full s-t tgd
is a s-t tgd in which there are no existentially quantified variables
(i.e., y is the empty set). The class of first-order s-t tgds (FO s-t
tgds) generalizes s-t tgds by allowing the antecedent φ(x) to be an
arbitrary FO-formula over S. The class of FO<s-t tgds is defined
similarly, allowing also comparisons of the form xi < xj to be
used in the antecedent. In what follows and in order to simplify
notation, we will typically drop the outermost universal quantifiers
when writing s-t tgds, FO s-t tgds, or FO<s-t tgds.
It is common in data exchange to consider schema mappings
specified using also target constraints in the form of target tgds and
target egds. We will only discuss such target constraints in detail
in Section 6 and therefore postpone the relevant definitions to that
section.
Example 2.1 An example of a schema mapping is depicted in Fig-
ure 1. Both s-t tgds in that figure are LAV s-t tgds. Given the
source instance I, three solutions (J, Ju, and Jc) for I are shown.
The values N1, ..., N4 in the solutions are nulls from Vars. All
other values are from Const. Even though J is a solution for I, the
solution J contains an unnecessary tuple, namely, (Cathy, N1). In
other words, the result of removing (Cathy, N1) from J, which is
Ju, is still a solution. In fact, Juis a universal solution. Intuitively,
Ju is the most general solution because it does not make unneces-
sary assumptions about the existence of other tuples or values in
place of the nulls. The solution Jc is also a universal solution (we
shall explain what ismeant by “the core” shortly). There isa homo-
morphism h : Ju → J, where h(v) = v for every v ∈ dom(Ju).
There is also a homomorphism h′: Ju → Jc, where h′(N2) = N4
and h(v) = v for every v ∈ dom(Ju) and v ?= N2. Clearly, since
Jc ⊆ Ju, there is also a homomorphism from Jcto Ju.
2.1The Core
Asstatedin theIntroduction, asource instance may have amulti-
tude of solutions. Among all solutions, the universal solutions have
been singled out as the preferred ones because they are the most
general ones (i.e., for a given source instance I, a universal solution
for I has a homomorphism into any solution for I). Among theuni-
versal solutions, the core universal solution plays a special role. A
targetinstance J issaidtobeacoreif thereisnoproper subinstance
J′⊆ J and homomorphism h : J → J′. An equivalent definition
in terms of retractions isas follows: A subinstance J′⊆ J is called
a retract of J if there is a homomorphism h : J → J′such that for
all a ∈ dom(J′), h(a) = a. The corresponding homomorphism h
iscalled a retraction. A retract isproper if it isa proper subinstance
of the original instance. A core of a target instance J is a retract
of J that has itself no proper retracts. Every (finite) target instance
has a unique core, up to isomorphism. Moreover, two instances are
homomorphically equivalent if and only if they have isomorphic
cores. It follows that, for every schema mapping M, every source
instance has at most one core universal solution up to isomorphism.
Indeed, if the schema mapping M is specified by FO s-t tgds then
each source instance has exactly one core universal solution up to
isomorphism [7]. We will therefore freely speak of the core.
Example 2.2 Referring back to the schema mapping in Figure 1,
the solution Jc is the core for the source instance I. Intuitively,
every tuple in Jc must exist in order for (I,Jc) to satisfy Σ. So
there are no redundant tuples in Jc.
In addition to being the smallest universal solution, the core has
certain other desirable properties. Specifically, the core also re-
turns the most conservative answers on conjunctive queries with
inequalities: if Q is a conjunctive query with inequalities, and J
is the core universal solution for a given source instance I, then
Q(J)↓ is contained in Q(J′)↓ for every universal solution J′of
I. Furthermore, in a precise sense, the core is the universal solu-
tion that satisfies the most embedded dependencies. Indeed, it is
easy to show that if a (arbitrary) tgd or egd holds in a universal
solution, then it must also hold in the core. To make this precise,
let a disjunctive embedded dependency be a first-order sentence of
the from ∀x(φ(x) →W
tions of atomic formulas over the target schema T and/or equalities.
Then the following result holds.
i∃yi.ψi(x,yi)), where φ,ψiare conjunc-
Theorem 2.1 LetMbeaschemamapping, I beasource instance,
let J the core universal solution of I, and let J′be any other uni-
versal solution of I, i.e., one that is not a core. Then
• Every disjunctive embedded dependency true in J′is true in J,
• Some disjunctive embedded dependency true in J is false in J′.
Page 4
The naive chase procedure
Input: A schema mapping M = (S,T,Σ) and a source instance I
where Σ is a finite set of FO<s-t tgds
Output: A universal solution J for I w.r.t. M
J := ∅;
for all ∀x(φ(x) → ∃y.ψ(x,y)) ∈ Σ do
for all tuples of constants a such that I |= φ(a) do
for each yi ∈ y, pick a fresh null value Nifor yi.
add the facts in ψ(a,N) to J.
end for
end for;
return J
Figure 2: Naive chase method for computing universal solu-
tions.
The proof is omitted for lack of space.
Concerning the complexity of computing the core, we have:
Theorem 2.2 ([7]) Let M be a schema mapping specified by FO<
s-t tgds. There is a polynomial-time algorithm such that, given a
source instance I, the algorithm returns the core universal solution
for I.
Strictly speaking, this result was shown in [7] only for schema
mappings specified by s-t tgds and target egds. However, the same
argument applies for schema mappings specified by FO<s-t tgds
(and target egds).
Although the data complexity of computing core solutions is
polynomial time, the degree of the polynomial depends on the
schema mapping in question. Indeed, it was shown in [9] that com-
puting core universal solutions for schema mappings specified by
s-t tgds is fixed parameter intractable, where the parameter is the
maximum number of variables in occurring in each s-t tgd.
3.USINGSQLTOCOMPUTEUNIVERSAL
SOLUTIONS
In this section, we define canonical universal solutions, and we
describe how FO<s-t tgds can be compiled into SQL queries that,
when executed against anysource instance I,produce thecanonical
universal solution for I. As we will see in Section 4, when this
method is applied to laconic schema mappings, the SQL queries
obtained produce the core universal solution of I.
In [5], it was shown that, for schema mappings specified by s-t
tgds, the chase procedure can be used to compute a universal so-
lution for a given source instance. In fact, the same holds true for
schema mappings specified by FO<s-t tgds. Figure 2 describes
a variant of the chase procedure known as the naive chase. For a
source instance I and schema mapping M specified by FO<s-t
tgds, the result of applying the naive chase is called a canonical
universal solution of I with respect to M. Observe that the result
of the naive chase is unique up to isomorphism, since it depends
only on the exact choice of fresh nulls. Also note that, even if two
schema mappings are logically equivalent, they may assign differ-
ent canonical universal solutions to a given source instance.
Example 3.1 The naive chase procedure on the schema mapping
M and source instance I of Figure 1 produces the universal so-
lution Ju shown in the same figure. Intuitively, the first s-t tgd in
Σ on the PTStud relation caused the first two facts of Advised
relation to be created. The second s-t tgd in Σ on the GradStud
relation caused the last two facts of the Advised relation and all
facts of Workswith relation to be created.
It is easy to see that, for schema mappings specified by FO<s-
t tgds, the naive chase procedure can be implemented using SQL
queries. In fact, Clio follows this approach [8]. We illustrate the
approach by returning to our running example of the schema map-
ping in Figure 1.
The first step is to Skolemize each s-t tgd in Σ. By this, we
mean replacing each existentially-quantified variable with a func-
tion term f(x), where f is a fresh function symbol of appropriate
arity and x denotes the set of all universally-quantified variables in
the tgd. For example, after this step on Σ, we get:
PTStud(x,y) → Advised(y,f(x,y))
GradStud(x,y) → Advised(y,g(x,y)) ∧ Workswith(y,g(x,y))
These dependencies are logically equivalent to the following de-
pendencies with a single relational atom in the right-hand-side:
PTStud(x,y) → Advised(y,f(x,y))
GradStud(x,y) → Advised(y,g(x,y))
GradStud(x,y) → Workswith(y,g(x,y))
Next, for each target relation R we collect the dependencies that
containR intheright-hand-side, andweinterprettheseasconstitut-
ing a definition of R. In this way, we get the following definitions
of Advised and Workswith.
Advised := {(y,f(x,y)) | PTStud(x,y)}∪
{(y,g(x,y)) | GradStud(x,y)}
Workswith := {(y,g(x,y)) | GradStud(x,y)}
In general, the definition of a k-ary target relation R ∈ T will be
of the shape:
R
:={(t1(x),...,tk(x)) | φ(x))} ∪ ··· ∪
{(t′
1(x′),...,t′
k(x)) | φ′(x′)}
(1)
where t1,...,tk,...,t′
x1, or functions over terms, such as f(x1,x2)), φ,...,φ′are first-
order queries over the source schema. Each φ,...,φ′corresponds
to a SQL query, and the union of these SQL queries is a query that
when executed on any source instance I will compute the canonical
universal solution for I. To continue with our running example,
the following SQL queries may be generated for Advised and
Workswith:
Advised:
select distinct name,
concat(“f(”,age,name,“)”)
from PTStud
union
select distinct name,
concat(“g(”,age,name,“)”)
from GradStud
1,...,t′
kare terms (i.e., variables such as
Workswith:
select distinct name,
concat(“g(”,age,name,“)”)
from GradStud
Evaluating the SQL query associated to Advised on the source
instance I in Figure 1 yields the tuples { (John, f(32,John)), (Ann,
f(30,Ann)), (Bob, g(27,Bob)), (Ann, g(30,Ann)) }.
f(32,John), f(30,Ann), g(27,Bob), and g(30,Ann) correspond, re-
spectively, to the nulls N1,N2,N3,N4in Juof Figure 1.
The general idea behind the construction of the SQL queries
should be clear from the example. The translation assumes the ex-
istence of a concat function that returns the concatenation of all
its arguments. Intuitively, the result of the concat function repre-
sents a null.
Note that, in this example, the resulting SQL queries are unions
of select-project-join queries (i.e., unions of conjunctive queries)
augmented with the use of the concat function. In particular,
they do not contain any GROUP BY clauses or any aggregate func-
tions. In the case of schema mappings specified by FO s-t tgds, the
The terms
Page 5
(a)
(a′)
P(x) → ∃yz.R(x,y) ∧ R(x,z)
P(x) → ∃y.R(x,y)
(b)
P(x) → ∃y.R(x,y)
P(x) → R(x,x)
P(x) → R(x,x)
(b′)
(c)
R(x,y) → S(x,y)
P(x) → ∃y.S(x,y)
R(x,y) → S(x,y)
P(x) ∧ ¬∃y.R(x,y) → ∃y.S(x,y)
(c′)
(d)
R(x,y) → ∃z.S(x,y,z)
R(x,x) → S(x,x,x)
R(x,y) ∧ x ?= y → ∃z.S(x,y,z)
R(x,x) → S(x,x,x)
(d′)
(e)
(e′)
R(x,y) → ∃z.(S(x,z) ∧ S(y,z))
(R(x,y) ∨ R(y,x)) ∧ x ≤ y → ∃z.(S(x,z) ∧ S(y,z))
(f)PTStud(x,y) → ∃z.Advised(y,z)
GradStud(x,y) → ∃z.(Advised(y,z) ∧ Workswith(y,z))
PTStud(x,y) ∧ ¬∃u.GradStud(u,y) → ∃z.Advised(y,z)
GradStud(x,y) → ∃z.(Advised(y,z) ∧ Workswith(y,z))
(f′)
Figure 3: Examples of non-laconic schema mappings (a-f) and
their laconic equivalents (a′-f′).
same approach will give rise to SQL queries that use the difference
(EXCEPT) operator but still do not contain any GROUP BY clauses
or anyaggregate functions. Finally, inthecase of schema mappings
specified by FO<s-t tgds, the resulting SQL queries require the
use of comparisons of the form x < y in the WHERE clause, but no
further constructs. Also note that the translation from schema map-
pings to SQL queries computing the canonical universal solution is
polynomial.
To summarize, we have explained how, for schema mappings
specified by FO<s-t tgds, canonical universal solutions can be
obtained using SQL queries that do not contain any GROUP BY
clauses or any aggregate functions, i.e., that belong to the (pure)
relational calculus fragment of SQL, except for the use of string
concatenation.
4.LACONIC SCHEMA MAPPINGS
In this section, we present an algorithm for tranforming any
schema mapping M specified by FO<s-t tgds into a logically
equivalent one M′, such that the naive chase procedure applied to
M′and to a source instance I produces the core universal solution
for I and M. In particular, this shows that, for schema mappings
specified by FO<s-t tgds, the core universal solution can be com-
puted using SQL queries.
Definition 4.1 A schema mapping is laconic if for every source
instance I, the canonical universal solution of I with respect to M
is the core universal solution of I with respect to M.
Note that the definition only makes sense for schema mappings
specified by FO<s-t tgds, because we have defined the notion of a
canonical universal solution only for such schema mappings.
Examplesof laconicandnon-laconic schemamappings aregiven
in Figure 3. For Example 3(d), the canonical universal solution
of the source instance I = {R(a,a)} is {S(a,a,N),S(a,a,a)},
which is not the core universal solution of I. Clearly, one should
only “translate” according to the first s-t tgd in Example 3(d) if
x ?= y, which explains the antecedent of the first tgd in Example
3(d′). It is easy to see that every schema mapping specified by full
s-t tgds only (i.e., s-t tgds without existential quantifiers) is laconic.
Indeed, in this case, the canonical universal solution does not con-
tain any nulls, and hence is guaranteed to be the core. Thus, being
specified by full s-t tgds is a sufficient condition for laconicity, al-
though a rather uninteresting one. The following provides us with
a necessary condition, which explains why the schema mapping in
Figure 3(a) is not laconic. Given an s-t tgd ∀x(φ → ∃y.ψ), by
the canonical instance of ψ, we will mean the (unordered) target
instance whose facts are the conjuncts of ψ, where the x variables
are treated as constants and the y variables as nulls.
Proposition 4.1 If a schema mapping (S,T,Σ) specified by s-t
tgds is laconic, then for each s-t tgd ∀x(φ → ∃y.ψ) ∈ Σst, the
canonical instance of ψ is a core.
The proof is omitted for lack of space.
In the case of schema mapping (e) in Figure 3, the linear order is
used in order to obtain a logically equivalent laconic schema map-
ping (e′). Note that the schema mapping (e′) is order-invariant in
the sense that the set of solutions of a source instance I does not
depend on the interpretation of the < relation in I, as long as it is
a linear order. Still, the use of the linear order cannot be avoided,
as we will show in Section 5.1. What is really going on, in this
example, is that the right hand side of (e) has a non-trivial auto-
morphism (viz. the map sending x to y and vice versa), and the
conjunct x ≤ y in the antecedent of (e′) plays, intuitively, the role
of a tie-breaker, cf. Section 4.1.3.
Testing whether a given schema mapping is laconic is not a
tractable problem:
Proposition 4.2 Testing laconicity of schema mappings specified
by FO s-t tgds is undecidable. It is coNP-hard already for schema
mappings specified by LAV s-t tgds.
In fact, testing laconicity of schema mappings specified by s-t
tgds is coNP-complete. We omit the proof for lack of space.
4.1Making schema mappings laconic
In this section, we present a procedure for transforming any
schema mapping M specified by FO<s-t tgds into a logically
equivalent laconic schema mapping M′specified by FO<s-t tgds.
The laconic schema mapping can then be translated into SQL
queries, as described in Section 3, which when executed on any
source instance will produce the core universal solution.
To simplify the notation, throughout this section, we assume a
fixed input schema mapping M = (S,T,Σ), with Σ a finite set
of FO<s-t tgds. Moreover, we will assume that the FO<s-t tgds
∀x(φ → ∃y.ψ) ∈ Σ are non-decomposable [7], meaning that the
fact graph of ∃y.ψ(x,y), where the facts are the conjuncts of ψ
and two facts are connected if they have an existentially quantified
variable in common, is connected. This assumption is harmless:
every FO<s-t tgd can be decomposed into a logically equivalent
finite set of non-decomposable FO<s-t tgds (with identical left-
hand-sides, one for each connected component of the fact graph) in
polynomial time.
The outline of the procedure for making schema mappings la-
conic is as follows (the items correspond to subsections of the
present section):
(1) Construct a finite list fact block types: these are descriptions of
potential “patterns” of tuples in the core. (See Section 4.1.1.)
(2) Compute for each of thefact block types a precondition: afirst-
order formula over the source schema that tells exactly when