ArticlePDF Available

On the Expressive Power of the Relational Algebra on Finite Sets of Relation Pairs

Authors:

Abstract

We give a language-independent characterization of the expressive power of the relational algebra on finite sets of source-target relation instance pairs. The associated decision problem is shown to be co-graph-isomorphism hard and in co NP. The main result is also applied in providing a new characterization of the generic relational queries.
1
On the Expressive Power of the Relational Algebra
on Finite Sets of Relation Pairs
George H.L. Fletcher, Marc Gyssens, Jan Paredaens, and Dirk Van Gucht
Abstract—We give a language-independent characteri-
zation of the expressive power of the relational algebra
on finite sets of source-target relation instance pairs. The
associated decision problem is shown to be co-graph-
isomorphism hard and in coNP. The main result is also
applied in providing a new characterization of the generic
relational queries.
Index Terms—Query languages, relational algebra, data
mapping, data integration, definability, expressibility, BP
completeness, graph isomorphism, genericity, monotonic-
ity.
I. INTRODUCTION
WE investigate a generalization of the classic
result of Bancilhon and Paredaens on the
expressive power of the relational algebra [1], [3],
[10] concerning the following decision problem:
BP-PAIR. Given a pair of relations (s, t),
with snon-empty or tof positive arity,
does there exist a relational algebra ex-
pression Esuch that E(s) = t?
Bancilhon and Paredaens established that BP-PAIR
is equivalent to the problem of determining whether
or not (1) every atom occurring in talso occurs
in s, and (2) every automorphism of sis also
an automorphism of t. To date, the complexity of
BP-PAIR has not been established.
Example 1: Consider the following pairs of
George Fletcher is with Washington State University, Vancouver.
e-mail: fletcher@vancouver.wsu.edu
Marc Gyssens is with Hasselt University and the Transnational
University of Limburg. e-mail: marc.gyssens@uhasselt.be
Jan Paredaens is with the University of Antwerp. e-mail:
jan.paredaens@ua.ac.be
Dirk Van Gucht is with Indiana University, Bloomington. e-mail:
vgucht@cs.indiana.edu
source/target instances.
s1s2s3
a a
b b
b b
c c
a a
b b
a b
t1t2t3
a a
b b
b b
c c
Clearly, each pair (si, ti)satisfies BP-PAIR condi-
tions (1) and (2), and hence, for each i= 1,2,3,
there exists a relational algebra expression Eisuch
that Ei(si) = ti.
It is also the case that there exists a single expres-
sion Esuch that E(si) = ti, for each i= 1,2,3;
for example, the expression s(s×πhi(σ16=2(s)))
behaves properly on each source instance. Suppose
that t2also has tuple hc, bi. In this case (s2, t2)
violates condition (2), and hence there no longer
exists an expression E2such that E2(s2) = t2(and
consequently, there also no longer exists a single
expression for mapping all pairs). What if we were
to additionally add tuple hb, cito t2? In this case
(s2, t2)again satisfies both (1) and (2), and hence
there exists an expression E2such that E2(s2) = t2.
Unfortunately, in this case there still does not exist
a general expression Ewhich behaves properly on
each (si, ti). This does not follow, however, from
either condition (1) or (2). What is it about this set
of instances that makes it unmappable? Is it possible
to characterize the sets that are mappable?
A. The Problem
Towards resolving such questions about the ex-
pressive power of the relational algebra on sets of
source/target instance pairs, in this note we intro-
duce and study the following generalized decision
problem:
BP-PAIRS. Given a set of pairs of rela-
tions {(s1, t1),...,(sk, tk)},k>1, with
2
each siof arity m>0and each tiof
arity n>0, does there exist a relational
algebra expression Esuch that E(si) = ti
for i= 1, . . . , k?
Note that BP-PAIRS allows empty source relations.
It is clear that the classic BP-PAIR problem reduces
to a strict special case of the generalized BP-PAIRS
problem (namely, where k= 1, the source relation
is non-empty, and n>1).
B. Practical Significance
The present investigation was motivated by prac-
tical query discovery problems arising in the context
of recent research on data integration, extraction,
and exchange. In each of these domains, a crucial
problem is the instance-driven discovery of mapping
queries between autonomous data sources. In the
context of data integration, recent research has ex-
plored the use of corresponding example instances
of source and target schemas in the derivation of ap-
propriate source-to-target data mapping queries [2],
[4]. In the context of data extraction, an extensive
line of research has explored the use of example
instances to derive “wrapper” queries for extraction
of relevant information from data sources (e.g., [6]).
In the context of data exchange, an important issue
is the discovery of source-to-target dependencies for
translation of instances of a source schema into
appropriate instances of a target schema (cf. [9]).
Important issues in each of these contexts are to
characterize the goodness of sets of examples for
query discovery and to understand the complexity
of such derivations. The ubiquity of such instance-
based reasoning in a variety of query discovery tasks
led us to the present study of BP-PAIRS.
C. Summary of Results
In this note we first give an exact language-
independent characterization of when a solution to
aBP-PAIRS instance exists and show how to con-
struct an appropriate mapping expression Ewhen
this is the case. Next, we establish that BP-PAIRS
is co-graph-isomorphism-hard and in coNP. We then
use these results to give a new characterization of
the generic relational queries. We close by indicat-
ing topics for further investigation.
II. PRELIMINARY NOTIONS
In this section we give basic definitions and
notation used in this note.
Definition 1: Arelation rof arity nN
is a finite subset of nCartesian products of an
infinitely enumerable domain Dof uninterpreted
atoms: rDn. The active domain of ris the
set of atoms occurring in r, denoted as adom(r) =
Sn
i=1{ai| ha1, . . . , ai, . . . , ani ∈ r}.
Notice that there are only two 0-ary relations: the
empty relation {} and the relation with the empty
tuple {hi}. These are often used to encode false and
true, respectively, as relations. In this way, boolean
queries can be embedded in the relational model.
Definition 2: An isomorphism ϕfrom a rela-
tion rof arity nto a relation sof arity nis a
permutation of Dsuch that, for all a1, . . . , anD,
it is the case that ha1, . . . , ani ∈ rif and only if
hϕ(a1), . . . , ϕ(an)i ∈ s.
Definition 3: An automorphism ϕof a relation r
of arity nis an isomorphism from rto itself, i.e., for
all a1, . . . , anD, it is the case that ha1, . . . , ani ∈
rif and only if hϕ(a1), . . . , ϕ(an)i ∈ r. The set of
automorphisms of ris denoted Aut(r).
Notice that the restriction of an automorphism
of rto adom(r)is necessarily a permutation
of adom(r).
Definition 4: ABP-set is a finite set
{(s1, t1),...,(sk, tk)}of k>1pairs of relations,
such that, for i= 1, . . . , k,siis of arity m>0,
and tiis of arity n>0.
We follow Paredaens’ presentation of the rela-
tional algebra [10], extended with a constant op-
erator unit. In what follows, for a tuple t=
ha1, . . . , ani ∈ Dnwe denote by t[i]the ith com-
ponent of t, i.e., t[i] = aifor 16i6n.
Definition 5: Let rand sbe relations of arity m
and n, respectively. The relational algebra is the
set of well-formed expressions containing relation
names and closed under the following seven opera-
tions on relations.
The product of rand sis the relation r×
s={ha1, . . . , am, b1, . . . , bni | ha1, . . . , ami ∈
rand hb1, . . . , bni ∈ s}.
The union of rand sis the relation rs=
{t|tror ts}, which is only defined
when m=n.
The difference of rand sis the relation rs=
{t|trand t /s}, which is only defined
when m=n.
3
The projection of ron hj1, . . . , j`i(`>0and,
for i= 1, . . . , `,16ji6m}) is the relation
πhj1,...,j`i(r) = {ht[j1], . . . , t[j`]i | tr}.
The equality selection of ron iand j(for 16
i, j 6m) is the relation σi=j(r) = {t|t
rand t[i] = t[j]}.
The inequality selection of ron iand j(for
16i, j 6m) is the relation σi6=j(r) = {t|t
rand t[i]6=t[j]}.
The unit of ris the relation unit(r) = {hi}.
If Eis an expression over relation names
R1, . . . , Rk, then E(r1, . . . , rk)denotes the relation
which results from the evaluation of Ewith each
Ribound to relation ri, for 16i6k.
Finally, we give a standard semantic notion for
relational mappings.
Definition 6: A mapping Qfrom relations of
some arity m>0to relations of some arity
n>0is generic if, for each relation rof arity
mand each permutation ϕof D, it is the case that
ϕ(Q(r)) = Q(ϕ(r)).
III. RES OLVING THE BP-PAIRS PROBLEM
In Example 1, we claimed that the addition of
tuples hb, ciand hc, bito t2would make the BP-
set {(s1, t1),(s2, t2),(s3, t3)}an invalid instance of
BP-PAIRS, despite the fact that each pair in the
set is a valid instance of BP-PAIR. We now prove
this. In particular, we now present the main result,
a language-independent characterization of the ex-
pressive power of the relational algebra on BP-sets.
Theorem 1: Let {(s1, t1),...,(sk, tk)}be a BP-
set. The following statements are equivalent:
1) There is a relational algebra expression E
such that, for i= 1, . . . , k,E(si) = ti,
2) It holds that
a) for i= 1, . . . , k,adom(ti)adom(si);
and
b) if ϕis an isomorphism from sito sj(16
i, j 6k), then it is also an isomorphism
from tito tj.
Proof: (1 2) This implication follows im-
mediately, since relational algebra queries are both
domain-preserving and generic [1], [3], [10].
(2 1) Let mand nbe the arity of each siand
ti, respectively, for 16i6k.
First, we observe that , for each i, the pair (si, ti)
is an instance of the classic BP-PAIR problem.
By putting i=j, condition (2b) implies that
each automorphism of siis also an automorphism
of ti. Hence, there exists a relational algebra ex-
pression Eitaking one m-ary relation as argument
and returning an n-ary relation as result such that
Ei(si) = ti. We must note here that the border cases
m= 0 and/or n= 0 were not explicitly considered
in the original proof of Paredaens [10]. However,
the expression Ei(r) = rrwill do in the case
that ti={}, irrespective of m, and the expression
Ei(r) = unit(r)will do in the case that ti={hi},
irrespective of m.
Next, we note that for each sithere is a relational
algebra expression Fisuch that a relation ris
isomorphic with siif and only if Fi(r)6=. This
fact was already shown by Bancilhon for the case
when si6={} [1]. In the case when si={},
the boolean expression Fi(r) = unit(r)πhi(r)
identifies relations isomorphic with si.
Combining these results, the following relational
algebra expression fulfills (1):
E(r) = [
16i6k
πh1,...,ni(Ei(r)×Fi(r)).
It is interesting to note that the proof of Theorem
1 provides an explicit PSPACE construction of an
appropriate mapping expression, as is the case for
the proof of Paredaens [10] of the BP-PAIR result.
At this point, however, we must emphasize that
there is a fundamental difference between the classic
BP-PAIR result and the BP-PAIRS result. The
proof of Paredaens [10] of the BP-PAIR result
reveals that the difference operator is not used in
the construction of the required relational algebra
expression in the case that m, n >1and both
source and target are nonempty. The expressions
constructed are thus monotone in the sense that
r1r2implies E(r1)E(r2).
In the expressions Fi(16i6k) in the proof of
Theorem 1, the difference operator is used. This is
not an incidental effect of the particular construction
used, even in the case that m, n >1and both
source and target are nonempty. Indeed, solutions
to BP-PAIRS make essential use of the difference
operator since BP-sets can capture nonmonotone
query behavior (since kcan be greater than 1), and
the relational algebra expressions without difference
4
are always monotone.1
Example 2: Consider the following pairs of
source/target instances.
s1s2
a a a a
b b
t1t2
a a a b
b a
The BP-set {(s1, t1),(s2, t2)}satisfies condition (2)
of Theorem 1. Hence, there exists a relational
algebra expression Esuch that E(s1) = t1and
E(s2) = t2. Obviously, Ecannot be monotone, and
therefore must contain the difference operator. This
is also the case for the BP-set of Example 1.
IV. COMPLEXITY OF THE BP-PAIR A ND
BP-PAIRS PRO BL EM S
We next relate the complexity of BP-PAIR and
BP-PAIRS to several well known graph decision
problems. First, we present some terminology.
Definition 7: Agraph Gis a binary relation E
over a finite domain VD. We write G= (V, E),
where Vis called the set of vertices and E V×V
is called the set of edges.
Definition 8: Two graph decision problems.
Subgraph Isomorphism (SubGI): given two
graphs G1and G2, is G1isomorphic to a
subgraph of G2?
Graph Isomorphism (GI): given two graphs G1
and G2, are they isomorphic?
SubGI is a typical NP-complete problem [8].
Clearly GI is also in NP; it is unknown, however,
whether GI is in P, is NP-complete, or neither [8].
We immediately observe the following.
Lemma 1: BP-PAIRS is in coNP.
Proof: Recall that coNP is the class of
problems which have polynomial time disquali-
fications (for example, see [8]). Given a BP-set
{(s1, t1),...,(sk, tk)}, then guess an i, a j, and
an isomorphism ϕfrom sito sj, and check in
polynomial-time whether or not ϕis also an iso-
morphism from tito tj. If not, then, using the
1For the same reason, one cannot simply reduce the BP-PAIRS
problem for a BP-set {(s1, t1),...,(s2, t2)}to the BP-PAIR prob-
lem for the pair (s1× · · · × sk, t1× · · · × tk), even in the case that
m, n >1and all relations under consideration are nonempty.
characterization of BP-PAIRS (Theorem 1), reject
{(s1, t1),...,(sk, tk)}.
Definition 9: Given a relation rand atom vD,
define rv =r× {hvi}.
We denote by Pthe complement of decision
problem P, and by Pp
mP0that Ppolynomial time
many-one reduces to problem P0[8].
We can now show the main result of this section.
Theorem 2:
GI p
mBP-PAIR p
mBP-PAIRS p
mSubGI.
Proof: We establish the first reduction by ex-
hibiting a polynomial time many-one reduction f
from GI to BP-PAIR. Let G1= (V1,E1)and
G2= (V2,E2)be a pair of graphs, and assume,
without loss of generality, that V1V2=.
If E1or E2is non-empty, define f(G1, G2) =
(E1v1∪ E2v2,{hv1i}), where v1and v2are two
different elements of D(V1V2). Otherwise, define
f(G1, G2) = ({hui,hvi},{hui})where uand vare
two different elements of D. Clearly fis polynomial
time computable. If (G1, G2)GI and E1or E2
is not empty, then there exists an isomorphism ϕ
from G1to G2. If we extend ϕsuch that ϕ(v1) =
v2, then ϕis an automorphism of E1v1∪ E2v2.
But then, by Theorem 1, f(G1, G2)/BP-PAIR,
since ϕis not an automorphism of {hv1i}. Now, if
(G1, G2)6∈ GI, then for each ϕAut(E1v1∪ E2v2)
it clearly must be the case that ϕ(v1) = v1and
ϕ(v2) = v2. By Theorem 1, it follows immediately
that f(G1, G2)BP-PAIR. Finally, if E1and E2are
both empty, then clearly (G1, G2)GI if and only
if f(G1, G2)6∈ BP-PAIR.
The second reduction follows directly from the
definition of BP-PAIRS. The third reduction fol-
lows from Lemma 1 since SubGI is NP-complete.
V. ANOB SE RVATION ON GENE RI C QUE RI ES
As an application of Theorem 1, we have the
following novel characterization of the generic re-
lational queries.
Theorem 3: Let Qbe a mapping from relations
of arity m>0to relations of arity n>0. Then the
following statements are equivalent:
1) Qis generic.
2) For any finite set Rof relations of arity m,
there is a relational algebra expression ER
such that, for every r∈ R,ER(r) = Q(r).
5
3) For any pair R={r1, r2}of relations of arity
m, there is a relational algebra expression ER
such that, for i= 1,2,ER(ri) = Q(ri).
Proof: (1 2) Let R={r1...,rk}. Consider
the pairs (ri, Q(ri)),i= 1, . . . , k. Suppose that, for
i, j = 1, . . . , k,ϕis an isomorphism from rito
rj. Extend ϕin an arbitrary way to a permutation
of D. Since Qis generic, ϕ(Q(ri)) = Q(ϕ(ri)) =
Q(rj). Hence, ϕis an isomorphism from Q(ri)to
Q(rj). Since it is also the case that adom(Q(ri))
adom(ri), we have from Theorem 1 that there exists
a relational algebra expression ERsuch that, for
every r∈ R,ER(r) = Q(r).
(2 3) Obvious.
(3 1) Let rbe a relation of arity mand ϕbe a
permutation on D. Let R={r, ϕ(r)}. By assump-
tion, there exists a relational algebra expression ER
such that ER(r) = Q(r)and ER(ϕ(r)) = Q(ϕ(r)).
Since the relational algebra is generic, we have that
ϕ(Q(r)) = ϕ(ER(r)) = ER(ϕ(r)) = Q(ϕ(r)).
Theorem 3 highlights once more the fundamental
difference between the classic BP-PAIR case and
the BP-PAIRS case. Not only does the proof of
Theorem 3 heavily rely on the the fact that |R| >1,
but furthermore, without this condition the result
simply does not hold. To see this, let aD.
Consider the mapping Qfor which Q({hai}) =
{hai} and Q(r) = for r6={hai}. Clearly, Q
is computable, but not generic. To see this, choose
bDsuch that a6=b. Consider the permutation of
Dthat swaps aand band fixes all other elements
of D. While this permutation is an isomorphism
from {hai} to {hbi}, it is not an isomorphism from
Q({hai}) = {hai} to Q({hbi}) = . Nevertheless,
Qsatisfies statement (2) of Theorem 3 for |R| = 1.
VI. FI NAL REMARKS
All of the results established above also hold
for the natural generalization of BP-PAIRS to the
nested relational model, following Gyssens et al.
[12]. It may also prove fruitful to investigate similar
generalizations of instance-driven query discovery
for graph [5] and XML [7] data. We close by noting
several further open questions which naturally arise
from the present investigation.
Recently, results have been established on the
complexity of repairing data mapping expres-
sions for several logical languages [11]. In
the context of reasoning about BP-sets, one
can dually consider repairing instances for data
mapping discovery.
Suppose a BP-set only satisfies condition
(2a) of Theorem 1. What is the minimal
number of tuple additions and/or deletions
required to “repair” the set such that it also
satisfies condition (2b)? For some k>0,
can the set be repaired with at most ksuch
updates?
Suppose a BP-set {(s1, t1),...,(sk, tk)}
fails to satisfy condition (2a) of Theo-
rem 1. Can one find a renaming of the
atoms in S16i6kadom(si)with atoms in
S16i6kadom(ti)such that the set satisfies
both conditions (2a) and (2b)? In other
words, for the given BP-set, does there
exist a relational algebra expression Eand
binary relation τS16i6kadom(si)×
S16i6kadom(ti)such that E(si, τ ) = ti,
for each 16i6k? Note that BP-PAIRS
is just the special case of this problem
where τis restricted to subsets of the
identity relation on S16i6kadom(si).
Are there natural characterizations and prac-
tical algorithmic solutions for such instance
repair problems?
A BP-set can be thought of as a “sample” or fi-
nite “trace” of an infinite query. Although The-
orem 1 provides an explicit means to construct
an appropriate mapping query when possible,
there is no guarantee that in practice this query
is a “desirable,” “interesting,” or the “best”
mapping for a given context. For example,
consider again the BP-set of Example 1. In this
case, the mapping expression constructed using
Theorem 1 consists of a union of expressions,
each of which consists of crossproducts and
unions of sizeable subexpressions [10]. In con-
trast, we saw in the Example that the succinct
expression s(s×πhi(σ16=2 (s))) is sufficient.
Consequently, towards applications of Theorem
1 it is important to develop meaningful notions
of query interestingness and goodness-of-fit.
REFERENCES
[1] F. Bancilhon. On the Completeness of Query Languages for Re-
lational Data Bases. Proc. MFCS, Springer LNCS 64, pp. 112–
123, Zakopane, Poland, 1978.
[2] A. Bilke and F. Naumann. Schema Matching using Duplicates.
Proc. IEEE ICDE, pp. 69–80, Tokyo, 2005.
6
[3] A.K. Chandra and D. Harel. Computable Queries for Relational
Data Bases. J. Comput. Syst. Sci. 21(2):156–178, 1980.
[4] G.H.L. Fletcher and C.M. Wyss. Data Mapping as Search. Proc.
EDBT, Springer LNCS 3896, pp. 95–111, Munich, 2006.
[5] M. Gemis, J. Paredaens, P. Peelman, and J. Van den Bussche.
Expressiveness and Complexity of Generic Graph Machines.
Theory Comput. Syst. 31(3):231–249, 1998.
[6] G. Gottlob, C. Koch, R. Baumgartner, M. Herzog, and S. Flesca.
The Lixto Data Extraction Project—Back and Forth between
Theory and Practice. Proc. ACM PODS, pp. 1–12, Paris, 2004.
[7] M. Gyssens, J. Paredaens, D. Van Gucht, and G.H.L. Fletcher.
Structural Characterizations of the Semantics of XPath as
Navigation Tool on a Document. Proc. ACM PODS, pp. 318–
327, Chicago, 2006.
[8] J. K¨
obler, U. Sch¨
oning, and J. Tor´
an. The Graph Isomorphism
Problem: Its Structural Complexity. Birkh ¨
auser, Boston, 1993.
[9] P.G. Kolaitis. Schema Mappings, Data Exchange, and Metadata
Management. Proc. ACM PODS, pp. 61–75, Baltimore, 2005.
[10] J. Paredaens. On the Expressive Power of the Relational Alge-
bra. Information Processing Letters 7(2):107–111, 1978.
[11] P. Senellart and G. Gottlob. On the Complexity of Deriving
Schema Mappings from Database Instances. Proc. ACM PODS,
pp. 23–32, Vancouver, Canada, 2008.
[12] M. Gyssens, J. Paredaens, and D. Van Gucht. A Uniform
Approach Toward Handling Atomic and Structured Information
in the Nested Relational Database Model. Journal of the ACM
36(4):790–825, 1989.
... Then we need to check if the query language is capable of defining the movieLink relation -this is the definability problem. Using example instances of source and target schemas for deriving appropriate sourceto-target mappings have been explored in relational databases [11,13,10,2]. Research on schema mappings for graph databases has started [7,5], though data values and extraction from example graphs have not been considered till now to the best of our knowledge. Example instances have also been used to derive "wrapper" queries for extraction of relevant information from data sources [12]. ...
... For the lower bounds, we identify how small data graphs can count exponentially large numbers using data values, which otherwise require exponentially large graphs. Related work Apart from derivation of mappings [11,13,10,2], studies have also been made of using data examples to illustrate the semantics of schema mappings [1]. In [8], the problem of deriving schema mappings from data examples is studied from the perspective of algorithmic learning theory. ...
Article
Full-text available
Designing query languages for graph structured data is an active field of research. Evaluating a query on a graph results in a relation on the set of its nodes. In other words, a query is a mechanism for defining relations on a graph. Some relations may not be definable by any query in a given language. This leads to the following question: given a query language and a relation, does there exist a query in the given language that defines the given relation? This is called the definability problem. When the given query language is standard regular expressions, the definability problem is known to be Pspace-complete. The model of graphs can be extended by labeling nodes with values from an infinite domain. These labels induce a partition on the set of nodes: two nodes are equivalent if they are labeled by the same value. Query languages can also be extended to make use of this equivalence. Two such extensions are Regular Expressions with Memory (REM) and Regular Expressions with Equality (REE). In this paper, we study the complexity of the definability problem in this extended model when the query language is either REM or REE. We show that the definability problem is Expspace-complete when the query language is REM, and it is Pspace-complete when the query language is REE. In addition, when the query language is a union of conjunctive queries based on REM or REE, we show coNP-completeness.
... Other work has extended the view synthesis problem to an iterative one, with the user being asked to confirm the presence or absence of tuples one at a time in order to learn an appropriate user query for various settings [8,10,7,9,1]. Earlier work studied the problem of checking if there exists a view definition without synthesizing it [14]. Another related direction is that of synthesizing a view given multiple pairs of database instances, introduced in the context of data integration as a problem of learning schema mappings from data examples [11,16,2]. ...
Article
This paper addresses the Data-Diff problem: given a dataset and a subsequent version of the dataset, find the shortest sequence of operations that transforms the dataset to the subsequent version, under a restricted family of operations. We consider operations similar to SQL UPDATE, each with a condition (WHERE) that matches a subset of tuples and a modifier (SET) that makes changes to those matched tuples. We characterize the problem based on different constraints on the attributes and the allowed conditions and modifiers, providing complexity classification and algorithms in each case.
... Thus, the results of [6,21] are not applicable in our setting. Placing our work in the wider framework of the expressive power of relational algebra can give rise to interesting future work, e.g., extensions of the style of [14] to the work of [6,21] can be considered for the problem we investigate. ...
Conference Paper
This paper investigates the problem of reverse engineering, i.e., learning, select-project-join (SPJ) queries from a user-provided example set, containing positive and negative tuples. The goal is then to determine whether there exists a query returning all the positive tuples, but none of the negative tuples, and furthermore, to find such a query, if it exists. These are called the satisfiability and learning problems, respectively. The ability to solve these problems is an important step in simplifying the querying process for non-expert users. This paper thoroughly investigates the satisfiability and learning problems in a variety of settings. In particular, we consider several classes of queries, which allow different combinations of the operators select, project and join. In addition, we compare the complexity of satisfiability and learning, when the query is, or is not, of bounded size. We note that bounded-size queries are of particular interest, as they can be used to avoid over-fitting (i.e., tailoring a query precisely to only the seen examples). In order to fully understand the underlying factors which make satisfiability and learning (in)tractable, we consider different components of the problem, namely, the size of a query to be learned, the size of the schema and the number of examples. We study the complexity of our problems, when considering these as part of the input, as constants or as parameters (i.e., as in parameterized complexity analysis). Depending on the setting, the complexity of satisfiability and learning can vary significantly. Among other results, our analysis also provides new problems that are complete for W[3], for which few natural problems are known. Finally, by considering a variety of settings, we derive insight on how the different facets of our problem interplay with the size of the database, thereby providing the theoretical foundations necessary for a future implementation of query learning from examples.
... Their research led to the notion of BP-completeness. Their results were later extended to the nested relational model [Van Gucht 1987] and to sequences of input-output pairs [Fletcher et al. 2009]. Learning and definability have in common the fact that they look for a query consistent with a set of examples. ...
Article
Full-text available
We investigate the problem of learning join queries from user examples. The user is presented with a set of candidate tuples and is asked to label them as positive or negative examples, depending on whether or not she would like the tuples as part of the join result. The goal is to quickly infer an arbitrary n-ary join predicate across an arbitrary number m of relations while keeping the number of user interactions as minimal as possible. We assume no prior knowledge of the integrity constraints across the involved relations. Inferring the join predicate across multiple relations when the referential constraints are unknown may occur in several applications, such as data integration, reverse engineering of database queries, and schema inference. In such scenarios, the number of tuples involved in the join is typically large. We introduce a set of strategies that let us inspect the search space and aggressively prune what we call uninformative tuples, and we directly present to the user the informative ones that is, those that allow the user to quickly find the goal query she has in mind. In this article, we focus on the inference of joins with equality predicates and also allow disjunctive join predicates and projection in the queries. We precisely characterize the frontier between tractability and intractability for the following problems of interest in these settings: consistency checking, learnability, and deciding the informativeness of a tuple. Next, we propose several strategies for presenting tuples to the user in a given order that allows minimization of the number of interactions. We show the efficiency of our approach through an experimental study on both benchmark and synthetic datasets.
... Their research led to the notion of BP-completeness. Their results were later extended to the nested relational model [VG87] and to sequences of input-output pairs [FGPVG09]. Learning and definability have in common the fact that they look for a query consistent with a set of examples. The difference is that learning allows the query to select or not the tuples that are not explicitly labeled as positive or negative examples while definability requires the query to select nothing else than the set of positive examples (i.e., all the other tuples are implicitly negative). ...
Article
Specifying a database query using a formal query language is typically a challenging task for non-expert users. In the context of big data, this problem becomes even harder because it requires the users to deal with database instances of large size and hence difficult to visualize. Such instances usually lack a schema to help the users specify their queries, or have an incomplete schema as they come from disparate data sources. In this thesis, we address the problem of query specification for non-expert users. We identify two possible approaches for tackling this problem: learning queries from examples and translating the data in a format that the user finds easier to query. Our contributions are aligned with these two complementary directions and span over three of the most popular data models: XML, relational, and graph. This thesis consists of two parts, dedicated to (i) schema definition and translation, and to (ii) learning schemas and queries. In the first part, we define schema formalisms for unordered XML and we analyze their computational properties; we also study the complexity of the data exchange problem in the setting of a relational source and a graph target database. In the second part, we investigate the problem of learning from examples the schemas for unordered XML proposed in the first part, as well as relational join queries and path queries on graph databases. The interactive scenario that we propose for these two classes of queries is immediately applicable to assisting non-expert users in the process of query specification.
... Their research led to the notion of BP-completeness. Their results were later extended to the nested relational model [41] and to sequences of inputoutput pairs [18]. A related problem, recently studied by Tran et al. [39], is the query by output problem: given a database instance and the output of some query, their goal is to construct an instance equivalent query to the initial one. ...
Article
Web applications store their data within various database models, such as relational, semi-structured, and graph data models to name a few. We study learning algorithms for queries for the above mentioned models. As a further goal, we aim to apply the results to learning cross-model database mappings, which can also be seen as queries across different schemas.
Article
A fitting algorithm for conjunctive queries (CQs) is an algorithm that takes as input a collection of data examples and outputs a CQ that fits the examples. In this column, we propose a set of desirable properties of such algorithms and use this as a guide for surveying results from the authors' recent papers published in PODS 2023, IJCAI 2023, and Inf. Proc. Letters 2024. In particular, we explain and compare several concrete fitting algorithms, and we discuss complexity and size bounds for constructing fitting CQs with desirable properties.
Article
We study the definability problem for first-order logic, denoted by FO-Def. The input of FO-Def is a relational database instance I and a relation R; the question to answer is whether there exists a first-order query Q (or, equivalently, a relational algebra expression Q) such that Q evaluated on I gives R as an answer. Although the study of FO-Def dates back to 1978, when the decidability of this problem was shown, the exact complexity of FO-Def remains as a fundamental open problem. In this article, we provide a polynomial-time algorithm for solving FO-Def that uses calls to a graph-isomorphism subroutine (or oracle). As a consequence, the first-order definability problem is found to be complete for the class GI of all problems that are polynomial-time Turing reducible to the graph isomorphism problem, thus closing the open question about the exact complexity of this problem. The technique used is also applied to a generalized version of the problem that accepts a finite set of relation pairs, and whose exact complexity was also open; this version is also found to be GI-complete.
Conference Paper
Designing query languages for graph structured data is an active field of research. Evaluating a query on a graph results in a relation on the set of its nodes. In other words, a query is a mechanism for defining relations on a graph. Some relations may not be definable by any query in a given language. This leads to the following question: given a graph, a query language and a relation on the graph, does there exist a query in the language that defines the relation? This is called the definability problem. When the given query language is standard regular expressions, the definability problem is known to be PSPACE-complete. The model of graphs can be extended by labeling nodes with values from an infinite domain. These labels induce a partition on the set of nodes: two nodes are equivalent if they are labeled by the same value. Query languages can also be extended to make use of this equivalence. Two such extensions are Regular Expressions with Memory (REM) and Regular Expressions with Equality (REE). In this paper, we study the complexity of the definability problem in this extended model when the query language is either REM or REE. We show that the definability problem is EXPSPACE-complete when the query language is REM, and it is PSPACE-complete when the query language is REE. In addition, when the query language is a union of conjunctive queries based on REM or REE, we show CoNP-completeness.
Chapter
Data integration remains a perenially difficult task. The need to access, integrate and make sense of large amounts of data has, in fact, accentuated in recent years. There are now many publicly available sources of data that can provide valuable information in various domains. Concrete examples of public data sources include: bibliographic repositories (DBLP, Cora, Citeseer), online movie databases (IMDB), knowledge bases (Wikipedia, DBpedia, Freebase), social media data (Facebook and Twitter, blogs). Additionally, a number of more specialized public data repositories are starting to play an increasingly important role. These repositories include, for example, the U.S. federal government data, congress and census data, as well as financial reports archived by the U.S. Securities and Exchange Commission (SEC).
Article
Full-text available
The Generic Graph Machine (GGM) model is a Turing machine-like model for expressing generic computations working directly on graph structures. In this paper we present a number of observations concerning the expressiveness and complexity of GGMs. Our results comprise the following: (i) an intrinsic characterization of the pairs of graphs that are an input—output pair of some GGM; (ii) a comparison between GGM complexity and TM complexity; and (iii) a detailed discussion on the connections between the GGM model and other generic computation models considered in the literature, in particular the generic complexity classes of Abiteboul and Vianu, and the Database Method Schemes of Denninghoff and Vianu.
Conference Paper
Full-text available
We introduce a theoretical framework for discovering rela- tionships between two database instances over distinct and unknown schemata. This framework is grounded in the context of data exchange. We formalize the problem of understanding the relationship between two instances as that of obtaining a schema mapping so that a minimum repair of this mapping provides a perfect description of the target instance given the source instance. We show that this denition yields \in- tuitive" results when applied on database instances derived from each other by basic operations. We study the complexity of decision problems related to this optimality notion in the context of dierent logical languages and show that, even in very restricted cases, the problem is of high complexity.
Conference Paper
Full-text available
Given a document D in the form of an unordered labeled tree, we study the expressibility on D of various fragments of XPath, the core navigational language on XML documents. We give charac- terizations, in terms of the structure of D, for when a binary relation on its nodes is definable by an XPath expression in these fragm ents. Since each pair of nodes in such a relation represents a unique path in D, our results therefore capture the sets of paths in D definable in XPath. We refer to this perspective on the semantics of XPath as the "global view." In contrast with this global view, ther e is also a "local view" where one is interested in the nodes to which one can navigate starting from a particular node in the document. In this view, we characterize when a set of nodes in D can be defined as the result of applying an XPath expression to a given node of D. All these definability results, both in the global and the lo cal view, are obtained by using a robust two-step methodology, which consists of first characterizing when two nodes cannot be dis tin- guished by an expression in the respective fragments of XPath, and then bootstrapping these characterizations to the desired results.
Conference Paper
Full-text available
Schema mappings are high-level specifications that describe the relationship between database schemas. Schema mappings are prominent in several different areas of database management, including database design, information integration, data exchange, metadata management, and peer-to-peer data management systems. Our main aim in this paper is to present an overview of recent advances in data exchange and metadata management, where the schema mappings are between relational schemas. In addition, we highlight some research issues and directions for future work.
Conference Paper
Full-text available
We present the Lixto project, which is both a research project in database theory and a commercial enterprise that develops Web data extraction (wrapping) and Web service definition software. We discuss the project's main motivations and ideas, in particular the use of a logic-based framework for wrapping. Then we present theoretical results on monadic datalog over trees and on Elog, its close relative which is used as the internal wrapper language in the Lixto system. These results include both a characterization of the expressive power and the complexity of these languages. We describe the visual wrapper specification process in Lixto and various practical aspects of wrapping. We discuss work on the complexity of query languages for trees that was inseminated by our theoretical study of logic-based languages for wrapping. Then we return to the practice of wrapping and the Lixto Transformation Server, which allows for streaming integration of data extracted from Web pages. This is a natural requirement in complex services based on Web wrapping. Finally, we discuss industrial applications of Lixto and point to open problems for future study.
Conference Paper
In this paper, we describe and situate the TUPELO system for data mapping in relational databases. Automating the discovery of mappings between structured data sources is a long standing and important problem in data management. Starting from user provided example instances of the source and target schemas, TUPELO approaches mapping discovery as search within the transformation space of these instances based on a set of mapping operators. TUPELO mapping expressions incorporate not only data-metadata transformations, but also simple and complex semantic trans formations, resulting in significantly wider applicability than previous systems. Extensive empirical validation of TUPELO, both on synthetic and real world datasets, indicates that the approach is both viable and effective.
Article
The algebras and query languages for nested relations defined thus far do not allow us to “flatten” a relation scheme by disregarding the internal representation of data. In real life, however, the degree in which the structure of certain information, such as addresses, phone numbers, etc., is taken into account depends on the particular application and may even vary in time. Therefore, an algebra is proposed that does allow us to simplify relations by disregarding the internal structure of a certain class of information. This algebra is based on a careful manipulation of attribute names. Furthermore, the key operator in this algebra, called “copying”, allows us to deal with various other common queries in a very uniform manner, provided these queries are interpreted as operations on classes of semantically equivalent relations rather than individual relations. Finally, it is shown that the proposed algebra is complete in the sense of Bancilhon and Paredaens.
Conference Paper
In this paper, we describe and situate the TUPELO system for data mapping in relational databases. Automating the discovery of mappings between structured data sources is a long standing and important problem in data manage- ment. Starting from user provided example instances of the source and target schemas, TUPELO approaches mapping discovery as search within the trans- formation space of these instances based on a set of mapping operators. TU- PELO mapping expressions incorporate not only data-metadata transformations, but also simple and complex semantic transformations, resulting in significantly wider applicability than previous systems. Extensive empirical validation of TU- PELO, both on synthetic and real world datasets, indicates that the approach is both viable and effective.