Content uploaded by Marc Gyssens
Author content
All content in this area was uploaded by Marc Gyssens
Content may be subject to copyright.
1
On the Expressive Power of the Relational Algebra
on Finite Sets of Relation Pairs
George H.L. Fletcher, Marc Gyssens, Jan Paredaens, and Dirk Van Gucht
Abstract—We give a language-independent characteri-
zation of the expressive power of the relational algebra
on finite sets of source-target relation instance pairs. The
associated decision problem is shown to be co-graph-
isomorphism hard and in coNP. The main result is also
applied in providing a new characterization of the generic
relational queries.
Index Terms—Query languages, relational algebra, data
mapping, data integration, definability, expressibility, BP
completeness, graph isomorphism, genericity, monotonic-
ity.
I. INTRODUCTION
WE investigate a generalization of the classic
result of Bancilhon and Paredaens on the
expressive power of the relational algebra [1], [3],
[10] concerning the following decision problem:
BP-PAIR. Given a pair of relations (s, t),
with snon-empty or tof positive arity,
does there exist a relational algebra ex-
pression Esuch that E(s) = t?
Bancilhon and Paredaens established that BP-PAIR
is equivalent to the problem of determining whether
or not (1) every atom occurring in talso occurs
in s, and (2) every automorphism of sis also
an automorphism of t. To date, the complexity of
BP-PAIR has not been established.
Example 1: Consider the following pairs of
George Fletcher is with Washington State University, Vancouver.
e-mail: fletcher@vancouver.wsu.edu
Marc Gyssens is with Hasselt University and the Transnational
University of Limburg. e-mail: marc.gyssens@uhasselt.be
Jan Paredaens is with the University of Antwerp. e-mail:
jan.paredaens@ua.ac.be
Dirk Van Gucht is with Indiana University, Bloomington. e-mail:
vgucht@cs.indiana.edu
source/target instances.
s1s2s3
a a
b b
b b
c c
a a
b b
a b
t1t2t3
a a
b b
b b
c c
Clearly, each pair (si, ti)satisfies BP-PAIR condi-
tions (1) and (2), and hence, for each i= 1,2,3,
there exists a relational algebra expression Eisuch
that Ei(si) = ti.
It is also the case that there exists a single expres-
sion Esuch that E(si) = ti, for each i= 1,2,3;
for example, the expression s−(s×πhi(σ16=2(s)))
behaves properly on each source instance. Suppose
that t2also has tuple hc, bi. In this case (s2, t2)
violates condition (2), and hence there no longer
exists an expression E2such that E2(s2) = t2(and
consequently, there also no longer exists a single
expression for mapping all pairs). What if we were
to additionally add tuple hb, cito t2? In this case
(s2, t2)again satisfies both (1) and (2), and hence
there exists an expression E2such that E2(s2) = t2.
Unfortunately, in this case there still does not exist
a general expression Ewhich behaves properly on
each (si, ti). This does not follow, however, from
either condition (1) or (2). What is it about this set
of instances that makes it unmappable? Is it possible
to characterize the sets that are mappable?
A. The Problem
Towards resolving such questions about the ex-
pressive power of the relational algebra on sets of
source/target instance pairs, in this note we intro-
duce and study the following generalized decision
problem:
BP-PAIRS. Given a set of pairs of rela-
tions {(s1, t1),...,(sk, tk)},k>1, with
2
each siof arity m>0and each tiof
arity n>0, does there exist a relational
algebra expression Esuch that E(si) = ti
for i= 1, . . . , k?
Note that BP-PAIRS allows empty source relations.
It is clear that the classic BP-PAIR problem reduces
to a strict special case of the generalized BP-PAIRS
problem (namely, where k= 1, the source relation
is non-empty, and n>1).
B. Practical Significance
The present investigation was motivated by prac-
tical query discovery problems arising in the context
of recent research on data integration, extraction,
and exchange. In each of these domains, a crucial
problem is the instance-driven discovery of mapping
queries between autonomous data sources. In the
context of data integration, recent research has ex-
plored the use of corresponding example instances
of source and target schemas in the derivation of ap-
propriate source-to-target data mapping queries [2],
[4]. In the context of data extraction, an extensive
line of research has explored the use of example
instances to derive “wrapper” queries for extraction
of relevant information from data sources (e.g., [6]).
In the context of data exchange, an important issue
is the discovery of source-to-target dependencies for
translation of instances of a source schema into
appropriate instances of a target schema (cf. [9]).
Important issues in each of these contexts are to
characterize the goodness of sets of examples for
query discovery and to understand the complexity
of such derivations. The ubiquity of such instance-
based reasoning in a variety of query discovery tasks
led us to the present study of BP-PAIRS.
C. Summary of Results
In this note we first give an exact language-
independent characterization of when a solution to
aBP-PAIRS instance exists and show how to con-
struct an appropriate mapping expression Ewhen
this is the case. Next, we establish that BP-PAIRS
is co-graph-isomorphism-hard and in coNP. We then
use these results to give a new characterization of
the generic relational queries. We close by indicat-
ing topics for further investigation.
II. PRELIMINARY NOTIONS
In this section we give basic definitions and
notation used in this note.
Definition 1: Arelation rof arity n∈N
is a finite subset of nCartesian products of an
infinitely enumerable domain Dof uninterpreted
atoms: r⊂Dn. The active domain of ris the
set of atoms occurring in r, denoted as adom(r) =
Sn
i=1{ai| ha1, . . . , ai, . . . , ani ∈ r}.
Notice that there are only two 0-ary relations: the
empty relation {} and the relation with the empty
tuple {hi}. These are often used to encode false and
true, respectively, as relations. In this way, boolean
queries can be embedded in the relational model.
Definition 2: An isomorphism ϕfrom a rela-
tion rof arity nto a relation sof arity nis a
permutation of Dsuch that, for all a1, . . . , an∈D,
it is the case that ha1, . . . , ani ∈ rif and only if
hϕ(a1), . . . , ϕ(an)i ∈ s.
Definition 3: An automorphism ϕof a relation r
of arity nis an isomorphism from rto itself, i.e., for
all a1, . . . , an∈D, it is the case that ha1, . . . , ani ∈
rif and only if hϕ(a1), . . . , ϕ(an)i ∈ r. The set of
automorphisms of ris denoted Aut(r).
Notice that the restriction of an automorphism
of rto adom(r)is necessarily a permutation
of adom(r).
Definition 4: ABP-set is a finite set
{(s1, t1),...,(sk, tk)}of k>1pairs of relations,
such that, for i= 1, . . . , k,siis of arity m>0,
and tiis of arity n>0.
We follow Paredaens’ presentation of the rela-
tional algebra [10], extended with a constant op-
erator unit. In what follows, for a tuple t=
ha1, . . . , ani ∈ Dnwe denote by t[i]the ith com-
ponent of t, i.e., t[i] = aifor 16i6n.
Definition 5: Let rand sbe relations of arity m
and n, respectively. The relational algebra is the
set of well-formed expressions containing relation
names and closed under the following seven opera-
tions on relations.
•The product of rand sis the relation r×
s={ha1, . . . , am, b1, . . . , bni | ha1, . . . , ami ∈
rand hb1, . . . , bni ∈ s}.
•The union of rand sis the relation r∪s=
{t|t∈ror t∈s}, which is only defined
when m=n.
•The difference of rand sis the relation r−s=
{t|t∈rand t /∈s}, which is only defined
when m=n.
3
•The projection of ron hj1, . . . , j`i(`>0and,
for i= 1, . . . , `,16ji6m}) is the relation
πhj1,...,j`i(r) = {ht[j1], . . . , t[j`]i | t∈r}.
•The equality selection of ron iand j(for 16
i, j 6m) is the relation σi=j(r) = {t|t∈
rand t[i] = t[j]}.
•The inequality selection of ron iand j(for
16i, j 6m) is the relation σi6=j(r) = {t|t∈
rand t[i]6=t[j]}.
•The unit of ris the relation unit(r) = {hi}.
If Eis an expression over relation names
R1, . . . , Rk, then E(r1, . . . , rk)denotes the relation
which results from the evaluation of Ewith each
Ribound to relation ri, for 16i6k.
Finally, we give a standard semantic notion for
relational mappings.
Definition 6: A mapping Qfrom relations of
some arity m>0to relations of some arity
n>0is generic if, for each relation rof arity
mand each permutation ϕof D, it is the case that
ϕ(Q(r)) = Q(ϕ(r)).
III. RES OLVING THE BP-PAIRS PROBLEM
In Example 1, we claimed that the addition of
tuples hb, ciand hc, bito t2would make the BP-
set {(s1, t1),(s2, t2),(s3, t3)}an invalid instance of
BP-PAIRS, despite the fact that each pair in the
set is a valid instance of BP-PAIR. We now prove
this. In particular, we now present the main result,
a language-independent characterization of the ex-
pressive power of the relational algebra on BP-sets.
Theorem 1: Let {(s1, t1),...,(sk, tk)}be a BP-
set. The following statements are equivalent:
1) There is a relational algebra expression E
such that, for i= 1, . . . , k,E(si) = ti,
2) It holds that
a) for i= 1, . . . , k,adom(ti)⊆adom(si);
and
b) if ϕis an isomorphism from sito sj(16
i, j 6k), then it is also an isomorphism
from tito tj.
Proof: (1 ⇒2) This implication follows im-
mediately, since relational algebra queries are both
domain-preserving and generic [1], [3], [10].
(2 ⇒1) Let mand nbe the arity of each siand
ti, respectively, for 16i6k.
First, we observe that , for each i, the pair (si, ti)
is an instance of the classic BP-PAIR problem.
By putting i=j, condition (2b) implies that
each automorphism of siis also an automorphism
of ti. Hence, there exists a relational algebra ex-
pression Eitaking one m-ary relation as argument
and returning an n-ary relation as result such that
Ei(si) = ti. We must note here that the border cases
m= 0 and/or n= 0 were not explicitly considered
in the original proof of Paredaens [10]. However,
the expression Ei(r) = r−rwill do in the case
that ti={}, irrespective of m, and the expression
Ei(r) = unit(r)will do in the case that ti={hi},
irrespective of m.
Next, we note that for each sithere is a relational
algebra expression Fisuch that a relation ris
isomorphic with siif and only if Fi(r)6=∅. This
fact was already shown by Bancilhon for the case
when si6={} [1]. In the case when si={},
the boolean expression Fi(r) = unit(r)−πhi(r)
identifies relations isomorphic with si.
Combining these results, the following relational
algebra expression fulfills (1):
E(r) = [
16i6k
πh1,...,ni(Ei(r)×Fi(r)).
It is interesting to note that the proof of Theorem
1 provides an explicit PSPACE construction of an
appropriate mapping expression, as is the case for
the proof of Paredaens [10] of the BP-PAIR result.
At this point, however, we must emphasize that
there is a fundamental difference between the classic
BP-PAIR result and the BP-PAIRS result. The
proof of Paredaens [10] of the BP-PAIR result
reveals that the difference operator is not used in
the construction of the required relational algebra
expression in the case that m, n >1and both
source and target are nonempty. The expressions
constructed are thus monotone in the sense that
r1⊆r2implies E(r1)⊆E(r2).
In the expressions Fi(16i6k) in the proof of
Theorem 1, the difference operator is used. This is
not an incidental effect of the particular construction
used, even in the case that m, n >1and both
source and target are nonempty. Indeed, solutions
to BP-PAIRS make essential use of the difference
operator since BP-sets can capture nonmonotone
query behavior (since kcan be greater than 1), and
the relational algebra expressions without difference
4
are always monotone.1
Example 2: Consider the following pairs of
source/target instances.
s1s2
a a a a
b b
t1t2
a a a b
b a
The BP-set {(s1, t1),(s2, t2)}satisfies condition (2)
of Theorem 1. Hence, there exists a relational
algebra expression Esuch that E(s1) = t1and
E(s2) = t2. Obviously, Ecannot be monotone, and
therefore must contain the difference operator. This
is also the case for the BP-set of Example 1.
IV. COMPLEXITY OF THE BP-PAIR A ND
BP-PAIRS PRO BL EM S
We next relate the complexity of BP-PAIR and
BP-PAIRS to several well known graph decision
problems. First, we present some terminology.
Definition 7: Agraph Gis a binary relation E
over a finite domain V⊂D. We write G= (V, E),
where Vis called the set of vertices and E ⊆ V×V
is called the set of edges.
Definition 8: Two graph decision problems.
•Subgraph Isomorphism (SubGI): given two
graphs G1and G2, is G1isomorphic to a
subgraph of G2?
•Graph Isomorphism (GI): given two graphs G1
and G2, are they isomorphic?
SubGI is a typical NP-complete problem [8].
Clearly GI is also in NP; it is unknown, however,
whether GI is in P, is NP-complete, or neither [8].
We immediately observe the following.
Lemma 1: BP-PAIRS is in coNP.
Proof: Recall that coNP is the class of
problems which have polynomial time disquali-
fications (for example, see [8]). Given a BP-set
{(s1, t1),...,(sk, tk)}, then guess an i, a j, and
an isomorphism ϕfrom sito sj, and check in
polynomial-time whether or not ϕis also an iso-
morphism from tito tj. If not, then, using the
1For the same reason, one cannot simply reduce the BP-PAIRS
problem for a BP-set {(s1, t1),...,(s2, t2)}to the BP-PAIR prob-
lem for the pair (s1× · · · × sk, t1× · · · × tk), even in the case that
m, n >1and all relations under consideration are nonempty.
characterization of BP-PAIRS (Theorem 1), reject
{(s1, t1),...,(sk, tk)}.
Definition 9: Given a relation rand atom v∈D,
define rv =r× {hvi}.
We denote by Pthe complement of decision
problem P, and by P≤p
mP0that Ppolynomial time
many-one reduces to problem P0[8].
We can now show the main result of this section.
Theorem 2:
GI ≤p
mBP-PAIR ≤p
mBP-PAIRS ≤p
mSubGI.
Proof: We establish the first reduction by ex-
hibiting a polynomial time many-one reduction f
from GI to BP-PAIR. Let G1= (V1,E1)and
G2= (V2,E2)be a pair of graphs, and assume,
without loss of generality, that V1∩V2=∅.
If E1or E2is non-empty, define f(G1, G2) =
(E1v1∪ E2v2,{hv1i}), where v1and v2are two
different elements of D−(V1∪V2). Otherwise, define
f(G1, G2) = ({hui,hvi},{hui})where uand vare
two different elements of D. Clearly fis polynomial
time computable. If (G1, G2)∈GI and E1or E2
is not empty, then there exists an isomorphism ϕ
from G1to G2. If we extend ϕsuch that ϕ(v1) =
v2, then ϕis an automorphism of E1v1∪ E2v2.
But then, by Theorem 1, f(G1, G2)/∈BP-PAIR,
since ϕis not an automorphism of {hv1i}. Now, if
(G1, G2)6∈ GI, then for each ϕ∈Aut(E1v1∪ E2v2)
it clearly must be the case that ϕ(v1) = v1and
ϕ(v2) = v2. By Theorem 1, it follows immediately
that f(G1, G2)∈BP-PAIR. Finally, if E1and E2are
both empty, then clearly (G1, G2)∈GI if and only
if f(G1, G2)6∈ BP-PAIR.
The second reduction follows directly from the
definition of BP-PAIRS. The third reduction fol-
lows from Lemma 1 since SubGI is NP-complete.
V. ANOB SE RVATION ON GENE RI C QUE RI ES
As an application of Theorem 1, we have the
following novel characterization of the generic re-
lational queries.
Theorem 3: Let Qbe a mapping from relations
of arity m>0to relations of arity n>0. Then the
following statements are equivalent:
1) Qis generic.
2) For any finite set Rof relations of arity m,
there is a relational algebra expression ER
such that, for every r∈ R,ER(r) = Q(r).
5
3) For any pair R={r1, r2}of relations of arity
m, there is a relational algebra expression ER
such that, for i= 1,2,ER(ri) = Q(ri).
Proof: (1 ⇒2) Let R={r1...,rk}. Consider
the pairs (ri, Q(ri)),i= 1, . . . , k. Suppose that, for
i, j = 1, . . . , k,ϕis an isomorphism from rito
rj. Extend ϕin an arbitrary way to a permutation
of D. Since Qis generic, ϕ(Q(ri)) = Q(ϕ(ri)) =
Q(rj). Hence, ϕis an isomorphism from Q(ri)to
Q(rj). Since it is also the case that adom(Q(ri)) ⊆
adom(ri), we have from Theorem 1 that there exists
a relational algebra expression ERsuch that, for
every r∈ R,ER(r) = Q(r).
(2 ⇒3) Obvious.
(3 ⇒1) Let rbe a relation of arity mand ϕbe a
permutation on D. Let R={r, ϕ(r)}. By assump-
tion, there exists a relational algebra expression ER
such that ER(r) = Q(r)and ER(ϕ(r)) = Q(ϕ(r)).
Since the relational algebra is generic, we have that
ϕ(Q(r)) = ϕ(ER(r)) = ER(ϕ(r)) = Q(ϕ(r)).
Theorem 3 highlights once more the fundamental
difference between the classic BP-PAIR case and
the BP-PAIRS case. Not only does the proof of
Theorem 3 heavily rely on the the fact that |R| >1,
but furthermore, without this condition the result
simply does not hold. To see this, let a∈D.
Consider the mapping Qfor which Q({hai}) =
{hai} and Q(r) = ∅for r6={hai}. Clearly, Q
is computable, but not generic. To see this, choose
b∈Dsuch that a6=b. Consider the permutation of
Dthat swaps aand band fixes all other elements
of D. While this permutation is an isomorphism
from {hai} to {hbi}, it is not an isomorphism from
Q({hai}) = {hai} to Q({hbi}) = ∅. Nevertheless,
Qsatisfies statement (2) of Theorem 3 for |R| = 1.
VI. FI NAL REMARKS
All of the results established above also hold
for the natural generalization of BP-PAIRS to the
nested relational model, following Gyssens et al.
[12]. It may also prove fruitful to investigate similar
generalizations of instance-driven query discovery
for graph [5] and XML [7] data. We close by noting
several further open questions which naturally arise
from the present investigation.
•Recently, results have been established on the
complexity of repairing data mapping expres-
sions for several logical languages [11]. In
the context of reasoning about BP-sets, one
can dually consider repairing instances for data
mapping discovery.
–Suppose a BP-set only satisfies condition
(2a) of Theorem 1. What is the minimal
number of tuple additions and/or deletions
required to “repair” the set such that it also
satisfies condition (2b)? For some k>0,
can the set be repaired with at most ksuch
updates?
–Suppose a BP-set {(s1, t1),...,(sk, tk)}
fails to satisfy condition (2a) of Theo-
rem 1. Can one find a renaming of the
atoms in S16i6kadom(si)with atoms in
S16i6kadom(ti)such that the set satisfies
both conditions (2a) and (2b)? In other
words, for the given BP-set, does there
exist a relational algebra expression Eand
binary relation τ⊆S16i6kadom(si)×
S16i6kadom(ti)such that E(si, τ ) = ti,
for each 16i6k? Note that BP-PAIRS
is just the special case of this problem
where τis restricted to subsets of the
identity relation on S16i6kadom(si).
Are there natural characterizations and prac-
tical algorithmic solutions for such instance
repair problems?
•A BP-set can be thought of as a “sample” or fi-
nite “trace” of an infinite query. Although The-
orem 1 provides an explicit means to construct
an appropriate mapping query when possible,
there is no guarantee that in practice this query
is a “desirable,” “interesting,” or the “best”
mapping for a given context. For example,
consider again the BP-set of Example 1. In this
case, the mapping expression constructed using
Theorem 1 consists of a union of expressions,
each of which consists of crossproducts and
unions of sizeable subexpressions [10]. In con-
trast, we saw in the Example that the succinct
expression s−(s×πhi(σ16=2 (s))) is sufficient.
Consequently, towards applications of Theorem
1 it is important to develop meaningful notions
of query interestingness and goodness-of-fit.
REFERENCES
[1] F. Bancilhon. On the Completeness of Query Languages for Re-
lational Data Bases. Proc. MFCS, Springer LNCS 64, pp. 112–
123, Zakopane, Poland, 1978.
[2] A. Bilke and F. Naumann. Schema Matching using Duplicates.
Proc. IEEE ICDE, pp. 69–80, Tokyo, 2005.
6
[3] A.K. Chandra and D. Harel. Computable Queries for Relational
Data Bases. J. Comput. Syst. Sci. 21(2):156–178, 1980.
[4] G.H.L. Fletcher and C.M. Wyss. Data Mapping as Search. Proc.
EDBT, Springer LNCS 3896, pp. 95–111, Munich, 2006.
[5] M. Gemis, J. Paredaens, P. Peelman, and J. Van den Bussche.
Expressiveness and Complexity of Generic Graph Machines.
Theory Comput. Syst. 31(3):231–249, 1998.
[6] G. Gottlob, C. Koch, R. Baumgartner, M. Herzog, and S. Flesca.
The Lixto Data Extraction Project—Back and Forth between
Theory and Practice. Proc. ACM PODS, pp. 1–12, Paris, 2004.
[7] M. Gyssens, J. Paredaens, D. Van Gucht, and G.H.L. Fletcher.
Structural Characterizations of the Semantics of XPath as
Navigation Tool on a Document. Proc. ACM PODS, pp. 318–
327, Chicago, 2006.
[8] J. K¨
obler, U. Sch¨
oning, and J. Tor´
an. The Graph Isomorphism
Problem: Its Structural Complexity. Birkh ¨
auser, Boston, 1993.
[9] P.G. Kolaitis. Schema Mappings, Data Exchange, and Metadata
Management. Proc. ACM PODS, pp. 61–75, Baltimore, 2005.
[10] J. Paredaens. On the Expressive Power of the Relational Alge-
bra. Information Processing Letters 7(2):107–111, 1978.
[11] P. Senellart and G. Gottlob. On the Complexity of Deriving
Schema Mappings from Database Instances. Proc. ACM PODS,
pp. 23–32, Vancouver, Canada, 2008.
[12] M. Gyssens, J. Paredaens, and D. Van Gucht. A Uniform
Approach Toward Handling Atomic and Structured Information
in the Nested Relational Database Model. Journal of the ACM
36(4):790–825, 1989.