Query evaluation with soft-key constraints.
-
Citations (0)
-
Cited In (0)
Page 1
Query Evaluation with Soft-Key Constraints
Abhay Jha
CSE, University of Washington
Seattle WA, 98195-2350
abhaykj@cs.washington.edu
Vibhor Rastogi
CSE, University of Washington
Seattle WA, 98195-2350
vibhor@cs.washington.edu
Dan Suciu
CSE, University of Washington
Seattle WA, 98195-2350
suciu@cs.washington.edu
ABSTRACT
Key Violations often occur in real-life datasets, especially in those
integrated from different sources. Enforcing constraints strictly on
these datasets is not feasible. In this paper we formalize the notion
of soft-key constraints on probabilistic databases, which allow for
violation of key constraint by penalizing every violating world by a
quantity proportional to the violation. To represent our probabilis-
tic database with constraints, we define a class of markov networks,
where we can do query evaluation in PTIME. We also study the
evaluation of conjunctive queries on relations with soft keys and
present a dichotomy that separates this set into those in PTIME and
the rest which are #P-Hard.
1.INTRODUCTION
Soft constraints are emerging as a promising approach to cope
with various kinds of uncertainty in data, as found in many modern
applications. Soft constraints have been used to enhance the quality
of information extraction [25], of object reconciliation [26, 18], in
query optimization [16], and in data cleaning [1].
While discovering soft constraints is possible today given ad-
vances in machine learning, using the soft constraints during query
processing over large volumes of data is much harder. Current ap-
proaches to probabilistic inference are based either on Monte Carlo
Markov Chain [22] or on message passing [7], and these do not
scale to large volumes of data. The main problem that prevents us
from adopting soft constraints is the lack of scalable query process-
ing techniques in the presence of soft constraints.
In this paper we study soft key constraints, or soft keys in short,
and examine the query evaluation problem: evaluate a Boolean
conjunctive query on a database given a set of soft keys. We in-
terpret soft keys using Markov Networks whose potential consists
of two parts, one that depends on individual tuples, and the other
that depends only on the number of tuples that have the same key;
such potentials have been recently studied in [15]. Our soft keys are
in fact general cardinality constraints. We define query evaluation
as computing the marginal probability of the query, which is the
common semantics in probabilistic databases [3, 12, 5, 9]: this is
different from computing the most likely world, a problem studied
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
PODS’08, June 9–12, 2008, Vancouver, BC, Canada.
Copyright 2008 ACM 978-1-60558-108-8/08/06 ...$5.00.
in [15].
Whilegeneralqueryevaluationonprobabilisticdatabaseisknown
to be #P-hard, even without soft keys [9], we identify two cases that
are tractable and describe two polynomial time algorithms for eval-
uating conjunctive queries in the presence of soft key constraints:
to our knowledge these are the first provably tractable algorithms in
the presence of any soft constraints. Our first algorithm applies to
queries over a single relation, in the presence of multiple soft keys.
Our second algorithm applies to conjunctive queries over multiple
relations, with multiple soft keys. In both cases we also establish a
dichotomy: if our algorithms do not apply then we can show, under
certain assumptions, that the query is #P-hard.
Our analysis is similar in spirit to a previously known dichotomy
resultforqueryevaluationondisjoint/independentprobabilisticdatabases[9],
which can be thought of as probabilistic databases with hard key
constraints. That result defines a syntactic condition on the query,
called safety, then proves that every safe query can be evaluated in
PTIME and that every unsafe query is #P-hard. In our work we also
define a syntactic safety condition for queries in the presence of soft
keys. Then we show that a query is safe in the presence of soft keys
iff it remains safe after making every key either hard, or removing
it altogether, in all possible ways. The intuitive significance of this
connection is the following: by varying the weight attached to a
soft key one can either make it hard or remove it completely, and
therefore any PTIME algorithm that can handle soft keys needs to
be able to handle these extremes. However, the PTIME algorithm
that we describe in this paper in the presence of soft keys is signif-
icantly more difficult than the previous algorithm for safe queries.
This is by necessity: even one soft key makes query evaluation
much more difficult, and the interaction between multiple soft keys
on the same table adds even more complexity.
1.1 Motivating Example
To motivate our work, we illustrate with a concrete problem: du-
plicate elimination in dirty data. This occurs frequently in data in-
tegration because of different representation conventions, or simply
because of typos, and results in key violations: a person has mul-
tiple addresses, a company has multiple CEO’s, a scientific paper
has multiple years of publication. Andritsos et al. [1] have pro-
posed a probabilistic approach to answer queries directly on the
dirty data. They assign to each duplicate tuple a probability, such
that the probabilities for the same key sum up to 1. For exam-
ple, consider a relation Person(name, city), where we de-
fine name to be a key. If we find two tuples with the same name,
say (Joe, Seattle) and (Joe, Whistler) then we have
a key violation: the approach in [1] is to assign to each tuple a prob-
ability, say 0.5, indicating that only one tuple may be present in a
clean instance. While this approach uses probabilities, the key con-
straint is hard: in each possible world only one of the two tuples
Page 2
Person:
Name
Joe
Joe
Frank
Frank
Frank
Sue
Sue
Lisa
Lisa
Lisa
City
Seattle
Whistler
Seattle
Paris
Honolulu
Portland
Whistler
Paris
Milan
Saint Malo
W
w1 = 4
w2 = 3
w3 = 0
w4 = −2
w5 = 1
w6 = 0
w7 = −1
w8 = 3
w9 = −1
w10 = −1
t1
t2
t3
t4
t5
t6
t7
t8
t9
t10
Figure 1: A probabilistic relation given by weights
SOFT KEY x ON Person(x,y)
SIZE 2 WEIGHT -4;
SOFT KEY ON Person(Frank,y)
SIZE 2 WEIGHT 4
SIZE 3 WEIGHT -3;
SOFT KEY x WHERE Income(x,z), z > 1M
ON Person(x,y)
SIZE 2 WEIGHT 3
SIZE s WHERE s > 2 WEIGHT -2*s
Figure 2: Soft key constraints for Person
may be present.
With a soft key, we can relax the constraint, by allowing a possi-
ble world to contain multiple occurrences of the same key, but by
assigning a certain penalty for multiple occurrences. There are two
reasons why we need such a relaxation. First is that constraints are
learned from training data, rather than stated by an administrator,
and these are always “soft”. For example, in the case of historical
data, a person’s address is not unique at all ! A machine learning
tool may infer, for example, that people over 50 years old typically
have 6 addresses, because they moved in the past, while people
under 25 typically have 1 address. Such detailed statistical infor-
mation about the data can be represented using our notion of soft
keys.
Second, by ignoring the “softness” of a key, we get wrong query
answers. For example, assume that most people have a single res-
idence, but wealthy people may have a second vacation home, and
perhaps a third small apartment in a big city. Consider a user who
integratesPerson(name, city)withHobby(name, hobby),
searching for cities likely to host skiers:
q(y) :- Person(x,y), Hobby(x, ’Ski’)
Whistler is a popular ski resort in Canada, but very few people
actually live there. However, many wealthy people have a condo
or a vacation home in Whistler. With a hard key constraint, all
these entries appear to the system to be wrong, and it will decrease
their probabilities: other cities will rank higher than Whistler,
only because the system found fewer key violations for those cities.
In fact, a probabilistic database system that returns only the top k
most likely answers [19] may not retrieve Whistler at all.
Our approach is to use soft keys instead of hard keys; we illus-
trate it in Figure 1. Person is a probabilistic database [9], where
instead of probabilities we indicate the weight of each tuple1. As
1The weight w and the probability p are generally related through
w = log
p
1−p, but see Examples 2.6 and 2.7.
in [1], we can assign different weights to different tuples to indi-
cate our belief in the quality of the data that produced each tuple.
However, unlike [1] tuples with the same key are not disjoint: keys
are “soft”. In our approach the soft keys are specified separately,
using a declarative language that is illustrated in Fig 2. The first
constraint says that two cities for the same person should be penal-
ized with a weight w = −4. The second constraint defines two soft
keys. One says that Frank is allowed to have two cities: its weight
w = 4 cancels the weight of the first soft key. The second penal-
izes with a weight -3 if 3 tuples are found with the key Frank.
The next soft key says that wealthy people are actually more likely
to have two homes, then gives a formula on how to decrease their
weight as function of the number of homes. Together, these soft
keys capture our statistical knowledge about the data, and need to
be used during query evaluation. For example, the query q above
will return Whistler with a much higher probability, because the
wealthy people that have vacation homes are no longer considered
errors by the system.
Organization We give the basic definitions and the background
on Markov Networks in Sec. 2, then study the query evaluation
problem on a single relation in Sec. 3. We show how to extend
it to conjunctive queries over multiple relations and establish the
dichotomy in Sec. 4.
2. DEFINITIONS AND NOTATIONS
We begin with a brief review of Markov networks (for a detailed
description, we refer to [17, 7]). Then, we define a particular kind
of a Markov Network to interpret the soft keys.
2.1 Markov Networks
A Markov Network is a concise presentation of a probability dis-
tributionofasetofrandomvariables¯ X = {t1,t2,...,tn}. Astate
is defined to be an assignment to the variables in¯ X. In this paper
we restrict to Boolean variables, and therefore we will assimilate a
state with a subset W ⊆¯ X, and call it a world. The set of pos-
sible worlds is W = 2
random variables¯ X is a finite probability space (W,P), where
W = 2
atomic, exclusive events of the probability space are the possible
worlds W ∈ W.
Typically, n is very large: in a probabilistic database each tuple
corresponds to a Boolean variable ti(hence we will refer to tias a
tuple) and a world corresponds to a subset of tuples. Thus, n is the
number of tuples in the probabilistic database. It is not possible to
enumerate all 2nvalues of P.
A Markov Network (MN) describes the function P more con-
cisely. The MN is a triple (¯ X,K,(Φc)c∈K), where K is a set of
subsets of¯ X, called cliques, and for each c ∈ K, Φc : {0,1}c→
R+is called a potential function. The probability distribution P :
W → [0,1] defined by the MN is P(W) =
Φ(W) is called the weight of the world W and Z is a normalization
factor, and are given by:
Y
Z=
¯
X. Thus, a probability distribution on the
¯
Xand P : W → [0,1] s.t.
P
W∈WP(W) = 1; the
1
ZΦ(W), where
Φ(W)=
c∈K
X
Φc(W ∩ c)
W∈W
Φ(W)
Thus, instead of having to enumerate 2nvalues for P, we can just
enumerate 2|c|values of each potential Φc. The basic assumption
in Markov Networks is that each clique c is small.
An MN defines an undirected graph with nodes¯ X, and with
edges {(x,y) | ∃c ∈ K.x ∈ c,y ∈ c}. Then each set c ∈ K is
Page 3
Figure 3: A Markov Network.
indeed a clique in the graph, justifying the terminology: in this pa-
per we allow the cliques in K to be non-maximal, which is a minor
departurefromthestandarddefinition. Theimportanceofthegraph
is that edges correspond to correlations: an edge (x,y) means that
the variables x,y are correlated, while the lack of an edge means
that they are independent.
An MN is often represented as a log-linear model, as follows:
each potential function is given by Φc(Wc) = exp(wcfc(Wc)),
where wc ∈ R is a weight and fc is a feature function. Thus, the
weight of a world is:
X
2.2 Size-ConstrainedMarkovNetworks(SCMN)
Our goal is to use a Markov Network to model soft keys. A
clique will correspond to a set of tuples violating the key constraint,
and therefore can be large: the basic assumption in Markov Net-
works that cliques are small no longer holds. On the other hand,
to model soft keys, the clique’s potential needs to be a function
only of the number of tuples violating the key constraint, and not
the actual set of tuples. For that purpose we introduce a restricted
form of Markov Networks, which we call Size Constrained MN,
and represent as a log-linear model:
Φ(W)=
exp
c∈K
wcfc(W ∩ c)
!
DEFINITION 2.1. ASize-ConstrainedMarkovNetwork(SCMN)
is M = (¯ X,K,(wi)i=1,n,(fc)c∈K), where:
• ¯ X = {t1,...,tn} is a set of Boolean variables. We refer to
them as tuples.
• K is a set of nonempty subsets of¯ X called cliques.
• wi ∈ R is called the weight of the tuple ti.
• For all c ∈ K, fc : {0,...,|c|} → R, s.t. fc(0) = 0.
An SCMN defines the following probability space on W = 2
P(W) =
0
ti∈W
Z= Φ(W)
¯
X:
1
ZΦ(W), where
Φ(W)=
exp
@X
wi+
X
c∈K
fc(|W ∩ c|)
1
A
X
W∈W
Thus, like in [15], the potential in an SCMN has two parts: one
that depends only on the individual tuples, and one that depends
only on the number of tuples in each cliques.
Example 2.2 We illustrate a Size-Constrained Markov Network
over five Boolean variables¯ X = {t1,...,t5}. The graph of the
MN is in Fig. 3 and says that the variables t1and t2are correlated,
and so are t3,t4,t5, but that t1,t2 are independent from t3,t4,t5.
Define now the following SCMN:
M = (¯ X,K,(wi)i=1,5,(fc)c∈K)
where w1,...,w5 ∈ R are the weights, K = {c1,c2} and:
c1
=
{t1,t2}
c2
=
{t3,t4,t5}
Here v1,v2,u1,u2,u3 ∈ R. The probability function P(W) mul-
tiplies the potentials ewifor all ti ∈ W, then it examines the cardi-
nalitiesofW∩c1andofW∩c2andmultiplieswiththecorrespond-
ing potential. For example, for the two worlds W = {t2,t3,t4}
and W?= {t1,t3,t4,t5} we have:
1
Zexp(w2+ w3+ w4+ v1+ u2)
P(W?)=
Zexp(w1+ w3+ w4+ w5+ v1+ u3)
2.3Soft Keys
We now formalize the notion of soft keys and describe their se-
mantics using an SCMN.
Syntax We start by defining a probabilistic relational schema:
fc1(1) = v1
fc2(1) = u1
fc1(2) = v2
fc2(2) = u2
fc2(3) = u3
P(W)=
1
DEFINITION 2.3. A probabilistic relation is a relation schema
R(A1,...,Ak,W) with a distinguished attribute W, called the
weight.
A probabilistic database schema is¯R = (R1,...,Rm), and a
probabilisticinstanceissimplyaninstanceI for¯R: theterm“prob-
abilistic” refers to how we will interpret the weights, as we show
below. As usual, we denote RI
Given a relation name R we denote Attr(R) the set of attributes
without the weight attribute. We consider in this paper conjunctive
queries that refer only to attributes in Attr(R); thus, a subgoal
g on R means a predicate (with variables and/or constants) over
Attr(R), without the weight attribute. We denote vars(g) the set
of variables in g. For example, give a relation R(A,B,C,D,W),
a subgoal is g = R(x,a,y,x), where vars(g) = {x,y}.
DEFINITION 2.4. Let R be a relation name.
ithe relation Riof the instance I.
• A soft key schema for R is a pair σ = (¯ x,g) where ¯ x is a set
of variables and g is a subgoal on R, s.t. ¯ x ⊆ vars(g). We
denote Keyσ(R) ⊆ Attr(R) the set of attributes in R that
have in g either a constant or a ¯ x-variable.
• A soft key for R is a triple γ = (σ,s,w), where σ is a soft
key schema, s ∈ N+is called the size and w ∈ R is called
the weight.
We write s = size(γ) and w = weight(γ) to indicate the
size and the weight of the soft key. We also use interchangeably σ
and γ when clear from the context, e.g. write Keyγ(R) instead of
Keyσ(R).
Informally, the soft key (σ,s,w) says this. If there exists s tuples
in R with the same values ¯ x, then charge with a weight w. If w < 0
then that will penalize s occurrences of x otherwise it will reward
them.
Example 2.5 The first, second, and last soft key in Figure 2 are
expressed as follows in our formalism:
((x,Person(x,y)),2,−4)
((∅,Person(Frank,y)),2,4)
((∅,Person(a,y)),s,−2s)
The last line represents a set of soft keys: there is one for each
a that satisfies Income(a,z),z > 1M in the current database
Page 4
instance, and for each number s between 1 and the cardinality of
Person. We will assume in this paper that the system performs
automatically the conversion from a use-friendly syntax as in Fig. 2
to formal soft keys.
Semantics Given an instance I and a set of soft keys Γ, its se-
mantics is given by the following SCMN:
M(Γ,I) = (¯ X,K,(wi)i=1,n,(fc)c∈K)
• ¯ X = {t1,...,tn} is the set of tuples in all probabilistic re-
lations in I. That is,¯ X =S
• wi = ti.W (the weight of the tuple tiin I).
• Let γ ∈ Γ be a soft key for a relation Rj, i.e. γ = (σ,s,w),
where σ = (¯ x,g), and denote ¯ y = vars(g) − ¯ x. Let ¯ a,¯b
be tuples of constants with the same arity as ¯ x and ¯ y respec-
tively. Denote g[¯ a/¯ x,¯b/¯ y] the ground tuple obtained by sub-
stituting ¯ x with ¯ a and ¯ y with¯b in g. For any γ and ¯ a, we
define the following set:
j=1,mRI
j; we assume the union
to be disjoint.
cγ,¯ a = {ti | ∃¯b : g[¯ a/¯ x,¯b/¯ y] = ti}
A set of the form cγ,¯ a consists of a subsets of tuples in RI
that are affected by the soft key γ. We define the set of
cliques K to be all sets of this form, and the feature func-
tions fc to combine the weights of all soft keys that define
the same clique c:
j
K=
{cγ,¯ a | cγ,¯ a ?= ∅}
0
B
B
@
fc(s)=
exp
B
B
B
B
B
B
B
X
γ ∈ Γ :
(∃¯ a.cγ,¯ a = c)
∧size(γ) = s
|c|
s
!
weight(γ)
1
C
C
A
C
C
C
C
C
C
C
We explain the definition through three examples. First, we ex-
amine a probabilistic instance I without any soft keys: in this case
the probability defined by M(∅,I) is precisely a tuple-independent
probabilistic database [9].
Example 2.6 Let I be an instance of the probabilistic relational
schema R(A,B,W): I = {t1,...,tn}. Denote wi = ti.W the
weight of tuple tiin I, and let pi = ewi/(1 + ewi). Consider the
SCMN M(∅,I); then the probability P(W) of a possible world
W ⊆ I is (see Def 2.1):
0
ti∈W
FromP
This is precisely a tuple-independent probabilistic database: every
tuple ti appears in W independently, and its marginal probability
is P(ti) = pi.
P(W)=
1
Zexp
@X
wi
1
A=1
ti∈Wpi×Q
Z
Y
ti∈W
pi
1 − pi
(1)
WP(W) = 1, through direct calculation we obtain Z =
i=1,n(1−pi). Thus, P(W) =Q
1/Q
ti?∈W(1−pi).
Now we examine how soft keys affect the probability distribu-
tion, by introducing correlations between tuples.
Example 2.7 Continuing the previous example, consider one soft
key with schema:
σ=(x,R(x,y))
For concreteness, suppose the soft key has size 3 and weight −5.0,
i.e. γ = (σ,3,−5.0). This says that three or more occurrences
of the same value for A should be penalized by −5.0. Suppose
a world W contains three tuples (a,b1),(a,b2),(a,b3): then the
SCMN semantics penalizes P(W) by multiplying Eq.(1) by e−5.
Now suppose that W contains n violations, (a,b1),...,(a,bn):
then Eq.(1) is multiplied with e−5(n
rences of a contribute with a weight of −5.0. This is a desirable
behavior: the user says that three occurrences should be penal-
ized by −5.0, and therefore she expects the penalty to increase
with n. Our particular choice of defining this penalty, by multi-
plying with`n
Markov Logic if one assigns weight −5.0 to the formula:
∃y1.∃y2.∃y3.
i
3): every three distinct occur-
3
´, is somewhat arbitrary. Our choice was inspired
^
However, other choices are possible, for example one could mul-
tiply with n − 3 instead of`n
Clearly, by adding the soft key γ, we introduced correlations
between tuples that share the same value of A. But, in addition,
the soft key changes the marginal probability of every tuple ti =
R(a,b), even if it does not violate the soft key, because Z changes.
In example 2.6 the marginal probability of a tuple ti was simply
P(ti) = pi. By adding a single soft key constraint it becomes
unclear how to compute the marginal probability of any tuple, even
one that doesn’t violate the key constraint.
by Markov Logic [22]: this is precisely the semantics obtained in
R(x,yi) ∧
^
i?=j
yi ?= yj
3
´. The results in this paper are not
affected by the particular choice of the feature functions fc.
Finally, we examine the impact of several keys.
Example 2.8 We add a second key to the previous example:
γ?
=(∅,R(x,y),1000,+3.5)
Now Γ = {γ,γ?}. Here γ?rewards worlds that have at least 1000
tuples. In other words, while γ encourages us to remove tuples that
share the same key, γ?penalizes us if we remove too many glob-
ally: more precisely by rewarding worlds with over 1000 tuples it
increases Z and thus penalizes worlds with less than 1000 tuples.
Thisillustratesthatmultiplesoftkeysmayinteractandfurthercom-
plicate the probability space: with this new soft key it seems even
more difficult to compute the marginal probability of a tuple.
PROPOSITION 2.9. The size of M(Γ,I) is bounded by a poly-
nomial in the size of Γ and I.
PROOF. (Sketch) Let n = |I| and m = |Γ|. Consider a soft
key γ ∈ Γ: the cliques it generates, cγ,¯ a, are disjoint sets, hence γ
generates at most n cliques. Thus there are at most mn cliques in
M(Γ,I).
2.4Problem Definition
We fix the probabilistic relational schema¯R and a Boolean con-
junctive query q, and study the following problem: given a set of
soft keys Γ and an instance I for¯R, compute the marginal proba-
bility P(q):
P(q)=
X
W⊆I:W| =q
P(W)
(2)
where P is the probability distribution defined by M(Γ,I). We call
P(q) the value of q on I given the soft keys Γ.
Notice that the soft keys Γ are part of the input; this is because
we insist on evaluating q using an algorithm that is generic in the
sizes and weights in Γ.
Page 5
In this paper we restrict our discussion to conjunctive queries
without self-joins, i.e. where every relation name occurs at most
once in the query. For example the two queries R(x,y),S(y,a,z)
and R(a,x,x),S(x,y,y,b),T(x,z) are without self-joins, while
the query R(x,y),R(y,z) has a self-join. This restriction is sim-
ilar to other complexity results for the query evaluation on proba-
bilistic databases [10, 20, 9, 21]. Queries that have self-joins are
significantly harder to analyze: the only case that has been stud-
ied is that of queries over tuple-independent databases [8], and that
turned out to be significantly harder.
3. SINGLE SUBGOAL QUERIES
We start our investigation by studying the complexity of queries
consisting of a single subgoal. Examples include q : −R(x,a,y),
or q : −R(x,x,x) or q : −R(x,y,z). These are select-project
queries, and we call them in this section disjunctive queries. In the
presence of soft keys, even such queries can be hard:
PROPOSITION 3.1. Consider a relation R(A,B,W) and the
query q : −R(x,y) (which simply checks if R is non-empty). Con-
sider two key schemas: σ1 = (x,R(x,y)) and σ2 = (y,R(x,y)).
Then the problem: for inputs I, w1,w2, compute the value of q on
I given the soft keys (σ1,2,w1) and (σ2,2,w2) is #P-hard.
PROOF. We will first prove hardness for w1 = w2 = −∞,
then show how to extend the proof to w1,w2 ∈ R. We use a reduc-
tion from the IMPERFECT MATCHING (IPM) problem, which is:
given a bipartite graph G = (U,V,E), with E ⊆ U ×V , compute
the number of matches (full or partial). A match is a M ⊆ E s.t.
every u ∈ U occurs at most once, and every v ∈ V occurs at most
once in M. IPM is #P-hard as shown in [28]. Given a graph G =
(U,V,E) for the IPM problem, we reduce it to an instance I over
the schema R(A,B,W) as follows: I = {(u,v,0)|(u,v) ∈ E}.
Thus, each tuple has weight 0. A world W ∈ W corresponds to
a subset of edges M ⊆ E. Denote W1 the set of worlds that are
matches, and W0 = W −W1. If W ∈ W1then Φ(W) = 1, and if
W ∈ W0then Φ(W) = 0 (because either γ1or γ2are violated and
their weights are −∞). Denote m = |W1| the number of partial
matches in G. We have:
X
Φ(q)=
Z=
W∈W
Φ(W) = m
X
W∈W,W| =q
Φ(W) = Z − 1
because the only world in W that does not satisfy q is ∅, which is
alsoapartialmatch. ItfollowsthatP(q) = Φ(q)/Z = (m−1)/m,
hence m = 1/(1 − P(q)). This completes the reduction, and
shows that computing P(q) is #P-hard when w1 = w2 = −∞.
Now we prove hardness assuming that w1,w2 are in R, and part
of the input. Define w1 = w2 = w; we choose w later. Denoting
ε =P
ensure ε ≤ 1/2: this allows us to compute m as ?1/(1 − P(q))?.
Namely, choose w s.t. ew≤ 1/2|E|+1, thus ∀W ∈ W0, Φ(W) ≤
1/2|E|+1, which implies ε ≤ 1/2 because there are only 2|E|pos-
sible worlds.
W∈W0Φ(W) we have Z = m + ε and Φ(q) = Z − 1,
hence m = 1/(1 − P(q)) − ε. We choose w small enough to
Thus, the interaction between soft keys on the same relation can
make even a disjunctive query hard. We show, however, that if
the soft keys are “hierarchical”, then any disjunctive query can be
computed in polynomial time.
DEFINITION 3.2. Let γ1,γ2 be two soft keys on a relation R.
Wesaythattheyarehierarchicalifeither(1)Keyγ1(R) ⊆ Keyγ2(R)
or (2) Keyγ1(R) ⊇ Keyγ2(R) or (3) the subgoals g1,g2 of γ1
and γ2are incompatible (i.e. non-unifiable): we write g1∩ g2 = ∅
to indicate that g1,g2are incompatible.
Let Γ be a set of soft keys on R. We say that Γ is hierarchical if
∀γ1,γ2 ∈ Γ, the pair γ1,γ2is hierarchical.
For example, consider the following soft keys on R(A,B,C,W)
(we show only the key schemas: the sizes and weights are not used
in the definition):
γ1
γ2
γ3
=(y1,R(x1,a1,y1))
(y2,R(y2,a2,x2))
((y3,y4),R(x3,y3,y4))
=
=
• γ1,γ2are hierarchical (assuming a1 ?= a2). This is because
R(x1,a1,y1) and R(y2,a2,x2) are incompatible: in our no-
tation, R(x1,a1,y1) ∩ R(y2,a2,x2) = ∅.
• γ2,γ3arenon-hierarchical: Keyγ2(R) = {A,B}, Keyγ3(R) =
{B,C}, and their subgoals are compatible.
• γ1,γ3 are hierarchical because Keyγ3(R) = {B,C} and
Keyγ1(R) = {B,C}.
We prove two results in this section:
THEOREM 3.3. If Γ is hierarchical, then any disjunctive query
on R can be evaluated in time O(n1+arity(R)).
THEOREM 3.4. If Γ is non-hierarchical and contains no con-
stants, then every disjunctive query on R is #P-hard.
3.1APTIMEAlgorithmforHierarchicalSCMN
We prove here Theorem 3.3. In fact, we will prove a more gen-
eral result. Fix any SCMN M = (¯ X,K,(wi)i=1,n,(fc)c∈K), and
assume w.l.o.g. that ∀c ∈ K, |c| ≥ 2.
DEFINITION 3.5. M is hierarchical if ∀c1,c2 ∈ K either c1∩
c2 = ∅ or c1 ⊆ c2 or c1 ⊇ c2. The height of the hierarchy is the
largest number h s.t. there exists h cliques in K s.t. c1 ⊃ c2 ⊃
... ⊃ ch?= ∅.
THEOREM 3.6. Let M be hierarchical SCMN of height h and
Q ⊆¯ X a set of tuples. Then the probability P(Q) defined as:
P(Q) =
X
W⊆W:Q∩W?=∅
P(W)
can be computed in time O(nh+1), where n = |¯ X|.
Theorem 3.6 proves Theorem 3.3: this is because if Γ is hierar-
chical then for every I the SCMN M(Γ,I) is hierarchical and its
height is ≤ arity(R). Moreover, on the probabilistic instance I,
the query q is equivalent to a fixed set of tuples Q. More precisely,
defining Q ⊆ I (note that here¯ X = I) to be the set of tuples that
match the subgoal q, then we have for every world W ⊆ I: W |= q
iff W ∩ Q ?= ∅.
In the remainder of this section we prove Theorem 3.6.
The proof of the theorem is given by the Algorithm 3.1, which
computes P(Q) using dynamic programming. We first describe
the notations used by the algorithm and explain it, then prove its
correctness and running time.
Page 6
Algorithm 3.1 Computing P(Q) on a hierarchical SCMN
1: Inputs: SCMN (¯ X,K,(wi)i=1,n,(fc)c∈K), query Q ⊆¯ X.
2: Outputs: P(Q).
3: Let Z0
ε= 0
4: for k = 1,n do
5:
for s1,...,shk∈ {0,...,k}hkdo
6:Let ¯ s = (s1,...,shk)
7:Let c1 ⊃ ... ⊃ cLthe common ancestors of k − 1,k
8:
9:
¯ s?:¯ s?
10:
else
11:
U = R = 0
12:
end if
13:
14:
15:
¯ s?:¯ s?
16:
if tk∈ Q then
17:
T = V
18:
else
19:
¯ s?:¯ s?
20:
end if
21:
else
22:Let T = V = 0
23:
end if
24:Let Zk
25:
end for
26: end for
27: Let Φ(Q) =P
ε= 1 S0
ifVhk
i=L+1(si = 0) then
Let U =P
L=¯ sLZk−1
¯ s?
and R =P
¯ s?:¯ s?
L=¯ sLSk−1
¯ s?
ifVhk
V = FP
i=L+1(si = 1) then
Let F = exp(wk+P
i=1,L(fci(si) − fci(si− 1)))
L=¯ sL−1Zk−1
¯ s?
T = FP
L=¯ sL−1Sk−1
¯ s?
¯ s= U + V and Sk
¯ s= R + T
¯ sSn
¯ sand Z =P
¯ sZn
¯ s
28: Return P(Q) = Φ(Q)/Z.
We start by defining the following forest T:
Nodes(T)
Edges(T)
=K ∪ {{ti} | ti ∈¯ X}
{(c,c?) | c ⊃ c?∧ ¬∃c??.(c ⊃ c??⊃ c?)}
=
Theleavesofthisforestcorrespondpreciselytothetuplest1,...,tn.
Fix any order on the forest: this defines both an order on the leaf
nodes,t1,t2,...,tn, and of the internal nodes. The order ensures
that all tuples belonging to a clique form a subsequence of the leaf
nodes: ti,ti+1,...,tj. This was possible because the cliques are
hierarchical.
WenowdefineO((n+1)h+1)subsetsofW. Fork = {0,...,n},
denote:
• ¯ Xk= {t1,...,tk} (the first k tuples).
• hk= the number of cliques containing tk(for k = 0, we set
hk= 0). Note that hk≤ h.
• c1 ⊃ c2 ⊃ ... ⊃ chk⊃ {tk} the longest path in T to tk.
DEFINITION 3.7. Let k ∈ {0,...,n} and ¯ s = (s1,...,shk),
where s1,...,shk∈ {0,...,n}. Define:
Wk
Qk
¯ s
=
{W | W ⊆¯ Xk∧ ∀i ∈ [hk].|W ∩ ci| = si}
{W | W ∈ Wk
¯ s
=
¯ s∧ W ∩ Q ?= ∅}
In other words, Wk
first k tuples, and (b) their intersection with the cliques c1,...,chk
on the path from a root to tk have cardinalities s1,...,shk. Note
that we have the following:
¯ sconsists of all worlds that (a) use only the
W0
ε
=
{∅}
[
W
=
¯ s
Wn
¯ s
The algorithm computes iteratively O((n + 1)h+1) quantities:
Zk
¯ s=
X
W∈Wk
¯ S, and Φ(Q) =P
¯ s
Φ(W)Sk
¯ s=
X
W∈Qk
¯ s
Φ(W)
Then Z =P
used by the algorithm. Given ¯ s = (s1,s2,...,sd) denote:
¯ sZn
¯ sSn
¯ s, which allows us to com-
pute P(Q) = Φ(Q)/Z. We still need to introduce a few notation
¯ s − 1
¯ sL
=
=
(s1− 1,s2− 1,...,sd− 1)
(s1,s2,...,sL) for L ≤ d
Given two consecutive leaves tk−1,tk in the forest T, we denote
with c1 ⊃ c2 ⊃ ... ⊃ cLall their common ancestors.
We now prove the correctness of the algorithm.
Consider a world W ∈ Wk
is when tk ?∈ W; then W contains at most the tuples t1,...,tk−1.
Recall that c1 ⊃ ... ⊃ cL are the common ancestors of tk−1and
tk, thus W ∈ Wk−1
cL+1 ⊃ ... ⊃ chkbe the rest of the path to tk: in all these cliques,
tkis the smallest element, hence W cannot contain any tuples from
these cliques. We therefor must have sL+1 = ... = shk= 0.
This completes the analysis of the first case. The second case is
when tk ∈ W: then we must have W − {tk} ∈ Wk−1
s?
argue similarly that sL+1 = ... = shk= 1. This allows us to
derive a recursive formula for Wk
Qk
need to check if tk∈ Q: if not then we recur with Qk
we recur with Zk
the lemma below we denote A ∪ {t} the set {W ∪ {t} | W ∈ A}
for a set of worlds A ⊆ W and a tuple t:
LEMMA 3.8. Wk
(S
(S
If tk ∈ Q then Qk
Rk
(S
(S
¯ s. There are two cases. The first case
¯ s?
for some ¯ s?s.t. s?
1= s1,...,s?
L= sL. Let
¯ s?
, where
1= s1 − 1,...,s?
L= sL − 1 (since we removed tk), we can
¯ s. We can derive a similar one for
¯ s: the only extra wrinkle here is that, when tk∈ W then we also
¯ s?; if yes, then
¯ s?. We state the resulting recurrence formally. In
¯ s= Uk
L=¯ sLWk−1
¯ s∪ Vk
¯ s, where
ifVhk
∪ {tk}
Uk
¯ s
=
¯ s?:¯ s?
¯ s?
i=L+1si = 0
otherwise
∅
Vk
¯ s
=
¯ s?:¯ s?
L=¯ sL−1Wk−1
¯ s?
ifVhk
i=L+1si = 1
otherwise
∅
¯ s = Rk
¯ s∪ Vk
¯ s and if tk ?∈ Q then Qk
¯ s =
¯ s∪ Tk
¯ s, where:
Rk
¯ s
=
¯ s?:¯ s?
L=¯ sLQk−1
¯ s?
ifVhk
∪ {tk}
i=L+1si = 0
otherwise
∅
Tk
¯ s
=
¯ s?:¯ s?
L=¯ sL−1Qk−1
¯ s?
ifVhk
i=L+1si = 1
otherwise
∅
We can now prove:
PROPOSITION 3.9. Algorithm 3.1 correctly computes the prob-
ability P(Q) of a disjunctive query Q over a hierarchical SCMN.
PROOF. We need to show that the quantities Zk
computed correctly. This follows immediately from the previous
lemma, observing that the values denoted U, V , R, T in the al-
gorithm are precisely Φ(Uk
¯ s and Sk
¯ s are
¯ s), Φ(Vk
¯ s), Φ(Rk
¯ s), Φ(Tk
¯ s) The crux
Page 7
of the correctness proof relies in examining the case tk ∈ W:
then Φ(W) = Φ(W − {tk})F, where the factor F is exp(wk+
P
the weight fci(si) for the clique ci: on the other hand W − {tk}
contributes with a weight fci(si−1) for that clique, which justifies
the formula for F.
i=1,Lfci(si) − fci(si− 1)): this is because W contributes in
addition to W − {tk} with the weight wkfor the tuple tkand with
3.2Hardness of Non-hierarchical Keys
In this section we prove Theorem 3.4. For that we first extend
Proposition 3.1:
PROPOSITION 3.10. ConsidertherelationR(A,B,W)andthe
key schemas σ1 = (x,R(x,y) and σ2 = (y,R(x,y)) as in Propo-
sition 3.1. Consider the following four queries:
q1
q2
q3
q4
: −
: −
: −
: −
R(x,y)
R(x,x)
R(a,y)
R(a,b)
where a,b are constants. Then, for each of the queries qi, i =
1,2,3,4, the problem: for inputs I, w1,w2, compute the value of
qion I given the soft keys (σ1,2,w1) and (σ2,2,w2) is #P-hard.
Note that this extends Proposition 3.1 from q1to q2, q3, and q4.
We refer the reader to the full version of this paper for the proof.
Now we can prove Theorem 3.4 by reduction from one of the
three queries in Proposition 3.10. Let Γ be a set of soft keys with-
out constants. Since it is non-hierarchical there exists two soft
keys γ1,γ2 s.t. Keyγ1(R) and Keyγ2(R) are incomparable sets.
(Note that case (3) of Definition 3.2 cannot happen because there
are no constants). Thus, there exists two attributes A,B s.t. A ∈
Keyγ1(R) − Keyγ2(R) and B ∈ Keyγ2(R) − Keyγ1(R). We
examine now the query q on these two attributes (recall that q is a
single subgoal): it can have two distinct variables, the same vari-
able, a variable and a constant, or two constants. We then do a
reduction from the corresponding query qiin Proposition 3.10, by
setting the weights of all soft keys other than γ1,γ2to 0.
4.CONJUNCTIVE QUERIES
Consider a relational schema¯R with probabilistic relations R1,
..., Rm. Let Γ = {γ1∪...∪γm} be the set of soft keys s.t. the soft
key γi applies to Ri. Thus, each relation Ri has exactly one soft
key: this includes the case when there is no soft key for Ri, because
in that case we can define the soft key for Ri to be the no-op key,
whose schema is (¯ x,Ri(¯ x)), where Ri(¯ x) has a distinct variable
for each attribute: Ri(¯ x) = Ri(x1,x2,...) . In this section we
impose a restriction on the soft keys. Their subgoals may have
no repeated variables: e.g. we allow ((x,y),R(y,a,z,b,x)) but
not (x,R(x,x,y,y)). Also assume there are no trivial soft keys
i.e. they have non-zero weight otherwise it doesn’t make sense to
consider them.
DEFINITION 4.1. Let g be a subgoal over R whose soft key has
schema γ = (¯ x,g?). We define the set Key(g) ⊆ V ars(g), as
follows. First, ifg∩g?= ∅(seeDef.3.2)thenKey(g) = V ars(g);
otherwise Key(g) consists of all variables in g that occur on a
position where g?has a key variable.
For example, consider the relation R(A,B,C,W) with soft key
γ = (x,R(a,x,y)). Then
Key(R(a,y,z))=
{y}
{y,z}
∅
Key(R(b,y,z))
Key(R(x,b,z))
=
=
Let q be a conjunctive query. For a variable x ∈ V ars(q) we
denote Sg(x) = {g | x ∈ Key(g)}. Thus, Sg(x) contains all
subgoals in which x occurs in a key position.
DEFINITION 4.2
junctive query without self-joins. q is safe for soft keys Γ if one of
the following holds:
(SAFE QUERIES). Let q be a boolean con-
1. Base case: q := g, where g is a single subgoal.
2. Disconnected components: q = q1q2 where V ars(q1) ∩
V ars(q2) = ∅, and q1and q2are both safe.
3. Projectable Variable: ∃x ∈ V ars(q), such that (a) ∀g ∈
Sg(q),x ∈ V ars(g) (i.e. x appears in all subgoals), (b)
∀y ∈ V ars(q), Sg(y) ⊆ Sg(x) and (c) for every constant
a, q[a/x] is safe.
Thus, a variable x is a projectable variable then x must occur in
all subgoals. It does not have to occur everywhere in key positions,
but the set of subgoals where it does not occur in a key position
should not have any other key variable either. Note that the safety
for q[a/x] is independent of the choice of the constant a.
Example 4.3 We illustrate with two queries, and underline in each
subgoal g the variables in Key(g): thus, R(x,y) means that there
exists a soft key with schema (x,R(x,y)), while S(y) means there
exists a soft key (∅,S(y)):
q1
=R(x,y),S(x)
q2
=R(x,y),S(y)
q1is safe: x occurs everywhere, and Sg(x) = {R} while Sg(y) =
∅. On the other hand q2 is unsafe: x does not occur everywhere.
While y does occur everywhere, Sg(y) = ∅, while Sg(x) = {R}.
Intuitively, projectable variable condition states that there is a
variable x in all the sub-goals of q and all other variables y in each
subgoal g are such that y ≤g x. We call x the projectable variable.
THEOREM 4.4. Let Γ be a set of soft keys, with one key per
relation. Let q be a conjunctive query q without self-joins. If q is
safe for Γ then it can be evaluated in PTIME. If the query is unsafe
for Γ, then it is #P-hard.
In the remainder of this section we prove the theorem.
4.1 Hardness of unsafe queries
We start by proving that every unsafe query is #P-hard. We use
a result in [9] that establishes a dichotomy of conjunctive queries
without self-joins on disjoint-independent probabilistic databases.
We briefly review that result, using the terminology in our paper.
DEFINITION 4.5. A soft key γ = (σ,s,w) on a relation R is
called a hard key if s = 2 and w = −∞.
A hard key is called a standard key if its subgoal g has no con-
stants.
A standard key is called a trivial key if ¯ x = V ars(g).
Page 8
We illustrate with the three keys on R(A,B,C,W):
γ1
γ2
γ3
=
=
((x,R(x,a,y)),2,−∞)
((x,R(x,y,z)),2,−∞)
(((x,y,z),R(x,y,z)),2,−∞)=
γ1is hard, γ2is standard (it says that A is a key), and γ3is trivial
(it says that A,B,C are a key, which is the same as not giving any
key for R).
DEFINITION 4.6. Adisjoint-independentprobabilisticdatabase
instance is an instance I together with a set of standard keys Γ.
Wenowreviewthedefinitionofsafequeriesoverdisjoint-independent
probabilistic databases, which we call here h-safe queries to distin-
guish them from ours, and review the dichotomy result on disjoint-
independent probabilistic databases. Recall that Key(g) denotes
the set of variables that appear in a key position; denote NKey(g)
the set of variables that appear in a non-key position in g. These
sets are not necessarily disjoint, because variables may be repeated.
DEFINITION 4.7. [9] A Boolean conjunctive query is h-safe for
a set of keys Γ if:
1. Base case q := g where g is a single subgoal.
2. Disconnected components q = q1q2 where V ars(q1) ∩
V ars(q2) = ∅ and q1,q2are both safe.
3. Independent project ∃x ∈ V ars(q) s.t. ∀g ∈ Sg(q), x ∈
Key(g), and for any constant a, q[a/x] is safe. Thus, x must
occur in a key position in every subgoal.
4. Disjoint project ∃g ∈ Sg(q) s.t. Key(g) = ∅ and ∃y ∈
NKey(g) s.t. q[a/y] is safe. Thus, y must occur in a non-
key position in g and g has no key variables.
The dichotomy for disjoint-independent databases is:
THEOREM 4.8. [9] Let Γ be a set of hard keys. If a query q is
h-safe for Γ, then it can be computed in PTIME. If it is not h-safe,
then it is #P-hard.
Now consider a set of soft keys Γ without variables for the rela-
tions¯R. A hardening of Γ is a set of keys Γhobtained by either
hardening or trivializing every soft key in Γ: that is, for every key
(σ,s,w) in Γ there is a key (σ,2,w) in Γhwhere w = −∞ or
w = 0. We prove the following:
PROPOSITION 4.9. Let Γ be a set of soft keys without variables
and q be a query. Then q is safe w.r.t. Γ iff for every hardening Γh
of Γ q is h-safe w.r.t. Γh.
PROOF. We prove “only if” by induction on the structure of the
query. The interesting case is given by condition 3: the others are
straightforward. Let x occur in all subgoals. If x occurs in each
subgoal in a key position, then the independent project case applies
and we conclude that q is h-safe. So suppose x occurs in some
subgoal g only on non-key positions. Then no variable y can occur
in g on a key position, otherwise sg(y) ?⊆ sg(x). So Key(g) =
∅, hence x appears in a non-key position and we apply a disjoint
project on x, hence q is h-safe.
We now prove the “if” direction. Here also we consider con-
dition 3 of the safety definition, and assume that it fails. Then we
“harden”thekeyconstraintsasfollows. LetG1 = {g | Key(g) = ∅}
and G2 = {g | Key(g) ?= ∅}. For all g ∈ G1we trivialize its key,
hence Keyh(g) = V ars(g). For all g ∈ G2 we harden the key,
hence Keyh(g) = Key(g). We prove that the resulting query is
h-unsafe. Indeed, a disjoint project is not possible, since we trivial-
ized all keys in G1where such a project would have been possible.
Suppose an independent project is possible on some variable x.
We show that x is a projectable variable by checking condition 3.
Clearly it occurs in all subgoals. In addition sg(x) = G2. More-
over, for any variable y, sg(y) ⊆ G2, since no variables occur in
key positions in G1. Contradiction.
Recall the query q1 = R(x,y),S(x) in Example 4.3. There are
four ways to harden it: in each subgoal g either keep Key(g) un-
changed, orincludeallvariables. Eachoftheseish-safe. Forexam-
pleq1ish-safebecausewedoadisjointprojectony; R(x,y),S(x)
is h-safe because we do an independent project on x. On the other
handq2isnotsafe, becausethehardeningR(x,y),S(y)ish-unsafe.
We use this to prove:
COROLLARY 4.10. If a query q is unsafe for a set of soft keys
Γ then q is #P-hard.
For the proof, we first use Prop.4.9 to harden the keys s.t. q is
h-unsafe w.r.t. Γ. Next, we observe that we can remove the con-
stants in these hard keys by eliminating the corresponding attribute
in the relation: the new query (over a new schema) is still h-unsafe,
but now all keys are standard: thus the new query is #P-hard by
Theorem 4.8. Finally, use the same technique as in Proposition 3.1
to replace weights −∞ with weights w ∈ R.
4.2Algorithm for safe queries
We explain and prove the algorithm with the following sequence
of definitions and proofs.
Given a soft key γ = (_,g) we denote g by gγ.
DEFINITION 4.11. Homogenization : Consider a relation R
which has a soft key γ. Given a subgoal g over R, the homogenized
instance hom(R,g,γ) is defined as Rt= σ ¯
attribute positions where gγ has key attributes and g has constants
¯ a.
A=¯ aR, where¯ A are the
Example 4.12 hom(R(A,B,C),R(a,y,z),(x,y,R(x,y,z))) =
σA=aR
hom(R(A,B,C),R(x,y,a),(x,y,R(x,y,z))) = R
hom(R(A,B,C),R(a,b,a),(y,R(a,y,z))) = σA=a,B=bR
Consider a CQ q = g1...gmover¯R. Since the query has no self
join we can assume gi is over Ri with soft key Γi = (_,Gi). We
assume that the subgoals of the query have no repeated variables.
For e.g. R(x,x,y) is not allowed. This is just for the ease of
presentation; the algorithm can be made to work for that case as
well.
DEFINITION 4.13. Let ¯ s = (s1,...,sm) with si ∈ {N∪{0}}.
Define:
W ¯
Q ¯
Z ¯
R,¯ s
=
{W | W ⊆¯R ∧ ∀i|W ∩ Ri∩ Gi| = si}
{W | W ∈ W ¯
X
X
R,¯ sis the set of all worlds of¯R s.t. ∀1 ≤ i ≤ m,the
number of tuples from Ri that are influenced by the soft key are
exactly si. Q ¯
R,¯ s
=
R,¯ s∧ W |= q}
Φ(W)
R,¯ s
=
W∈W¯
R,¯ s
Φ ¯
R,¯ s
=
W∈Q¯
R,¯ s
Φ(W)
So W ¯
R,¯ sis the subset of W ¯
R,¯ s, where query is true. Z ¯
R,¯ s
Page 9
Algorithm 4.1 To compute Φ ¯
1: Inputs: CQ q, over¯R. ¯ s ∈ {N ∪ 0}m.
2: Outputs: Φ ¯
3: if q = q1q2and V ars(q1) ∩ V ars(q2) = ∅ then
4:Calculate the components corresponding to q1and q2sepa-
rately and then multiply them
5: end if
6: if q is a single subgoal query then
7:Use Algorithm 3.1
8: end if
9: if ∃ Projectable Variable x then
10:
if ∃is.t.si > |Ri| then
11: Return 0,0
12:
end if
13: Let Z0
14: Let S0
15:
16: Let q = g1...gm
17:
for k = 1,n do
18:Let q?= q[ak/x]
19:
SS = 0
20:
ZZ = 0
21:Let R?
22:
for¯j = 0m, ¯ s do
23: Let¯d = (min(|R?
24:
for¯i = 0m,¯d do
25:Let F = exp
26:Let Pk = Φ ¯
27: Let Zk = Z ¯
R?,¯i
28: Let U = Zk−1
¯j−¯iPk
29:Let V = Sk−1
30: Let R = Zk−1
¯j−¯iZk
31:
SS = SS + F(U + V )
32:
ZZ = ZZ + FR
33:
end for
34:
Sk
¯j= SS
35:
Zk
¯j= ZZ
36:
end for
37:
end for
38: end if
39: Return Sn
¯ s
R,¯ s(q), Z ¯
R,¯ s
R,¯ s(q), Z ¯
R,¯ s(q)
¯ s= 1 if ¯ s =¯0 else 0
¯ s= 0 ∀¯ s
Let dom(SAix) = {a1...an}
i= {t ∈ Ri|t.Aix = ak}
1∩ G1|,j1),...,min(|R?
“P
m∩ Gm|,jm))
”
p=1,m(fp(jp) − fp(jp− ip))
R?,¯i(q?)
¯j−¯i(Zk − Pk)
¯ s, Zn
and Φ ¯
Hence we have
R,¯ sare respectively the sum of potentials of all these worlds.
Pr(q) =
P
¯ sΦ ¯
P
R,¯ s
¯ sZ ¯
R,¯ s
Let x be a projectable variable. Then the following proposition
establishes the intuition behind projectable variable and homoge-
nization.
PROPOSITION 4.14. Let Aixbe the attribute of Riwhere x oc-
curs in gi. If Rt
1. Pr(q) = Pr(q?) where q?is same as q but over¯Rtinstead
of¯R.
i= hom(Ri,gi,Γi), then
2. For any two worlds W1 ∈ σAix=aRt
a ?= b ∈ dom(x)
Φ(W1∧W2) = Φ(W1)Φ(W2)exp(hi(|W1∩ gΓi|,|W2∩ gΓi|)
where hiis a real-valued function that depends only on Γi.
iand W2 ∈ σAix=bRt
i,
PROOF. (Sketch)ConsidertheSCMNcorrespondingto(Ri,Γi).
It is composed of independent cliques and Rt
cliques which contain tuple that can make gi true, i.e. they can
have an influence on the query q. Since these set of cliques is inde-
pendent of those in Ri− Rt
Now consider M = SCMN(Rt
condition either x is a key in which case we say fi = 1. Otherwise
no other variable in gi is a key, which means our homogenization
ensures that M is composed of at most one clique c, with feature
function say fi. So to compute the probability of any world, we just
need to keep track of the number of tuples from c. The reader can
convince themselves that hi(x,y) = fi(x+y)−fi(x)−fi(y).
Algorithm 4.1 first checks for Join or projectable variable con-
dition. The former is trivial. In the latter case, it proceeds by di-
viding possible worlds of¯Rtinto σ ¯
a ∈Sdom(Aix). After that the algorithm is similar in spirit to
full version of this paper which mentions the algorithm in detail
even for hierarchical constraints.
icontains all those
i, P(q) can be computed just over¯Rt.
i,Γi) ; By our projectable variable
A=a¯Rt, where¯ A =SAixand
Algorithm 3.1 and due to lack of space we refer the reader to the
We should mention that the algorithm can be made more effi-
cient by using some independence properties. For example when
q[a/x] and q[b/x] are independent, Pr(q[a/x] ∨ q[b/x]) = (1 −
Pr(q[a/x]))(1 − Pr(q[b/x])). But it doesn’t affect the worst-case
complexityandsoforthesakeofbrevity, wehaven’twrittenithere.
5. RELATED WORK
Query evaluation over probabilistic databases is a well studied
problem. Methods for query evaluation can broadly be classified
into two categories: Intensional ([4, 2, 13, 23]) and Extensional ([6,
10, 20, 11]). Our approach belongs to the extensional category.
Intensionalmethodsworkbyassociatingwitheachbooleanquery
a symbolic event. Query evaluation is then performed by manipu-
lating expressions over these symbolic events. For example, in [4],
lineage is used for defining the symbolic events. In principle, in-
tensional methods can evaluate any given query over a probabilistic
database with arbitrary correlations among tuples. However, as the
correlations and/or queries become complicated, the symbolic ex-
pressions become very large making query evaluation intractable.
On the other hand, extensional methods use efficient operators
over real numbers for query evaluation. They work for a restricted
set of correlations and queries. Prior work for extensional methods
assume very simple correlations like independence ([6, 10]) or ex-
clusions ([3, 20, 11]). As per our knowledge, this is the first paper
that uses extensional approach to handle more complicated corre-
lations involving soft constraints.
Query evaluation is closely connected to the inference problem
in AI. Many methods proposed in AI literature have been adapted
for query evaluation. Deshpande et. al. [24] proposed the use of
Markov Networks to represent tuple correlations. In particular,
Size-Constrained Markov Networks used in this paper are a sub-
set of correlations they consider. However, they propose an in-
tensional method of query evaluation that makes query evaluation
intractable even for safe queries on Size-Constrained Markov Net-
works. In[15], Guptaet. al. solveaninferenceproblemforMarkov
Networks that use cardinality based potential functions similar to
Size-Constrained Markov Networks. However, they solve the sim-
pler MAP problem that amounts to finding the most likely world
among the set of all possible worlds. Query evaluation requires
finding the sum of the probabilities of all worlds for a given set of
worlds that satisfy a query. Intuitively, the latter is harder because
it has to deal with all possible worlds, while the former can make a
Page 10
greedy choice in the selection of its worlds.
For a look at some more recent work on modeling probabilistic
databases with graphical models, we refer the reader to [14]. They
describe how to represent relational data with Bayesian Networks
according to both possible-worlds and domain-frequency seman-
tics. Our way of representation is different though, as we do not
keeparandomvariableforeveryattributeofeverytuple. Ourrepre-
sentationcloselyresemblesthatofMarkovLogicNetworks(MLNs)[22].
An MLN is just a collection of relations and a set of first-order for-
mulas over them with real weights. It gives semantics to these for-
mulas by representing them as features in a markov network over
the relations. Our model corresponds to MLNs with formulas like
key constraint. But MLNs are a very general model where infer-
ence can be very expensive, hence our work also helps to identify
some subsets where inference is tractable.
Finally [1] and [27] are some other works which propose han-
dlingofconstraintsduringqueryansweringinsteadofcleaningdata
apriori. While the former works in a probabilistic setting like us, it
enforces hard keys. The latter offers a deterministic way of conflict
resolution using some form of user specification.
Acknowledgments This work was partially supported by NSF
Grants IIS-0513877, IIS-0415193, and IIS-0713576, and by a Gift
from Microsoft.
6.
[1] P. Andritsos, A. Fuxman, and R. J. Miller. Clean answers
over dirty databases: A probabilistic approach. In ICDE ’06:
Proceedings of the 22nd International Conference on Data
Engineering, page 30, Washington, DC, USA, 2006. IEEE
Computer Society.
[2] L. Antova, C. Koch, and D. Olteanu. MayBMS: Managing
incomplete information with probabilistic world-set
decompositions. In ICDE, 2007.
[3] D. Barbara, H. Garcia-Molina, and D. Porter. The
management of probabilistic data. IEEE Trans. on
Knowledge and Data Eng., 1992.
[4] O. Benjelloun, A. D. Sarma, A. Y. Halevy, and J. Widom.
ULDBs: Databases with uncertainty and lineage. In VLDB,
pages 953–964, 2006.
[5] O. Benjelloun, A. D. Sarma, C. Hayworth, and J. Widom. An
introduction to ULDBs and the Trio system. IEEE Data Eng.
Bull, 29(1):5–16, 2006.
[6] R. Cavallo and M. Pittarelli. The theory of probabilistic
databases. In Proceedings of VLDB, pages 71–81, 1987.
[7] R. Cowell, P. Dawid, S. Lauritzen, and D. Spiegelhalter,
editors. Probabilistic Networks and Expert Systems.
Springer, 1999.
[8] N. Dalvi and D. Suciu. The dichotomy of conjunctive queries
on probabilistic structures. In PODS, pages 293–302, 2007.
[9] N. Dalvi and D. Suciu. Management of probabilistic data:
foundations and challenges. In PODS ’07: Proceedings of
REFERENCES
the twenty-sixth ACM SIGMOD-SIGACT-SIGART
symposium on Principles of database systems, pages 1–12,
New York, NY, USA, 2007. ACM Press.
[10] N. N. Dalvi and D. Suciu. Efficient query evaluation on
probabilistic databases. VLDB, 2004.
[11] N. N. Dalvi and D. Suciu. Management of probabilistic data:
foundations and challenges. In PODS, 2007.
[12] N. Fuhr and T. Roelleke. A probabilistic relational algebra
for the integration of information retrieval and database
systems. ACM Trans. Inf. Syst., 15(1):32–66, 1997.
[13] Fuhr, Norbert. A probabilistic relational model for the
integration of IR and databases. In SIGIR, 1993.
[14] L. Getoor. An introduction to probabilistic graphical models
for relational data. Data Engineering Bulletin, 29(1), march
2006.
[15] R. Gupta, A. Diwan, and S. Sarawagi. Efficient inference
with cardinality-based clique potentials. In ICML. ACM,
2007.
[16] I. Ilyas, V. Markl, P. Haas, P. Brown, and A. Aboulnaga.
Cords: Automatic discovery of correlations and soft
functional dependencies. In SIGMOD, pages 647–658, 2004.
[17] J. Pearl. Probabilistic Reasoning in Intelligent Systems :
Networks of Plausible Inference. Morgan Kaufmann,
September 1988.
[18] H. Poon and P. Domingos. Joint inference in information
extraction. In AAAI, pages 913–918, 2007.
[19] C. Re, N. Dalvi, and D. Suciu. Efficient Top-k query
evaluation on probabilistic data. In ICDE, 2007.
[20] C. Re, N. N. Dalvi, and D. Suciu. Query evaluation on
probabilistic databases. IEEE Data Eng. Bull, 2006.
[21] C. Re and D.Suciu. Efficient evaluation of having queries on
a probabilistic database. In Proceedings of DBPL, 2007.
[22] M. Richardson and P. Domingos. Markov logic networks.
Mach. Learn., 62(1-2):107–136, 2006.
[23] F. Sadri. Integrity constraints in the information source
tracking method. IEEE Transactions on Knowledge and
Data Engineering, 1995.
[24] P. Sen and A. Deshpande. Representing and querying
correlated tuples in probabilistic databases. In ICDE. IEEE,
2007.
[25] W. Shen, X. Li, and A. Doan. Constraint-based entity
matching. In AAAI, pages 862–867, 2005.
[26] P. Singla and P. Domingos. Entity resolution with markov
logic. In ICDM, pages 572–582, 2006.
[27] S. Staworko, J. Chomicki, and J. Marcinkowski.
Preference-driven querying of inconsistent relational
databases. In EDBT Workshops, pages 318–335, 2006.
[28] L. G. Valiant. The complexity of enumeration and reliability
problems. SIAM Journal on Computing, 8(3):410–421, 1979.
View other sources
Hide other sources
-
Available from Dan Suciu · 13 Mar 2013
-
Available from washington.edu