ArticlePDF Available

Abstract and Figures

This article develops upper and lower bounds for the probability of Boolean functions by treating multiple occurrences of variables as independent and assigning them new individual probabilities. We call this approach dissociation and give an exact characterization of optimal oblivious bounds, that is, when the new probabilities are chosen independently of the probabilities of all other variables. Our motivation comes from the weighted model counting problem (or, equivalently, the problem of computing the probability of a Boolean function), which is #P-hard in general. By performing several dissociations, one can transform a Boolean formula whose probability is difficult to compute into one whose probability is easy to compute, and which is guaranteed to provide an upper or lower bound on the probability of the original formula by choosing appropriate probabilities for the dissociated variables. Our new bounds shed light on the connection between previous relaxation-based and model-based approximations and unify them as concrete choices in a larger design space. We also show how our theory allows a standard relational database management system (DBMS) to both upper and lower bound hard probabilistic queries in guaranteed polynomial time.
Content may be subject to copyright.
1
Oblivious Bounds on the Probability of Boolean Functions
WOLFGANG GATTERBAUER, Carnegie Mellon University
DAN SUCIU, University of Washington
This paper develops upper and lower bounds for the probability of Boolean functions by treating multiple
occurrences of variables as independent and assigning them new individual probabilities. We call this ap-
proach dissociation and give an exact characterization of optimal oblivious bounds, i.e. when the new prob-
abilities are chosen independent of the probabilities of all other variables. Our motivation comes from the
weighted model counting problem (or, equivalently, the problem of computing the probability of a Boolean
function), which is #P-hard in general. By performing several dissociations, one can transform a Boolean
formula whose probability is difficult to compute, into one whose probability is easy to compute, and which
is guaranteed to provide an upper or lower bound on the probability of the original formula by choosing
appropriate probabilities for the dissociated variables. Our new bounds shed light on the connection be-
tween previous relaxation-based and model-based approximations and unify them as concrete choices in
a larger design space. We also show how our theory allows a standard relational database management
system (DBMS) to both upper and lower bound hard probabilistic queries in guaranteed polynomial time.
Categories and Subject Descriptors: G.3 [Probability and Statistics]; H.2.m [Database Management]:
Miscellaneous; I.1.1 [Symbolic and algebraic manipulation]: Expressions and Their Representation
General Terms: Algorithms, Theory
Additional Key Words and Phrases: Probabilistic databases, Weighted model counting, Boolean expressions,
Oblivious approximations, Relaxation
1. INTRODUCTION
Query evaluation on probabilistic databases is based on weighted model counting
for positive Boolean expressions. Since model counting is #P-hard in general, today’s
probabilistic database systems evaluate queries using one of the following three ap-
proaches: (1) incomplete approaches identify tractable cases (e.g., read-once formu-
las) either at the query-level [Dalvi and Suciu 2007; Dalvi et al. 2010] or the data-
level [Olteanu and Huang 2008; Sen et al. 2010]; (2) exact approaches apply exact
probabilistic inference, such as repeated application of Shannon expansion [Olteanu
et al. 2009] or tree-width based decompositions [Jha et al. 2010]; and (3) approxi-
mate approaches either apply general purpose sampling methods [Jampani et al. 2008;
Kennedy and Koch 2010; Re et al. 2007] or approximate the number of models of the
Boolean lineage expression [Olteanu et al. 2010; Fink and Olteanu 2011].
This paper provides a new algebraic framework for approximating the probability
of positive Boolean expressions. While our method was motivated by query evaluation
on probabilistic databases, it is more general and applies to all problems that rely
on weighted model counting, e.g., general probabilistic inference in graphical mod-
Authors’ addresses: W. Gatterbauer (corresponding author), Tepper School of Business, Carnegie Mellon
University; email: gatt@cmu.edu; D. Suciu, Computer Science and Engineering Department, University of
Washington, Seattle.
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted
without fee provided that copies are not made or distributed for profit or commercial advantage and that
copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights
for components of this work owned by others than ACM must be honored. Abstracting with credit is per-
mitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component
of this work in other works requires prior specific permission and/or a fee. Permissions may be requested
from Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212)
869-0481, or permissions@acm.org.
c
YYYY ACM 0362-5915/YYYY/01-ART1 $10.00
DOI 10.1145/0000000.0000000 http://doi.acm.org/10.1145/0000000.0000000
ACM Transactions on Database Systems, Vol. V, No. N, Article 1, Publication date: January YYYY.
arXiv:1409.6052v1 [cs.AI] 21 Sep 2014
1:2 W. Gatterbauer and D. Suciu
Conjunctive dissociation Disjunctive dissociation
ϕ=ϕ1ϕ2ϕ=ϕ1ϕ2
ϕ0=ϕ1[x0
1/x]ϕ2[x0
2/x]ϕ0=ϕ1[x0
1/x]ϕ2[x0
2/x]
0p1
Assignment for p
1
1
p
0
Assignment for p
2
oblivious
lower
bounds
oblivious
upper
bounds
0p1
Assignment for p
1
1
p
0
Assignment for p
2
oblivious
lower
bounds
oblivious
upper
bounds
Optimal obli- Upper p0
1·p0
2=p p0
1=p p0
2=p
vious bounds Lower p0
1=p p0
2=p(1p0
1)·(1p0
2)=1p
Model-based Upper p0
1=p p0
2=1 (optimal) p0
1=p p0
2=1 (non-optimal)
bounds Lower p0
1=p p0
2=0 (non-optimal) p0
1=p p0
2=0 (optimal)
Relaxation & Comp. p0
1=p p0
2=P[x|ϕ1]p0
1=p p0
2=P[xϕ1]
Fig. 1. Dissociation as framework that allows to determine optimal oblivious upper and lower bounds for
the probabilities p0=hp0
1, p0
2iof dissociated variables. Oblivious here means that we assign new values
after looking at only a limited scope of the expression. Model-based upper conjunctive and lower disjunctive
bounds are obliviously optimal (they fall on the red line of optimal assignments), whereas lower conjunctive
and upper disjunctive are not. Relaxation & Compensation is a form of dissociation which is not oblivious
(p2is calculated with knowledge of ϕ1) and does not, in general, guarantee to give an upper or lower bound.
els [Chavira and Darwiche 2008].1An important aspect of our method is that it is not
model-based in the traditional sense. Instead, it enlarges the original variable space
by treating multiple occurrences of variables as independent and assigning them new
individual probabilities. We call this approach dissociation2and explain where exist-
ing relaxation-based and model-based approximations fit into this larger space of ap-
proximations. We characterize probability assignments that lead to guaranteed upper
or lower bounds on the original expression and identify the best possible oblivious
bounds, i.e. after looking at only a limited scope of the expression. We prove that for
every model-based bound there is always a dissociation-based bound that is as good or
better. And we illustrate how a standard relational DBMS can both upper and lower
bound hard probabilistic conjunctive queries without self-joins with appropriate SQL
queries that use dissociation in a query-centric way.
We briefly discuss our results: We want to compute the probability P[ϕ]of a Boolean
expression ϕwhen each of its Boolean variables xiis set independently to true with
some given probability pi=P[xi]. Computing P[ϕ]is known to be #P-hard in gen-
eral [Valiant 1979] and remains hard to even approximate [Roth 1996]. Our approach
is to approximate P[ϕ]with P[ϕ0]that is easier to compute. The new formula ϕ0is
derived from ϕthrough a sequence of dissociation steps, where each step replaces d
distinct occurrences of some variable xin ϕwith dfresh variables x0
1, x0
2,...x0
d. Thus,
1Note that weighted model counting is essentially the same problem as computing the probability P[ϕ]of
a Boolean expression ϕ. Each truth assignment of the Boolean variables corresponds to one model whose
weight is the probability of this truth assignment. Weighted model counting then asks for the sum of the
weights of all satisfying assignments.
2Dissociation is the breaking of an existing association between related, but not necessarily identical items.
ACM Transactions on Database Systems, Vol. V, No. N, Article 1, Publication date: January YYYY.
Oblivious Bounds on the Probability of Boolean Functions 1:3
after applying dissociation repeatedly, we transform ϕinto another expression ϕ0and
approximate P[ϕ]with P[ϕ0]. The question that we address in this paper is: how should
we set the probabilities of the dissociated variables x0
iin order to ensure that P[ϕ0]
is a good approximation of P[ϕ]? In particular, we seek conditions under which ϕ0is
guaranteed to be either an upper bound P[ϕ0]P[ϕ]or a lower bound P[ϕ0]P[ϕ].
Our main result can be summarized as follows: Suppose that xoccurs positively
in ϕ. Dissociate xinto two variables x0
1and x0
2such that the dissociated formula is
ϕ0=ϕ0
1ϕ0
2, where x0
1occurs only in ϕ0
1and x0
2occurs only in ϕ0
2; in other words,
ϕϕ0
1[x0
1/x]ϕ0
2[x0
2/x](we will later define this as “conjunctive dissociation”). Let
p=P[x],p0
1=P[x0
1],p0
2=P[x0
2]be their probabilities. Then (1) P[ϕ0]P[ϕ]iff p0
1·p0
2p,
and (2) P[ϕ0]P[ϕ]iff p0
1pand p0
2p. In particular, the best upper bounds are
obtained by choosing any p0
1, p0
2that satisfy p0
1·p0
2=p, and the best lower bound is
obtained by setting p0
1=p0
2=p. The “only if direction holds assuming ϕ0satisfies
certain mild conditions (e.g., it should not be redundant), and under the assumption
that p0
1, p0
2are chosen obliviously, i.e. they are functions only of p=P[x]and indepen-
dent of the probabilities of all other variables. This restriction to oblivious probabilities
guarantees the computation of the probabilities p0
1, p0
2to be very simple.3Our result ex-
tends immediately to the case when the variable xis dissociated into several variables
x0
1, x0
2, . . . , x0
d, and also extends (with appropriate changes) to the case when the expres-
sions containing the dissociated variables are separated by rather than (Fig.1).
Example 1.1 (2CNF Dissociation). For a simple illustration of our main result, con-
sider a Positive-Partite-2CNF expression with |E|clauses
ϕ=^
(i,j)E
(xiyj)(1)
for which calculating its probability is already #P-hard [Provan and Ball 1983]. If we
dissociate all occurrences of all mvariables xi, then the expression becomes:
ϕ0=^
(i,j)E
(x0
i,j yj)(2)
which is equivalent to VjyjVi,j x0
i,j . This is a read-once expression whose prob-
ability can always be computed in PTIME [Gurvich 1977]. Our main result im-
plies the following: Let pi=P[xi],i[m]be the probabilities of the original vari-
ables and denote p0
i,j =P[x0
i,j ]the probabilities of the fresh variables. Then (1) if
i[m] : p0
i,j1·p0
i,j2·· ·p0
i,jdi=pi, then ϕ0is an upper bound (P[ϕ0]P[ϕ]); (2) if
i[m] : p0
i,j1=p0
i,j2=. . . =p0
i,jdi=pi, then ϕ0is a lower bound (P[ϕ0]P[ϕ]). Fur-
thermore, these are the best possible oblivious bounds, i.e. where p0
i,j depends only on
pi=P[xi]and is chosen independently of other variables in ϕ.
We now explain how dissociation generalizes two other approximation methods in
the literature (Fig.1 gives a high-level summary and Sect.5 the formal details).
Relaxation & Compensation. This is a framework by Choi and Darwiche [2009;
2010] for approximate probabilistic inference in graphical models. The approach per-
forms exact inference in an approximate model that is obtained by relaxing equiv-
3Our usage of the term oblivious is inspired by the notion of oblivious routing algorithms [Valiant 1982]
which use only local information and which can therefore be implemented very efficiently. Similarly, our
oblivious framework forces p0
1, p0
2to be computed only as a function of p, without access to the rest of ϕ. One
can always find values p0
1, p0
2for which P[ϕ] = P[ϕ0]. However, to find those value in general, one has to first
compute q=P[ϕ], then find appropriate values p0
1, p0
2for which the equality P[ϕ0] = qholds. This is not
practical, since our goal is to compute qin the first place.
ACM Transactions on Database Systems, Vol. V, No. N, Article 1, Publication date: January YYYY.
1:4 W. Gatterbauer and D. Suciu
alence constraints in the original model, i.e. by removing edges. The framework al-
lows one to improve the resulting approximations by compensating for the relaxed
constraints. In the particular case of a conjunctive Boolean formula ϕ=ϕ1ϕ2,re-
laxation refers to substituting any variable xthat occurs in both ϕ1and ϕ2with two
fresh variables x0
1in ϕ1and x0
2in ϕ2.Compensation refers to then setting their prob-
abilities p0
1=P[x0
1]and p0
2=P[x0
2]to p0
1=pand p0
2=P[x|ϕ1]. This new probability
assignment is justified by the fact that, if xis the only variable shared by ϕ1and ϕ2,
then compensation ensures that P[ϕ0] = P[ϕ](we will show this claim in Prop. 5.1). In
general, however, ϕ1, ϕ2have more than one variable in common, and in this case we
have P[ϕ0]6=P[ϕ]for the same compensation. Thus in general, compensation is applied
as a heuristics. Furthermore, it is then not known whether compensation provides an
upper or lower bound.
Indeed, let p0
1=p,p0
2=P[x|ϕ1]be the probabilities set by the compensation method.
Recall that our condition for P[ϕ0]to be an upper bound is p0
1·p0
2p, but we have
p0
1·p0
2=p·P[x|ϕ1]p. Thus, the compensation method does not satisfy our oblivious
upper bound condition. Similarly, because of p0
1=pand p0
2p, these values fail to
satisfy our oblivious lower bound condition. Thus, relaxation is neither a guaranteed
upper bound, nor a guaranteed lower bound. In fact, relaxation is not oblivious at all
(since p0
2is computed from the probabilities of all variables, not just P[x]). This enables
it to be an exact approximation in the special case of a single shared variable, but fails
to guarantee any bounds, in general.
Model-based approximations. Another technique for approximation described
by Fink and Olteanu [2011] is to replace ϕwith another expression whose set of mod-
els is either a subset or superset of those of ϕ. Equivalently, the upper bound is a
formula ϕUsuch that ϕϕU, and the lower bound is ϕLsuch that ϕLϕ. We show
in this paper, that if ϕis a positive Boolean formula, then all upper and lower model-
based bounds can be obtained by repeated dissociation: the model-based upper bound
is obtained by repeatedly setting probabilities of dissociated variables to 1, and the
model-based lower bound by setting the probabilities to 0. While the thus generated
model-based upper bounds for conjunctive expressions correspond to optimal oblivious
dissociation bounds, the model-based lower bounds for conjunctive expressions are not
optimal and can always be improved by dissociation.
Indeed, consider first the upper bound for conjunctions: the implication ϕϕU
holds iff there exists a formula ϕ1such that ϕϕ1ϕU.4Pick a variable x, denote p=
P[x]its probability, dissociate it into x0
1in ϕ1and x0
2in ϕU, and set their probabilities
as p0
1= 1 and p0
2=p. Thus, ϕUremains unchanged (except for the renaming of xto
x0
2), while in ϕ1we have set x1= 1. By repeating this process, we eventually transform
ϕ1into true (Recall that our formula is monotone). Thus, model-based upper bounds
are obtained by repeated dissociation and setting p0
1= 1 and p0
2=pat each step. Our
results show that this is only one of many oblivious upper bounds as any choices with
p0
1p0
2plead to an oblivious upper bound for conjunctive dissociations.
Consider now the lower bound: the implication ϕLϕis equivalent to ϕLϕ
ϕ2. Then there is a sequence of finitely many conjunctive dissociation steps, which
transforms ϕinto ϕϕ2and thus into ϕL. At each step, a variable xis dissociated
into x0
1and x0
2, and their probabilities are set to p0
1=pand p0
2= 0, respectively.5
According to our result, this choice is not optimal: instead one obtains a tighter bound
by also setting p0
2=p, which no longer corresponds to a model-based lower bound.
4Fink and Olteanu [2011] describe their approach for approximating DNF expressions only. However, the
idea of model-based bounds applies equally well to arbitrary Boolean expressions, including those in CNF.
5The details here are more involved and are given in detail in Sect.5.2.
ACM Transactions on Database Systems, Vol. V, No. N, Article 1, Publication date: January YYYY.
Oblivious Bounds on the Probability of Boolean Functions 1:5
Thus, model-based lower bounds for conjunctive expressions are not optimal and can
always be improved by using dissociation.
Our dual result states the following for the case when the two formulas are con-
nected with disjunction instead of conjunction : (1) the dissociation is an upper
bound iff p0
1pand p0
2p, and (2) it is a lower bound iff (1 p0
1)(1 p0
2)1p. We
see that model-based approximation gives an optimal lower bound for disjunctions,
because (1 p0
1)(1 p0
2)=1·(1 p)=1p, however non-optimal upper bounds. Exam-
ple 7.2 illustrates this asymmetry and the possible improvement through dissociation
with a detailed simulation-based example.
Bounds for hard probabilistic queries. Query evaluation on probabilistic
databases reduces to the problem of computing the probability of its lineage expres-
sion which is a a monotone, k-partite Boolean DNF where kis fixed by the number of
joins in the query. Computing the probability of the lineage is known to be #P-hard for
some queries [Dalvi and Suciu 2007], hence we are interested in approximating these
probabilities by computing dissociated Boolean expressions for the lineage. We have
previously shown in [Gatterbauer et al. 2010] that every query plan for a query corre-
sponds to one possible dissociation for its lineage expression. The results in this paper
show how to best set the probabilities for the dissociated expressions in order to obtain
both upper bounds and lower bounds. We further show that all the computation can be
pushed inside a standard relational database engine with the help of SQL queries that
use User-Defined-Aggregates and views that replace the probabilities of input tuples
with their optimal symmetric lower bounds. We illustrate this approach in Sect. 6 and
validate it on TPC-H data in Sect.7.5.
Main contributions. (1) We introduce an algebraic framework for approximating
the probability of Boolean functions by treating multiple occurrences of variables as
independent and assigning them new individual probabilities. We call this approach
dissociation; (2) we determine the optimal upper and lower bounds for conjunctive and
disjunctive dissociations under the assumption of oblivious value assignments; (3) we
show how existing relaxation-based and model-based approximations fit into the larger
design space of dissociations, and show that for every model-based bound there is at
least one dissociation-based bound which is as good or tighter; (4) we apply our gen-
eral framework to both upper and lower bound hard probabilistic conjunctive queries
without self-joins in guaranteed PTIME by translating the query into a sequence of
standard SQL queries; and (5) we illustrate and evaluate with several detailed ex-
amples the application of this technique. Note that this paper does not address the
algorithmic complexities in determining alternative dissociations, in general.
Outline. Section 2 starts with some notational background, and Sect. 3 formally
defines dissociation. Section 4 contains our main results on optimal oblivious bounds.
Section 5 formalizes the connection between relaxation, model-based bounds and disso-
ciation, and shows how both previous approaches can be unified under the framework
of dissociation. Section 6 applies our framework to derive upper and lower bounds for
hard probabilistic queries with standard relational database management systems.
Section 7 gives detailed illustrations on the application of dissociation and oblivious
bounds. Finally, Sect.8 relates to previous work before Sect.9 concludes.
2. GENERAL NOTATIONS AND CONVENTIONS
We use [m]as short notation for {1, . . . , m}, use the bar sign for the complement of an
event or probability (e.g., ¯x=¬x, and ¯p= 1 p), and use a bold notation for sets (e.g.,
s[m]) or vectors (e.g., x=hx1, . . . , xmi) alike. We assume a set xof independent
Boolean random variables, and assign to each variable xia primitive event which is
true with probability pi=P[xi]. We do not formally distinguish between the variable
xiand the event xithat it is true. By default, all primitive events are assumed to be
ACM Transactions on Database Systems, Vol. V, No. N, Article 1, Publication date: January YYYY.
1:6 W. Gatterbauer and D. Suciu
y1
y2
y3
x1
x2
k1=2
k2=2
(a) f
y1
y2
y3
x0
2
x1
x2
x0
1,1
x0
1,2
d2=1
(b) f0
Fig. 2. Example 3.2. (a): Bipartite primal graph for CNF representing f. (b): A dissociation f0where vari-
able x1appearing k1= 2 times in fis dissociated into (replaced by) d1= 2 fresh variables in f0.
independent (e.g., P[x1x2] = p1p2). We are interested in bounding the probability
P[f]of a Boolean function f, i.e. the probability that the function is true if each of the
variables is independently true or false with given probabilities. When no confusion
arises, we blur the distinction between a Boolean expression ϕand the Boolean func-
tion fϕit represents (cf. [Crama and Hammer 2011, Sect. 1.2]) and write P[ϕ]instead
of P[fϕ]. We also use the words formula and expression interchangeably. We write f(x)
to indicate that xis the set of primitive events appearing in the function f, and f[x1/x]
to indicate that x1is substituted for xin f. We often omit the operator and denote
conjunction by mere juxtaposition instead.
3. DISSOCIATION OF BOOLEAN FUNCTIONS AND EXPRESSIONS
We define here dissociation formally. Let f(x,y)and f0(x0,y)be two Boolean functions,
where x,x0,yare three disjoint sets of variables. Denote |x|=m,|x0|=m0, and |y|=n.
We restrict fand f0to be positive in xand x0, respectively [Crama and Hammer 2011,
Def. 1.24].
Definition 3.1 (Dissociation). We call a function f0a dissociation of fif there exists
a substitution θ:x0xs.t. f0[θ] = f.
Example 3.2 (CNF Dissociation). Consider two functions fand f0given by CNF
expressions
f= (x1y1)(x1y2)(x2y1)(x2y3)
f0= (x0
1,1y1)(x0
1,2y2)(x0
2y1)(x0
2y3)
Then f0is a dissociation of fas f0[θ] = ffor the substitution θ=
{(x0
1,1, x1),(x0
1,2, x1),(x0
2, x2)}. Figure 2 shows the CNF expressions’ primal graphs.6
In practice, to find a dissociation for a function f(x,y), one proceeds like this: Choose
any expression ϕ(x,y)for fand thus f=fϕ. Replace the kidistinct occurrences of
variables xiin ϕwith difresh variables x0
i,1, x0
i,2, . . . , x0
i,di, with diki. The resulting
expression ϕ0represents a function f0that is a dissociation of f. Notice that we may
obtain different dissociations by deciding for which occurrences of xito use distinct
fresh variables, and for which occurrences to use the same variable. We may further
obtain more dissociations by starting with different, equivalent expressions ϕfor the
function f. In fact, we may construct infinitely many dissociations this way. We also
note that every dissociation of fcan be obtained through the process outlined here.
Indeed, let f0(x0,y)be a dissociation of f(x,y)according to Definition 3.1, and let θbe
the substitution for which f0[θ] = f. Then, if ϕ0is any expression representing f0, the
expression ϕ=ϕ0[θ]represents f. We can thus apply the described dissociation process
to a certain expression ϕand obtain f0.
6The primal graph of a CNF (DNF) has one node for each variable and one edge for each pair of variables
that co-occur in some clause (conjunct). This concept originates in constraint satisfaction and it is also
varyingly called co-occurrence graph or variable interaction graph [Crama and Hammer 2011].
ACM Transactions on Database Systems, Vol. V, No. N, Article 1, Publication date: January YYYY.
Oblivious Bounds on the Probability of Boolean Functions 1:7
Example 3.3 (Alternative Dissociations). Consider the two expressions:
ϕ= (xy1)(xy2)(xy3)(y4y5)
ψ=xy4xy5y1y2y3y4y1y2y3y5
Both are equivalent (ϕψ) and thus represent the same Boolean function (fϕ=fψ).
Yet each leads to a quite different dissociation in the variable x:
ϕ0= (x0
1y1)(x0
2y2)(x0
3y3)(y4y5)
ψ0=x0
1y4x0
2y5y1y2y3y4y1y2y3y5
Here, ϕ0and ψ0represent different functions (fϕ06=fψ0) and are both dissociations of
ffor the substitutions θ1={(x0
1, x),(x0
2, x),(x0
3, x)}and θ2={(x0
1, x),(x0
2, x)}, respec-
tively.
Example 3.4 (More alternative Dissociations). Consider the AND-function
f(x, y) = xy. It can be represented by the expressions xxy, or xxxy, etc., leading
to the dissociations x0
1x0
2y, or x0
1x0
2x0
3y, etc. For even more dissociations, represent
fusing the expression (xx)yxy, which can dissociate to (x0
1x0
2)yx0
3y, or
(x0
1x0
2)yx0
1y, etc. Note that several occurrences of a variable can be replaced by the
same new variables in the dissociated expression.
4. OBLIVIOUS BOUNDS FOR DISSOCIATED EVENT EXPRESSIONS
Throughout this section, we fix two Boolean functions f(x,y)and f0(x0,y)such that f0
is a dissociation of f. We are given the probabilities p=P[x]and q=P[y]. Our goal
is to find probabilities p0=P[x0]of the dissociated variables so that P[f0]is an upper
or lower bound for P[f]. We first define oblivious bounds (Sect.4.1), then characterize
them, in general, through valuations (Sect. 4.2) and, in particular, for conjunctive and
disjunctive dissociations (Sect. 4.3), then derive optimal bounds (Sect. 4.4), and end
with illustrated examples for CNF and DNF dissociations (Sect.4.5).
4.1. Definition of Oblivious Bounds
We use the subscript notation Pp,q[f]and Pp0,q[f0]to emphasize that the proba-
bility space is defined by the probabilities p=hp1, p2, . . .i,q=hq1, q2, . . .i, and
p0=hp0
1, p0
2, . . .i, respectively. Given pand q, our goal is thus to find p0such that
Pp0,q[f0]Pp,q[f]or Pp0,q[f0]Pp,q[f].
Definition 4.1 (Oblivious Bounds). Let f0be a dissociation of fand p=P[x]. We
call p0an oblivious upper bound for pand dissociation f0of fiff q:Pp0,q[f0]Pp,q[f].
Similarly, p0is an oblivious lower bound iff q:Pp0,q[f0]Pp,q[f].
In other words, p0is an oblivious upper bound if the probability of the dissociated
function f0is bigger than that of ffor every choice of q. Put differently, the probabilities
of x0depend only on the probabilities of xand not on those of y.
An immediate upper bound is given by p0=1, since fis monotone and f0[1/x0] =
f[1/x]. Similarly, p=0is a na¨
ıve lower bound. This proves that the set of upper and
lower bounds is never empty. Our next goal is to characterize all oblivious bounds and
to then find optimal ones.
4.2. Characterization of Oblivious Bounds through Valuations
We will give a necessary and sufficient characterization of oblivious bounds, but first
we need to introduce some notations. If f(x,y)is a Boolean function, let ν:y {0,1}
be a truth assignment or valuation for y. We use νfor the vector hν(y1), . . . , ν(yn)i,
and denote with f[ν]the Boolean function obtained after applying the substitution ν.
Note that f[ν]depends on variables xonly. Furthermore, let gbe nBoolean functions,
ACM Transactions on Database Systems, Vol. V, No. N, Article 1, Publication date: January YYYY.
1:8 W. Gatterbauer and D. Suciu
g1
g2
z1
z2
z3
(a) g=hg1, g2i
z1
z2
z3
=h0,1i
=h1,0i
=h1,1i
=h0,0i
(b) gνfor different valuations ν
Fig. 3. Example 4.2. Illustration of the valuation notation with Karnaugh maps. (a): Boolean functions
g1=z1z2and g2=z1z3. (b): Boolean functions gνfor all 4 possible valuations ν. For example, gν=z1¯z2z3
for ν=h0,1i.
over variables z. We denote with gνthe Boolean function gν=Vjgν
j, where gν
j= ¯gjif
ν(yj)=0and gν
j=gjif ν(yj) = 1.
Example 4.2 (Valuation Notation). Assume g=hg1, g2iwith g1=z1z2and g2=
z1z3, and ν=h0,1i. Then gν=¬(z1z2)z1z3=z1¯z2z3. Figure 3 illustrates our notation
for this simple example with the help of Karnaugh maps. We encourage the reader to
take a moment and carefully study the correspondences between g,ν, and gν.
Then, any function f(x,y)admits the following expansion by the y-variables:
f(x,y) = _
ν
f[ν]yν(3)
Note that any two expressions in the expansion above are logically contradictory, a
property called determinism by Darwiche and Marquis [2002], and that it can be seen
as the result of applying Shannon’s expansion to all variables of y.
Example 4.3 (Valuation Notation continued). Consider the function f= (xy1)(x
y2). For the example valuation ν=h0,1i, we have f[ν] = (x0)(x1) = xand yν= ¯y1y2.
Equation 3 gives us an alternative way to write fas disjunction over all 22valuations
of yas f=x(¯y1¯y2)x(y1¯y2)x(¯y1y2)y1y2.
The following proposition is a necessary and sufficient condition for oblivious upper
and lower bounds, based on valuations.
PROPOSITION 4.4 (OBLIVIOUS BOUNDS AND VALUATIONS).Fix two Boolean
functions f(x,y),f0(x0,y)s.t. f0is a dissociation of f, and let pand p0denote the
probabilities of the variables xand x0, respectively. Then p0is an oblivious upper bound
iff Pp0[f0[ν]] Pp[f[ν]] for every valuation νfor y. The proposition holds similarly for
oblivious lower bounds.
PROOF. Remember that any two events in Eq. 3 are disjoint. The total probability
theorem thus allows us to sum over the probabilities of all conjuncts:
Pp,q[f(x,y)] = X
ν
Pp[f[ν]] ·Pq[yν]
Pp0,q[f0(x,y)] = X
ν
Pp0[f0[ν]] ·Pq[yν]
The “if direction follows immediately. For the “only if” direction, assume that p0is an
oblivious upper bound. By definition, Pp0,q[f0]Pp,q[f]for every q. Fix any valuation
ν:y {0,1}, and define the following probabilities q:qi= 0 when ν(yi) = 0, and
qi= 1 when ν(yi)=1. It is easy to see that Pp,q[f] = Pp[f[ν]], and similarly, Pp0,q[f0] =
Pp0[f0[ν]], which proves Pp0[f0[ν]] Pp[f[ν]].
ACM Transactions on Database Systems, Vol. V, No. N, Article 1, Publication date: January YYYY.
Oblivious Bounds on the Probability of Boolean Functions 1:9
A consequence of choosing p0obliviously is that it remains a bound even if we allow
the variables yto be arbitrarily correlated. More precisely:
COROLLARY 4.5 (OBLIVIOUS BOUNDS AND CORRELATIONS).Let f0(x0,y)be a
dissociation of f(x,y), let p0be an oblivious upper bound for p, and let g=
hg1, . . . , g|y|ibe Boolean functions in some variables zwith probabilities r=P[z]. Then:
Pp0,r[f0x0,g(z)]Pp,r[fx,g(z)]. The result for oblivious lower bounds is similar.
The intuition is that, by substituting the variables ywith functions gin f(x,y), we
make ycorrelated. The corollary thus says that an oblivious upper bound remains an
upper bound even if the variables yare correlated. This follows from folklore that any
correlation between the variables ycan be captured by general Boolean functions g.
For completeness, we include the proof in Appendix B.
PROOF OF COROLLARY 4.5. We derive the probabilities of fand f0from Eq.3:
Pp,r[f(x,g)] = X
ν
Pp[f[ν]] ·Pr[gν]
Pp0,r[f0(x,g)] = X
ν
Pp0[f0[ν]] ·Pr[gν]
The proof follows now immediately from Prop. 4.4.
4.3. Oblivious Bounds for Unary Conjunctive and Disjunctive Dissociations
A dissociation f0(x0,y)of f(x,y)is called unary if |x|= 1, in which case we write the
function as f(x, y). We next focus on unary dissociations, and establish a necessary
and sufficient condition for probabilities to be oblivious upper or lower bounds for the
important classes of conjunctive and disjunctive dissociations. The criterion also ex-
tends as a sufficient condition to non-unary dissociations, since these can be obtained
as a sequence of unary dissociations.7
Definition 4.6 (Conjunctive and Disjunctive Dissociation). Let f0(x0,y)be a
Boolean function in variables x0,y. We say that the variables x0are conjunctive in f0
if f0(x0,y) = Vj[d]fj(x0
j,y),d=|x0|. We say that a dissociation f0(x0,y)of f(x, y)is
conjunctive if x0are conjunctive in f0. Similarly, we say that x0are disjunctive in f0if
f0(x0,y) = Wj[d]fj(x0
j,y), and a dissociation is disjunctive if x0is disjunctive in f0.
Thus in a conjunctive dissociation, each dissociated variable x0
joccurs in exactly one
Boolean function fjand these functions are combined by to obtain f0. In practice,
we start with fwritten as a conjunction, then replace xwith a fresh variable in each
conjunct:
f(x, y) = ^fj(x, y)
f0(x0,y) = ^fj(x0
j,y)
Disjunctive dissociations are similar.
Note that if x0is conjunctive in f0(x0,y), then for any substitution ν:y {0,1},f0[ν]
is either 0, 1, or a conjunction of variables in x0:f0[ν] = Vjsx0
j, for some set s[d],
where d=|x0|. Similarly, if x0is disjunctive, then f0[ν]is 0, 1, or Wjsx0
j.8
7Necessity does not always extend to non-unary dissociations. The reason is that an oblivious dissociation
for xmay set the probability of a fresh variable by examining all variables x, while a in sequence of oblivious
dissociations each new probability P[x0
i,j ]may depend only on the variable xicurrently being dissociated.
8Note that for s=:f0[ν] = Vjsx0
j= 1 and f0[ν] = Wjsx0
j= 0.
ACM Transactions on Database Systems, Vol. V, No. N, Article 1, Publication date: January YYYY.
1:10 W. Gatterbauer and D. Suciu
We need one more definition before we state the main result in our paper.
Definition 4.7 (Cover). Let x0be conjunctive in f0(x0,y). We say that f0covers the
set s[d]if there exists a substitution νs.t. f0[ν] = Vjsx0
j. Similarly, if x0is disjunc-
tive, then we say that f0covers sif there exists νs.t. f0[ν] = Wjsx0
j.
THEOREM 4.8 (OBLIVIOUS BOUNDS).Let f0(x0,y)be a conjunctive dissociation of
f(x, y), and let p=P[x],p0=P[x0]be probabilities of xand x0, respectively. Then:
(1)If p0
jpfor all j, then p0is an oblivious lower bound for p, i.e. q:Pp0,q[f0]
Pp,q[f]. Conversely, if p0is an oblivious lower bound for pand f0covers all sin-
gleton sets {j}with j[d], then p0
jpfor all j.
(2)If Qjp0
jp, then p0is an oblivious upper bound for p, i.e. q:Pp0,q[f0]Pp,q[f].
Conversely, if p0is an oblivious upper bound for pand f0covers the set [d], then
Qjp0
jp.
Similarly, the dual result holds for disjunctive dissociations f0(x0,y)of f(x, y):
(3)If p0
jpfor all j, then p0is an oblivious upper bound for p. Conversely, if p0is
an oblivious upper bound for pand f0covers all singleton sets {j},j[d], then
p0
jp.
(4)If Qj(1 p0
j)1p, then p0is an oblivious lower bound for p. Conversely, if p0is
an oblivious lower bound for pand f0covers the set [d], then Qj(1 p0
j)1p.
PROOF. We make repeated use of Prop. 4.4. We give here the proof for conjunctive
dissociations only; the proof for disjunctive dissociations is dual and similar.
(1) We need to check Pp0[f0[ν]] Pp[f[ν]] for every ν. Since the dissociation is unary,
f[ν]can be only 0,1, or x, while f0[ν]is 0,1, or Vjsx0
jfor some set s[d].
Case 1: f[ν]=0. We will show that f0[ν]=0, which implies Pp0[f0[ν]] =
Pp[f[ν]] = 0. Recall that, by definition, f0(x0,y)becomes f(x, y)if we
substitute xfor all variables x0
j. Therefore, f0[ν][x/x0
1, . . . , x/x0
d]=0,
which implies f0[ν] = 0 because f0is monotone in the variables x0.
Case 2: f[ν]=1. Then Pp0[f0[ν]] Pp[f[ν]] holds trivially.
Case 3: f[ν] = x. Then Pp[f[ν]] = p, while Pp0[f0[ν]] = Qjsp0
j. We prove that
s6=: this implies our claim, because Qjsp0
jp0
jp, for any choice
of js. Suppose otherwise, that s=, hence f0[ν] = 1. Substituting
all variables x0with xtransforms f0to f. This implies f[ν]=1, which
contradicts f[ν] = x.
For the converse, assume that p0is an oblivious lower bound. Since f0covers {j},
there exists a substitution νs.t. f0[ν] = x0
j, and therefore f[ν] = x. By Prop. 4.4
we have p0
j=Pp0[f0[ν]] Pp[f[ν]] = p, proving the claim.
(2) Here we need to check Pp0[f0[ν]] Pp[f[ν]] for every ν. The cases when f[ν]is
either 0 or 1 are similar to the cases above, so we only consider the case when
f[ν] = x. Then f0[ν] = Vjsx0
jand Pp0[f0[ν]] = Qjsp0
jQjpjp=Pp[f[ν]].
For the converse, assume p0is an oblivious upper bound, and let νbe the
substitution for which f0[ν] = Vjx0
j(which exists since f0is covers [d]). Then
Pp0[f0[ν]] Pp[f[ν]] implies pQjpj.
4.4. Optimal Oblivious Bounds for Unary Conjunctive and Disjunctive Dissociations
We are naturally interested in the “best possible” oblivious bounds. Call a dissociation
f0non-degenerate if it covers all singleton sets {j},j[d]and the complete set [d].
Theorem 4.8 then implies:
ACM Transactions on Database Systems, Vol. V, No. N, Article 1, Publication date: January YYYY.
Oblivious Bounds on the Probability of Boolean Functions 1:11
COROLLARY 4.9 (OPTIMAL OBLIVIOUS BOUNDS).If f0is a conjunctive dissocia-
tion of fand f0is non-degenerate, then the optimal oblivious lower bound is p0
1=p0
2=
. . . =p, while the optimal oblivious upper bounds are obtained whenever p0
1p0
2·· · =p.
Similarly, if f0is a disjunctive dissociation of fand f0is non-degenerate, then the op-
timal oblivious upper bound is p0
1=p0
2=. . . =p, while the optimal oblivious lower
bounds are obtained whenever (1 p0
1)·(1 p0
2)·· · = 1 p.
Notice that while optimal lower bounds for conjunctive dissociations and optimal
upper bounds for disjunctive dissociations are uniquely defined with p0
j=p, there are
infinitely many optimal bounds for the other directions (see Fig.1). Let us call bounds
symmetric if all dissociated variable have the same probability. Then optimal symmet-
ric upper bounds for conjunctive dissociations are p0
j=d
p, and optimal symmetric
lower bounds for disjunctive dissociations p0
j= 1 d
1p.
We give two examples of degenerate dissociations. First, the dissociation f0=
(x0
1y1y3)(x0
2y2y3)does not cover either {1}nor {2}: no matter how we substi-
tute y1, y2, y3, we can never transform f0to x0
1. For example, f0[1/y1,0/y2,0/y3]=0and
f0[1/y1,0/y2,1/y3] = 1. But f0does cover the set {1,2}because f0[1/y1,1/y2,0/y3] =
x1x2. Second, the dissociation f0= (x0
1y1y2)(x0
2y2y1)covers both {1}and {2}, but
does not cover the entire set {1,2}. In these cases the oblivious upper or lower bounds
in Theorem 4.8 still hold, but are not necessarily optimal.
However, most cases of practical interest result in dissociations that are non-
degenerate, in which case the bounds in Theorem 4.8 are tight. We explain this here.
Consider the original function, pre-dissociation, written in a conjunctive form:
f(x, y) =g0^
j[d]
(xgj) = g0^
j[d]
fj(4)
where each gjis a Boolean function in the variables y, and where we denoted fj=xgj.
For example, if fis a CNF expression, then each fjis a clause containing x, and g0is
the conjunction of all clauses that do not contain x. Alternatively, we may start with a
CNF expression, and group the clauses containing xin equivalence classes, such that
each fjrepresents one equivalence class. For example, starting with four clauses, we
group into two functions f= [(xy1)(xy2)][(xy3)(xy4)] = (xy1y2)(xy3y4) =
f1f2. Our only assumption about Eq.4 is that it is non-redundant, meaning that none
of the expressions g0or fjmay be dropped. Then we prove:
PROPOSITION 4.10 (NON-DEGENERATE DISSOCIATION).Suppose the function f
in Eq. 4 is non-redundant. Define f0=g0Vj(x0
jgj). Then f0covers every single-
ton set {j}. Moreover, if the implication g0Wjgjdoes not hold, then f0also covers the
set [d]. Hence f0is non-degenerate. A similar result holds for disjunctive dissociations
if the dual implication g0Vjgjdoes not hold.
PROOF. We give here the proof for conjunctive dissociations only; the proof for dis-
junctive dissociations follows from duality. We first prove that f0covers any singleton
set {j}, for j[d]. We claim that the following logical implication does not hold:
g0^
i6=j
gigj(5)
Indeed, suppose the implication holds for some j. Then the following implication also
holds: g0Vi6=j(xgi)(xgj), since for x= 0 it is the implication above, while
for x= 1 it is a tautology. Therefore, the function fjis redundant in Eq. 4, which
contradicts our assumption. Hence, the implication Eq. 5 does not hold. Let νbe any
ACM Transactions on Database Systems, Vol. V, No. N, Article 1, Publication date: January YYYY.
1:12 W. Gatterbauer and D. Suciu
Valuation CNF dissociation DNF dissociation
#ν f0
c[ν]fc[ν]P[f0
c[ν]] P[fc[ν]] f0
d[ν]fd[ν]P[f0
d[ν]]P[fd[ν]]
1.h0,0,0,0i0 0 0 0 0 0 0 0
2.h1,0,0,0i0 0 0 0x0
1x p0
1p
3.h0,1,0,0i0 0 0 0x0
2x p0
2p
4.h0,0,1,0i0 0 0 0x0
3x p0
3p
5.h0,0,0,1ix0
1x0
2x0
3x p0
1p0
2p0
3p1 1 1 1
6.h1,1,0,0i0 0 0 0x0
1x0
2x¯p0
1¯p0
2¯p
7.h1,0,1,0i0 0 0 0x0
1x0
3x¯p0
1¯p0
3¯p
8.h0,1,1,0i0 0 0 0x0
2x0
3x¯p0
2¯p0
3¯p
9.h1,0,0,1ix0
2x0
3x p0
2p0
3p1 1 1 1
10.h0,1,0,1ix0
1x0
3x p0
1p0
3p1 1 1 1
11.h0,0,1,1ix0
1x0
2x p0
1p0
2p1 1 1 1
12.h1,1,1,0i0 0 0 0x0
1x0
2x0
3x¯p0
1¯p0
2¯p0
3¯p
13.h1,1,0,1ix0
3x p0
3p1 1 1 1
14.h1,0,1,1ix0
2x p0
2p1 1 1 1
15.h0,1,1,1ix0
1x p0
1p1 1 1 1
16.h1,1,1,1i1 1 1 1 1 1 1 1
(a) Comparing 24valuations for determining oblivious bounds.
Upper bounds
x0
1
x0
2
x0
3
x0
1x0
2
x0
1x0
3
x0
2x0
3
x0
1x0
2x0
3
Lower bounds
(b) Non-trivial valuations f0
c[x]
x0
1_x0
2_x0
3
x0
1_x0
2
x0
1_x0
3
x0
2_x0
3
x0
1
x0
2
x0
3
(c) Non-trivial valuations f0
d[x]
Fig. 4. Example 4.11 (CNF fc) and Example 4.12 (DNF fd). (a): Determining oblivious bounds by ensuring
that bounds hold for all valuations. (b), (c): Partial order of implication () for the non-trivial valuations
fc[ν]0and fd[ν]0, e.g.: from x0
1x0
2x0
1it follows that p0
1p0
2pp0
1p. Note that fc6=fd.
assignment that causes Eq. 5 to fail: thus, for all j {0, . . . , d},j6=i,gi[ν] = 1 and
gj[ν] = 0. Therefore f0[ν] = xj, proving that it covers {j}.
Next, assume that g0Wjgjdoes not hold. We prove that f0covers [d]. Let νbe any
substitution that causes the implication to fail: g0[ν]=1and gj[ν]=0for j[d]. Then
f0[ν] = Vj[d]x0
j.
4.5. Illustrated Examples for Optimal Oblivious Bounds
We next give two examples that illustrate optimal oblivious bounds for conjunctive and
disjunctive dissociations in some detail.
Example 4.11 (CNF Dissociation). Consider the function fcgiven by an CNF ex-
pression and its dissociation f0
c:
fc= (xy1)(xy2)(xy3)y4
f0
c= (x0
1y1)(x0
2y2)(x0
3y3)y4
There are 24= 16 valuations for y=hy1, y2, y3, y4i. Probabilities p0=hp0
1, p0
2, p0
3iare
thus an oblivious upper bound exactly if they satisfy the 16 inequalities given under
“CNF dissociation” in Fig. 4a. For valuations with ν4= 0 (and thus fc[ν] = 0) or all
ACM Transactions on Database Systems, Vol. V, No. N, Article 1, Publication date: January YYYY.
Oblivious Bounds on the Probability of Boolean Functions 1:13
νj= 1 (and thus f0
c[ν]=1) the inequalities trivially hold. For the remaining 7 non-
trivial inequalities, p0
1p0
2p0
3pimplies all others. Figure 4b shows the partial order
between the non-trivial valuations, with x0
1x0
2x0
3implying all others. Since fcand f0
c
are positive in xand x0, respectively, it follows that optimal oblivious upper bounds
are given by p0
1p0
2p0
3=p, e.g., by setting p0
i=3
pfor symmetric bounds.
Oblivious lower bounds are given by the 16 inequalities after inverting the inequality
sign. Here we see that the three inequalities p0
jptogether imply the others. Hence,
oblivious lower bounds are those that satisfy all three inequalities. The only optimal
oblivious upper bounds are then given by p0
j=p.
Example 4.12 (DNF Dissociation). Consider the function fdgiven by an DNF ex-
pression and its dissociation f0
d:
fd=xy1xy2xy3y4
f0
d=x0
1y1x0
2y2x0
3y3y4
An oblivious upper bound p0=hp0
1, p0
2, p0
3imust thus satisfy the 16 inequalities9given
under “DNF dissociation” in Fig. 4a. For valuations with ν4= 1 (and thus f0
d[ν] =
1) or νj= 0 (and thus fd[ν]=0) the inequalities trivially hold. For the remaining
inequalities we see that the elements of set {x0
1, x0
2, x0
3}together imply all others, and
that x0
1x0
2x0
3is implied by all others (Fig. 4c shows the partial order between the
non-trivial valuations). Thus, an oblivious upper bound must satisfy p0
jp, and the
optimal one is given by p0
j=p. Analogously, an oblivious lower bound must satisfy
¯p0
1¯p0
2¯p0
3¯p. Optimal ones are given for ¯p0
1¯p0
2¯p0
3= ¯p, e.g., by setting p0
j= 1 3
¯p.
5. RELAXATION AND MODEL-BASED BOUNDS AS DISSOCIATION
This section formalizes the connection between relaxation, model-based bounds and
dissociation that was outlined in the introduction. In other words, we show how both
previous approaches can be unified under the framework of dissociation.
5.1. Relaxation & Compensation
The following proposition shows relaxation & compensation as conjunctive dissociation
and was brought to our attention by Choi and Darwiche [2011].
PROPOSITION 5.1 (COMPENSATION AND CONJUNCTIVE DISSOCIATION).Let f1,
f2be two monotone Boolean functions which share only one single variable x. Let f
be their conjunction, and f0be the dissociation of fon x, i.e.
f=f1f2
f0=f1[x0
1/x]f2[x0
2/x]
Then P[f] = P[f0]for P[x0
1] = P[x]and P[x0
2] = P[x|f1].
PROOF OF PROP. 5.1. First, note that P[f] = P[f1]P[f2|f1]. On the other hand,
P[f0] = P[f0
1]P[f0
2]as f0
1and f0
2are independent after dissociating on the only shared
variable x. We also have P[f1] = P[f0
1]since P[x] = P[x0
1]. It remains to be shown that
9Remember that the probability of a disjunction of two independent events is P[x0
1x0
2]=1¯p0
1¯p0
2.
ACM Transactions on Database Systems, Vol. V, No. N, Article 1, Publication date: January YYYY.
1:14 W. Gatterbauer and D. Suciu
P[f0
2] = P[f2|f1]. Indeed:
P[f0
2] = P[x0
2·f0
2[1/x0
2]¯x0
2·f0
2[0/x0
2]]
=P[x0
2]·P[f0
2[1/x0
2]] + P[¯x0
2]·P[f0
2[0/x0
2]]
=P[x|f1]·P[f2[1/x]] + Px|f1]·P[f2[0/x]]
=P[x|f1]·P[f2[1/x]|f1] + P[¯x|f1]·P[f2[0/x]|f1]
=P[f2|f1]
which proves the claim.
Note that compensation is not oblivious, since the probability p0
2depends on the other
variables occurring in ϕ1. Further note that, in general, ϕ1, ϕ2have more than one
variable in common, and in this case we have P[ϕ0]6=P[ϕ]for the same compensation.
Thus in general, compensation is applied as a heuristics, and it is then not known
whether it provides an upper or lower bound.
The dual result for disjunctions holds by replacing f1with its negation ¯
f1in P[x0
2] =
P[x|¯
f1]. This result is not immediately obvious from the previous one and has, to our
best knowledge, not been stated or applied anywhere before.
PROPOSITION 5.2 (“DISJUNCTIVE COMPENSATION).Let f1,f2be two monotone
Boolean functions which share only one single variable x. Let fbe their disjunction,
and f0be the dissociation of fon x, i.e. f=f1f2, and f0=f1[x0
1/x]f2[x0
2/x]. Then
P[f] = P[f0]for P[x0
1] = P[x]and P[x0
2] = P[x|¯
f1].
PROOF OF PROP. 5.2. Let g=¯
f,g1=¯
f1, and g2=¯
f2. Then f=f1f2is equivalent
to g=g1g2. From Prop. 5.1, we know that P[g] = P[g0], and thus P[f] = P[f0], for
P[x0
1] = P[x]and P[x0
2] = P[x|g1] = P[x|¯
f1].
5.2. Model-based Approximation
The following proposition shows that all model-based bounds can be derived by re-
peated dissociation. However, not all dissociation-bounds can be explained as models
since dissociation is in its essence an algebraic and not a model-based technique (dis-
sociation creates more variables and thus changes the probability space). Therefore,
dissociation can improve any existing model-based approximation approach. Exam-
ple 7.2 will illustrate this with a detailed simulation-based example.
PROPOSITION 5.3 (MODEL-BASED BOUNDS AS DISSOCIATIONS).Let f,fUbe two
monotone Boolean functions over the same set of variables, and for which the logical im-
plication ffUholds. Then: (a) there exists a sequence of optimal conjunctive dissoci-
ations that transform fto fU, and (b) there exists a sequence of non-optimal disjunctive
dissociations that transform fto fU. The dual result holds for the logical implication
fLf: (c) there exists a sequence of optimal disjunctive dissociations that transform
fto fL, and (d) there exists a sequence of non-optimal conjunctive dissociations that
transform fto fL.
PROOF OF PROP. 5.3. We focus here on the implication ffU. The proposition for
the results fLfthen follows from duality.
(a) The implication ffUholds iff there exists a positive function f2such that
f=fUf2. Pick a set of variables xs.t. f2[1/x] = 1, and dissociate fon xinto
f0=fU[x0
1/x]f2[x0
2/x]. By setting the probabilities of the dissociated variables to
p0
1=pand p0
2=1, the bounds become optimal (p0
1p0
2=p). Further more, fUremains
unchanged (except for the renaming of xto x0
1), whereas f2becomes true. Hence, we
get f0=fU. Thus, all model-based upper bounds can be obtained by conjunctive disso-
ciation and choosing optimal oblivious bounds at each dissociation step.
ACM Transactions on Database Systems, Vol. V, No. N, Article 1, Publication date: January YYYY.
Oblivious Bounds on the Probability of Boolean Functions 1:15
(b) The implication ffUalso holds iff there exists a positive function fdsuch that
fU=ffd. Let mbe the positive minterm or elementary conjunction involving all
variables of f. The function fdcan be then written as DNF fd=c1c2. . . , with
products cim. Since fis monotone, we know mf, and thus also mfdf. We can
therefore write f=fmfdor as
f=fmc1mc2. . .
Let xibe the set of all variables in mthat do not occur in ciand denote with mithe
conjunction of xi. Then then each mcican instead be written as miciand thus:
f=fm1c1m2c2. . .
WLOG, we now separate one particular conjunct miciand dissociate on the set xi
f0=fm1c1m2c2. . .
| {z }
f1
[x0
1/xi]mici
|{z}
f2
[x0
2/xi]
By setting the probabilities of the dissociated variables to the non-optimal upper
bounds p0
1=pand p0
2=1,f1remains unchanged (except for the renaming of xito
x0
1), whereas f2becomes ci. Hence, we get f0=fm1c1m2c2 ··· ci. We can
now repeat the same process for all conjuncts mciand receive after a finite number of
dissociation steps
f00 =f(c1c2. . . ) = ffd
Hence f00 =fU. Thus, all model-based upper bounds can be obtained by disjunctive
dissociation and choosing non-optimal bounds at each dissociation step.
6. QUERY-CENTRIC DISSOCIATION BOUNDS FOR PROBABILISTIC QUERIES
Our previous work [Gatterbauer et al. 2010] has shown how to upper bound the prob-
ability of conjunctive queries without self-joins by issuing a sequence of SQL state-
ments over a standard relational DBMS. This section illustrates such dissociation-
based upper bounds and also complements them with new lower bounds. We use the
Boolean query Q:R(X), S(X, Y ), T (Y), for which the probability computation prob-
lem is known to be #P-hard, over the following database instance D:
R A
x11
x22
S A B
z11 1
z22 1
z32 2
T B
y11
y22
Thus, relation Shas three tuples (1,1),(2,1),(2,2), and both Rand Thave two tuples
(1) and (2). Each tuple is annotated with a Boolean variable x1, x2, z1, . . . , y2, which rep-
resents the independent event that the corresponding tuple is present in the database.
The lineage expression ϕis then the DNF that states which tuples need to be present
in order for the Boolean query Qto be true:
ϕ=x1z1y1x2z2y1x2z3y2
Calculating P[ϕ]for a general database instance is #P-hard. However, if we treat each
occurrence of a variable xiin ϕas different (in other words, we dissociate ϕeagerly on
all tuples xifrom table R), then we get a read-once expression
ϕ0=x1z1y1x0
2,1z2y1x0
2,2z3y2
= (x0
1z1x0
2,1z2)y1x0
2,2z3y2
ACM Transactions on Database Systems, Vol. V, No. N, Article 1, Publication date: January YYYY.
1:16 W. Gatterbauer and D. Suciu
select IOR(Q3.P) as P
from
(select T.B, T.P*Q2.P as P
from T,
(select Q1.B, IOR(Q1.P) as P
from (select S.A, S.B, S.P*R.P as P
from R, S
where R.A = S.A) as Q1
group by Q1.B) as Q2
where T.B = Q2.B) as Q3
(a) Query PR
create view VR as
select R.A,
1-power(1-R.P,1e0/count(*)) as P
from R, S, T
where R.A=S.A
and S.B=T.B
group by R.A, R.P
(b) View VRfor lower bound with PR
Fig. 5. (a): SQL query corresponding to plan PRfor deriving an upper bound for the hard probabilistic
Boolean query Q:R(X), S(X, Y ), T (Y). Table Rneeds to be replaced with the view VRfrom (b) for deriving
a lower bound. IOR is a user-defined aggregate explained in the text and stated in Appendix C.
Writing pi, qi, rifor the probabilities of variables xi, yi, zi, respectively, we can calculate
P[ϕ0] = (p0
1·r1)(p0
2,1·r2)·q1(p0
2,2·r3·q2)
where · stands for multiplication and for independent-or.10
We know from Theorem 4.8 (3) that P[ϕ0]is an upper bound to P[ϕ]by assigning
the original probabilities to the dissociated variables. Furthermore, we have shown
in [Gatterbauer et al. 2010] that P[ϕ]can be calculated with a probabilistic query plan
PR=πp
1p
Yπp
Y1p
XR(X), S(X, Y ), T (Y)
where the probabilistic join operator 1p. . . (in prefix notation) and the probabilistic
project operator with duplicate elimination πpcompute the probability assuming that
their inputs are independent [Fuhr and R¨
olleke 1997]. Thus, when the join operator
joins two tuples with probabilities p1and p2, respectively, the output has probability
p1p2. When the independent project operator eliminates kduplicate records with prob-
abilities p1, . . . , pk, respectively, the output has probability 1¯p1· ·· ¯pk. This connection
between read-once formulas and query plans was first observed by Olteanu and Huang
[2008]. We write here PRto emphasize that this plan dissociates tuples in table R.11
Figure 5a shows the corresponding SQL query assuming that each of the input tables
has one additional attribute Pfor the probability of a tuple. The query deploys a user-
defined aggregate (UDA) IOR that calculates the independent-or for the probabilities of
the tuples grouped together, i.e. IOR(p1, p2, . . . , pn)=1¯p1¯p2·· · ¯pn. Appendix C states
the UDA definition for PostgreSQL.
We also know from Theorem 4.8 (4) that P[ϕ0]is a lower bound to P[ϕ]by assigning
new probabilities 12
1p2to x0
2,1and x0
2,2(or more generally, any probabilities p0
2,1
and p0
2,2with ¯p0
2,1·¯p0
2,2¯p2). Because of the connection between the read-once expres-
sion ϕ0and the query plan PR, we can calculate the lower bound by using the same
SQL query from Fig. 5a after exchanging the table Rwith a view VR(Fig. 5b); VRis
basically a copy of Rthat replaces the probability piof a tuple xiappearing in ϕwith
1di
1piwhere diis the number of times that xiappears in the lineage of Q. The
view joins tables R,Sand T, groups the original input tuples xifrom R, and assigns
each xithe new probability 1di
1picalculated as 1-power(1-T.P,1e0/count(*)).
10The independent-or combines two probabilities as if calculating the disjunction between two independent
events. It is defined as p1p2:= 1 ¯p1¯p2.
11Note we use the notation PRfor both the probabilistic query plan and the corresponding SQL query.
ACM Transactions on Database Systems, Vol. V, No. N, Article 1, Publication date: January YYYY.
Oblivious Bounds on the Probability of Boolean Functions 1:17
Alternatively to ϕ0, if we treat each occurrence of a variable yjin ϕas different (in
other words, we dissociate ϕeagerly on all tuples yjfrom table T), then we get another
read-once expression
ϕ00 =x1z1y0
1,1x2z2y0
1,2x2z3y2
=x1z1y0
1,1x2(z2y0
1,2z3y2)
P[ϕ00]is an upper bound to P[ϕ]by assigning the original probabilities to the dissociated
variables. In turn, P[ϕ00]can be calculated with another probabilistic query plan that
dissociates all tuples from table Tinstead of R:
PT=πp
1p
XR(X), πp
X1p
YS(X, Y ), T (Y)
Similarly to before, P[ϕ00]is a lower bound to P[ϕ]by assigning new probabilities 1
2
1q1to y0
1,1and y0
1,2. And we can calculate this lower bound with the same query PT
after exchanging Twith a view VTthat replaces the probability qjof a tuple yjwith
1dj
p1qjwhere djis the number of times that yjappears in the lineage of Q.
Note that both query plans will calculate upper and lower bounds to query Qover
any database instance D. In fact, all possible query plans give upper bounds to the
true query probability. And as we have illustrated here, by replacing the input tables
with appropriate views, we can use the same query plans to derive lower bounds. We
refer the reader to [Gatterbauer et al. 2010] where we develop the theory of the par-
tial dissociation order among all possible query plans and give a sound and complete
algorithm that returns a set of query plans which are guaranteed to give the tightest
bounds possible in a query-centric way for any conjunctive query without self-joins. For
our example hard query Q, plans PRand PTare the best possible plans. We further
refer to [Gatterbauer and Suciu 2013] for more details and an extensive discussion on
how to speed up the resulting multi-query evaluation.
Also note that these upper and lower bounds can be derived with the help of any
standard relational database, even cloud-based databases which commonly do not al-
low users to define their own UDAs.12 To our best knowledge, this is the first technique
that can upper and lower bound hard probabilistic queries without any modifications
to the database engine nor performing any calculations outside the database.
7. ILLUSTRATIONS OF OBLIVIOUS BOUNDS
In this section, we study the quality of oblivious bounds across varying scenarios:
We study the bounds as function of correlation between non-dissociated variables
(Sect. 7.1), compare dissociation-based with model-based approximations (Sect. 7.2),
illustrate a fundamental asymmetry between optimal upper and lower bounds
(Sect. 7.3), show that increasing the number of simultaneous dissociations does not
necessarily worsen the bounds (Sect. 7.4), and apply our framework to approximate
hard probabilistic queries over TPC-H data with a standard relational database man-
agement system (Sect.7.5).
7.1. Oblivious Bounds as Function of Correlation between Variables
Example 7.1 (Oblivious Bounds and Correlations). Here we dissociate the DNF
ϕd=xA xB and the analogous CNF ϕc= (xA)(xB)on xand study the er-
12The UDA IOR can be expressed with standard SQL aggregates, e.g., IOR(Q3.P) can be evaluated with
1-exp(sum(log(case Q3.P when 1 then 1E-307 else 1-Q3.P end))) on Microsoft SQL Azure.
ACM Transactions on Database Systems, Vol. V, No. N, Article 1, Publication date: January YYYY.
1:18 W. Gatterbauer and D. Suciu
101
Correlation ρ(A, B)
1
0.5
0
P[ϕc] and bounds
q=0.8
q=0.5
q=0.2
(a) CNF for p= 0.2
101
Correlation ρ(A, B)
1
0.5
0
P[ϕc] and bounds
q=0.8
q=0.5
q=0.2
(b) CNF for p= 0.5
101
Correlation ρ(A, B)
1
0.5
0
P[ϕc] and bounds
q=0.8
q=0.5
q=0.2
(c) CNF for p= 0.8
101
Correlation ρ(A, B)
1
0.5
0
P[ϕd] and bounds
q=0.8
q=0.5
q=0.2
(d) DNF for p= 0.8
101
Correlation ρ(A, B)
1
0.5
0
P[ϕd] and bounds
q=0.8
q=0.5
q=0.2
(e) DNF for p= 0.5
101
Correlation ρ(A, B)
1
0.5
0
P[ϕd] and bounds
q=0.8
q=0.5
q=0.2
(f) DNF for p= 0.2
Fig. 6. Example 7.1. Probabilities of CNF ϕc= (xA)(xB)and DNF ϕd=xA xB together with their
symmetric optimal upper and lower oblivious bounds (borders of shaded areas) as function of the correlation
ρ(A, B)between Aand B, and parameters p=P[x]and q=P[A] = P[B]. For every choice of p, there are
some Aand Bfor which the upper or lower bound becomes tight.
ror of the optimal oblivious bounds as function of the correlation between Aand B.13
Clearly, the bounds also depend on the probabilities of the variables x,A, and B. Let
p=P[x]and assume Aand Bhave the same probability q=P[A] = P[B]. We set
p0=P[x0
1] = P[x0
2]according to the optimal symmetric bounds from Corollary 4.9.
In a few steps, one can calculate the probabilities as
P[ϕd] = 2pq pP[AB]
P[ϕ0
d] = 2p0qp02P[AB]
P[ϕc] = p+ (1 p)P[AB]
P[ϕ0
c] = 2p0q+p02(1 2q) + (1 p02)P[AB]
Results: Figure 6 shows the probabilities of the expressions P[ϕ](full lines) and those
of their dissociations P[ϕ0](border of shaded areas) for various values of p,qand as
function of the correlation ρ(A, B).14 For example, Fig.6d shows the case for DNF when
P[x]is p= 0.8and A, B have the same probability qof either 0.8,0.5, or 0.2. When A, B
13Note that this simplified example also illustrates the more general case ψd=xA xB Cwhen Cis
independent of Aand B, and thus P[ψd] = P[ϕd](1 P[C]) + P[C]. As a consequence, the graphs in Fig.6 for
P[C]6= 0 would be vertically compressed and the bounds tighter in absolute terms.
14The correlation ρ(A, B )between Boolean events Aand Bis defined as ρ(A, B) = cov(A,B)
var(A)var(B)with
covariance cov(A, B) = P[AB]P[A]P[B]and variance var(A) = P[A](P[A])2[Feller 1968]. Notice that
ACM Transactions on Database Systems, Vol. V, No. N, Article 1, Publication date: January YYYY.
Oblivious Bounds on the Probability of Boolean Functions 1:19
are not correlated at all (ρ= 0), then the upper bound is a better approximation when
qis small, and the lower bound is a better approximation when qis large. On the other
hand, if A, B are not correlated, then there is no need to dissociate the two instances of
xas one can then compute P[(xA)(xB)] simply as p+¯pP[A]P[B]. The more interesting
case is when A, B are positively correlated (P[AB]P[A]P[B], e.g., positive Boolean
functions of other independent variables z, such as the provenance for probabilistic
conjunctive queries). The right of the vertical dashed line of Fig. 6d shows that, in
this case, dissociation offers very good upper and lower bounds, especially when the
formula has a low probability. The graph also shows the effect of dissociation when
A, B are negatively correlated (left of dashed line). Notice that the correlation cannot
always be 1(e.g., two events, each with probability >0.5, can never be disjunct). The
graphs also illustrate why these bounds are obliviously optimal, i.e. without knowledge
of A, B: for every choice of p, there are some A, B for which the upper or lower bound
becomes tight.
7.2. Oblivious Bounds versus Model-based Approximations
Example 7.2 (Disjunctive Dissociation and Models). This example compares the
approximation of our dissociation-based approach with the model-based approach by
Fink and Olteanu [2011] and illustrates how dissociation-based bounds are tighter, in
general, than model-based approximations. For this purpose, we consider again the
hard Boolean query Q:R(X), Sd(X, Y ), T (Y)over the database Dfrom Sect. 6. We
now only assume that the table Sis deterministic, as indicated by the superscript din
Sd. The query-equivalent lineage formula is then
ϕ=x1y1x2y1x2y2
for which Fig.7a shows the bipartite primal graph. We use this instance as its primal
graph forms a P4, which is the simplest 2-partite lineage that is not read-once.15 In
order to compare the approximation quality, we need to limit ourselves to an example
which is tractable enough so we can generate the whole probability space. In practice,
we allow each variable to have any of 11 discrete probabilities D={0,0.1,0.2,...,1}
and consider all 114= 14641 possible probability assignments ν:hp1, p2, q1, q2i D4
with p=P[x]and q=P[y]. For each ν, we calculate both the absolute error δ=P[ϕ]
P[ϕ]and the relative error ε=δ
P[ϕ], where P[ϕ]stands for any of the approximations,
and the exact probability P[ϕ]is calculated by the Shannon expansion on y1as ϕ
y1(x1x2) ¬y1(x2y2)and thus Pp,q[ϕ] = (1 (1 p1)(1 p2))q1+ (1 q1)p2q2.
Models: We use the model-based approach by Fink and Olteanu [2011] to approxi-
mate ϕwith lowest upper bound (LUB) formulas ϕUi and greatest lower bound (GLB)
formulas ϕLi, for which ϕLi ϕand ϕϕU i, and neither of the upper (lower)
bounds implies another upper (lower) bound. Among all models considered, we focus
on only read-once formulas. Given the lineage ϕ, the 4 LUBs are ϕU1=x1y1x2,
ϕU2=y1x2y2,ϕU3= (x1x2)y1y2, and ϕU4=x1x2(y1y2). The 3 GLBs are
ϕL1= (x1x2)y1,ϕL2=x1(y1y2), and ϕL3=x1y1x2y2. For each ν, we choose
mini(P[ϕUi]) and maxi(P[ϕLi]) as the best upper and lower model-based bounds, re-
spectively.
ρ(A, B) = P[AB ]q2
qq2and, hence: P[AB] = ρ(A, B )·(qq2)+q2. Further, P[AB]=0(i.e. disjointness between
Aand B) is not possible for q > 0.5, and from P[AB]1, one can derive P[AB]2q1. In turn, ρ=1
is not possible for q < 0.5, and it must hold P[AB]0. From both together, one can derive the condition
ρmin(q) = max(q
1q,1+q22q
qq2)which gives the minimum possible value for ρ, and which marks the left
starting point of the graphs in Fig.6 as function of q.
15 A path Pnis a graph with vertices {x1,...,xn}and edges {x1x2, x2x3,...,xn1xn}.
ACM Transactions on Database Systems, Vol. V, No. N, Article 1, Publication date: January YYYY.
1:20 W. Gatterbauer and D. Suciu
y1
y2
x2
x1
(a) ϕ
y1
x1
y2
x0
2,1
x0
2,2
(b) ϕ0
1
y2
x1
x2
y0
1,2
y0
1,1
(c) ϕ0
2
0p1
1
p
0
oblivious
lower
bounds
oblivious
upper
bounds
m
d
(d) p0for DNF
average worst
absolute
mδm
U1.54%15.8%
δm
L0.98%10.5%
dδd
U0.35%3.7%
δd
L0.94%8.7%
relative
mεm
U4.55%289.3%
εm
L2.23%28.9%
dεd
U0.73%6.7%
εd
L2.17%28.9%
(e) Errors
0 500
1
000
1
500
2
000
30%
20%
10%
0%
10%
20%
30%
40%
50%
Relative errors
Mode l-based m
U,m
L
Dissoc iation d
U,d
L
(f) Top 15% of 14641 data points
sorted by decreasing εd
0
2
000
4
000 6000
30%
20%
10%
0%
10%
20%
30%
40%
50%
Relative errors
Mode l-based m
U,m
L
Dissoc iation d
U,d
L
(g) Top 50% of 14641 data points
sorted by decreasing ε
Fig. 7. Example 7.2. (a)-(c): Bipartite primal graphs for DNF ϕand two dissociations. Notice that the primal
graphs of ϕ0
1and ϕ0
2are forests and thus correspond to read-once expressions. (d): For a given disjunctive
dissociation d, there is only one optimal oblivious upper bound but infinitely many optimal lower bounds.
We evaluate P[ϕ0]for three of the latter (two of which coincide with models m) and keep the maximum as
the best oblivious lower bound. (e): In comparison, dissociation gives substantially better upper bounds than
model-based bounds (0.73% vs. 4.55% average and 6.7% vs. 289.3% worst-case relative errors), yet lower
bounds are only slightly better. (f): Relative errors of 4 approximations for individual data points sorted by
the dissociation error for upper bounds and for lower bounds separately; this is why the dissociation errors
show up as smooth curves (red) while the model based errors are unsorted and thus ragged (blue). (g): Here
we sorted the errors for each approximation individually; this is why all curves are smooth.
Dissociation: Analogously, we consider the possible dissociations into read-once for-
mulas. For our given ϕ, those are ϕ0
1= (x1x0
2,1)y1x0
2,2y2and ϕ0
2=x1y0
1,1x2(y0
1,2y2),
with Fig. 7b and Fig. 7c illustrating the dissociated read-once primal graphs.16 From
Corollary 4.9, we know that Pp0,q[ϕ0
1]Pp,q[ϕ]for the only optimal oblivious upper
bounds p0
2,1=p0
2,2=p2and Pp0,q[ϕ0
1]Pp,q[ϕ]for any p0
2with ¯p0
2,1¯p0
2,2= ¯p2. In
particular, we choose 3 alternative optimal oblivious lower bounds p0
2 {hp2,0i,h1
1p2,11p2i,h0, p2i} (see Fig. 7d). Analogously Pp,q0[ϕ0
2]Pp,q[ϕ]for q0
1,1=
q0
1,2=q1and Pp,q0[ϕ0
2]Pp,q[ϕ]for q0
1 {hq1,0i,h11q1,11q1i,h0, q1i}. For
each ν, we choose the minimum among the 2 upper bounds and the maximum among
the 6 lower bounds as the best upper and lower dissociation-based bounds, respectively.
Results: Figures 7e-g show that dissociation-based bounds are always better or equal
to model-based bounds. The reason is that all model-based bounds are a special case of
oblivious dissociation bounds. Furthermore, dissociation gives far better upper bounds,
but only slighter better lower bounds. The reason is illustrated in Fig.7d: the single
dissociation-based upper bound p0=hp, pialways dominates the two model-based up-
per bounds, whereas the two model-based lower bounds are special cases of infinitely
many optimal oblivious lower dissociation-based bounds. As extreme case, it is there-
fore possible for a model-based lower bound to coincide with the best among all optimal
16Note that we consider here dissociation on both x- and y-variables, thus do not treat them as distinct.
ACM Transactions on Database Systems, Vol. V, No. N, Article 1, Publication date: January YYYY.
Oblivious Bounds on the Probability of Boolean Functions 1:21
y2
x2
x1
y1
(a) ψ
x0
2,1
x1
y2
x0
2,2
y1
(b) ψ0
1
y2
x1
x2
y0
1,2
y0
1,1
(c) ψ0
2
0p1
1
p
0
oblivious
lower
bounds
oblivious
upper
bounds
(d) p0for CNF
average worst
absolute
DNF
δd
U0.35%3.7%
δd
L0.94%8.7%
CNF
δc
U0.94%8.7%
δc
L0.35%3.7%
relative
DNF
εd
U0.73%6.7%
εd
L2.17%28.9%
CNF
εc
U2.69%54.5%
εc
L1.09%28.2%
(e) Errors
0 500
1
000
1
500
2
000
30%
20%
10%
0%
10%
20%
30%
40%
50%
Relative errors
DNF dissoci ation d
U,d
L
CNF dissociati on c
U,c
L
(f) Top 15% of 14641 data points
sorted by decreasing εd
0
2
000
4
000 6000
30%
20%
10%
0%
10%
20%
30%
40%
50%
Relative errors
DNF dissoci ation d
U,d
L
CNF dissociati on c
U,c
L
(g) Top 50% of 14641 data points
sorted by decreasing ε
Fig. 8. Example 7.3. (a)-(c): Bipartite primal graphs for CNF ψand two dissociations. (d): For a given con-
junctive dissociation c, we use the only optimal oblivious lower bound and three of infinitely many optimal
oblivious upper bounds. (e): In comparison, disjunctive dissociation gives better upper bounds than conjunc-
tive dissociation (0.73% vs. 2.69% average and 6.7% vs. 54.5% worst-case relative errors), and v.v. for lower
bounds. (f): Relative errors of 4 approximations for individual data points sorted by the disjunctive dissoci-
ation error εdfor upper bounds and for lower bounds separately; this is why the DNF dissociation errors
show up as smooth curves (red) while the CNF dissociation errors are unsorted and thus ragged (gray). (g):
Here we sorted the errors for each approximation individually; this is why all curves are smooth.
oblivious lower dissociation bounds. For our example, we evaluate three oblivious lower
bounds, two of which coincide with models.
7.3. Conjunctive versus Disjunctive Dissociation
Example 7.3 (Disjunctive and Conjunctive Dissociation). This example illustrates
an interesting asymmetry: optimal upper bounds for disjunctive dissociations and op-
timal lower bounds for conjunctive dissociations are not only unique but also better, on
average, than optimal upper bounds for conjunctive dissociations and optimal lower
bounds for disjunctive dissociations, respectively (see Figure 1). We show this by com-
paring the approximation of a function by either dissociating a conjunctive or a dis-
junctive expression for the same function.
We re-use the setup from Example 7.2 where we had a function expressed by a dis-
junctive expression ϕ. Our DNF ϕcan be written as CNF ψ= (x1x2)(y1x2)(y1y2)
with fϕ=fψ17, and two conjunctive dissociations ψ0
1= (x1x0
2,1)(y1x0
2,2)(y1y2)
and ψ0
2= (x1x2)(y0
1,1x2)(y0
1,2y2)(Figures 8a-c shows the primal graphs).
Again from Corollary 4.9, we know that Pp0,q[ϕ0
1]Pp,q[ϕ]for the only optimal
oblivious lower bounds p0
2,1=p0
2,2=p2and Pp0,q[ϕ0
1]Pp,q[ϕ]for any p0
2with
p0
2,1p0
2,2=p2. In particular, we choose 3 alternative optimal oblivious lower bounds
p0
2 {hp2,1i,hp2,p2i,h1, p2i} (see Fig. 7d). Analogously Pp,q0[ϕ0
2]Pp,q[ϕ]for
17Notice that this transformation from DNF to CNF is hard, in general. We not do not focus on algorithmic
aspects in this paper, but rather show the potential of this new approach.
ACM Transactions on Database Systems, Vol. V, No. N, Article 1, Publication date: January YYYY.
1:22 W. Gatterbauer and D. Suciu
y1
x1
.
.
.
.
.
.
xn
yn
(a) ϕn
x0
1,1
x0
1,2
y1
.
.
.
yn
x0
n1,2
xn
.
.
.
(b) ϕ0
n
100101102103
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
n
Pr ob ab ilitie s
P[ϕn]
p= 0.1
(c) p=const
100101102103
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
n
Pr ob ab ilitie s
ma x P[ϕ0
n]0.531 2
P[ϕn] = 0 .5
p
(d) P[ϕn] = const
Fig. 9. Example 7.4. (a), (b): Primal graphs for path PnDNF ϕnand its dissociation ϕ0
n. (c), (d): P[ϕn]to-
gether with their symmetric optimal upper and lower oblivious bounds (borders of shaded areas) as function
of n. (c) keeps p= 0.1constant, whereas (d) varies pso as to keep P[ϕ]=0.5constant for increasing n. The
upper bounds approximate the probability of the DNF very well and even become tight for n .
q0
1,1=q0
1,2=q1and Pp,q0[ϕ0
2]Pp,q[ϕ]for q0
1 {hq1,1i,hq1,q1i,h1, q1i}. For each
ν, we choose the maximum among the 2 lower bounds and the minimum among the 6
upper bounds as the best upper and lower conjunctive dissociation-based bounds, re-
spectively. We then compare with the approximations from the DNF ϕin Example 7.2.
Results: Figures 8e-g show that optimal disjunctive upper bounds are, in general but
not consistently, better than optimal conjunctive upper bounds for the same function
(83.5% of those data points with different approximations are better for conjunctive
dissociations). The dual result holds for lower bounds. This duality can be best seen in
the correspondences of absolute errors between upper and lower bounds.
7.4. Multiple Dissociations at Once
Here we investigate the influence of the primal graph and number of dissociations on
the tightness of the bounds. Both examples correspond to the lineage of the standard
unsafe query Q:R(X), S(X, Y ), T (Y)over two different database instances.
Example 7.4 (Path Pnas Primal Graph). This example considers a DNF expres-
sion whose primal graph forms a Pn, i.e. a path of length n(see Fig.9a). Note that this
is a generalization of the path P4from Example 7.2 and corresponds to the lineage of
the same unsafe query over larger database instance with 2n1tuples:
ϕn=x1y1x1y2x2y2. . . xn1ynxnyn
ϕ0
n=x0
1,1y1x0
1,2y2x0
2,1y2. . . x0
n1,2ynx0
nyn
Exact: In the following, we assume the probabilities of all variables to be pand use
the notation pn:= P[ϕn]and p
n:= P[ϕ
n], where ϕ
ncorresponds to the formula ϕn
without the last conjunct xnyn. We can then express pnas function of p
n,pn1and
p
n1by recursive application of Shannon’s expansion to xnand yn:
pn=P[xn]P[yn] + P[¯yn]pn1+P[¯xn]p
n
p
n=P[yn]P[xn1] + Pxn1]p
n1+P[¯yn]pn1
We thus get the linear recurrence system
pn=A1pn1+B1p
n1+C1
p
n=A2pn1+B2p
n1+C2
ACM Transactions on Database Systems, Vol. V, No. N, Article 1, Publication date: January YYYY.
Oblivious Bounds on the Probability of Boolean Functions 1:23
.
.
.
.
.
.
xn
y1
x1
yn
(a) ϕn
.
.
.
.
.
.
y1
x0
1,1
x0
n,1
yn
x0
n,n
x0
1,n
(b) ϕ0
n
0 10 20 30 40
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
n
Pr ob ab ilitie s
P[ϕn]
p= 0.1
(c) p=const
100101102103
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
n
Pr ob ab ilitie s
li mn→∞ P[ϕ0
n]0.5 80 3
P[ϕn] = 0 .5
p
(d) P[ϕn] = const
Fig. 10. Example 7.5. (a), (b): Primal graphs for complete bipartite DNF ϕnand its dissociation ϕ0
n. (c), (d):
P[ϕn]together with their symmetric optimal upper and lower oblivious bounds (borders of shaded areas) as
function of n. (d) varies pso as to keep P[ϕ]=0.5constant for increasing size n. The oblivious upper bound
is ultimately bounded despite having n2fresh variables in the dissociated DNF for increasing n.
with A1= ¯p,B1=p¯p2,C1=p2(2 p),A2= ¯p,B2=p¯p, and C2=p2. With a few manip-
ulations, this recurrence system can be transformed into a linear non-homogenous re-
currence relation of second order pn=Apn1+Bpn2+Cwhere A=A1+B2= ¯p(1+p),
B=A2B1A1B2=p2¯p2, and C=B1C2+C1(1 B2) = p2(p¯p2+ (2 p)(1 p¯p)).
Thus we can recursively calculate P[ϕn]for any probability assignment pstarting with
initial values p1=p2and p2= 3p22p3.
Dissociation: Figure 9b shows the primal graph for the dissociation ϕ0
n. Variables
x1to xn1are dissociated into two variables with same probability p0, whereas xn
into one with original probability p. In other words, with increasing n, there are more
variables dissociated into two fresh ones each. The probability P[ϕ0
n]is then equal to
the probability that at least one variable xiis connected to one variable yj:
P[ϕ0
n] = 1 (1 pp0)1p(1 ¯p02)n21p(1 ¯p¯p0)
We set p0=pfor upper bounds, and p0= 1 1pfor lower bounds.
Results: Figures 9c-d shows the interesting result that the disjunctive upper bounds
become tight for increasing size of the primal graph, and thus increasing number of
dissociations. This can be best seen in Fig.9d for which pis chosen as to keep P[ϕ]=0.5
constant for varying nand we have limn→∞ P[ϕ0
n] = P[ϕn] = 0.5for upper bounds.
In contrast, the disjunctive lower bounds become weaker but still have a limit value
limn→∞ P[ϕ0
n]0.2929 (derived numerically).
Example 7.5 (Complete bipartite graph Kn,n as Primal Graph). This example con-
siders a DNF whose primal graph forms a complete bipartite graph of size n, i.e. each
variable xiis appearing in one clause with each variable yj(see Fig. 10a). Note that
this example corresponds to lineage for the standard unsafe query over a database
instance with O(n2)tuples:
ϕn=_
(i,j)[n]2
xiyj
ϕ0
n=_
j[n]
yj_
i[n]
x0
i,j
Exact: We again assume that all variables have the same probability p=P[xi] =
P[yi].P[ϕn]is then equal to the probability that there is at least one tuple xiand at
ACM Transactions on Database Systems, Vol. V, No. N, Article 1, Publication date: January YYYY.
1:24 W. Gatterbauer and D. Suciu
select distinct s nationkey
from Supplier, Partsupp, Part
where s suppkey = ps suppkey
and ps partkey = p partkey
and s suppkey <=$1
and p name like $2
(a) Deterministic query Q(a)
Supplier(s suppkey, s nationkey)
PartSupp(ps suppkey, ps partkey)
Part(p partkey, p name)
(b) Relevant TPC-H schema
Fig. 11. Example 7.6. Parameterized SQL query Q(a)and relevant portion of the TPC-H schema.
least one tuple yi:
P[ϕn] = 1(1 p)n2(6)
Dissociation: Figure 10b shows the primal graph for the dissociation ϕ0
n. Each vari-
able xiis dissociated into nfresh variables with same probability p0, i.e. there are n2
fresh variables in total. The probability P[ϕ0
n]is then equal to the probability that at
least one variable yiis connected to one variables xi,j :
P[ϕ0
n] = 1 1p1(1 p0)nn
We will again choose pas to keep r:=P[ϕn]constant with increasing n, and then
calculate P[ϕ0
n]as function of r. From Eq.6, we get p= 1 n
p1rand then set p0=p
for upper bounds, and p0= 1 n
1pfor lower bounds as each dissociated variable is
replaced by nfresh variables. It can then be shown that P[ϕ0
n]for the upper bound is
monotonically increasing for nand bounded below 1 with the limit value:
lim
n→∞
P[ϕ0
n] = 1 (1 r)r
Results: Figure 10d keeps P[ϕ]=0.5constant (by decreasing pfor increasing n) and
shows the interesting result that the optimal upper bound is itself upper bounded and
reaches a limit value, although there are more variables dissociated, and each variable
is dissociated into more fresh ones. This limit value is 0.5803 for r= 0.5. However, lower
bounds are not useful in this case.
7.5. Dissociation with a Standard Relational Database Management System
Example 7.6 (TPC-H). Here we apply the theory of dissociation to bound hard
probabilistic queries with the help of PostgreSQL 9.2, an open-source relational
DMBS.18 We use the TPC-H DBGEN data generator19 to generate a 1GB database.
We then add a column Pto each table, and assign to each tuple either the probabil-
ity p= 0.1, or p= 0.5, or a random probability from the set {0.01,0.02,...,0.5}(“p=
rand 0.5”). Choosing tuple probabilities p0.5helps us avoid floating-point errors for
queries with very large lineages whose query answer probabilities would otherwise be
too close to 1.20 Our experiments use the following parameterized query (Fig.11):
Q(a) :S(s, a),PS (s, u), P (u, n), s $1, n like $2
Relations S,PS and Prepresent tables Supplier,PartSupp and Part, respectively. Vari-
able astands for attribute nationkey (“answer tuple”), sfor suppkey,ufor partkey (“unit”),
18http://www.postgresql.org/download/
19http://www.tpc.org/tpch/
20Compare Fig. 13b and Fig. 13e to observe the impact of the choice of input tuple probabilities pon the
probabilities of query answers for the same query.
ACM Transactions on Database Systems, Vol. V, No. N, Article 1, Publication date: January YYYY.
Oblivious Bounds on the Probability of Boolean Functions 1:25
create view VP as
select p partkey, s nationkey,
1-power(1-P.P,1e0/count(*)) as P
from Part P, Partsupp, Supplier
where p partkey=ps partkey
and ps suppkey = s suppkey
and s suppkey <=$1
and p name like $2
group by p partkey, s nationkey, P.P
(a) View VPfor lower bounds with P
P
select s nationkey, IOR(Q3.P) as P
from
(select S.s nationkey, S.P*Q2.P as P
from Supplier S,
(select Q1.ps suppkey, s nationkey, IOR(Q1.P) as P
from
(select ps suppkey, s nationkey, PS.P*VP.P as P
from Partsupp PS, VP
where ps partkey = p partkey
and ps suppkey <=$1) as Q1
group by Q1.ps suppkey, s nationkey) as Q2
where s suppkey = Q2.ps suppkey) as Q3
group by Q3.s nationkey
(b) Query P
P
Fig. 12. Example 7.6. (a) View definition VPand (b) and query P
Pfor deriving the lower bounds by disso-
ciating table P. Note the inclusion of the attribute nationkey in VPfor reasons explained in the text.
and nfor name. The probabilistic version of this query asks which nations (as deter-
mined by the attribute nationkey) are most likely to have suppliers with suppkey $1
that supply parts with a name like $2 when all tuples in Supplier,PartSupp, and Part are
probabilistic. Parameters $1 and $2 allow us to reduce the number of tuples that can
participate from tables Supplier and Part, respectively, and to thus study the effects of
the lineage size on the predicted dissociation bounds and running times. By default, ta-
bles Supplier,Partsupp and Part have 10k, 800k and 200k tuples, respectively, and there
are 25 different numeric attributes for nationkey. For parameter $1, we choose a value
{500,1000,...,10000}which selects the respective number of tuples from table Sup-
plier. For parameter $2, we choose a value {’%’,’%red%’,’%red%green%’ }which selects
200k, 11k or 251 tuples in table Part, respectively.
Translation into SQL: Note that the lineage for each individual query answer cor-
responds to the lineage for the Boolean query Qfrom Sect. 6, which we know is hard,
in general. We thus bound the probabilities of the query answers by evaluating four
different queries that correspond to the query-centric dissociation bounds from Sect.6:
dissociating either table Supplier or table Part, and calculating either upper and lower
bounds. To get the final upper (lower) bounds, we take the minimum (maximum) of the
two upper (lower) bounds for each answer tuple. The two query plans are as follows:
PS(a) = πp
a1p
uπp
a,u 1p
sS(s, a),PS (s, u), s $1, P (u, n), n like $2
PP(a) = πp
a1p
sS(s, a), πp
s1p
uPS (s, u), s $1, P (u, n), n like $2
Notice one technical detail for determining the lower bounds with plan PP: Any tuple
tfrom table Part may appear a different number of times in the lineages for different
query answers.21 Thus, for every answer athat has tin its lineage, we need to cre-
ate a distinct copy of tin the view VP, with a probability that depends only on the
number of times that tappears in the lineage for a.22 Thus, the view definition for VR
needs to include the attribute nationkey (Fig. 12a) and PPneeds to be adapted as follows
(Fig.12b):
P
P(a) = πp
a1p
sS(s, a), πp
s,a 1p
uPS (s, u), s $1,VP (u, a)
21Lower bounds by dissociating Supplier are easier since the table includes the query answer attribute
nationkey. As consequence, any tuple from Supplier may appear in the lineage of one query answer only.
22We could actually use the total number of times the tuple appears in all lineages and still get lower bounds.
However, the resulting lower bounds would be weaker and not obliviously optimal.
ACM Transactions on Database Systems, Vol. V, No. N, Article 1, Publication date: January YYYY.
1:26 W. Gatterbauer and D. Suciu
6 16 15 21 17 1 11 20 3 5
0.45
0.5
0.55
0.6
0.65 Diss. on Supplier
Best of both Diss.
Part dissociated
Actual Probability
(a) $1=10000, $2= ’%red%green%’
p=rand 0.5, MaxLin = 48
11 16 9 21 4 6 8 20 1 0
x
1Diss. on Supplier
Best of both Diss.
Part dissociated
Actual Probability
(b) $1=10000, $2= ’%red%’
p=rand 0.5, MaxLin = 1941
15 11 0 17 16 6 21 3 4 2
0.995
0.996
0.997
0.998
0.999
1Diss. on Supplier
Best of both Diss.
Part dissociated
Actual Probability
(c) $1=10000, $2= ’%red%green%’
p= 0.5, MaxLin = 48
9 18 10 4 2 1 8 7 20 14
0.4
0.41
0.42
0.43
0.44
0.45
0.46 Diss. on Supplier
Best of both Diss.
Part dissociated
Actual Probability
(d) $1=3000, $2= ’%red%’
p= 0.1, MaxLin = 611
11 4 17 9 1 0 3 8 21 10
0.81
0.82
0.83
0.84
0.85
0.86 Diss. on Supplier
Best of both Diss.
Part dissociated
Actual Probability
(e) $1=10000, $2= ’%red%’
p= 0.1, MaxLin = 1941
11 17 0 4 8 1 3 20 18 16
x
1Diss. on Supplier
Best of both Diss.
Part dissociated
(f) $1=10000, $2= ’%’
p= 0.1, MaxLin = 35040
Case Answer Fig. #Part %diss. #fresh #Supp. %diss. #fresh tighter bounds
(A) (2) (c) 40 7.5% 2 43 0.0% -PS
(B) (6) (a) 42 11.9% 2 53 7.5% 2 PS
(C) (11) (b) 1830 5.8% 2 434 95.6% 2-11 PP
(D) (11) (f) 32899 6.3% 2438 100.0% 80 PP
(g) Overview of 4 cases discussed in the text
Fig. 13. Example 7.6. Probabilities for the top 10 query answers for varying query parameters $1, $2,
and tuple probabilities p. The ranking is determined by the upper dissociation bounds (upper end of the
red interval) and is identical to the one determined by the actual probabilities (crosses), except in (c) where
tuples 6 and 21 are flipped. MaxLin shows the maximal lineage among the query answers, which is too big in
(f) for exact probabilistic inference. (b): x= 0.999999999 999 = 11012. (f): x= 0.999 999 9999 = 11010 .
In order to speed up the resulting multi-query evaluation, we apply a determinis-
tic semi-join reduction on the input tables and then reuse intermediate query results
across all four subsequent queries. Since query optimization is not the focus of this
paper, we do not show these techniques in Fig. 12 and instead refer to [Gatterbauer
and Suciu 2013] for a detailed discussion of techniques to speed up query dissociation.
In addition, the exact SQL statements that allow the interested reader to repeat our
experiments with PostgreSQL are available on the LaPushDB project page.23
23http://LaPushDB.com/
ACM Transactions on Database Systems, Vol. V, No. N, Article 1, Publication date: January YYYY.
Oblivious Bounds on the Probability of Boolean Functions 1:27
Ground truth: To compare our bounds against ground truth, we construct the lin-
eage DNF for each query answer and use DeMorgan to write it as dual lineage CNF
without exponential increase in size. For example, the DNF ϕ=x1x3x1x2can be
written as CNF ¯ϕ= (¯x1¯x3)(¯x1¯x2)with P[ ¯ϕ] = 1 P[ϕ]. The problem of eval-
uating a probabilistic CNF can further be translated into the problem of computing
the partition function of a propositional Markov random field, for which the AI com-
munity has developed sophisticated solvers. For our experiments, we use a tool called
SampleSearch [Gogate and Domingos 2010; Gogate and Dechter 2011]24.
Quality Results: Figure 13 shows the top 10 query answers, as predicted by the up-
per dissociation bounds for varying query parameters $1 and $2, as well as varying
tuple probabilities p. The crosses show the ground truth probabilities as determined
by SampleSearch. The red intervals shows the interval between upper and lower dis-
sociation bounds. Recall that the final dissociation interval is the intersection between
the interval from dissociation on Supplier (left of the red interval) and on Part (right of
the red interval). The graphs suggest that the upper dissociation bounds are very close
to the actual probabilities, which is reminiscent of Sect. 7.4 having shown that upper
bounds for DNF dissociations are commonly closer to the true probabilities than lower
bounds. For Fig. 13f, we have no ground truth as the lineage for the top tuple (11) has
size 32899 (i.e. the corresponding DNF has 32899 clauses), which is too big for exact
probabilistic inference. However, extrapolating from Fig. 13d and Fig. 13e to Fig. 13f,
we speculate that upper dissociation bounds give good approximations here as well.
The different interval sizes (i.e. quality of the bounds) arise from different numbers
of dissociated tuples in the respective lineages. We illustrate with 4 cases (Fig. 13g):
(A) If there is a plan that does not dissociate any tuple, then both upper and lower
bounds coincide. This scenario is also called data-safe [Jha et al. 2010]. For ex-
ample, the lineage for answer (2) in Fig. 13c has 40 unique tuples from table
Part, out of which 3 (7.5%) are dissociated into 2 fresh ones with PP. However, all
of the 43 tuples from table Supplier that appear in the lineage appear only once.
Therefore, PSdissociates no tuple when calculating the answer probability for
(2), and as a result gives us the exact value.
(B) The lineage for the top-ranked answer (6) in Fig.13a has 42 unique tuples from
table Part, out of which 5 (11.9%) are dissociated into 2 fresh ones with PP. In
contrast, the lineage has 53 unique tuples from table Supplier, out of which only
4 (7.5%) are dissociated into 2 fresh ones with PS. Intuitively, PSshould give
us therefore tighter bounds, which is confirmed by the results.
(C) The lineage for the top-ranked answer (11) in Fig.13b and Fig.13e (both figures
show results for the same query, but for different tuple probabilities p= 0.1or
p=rand 0.5) has 1830 unique tuples from table Part, out of which 5.8% are
dissociated into 2 or 3 fresh ones with PP. In contrast, the lineage has only 434
unique tuples from table Supplier, out of which 95.6% are dissociated into into
2 to 11 fresh variables. Thus, PPgives far tighter bounds in this case.
(D) The lineage for the same answer (11) in Fig. 13f has 32899 unique tuples from
table Part, out of which 6.3% are dissociated into 2-4 fresh ones with PP. In
contrast, the lineage has only 438 unique tuples from table Supplier, out of which
all are dissociated into 80 fresh ones with PS(this is an artifact of the TPC-H
random database generator). Thus, the bounds for PSare very poor.
Importantly, relevance ranking of the answer tuples by upper dissociation bounds
approximates the ranking by query reliability very well. For example, for the case of
p= 0.1and $2=’%red%’ (Fig. 13d and Fig. 13e), the ranking given by the minimum
upper bounds was identical to the ranking given by the ground truth for all parameter
24http://www.hlt.utdallas.edu/~vgogate/SampleSearch.html
ACM Transactions on Database Systems, Vol. V, No. N, Article 1, Publication date: January YYYY.
1:28 W. Gatterbauer and D. Suciu
5000 10000
0
50
100
150
200
250
300
350
400
$1
Time [msec]
Upper & lower bounds
Upper bounds only
Lineage query
Deterministic SQL
(a) $2 = ’%red%green%’
5000 10000
0
1
2
3
4
5
6
7
8
$1
Time [sec]
Upper & lower bounds
Upper bounds only
Lineage query
Deterministic SQL
(b) $2 = ’%red%’
5000 10000
0
20
40
60
80
100
120
$1
Time [sec]
Upper & lower bounds
Upper bounds only
Lineage query
Deterministic SQL
(c) $2 = ’%’
$2 ’%red%green%’ ’%red%’ ’%red%’ ’%’ ’%’
$1 500 10000 500 10000 500 10000
Total lineage size 42 1,004 2,218 44,152 40,000 800,000
SampleSearch [msec] 321 808 1,152 106,396 79,449
Upper & lower bounds [msec] 169 342 1,070 7,803 5,886 118,144
Upper bounds only [msec] 139 199 932 3,203 3,569 42,164
Lineage SQL [msec] 109 119 417 1,112 844 5,731
Deterministic SQL [msec] 124 128 444 746 649 2,032
(d)
Fig. 14. Example 7.6. Timing results for queries with varying parameters $1 and $2: For small lineage
(<10000), dissociation bounds can be calculated in a small multiple (<4) of the time needed for a standard
deterministic query. For large lineages (>10000), calculation scales linearly in the size of the total lineage.
choices $1 {500,1000,...,10000}. Figure 13c shows an example of an incorrect rank-
ings (for parameters $1=10000, $2=’%red%green%’, p= 0.5): Here tuples 6 and 21 are
flipped as compared to their actual probabilities 0.99775 and 0.99777, respectively.
Timing Results: Figure 14 compares the times for evaluating the deterministic query
(Fig.11a) with the times for calculating the dissociation bounds for changing parame-
ters $1 and $2. As experimental platform, we use PostgreSQL 9.2 on a 2.5 Ghz Intel
Core i5 with 16G of main memory. We run each query 5 times and take the average
execution time. Figure 14d also shows the size of the total lineage of a query (which
is the same as the number of query results for the deterministic query without pro-
jection) and the times needed by SampleSearch to evaluate the ground truth, if pos-
sible.25 Since table Supplier contains exactly 10k tuples with suppkey {1,...,10000},
any choice of $110000 has no effect on the query. We show separate graphs for the
time needed to calculate the upper bounds only (which our theory and experiments
suggest give better absolute approximations) and the time for both upper and lower
bounds (lower bounds are more expensive due to the required manipulation of the in-
put tuples). We also show the times for retrieving the lineage with a lineage query. Any
probabilistic approach that evaluates the probabilities outside of the database engine
needs to issue this query to retrieve the DNF for each answer. The time needed for the
lineage query thus serves as minimum benchmark for any probabilistic approximation.
Our timing results show that, for small lineages (<10000), calculating upper and
lower bounds can be achieved in a time that is only a small multiple (<4) of the time
needed for an equivalent deterministic query. For large lineages (>10000), calculating
the bounds scales linearly with the size of the lineage (Fig. 14c), whereas determin-
25The reported times are for evaluating all answer DNFs without the overhead for the lineage query.
ACM Transactions on Database Systems, Vol. V, No. N, Article 1, Publication date: January YYYY.
Oblivious Bounds on the Probability of Boolean Functions 1:29
istic query evaluation can scale in sublinear time (recall that the cardinality of the
answer set is maximal 25 across all queries since the database contains only 25 differ-
ent values for the answer attribute nationkey). Importantly, scalability for dissociation
is independent of the tractability of the data instance, e.g. maximum treewidth of the
lineage for any query answer. In contrast, SampleSearch quickly takes too long for in-
creasing lineage. For example, SampleSearch needs 108 sec for calculating the ground
truth for Fig. 13e (The maximum lineage size among all 25 query answers is 1941 in
this scenario). Upper dissociation bounds can be calculated in only 3 sec and give the
same ranking (and almost the same probabilities).
Key take-aways: Overall, our quality and timing experiments suggest that dissocia-
tion bounds (in particular, the upper bounds) can provide a good approximation of the
actual probabilities and provide a ranking of the query answers that is often identi-
cal to the ranking for their actual probabilities. These bounds can be calculated with
guaranteed polynomial scalability in the size of the data. In particular for queries with
small lineage sizes (<10000), calculating the bounds took only a small multiple (<4)
of the time needed to evaluate standard deterministic queries.
8. RELATED WORK AND DISCUSSION
Dissociation is related to a number of recent approaches in the graphical models and
constraint satisfaction literature which approximate an intractable problem with a
tractable relaxed version after treating multiple occurrences of variables or nodes as
independent or ignoring some equivalence constraints: Choi et al. [2007] approximate
inference in Bayesian networks by “node splitting, i.e. removing some dependencies
from the original model. Ram´
ırez and Geffner [2007] treat the problem of obtaining
a minimum cost satisfying assignment of a CNF formula by “variable renaming,” i.e.
replacing a variable that appears in many clauses by many fresh new variables that
appear in few. Pipatsrisawat and Darwiche [2007] provide lower bounds for MaxSAT
by “variable splitting,” i.e. compiling a relaxation of the original CNF. Andersen et al.
[2007] improve the relaxation for constraint satisfaction problems by “refinement
through node splitting,” i.e. making explicit some interactions between variables. Choi
and Darwiche [2009] relax the maximum a posteriori (MAP) problem in probabilistic
graphical models by dropping equivalence constraints and partially compensating for
the relaxation. Our work provides a general framework for approximating the proba-
bility of Boolean functions with both upper and lower bounds. These bounds shed light
on the connection between previous relaxation-based and model-based approximations
and unify them as concrete choices in a larger design space. We thus refer to all of the
above approaches as instances of dissociation-based approximations.
Another line of work that is varyingly called “discretization,” “bucketing,” “binning,”
or “quantization” proposes relaxations by merging or partitioning states or nodes (in-
stead of splitting them), and to then perform simplified calculations over those parti-
tions: Dechter and Rish [2003] approximate a function with high arity by a collection of
smaller-arity functions with a parameterized scheme called “mini-buckets. As exam-
ple, the sum Pif(xi)·g(xi)for non-negative functions fand gcan be upper bounded
by the summation (maxif(xi)) ·Pig(xi), i.e. all different values of f(xi)are replaced
by the single maximum value maxif(xi), which simplifies the calculations. Similarly,
the sum can be lower bounded by (minif(xi)) ·Pig(xi). St-Aubin et al. [2000] use
Algebraic Decision Diagrams (ADDs) and reduce the sizes of the intermediate value
functions generated by replacing the values at the terminals of the ADD with ranges
of values. Bergman et al. [2011],[2013] construct relaxations of Multivalued Decision
Diagrams (MDDs) by merging vertices when the size of the partially constructed MDD
grows too large. Gogate and Domingos [2011] compress potentials computed during
the execution of variable elimination by “quantizing” them, i.e. replacing a number
ACM Transactions on Database Systems, Vol. V, No. N, Article 1, Publication date: January YYYY.
1:30 W. Gatterbauer and D. Suciu
of distinct values in range of the potential by a single value. Since all of the above
approaches reduce the number of distinct values in the range of a function, we collec-
tively refer to them as quantization-based approximations. A more detailed literature
overview in this space is given by Gogate and Domingos [2013].
Note that dissociation-based approaches (that split nodes) and quantization-based
approaches (that merge nodes) are not inverse operations, but are rather two com-
plementary approaches that may be combined to yield improved methods. The in-
verse of dissociation is what we refer to as assimilation:26 Consider the Boolean for-
mula ϕ=x1y1x1y2x2y2which is a dissociation of the much simpler formula
ϕ=x(y1y2). Hence, we know from dissociation that P[ϕ]P[ϕ]for pmin(p1, p2)
and that P[ϕ]P[ϕ]for p1¯p1¯p2. Note that for quantization, we can always choose
max(p1, p2)or min(p1, p2)to get a guaranteed upper or lower bound. In contrast, for
assimilation, we may have to choose a different value to get a guaranteed bound. Also,
for the case p1=p2, assimilation will generally still be an approximation, whereas
quantization would be exact. Thus, these are two different approaches.
Existing approaches for query processing over probabilistic databases that are both
general and tractable are either: (1) simulation-based approaches that adapt general
purpose sampling methods [Jampani et al. 2008; Kennedy and Koch 2010; Re et al.
2007]; or (2) model-based approaches that approximate the original number of mod-
els with guaranteed lower or upper bounds [Olteanu et al. 2010; Fink and Olteanu
2011]. We have show in this paper that, for every model-based bound, there exists a
dissociation bound that is at least as good or better.
Our work on dissociation originated while generalizing propagation-based ranking
methods from graphs [Detwiler et al. 2009] to hypergraphs and conjunctive queries.
In [Gatterbauer et al. 2010], we applied dissociation in a query-centric way to upper
bound hard probabilistic queries, and showed the connection between propagation on
graphs and dissociation on hypergraphs (see [Gatterbauer and Suciu 2013] for all de-
tails). In this paper, we provide the theoretical underpinnings of these results in a
generalized framework with both upper and lower bounds. A previous version of this
paper was made available as [Gatterbauer and Suciu 2011].
9. OUTLOOK
We introduced dissociation as a new algebraic technique for approximating the prob-
ability of Boolean functions. We applied this technique to derive obliviously optimal
upper and lower bounds for conjunctive and disjunctive dissociations and proved that
dissociation always gives equally good or better approximations than models. We did
not address the algorithmic complexities of exploring the space of alternative disso-
ciations, but rather see our technique as a basic building block for new algorithmic
approaches. Such future approaches can apply dissociation at two conceptual levels:
(1) at the query-level, i.e. at query time and before analyzing the data, or (2) at the
data-level, i.e. while analyzing the data.
The advantage of query-centric approaches is that they can run in guaranteed poly-
nomial time,27 yet at the cost of no general approximation guarantees.28 Here, we en-
vision that the query-centric, first-order logic-based view of operating on data by the
26We prefer the word “assimilation” as inverse of dissociation over of the more natural choice of “association”
as it implies correctly that two items are actually merged and not merely associated.
27Recall from our experiments (Fig. 14c) that query-centric dissociation scales linearly in the size of the
lineage, independent of intricacies in the data, such as the treewidth of the lineage.
28However, also recall from our experiments (Fig.13) that query-centric approaches may work well in prac-
tice. Notice here the similarity to loopy belief propagation [Frey and MacKay 1997], which is applied widely
and successfully, despite lacking general performance guarantees.
ACM Transactions on Database Systems, Vol. V, No. N, Article 1, Publication date: January YYYY.
Oblivious Bounds on the Probability of Boolean Functions 1:31
database community can also influence neighboring communities, in particular those
working on lifted inference (see e.g. [Van den Broeck et al. 2011]).
The advantage of data-centric approaches is that exact solutions can be arbitrarily
approximated, yet at the cost of no guaranteed runtime [Roth 1996]. Here, we envision
a range of new approaches (that may combine dissociation with quantization) to com-
pile an existing intractable formula into a tractable target language, e.g., read-once
formulas or formulas with bounded treewidth. For example, one can imagine an ap-
proximation scheme that adds repeated dissociation to Shannon expansion in order to
avoid the intermediate state explosion.
ACKNOWLEDGMENTS
We would like to thank Arthur Choi and Adnan Darwiche for helpful discussions on Relaxation & Com-
pensation, and for bringing Prop. 5.1 to our attention [Choi and Darwiche 2011]. We would also like to
thank Vibhav Gogate for helpful discussions on quantization and guidance for using his tool SampleSearch,
Alexandra Meliou for suggesting the name “dissociation,” and the reviewers for their careful reading of
this manuscript and their detailed feedback. This work was partially supported by NSF grants IIS-0915054
and IIS-1115188. More information about this research, including the PostgreSQL statements to repeat the
experiments on TPC-H data, can be found on the project page: http://LaPushDB.com/.
REFERENCES
ANDERSEN, H. R., HADZIC, T., H OOKER, J. N., AND TIEDEMANN, P. 2007. A constraint store based on
multivalued decision diagrams. In Proceedings of the 13th International Conference on Principles and
Practice of Constraint Programming (CP’07). 118–132.
BERGMA N, D., CIRE, A. A., VAN HOEVE, W.-J., AND HOOKER, J. N. 2013. Optimization bounds from binary
decision diagrams. In INFORMS Journal on Computing. (to appear).
BERGMA N, D., VAN HO EVE, W. J., AND HOOKER, J. N. 2011. Manipulating MDD relaxations for combinato-
rial optimization. In 8th International Conference on Integration of AI and OR Techniques in Constraint
Programming for Combinatorial Optimization Problems (CPAIOR’11). 20–35.
CHAVIRA, M. AND DARWICHE, A. 2008. On probabilistic inference by weighted model counting. Artif. In-
tell. 172, 6-7, 772–799.
CHOI, A., CHAVIRA , M., AND DARWICHE, A. 2007. Node splitting: A scheme for generating upper bounds
in Bayesian networks. In Proceedings of the 23rd Conference in Uncertainty in Artificial Intelligence
(UAI’07). 57–66.
CHOI, A. AND DARWICHE, A. 2009. Relax then compensate: On max-product belief propagation and more.
In Proceedings of the 23rd Annual Conference on Neural Information Processing Systems (NIPS’09).
351–359. (Alternative title: Approximating MAP by Compensating for Structural Relaxations).
CHOI, A. AND DARWICHE, A. 2010. Relax, compensate and then recover. In Proceedings of New Frontiers in
Artificial Intelligence Workshops (JSAI-isAI’10). 167–180.
CHOI, A. AND DARWICHE, A. 2011. Personal communication.
CRAMA, Y. AND HAMMER, P. L. 2011. Boolean Functions: Theory, Algorithms, and Applications. Cambridge
University Press.
DALVI , N. N., SCHNAITTER, K., AND SUCIU, D. 2010. Computing query probability with incidence algebras.
In Proceedings of the 29th Symposium on Principles of Database Systems (PODS’10). 203–214.
DALVI , N. N. AND SUCIU, D. 2007. Efficient query evaluation on probabilistic databases. VLDB J. 16, 4,
523–544.
DARWICHE, A. AND MARQUIS, P. 2002. A knowledge compilation map. J. Artif. Int. Res. 17, 1, 229–264.
DECHTER, R. AND RISH, I. 2003. Mini-buckets: A general scheme for bounded inference. J. ACM 50, 2,
107–153.
DETWILER, L., GATTER BAUE R, W., LO UIE , B., SUCI U, D., AND TARCZY-HORNOCH, P. 2009. Integrating
and ranking uncertain scientific data. In Proceedings of the 25th International Conference on Data En-
gineering (ICDE’09). 1235–1238.
FELLER, W. 1968. An introduction to probability theory and its applications 3d ed Ed. Wiley, New York.
FINK, R. AND OLTEANU, D. 2011. On the optimal approximation of queries using tractable propositional
languages. In Proceedings 14th International Conference on Database Theory (ICDT’11). 174–185.
FREY, B. J. AND MACKAY, D. J. C. 1997. A revolution: Belief propagation in graphs with cycles. In NIPS.
ACM Transactions on Database Systems, Vol. V, No. N, Article 1, Publication date: January YYYY.
1:32 W. Gatterbauer and D. Suciu
FUHR, N. AND R¨
OLLEKE, T. 1997. A probabilistic relational algebra for the integration of information re-
trieval and database systems. ACM Trans. Inf. Syst. 15, 1, 32–66.
GATTERBAUE R, W., JHA, A. K., AND SUCIU, D. 2010. Dissociation and propagation for efficient query eval-
uation over probabilistic databases. In Proceedings of the 4th International VLDB workshop on Man-
agement of Uncertain Data (MUD’10). 83–97.
GATTERBAUE R, W. AND SUCIU, D. 2011. Optimal upper and lower bounds for Boolean expressions by dis-
sociation. arXiv:1105.2813 [cs.AI].
GATTERBAUE R, W. AND SUCIU, D. 2013. Dissociation and propagation for efficient query evaluation over
probabilistic databases. arXiv:1310.6257 [cs.DB].
GOGATE, V. AND DECHTER, R. 2011. SampleSearch: Importance sampling in presence of determinism.
Artificial Intelligence 175, 2, 694–729.
GOGATE, V. AND DOMINGOS, P. 2010. Formula-based probabilistic inference. In Proceedings of the 26th
Conference in Uncertainty in Artificial Intelligence (UAI’10). 210–219.
GOGATE, V. AND DOMINGOS, P. 2011. Approximation by quantization. In Proceedings of the 27th Conference
in Uncertainty in Artificial Intelligence (UAI’11). 247–255.
GOGATE, V. AND DOMINGOS, P. 2013. Approximation by quantization. In Proceedings of the 29th Conference
in Uncertainty in Artificial Intelligence (UAI’13). 252–261.
GURVICH, V. 1977. Repetition-free Boolean functions. Uspekhi Mat. Nauk 32, 183–184. (in Russian).
JAMPANI , R., X U, F., WU, M., PEREZ, L. L., JE RMA INE , C. M., AND HAAS, P. J. 2008. MCDB: a Monte
Carlo approach to managing uncertain data. In Proceedings International Conference on Management
of Data (SIGMOD’08). 687–700.
JHA, A., OLTEA NU, D., AND SUCIU, D. 2010. Bridging the gap between intensional and extensional query
evaluation in probabilistic databases. In Proceedings of the 13th International Conference on Extending
Database Technology (EDBT’10). 323–334.
KENNEDY, O. AND KOCH, C. 2010. Pip: A database system for great and small expectations. In Proceedings
of the 26th International Conference on Data Engineering (ICDE’10). 157–168.
OLTEANU, D. AND HUANG, J. 2008. Using OBDDs for efficient query evaluation on probabilistic databases.
In Proceedings of the 4th International Conference on Scalable Uncertainty Management (SUM’08). 326–
340.
OLTEANU, D., HUAN G, J., AND KOCH, C. 2009. Sprout: Lazy vs. eager query plans for tuple-independent
probabilistic databases. In Proceedings of the 25th International Conference on Data Engineering
(ICDE’09). 640–651.
OLTEANU, D., HUAN G, J., AND KOCH, C. 2010. Approximate confidence computation in probabilistic
databases. In Proceedings of the 26th International Conference on Data Engineering (ICDE’10). 145–
156.
PIPATSRISAWAT, K. AND DARWICHE, A. 2007. Clone: Solving weighted max-sat in a reduced search space.
In Proceedings of the 20th Australian Joint Conference on Artificial Intelligence (AUS-AI’07). 223–233.
POOLE, D. 1993. Probabilistic horn abduction and Bayesian networks. Artif. Intell. 64, 1, 81–129.
PROVAN, J. S. AND BALL, M. O. 1983. The complexity of counting cuts and of computing the probability
that a graph is connected. SIAM J. Comput. 12, 4, 777–788.
RAM´
IREZ, M. AND GEFFN ER, H. 2007. Structural relaxations by variable renaming and their compilation
for solving mincostsat. In Proceedings of the 13th International Conference on Principles and Practice of
Constraint Programming (CP’07). 605–619.
RE, C., DALVI, N. N., AND SUCIU, D. 2007. Efficient top-k query evaluation on probabilistic data. In Pro-
ceedings of the 23rd International Conference on Data Engineering (ICDE’07). 886–895.
ROT H, D. 1996. On the hardness of approximate reasoning. Artif. Intell. 82, 1-2, 273–302.
SEN, P., DESHPANDE, A., AND GETOOR, L. 2010. Read-once functions and query evaluation in probabilistic
databases. Proceedings of the VLDB Endowment 3, 1, 1068–1079.
ST-AUBIN, R., HOEY, J., AND BOUTILIER, C. 2000. APRICODD: Approximate policy construction using
decision diagrams. In Advances in Neural Information Processing Systems 13 (NIPS’00). 1089–1095.
VALIANT, L. 1982. A scheme for fast parallel communication. SIAM Journal on Computing 11, 2, 350–361.
VALIANT, L. G. 1979. The complexity of computing the permanent. Theor. Comput. Sci. 8, 189–201.
VAN DEN BROECK, G., TAGHIPOUR, N., MEERT, W., DAVIS, J., AND RAEDT, L. D. 2011. Lifted probabilistic
inference by first-order knowledge compilation. In Proceedings of the 22nd International Joint Confer-
ence on Artificial Intelligence (IJCAI’11). 2178–2185.
ACM Transactions on Database Systems, Vol. V, No. N, Article 1, Publication date: January YYYY.
Oblivious Bounds on the Probability of Boolean Functions 1:33
A. NOMENCLATURE
x, y, z independent Boolean random variables
ϕ, ψ Boolean formulas, probabilistic event expressions
f, g, fϕBoolean function, represented by an expression ϕ
P[x],P[ϕ]probability of an event or expression
pi, qj, rkprobabilities pi=P[xi],qj=P[yj],rk=P[zk]
x,g,psets {x1,...,xk}or vectors hx1,...,xkiof variables, functions or probabilities
Pp,q[f]probability of function f(x,y)for p=P[x],q=P[y]
¯x, ¯ϕ, ¯pcomplements ¬x, ¬ϕ, 1p
f0, ϕ0dissociation of a function for expression ϕ
θsubstitution θ:x0x; defines a dissociation f0of fif f0[θ] = f
f[x0/x]substitution of x0for xin f
m, m0, n m =|x|,m0=|x0|,n=|y|
dinumber of new variables that xiis dissociated into
νvaluation or truth assignment ν:y {0,1}with yi=νi
f[ν], ϕ[ν]function for expression ϕwith valuation νsubstituted for y
gνgν=Vjgν
j, where gν
j= ¯gjif νj= 0 and gν
j=gjif νj=1
B. REPRESENTING COMPLEX EVENTS (DISCUSSION OF COROLLARY 4.5)
It is known from Poole’s independent choice logic [Poole 1993] that arbitrary cor-
relations between events can be composed from disjoint-independent events only. A
disjoint-independent event is represented by a non-Boolean independent random vari-
able ywhich takes either of kvalues v=hv1, . . . , vkiwith respective probabili-
ties q=hq1, . . . , qkiand Piqi= 1. Poole writes such a “disjoint declaration” as
y([v1:q1, . . . , vk:qk]).
In turn, any kdisjoint events can be represented starting from k1independent
Boolean variables z=hz1, . . . , zk1iand probabilities P[z] = hq1,q2
¯q1,q3
¯q1¯q2,..., qk1
¯q1...¯qk2i,
by assigning the disjoint-independent event variable yits value viwhenever event Ai
is true, with Aidefined as:
(y=v1)A1:z1
(y=v2)A2:¯z1z2
.
.
.
(y=vk1)Ak1:z1. . . ¯zk2zk1
(y=vk)Ak:¯z1. . . ¯zk2¯zk1.
For example, a disjoint-independent event y(v1:1
5, v2:1
2, v3:1
5, v4:1
10 )can be represented
with three independent Boolean variables z= (z1, z2, z3)and P[z]=(1
5,5
8,2
3).
It follows that arbitrary correlations between events can be modeled starting from
independent Boolean random variables alone. For example, two complex events Aand
Bwith P[A] = P[B] = qand varying correlation (see Sect. 7.1) can be represented as
composed events A:z1z2z3z4and B:¯z1z2z3z5over the primitive events z
with varying probabilities P[z]. Events Aand Bbecome identical for P[z] = (0,0, q, 0,0),
independent for P[z] = (0,0,0, q, q), and disjoint for P[z] = (0.5, q, 0,0,0) with q0.5.
ACM Transactions on Database Systems, Vol. V, No. N, Article 1, Publication date: January YYYY.
1:34 W. Gatterbauer and D. Suciu
C. USER-DEFINED AGGREGATE IOR (SEC. 6 AND SEC. 7.5)
Here we show the User-defined Aggregate (UDA) IOR in PostgreSQL:
create or replace function ior sfunc(float, float) returns float as
’select $1 * (1.0 - $2)’
language SQL;
create or replace function ior finalfunc(float) returns float as
’select 1.0 - $1’
language SQL;
create aggregate ior (float)(
sfunc = ior sfunc,
stype = float,
finalfunc = ior finalfunc,
initcond = ’1.0’);
ACM Transactions on Database Systems, Vol. V, No. N, Article 1, Publication date: January YYYY.
... They replace the input variables in a given query with fresh input variables depending on the structure of the query. Dissociation is similar in spirit: It is used to define upper and lower bounds for the probability of Boolean functions by treating multiple occurrences of a random variable as independent and assigning them new individual probabilities [13]. Query dissociation serves the same purpose [14]. ...
... let ltree = Acc-Top(ω key ✮L , (O|I)) 13 return htrees ∪ { ltree } Figure 10: Construction of a set of variable orders from a canonical variable order ω of a hierarchical CQAP query with access pattern (O|I). Each constructed variable order corresponds to an evaluation strategy of some part of the query result. ...
... Figure 24 gives the trigger procedure OnUpdate that maintains a set T of view trees under a sequence of single-tuple updates to input relations. It first applies an update δR = {x → m} to the view trees from T using UpdateTrees from Figure 20 containing x[k i ] to the relation parts whose HL-signature matches the degree of x[k i ] in base relations (Lines [11][12][13][14]. We state the amortised maintenance time of our approach under a sequence of single-tuple updates. ...
Preprint
Full-text available
We study the dynamic evaluation of conjunctive queries with output access patterns. An access pattern is a partition of the free variables of the query into input and output.The query returns tuples over the output variables given a tuple over the input variables. Our contribution is threefold. First, we give a syntactic characterisation of queries that admit constant time per single-tuple update and whose output tuples can be enumerated with constant delay given an input tuple. Second, we define a class of queries that admit optimal, albeit non-constant, update time and delay. Their optimality is predicated on the Online Matrix-Vector Multiplication conjecture. Third, we chart the complexity trade-off between preprocessing, update time and enumeration delay for such queries. Our results recover prior work on the dynamic evaluation of conjunctive queries without access patterns.
... Without loss of generality, the imprecise version then amounts to consider that the information we have is an interval [p i , p i ], as every convex set of probabilities on a binary space (here, {0, 1}) is an interval. We then consider that a probability set P BR over Y amounts to consider the robust version of Equation (14), that is ...
... Depending on the internal structure of the covariance matrix parameterΣ of the set of Gaussian distributions, the inference complexity can become harder to compute, and that is why we opted to use an approximation method [23] 14 to calculate the lower probability bound instead of an exact procedure as was presented in the original paper [6], all this without reducing the cautious predictive global results. ...
... Finally, let us notice that while this paper focused on the issue of multi-label learning problems, our results readily apply to any Boolean vectors of m items. As Boolean vectors and structures as well as probability bounds naturally appear in a number of other applications, including occupancy grids [22] or data bases [14], a future work would be to investigate how our present findings can help in such problems. ...
Preprint
Full-text available
In this paper, we consider the problem of making distributionally robust, skeptical inferences for the multi-label problem, or more generally for Boolean vectors. By distributionally robust, we mean that we consider a set of possible probability distributions, and by skeptical we understand that we consider as valid only those inferences that are true for every distribution within this set. Such inferences will provide partial predictions whenever the considered set is sufficiently big. We study in particular the Hamming loss case, a common loss function in multi-label problems, showing how skeptical inferences can be made in this setting. Our experimental results are organised in three sections; (1) the first one indicates the gain computational obtained from our theoretical results by using synthetical data sets, (2) the second one indicates that our approaches produce relevant cautiousness on those hard-to-predict instances where its precise counterpart fails, and (3) the last one demonstrates experimentally how our approach copes with imperfect information (generated by a downsampling procedure) better than the partial abstention [31] and the rejection rules.
... A different recent line of work is complete in that it can address all queries and databases, but it is query-centric [21,22] and unnecessarily approximates answers even for cases that allow an exact solution (e.g. the Example 1). Our intuitive goal is to get the best of both worlds (Fig. 1): a complete approach that includes the tractable case for exact inference as special cases. ...
... Every lineage has a unique 1OF dissociation up to renaming of variables. Specializing previous results on oblivious bounds [21] to 1OF dissociations, the important take-aways for that prior work is that ( ) lower and upper bounds for intractable expressions can be found very efficiently, ( ) that there is no guarantee on the accuracy of approximation, and ( ) that those bounds work better the fewer times variables are repeated. This motivates our interest in compiling an existing provenance polynomial into the smallest representation. ...
... Proof Lemma 3.9. Whenever Δ ⪯ Δ ′ , the lineage calculated with Δ ′ is a dissociation of the lineage of Δ , which means that each variable occurs at least as many times in ( Δ ′ ) as in ( Δ ) [21]. The statement follows now immediately. ...
Preprint
We consider the problem of finding the minimal-size factorization of the provenance of self-join-free conjunctive queries, i.e., we want to find an equivalent propositional formula that minimizes the number of variable occurrences. Our work is partly motivated from probabilistic inference where read-once formulas are known to allow exact PTIME solutions and non-read-once formulas allow approximate solutions with an error that depends on the number of repetitions of variables. We embark on the challenge of characterizing the data complexity of this problem and show its connection to the query resilience problem. While the problem is NP-complete in general, we develop an encoding as max-flow problem that is guaranteed to give the exact solution for several queries (and otherwise approximate minimizations). We show that our encoding is guaranteed to return a read-once factorization if it exists. Our problem and approach is a complete solution that naturally recovers exact solutions for all known PTIME cases, as well as identifying additional queries for which the problem can be solved in PTIME.
... There, the authors observed that no single FPRAS algorithm performed significantly better than the rest on all benchmarks. Applications of #DNF to probabilistic databases also motivated a number of algorithms designed for approximate #DNF that try to optimize query evaluation [13,14,15,16]. These algorithm are, however, either impractical (in terms of time complexity) or are designed to work on restricted classes of formulas such as read-once, monotone, etc. ...
Preprint
Full-text available
Model counting is a fundamental problem in many practical applications, including query evaluation in probabilistic databases and failure-probability estimation of networks. In this work, we focus on a variant of this problem where the underlying formula is expressed in the Disjunctive Normal Form (DNF), also known as #DNF. This problem has been shown to be #P-complete, making it often intractable to solve exactly. Much research has therefore focused on obtaining approximate solutions, particularly in the form of (ε,δ)(\varepsilon, \delta) approximations. The primary contribution of this paper is a new approach, called pepin, an approximate #DNF counter that significantly outperforms prior state-of-the-art approaches. Our work is based on the recent breakthrough in the context of the union of sets in the streaming model. We demonstrate the effectiveness of our approach through extensive experiments and show that it provides an affirmative answer to the challenge of efficiently computing #DNF.
... Finally, let us notice that while this chapter focused on the issue of multi-label learning problems, our results readily apply to any Boolean vectors of m items. As Boolean vectors and structures as well as probability bounds naturally appear in a number of other applications, including occupancy grids [Mouhagir et al., 2017] or data bases [Gatterbauer et al., 2014], a future work would be to investigate how our present findings can help in such problems. ...
Thesis
Full-text available
Decision makers are often faced with making single hard decisions, without having any knowledge of the amount of uncertainties contained in them, and taking the risk of making damaging, if not dramatic, mistakes. In such situations, where the uncertainty is higher due to imperfect information, it may be useful to provide set-valued but more reliable decisions. This works thus focuses on making distributionally robust, skeptical inferences (or decisions) in supervised classification problems using imprecise probabilities. By distributionally robust, we mean that we consider a set of possible probability distributions, i.e. imprecise probabilities, and by skeptical we understand that we consider as valid only those inferences that are true for every distribution within this set. Specifically, we focus on extending the Gaussian discriminant analysis and multilabel classification approaches to the imprecise probabilistic setting. Regarding to Gaussian discriminant analysis, we extend it by proposing a new imprecise classifier, considering the imprecision as part of its basic axioms, based on robust Bayesian analysis and near-ignorance priors. By including an imprecise component in the model, our proposal highlights those hard instances on which the precise model makes mistakes in order to provide cautious decisions in the form of set-valued class, instead. Regarding to multi-label classification, we first focus on reducing the time complexity of making a cautious decision over its output space of exponential size by providing theoretical justifications and efficient algorithms applied to the Hamming loss. Relaxing the assumption of independence on labels, we obtain partial decisions, i.e. not classifying at all over some labels, which generalize the binary relevance approach by using imprecise marginal distributions. Secondly, we extend the classifierchains approach by proposing two different strategies to handle imprecise probabilityestimates, and a new dynamic, context-dependent label ordering which dynamically selects the labels with low uncertainty as the chain moves forwards.
Article
Comparing relational languages by their logical expressiveness is well understood. Less understood is how to compare relational languages by their ability to represent relational query patterns. Indeed, what are query patterns other than ''a certain way of writing a query''? And how can query patterns be defined across procedural and declarative languages, irrespective of their syntax? Our SIGMOD 2024 paper proposes a semantic definition of relational query patterns that uses a variant of structurepreserving mappings between the relational tables of queries. This formalism allows us to analyze the relative pattern expressiveness of relational languages. Notably, for the nondisjunctive language fragment, we show that relational calculus (RC) can express a larger class of patterns than the basic operators of relational algebra (RA). We also propose Relational Diagrams, a complete and sound diagrammatic representation of safe relational calculus. These diagrams can represent all query patterns for unions of non-disjunctive queries, in contrast to visual query representations that derive visual marks from the basic operators of algebra. Our anonymously preregistered user study shows that Relational Diagrams allow users to recognize relational patterns meaningfully faster and more accurately than they can with SQL.
Article
We consider the problem of finding the minimal-size factorization of the provenance of self-join-free conjunctive queries, i.e.,we want to find a formula that minimizes the number of variable repetitions. This problem is equivalent to solving the fundamental Boolean formula factorization problem for the restricted setting of the provenance formulas of self-join free queries. While general Boolean formula minimization is Σ p 2 -complete, we show that the problem is NP-Complete in our case. Additionally, we identify a large class of queries that can be solved in PTIME, expanding beyond the previously known tractable cases of read-once formulas and hierarchical queries. We describe connections between factorizations, Variable Elimination Orders (VEOs), and minimal query plans. We leverage these insights to create an Integer Linear Program (ILP) that can solve the minimal factorization problem exactly. We also propose a Max-Flow Min-Cut (MFMC) based algorithm that gives an efficient approximate solution. Importantly, we show that both the Linear Programming (LP) relaxation of our ILP, and our MFMC-based algorithm are always correct for all currently known PTIME cases. Thus, we present two unified algorithms (ILP and MFMC) that can both recover all known PTIME cases in PTIME, yet also solve NP-Complete cases either exactly (ILP) or approximately (MFMC), as desired.
Article
Comparing relational languages by their logical expressiveness is well understood. Less well understood is how to compare relational languages by their ability to represent relational query patterns. Indeed, what are query patterns other than "a certain way of writing a query"? And how can query patterns be defined across procedural and declarative languages, irrespective of their syntax? To the best of our knowledge, we provide the first semantic definition of relational query patterns by using a variant of structure-preserving mappings between the relational tables of queries. This formalism allows us to analyze the relative pattern expressiveness of relational language fragments and create a hierarchy of languages with equal logical expressiveness yet different pattern expressiveness. Notably, for the non-disjunctive language fragment, we show that relational calculus can express a larger class of patterns than the basic operators of relational algebra. Our language-independent definition of query patterns opens novel paths for assisting database users. For example, these patterns could be leveraged to create visual query representations that faithfully represent query patterns, speed up interpretation, and provide visual feedback during query editing. As a concrete example, we propose Relational Diagrams, a complete and sound diagrammatic representation of safe relational calculus that is provably (i) unambiguous, (ii) relationally complete, and (iii) able to represent all query patterns for unions of non-disjunctive queries. Among all diagrammatic representations for relational queries that we are aware of, ours is the only one with these three properties. Furthermore, our anonymously preregistered user study shows that Relational Diagrams allow users to recognize patterns meaningfully faster and more accurately than SQL.
Article
The role of uncertainty in data management has become more prominent than ever before, especially because of the growing importance of machine learning-driven applications that produce large uncertain databases. A well-known approach to querying such databases is to blend rule-based reasoning with uncertainty. However, techniques proposed so far struggle with large databases. In this paper, we address this problem by presenting a new technique for probabilistic reasoning that exploits Trigger Graphs (TGs) -- a notion recently introduced for the non-probabilistic setting. The intuition is that TGs can effectively store a probabilistic model by avoiding an explicit materialization of the lineage and by grouping together similar derivations of the same fact. Firstly, we show how TGs can be adapted to support the possible world semantics. Then, we describe techniques for efficiently computing a probabilistic model and formally establish the correctness of our approach. We also present an extensive empirical evaluation using a prototype called LTGs. Our comparison against other leading engines shows that LTGs is not only faster, even against approximate reasoning techniques, but can also reason over probabilistic databases that existing engines cannot scale to.
Article
Occupancy grids are common tools used in robotics to represent the robot environment, and that may be used to plan trajectories, select additional measurements to acquire, etc. However, deriving information about those occupancy grids from sensor measurements often induce a lot of uncertainty, especially for grid elements that correspond to occluded or far away area from the robot. This means that occupancy information may be quite uncertain and imprecise at some places, while being very accurate at others. Modelling finely this occupancy information is essential to decide the optimal action the robot should take, but a refined modelling of uncertainty often implies a higher computational cost, a prohibitive feature for real-time applications. In this paper, we introduce the notion of credal occupancy grids, using the very general theory of imprecise probabilities to model occupancy uncertainty. We also show how one can perform efficient, real-time inferences with such a model, and show a use-case applying the model to an autonomous vehicle trajectory planning problem.
Conference Paper
Full-text available
Bounds on the optimal value are often indispensable for the practical solution of discrete optimization problems, particularly in the branching procedures used by constraint programming (CP) and integer programming solvers. Such bounds are frequently obtained by solving a continuous relaxation of the problem, perhaps a linear programming (LP) relaxation. In this paper, we explore an alternative strategy of obtaining bounds from a discrete relaxation, namely a binary decision diagram (BDD). Such a strategy is particularly suitable for CP, because BDDs provide enhanced propagation as well [2-5].
Article
Full-text available
In this paper, we present structured message passing (SMP), a unifying framework for approximate inference algorithms that take advantage of structured representations such as algebraic decision diagrams and sparse hash tables. These representations can yield significant time and space savings over the conventional tabular representation when the message has several identical values (context-specific independence) or zeros (determinism) or both in its range. Therefore, in order to fully exploit the power of structured representations, we propose to artificially introduce context-specific independence and determinism in the messages. This yields a new class of powerful approximate inference algorithms which includes popular algorithms such as cluster-graph Belief propagation (BP), expectation propagation and particle BP as special cases. We show that our new algorithms introduce several interesting bias-variance trade-offs. We evaluate these trade-offs empirically and demonstrate that our new algorithms are more accurate and scalable than state-of-the-art techniques.
Article
Full-text available
We explore the idea of obtaining bounds on the value of an optimization problem from a discrete relaxation based on binary decision diagrams (BDDs). We show how to construct a BDD that represents a relaxation of a 0-1 optimization problem, and how to obtain a bound for a separable objective function by solving a shortest (or longest) path problem in the BDD. As a test case we apply the method to the maximum independent set problem on a graph. We find that for most problem instances, it delivers tighter bounds in less computation time, than state-of-the-art integer programming software obtains by solving a continuous relaxation augmented with cutting planes.
Article
This paper introduces a deterministic approximation algorithm with error guarantees for computing the probability of propositional formulas over discrete random variables. The algorithm is based on an incremental compilation of formulas into decision diagrams using three types of decompositions: Shannon expansion, independence partitioning, and product factorization. With each decomposition step, lower and upper bounds on the probability of the partially compiled formula can be quickly computed and checked against the allowed error. This algorithm can be effectively used to compute approximate confidence values of answer tuples to positive relational algebra queries on general probabilistic databases (c-tables with discrete probability distributions). We further tune our algorithm so as to capture all known tractable conjunctive queries without selfjoins on tuple-independent probabilistic databases: In this case, the algorithm requires time polynomial in the input size even for exact computation. We implemented the algorithm as an extension of the SPROUT query engine. An extensive experimental effort shows that it consistently outperforms state-of-art approximation techniques by several orders of magnitude.
Article
Scitation is the online home of leading journals and conference proceedings from AIP Publishing and AIP Member Societies
Article
We introduce a new perspective on approximations to the maximum a posteriori (MAP) task in probabilistic graphical models, that is based on simplifying a given instance, and then tightening the approximation. First, we start with a structural relaxation of the original model. We then infer from the relaxation its deficien-cies, and compensate for them. This perspective allows us to identify two distinct classes of approximations. First, we find that max-product belief propagation can be viewed as a way to compensate for a relaxation, based on a particular idealized case for exactness. We identify a second approach to compensation that is based on a more refined idealized case, resulting in a new approximation with distinct properties. We go on to propose a new class of algorithms that, starting with a relaxation, iteratively seeks tighter approximations.