Conference PaperPDF Available

Implementing NOT EXISTS Predicates over a Probabilistic Database

Authors:

Abstract and Figures

Systems for managing uncertain data need to support queries with negated subgoals, which are typically expressed in SQL through the NOT EXISTS predicate. For example, the user of an RFID tracking sys- tem may want to find all RFID tags (people or objects) that have trav- eled from a point A to a point C without going through a point D. Such queries are dicult to support in a probabilistic database management system, because oending tuples do not necessarily disqualify an answer, but only decrease its probability. In this paper, we present an approach for supporting queries with NOT EXISTS in a probabilistic database management system, by leveraging the existing query processing infras- tructure. Our approach is to break up the query into multiple, monotone queries, which can be evaluated in the current system, then to combine their probabilities by addition and subtraction to compute that of the original query. We will also describe how this technique was integrated with MystiQ, and how we incorporated the top-k multi-simulation and safe-plans optimizations.
Content may be subject to copyright.
Implementing NOT EXISTS Predicates over a
Probabilistic Database?
Ting-You Wang??, Christopher Re, and Dan Suciu
University of Washington
Abstract. Systems for managing uncertain data need to support queries
with negated subgoals, which are typically expressed in SQL through the
NOT EXISTS predicate. For example, the user of an RFID tracking sys-
tem may want to find all RFID tags (people or objects) that have trav-
eled from a point A to a point C without going through a point D. Such
queries are difficult to support in a probabilistic database management
system, because offending tuples do not necessarily disqualify an answer,
but only decrease its probability. In this paper, we present an approach
for supporting queries with NOT EXISTS in a probabilistic database
management system, by leveraging the existing query processing infras-
tructure. Our approach is to break up the query into multiple, monotone
queries, which can be evaluated in the current system, then to combine
their probabilities by addition and subtraction to compute that of the
original query. We will also describe how this technique was integrated
with MystiQ, and how we incorporated the top-k multi-simulation and
safe-plans optimizations.
1 Introduction
Probabilistic databases have been used recently in a variety of applications of un-
certain data: in acquisition systems such as sensor nets or RFID deployments [5,
9, 10, 15], to manage data extracted by information extraction systems [8], to
query incompletely cleaned data [1], data obtained as a result of imprecise data
mappings [7], and data that has been anonymized to protect privacy [12].
The research on query processing on probabilistic databases has lead to
techniques for processing select-project-join queries [4, 3], queries on data with
explicit provenance [2], on Markov Network data [16], top-k queries [13], and
queries with having-predicates [14].
An area that has not been explored so far are queries with negated subgoals,
which, in SQL, can be expressed using the NOT EXISTS predicate. Such queries
are an important piece of managing uncertain data. For example, in an RFID
application a user may want to find all events when an RFID tag has traveled
from point A to point C without going through a point D; in large scale infor-
mation extraction systems s.a. DBLive [6] a user may want to find all database
?This work was partially supported by NSF grants IIS-0513877 and IIS-0713576.
?? Corresponding author: tingyouuw@gmail.com.
researchers that have never served on a VLDB committee (e.g. when a PC chair
forming a Program Committee is searching for junior researchers that have never
served on the PC before), or a PC chair may want to find all authors that have
never published before in a database conference (a much needed query when
selecting the best newcomer award).
Yet queries with negated subgoals are difficult to compute on a probabilistic
database. In a standard (deterministic) database, if at least one witness tuple
satisfies the negated subgoal then the answer is immediately disqualified. But
in a probabilistic database such a witness does not immediately disqualify the
answer, only reduces its probability. In fact, if there are many witnesses then
their probabilities need to be aggregated in order to compute by how much to
reduce the probability of the answer.
In this paper we describe a technique for computing SQL queries with NOT
EXISTS predicates on a probabilistic database, which we have implemented on
top of MystiQ, a probabilistic database system [11]. Our approach splits a query
with a conjunction of NOT EXISTS as its predicate into multiple parts: a query
to represent the positive results, and many queries to capture the not exists sub-
goals. Importantly, these queries are all simple select-from-where queries, which
can be evaluated using the existing MystiQ infrastructure. Then, we combine
their results and compute the probability of each output tuple by taking into
account the semantics of the NOT EXISTS predicate. We also describe how to
incorporate in this approach two optimizations present in MystiQ: top k-query
evaluation and safe subplans.
Motivating application: Logging Activities The RFID Ecosystem project
at the University of Washington [17] deploys about 132 antennas throughout the
department’s building and tracks thousands of RFID tags attached to objects or
carried by people. As these tagged people/objects roam through the hallways, a
database table is populated with their movements. In a perfect world, the table
would contain exactly every instance a tagged person passes by an antenna. How-
ever, this is not the case. It turns out that there are many times that readings
are dropped, which are called false negatives, and other times, extra readings
are registered, which are identified as false positives. As a consequence, every
reading recorded in the database is probabilistic. (We refer the reader to [15] for
a full description of the probabilistic model.)
Fig. 1. Person is walking down the hall starting at A
For example, consider the diagram in Fig. 1 that shows the position of three
antennas A, B, C (out of a few dozens), and suppose that we are interested in
retrieving all timestamps when a user has walked from Ato Cgoing through B.
The antennas read at a frequency of four readings per second. False negatives
are common (missed tags) and false positives are also frequent (readings from
nearby antennas). Our system converts these uncertainties into probabilities,
and as a consequence might record a sequence of readings like this:
A1A2B3A4A5B6B7B8C9C10B11 C12 . . . A21A22 B23 A24D25 D26 D27E28 E29C30
If the database were deterministic then (A5, C9) is the only answer: we don’t
consider (A1, C9) because there are intermediate readings at A, and we don’t
consider (A24, C30 ) an answer because obviously the user has taken a different
route to get from Ato C, other than through B. Our query is expressed by the
following SQL statement which finds all pairs for readings (A, C) for which there
exists no intermediate readings other than those at antenna B:
SELECT distinct r1.pid, r1.time, r2.time
FROM Data r1, Data r2
WHERE r1.time < r2.time AND r1.pid = <UserID> AND
r2.pid = r1.pid AND
r1.antenna = ’A’ AND r2.antenna = ’C’ AND
NOT EXISTS (SELECT distinct *
FROM Data r3
WHERE r3.pid = r1.pid AND r3.time > r1.time
AND r3.time < r2.time
AND r3.antenna != ’B’)
Evaluating this query on a probabilistic database poses important technical
challenges. It is no longer the case that (A5, C9) is the single answer. For example
(A1, C9) may also be an answer, namely when the offending readings A2, A4, A5
are false positives. In fact the probability of the event (A1, C9) is (1 p2)(1
p4)(1 p5) (we denote p2the probability of the event A2, etc: these values are
stored in the database). Similarly, (A24, C30 ) is now also a possible answer, and
its probability is that of all the offending witnesses being absent.
Paper Organization We start by describing the basic data model in Sec. 2.
We describe the main idea in Sec. 3, then provide the general approach in Sec. 4.
We describe some implementation details in Sec. 5 and some preliminary exper-
iments in Sec. 6. We conclude in Sec. 7.
2 Data and Query Model
We briefly describe MystiQ’s query and data model from [3]. The relations are
independent/disjoint probabilisitic relations, in which every two tuples are either
independent, or disjoint events. Queries are expressed in SQL, and each answer is
annotated by a probability representing the system’s confidence in that answer:
the answers are typically presented to the user in decreasing order of their output
probability, see Fig. 2. MystiQ supports SELECT-DISTINCT-FROM-WHERE
Fig. 2. An example of a query for MystiQ
queries and uses two algorithms to compute probabilities: safe plans [4] and
multisimulation [13].
The first algorithm translates a SQL query Qinto another query Q0that
manipulates probabilities explicitly: it multiplies or adds them, or computes
their complement (1 p). When such a rewriting is possible then Qis called a
safe query and Q0is called a safe plan. The relational database server executes
Q0and returns tuples ordered decreasingly by their probabilities: MystiQ will
display them without any further processing.
The second algorithm is used when Qis not safe. In this case MystiQ runs
a much simpler query on the database engine, but needs to do considerably
more work to compute the probabilities. The database engine simply returns all
possible answer tuples t1, . . . , tntogether with their lineage [2]: it does not com-
pute any output probabilities. The lineage of a tuple is a positive DNF formula
whose propositional variables are tuples from the input database: importantly,
these input tuples carry with them the input probability. The probability of
each tuple tiis precisely the probability of its associated DNF formula, and
MystiQ evaluates these probabilities p1,p2, ..., pnby running nMonte Carlo
simulation algorithms. This is an expensive process, and here MystiQ uses an
optimization called multisimulation, introduced in [13], whose goal is to run the
smallest number of simulation steps required to identify and rank only the top
ktuples, ordered decreasingly by their probabilities. We describe here how the
multisimulation algorithm identifies the top ktuples, without necessarily sort-
ing them, and refer the reader to [13] for details. Suppose that at some point
we have a confidence interval for each tuple: [a1, b1], [a2, b2], ..., [an, bn], s.t. it
is guaranteed that pi[ai, bi], for i= 1, . . . , n. (Initially, a1=a2=... = 0 and
b1=b2=... = 1.) The crux of the algorithm consists of choosing which DNF
formula to simulate next. Let cbe the k’th largest value in {a1, . . . , an}and let
dbe the k+ 1’st largest value in {b1, . . . , bn}. The first observation is that if
d < c then the algorithm has succeeded in identifying the top ktuples: these
are precisely the ktuples tis.t. caisince for any other tuple tjnot in this
set it is the case that bjd, hence pj< pi. If cd, then the interval [c, d] is
called the critical region: the second observation is that in this case one must
improve the confidence interval [ai, bi] of some tuple that contains the critical
region: aic<dbi. Thus, the algorithm chooses one of the tuples tiwhose
current confidence interval [ai, bi] “crosses” the critical region (hence tiis called
acrosser), then runs a few simulation steps on the DNF formula for ti, which
further shrinks the interval [ai, bi] and then re-computes the critical region c, d.
It stops when d<c.
In summary, MystiQ tries to evaluate a safe query, when possible, or runs
the multisimulation algorithm otherwise.
3 Evaluating Queries with a Single NOT EXISTS
In this work we considered SQL queries that have a conjunction of NOT EXISTS
in the where clause; the nested queries are in turn simple select-from-where
queries, i.e. we do not allow nested NOT EXISTS. We describe in this section
our approach, and we start with the simple case of a single nested NOT EXISTS
query. We use the following as our running example:
Original query Q:
SELECT distinct col_1, col_2, col_3
FROM R1, R2, R3
WHERE condition1 AND NOT EXISTS
( SELECT *
FROM R’1, R’2, R’3
WHERE condition2 )
From Q, we derive the following two queries. The outer query is the query Q
with the NOT EXISTS predicate removed:
Outer query Q1:
SELECT distinct col_1, col_2, col_3
FROM R1, R2, R3
WHERE condition1
The inner query is the query under the NOT EXISTS predicate:
Inner query Q2:
SELECT distinct col_1, col_2, col_3
FROM R1, R2, R3, R’1, R’2, R’3
WHERE condition1 AND condition2
Let E1be the event that a tuple tis in the answer to Q1, and E2 be the
event that tis in the answer to Q2. Then the answers to Qare precisely the
tuples satisfying the event E1E2and their probability is:
P(tQ) = P(E1)P(E1E2) = P(E1)P(E2)
The last equality holds because every tuple that is an answer to Q2is also an
answer to Q1.
In summary, to evaluate Qon a probabilistic database we proceed as follows.
First we evaluate Q1 and obtain a set of tuples t1, . . . , tnwith probabilities
p1, . . . , pn. Next, we evaluate Q2: we assume its answer is the same list of tuples,
with probabilities p0
1, . . . , p0
n: if Q2 returns some tuple that is not among t1, . . . , tn
then we can ignore it, and conversely, if it does not return some tuple tiwe simply
add it and set its probability p0
i= 0. The answer to the query Qis the list of
tuples t1, . . . , tn, and their probabilities are p1p0
1, . . . , pnp0
n.
4 Evaluating Queries with Multiple NOT EXISTS
We now show how to generalize to multiple NOT EXISTS predicates. Our run-
ning example is:
Q:
SELECT DISTINCT col_1, col_2, col_3
FROM A_1, B_1, C_1
WHERE condition_1 AND NOT EXISTS
(SELECT *
FROM A_2, B_2, C_2
WHERE condition_2)
AND
...
AND NOT EXISTS
(SELECT *
FROM A_m, B_m,C_m
WHERE condition_m)
As before we define an outer query Q1, and several inner queries Q2, Q3, . . . ,
Qm; their definitions are by a straightforward generalization of those in the
previous section.
Let Eibe the event that a tuple tis in the answer set to the query Qi. Then,
the event “tis in the answer to Q” is equivalent to:
(tQ)E1E2...Ek...Em
Therefore, by using the inclusion-exclusion formula we derive:
P(tQ) = P(E1E2...Ek...Em)
=P(E1)P(E1(E2E3. . . Em))
=P(E1)X
2i2m
P(E1Ei2) + X
2i2<i3m
P(E1Ei2Ei3). . . (1)mP(E1E2. . . Em)
Thus, in order to evaluate the query Qwe proceed as follows. We start by
evaluating the query Q1: suppose it returns ntuples, t1, . . . , tn, with probabilities
p1, . . . , pn. Next we evaluate the following 2n11 queries: Qi2. . . Qikfor k1
and 1 i2< . . . < ikn. (Note that Q1Qi2. . . Qikis equivalent to Qi2. . . Qik
when k > 0.) Assume that each of these queries returns the same set of tuples
t1, . . . , tn, with some probabilities. (Any additional tuples can be removed, and
any missing tuples can be added with probability 0.) Then the answer to Q
consists of the tuples t1, . . . , tn, and the probability of tiis obtained by summing
the 2n1probabilities of tiin all queries Q1Qi2. . . Qik, with signs (1)k+1.
We end this section with a note on our algorithm for generating a list of 2n1
queries that need to be evaluated. This process is briefly demonstrated in Fig. 3.
We maintain a list of already generated queries and then appending onto the
list by merging new nested queries with those already in the list. However, the
order of the appending is very important; it must be from the back of the list to
the front in order to create the alternating property that we desire. With this
we are ready to begin the implementation process for NOT EXISTS.
Fig. 3. Example of how a list of generated queries is populated
5 Implementing NOT EXISTS
It is clear that the first step of implementing NOT EXISTS is to parse a query
and recognize the NOT EXISTS clauses. Finding the clauses is straightforward
since it is a standard traversal of a tree generated to represent the input query.
But once these locations are found, we cannot immediately begin the genera-
tion of queries because there are two main issues that must be resolved. First,
the naming of table aliases could conflict between the outer and inner queries.
Second, we need to generate the queries in the correct order to produce the
alternating in signs that we desire for the probabilities.
The first problem requires a renaming that is done at the moment an input
query is parsed. For every table alias that is given, we will append on a suffix
to make the name unique from all others. The suffix chosen for MystiQ was X
where X is a non-negative integer. An example of this renaming is shown below:
SELECT R1.attr1, R1.attr2, R2.attr1
FROM Relation R1, Relation R2
WHERE R1.attr1 = R2.attr2 and R1.attr2 = ’in’ AND NOT EXISTS
(SELECT *
FROM Relation R1
WHERE R1.attr1 = R2.attr2 and R1.attr2 = ’out’)
TRANSLATES INTO:
SELECT R1_1.attr1, R1_1.attr2, R2_2.attr1
FROM Relation R1_1, Relation R2_2
WHERE R1_1.attr1 = R2_2.attr2 and R1_1.attr2 = ’in’ and NOT EXISTS
(SELECT *
FROM Relation R1_3
WHERE R1_3.attr1 = R2_2.attr2 and R1_3.attr2 = ’out’)
The second problem is much more simple to solve. In order to generate the
new queries in the correct order, we first make a pass over the input query and
extract/remove all the inner queries. At the end of this process, we have a list of
all the nested queries and the outer query we need. For the first of the nested
queries, we merge it with the outer query, then we take another nested query, if
it exists, and merge it with the first two generated queries in reverse order (just
as Fig 3 demonstrated). We proceed in this fashion for all nested queries. After
this process is done, we are ready to begin calculating probabilities.
Calculation of probabilities has two pathways that it can take. The first is
to compute the probabilities explicitly, taking advantage of the safe plans opti-
mization. The second is to use the multi-simulation method. Clearly, explicitly
calculating the probability is much more desirable, so unless a user specifies that
simulations must be run, MystiQ will check if all of the generated queries, which
are all monotone, can be computed exactly. If so, we can compute the NOT
EXISTS query by substituting the probability values directly into the formulas
discussed in Sections 3 and 4 to get an exact solution.
In a simulation, we no longer have exact probabilities to work with, but rather
entire sets of confidence intervals, one set for each of the generated queries. Recall
from Section 2, an interval represents a range in which the true probability lies
for a particular result of the query. As mentioned in Section 4, the possible tuples
of the original query are precisely those of the outer query. However, there is the
additional complication of the dependency between the intervals for the original
query and those of the generated queries. Since we never actually ran the original
query we have no DNF formula to simulate, so we must use the other intervals.
The endpoints for a tuple of the original query is equivalent to just finding
the extreme points of the probability formula. For example, if one was computing
the lower endpoint of an interval, they would be computing the smallest possible
probability value that could result from substituting the probabilities of the
equivalent tuples of the generated queries. Similarly, the upper endpoint would
be the largest possible probability value. Figure 4 provides a demonstration of
the actual calculation and equations.
Fig. 4. Example of how intervals are calculated
With this key issue resolved, the simulation process can proceed in much
the same way as it did in the original implementation of MystiQ, with only a
slight modification. The process begins with initializing a set of intervals (to
represent the original query), S, to the intervals of the outer query. Then, one
interval, IS, is chosen (exactly in the same way an interval would have
been chosen before). However, this interval itself is not simulated. Rather, all
intervals representing the same tuple from the generated queries are simulated.
These associated intervals are then aggregated together using the method just
described before to come up with the new interval for our original query. This
process is repeated until a certain number of intervals are isolated from each
other giving us the top-k tuples.
6 Experiments
Experiments were run in order to judge how well the implementation performed
when adjusting various parameters. For starters, we adjusted database table sizes
and toggled using safe plans. Next, we varied the number of Monte Carlo simula-
tions run every step in the multisimulation algorithm. Finally, we experimented
with how the query system handles more complex queries.
First, experiments were run to test how the implementation performed over
different data sizes. For this, we used a synthetic database (rows are generated
with ’random’ probabilities) consisting of Products and Orders, and ran the
following query:
SELECT DISTINCT p.id, p.name
FROM ProductEvent p
WHERE NOT EXISTS(
SELECT DISTINCT *
FROM OrderEvent o
WHERE o.productid = p.id and o.price > 10)
We varied the data set size (split almost equally between Products and Orders):
4000, 9500, 14000, 19000, 28500 rows. We also ran this query, both, with
the safe plan optimizations and without. Fig. 5 shows the results of those runs.
Fig. 5. Comparison of optimized and unoptimized queries
From this graph, we see that the running times of NOT EXISTS queries
exponentially rise as the data set size increases. However, as the size increases,
the slopes do seem to grow shallower. We also see in the graph that safe plan
optimizations make a significant improvement in performance. From the tests,
there was generally a 94% improvement in time when optimizations could be
found.
Another experiment examined how changing the number of simulations run
every step in the simulating process of MystiQ affected the running time. The
graph in Fig. 6 charts the change in times as we varied the number of simulations
from 10000 to 70000 on a fixed data set of about 10000 rows, about 5000 products
and 5000 orders (we used the same query as before).
Fig. 6. Execution times against changing the number of simulations per step
We see that for both small and large values, the running times dramatically
increase; however, the cause is very different. For the smaller values, the increase
in running time is mostly due to the fact that many more steps have to run to
reach enough simulations on the intervals to isolate the top k. For the larger
values, the increase in time is due to the fact that, though we reduce the number
of steps run, we are over simulating the intervals. The extra simulations are
simply taking up time and not adding any new information. The middle numbers,
particularly at around 30000, balance the two problems well. They run enough
simulations to reduce the number of steps taken, but they do not over simulate
the intervals.
Although Fig. 6 showed 30000 to be the optimal value for the Number of
Simulations per Step, this is only for this particular case. Different queries
will have different results. However, even in those cases, the graph of times
ought to follow the same curve since these queries face the same problems, only
it will be translated along the x-axis.
Lastly, experiments were performed on gathered data closely related to the
RFID data (times when a person entered and left a room). A complex query
was written that searched for consecutive events of entering and leaving a room
for a particular person, but the data set size was only 660 tuples (split between
entering and leaving rooms). The query, at first, used two NOT EXISTS queries.
Here is an example of the query:
SELECT DISTINCT er.tag_id, er.room_num, er.timestamp, lr.timestamp
FROM EnteredRoomEvent er, LeftRoomEvent lr
WHERE er.tag_id = lr.tag_id and er.room_num =
lr.room_num and er.timestamp < lr.timestamp AND NOT EXISTS
(SELECT DISTINCT *
FROM EnteredRoomEvent e3
WHERE e3.tag_id = er.tag_id and e3.timestamp >
er.timestamp and e3.timestamp < lr.timestamp) AND NOT EXISTS
(SELECT DISTINCT *
FROM LeftRoomEvent e4
WHERE e4.tag_id = er.tag_id and e4.timestamp >
er.timestamp and e4.timestamp < lr.timestamp)
Then the query was translated into a single NOT EXISTS query to see how
timings are affected. This new query is found below (where the AllSightings table
is simply the union of EnteredRoom/LeftRoom events).
SELECT DISTINCT er.tag_id, er.room_num, er.timestamp, lr.timestamp
FROM EnteredRoomEvent er, LeftRoomEvent lr
WHERE er.tag_id = lr.tag_id and er.room_num =
lr.room_num and er.timestamp < lr.timestamp AND NOT EXISTS
(SELECT distinct *
FROM AllSightings all
WHERE all.tag_id = er.tag_id and all.timestamp >
er.timestamp and all.timestamp < lr.timestamp)
The results of the test showed that it is about 25 times slower when adding one
additional level of complexity. With only one NOT EXISTS sub query, results
were found in 51.062 seconds where as by adding one more level of complexity,
it took 1260.389 seconds. This increase comes from executing twice the number
queries and running simulations on twice the number of sets of intervals.
In addition to the immediate results of the experiment, another observation
was made. An issue arises when there are many probabilities that lie very close
together, particularly when they are all close to zero (which was the case for
this experiment). While executions were run, there was a clear slow down when
all the intervals narrowed and were centered about zero. At this point, the
simulator makes only small changes to an interval’s endpoints leading to longer
execution to distinguish a top k set of intervals. This is more prevalent in NOT
EXISTS queries since using the NOT EXISTS clause, we can more easily send
more probabilities closer to zero, but if certain monotone queries were chosen
carefully over data sets with very small probabilities, it would also be possible
for this case to arise.
7 Conclusion
Systems for managing uncertain data need to support queries with negated sub-
goals, such as SQL queries with NOT EXISTS predicates. We have described
in this paper an approach for supporting such queries, while leveraging much
of the infrastructure of an existing probabilistic database management system.
We have described optimizations, which in our preliminary experiments show
improvement of performance by about 94%. However, the execution time still
is fairly slow and can be improved upon. As our experiments show, the run-
ning times are seemingly exponential with respect to the data size. Also, as the
complexity of the query increases, there is also a significant increase in running
time. Further improvements in any of these areas would be extremely beneficial
to making the system more practical to use in many different environments.
Possible directions for future work would be to improve the simulation pro-
cess or improving the interpretation of a NOT EXISTS query. Some specific
areas that can be improved include implementing a faster way of running Monte
Carlo steps, or understanding how to better choose intervals to simulate, or even
figuring out how to choose the number of steps to take each time we simulate an
interval. Other work could be to find better ways to break apart a NOT EXISTS
query so that we no longer have an exponential number of queries. Unless we get
away from the exponential increase in generated queries, we can always expect
that there will probably be an exponential increase in time when more NOT
EXISTS queries are added.
References
1. P. Andritsos, A. Fuxman, and R. J. Miller. Clean answers over dirty databases. In
ICDE, 2006.
2. O. Benjelloun, A. D. Sarma, C. Hayworth, and J. Widom. An introduction to
ULDBs and the Trio system. IEEE Data Eng. Bull, 29(1):5–16, 2006.
3. N. Dalvi, C. Re, and D. Suciu. Query evaluation on probabilistic databases. IEEE
Data Engineering Bulletin, 29(1):25–31, 2006.
4. N. Dalvi and D. Suciu. Efficient query evaluation on probabilistic databases. In
VLDB, Toronto, Canada, 2004.
5. A. Deshpande, C. Guestrin, S. Madden, J. M. Hellerstein, and W. Hong. Using
probabilistic models for data management in acquisitional environments. In CIDR,
pages 317–328, 2005.
6. A. Doan, R. Ramakrishnan, F. Chen, P. DeRose, Y. Lee, R. McCann, M. Sayya-
dian, and W. Shen. Community information management. IEEE Data Engineer-
ing Bulletin, Special Issue on Probabilistic Data Management, 29(1):64–72, March
2006.
7. X. Dong, A. Halevy, and C. Yu. Data integration with uncertainty. In VLDB,
2007.
8. R. Gupta and S. Sarawagi. Creating probabilistic databases from information
extraction models. In VLDB, pages 965–976, 2006.
9. T. Jayram, S. Kale, and E. Vee. Efficient aggregation algorithms for probabilistic
data. In SODA, 2007.
10. S. Jeffery, M. Garofalakis, and M. Franklin. Adaptive cleaning for RFID data
streams. In VLDB, pages 163–174, 2006.
11. MystiQ: a probabilistic database system, avialable at
http://mystiq.cs.washington.edu/.
12. V. Rastogi, D. Suciu, and S. Hong. The boundary between privacy and utility in
data publishing. In VLDB, 2007.
13. C. Re, N. Dalvi, and D. Suciu. Efficient Top-k query evaluation on probabilistic
data. In ICDE, 2007.
14. C. Re and D.Suciu. Efficient evaluation of having queries on a probabilistic
database. In Proceedings of DBPL, 2007.
15. C. R´e, J. Letchner, M. Balazinska, and D. Suciu. Extracting events from correlated
streams. In SIGMOD, page to appear, 2008.
16. P. Sen and A. Deshpande. Representing and querying correlated tuples in proba-
bilistic databases. In ICDE, 2007.
17. E. e. a. Welbourne. Challenges for pervasive RFID-based infrastructures. In
PERTEC Workshop, March 2007.
... In temporal databases, the existence of a matching tuple in the negative relation does not disqualify the tuple of p itself but timepoints at which it is valid [BBJ98,BJ09]. In probabilistic databases, where tuples have a probability to be true or false, the existence of a matching tuple in n only reduces the probability with which a tuple is included in the output [Suc09,WRS08]. ...
... In all these works, the focus is restricted to select-project-join queries. Probabilistic anti-joins expressed using the NOT EXISTS predicate in SQL have been explored by Wang et al. [WRS08]. They have been integrated in MystiQ by breaking the initial query into positive and negative subqueries that are separately evaluated and then combined. ...
... In temporal databases, the existence of a matching tuple in the negative relation does not disqualify the tuple of p itself but time points at which it is valid [BBJ98,BJ09]. In probabilistic databases, where tuples have a probability to be true or false, the existence of a matching tuple in n only reduces the probability with which a tuple is included in the output [Suc09,WRS08]. ...
... In temporal databases, the existence of a matching tuple in n does not disqualify the tuple of p itself but timepoints at which it is valid [1], [2]. In probabilistic databases, where tuples have a probability to be true or false, the existence of a matching tuple in n only reduces the probability with which a tuple is included in the output [3], [4]. ...
... The answer tuple ('Ann, ZAK, hotel 1 ', a 1 ∧ b 3 , [4,6), 0.49) expresses that, with probability 0.49, Ann wants to visit Zakynthos (a 1 ) and stay at hotel 1 in Zakynthos (b 3 ) during interval [4,6). It is valid over the intersection of the intervals of tuples a 1 and b 3 and it is true when both these tuples are true. ...
... The answer tuple ('Ann, ZAK, hotel 1 ', a 1 ∧ b 3 , [4,6), 0.49) expresses that, with probability 0.49, Ann wants to visit Zakynthos (a 1 ) and stay at hotel 1 in Zakynthos (b 3 ) during interval [4,6). It is valid over the intersection of the intervals of tuples a 1 and b 3 and it is true when both these tuples are true. ...
Preprint
Full-text available
The result of a temporal-probabilistic (TP) join with negation includes, at each time point, the probability with which a tuple of a positive relation p{\bf p} matches none of the tuples in a negative relation n{\bf n}, for a given join condition θ\theta. TP outer and anti joins thus resemble the characteristics of relational outer and anti joins also in the case when there exist time points at which input tuples from p{\bf p} have non-zero probabilities to be true and input tuples from n{\bf n} have non-zero probabilities to be false, respectively. For the computation of TP joins with negation, we introduce generalized lineage-aware temporal windows, a mechanism that binds an output interval to the lineages of all the matching valid tuples of each input relation. We group the windows of two TP relations into three disjoint sets based on the way attributes, lineage expressions and intervals are produced. We compute all windows in an incremental manner, and we show that pipelined computations allow for the direct integration of our approach into PostgreSQL. We thereby alleviate the prevalent redundancies in the interval computations of existing approaches, which is proven by an extensive experimental evaluation with real-world datasets.
... The complexity of several other query languages has been considered in the literature: queries with disequality joins ( =) by Olteanu and Huang [2008], with inequality joins (<) by , with NOT EXISTS predicates by Wang et al. [2008b], with a HAVING clause by Ré and Suciu [2009], and unrestricted relational algebra queries, and in particular "quantified queries" such as relational division, by Fink et al. [2011b]. ...
Article
Full-text available
Many applications today need to manage large data sets with uncertainties. In this paper we describe the foundations of managing data where the uncertainties are quantified as probabilities. We review the basic definitions of the probabilistic data model and present some fundamental theoretical results for query evaluation on probabilistic databases.
... Furthermore, in this survey we focus only on work that rely on a similarity function (or distance function) when executing the data matching process. Other work in the literature use the probabilistic theory or statistical methods (Takasu, Fukagawa and Akutsu, 2007), for example, in order to manage heterogeneous data (Widom, 2008;Wang, Re and Suciu, 2008;Getoor and Taskar, 2007). Another important consideration on reading this paper is that we present those proposals that deal with structured data (relational tuples, records, XML, and so on); for proposals that consider unstructured documents, and use some specific algorithm in order to match them, see (Wan, 2008;Aygn, 2008). ...
Article
Approximate data matching is a central problem in several data management processes, such as data integration, data cleaning, approximate queries, similarity search and so on. An approximate matching process aims at defining whether two data represent the same real-world object. For atomic values (strings, dates, etc), similarity functions have been defined for several value domains (person names, addresses, and so on). For matching aggregated values, such as relational tuples and XML trees, approaches alternate from the definition of simple functions that combine values of similarity of record attributes to sophisticated techniques based on machine learning, for example. For complex data comparison, including structured and semistructured documents, existing approaches use both structure and data for the comparison, by either considering or not considering data semantics. This survey presents terminology and concepts that base approximated data matching, as well as discusses related work on the use of similarity functions in such a subject. KeywordsInstance data matching–Similarity function–Similarity matching–Record linkage–Record matching–Duplicate detection–Object matching–Entity resolution
... Some applications, such as management of RFID data, require SQL queries with NOT EXISTS predicates. We have investigated evaluation algorithms for queries with one level of NOT EXISTS predicates in [28], and described an evaluation algorithm that could be fully integrated in MystiQ. Our algorithm is exponential in the number of NOT EXISTS predicates; some optimizations are likely required in order to make this approach more practical. ...
Conference Paper
Full-text available
MystiQ is a system that allows users to define a probabilistic database, then to evaluate SQL queries over this database. MystiQ is a middleware: the data itself is stored in a standard relational database system, and MystiQ is providing the probabilistic semantics. The advantage of a middleware over a re-implementation from scratch is that it can leverage the infrastructure of an existing database engine, e.g. indexes, query evaluation, query optimization, etc. Furthermore, MystiQ attempts to perform most of the probabilistic inference inside the relational database engine. MystiQ is currently available from mystiq.cs.washington.edu .
Article
This article charts the tractability frontier of two classes of relational algebra queries in tuple-independent probabilistic databases. The first class consists of queries with join, projection, selection, and negation but without repeating relation symbols and union. The second class consists of quantified queries that express the following binary relationships among sets of entities: set division, set inclusion, set equivalence, and set incomparability. Quantified queries are expressible in relational algebra using join, projection, nested negation, and repeating relation symbols. Each query in the two classes has either polynomial-time or &num;P-hard data complexity and the tractable queries can be recognised efficiently. Our result for the first query class extends a known dichotomy for conjunctive queries without self-joins to such queries with negation. For quantified queries, their tractability is sensitive to their outermost projection operator: They are tractable if no attribute representing set identifiers is projected away and &num;P-hard otherwise.
Article
This paper shows that any non-repeating conjunctive relational query with negation has either polynomial time or #P-hard data complexity on tuple-independent probabilistic databases. This result extends a dichotomy by Dalvi and Suciu for non-repeating conjunctive queries to queries with negation. The tractable queries with negation are precisely the hierarchical ones and can be recognized efficiently.
Article
As the size and complexity of analytic data processing systems continue to grow, the effort required to mitigate faults and performance skew has also risen. However, in some environments we have encountered, users prefer to continue query execution even in the presence of failures (e.g., the unavailability of certain data sources), and receive a "partial" answer to their query. We explore ways to characterize and classify these partial results, and describe an analytical framework that allows the system to perform coarse to fine-grained analysis to determine the semantics of a partial result. We propose that if the system is equipped with such a framework, in some cases it is better to return and explain partial results than to attempt to avoid them.
Article
This article describes an approximation algorithm for computing the probability of propositional formulas over discrete random variables. It incrementally refines lower and upper bounds on the probability of the formulas until the desired absolute or relative error guarantee is reached. This algorithm is used by the SPROUT query engine to approximate the probabilities of results to relational algebra queries on expressive probabilistic databases.
Conference Paper
Extensive work has recently been done on the evaluation of positive queries on probabilistic databases. The case of queries with negation has notoriously been left out, since it raises serious additional challenges to efficient query evaluation. This paper provides a complete framework for the evaluation of full relational algebra queries in probabilistic databases. In particular, it proposes exact and approximate evaluation techniques for relational algebra queries on representation systems that can accommodate any finite probability space over relational databases. Key ingredients to these techniques are (1) the manipulation of nested propositional expressions used for probability computation without unfolding them into disjunctive normal form, and (2) efficient computation of lower and upper probability bounds of such expressions by deriving coarser expressions in tractable theories such as one occurrence form. We complement our evaluation techniques with a tractability map for relational algebra queries without repeating relation symbols and for quantified queries such as set inclusion, equality, incomparability, and relational division, which are expressible in relational algebra using nested negation and repeating relation symbols. Based on this tractability study, we syntactically define a practical class of tractable relational algebra queries. We incorporated this framework in the SPROUT engine and show its efficiency experimentally in TPC-H and RFID scenarios.
Article
Full-text available
We introduce ULDBs: relational databases that add uncertainty and lineage of the data as first-class concepts. The ULDB model underlies the Trio system under development at Stanford. We describe the ULDB model, then present TriQL, our SQL-based query language for ULDBs. TriQL's semantics over ULDBs is defined both formally and operationally, and TriQL extends SQL with constructs for query-ing lineage and confidence values. We also briefly describe our initial prototype Trio implementation, which encodes ULDBs in conventional relations and automatically translates TriQL queries into SQL commands over the encoding. We conclude with research directions for ULDBs and the Trio system.
Article
Full-text available
We study the evaluation of positive conjunctive queries with Boolean aggregate tests (similar to HAVING queries in SQL) on probabilistic databases. Our motivation is to handle aggregate queries over imprecise data resulting from information integration or information extraction. More precisely, we study con-junctive queries with predicate aggregates using MIN, MAX, COUNT, SUM, AVG or COUNT(DISTINCT) on probabilistic databases. Computing the precise output prob-abilities for positive conjunctive queries (without HAVING) is P-hard, but is in P for a restricted class of queries called safe queries. Further, for queries without self-joins either a query is safe or its data complexity is P-Hard, which shows that safe queries exactly capture tractable queries without self-joins. In this pa-per, for each aggregate above, we find a class of queries that exactly capture efficient evaluation for HAVING queries without self-joins. Our algorithms use a novel technique to compute the marginal distributions of elements in a semiring, which may be of independent interest.
Conference Paper
Full-text available
We consider the privacy problem in data publishing: given a database instance containing sensitive information "anonymize" it to obtain a view such that, on one hand attackers cannot learn any sensitive information from the view, and on the other hand legitimate users can use it to compute useful statistics. These are conflicting goals. In this paper we prove an almost crisp separation of the case when a useful anonymization algorithm is possible from when it is not, based on the attacker's prior knowledge. Our definition of privacy is derived from existing literature and relates the attacker's prior belief for a given tuple t, with the posterior belief for the same tuple. Our definition of utility is based on the error bound on the estimates of counting queries. The main result has two parts. First we show that if the prior beliefs for some tuples are large then there exists no useful anonymization algorithm. Second, we show that when the prior is bounded for all tuples then there exists an anonymization algorithm that is both private and useful. The anonymization algorithm that forms our positive result is novel, and improves the privacy/utility tradeoff of previously known algorithms with privacy/utility guarantees such as FRAPP.
Conference Paper
Full-text available
Many real-life applications depend on databases automatically curated from unstructured sources through imperfect structure extraction tools. Such databases are best treated as imprecise representations of multiple extraction possibli-ties. State-of-the-art statistical models of extraction provide a sound probability distribution over extractions but are not easy to represent and query in a relational framework. In this paper we address the challenge of approximating such distributions as imprecise data models. In particular, we investigate a model that captures both row-level and column-level uncertainty and show that this representation provides significantly better approximation compared to models that use only row or only column level uncertainty. We present efficient algorithms for finding the best approximating parameters for such a model: our algorithm exploits the structure of the model to avoid enumerating the exponential number of extraction possibilities.
Conference Paper
Full-text available
Traditional database systems, particularly those focused on capturing and managing data from the real world, are poorly equipped to deal with the noise, loss, and uncertainty in data. We discuss a suite of techniques based on probabilistic models that are designed to allow database to tolerate noise and loss. These techniques are based on exploiting correlations to predict missing values and identify outliers. Interestingly, correlations also provide a way to give approximate answers to users at a significantly lower cost and enable a range of new types of queries over the correlation structure itself. We illustrate a host of applications for our new techniques and queries, ranging from sensor networks to network monitoring to data stream management. We also present a unified architecture for integrating such models into database systems, focusing in particular on acquisitional systems where the cost of capturing data (e.g., from sensors) is itself a significant part of the query processing cost.
Conference Paper
Full-text available
Probabilistic databases have received considerable attention recently due to the need for storing uncertain data produced by many real world applications. The widespread use of probabilistic databases is hampered by two limitations: (1) current probabilistic databases make simplistic assumptions about the data (e.g., complete independence among tuples) that make it difficult to use them in applications that naturally produce correlated data, and (2) most probabilistic databases can only answer a restricted subset of the queries that can be expressed using traditional query languages. We address both these limitations by proposing a framework that can represent not only probabilistic tuples, but also correlations that may be present among them. Our proposed framework naturally lends itself to the possible world semantics thus preserving the precise query semantics extant in current probabilistic databases. We develop an efficient strategy for query evaluation over such probabilistic databases by casting the query processing problem as an inference problem in an appropriately constructed probabilistic graphical model. We present several optimizations specific to probabilistic databases that enable efficient query evaluation. We validate our approach by presenting an experimental evaluation that illustrates the effectiveness of our techniques at answering various queries using real and synthetic datasets.
Conference Paper
We describe a system that supports arbi- trarily complex SQL queries on probabilis- tic databases. The query semantics is based on a probabilistic model and the results are ranked, much like in Information Retrieval. Our main focus is ecient query evaluation, a problem that has not received attention in the past. We describe an optimization algorithm that can compute eciently most queries. We show, however, that the data complexity of some queries is #P-complete, which implies that these queries do not admit any ecient evaluation methods. For these queries we de- scribe both an approximation algorithm and a Monte-Carlo simulation algorithm.
Conference Paper
The success of RFID in supply chain management is leading many to consider more personal and pervasive de- ployments of this technology. Unlike industrial settings, however, deployments that involve humans raise new and critical problems related to privacy, security, uncertainty, and a more diverse and evolving set of applications. At the University of Washington, we are deploying a building-wide RFID-based infrastructure with hundreds of antennas and thousands of tags. Our goal is to uncover the issues of pervasive RFID deployments and devise tech- niques for addressing these issues before such deployments become common place. In this paper, we present the challenges encountered and lessons learned during a smaller-scale pilot deployment of the system. We show some preliminary results and, for each challenge, discuss how we addressed it or how we are plan- ning on addressing it.