Fig 5 - uploaded by Dan Suciu
Content may be subject to copyright.
Source publication
Systems for managing uncertain data need to support queries with negated subgoals, which are typically expressed in SQL through the NOT EXISTS predicate. For example, the user of an RFID tracking sys- tem may want to find all RFID tags (people or objects) that have trav- eled from a point A to a point C without going through a point D. Such queries...
Context in source publication
Context 1
... EXISTS. It is clear that the first step of implementing NOT EXISTS is to parse a query and recognize the NOT EXISTS clauses. Finding the clauses is straightforward since it is a standard traversal of a tree generated to represent the input query. But once these locations are found, we cannot immediately begin the genera- tion of queries because there are two main issues that must be resolved. First, the naming of table aliases could conflict between the outer and inner queries. Second, we need to generate the queries in the correct order to produce the alternating in signs that we desire for the probabilities. The first problem requires a renaming that is done at the moment an input query is parsed. For every table alias that is given, we will append on a suffix to make the name unique from all others. The suffix chosen for MystiQ was X where X is a non-negative integer. An example of this renaming is shown below: The second problem is much more simple to solve. In order to generate the new queries in the correct order, we first make a pass over the input query and extract/remove all the inner queries. At the end of this process, we have a list of all the nested queries and the outer query we need. For the first of the nested queries, we merge it with the outer query, then we take another nested query, if it exists, and merge it with the first two generated queries in reverse order (just as Fig 3 demonstrated). We proceed in this fashion for all nested queries. After this process is done, we are ready to begin calculating probabilities. Calculation of probabilities has two pathways that it can take. The first is to compute the probabilities explicitly, taking advantage of the safe plans optimization. The second is to use the multi-simulation method. Clearly, explicitly calculating the probability is much more desirable, so unless a user specifies that simulations must be run, MystiQ will check if all of the generated queries, which are all monotone, can be computed exactly. If so, we can compute the NOT EXISTS query by substituting the probability values directly into the formulas discussed in Sections 3 and 4 to get an exact solution. In a simulation, we no longer have exact probabilities to work with, but rather entire sets of confidence intervals, one set for each of the generated queries. Recall from Section 2, an interval represents a range in which the true probability lies for a particular result of the query. As mentioned in Section 4, the possible tuples of the original query are precisely those of the outer query. However, there is the additional complication of the dependency between the intervals for the original query and those of the generated queries. Since we never actually ran the original query we have no DNF formula to simulate, so we must use the other intervals. The endpoints for a tuple of the original query is equivalent to just finding the extreme points of the probability formula. For example, if one was computing the lower endpoint of an interval, they would be computing the smallest possible probability value that could result from substituting the probabilities of the equivalent tuples of the generated queries. Similarly, the upper endpoint would be the largest possible probability value. Figure 4 provides a demonstration of the actual calculation and equations. With this key issue resolved, the simulation process can proceed in much the same way as it did in the original implementation of MystiQ, with only a slight modification. The process begins with initializing a set of intervals (to represent the original query), S , to the intervals of the outer query. Then, one interval, I ∈ S , is chosen (exactly in the same way an interval would have been chosen before). However, this interval itself is not simulated. Rather, all intervals representing the same tuple from the generated queries are simulated. These associated intervals are then aggregated together using the method just described before to come up with the new interval for our original query. This process is repeated until a certain number of intervals are isolated from each other giving us the top-k tuples. Experiments were run in order to judge how well the implementation performed when adjusting various parameters. For starters, we adjusted database table sizes and toggled using safe plans. Next, we varied the number of Monte Carlo simulations run every step in the multisimulation algorithm. Finally, we experimented with how the query system handles more complex queries. First, experiments were run to test how the implementation performed over different data sizes. For this, we used a synthetic database (rows are generated with ’random’ probabilities) consisting of Products and Orders , and ran the following query: We varied the data set size (split almost equally between Products and Orders ): ∼ 4000, ∼ 9500, ∼ 14000, ∼ 19000, ∼ 28500 rows. We also ran this query, both, with the safe plan optimizations and without. Fig. 5 shows the results of those runs. From this graph, we see that the running times of NOT EXISTS queries exponentially rise as the data set size increases. However, as the size increases, the slopes do seem to grow shallower. We also see in the graph that safe plan optimizations make a significant improvement in performance. From the tests, there was generally a 94% improvement in time when optimizations could be found. Another experiment examined how changing the number of simulations run every step in the simulating process of MystiQ affected the running time. The graph in Fig. 6 charts the change in times as we varied the number of simulations from 10000 to 70000 on a fixed data set of about 10000 rows, about 5000 products and 5000 orders (we used the same query as before). We see that for both small and large values, the running times dramatically increase; however, the cause is very different. For the smaller values, the increase in running time is mostly due to the fact that many more steps have to run to reach enough simulations on the intervals to isolate the top k. For the larger values, the increase in time is due to the fact that, though we reduce the number of steps run, we are over simulating the intervals. The extra simulations are simply taking up time and not adding any new information. The middle numbers, particularly at around 30000, balance the two problems well. They run enough simulations to reduce the number of steps taken, but they do not over simulate the intervals. Although Fig. 6 showed 30000 to be the optimal value for the Number of Simulations per Step , this is only for this particular case. Different queries will have different results. However, even in those cases, the graph of times ought to follow the same curve since these queries face the same problems, only it will be translated along the x-axis. Lastly, experiments were performed on gathered data closely related to the RFID data (times when a person entered and left a room). A complex query was written that searched for consecutive events of entering and leaving a room for a particular person, but the data set size was only 660 tuples (split between entering and leaving rooms). The query, at first, used two NOT EXISTS queries. Here is an example of the query: Then the query was translated into a single NOT EXISTS query to see how timings are affected. This new query is found below (where the AllSightings table is simply the union of EnteredRoom/LeftRoom events). The results of the test showed that it is about 25 times slower when adding one additional level of complexity. With only one NOT EXISTS sub query, results were found in 51.062 seconds where as by adding one more level of complexity, it took 1260.389 seconds. This increase comes from executing twice the number queries and running simulations on twice the number of sets of intervals. In addition to the immediate results of the experiment, another observation was made. An issue arises when there are many probabilities that lie very close together, particularly when they are all close to zero (which was the case for this experiment). While executions were run, there was a clear slow down when all the intervals narrowed and were centered about zero. At this point, the simulator makes only small changes to an interval’s endpoints leading to longer execution to distinguish a top k set of intervals. This is more prevalent in NOT EXISTS queries since using the NOT EXISTS clause, we can more easily send more probabilities closer to zero, but if certain monotone queries were chosen carefully over data sets with very small probabilities, it would also be possible for this case to arise. Systems for managing uncertain data need to support queries with negated subgoals, such as SQL queries with NOT EXISTS predicates. We have described in this paper an approach for supporting such queries, while leveraging much of the infrastructure of an existing probabilistic database management system. We have described optimizations, which in our preliminary experiments show improvement of performance by about 94%. However, the execution time still is fairly slow and can be improved upon. As our experiments show, the running times are seemingly exponential with respect to the data size. Also, as the complexity of the query increases, there is also a significant increase in running time. Further improvements in any of these areas would be extremely beneficial to making the system more practical to use in many different environments. Possible directions for future work would be to improve the simulation process or improving the interpretation of a NOT EXISTS query. Some specific areas that can be improved include implementing a faster way of running Monte Carlo steps, or understanding how to better choose intervals to simulate, or even figuring out how to choose the number of steps to take each time we simulate an interval. Other work could be to find better ways to break apart a ...
Similar publications
Modern data analytics applications typically process massive amounts of data on clusters of tens, hundreds, or thousands of machines to support near-real-time decisions.The quantity of data and limitations of disk and memory bandwidth often make it infeasible to deliver answers at interactive speeds. However, it has been widely observed that many a...
We study the following autocompletion problem, which is at the core of a new full-text search technology that we have developed over the last year. The problem is, for a given document collection, to precompute a data structure using as little space as possible such that queries of the following kind can be processed as quickly as possible: given a...
The join or correlated filtering of sensor readings is one of the fundamental query operations in wireless sensor networks (WSNs). Although the join in centralized or distributed databases is a well-researched problem, join processing in WSNs has quite different characteristics and is much more difficult to perform due to the lack of statistics on...
For tracing tag locations, the trajectories should be modeled and in- dexed in a radio frequency identification (RFID) system. The trajectory of a tag is represented as a line that connects two spatiotemporal locations captured when the tag enters and leaves the vicinity of a reader. If a tag enters but does not leave a reader, its trajectory is re...
Recent advances in wireless communications and positioning devices have generated a tremendous amount of interest in the continuous monitoring of spatial queries. However, such applications can incur a heavy burden on the data owner (DO), due to very frequent location updates. Database outsourcing is a viable solution, whereby the DO delegates its...
Citations
... In temporal databases, the existence of a matching tuple in n does not disqualify the tuple of p itself but timepoints at which it is valid [1], [2]. In probabilistic databases, where tuples have a probability to be true or false, the existence of a matching tuple in n only reduces the probability with which a tuple is included in the output [3], [4]. ...
... The answer tuple ('Ann, ZAK, hotel 1 ', a 1 ∧ b 3 , [4,6), 0.49) expresses that, with probability 0.49, Ann wants to visit Zakynthos (a 1 ) and stay at hotel 1 in Zakynthos (b 3 ) during interval [4,6). It is valid over the intersection of the intervals of tuples a 1 and b 3 and it is true when both these tuples are true. ...
... The answer tuple ('Ann, ZAK, hotel 1 ', a 1 ∧ b 3 , [4,6), 0.49) expresses that, with probability 0.49, Ann wants to visit Zakynthos (a 1 ) and stay at hotel 1 in Zakynthos (b 3 ) during interval [4,6). It is valid over the intersection of the intervals of tuples a 1 and b 3 and it is true when both these tuples are true. ...
The result of a temporal-probabilistic (TP) join with negation includes, at each time point, the probability with which a tuple of a positive relation matches none of the tuples in a negative relation , for a given join condition . TP outer and anti joins thus resemble the characteristics of relational outer and anti joins also in the case when there exist time points at which input tuples from have non-zero probabilities to be true and input tuples from have non-zero probabilities to be false, respectively. For the computation of TP joins with negation, we introduce generalized lineage-aware temporal windows, a mechanism that binds an output interval to the lineages of all the matching valid tuples of each input relation. We group the windows of two TP relations into three disjoint sets based on the way attributes, lineage expressions and intervals are produced. We compute all windows in an incremental manner, and we show that pipelined computations allow for the direct integration of our approach into PostgreSQL. We thereby alleviate the prevalent redundancies in the interval computations of existing approaches, which is proven by an extensive experimental evaluation with real-world datasets.
... In temporal databases, the existence of a matching tuple in the negative relation does not disqualify the tuple of p itself but timepoints at which it is valid [BBJ98,BJ09]. In probabilistic databases, where tuples have a probability to be true or false, the existence of a matching tuple in n only reduces the probability with which a tuple is included in the output [Suc09,WRS08]. ...
... In all these works, the focus is restricted to select-project-join queries. Probabilistic anti-joins expressed using the NOT EXISTS predicate in SQL have been explored by Wang et al. [WRS08]. They have been integrated in MystiQ by breaking the initial query into positive and negative subqueries that are separately evaluated and then combined. ...
... In temporal databases, the existence of a matching tuple in the negative relation does not disqualify the tuple of p itself but time points at which it is valid [BBJ98,BJ09]. In probabilistic databases, where tuples have a probability to be true or false, the existence of a matching tuple in n only reduces the probability with which a tuple is included in the output [Suc09,WRS08]. ...
... Furthermore, in this survey we focus only on work that rely on a similarity function (or distance function) when executing the data matching process. Other work in the literature use the probabilistic theory or statistical methods (Takasu, Fukagawa and Akutsu, 2007), for example, in order to manage heterogeneous data (Widom, 2008;Wang, Re and Suciu, 2008;Getoor and Taskar, 2007). Another important consideration on reading this paper is that we present those proposals that deal with structured data (relational tuples, records, XML, and so on); for proposals that consider unstructured documents, and use some specific algorithm in order to match them, see (Wan, 2008;Aygn, 2008). ...
Approximate data matching is a central problem in several data management processes, such as data integration, data cleaning,
approximate queries, similarity search and so on. An approximate matching process aims at defining whether two data represent
the same real-world object. For atomic values (strings, dates, etc), similarity functions have been defined for several value
domains (person names, addresses, and so on). For matching aggregated values, such as relational tuples and XML trees, approaches
alternate from the definition of simple functions that combine values of similarity of record attributes to sophisticated
techniques based on machine learning, for example. For complex data comparison, including structured and semistructured documents,
existing approaches use both structure and data for the comparison, by either considering or not considering data semantics.
This survey presents terminology and concepts that base approximated data matching, as well as discusses related work on the
use of similarity functions in such a subject.
KeywordsInstance data matching–Similarity function–Similarity matching–Record linkage–Record matching–Duplicate detection–Object matching–Entity resolution
... Some applications, such as management of RFID data, require SQL queries with NOT EXISTS predicates. We have investigated evaluation algorithms for queries with one level of NOT EXISTS predicates in [28], and described an evaluation algorithm that could be fully integrated in MystiQ. Our algorithm is exponential in the number of NOT EXISTS predicates; some optimizations are likely required in order to make this approach more practical. ...
MystiQ is a system that allows users to define a probabilistic database, then to evaluate SQL queries over this database. MystiQ is a middleware: the data itself is stored in a standard relational database system, and MystiQ is providing the probabilistic semantics. The advantage of a middleware over a re-implementation from scratch is that it can leverage the infrastructure of an existing database engine, e.g. indexes, query evaluation, query optimization, etc. Furthermore, MystiQ attempts to perform most of the probabilistic inference inside the relational database engine. MystiQ is currently available from mystiq.cs.washington.edu .
... The complexity of several other query languages has been considered in the literature: queries with disequality joins ( =) by Olteanu and Huang [2008], with inequality joins (<) by , with NOT EXISTS predicates by Wang et al. [2008b], with a HAVING clause by Ré and Suciu [2009], and unrestricted relational algebra queries, and in particular "quantified queries" such as relational division, by Fink et al. [2011b]. ...
Many applications today need to manage large data sets with uncertainties. In this paper we describe the foundations of managing data where the uncertainties are quantified as probabilities. We review the basic definitions of the probabilistic data model and present some fundamental theoretical results for query evaluation on probabilistic databases.
This article charts the tractability frontier of two classes of relational algebra queries in tuple-independent probabilistic databases. The first class consists of queries with join, projection, selection, and negation but without repeating relation symbols and union. The second class consists of quantified queries that express the following binary relationships among sets of entities: set division, set inclusion, set equivalence, and set incomparability. Quantified queries are expressible in relational algebra using join, projection, nested negation, and repeating relation symbols.
Each query in the two classes has either polynomial-time or #P-hard data complexity and the tractable queries can be recognised efficiently. Our result for the first query class extends a known dichotomy for conjunctive queries without self-joins to such queries with negation. For quantified queries, their tractability is sensitive to their outermost projection operator: They are tractable if no attribute representing set identifiers is projected away and #P-hard otherwise.
This paper shows that any non-repeating conjunctive relational query with negation has either polynomial time or #P-hard data complexity on tuple-independent probabilistic databases. This result extends a dichotomy by Dalvi and Suciu for non-repeating conjunctive queries to queries with negation. The tractable queries with negation are precisely the hierarchical ones and can be recognized efficiently.
As the size and complexity of analytic data processing systems continue to grow, the effort required to mitigate faults and performance skew has also risen. However, in some environments we have encountered, users prefer to continue query execution even in the presence of failures (e.g., the unavailability of certain data sources), and receive a "partial" answer to their query. We explore ways to characterize and classify these partial results, and describe an analytical framework that allows the system to perform coarse to fine-grained analysis to determine the semantics of a partial result. We propose that if the system is equipped with such a framework, in some cases it is better to return and explain partial results than to attempt to avoid them.
This article describes an approximation algorithm for computing the probability of propositional formulas over discrete random variables. It incrementally refines lower and upper bounds on the probability of the formulas until the desired absolute or relative error guarantee is reached. This algorithm is used by the SPROUT query engine to approximate the probabilities of results to relational algebra queries on expressive probabilistic databases.
Extensive work has recently been done on the evaluation of positive queries on probabilistic databases. The case of queries with negation has notoriously been left out, since it raises serious additional challenges to efficient query evaluation. This paper provides a complete framework for the evaluation of full relational algebra queries in probabilistic databases. In particular, it proposes exact and approximate evaluation techniques for relational algebra queries on representation systems that can accommodate any finite probability space over relational databases. Key ingredients to these techniques are (1) the manipulation of nested propositional expressions used for probability computation without unfolding them into disjunctive normal form, and (2) efficient computation of lower and upper probability bounds of such expressions by deriving coarser expressions in tractable theories such as one occurrence form. We complement our evaluation techniques with a tractability map for relational algebra queries without repeating relation symbols and for quantified queries such as set inclusion, equality, incomparability, and relational division, which are expressible in relational algebra using nested negation and repeating relation symbols. Based on this tractability study, we syntactically define a practical class of tractable relational algebra queries. We incorporated this framework in the SPROUT engine and show its efficiency experimentally in TPC-H and RFID scenarios.