Fig 2 - uploaded by Zachary G. Ives
Content may be subject to copyright.
1 An example query with two expensive user-defined predicates, two serial plans for it, and one conditional plan that uses age to decide which of the two expensive predicates to apply first.

1 An example query with two expensive user-defined predicates, two serial plans for it, and one conditional plan that uses age to decide which of the two expensive predicates to apply first.

Source publication
Article
Full-text available
As the data management field has diversified to consider settings in which queries are increasingly complex, statistics are less available, or data is stored remotely, there has been an acknowledgment that the traditional optimize-then-execute paradigm is insufficient. This has led to a plethora of new techniques, generally placed under the common...

Contexts in source publication

Context 1
... ordering refers to the problem of determining the order in which to apply a given set of commutative selection predicates (filters) to all the tuples of a relation, so as to find the tuples that satisfy all the predicates. Figure 2.1 shows an example selection query over a persons relation and several query plans for that query. ...
Context 2
... Plans: The natural class of execution plans to consider for eval- uating such queries is the class of serial orders that specify a single order in which the predicates should be applied to the tuples of the relation (Figure 2.1). Given a serial order, S π 1 , . . . ...
Context 3
... class of plans can be especially beneficial when the attributes are highly correlated with each other, and when there is a large disparity in the evaluation costs of the predicates. Figure 2.1 shows an example conditional plan that uses an inex- pensive predicate on age, which is strongly correlated with both the query predicates, to decide which of the two predicates to apply first. Specifically, for tuples with age > 25, the predicate on credit score is likely to be more selective than the predicate on education, and hence should be applied first, whereas the opposite would be true for tuples with age < 25. ...
Context 4
... left-deep or "left-linear" plan is one where only the left child of a join operator may be the result of another relational algebra operation; the right child must be a base relation. See Figure 2.2(ii). ...
Context 5
... predicates on edges that are eliminated (to remove cycles) are applied after performing the join, as "residual" predicates. Figure 2.2 shows an example multi-way join query that we use as a running example, and two possible join orders for it -one using a tree of binary join operators and the other using a ternary join operator. ...
Context 6
... Algorithms: The next aspect of join execution is the physical join algorithm used to implement each join in the join order -nested loop join, merge join, hash join, etc. These decisions are typically made Fig. 2.2 (i) A multi-way join query that we use as the running example; (ii) a left-deep join order, in which the right child of each join must be a base relation; (iii) a join order that uses a ternary join operator. during cost estimation, along with the join order selection, and are partly constrained by the available access methods. For ...
Context 7
... Non-pipelined plans: These contain at least one blocking operator that segments execution into stages: the blocking operator materializes its sub-results to a temporary table, which is then read by the next operator in a subsequent step. Each materialization is performed at a materialization point, illustrated in Figure 2.3 (i). ...
Context 8
... Pipelined plans: These plans execute all operators in the query plan in parallel, by sending the output tuples of an operator directly to the next operator in the pipeline. Figure 2.3(ii) shows an example pipelined plan that uses hash join operators. ...
Context 9
... subtlety of this plan is that it is actu- ally only "partly pipelined": the traditional hash join per- forms a build phase, reading one of the source relations into a hash table, before it begins pipelining tuples from input to output. Symmetric or doubly pipelined hash join operators ( Figure 2.4) are fully pipelined: when a tuple appears at either input, it is incrementally added to the corresponding hash table and probed against the opposite hash table. Sym- metric operators are extensively used when quicker response time is needed, or when the inputs are streaming in over a wide-area network, as they can read tuples from whichever input is available, and they incrementally produce output based on the tuples received so far. ...
Context 10
... metric operators are extensively used when quicker response time is needed, or when the inputs are streaming in over a wide-area network, as they can read tuples from whichever input is available, and they incrementally produce output based on the tuples received so far. We will continue to refer to plans such as the one shown in Figure 2.3(ii) as pipelined plans, as long as all the major operations in the plan are executed in a single pipeline. ...
Context 11
... other respects, the query execution remains essentially the same, except that the earliest tuples may have to be deleted from the join operator data structures when they have gone outside the range of the window (i.e., its size or duration). For example, if the symmetric hash join operator shown in Figure 2.4 is used to execute the join between R and S, then tuples should be removed from the hash tables once they have been followed by sufficient tuples to exceed the sliding window's capacity. ...
Context 12
... symmetric hash join operator [99,121] introduced in Section 2.1.3 solves both these problems by building hash tables on both inputs ( Figure 2.4); when an input tuple is read, it is stored in the appropriate hash table and probed against the opposite table, resulting in incremen- tal output. Because the operator is symmetric, it can process data from either input, depending on availability. ...
Context 13
... is built into the (three) hash indexes on the relation S. -Let the probing sequence chosen for s be T → R → U (Figure 3.2). -s is used to probe into the hash table on T to find matches. ...
Context 14
... eddy operator [6], discussed in Section 3.1.2, can be used in a fairly straightforward manner to adapt a selection ordering query (Figure 4.2). To execute the query σ S 1 ∧···∧Sn (R), one eddy operator and n selection operators are instantiated (one for each selection predicate). ...
Context 15
... with the opera- tors that the tuple has already been through, the operators whose input queues are full are also considered ineligible to receive the tuple. The latter condition, called backpres- sure, allows the eddy to indirectly consider the operator costs when making routing decisions; the intuition being that operators with full input queues are likely to have high relative per-tuple execution cost (Figure 4.2). ...

Similar publications

Article
Full-text available
In this survey chapter, we discuss adaptive query processing (AdQP) tech-niques for distributed environments. We also investigate the issues involved in ex-tending AdQP techniques originally proposed for single-node processing so that they become applicable to multi-node environments as well. In order to make it easier for the reader to understand...
Conference Paper
Full-text available
Managing fine-grained provenance is a critical requirement for data stream management systems (DSMS), not only to address com-plex applications that require diagnostic capabilities and assurance, but also for providing advanced functionality such as revision pro-cessing or query debugging. This paper introduces a novel ap-proach that uses operator...
Article
Full-text available
This paper addresses the problem of predicting the k events that are most likely to occur next, over historical real-time event streams. Existing approaches to causal prediction queries have a number of limitations. First, they exhaustively search over an acyclic causal network to find the most likely k effect events; however, data from real event...

Citations

... Although this is a somewhat exaggerated example, similar redundant probes can be seen in typical join workloads as we will show later. This inefficiency, sometimes called "caching effect", has been observed in previous literature [4,11,18], but was not systematically explored until recently. ...
Preprint
As database query processing techniques are being used to handle diverse workloads, a key emerging challenge is how to efficiently handle multi-way join queries containing multiple many-to-many joins. While uncommon in traditional enterprise settings that have been the focus of much of the query optimization work to date, such queries are seen frequently in other contexts such as graph workloads. This has led to much work on developing join algorithms for handling cyclic queries, on compressed (factorized) representations for more efficient storage of intermediate results, and on use of semi-joins or predicate transfer to avoid generating large redundant intermediate results. In this paper, we address a core query optimization problem in this context. Specifically, we introduce an improved cost model that more accurately captures the cost of a query plan in such scenarios, and we present several optimization algorithms for query optimization that incorporate these new cost functions. We present an extensive experimental evaluation, that compares the factorized representation approach with a full semi-join reduction approach as well as to an approach that uses bitvectors to eliminate tuples early through sideways information passing. We also present new analyses of robustness of these techniques to the choice of the join order, potentially eliminating the need for more complex query optimization and selectivity estimation techniques.
... Additionally, if incorrect optimizer decisions are possible, the optimizer should be able to predict requirement violations (e.g., via a feedback loop) and adapt the mapping of client to system tasks or fall back to isolated execution, while the system must be able to adapt at runtime to accommodate the re-optimizations. Techniques for runtime adaptation have already been studied in the context of adaptive query processing [17]. ...
Conference Paper
Full-text available
Enterprises collect data in large volumes and leverage them to drive numerous concurrent decisions and business processes. Their teams deploy multiple applications that often operate concurrently on the same data and infrastructure but have widely different performance requirements. To meet these requirements, enterprises enforce resource boundaries between applications, isolating them from one another. However, boundaries necessitate separate resources per application, making processing increasingly resource-hungry and expensive as concurrency increases. While cross-task optimizations, such as data and work sharing, are known to curb the increase in total resource requirements, resource boundaries render them inapplicable. We propose the principle of functional isolation: cross-task optimizations can and should be combined with performance isolation. Systems should permit cross-optimization as long as participating tasks achieve indistinguishable or improved performance compared to isolated execution. The outcome is faster, more cost-effective, and more sustainable data processing. We make an initial step toward our vision by addressing functional isolation for work sharing and propose GroupShare, a strategy that reduces both total CPU consumption and the latency of all queries.
... Traditionally, streaming data processing benefits from adaptive techniques [12], since in a streaming setting, both the data characteristics and the environmental conditions are subject to potentially frequent changes. The motivation of our work is the current lack of proposals for handling adaptivity in continuous massively parallel distance-based outlier detection. ...
Article
Full-text available
We deal with the problem of dynamically allocating the workload to multiple workers in massively parallel continuous distance-based outlier detection, where the workload is conceptually split in contiguous overlapping regions. The main challenges stem from the fact that modern streaming processing frameworks, such as Apache Flink and Spark Streaming, do not support feedback loops, the process is stateful while the adaptations do not result in key redistribution but in modifying the region boundaries associated with each key. These challenges correspond to overlooked issues, which call for novel solutions that we provide in our work. More specifically, firstly, we propose an architecture for allowing such adaptations in Flink. Secondly, we propose specific techniques for adaptive region definition that are applicable to any distance metric. Finally, we conduct thorough experimental evaluation and our results show that our proposal is both efficient and effective even in small finite streams. In addition, our proposal is shown to be insensitive to the exact continuous outlier detection algorithm and outlier query parameters.
... The authors provided an overview of these techniques along with their characteristics, e.g., the focus or the aim of the techniques. Deshpande et al. [187] conducted a survey to identify common issues, themes and approaches in adaptive query processing, in particular, focusing on adaptive join processing. Some surveyed techniques in above studies are covered by our identified enactment categories. ...
Thesis
Stream processing is a popular paradigm to process huge amounts of unbounded data, which has gained significant attention in both academia and industry. Typical stream processing applications such as stock trading and network traffic monitoring require continuously analyzed results provided to end-users. During processing, the characteristics of data streams such as volume or velocity can vary, e.g., peak load or bursty streams can occur at certain points. In order to cope with such situations, it requires the analytical systems to be able to adapt the execution of stream processing as quickly as possible. In literature, different approaches adapting data stream processing such as load-shedding and elastic parallelization do exist. However, each of them have their different shortcomings like skewed results (due to the dropped data) or strong limits on the adaptation due to the parallelization overhead. One specific challenge motivating us is to minimize the impact of runtime adaptation on the overall data processing, in particular for real-time data analytics. Moreover, while the need to create adaptive stream processing systems is well known, there is currently no systematic and broad analysis of the solution range of creating adaptation mechanisms for stream processing applications. In this dissertation, we focus on algorithm switching as a fundamental approach to the construction of adaptive stream processing systems. Algorithm switching is a form of adaptation, where stream processing algorithms, with fundamentally similar input-/output-characteristics but different runtime tradeoffs like resource consumption or precision, are replaced to optimize the processing. As our overall goal, we present a general algorithm switching framework that models a wide range of switching solutions (called switch variants) in a systematic and reusable manner as well as characterizes the switch variants with their quality guarantees. Concretely, we focus on developing a general model of algorithm switching to systematically capture possible variants of different switching behavior. We also present a theoretical specification to predict the timeliness-related qualities for the switch variants. Moreover, from the practical perspective, we also develop a component-based design to ease the realization effort of the algorithm switching variants. Finally, we provide a validation of the algorithm switching framework against the realized switch variants.
... Adaptive query processing. The idea of adapting the optimal join plan at runtime according to the shape of the data in GraphRex is closely related to the literature of adaptive query processing [94]. Ripple joins [125] generalize nest-loop and hash joins to optimize online aggregation. ...
Article
Today’s largest data processing workloads are hosted in cloud data centers. Due to unprecedented data growth and the end of Moore’s Law, these workloads have ballooned to the hyperscale level, encompassing billions to trillions of data items and hundreds to thousands of machines per query. Enabling and expanding with these workloads are highly scalable data center networks that connect up to hundreds of thousands of networked servers. These massive scales fundamentally challenge the designs of both data processing systems and data center networks, and the classic layered designs are no longer sustainable. Rather than optimize these massive layers in silos, we build systems across them with principled network-centric designs. In current networks, we redesign data processing systems with network-awareness to minimize the cost of moving data in the network. In future networks, we propose new interfaces and services that the cloud infrastructure offers to applications and codesign data processing systems to achieve optimal query processing performance. To transform the network to future designs, we facilitate network innovation at scale. This dissertation presents a line of systems work that covers all three directions. It first discusses GraphRex, a network-aware system that combines classic database and systems techniques to push the performance of massive graph queries in current data centers. It then introduces data processing in disaggregated data centers, a promising new cloud proposal. It details TELEPORT, a compute pushdown feature that eliminates data processing performance bottlenecks in disaggregated data centers, and Redy, which provides high-performance caches using remote disaggregated memory. Finally, it presents MimicNet, a fine-grained simulation framework that evaluates network proposals at datacenter scale with machine learning approximation. These systems demonstrate that our ideas in network-centric designs achieve orders of magnitude higher efficiency compared to the state of the art at hyperscale.
... GATI builds on this idea and provides inference service owners an option to specify an array of models at each node of the query DAG. In other domains, several other systems have used query replanning and dynamic adaptation to resource or workload changes [25,40,56,57,70] to get performance benefits. Using ML in Systems. ...
Preprint
Full-text available
Deep Neural Networks (DNNs) are witnessing increased adoption in multiple domains owing to their high accuracy in solving real-world problems. However, this high accuracy has been achieved by building deeper networks, posing a fundamental challenge to the low latency inference desired by user-facing applications. Current low latency solutions trade-off on accuracy or fail to exploit the inherent temporal locality in prediction serving workloads. We observe that caching hidden layer outputs of the DNN can introduce a form of late-binding where inference requests only consume the amount of computation needed. This enables a mechanism for achieving low latencies, coupled with an ability to exploit temporal locality. However, traditional caching approaches incur high memory overheads and lookup latencies, leading us to design learned caches - caches that consist of simple ML models that are continuously updated. We present the design of GATI, an end-to-end prediction serving system that incorporates learned caches for low-latency DNN inference. Results show that GATI can reduce inference latency by up to 7.69X on realistic workloads.
... Key concepts in a single pass, non-adaptive query optimization in relational database are reviewed in [27,32]. The conventional optimization technique includes Selection ordering, Optimizer choices in a multi-way join query (as access methods, join order, join algorithm, and pipeline), finding a single robust query plan, or finding a small set of plans that are appropriate for different situations. ...
... The obtained results demonstrate that this technique has comprehensive superiority in tackling the feature selection problem. Adaptive query processing is considered additionally in [27,32]. It uses an Adaptive Greedy algorithm ensuring the order used by the query processor and has query processing technique like selection ordering for single table. ...
Conference Paper
Data Management can be defined as the process of extracting, storing, organizing, and maintaining the data created and collected in organizations. Today's organizations invest in data management solutions that provide an efficient way to manage data in a unified structure. The enormous growth of data in the last decades has created a necessity for the fast extracting, accessing, and processing of data. Optimization has been a key component in improving the system's performance, searching, and accessing data in different data management solutions. Optimization is a mathematical discipline that formulates mathematical models and finds the best solution among a set of feasible solutions. This paper aims to give a general overview of applications of optimization techniques and algorithms in different areas of data management in the last decades. Data management includes a large group of functionalities, but we will focus on studying and reviewing the recent development of optimization algorithms used in databases, data warehouses, big data, and machine learning. Furthermore, this paper will identify applications of optimization in data management, reviews the current solutions proposed, and emphasize future topics where there is a lack of studies in data management.
... As a consequence, the implementation generated by Hydro will likely need to change over time. While Hydro's architecture is designed to tackle that aspect by not having hard-wired rules for code generation, we will also devise new runtime monitoring and adaptive code generation techniques, in the spirit of prior work [17,25,34]. ...
Preprint
Full-text available
Nearly twenty years after the launch of AWS, it remains difficult for most developers to harness the enormous potential of the cloud. In this paper we lay out an agenda for a new generation of cloud programming research aimed at bringing research ideas to programmers in an evolutionary fashion. Key to our approach is a separation of distributed programs into a PACT of four facets: Program semantics, Availablity, Consistency and Targets of optimization. We propose to migrate developers gradually to PACT programming by lifting familiar code into our more declarative level of abstraction. We then propose a multi-stage compiler that emits human-readable code at each stage that can be hand-tuned by developers seeking more control. Our agenda raises numerous research challenges across multiple areas including language design, query optimization, transactions, distributed consistency, compilers and program synthesis.
... The challenge of efficiently evaluating DIFF in conjunction with one more JOINs is a specialized scenario of the multi-operator query optimization problem: A small estimation error in the size of one or more intermediate outputs can transitively yield a very large estimation error for the cost of the entire query plan [37]. This theoretical fact inspired extensive work in adaptive query processing [21], including systems such as Eddies [5] and RIO [7]. Here, we take a similar approach and design an adaptive algorithm for evaluating the DIFF-JOIN that avoids the pitfalls of expensive intermediate outputs. ...
... In addition, our proposed optimizations draw from research in adaptive query processing [5,7,21]. We show in Sect. 4 how to optimize DIFF-JOIN queries using our adaptive algorithm, which builds upon extensive work on optimizing JOINs [51,52,54,60,61]. ...
Article
Full-text available
A range of explanation engines assist data analysts by performing feature selection over increasingly high-volume and high-dimensional data, grouping and highlighting commonalities among data points. While useful in diverse tasks such as user behavior analytics, operational event processing, and root-cause analysis, today’s explanation engines are designed as stand-alone data processing tools that do not interoperate with traditional, SQL-based analytics workflows; this limits the applicability and extensibility of these engines. In response, we propose the DIFF operator, a relational aggregation operator that unifies the core functionality of these engines with declarative relational query processing. We implement both single-node and distributed versions of the DIFF operator in MB SQL, an extension of MacroBase, and demonstrate how DIFF can provide the same semantics as existing explanation engines while capturing a broad set of production use cases in industry, including at Microsoft and Facebook. Additionally, we illustrate how this declarative approach to data explanation enables new logical and physical query optimizations. We evaluate these optimizations on several real-world production applications and find that DIFF in MB SQL can outperform state-of-the-art engines by up to an order of magnitude.
... In [2], Acosta et al. propose to use networks of Linked Data Eddies to allow the TPF client to switch between several join strategies. The smart client uses adaptive query processing techniques [10,20] to dynamically adapt query execution to adjust to changes in runtime execution and data availability. In scenarios with unpredictable transfer delays and data distributions, Linked Data Eddies outperform the existing LDF approach. ...
... So, it will have to download histograms from the server, which increases the data transferred. To solve it, we could rely on adaptive query processing techniques [10,20] for client-side SPARQL processing. Federated SPARQL query processing [3] and smart LDF clients [2] successfully applied similar methods to perform on-the-fly query optimization without having to download too much statistics from the server. ...
Thesis
Full-text available
Following the Linked Open Data principles, data providers have published billions of RDF documentsusing public SPARQL query services. To ensure these services remain stable and responsive, they enforce quotas on server usage. Queries which exceed these quotas are interrupted and deliver partial results. Such interruption is not an issue if it is possible to resume queries execution afterward. Unfortunately, there is no preemption model for the Web that allows for suspending and resuming SPARQL queries. In this thesis, we propose to tackle the issue of building public SPARQL query servers that allow any data consumer to execute any SPARQL query with complete results. First, we propose a new query execution model called Web Preemption. It allows SPARQL queries to be suspended by the Web server after a fixed time quantum and resumed upon client request. Web preemption is tractable only if its cost in time is negligible compared to the time quantum. Thus, we propose SaGe: a SPARQL query engine that implements Web Preemption with minimal overhead. Experimental results demonstrate that SaGe outperforms existing SPARQL query processing approaches by several orders of magnitude in terms of the average total query execution time and the time for first results.