Conference Paper

# Small Selectivities Matter: Lifting the Burden of Empty Samples

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

## No full-text available

ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
This work presents an in-depth analysis of the majority of the deep neural networks (DNNs) proposed in the state of the art for image recognition. For each DNN multiple performance indices are observed, such as recognition accuracy, model complexity, computational complexity, memory usage, and inference time. The behavior of such performance indices and some combinations of them are analyzed and discussed. To measure the indices we experiment the use of DNNs on two different computer architectures, a workstation equipped with a NVIDIA Titan X Pascal and an embedded system based on a NVIDIA Jetson TX1 board. This experimentation allows a direct comparison between DNNs running on machines with very different computational capacity. This study is useful for researchers to have a complete view of what solutions have been explored so far and in which research directions are worth exploring in the future; and for practitioners to select the DNN architecture(s) that better fit the resource constraints of practical deployments and applications. To complete this work, all the DNNs, as well as the software used for the analysis, are available online.
Article
Full-text available
Optimization of queries with conjunctive predicates for main memory databases remains a challenging task. The traditional way of optimizing this class of queries relies on predicate ordering based on selectivities or ranks. However, the optimization of queries with conjunctive predicates is a much more challenging task, requiring a holistic approach in view of (1) an accurate cost model that is aware of CPU architectural characteristics such as branch (mis)prediction, (2) a storage layer, allowing for a streamlined query execution, (3) a common subexpression elimination technique, minimizing column access costs, and (4) an optimization algorithm able to pick the optimal plan even in presence of a small (bounded) estimation error. In this work, we embrace the holistic approach, and show its superiority experimentally. Current approaches typically base their optimization algorithms on at least one of two assumptions: (1) the predicate selectivities are assumed to be independent, (2) the predicate costs are assumed to be constant. Our approach is not based on these assumptions, as they in general do not hold.
Chapter
Full-text available
Finding a good join order is crucial for query performance. In this paper, we introduce the Join Order Benchmark (JOB) and experimentally revisit the main components in the classic query optimizer architecture using a complex, real-world data set and realistic multi-join queries. We investigate the quality of industrial-strength cardinality estimators and find that all estimators routinely produce large errors. We further show that while estimates are essential for finding a good join order, query performance is unsatisfactory if the query engine relies too heavily on these estimates. Using another set of experiments that measure the impact of the cost model, we find that it has much less influence on query performance than the cardinality estimates. Finally, we investigate plan enumeration techniques comparing exhaustive dynamic programming with heuristic algorithms and find that exhaustive enumeration improves performance despite the sub-optimal cardinality estimates.
Conference Paper
Full-text available
Quickly and accurately estimating the selectivity of multidimensional predicates is a vital part of a modern relational query optimizer. The state-of-the art in this field are multidimensional histograms, which offer good estimation quality but are complex to construct and hard to maintain. Kernel Density Estimation (KDE) is an interesting alternative that does not suffer from these problems. However, existing KDE-based selectivity estimators can hardly compete with the estimation quality of state-of-the art methods. In this paper, we substantially expand the state-of-the-art in KDE-based selectivity estimation by improving along three dimensions: First, we demonstrate how to numerically optimize a KDE model, leading to substantially improved estimates. Second, we develop methods to continuously adapt the estimator to changes in both the database and the query workload. Finally, we show how to drastically improve the performance by pushing computations onto a GPU. We provide an implementation of our estimator and experimentally evaluate it on a variety of datasets and workloads, demonstrating that it efficiently scales up to very large model sizes, adapts itself to database changes, and typically outperforms the estimation quality of both existing Kernel Density Estimators as well as state-of-the-art multidimensional histograms.
Article
Full-text available
Histograms that guarantee a maximum multiplicative error (q-error) for estimates may significantly improve the plan quality of query optimizers. However, the construction time for histograms with maximum q-error was too high for practical use cases. In this paper we extend this concept with a threshold, i.e., an estimate or true cardinality θ, below which we do not care about the q-error because we still expect optimal plans. This allows us to develop far more efficient construction algorithms for histograms with bounded error. The test for θ, q-acceptability developed also exploits the order-preserving dictionary encoding of SAP HANA. We have integrated this family of histograms into SAP HANA, and we report on the construction time, histograms size, and estimation errors on real-world data sets. In virtually all cases the histograms can be constructed in far less than one second, requiring less than 5% of space compared to the original compressed data.
Article
Full-text available
We introduce fast algorithms for selecting a random sample of n records without replacement from a pool of N records, where the value of N is unknown beforehand. The main result of the paper is the design and analysis of Algorithm Z; it does the sampling in one pass using constant space and in O(n(1 + log(N/n))) expected time, which is optimum, up to a constant factor. Several optimizations are studied that collectively improve the speed of the naive version of the algorithm by an order of magnitude. We give an efficient Pascal-like implementation that incorporates these modifications and that is suitable for general use. Theoretical and empirical results indicate that Algorithm Z outperforms current methods by a significant margin.
Article
Full-text available
Cost-based query optimizers need to estimate the selectivity of conjunctive predicates when comparing alternative query execution plans. To this end, advanced optimizers use multivariate statistics to improve information about the joint distribution of attribute values in a table. The joint distribution for all columns is almost always too large to store completely, and the resulting use of partial distribution information raises the possibility that multiple, non-equivalent selectivity estimates may be available for a given predicate. Current optimizers use cumbersome ad hoc methods to ensure that selectivities are estimated in a consistent manner. These methods ignore valuable information and tend to bias the optimizer toward query plans for which the least information is available, often yielding poor results. In this paper we present a novel method for consistent selectivity estimation based on the principle of maximum entropy (ME). Our method exploits all available information and avoids the bias problem. In the absence of detailed knowledge, the ME approach reduces to standard uniformity and independence assumptions. Experiments with our prototype implementation in DB2 UDB show that use of the ME approach can improve the optimizer’s cardinality estimates by orders of magnitude, resulting in better plan quality and significantly reduced query execution times. For almost all queries, these improvements are obtained while adding only tens of milliseconds to the overall time required for query optimization.
Conference Paper
Full-text available
Accurate cardinality estimation is critically important to high-quality query optimization. It is well known that con- ventional cardinality estimation based on histograms or sim- ilar statistics may produce extremely poor estimates in a va- riety of situations, for example, queries with complex pred- icates, correlation among columns, or predicates containing user-defined functions. In this paper, we propose a new, general cardinality estimation technique that combines ran- dom sampling and materialized view technology to produce accurate estimates even in these situations. As a major in- novation, we exploit feedback information from query execu- tion and process control techniques to assure that estimates remain statistically valid when the underlying data changes. Experimental results based on a prototype implementation in Microsoft SQL Server demonstrate the practicality of the approach and illustrate the dramatic effects improved car- dinality estimates may have.
Article
Full-text available
Query optimizers rely on accurate estimations of the sizes of intermediate results. Wrong size estimations can lead to overly expensive execution plans. We first define the \emph{q-error} to measure deviations of size estimates from actual sizes. The q-error enables the derivation of two important results: (1) We provide bounds such that if the q-error is smaller than this bound, the query optimizer constructs an optimal plan. (2) If the q-error is bounded by a number $q$, we show that the cost of the produced plan is at most a factor of $q^4$ worse than the optimal plan. Motivated by these findings, we next show how to find the best approximation under the q-error. These techniques can then be used to build synopsis for size estimates. Finally, we give some experimental results where we apply the developed techniques.
Article
Full-text available
Attributes of a relation are not typically independent. Multidimensional histograms can be an effective tool for accurate multiattribute query selectivity estimation. In this paper, we introduce STHoles,aworkload-aware" histogram that allows bucket nesting to capture data regions with reasonably uniform tuple density. STHoles histograms are built without examining the data sets, but rather by just analyzing query results. Buckets are allocated where needed the most as indicated by the workload, which leads to accurate query selectivity estimations. Our extensive experiments demonstrate that STHoles histograms consistently produce good selectivity estimates across synthetic and real-world data sets and across query workloads, and, in many cases, outperform the best multidimensional histogram techniques that require access to and processing of the full data sets during histogram construction. 1.
Article
Cardinality estimation has long been grounded in statistical tools for density estimation. To capture the rich multivariate distributions of relational tables, we propose the use of a new type of high-capacity statistical model: deep autoregressive models. However, direct application of these models leads to a limited estimator that is prohibitively expensive to evaluate for range or wildcard predicates. To produce a truly usable estimator, we develop a Monte Carlo integration scheme on top of autoregressive models that can efficiently handle range queries with dozens of dimensions or more. Like classical synopses, our estimator summarizes the data without supervision. Unlike previous solutions, we approximate the joint data distribution without any independence assumptions. Evaluated on real-world datasets and compared against real systems and dominant families of techniques, our estimator achieves single-digit multiplicative error at tail, an up to 90x accuracy improvement over the second best method, and is space- and runtime-efficient.
Book
Synopses for Massive Data: Samples, Histograms, Wavelets, Sketches describes basic principles and recent developments in building approximate synopses (that is, lossy, compressed representations) of massive data. Such synopses enable approximate query processing, in which the user's query is executed against the synopsis instead of the original data. It focuses on the four main families of synopses: random samples, histograms, wavelets, and sketches. A random sample comprises a "representative" subset of the data values of interest, obtained via a stochastic mechanism. Samples can be quick to obtain, and can be used to approximately answer a wide range of queries. A histogram summarizes a data set by grouping the data values into subsets, or "buckets," and then, for each bucket, computing a small set of summary statistics that can be used to approximately reconstruct the data in the bucket. Histograms have been extensively studied and have been incorporated into the query optimizers of virtually all commercial relational DBMSs. Wavelet-based synopses were originally developed in the context of image and signal processing. The data set is viewed as a set of M elements in a vector-i.e., as a function defined on the set {0,1,2,. ,M?1}-and the wavelet transform of this function is found as a weighted sum of wavelet "basis functions." The weights, or coefficients, can then be "thresholded", e.g., by eliminating coefficients that are close to zero in magnitude. The remaining small set of coefficients serves as the synopsis. Wavelets are good at capturing features of the data set at various scales. Sketch summaries are particularly well suited to streaming data. Linear sketches, for example, view a numerical data set as a vector or matrix, and multiply the data by a fixed matrix. Such sketches are massively parallelizable. They can accommodate streams of transactions in which data is both inserted and removed. Sketches have also been used successfully to estimate the answer to COUNT DISTINCT queries, a notoriously hard problem. Synopses for Massive Data describes and compares the different synopsis methods. It also discusses the use of AQP within research systems, and discusses challenges and future directions. It is essential reading for anyone working with, or doing research on massive data.
Article
We present an overview of SAP HANA's Native Store Extension (NSE). This extension substantially increases database capacity, allowing to scale far beyond available system memory. NSE is based on a hybrid in-memory and paged column store architecture composed from data access primitives. These primitives enable the processing of hybrid columns using the same algorithms optimized for traditional HANA's in-memory columns. Using only three key primitives, we fabricated byte-compatible counterparts for complex memory resident data structures (e.g. dictionary and hash-index), compressed schemes (e.g. sparse and run-length encoding), and exotic data types (e.g. geo-spatial). We developed a new buffer cache which optimizes the management of paged resources by smart strategies sensitive to page type and access patterns. The buffer cache integrates with HANA's new execution engine that issues pipelined prefetch requests to improve disk access patterns. A novel load unit configuration, along with a unified persistence format, allows the hybrid column store to dynamically switch between in-memory and paged data access to balance performance and storage economy according to application demands while reducing Total Cost of Ownership (TCO). A new partitioning scheme supports load unit specification at table, partition, and column level. Finally, a new advisor recommends optimal load unit configurations. Our experiments illustrate the performance and memory footprint improvements on typical customer scenarios.
Article
Query optimizers depend on selectivity estimates of query predicates to produce a good execution plan. When a query contains multiple predicates, today's optimizers use a variety of assumptions, such as independence between predicates, to estimate selectivity. While such techniques have the benefit of fast estimation and small memory footprint, they often incur large selectivity estimation errors. In this work, we reconsider selectivity estimation as a regression problem. We explore application of neural networks and tree-based ensembles to the important problem of selectivity estimation of multi-dimensional range predicates. While their straightforward application does not outperform even simple baselines, we propose two simple yet effective design choices, i.e., regression label transformation and feature engineering, motivated by the selectivity estimation context. Through extensive empirical evaluation across a variety of datasets, we show that the proposed models deliver both highly accurate estimates as well as fast estimation.
Conference Paper
We introduce Deep Sketches, which are compact models of databases that allow us to estimate the result sizes of SQL queries. Deep Sketches are powered by a new deep learning approach to cardinality estimation that can capture correlations between columns, even across tables. Our demonstration allows users to define such sketches on the TPC-H and IMDb datasets, monitor the training process, and run ad-hoc queries against trained sketches. We also estimate query cardinalities with HyPer and PostgreSQL to visualize the gains over traditional cardinality estimators.
Conference Paper
Cardinality estimation is a fundamental task in database query processing and optimization. Unfortunately, the accuracy of traditional estimation techniques is poor resulting in non-optimal query execution plans. With the recent expansion of machine learning into the field of data management, there is the general notion that data analysis, especially neural networks, can lead to better estimation accuracy. Up to now, all proposed neural network approaches for the cardinality estimation follow a global approach considering the whole database schema at once. These global models are prone to sparse data at training leading to misestimates for queries which were not represented in the sample space used for generating training queries. To overcome this issue, we introduce a novel local-oriented approach in this paper, therefore the local context is a specific sub-part of the schema. As we will show, this leads to better representation of data correlation and thus better estimation accuracy. Compared to global approaches, our novel approach achieves an improvement by two orders of magnitude in accuracy and by a factor of four in training time performance for local models.
Article
Estimating selectivities remains a critical task in query processing. Optimizers rely on the accuracy of selectivities when generating execution plans and, in approximate query answering, estimated selectivities affect the quality of the result. Many systems maintain synopses, e.g., histograms, and, in addition, provide sampling facilities. In this paper, we present a novel approach to combine knowledge from synopses and sampling for the purpose of selectivity estimation for conjunctive queries. We first show how to extract information from synopses and sampling such that they are mutually consistent. In a second step, we show how to combine them and decide on an admissible selectivity estimate. We compare our approach to state-of-the-art methods and evaluate the strengths and limitations of each approach.
Conference Paper
Industrial as well as academic analytics systems are usually evaluated based on well-known standard benchmarks, such as TPC-H or TPC-DS. These benchmarks test various components of the DBMS including the join optimizer, the implementation of the join and aggregation operators, concurrency control and the scheduler. However, these benchmarks fall short of evaluating the "real" challenges imposed by modern BI systems, such as Tableau, that emit machine-generated query workloads. This paper reports a comprehensive study based on a set of more than 60k real-world BI data repositories together with their generated query workload. The machine-generated workload posed by BI tools differs from the "hand-crafted" benchmark queries in multiple ways: Structurally simple relational operator trees often come with extremely complex scalar expressions such that expression evaluation becomes the limiting factor. At the same time, we also encountered much more complex relational operator trees than covered by benchmarks. This long tail in both, operator tree and expression complexity, is not adequately represented in standard benchmarks. We contribute various statistics gathered from the large dataset, e.g., data type distributions, operator frequency, string length distribution and expression complexity. We hope our study gives an impetus to database researchers and benchmark designers alike to address the relevant problems in future projects and to enable better database support for data exploration systems which become more and more important in the Big Data era.
Book
Clarifies modern data analysis through nonparametric density estimation for a complete working knowledge of the theory and methods. Featuring a thoroughly revised presentation, Multivariate Density Estimation: Theory, Practice, and Visualization, Second Edition maintains an intuitive approach to the underlying methodology and supporting theory of density estimation. Including new material and updated research in each chapter, the Second Edition presents additional clarification of theoretical opportunities, new algorithms, and up-to-date coverage of the unique challenges presented in the field of data analysis.The new edition focuses on the various density estimation techniques and methods that can be used in the field of big data. Defining optimal nonparametric estimators, the Second Edition demonstrates the density estimation tools to use when dealing with various multivariate structures in univariate, bivariate, trivariate, and quadrivariate data analysis. Continuing to illustrate the major concepts in the context of the classical histogram, Multivariate Density Estimation: Theory, Practice, and Visualization, Second Edition also features:. Over 150 updated figures to clarify theoretical results and to show analyses of real data sets. An updated presentation of graphic visualization using computer software such as R. A clear discussion of selections of important research during the past decade, including mixture estimation, robust parametric modeling algorithms, and clustering. More than 130 problems to help readers reinforce the main concepts and ideas presented. Boxed theorems and results allowing easy identification of crucial ideas. Figures in color in the digital versions of the book. A website with related data sets. Multivariate Density Estimation: Theory, Practice, and Visualization, Second Edition is an ideal reference for theoretical and applied statisticians, practicing engineers, as well as readers interested in the theoretical aspects of nonparametric estimation and the application of these methods to multivariate data. The Second Edition is also useful as a textbook for introductory courses in kernel statistics, smoothing, advanced computational statistics, and general forms of statistical distributions.
Article
Database Management Systems (DBMS) continue to be the foundation of mission critical applications, both OLTP and Analytics. They provide a safe, reliable and efficient platform to store and retrieve data. SQL is the lingua franca of the database world. A database developer writes a SQL statement to specify data sources and express the desired result and the DBMS will figure out the most efficient way to implement it. The query optimizer is the component in a DBMS responsible for finding the best execution plan for a given SQL statement based on statistics, access structures, location, and format. At the center of a query optimizer is a cost model that consumes the above information and helps the optimizer make decisions related to query transformations, join order, join methods, access paths, and data movement. The final execution plan produced by the query optimizer depends on the quality of information used by the cost model, as well as the sophistication of the cost model. In addition to statistics about the data, the cost model also relies on statistics generated internally for intermediate results, e.g. size of the output of a join operation. This paper presents the problems caused by incorrect statistics of intermediate results, survey the existing solutions and present our solution introduced in Oracle 12c. The solution includes validating the generated statistics using table data and via the automatic creation of auxiliary statistics structures. We limit the overhead of the additional work by confining their use to cases where it matters the most, caching the computed statistics, and using table samples. The statistics management is automated. We demonstrate the benefits of our approach based on experiments using two SQL workloads, a benchmark that uses data from the Internal Movie Data Base (IMDB) and a real customer workload.
Article
We present a proof of the non-solvability in radicals of a general algebraic equation of degree greater than four. This proof relies on the non-solvability of the monodromy group of a general algebraic function.
Article
Attributes of a relation are not typically independent. Multidimensional histograms can be an effective tool for accurate multiattribute query selectivity estimation. In this paper, we introduce STHoles, a “workload-aware” histogram that allows bucket nesting to capture data regions with reasonably uniform tuple density. STHoles histograms are built without examining the data sets, but rather by just analyzing query results. Buckets are allocated where needed the most as indicated by the workload, which leads to accurate query selectivity estimations. Our extensive experiments demonstrate that STHoles histograms consistently produce good selectivity estimates across synthetic and real-world data sets and across query workloads, and, in many cases, outperform the best multidimensional histogram techniques that require access to and processing of the full data sets during histogram construction.
Book
Book together with Dekking, Kraaikamp and Meester Probability and Statistics are studied by most science students, usually as a second- or third-year course. Many current texts in the area are just cookbooks and, as a result, students do not know why they perform the methods they are taught, or why the methods work. The strength of this book is that it readdresses these shortcomings; by using examples, often from real-life and using real data, the authors can show how the fundamentals of probabilistic and statistical theories arise intuitively. It provides a tried and tested, self-contained course, that can also be used for self-study. A Modern Introduction to Probability and Statistics has numerous quick exercises to give direct feedback to the students. In addition the book contains over 350 exercises, half of which have answers, of which half have full solutions. A website at www.springeronline.com/1-85233-896-2 gives access to the data files used in the text, and, for instructors, the remaining solutions. The only pre-requisite for the book is a first course in calculus; the text covers standard statistics and probability material, and develops beyond traditional parametric models to the Poisson process, and on to useful modern methods such as the bootstrap. This will be a key text for undergraduates in Computer Science, Physics, Mathematics, Chemistry, Biology and Business Studies who are studying a mathematical statistics course, and also for more intensive engineering statistics courses for undergraduates in all engineering subjects.
Article
Methods for Approximate Query Processing (AQP) are essential for dealing with massive data. They are often the only means of providing interactive response times when exploring massive datasets, and are also needed to handle high speed data streams. These methods proceed by computing a lossy, compact synopsis of the data, and then executing the query of interest against the synopsis rather than the entire dataset. We describe basic principles and recent developments in AQP. We focus on four key synopses: random samples, histograms, wavelets, and sketches. We consider issues such as accuracy, space and time efficiency, optimality, practicality, range of applicability, error bounds on query answers, and incremental maintenance. We also discuss the trade-offs between the different synopsis types.
Article
A database is described that has been designed to fulfill the need for daily climate data over global land areas. The dataset, known as Global Historical Climatology Network (GHCN)-Daily, was developed for a wide variety of potential applications, including climate analysis and monitoring studies that require data at a daily time resolution (e.g., assessments of the frequency of heavy rainfall, heat wave duration, etc.). The dataset contains records from over 80 000 stations in 180 countries and territories, and its processing system produces the official archive for U.S. daily data. Variables commonly include maximum and minimum temperature, total daily precipitation, snowfall, and snow depth; however, about two-thirds of the stations report precipitation only. Quality assurance checks are routinely applied to the full dataset, but the data are not homogenized to account for artifacts associated with the various eras in reporting practice at any particular station (i.e., for changes in systematic bias). Daily updates are provided for many of the station records in GHCN-Daily. The dataset is also regularly reconstructed, usually once per week, from its 20+ data source components, ensuring that the dataset is broadly synchronized with its growing list of constituent sources. The daily updates and weekly reprocessed versions of GHCN-Daily are assigned a unique version number, and the most recent dataset version is provided on the GHCN-Daily website for free public access. Each version of the dataset is also archived at the NOAA/National Climatic Data Center in perpetuity for future retrieval.
Article
Summary While studying the median of the binomial distribution we discovered that the mean median-mode inequality, recently discussed in. Statistics Neerlandica by Runnen-burg 141 and Van Zwet [7] for continuous distributions, does not hold for the binomial distribution. If the mean is an integer, then mean = median = mode. In theorem 1 a sufficient condition is given for mode = median = rounded mean. If median and mode differ, the mean lies in between.
Conference Paper
We consider the problem of pipelined filters, where a continuous stream of tuples is processed by a set of commutative filters. Pipelined filters are common in stream applications and capture a large class of multiway stream joins. We focus on the problem of ordering the filters adaptively to minimize processing cost in an environment where stream and filter characteristics vary unpredictably over time. Our core algorithm, A-Greedy (for Adaptive Greedy), has strong theoretical guarantees: If stream and filter characteristics were to stabilize, A-Greedy would converge to an ordering within a small constant factor of optimal. (In experiments A-Greedy usually converges to the optimal ordering.) One very important feature of A-Greedy is that it monitors and responds to selectivities that are correlated across filters (i.e., that are nonindependent), which provides the strong quality guarantee but incurs run-time overhead. We identify a three-way tradeoff among provable convergence to good orderings, run-time overhead, and speed of adaptivity. We develop a suite of variants of A-Greedy that lie at different points on this tradeoff spectrum. We have implemented all our algorithms in the STREAM prototype Data Stream Management System and a thorough performance evaluation is presented.
Conference Paper
Current methods for selectivity estimation fall into two broad categories, synopsis-based and sampling-based. Synopsis-based methods, such as histograms, incur minimal overhead at query optimization time and thus are widely used in commercial database systems. Sampling-based methods are more suited for ad-hoc queries, but often involve high I/O cost because of random access to the underlying data. Though both methods serve the same purpose of selectivity estimation, their inter-action in the case of selectivity estimation for conjuncts of predicates on multiple attributes is largely unexplored. Our work aims at taking the best of both worlds, by making consistent use of synopses and sam-ple information when they are both present. To achieve this goal, we propose HASE, a novel estimation scheme based on a powerful mech-anism called generalized raking. We formalize selectivity estimation in the presence of single attribute synopses and sample information as a constrained optimization problem. By solving this problem, we obtain a new set of weights associated with the sampled tuples, which has the nice property of reproducing the known selectivities when applied to individ-ual predicates. We discuss di erent variants of the optimization problem and provide algorithms for solving it. We also provide asymptotic error bounds on the estimate. Extensive experiments are performed on both synthetic and real data, and the results show that HASE signi cantly outperforms both synopsis-based and sampling-based methods.
Article
An algorithm is presented for finding a zero of a function which changes sign in a given interval. The algorithm combines linear interpolation and inverse quadratic interpolation with bisection. Convergence is usually superlinear, and is never much slower than for bisection. ALGOL 60 procedures are given.
Article
Object-relational database management systems allow knowledgeable users to define new data types as well as new methods (operators) for the types. This flexibility produces an attendant complexity, which must be handled in new ways for an object-relational database management system to be efficient. In this article we study techniques for optimizing queries that contain time-consuming methods. The focus of traditional query optimizers has been on the choice of join methods and orders; selections have been handled by “pushdown” rules. These rules apply selections in an arbitrary order before as many joins as possible, using th e assumption that selection takes no time. However, users of object-relational systems can embed complex methods in selections. Thus selections may take significant amounts of time, and the query optimization model must be enhanced. In this article we carefully define a query cost framework that incorporates both selectivity and cost estimates for selections. We develop an algorithm called Predicate Migration, and prove that it produces optimal plans for queries with expensive methods. We then describe our implementation of Predicate Migration in the commercial object-relational database management system Illustra, and discuss practical issues that affect our earlier assumptions. We compare Predicate Migration to a variety of simplier optimization techniques, and demonstrate that Predicate Migration is the best general solution to date. The alternative techniques we present may be useful for constrained workloads.
Conference Paper
Research on query optimization has focused almost exclusively on reducing query execution time, while important qualities such as consistency and predictability have largely been ignored, even though most database users consider these qualities to be at least as important as raw performance. In this paper, we explore how the query optimization process can be made more robust, focusing on the important subproblem of cardinality estimation. The robust cardinality estimation technique that we propose allows for a user- or application-specified trade-off between performance and predictability, and it captures multi-dimensional correlations while remaining space- and time-efficient.
Article
The history of histograms is long and rich, full of detailed information in every step. It includes the course of histograms in di#erent scientific fields, the successes and failures of histograms in approximating and compressing information, their adoption by industry, and solutions that have been given on a great variety of histogram-related problems. In this paper and in the same spirit of the histogram techniques themselves, we compress their entire history (including their "future history" as currently anticipated) in the given/fixed space budget, mostly recording details for the periods, events, and results with the highest (personally-biased) interest. In a limited set of experiments, the semantic distance between the compressed and the full form of the history was found relatively small! 1 Prehistory The word histogram' is of Greek origin, as it is a composite of the words isto-s' (###os) (= mast', also means web' but this is not relevant to this discussion) and gram-ma' (####) (= something written '). Hence, it should be interpreted as a form of writing consisting of `masts', i.e., long shapes vertically standing, or something similar. It is not, however, a Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the VLDB copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Very Large Data Base Endowment. To copy otherwise, or to republish, requires a fee and/or special permission from the Endowment.
a rber, N. May, W. Lehner, P. Große, I. Mü ller, H. Rauhe, and J. Dees. The SAP HANA database -- an architecture overview
• F.
Accessed: 2020-07-02. Reference for SAP HANA Platform 2.0 SPS 04
• Reference
• Platform