Article

The Star Schema Benchmark (SSB)

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

The Star Schema benchmark, or SSB, was devised to evaluate database system performance of star schema data warehouse queries. The schema for SSB is based on the TPC-H benchmark, but in a highly modified form. We believe the details of modification to be instructive in answering an important question: given a database schema that is not in star schema form, how can it be transformed to star schema form without loss of important query information? The SSB has been used to measure a number of major commercial,database products on Linux to evaluate a new product. We also intend to use SSB to compare star schema query performance,of three major commercial,database products running on Windows, which will be reported separately.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... For this reason, later benchmarks such as SSB [12], TPC-DS [13] and TPCx-BB [14] propose the use of star or snowflake schemes, more suitable [1] for systems that support OLAP applications. In [15], the authors apply the SSB benchmark on Apache Druid [6], where they test different implementations of the data model proposed by SSB. ...
... It not only allows to compare OLAP systems [11][12][13][14] but also allows OLAP cube designers to compare different OLAP cube designs on a chosen Big Data OLAP technology. ...
... It can be applied to any OLAP system, but it is specially recommended to be applied on Big Data OLAP systems, because it addresses the specific requirements of Big Data scenarios unlike other proposals [12]. Based on the analysis of these specific requirements, we propose a complete set of benchmarking metrics in contrast to other reviewed approaches [15,16,18,19]. ...
Article
Full-text available
In recent years, several new technologies have enabled OLAP processing over Big Data sources. Among these technologies, we highlight those that allow data pre-aggregation because of their demonstrated performance in data querying. This is the case of Apache Kylin, a Hadoop based technology that supports sub-second queries over fact tables with billions of rows combined with ultra high cardinality dimensions. However, taking advantage of data pre-aggregation techniques to designing analytic models for Big Data OLAP is not a trivial task. It requires very advanced knowledge of the underlying technologies and user querying patterns. A wrong design of the OLAP cube alters significantly several key performance metrics, including: (i) the analytic capabilities of the cube (time and ability to provide an answer to a query), (ii) size of the OLAP cube, and (iii) time required to build the OLAP cube. Therefore, in this paper we (i) propose a benchmark to aid Big Data OLAP designers to choose the most suitable cube design for their goals, (ii) we identify and describe the main requirements and trade-offs for effectively designing a Big Data OLAP cube taking advantage of data pre-aggregation techniques, and (iii) we validate our benchmark in a case study.
... It takes a dataflow graph, G, as the input, and returns an execution tree graph, Gτ , as the output. The algorithm does the partitioning using depth-first searching (DFS), which starts from the data source components, i.e., the vertexes whose in-degree is equal to zero in G (see line [6][7][8][9]. An execution tree is created taking a data source component as the root (see line 7). ...
... The thread will be blocked if there is no space available in the queue. The shared cache is processed sequentially through all the activity threads (see line [1][2][3][4][5][6][7][8][9][10][11]. For each component, it has to process m shared caches in a sequential order. ...
... We use the star schema benchmark (SSB) based on TPC-H [9] for the evaluation. The star schema consists of one fact table, lineorder, and four dimension tables including customer, part, supplier and date (see [9] for the detailed schema information). ...
Conference Paper
Full-text available
Extract-Transform-Load (ETL) handles large amounts of data and manages workload through dataflows. ETL dataflows are widely regarded as complex and expensive operations in terms of time and system resources. In order to minimize the time and the re-sources required by ETL dataflows, this paper presents an optimiza-tion framework using partitioning and parallelization. The frame-work first partitions an ETL dataflow into multiple execution trees according to the characteristics of ETL constructs, then within an execution tree pipelined parallelism and shared cache are used to optimize the partitioned dataflow. Furthermore, multi-threading is used in component-based optimization. The experimental results show that the proposed framework can achieve 4.7 times faster than the ordinary ETL dataflows (without using the proposed partition-ing and optimization methods), and is comparable to the similar ETL tools.
... It takes a dataflow graph, G, as the input, and returns an execution tree graph, Gτ , as the output. The algorithm does the partitioning using depth-first searching (DFS), which starts from the data source components, i.e., the vertexes whose in-degree is equal to zero in G (see line [6][7][8][9]. An execution tree is created taking a data source component as the root (see line 7). ...
... The thread will be blocked if there is no space available in the queue. The shared cache is processed sequentially through all the activity threads (see line [1][2][3][4][5][6][7][8][9][10][11]. For each component, it has to process m shared caches in a sequential order. ...
... We use the star schema benchmark (SSB) based on TPC-H [9] for the evaluation. The star schema consists of one fact table, lineorder, and four dimension tables including customer, part, supplier and date (see [9] for the detailed schema information). ...
Article
Full-text available
Extract-Transform-Load (ETL) handles large amount of data and manages workload through dataflows. ETL dataflows are widely regarded as complex and expensive operations in terms of time and system resources. In order to minimize the time and the resources required by ETL dataflows, this paper presents a framework to optimize dataflows using shared cache and parallelization techniques. The framework classifies the components in an ETL dataflow into different categories based on their data operation properties. The framework then partitions the dataflow based on the classification at different granularities. Furthermore, the framework applies optimization techniques such as cache re-using, pipelining and multi-threading to the already-partitioned dataflows. The proposed techniques reduce system memory footprint and the frequency of copying data between different components, and also take full advantage of the computing power of multi-core processors. The experimental results show that the proposed optimization framework is 4.7 times faster than the ordinary ETL dataflows (without using the proposed optimization techniques), and outperforms the similar tool (Kettle).
... To explain FTree, we use the SSB (star schema benchmark) [27]; this is demonstrated in Table 2. The SSB consists of a data warehouse with the lineorder fact table as well as the customer, supplier, part, and date dimension tables. ...
... The SSB consists of a data warehouse with the lineorder fact table as well as the customer, supplier, part, and date dimension tables. To explain FTree, we use the SSB (star schema benchmark) [27]; this is demonstrated in Table 2. The SSB consists of a data warehouse with the lineorder fact table as well as the customer, supplier, part, and date dimension tables. ...
Article
Full-text available
Data warehousing gives frameworks and means for enterprise administrators to methodically prepare, comprehend, and utilize the data to improve strategic decision-making skills. One of the principal challenges to data warehouse designers is fragmentation. Currently, several fragmentation approaches for data warehouses have been developed since this technique can decrease the OLAP (online analytical processing) query response time and it provides considerable benefits in table loading and maintenance tasks. In this paper, a horizontal fragmentation method, called FTree, that uses decision trees to fragment data warehouses is presented to take advantage of the effectiveness that this technique provides in classification. FTree determines the OLAP queries with major relevance, evaluates the predicates found in the workload, and according to this, builds the decision tree to select the horizontal fragmentation scheme. To verify that the design is correct, the SSB (star schema benchmark) was used in the first instance; later, a tourist data warehouse was built, and the fragmentation method was tested on it. The results of the experiments proved the efficacy of the method.
... In order to predict the performance of a RDBMS on a large workload, it is common to evaluate it on a benchmark workload that resembles the target workload. Historically, benchmarked workloads are either standardized (such as TPC-H [2], SSB [30], YCSB [8], and Wisconsin Benchmark [12]) or created by domain experts who manually curate queries. If we wish to construct a custom benchmark for every use case of each customer, the second solution becomes unsustainable. ...
... We use a workload of 2200 queries executed on database with SF=1 and uniform data distribution. 3. SSB [30] : A benchmark designed to measure performance of database products in support of classical data warehousing applications. ...
Preprint
Full-text available
This work studies the problem of constructing a representative workload from a given input analytical query workload where the former serves as an approximation with guarantees of the latter. We discuss our work in the context of workload analysis and monitoring. As an example, evolving system usage patterns in a database system can cause load imbalance and performance regressions which can be controlled by monitoring system usage patterns, i.e.,~a representative workload, over time. To construct such a workload in a principled manner, we formalize the notions of workload {\em representativity} and {\em coverage}. These metrics capture the intuition that the distribution of features in a compressed workload should match a target distribution, increasing representativity, and include common queries as well as outliers, increasing coverage. We show that solving this problem optimally is NP-hard and present a novel greedy algorithm that provides approximation guarantees. We compare our techniques to established algorithms in this problem space such as sampling and clustering, and demonstrate advantages and key trade-offs of our techniques.
... This issue is exacerbated for the most demanding yet popular visualization scenarios such as crossfilter [41,58,67], where one interaction may generate hundreds of queries per second, with an expectation of near-immediate results. Unfortunately, existing database benchmarks such as TPC-H [65], TPC-DS [64], or the Star Schema Benchmark (SSB) [47] are insufficient for making these comparisons. One main reason is that the workloads modeled in these benchmarks are not representative of how database queries are generated through user interactions, such as with tools like Tableau [59] or Spotfire [58]. ...
... Therefore, this use case emphasizes the need for real-time results. Unlike in traditional database scenarios (e.g., TPC-H [65], TPC-DS [64], or the Star Schema Benchmark (SSB) [47]), latencies of just one second can make interactive data exploration interfaces appear unresponsive [37]. ...
... In this research the widely-known Star Schema Benchmark (SSB) [36] was adopted as a baseline schema and dataset. The SSB, defines a collection of base relations along with a set of queries typically used in data warehousing. ...
... The SSB comprises a dimensional data model (four dimensions, one fact table), an extensible dataset (size depending on a scaling factor-SF), and a set of queries typical for data warehouse applications arranged in four categories/families designated as Query Flights (a detailed definition of the SSB is available online [36]). Thirteen Select-Project-Join query statements in total compose the full query set of the SSB. ...
Article
Full-text available
The information explosion the world has witnessed in the last two decades has forced businesses to adopt a data-driven culture for them to be competitive. These data-driven businesses have access to countless sources of information, and face the challenge of making sense of overwhelming amounts of data in a efficient and reliable manner, which implies the execution of read-intensive operations. In the context of this challenge, a framework for the dynamic read-optimization of large dimensional datasets has been designed, and on top of it a workload-driven mechanism for automatic materialized view selection and creation has been developed. This paper presents an extensive description of this mechanism, along with a proof-of-concept implementation of it and its corresponding performance evaluation. Results show that the proposed mechanism is able to derive a limited but comprehensive set of views leading to a drop in query latency ranging from 80% to 99.99% at the expense of 13% of the disk space used by the base dataset. This way, the devised mechanism enables speeding up query execution by building materialized views that match the actual demand of query workloads.
... Accordingly, considering the SELECT statement in the example above and the dimensional schema described in (O'Neil et al., 2009) which comprises one fact table and four dimension tables, a complete feature vector instance (its decimal equivalent for length and clarity) is shown below: ...
... This proof-of-concept implementation leverages the Star Schema Benchmark (SSB) as baseline schema and dataset, and therefore both the predefined views and query templates, as well as the query generator module were designed and built so they conform to the data model the SSB embodies. Thirteen ASPJquery statements compose the full query set of the SSB, arranged in four categories/families designated as Query Flights (a detailed definition of the SSB is available at (O'Neil et al., 2009)). For this proof-ofconcept, three view definitions were derived based on the original SSB query set, and from each view definition, four query templates were prepared. ...
Conference Paper
Full-text available
Small-to-medium businesses are increasingly relying on big data platforms to run their analytical workloads in a cost-effective manner, instead of using conventional and costly data warehouse systems. However, the distributed nature of big data technologies makes it time-consuming to process typical analytical queries, especially those involving aggregate and join operations, preventing business users from performing efficient data exploration. In this sense, a workload-driven approach for automatic view selection was devised, aimed at speeding up analytical queries issued against distributed dimensional data. This paper presents a detailed description of the proposed approach, along with an extensive evaluation to test its feasibility. Experimental results shows that the conceived mechanism is able to automatically derive a limited but comprehensive set of views able to reduce query processing time by up to 89%–98%.
... We use the Star Schema Benchmark [41] and the TPC-H Benchmark [7] dataset to run our experiments. We use Scale Factor 1 for the experiments including a full exploration of all code variants. ...
... Due to current implementation restrictions of our prototype system (e.g., a missing LIKE operator), we limit the evaluation queries to a representative subset. The star schema benchmark showcases a data warehouse workload, where one central fact table is connected to four dimensions tables [41]. The query consists of four query groups, which vary in the number of joins involved in the queries. ...
Article
Full-text available
Processor manufacturers build increasingly specialized processors to mitigate the effects of the power wall to deliver improved performance. Currently, database engines are manually optimized for each processor: A costly and error prone process. In this paper, we propose concepts to enable the database engine to perform per-processor optimization automatically. Our core idea is to create variants of generated code and to learn a fast variant for each processor. We create variants by modifying parallelization strategies, specializing data structures, and applying different code transformations. Our experimental results show that the performance of variants may diverge up to two orders of magnitude. Therefore, we need to generate custom code for each processor to achieve peak performance. We show that our approach finds a fast custom variant for multi-core CPUs, GPUs, and MICs.
... In addition, index comparison and selection has been widely studied in traditional RDBMS area [11], [26] and spatial database area [9], [20], [25], but they either focus on one-dimensional indexes, or are mainly suitable for spatial data. Moreover, because the storage and reading model of index table and data table is different between traditional database and Hive: for the former, the data is well-organized as heap file or tree-based file and processed locally in one server. ...
... The three work mainly focus on spatial data processing, not traditional data processing like ours. [26] proposes a star schema benchmark for the index comparison of RDBMS, but no detailed comparison. [11] proposes several kinds of B-Tree index, but no experimental evaluations on them. ...
Article
Full-text available
Apache Hive has been widely used for big data processing over large scale clusters by many companies. It provides a declarative query language called HiveQL. The efficiency of filtering out query-irrelevant data from HDFS closely affects the performance of query processing. This is especially true for multi-dimensional, high-selective, and few columns involving queries, which provides sufficient information to reduce the amount of bytes read. Indexing (Compact Index, Aggregate Index, Bitmap Index, DGFIndex, and the index in ORC file) and columnar storage (RCFile, ORC file, and Parquet) are powerful techniques to achieve this. However, it is not trivial to choosing a suitable index and columnar storage based on data and query features. In this paper, we compare the data filtering performance of the above indexes with different columnar storage formats by conducting comprehensive experiments using uniform and skew TPC-H data sets and various multi-dimensional queries, and suggest the best practices of improving multi-dimensional queries in Hive under different conditions.
... SSB models the data warehouse of a wholesale supplier and its queries are simplified versions of TPC-H queries. They are organized in four flights of three to four queries each [6]. Each flight consists of a sequence of queries that someone working with data warehouse system would ask, e.g., for a drill down. ...
... Figure 4 shows the SSB database schema. This schema registers the sales that are made [6]. In this paper we reviewed three Decision Support benchmarks (TPC-DS, TPC-H and SSB), explaining their structure and main objectives. ...
Article
Database management systems form the core of all business intelligence solutions, with information being the key to success in today’s businesses. Performance evaluation of database systems is a non-trivial activity, due to different architectures and functionalities tuned for specific requirements. In this paper we provide an overview and analyze the main properties of three benchmarks: TPC-DS, TPC-H and the SSB which are online analytical processing benchmarks in order to validate the performance of a Decision Support System.
... The scenario of Figure 3 illustrates H [16]. base n of SSB data C-H queries. They are organized in four flights of three to four queries each [13]. Each flight consists of a sequence of queries that someone working with data warehouse system would ask. ...
... Figure 4 shows the SSB database schema. This schema registers the sales that are made [13]. ...
Conference Paper
Benchmarking is a common practice for the evaluation of database computer systems. By executing certain benchmarks, manufacturers and researchers are able to highlight the characteristics of a certain system and are also able to rank the system against the rest. On the other hand, at the moment, BigData is a hot topic. It concerns dealing efficiently with information that is challenging to handle, due to volume, velocity or variety. As more and more platforms are proposed to deal with BigData, it becomes important to have benchmarks that can be used to evaluate performance characteristics of such platforms. At the same time, Decision Support applications are related to BigData, as they need to efficiently deal with huge datasets. In this paper we describe benchmarks representing Decision Support Systems (TPC-H, SSB, TPC-DS), and benchmarks for the Big Data class (YCSB, BigBench, and BigFrame), in order to help users to choose the most appropriate one for their needs. We also characterize the relationship between Big Data benchmarks and Decision Support benchmarks.
... Combining the data from social media feeds with user reviews, transactions, and analytics in an online retailer case results in complex, hybrid data processing pipelines. The data of interest and value is both relational and vector-based, not fitting fully to cases outlined in TPC-H, TPC-DS, or SSB [9] benchmarks, or vector search-aimed ANN-Benchmarks [10]. ...
... Contrairement au TPC-H, il utilise un modèle conceptuel multidimensionnel en étoile composé d'une table de fait et de 4 tables de dimensions. Il contient moins de requêtes que TPC-H et n'a pas des exigences strictes en terme de formes de calibrage (tuning) autorisées et interdites[132]. ...
Thesis
Full-text available
Dans le monde d’aujourd’hui, nous dépendons énormément des équipements numériques pour le travail, le divertissement et le social. Par conséquent, nous sommes obligés de chercher tous les moyens pour économiser l’énergie consommée par leurs composants matériels, logiciels, ainsi que les applications qu’ils utilisent. Les systèmes de gestion de bases de données (SGBDs) deviennent des gouffres énergétiques à cause de l’explosion massive des données qu’ils doivent collecter, traiter et stocker. Le processeur des requêtes constitue l’un des composants le plus énergivore des SGBDs. Il a pour rôle de traiter d’une manière efficace les requêtes. Vu le volume des données et la complexité des requêtes d’analyse, l’étude de l’efficacité énergétique de ce composant devient sans aucun doute une question cruciale et urgente. Cette importance a été largement soulignée dans le rapport de Claremont et cosignée par environ une trentaine de chercheurs, experts, architectes du monde de base de données. Ce point est ensuite repris dans le rapport de Beckman qui place l’efficacité énergétique comme un défi à relever dans le domaine des données massives. La majorité des optimiseurs des requêtes actuels sont conçus pour minimiser les opérations d’entrées-sorties et essayent d’exploiter la RAM autant que possible. En conséquence, ils ignorent généralement les aspects énergétiques. Dans cette thèse, pour optimiser l’énergie consommée par un SGBD, nous proposons une approche orientée logicielle que nous baptisons Auditer l’Énergie – Avant de Déployer, permettant de concevoir des processeurs de requêtes moins énergivores. Cette approche se décline en cinq étapes principales: (1) Audit des composants des processeurs de requêtes afin de comprendre son fonctionnement et déterminer ses paramètres sensibles à l’énergie. (2) Élaboration d’un état de l’art sur les solutions existantes dont l’objectif est double: (i) fournir une feuille de route pour les concepteurs et les étudiants qui souhaitent comprendre les questions liées à l’efficacité énergique dans le monde des bases de données et (ii) aider au développement des modèles énergétiques. (3) Modélisation de l’énergie des processeurs de requêtes. Cette modélisation est réalisée par la définition des modèles de coût mathématique dédiés à estimer la consommation énergétique du système lors de l’exécution des requêtes. (4) Utilisation des techniques d’apprentissage automatique profond pour identifier les valeurs des paramètres sensibles à l’énergie appartenant à la fois aux composants matériels et logiciels. Pour appliquer notre approche, nous avons choisi trois systèmes de traitement des données libres dont deux sont orientés disque: PostgreSQL, MonetDB et Hyrise.(5) Déploiement des modèles validés dans les trois SGBDs pour des études d’estimation et d’optimisation énergétique.
... First our developed HOME uses three tables from SSB lineorder1, parts1 and customer1. [70]. Nineteen select statements are generated for evaluation. ...
Thesis
Analyzing Big Data has emerged as a significant activity for many organizations. This Big Data analysis is simplified by the MapReduce framework and execution environment, such as Hadoop and parallel systems, such as Hive. On the other, most of the MapReduce users have a complex query analysis that has expressed as individual MapReduce jobs. By using high level query languages such as Pig, Hive, and Jaql, the user complex query expresses into Workflow s of MapReduce jobs. The work in this thesis concerns about how to reuse the previous results in the hive output file in the same or different sessions to improve the Hive performance. This has been done by introducing two algorithms. First called HOME (HiveQL Optimization in Multi Session Environment). To evaluate our first developed HOME algorithm, it has implemented using 19 Different SQL Statement to reduce I/O in MapReduce Job. By developing HOME algorithm, a new HiveQL execution architecture based on materialized previous results has proposed. Also, This thesis concerns about how to minimize the size of stored data by storing the index of the previous results in the Hive output file to reuse it in the same or different sessions to improve the Hive performance. This has been done by introducing second algorithm called iHOME (indexing HiveQL Optimization For JOIN Over Multi-Session Environment). By developing iHOME algorithm, a new HiveQL execution architecture based on materialized previous results has been proposed. To evaluate our second developed iHOME algorithm, it has been implemented using 10 Different SQL Statement to reduce I/O MapReduce Job. The framework of two our algorithms the implementation has built on top of the Hive data flow system without any change in Hive. To evaluate the proposed HiveQL architecture performance, the Star Schema benchmark SSB has been used. According to the experimental results, it is found that the performance of the developed our algorithms outperforms the Hive by 67% for HOME and 84% for iHOME on the average of execution time.
... In order to develop a predictive model to assess the VTE predisposing it was necessary to gather data from several sources and carry out an Extract, Transform, and Load process to organize the information. The information was held in a star schema, which consists of a collection of tables that are logically related to each other (O'Neil, O'Neil, & Chen, 2009). To obtain a star schema, it was essential to follow a few steps. ...
Chapter
Full-text available
Thrombophilia stands for a genetic or an acquired tendency to hypercoagulable states, frequently as venous thrombosis. Venous thromboembolism, represented mainly by deep venous thrombosis and pulmonary embolism, is often a chronic illness, associated with high morbidity and mortality. Therefore, it is crucial to identify the cause of the disease, the most appropriate treatment, the length of treatment or prevent a thrombotic recurrence. This work will focus on the development of a diagnosis decision support system in terms of a formal agenda built on a Logic Programming approach to knowledge representation and reasoning, complemented with a computational framework based on Artificial Neural Networks. The proposed model has been quite accurate in the assessment of thrombophilia predisposition (accuracy close to 95%). Furthermore, the model classified properly the patients that really presented the pathology, as well as classifying the disease absence (sensitivity and specificity higher than 95%).
... The analysis of available solutions [25] shows that no advantage can be obtained by storing data by columns using a row-oriented DBMS with a vertically divided schema or by indexing all columns to ensure the independent access to them. The database systems with row storage have a considerably lower performance than the column-oriented databases on the Star Schema Benchmark (SSBM) [46][47][48]. The difference in performance shows that there are significant differences at the level of query execution between the two systems (in addition to the evident differences at the level of storage). ...
Article
Full-text available
The development and investigation of efficient methods of parallel processing of very large databases using the columnar data representation designed for computer cluster is discussed. An approach that combines the advantages of relational and column-oriented DBMSs is proposed. A new type of distributed column indexes fragmented based on the domain-interval principle is introduced. The column indexes are auxiliary structures that are constantly stored in the distributed main memory of a computer cluster. To match the elements of a column index to the tuples of the original relation, surrogate keys are used. Resource hungry relational operations are performed on the corresponding column indexes rather than on the original relations of the database. As a result, a precomputation table is obtained. Using this table, the DBMS reconstructs the resulting relation. For basic relational operations on column indexes, methods for their parallel decomposition that do not require massive data exchanges between the processor nodes are proposed. This approach improves the class OLAP query performance by hundreds of times.
... In order to develop a predictive model to assess the VTE predisposing it was necessary to gather data from several sources and carry out an Extract, Transform, and Load process to organize the information. The information was held in a star schema, which consists of a collection of tables that are logically related to each other (O'Neil, O'Neil, & Chen, 2009). To obtain a star schema, it was essential to follow a few steps. ...
Article
Thrombophilia stands for a genetic or an acquired tendency to hypercoagulable states, frequently as venous thrombosis. Venous thromboembolism, represented mainly by deep venous thrombosis and pulmonary embolism, is often a chronic illness, associated with high morbidity and mortality. Therefore, it is crucial to identify the cause of the disease, the most appropriate treatment, the length of treatment or prevent a thrombotic recurrence. This work will focus on the development of a diagnosis decision support system in terms of a formal agenda built on a Logic Programming approach to knowledge representation and reasoning, complemented with a computational framework based on Artificial Neural Networks. The proposed model has been quite accurate in the assessment of thrombophilia predisposition (accuracy close to 95%). Furthermore, the model classified properly the patients that really presented the pathology, as well as classifying the disease absence (sensitivity and specificity higher than 95%).
... We evaluate our strategies based on a standard benchmark originally designed to measure the performance of aggregate queries in relational database systems: the Star Schema Benchmark (SSB) [19]. This benchmark is well-known in the database community and was chosen for its simple design (refined decision support benchmark TPC-H [11]) and its well-defined testbed. ...
Conference Paper
More and more RDF data is exposed on the Web via SPARQL endpoints. With the recent SPARQL 1.1 standard, these datasets can be queried in novel and more powerful ways, e.g., complex analysis tasks involving grouping and aggregation, and even data from multiple SPARQL endpoints, can now be formulated in a single query. This enables Business Intelligence applications that access data from federated web sources and can combine it with local data. However, as both aggregate and federated queries have become available only recently, state-of-the-art systems lack sophisticated optimization techniques that facilitate efficient execution of such queries over large datasets. To overcome these shortcomings, we propose a set of query processing strategies and the associated Cost-based Optimizer for Distributed Aggregate queries (CoDA) for executing aggregate SPARQL queries over federations of SPARQL endpoints. Our comprehensive experiments show that CoDA significantly improves performance over current state-of-the-art systems.
... However, we also conclude that it is limited in terms of capabilities. For the second part of our evaluation we choose the Star Schema Benchmark that is one of standard benchmarks used for evaluate Decision Support Systems (O'Neil et al., 2009). This benchmark has a star schema table organization and provides four primary queries with variations. ...
Article
The big data era brought a set of new database features, such as parallel execution of requests and data distribution among different nodes as well as new types of databases. NoSQL technology emerged to aid people and companies to easily scale systems with simple and transparent data distribution. It became possible to cope with higher demand in less time while performing different types of operations and storing large amounts of data. In this paper, we evaluate Cassandra's scalability and execution time of CRUD operations and, posteriorly, compare one relational and one non-relational system by evaluating their performance during execution of decision support queries. For that purpose, we used two standard benchmarks, Yahoo! Cloud Serving Benchmark, to evaluate execution time of requests and speedup of Cassandra, and Star-Schema Benchmark, to run queries over MySQL cluster, as relational database, and Hadoop with Hive as non-SQL counterpart. We conclude about the capabilities and limitations of those systems.
... We illustrate our approach using TPC-H [5] and Star Schema Benchmark (SSB) [12] schemas. TPC-H is a decision support benchmark which represents our source. ...
... Besides TPC-H, we also measured the performance and scalability of our system on the Star Schema Benchmark (SSB) [26], which mimics data warehousing scenarios.Table 3 shows that our parallelization framework works very well on this workload, achieving a speedup of over 40 for most queries. The scalability is higher than on TPC-H, because TPC-H is a much more complex and challenging workload. ...
Article
Full-text available
With modern computer architecture evolving, two problems conspire against the state-of-the-art approaches in parallel query execution: (i) to take advantage of many-cores, all query work must be distributed evenly among (soon) hundreds of threads in order to achieve good speedup, yet (ii) dividing the work evenly is difficult even with accurate data statistics due to the complexity of modern out-of-order cores. As a result, the existing approaches for plan-driven parallelism run into load balancing and context-switching bottlenecks, and therefore no longer scale. A third problem faced by many-core architectures is the decentralization of memory controllers, which leads to Non-Uniform Memory Access (NUMA). In response, we present the morsel-driven query execution framework, where scheduling becomes a fine-grained run-time task that is NUMA-aware. Morsel-driven query processing takes small fragments of input data (morsels) and schedules these to worker threads that run entire operator pipelines until the next pipeline breaker. The degree of parallelism is not baked into the plan but can elastically change during query execution, so the dispatcher can react to execution speed of different morsels but also adjust resources dynamically in response to newly arriving queries in the workload. Further, the dispatcher is aware of data locality of the NUMA-local morsels and operator state, such that the great majority of executions takes place on NUMA-local memory. Our evaluation on the TPC-H and SSB benchmarks shows extremely high absolute performance and an average speedup of over 30 with 32 cores.
Article
Index tuning plays a crucial role in facilitating the efficiency of data retrieval within database systems, which adjusts index settings to optimize the database performance. Recently, with the growth of data volumes, the complexity of workloads, and the diversification of database applications, various Automatic Index Tuning (AIT) methods have been proposed to address these challenges. In this paper, we provide a comprehensive survey on Automatic Index Tuning. First, we overview the AIT techniques from multiple aspects, including (i) problem definition, (ii) workflow, (iii) framework, (iv) index types, (v) index interaction, (vi) changing factors, (vii) automation level, and show the development history. Second, we summarize techniques in the main modules of AIT, including preprocessing , index benefit estimation , and index selection . Preprocessing involves workload compression, index candidate generation, feature representation of workloads and databases, and workload reduction. Index benefit estimation approaches are categorized into empirical methods and machine learning based methods. Index selection involves algorithms of offline AIT and online AIT. Moreover, we summarize the commonly-used datasets in AIT and discuss the applications of index tuning in commercial and opensource database products. Finally, we outline potential future research directions. Our survey aims to enhance both general knowledge and in-depth insights into AIT, and inspire researchers to address the ongoing challenges.
Article
Star-join query is the fundamental task in data warehouse and has wide applications in On-line Analytical Processing (olap) scenarios. Due to the large number of foreign key constraints and the asymmetric effect in the neighboring instance between the fact and dimension tables, even those latest dp efforts specifically designed for join, if directly applied to star-join query, will suffer from extremely large estimation errors and expensive computational cost. In this paper, we are thus motivated to propose DP-starJ, a novel Differentially Private framework for star-Join queries. DP-starJ consists of a series of strategies tailored to specific features of star-join, including 1) we unveil the different effects of fact and dimension tables on the neighboring database instances, and accordingly revisit the definitions tailored to different cases of star-join; 2) we propose Predicate Mechanism (PM), which utilizes predicate perturbation to inject noise into the join procedure instead of the results; 3) to further boost the robust performance, we propose a dp-compliant star-join algorithm for various types of star-join tasks based on PM. We provide both theoretical analysis and empirical study, which demonstrate the superiority of the proposed methods over the state-of-the-art solutions in terms of accuracy, efficiency, and scalability.
Article
Columnar storage is a core component of a modern data analytics system. Although many database management systems (DBMSs) have proprietary storage formats, most provide extensive support to open-source storage formats such as Parquet and ORC to facilitate cross-platform data sharing. But these formats were developed over a decade ago, in the early 2010s, for the Hadoop ecosystem. Since then, both the hardware and workload landscapes have changed. In this paper, we revisit the most widely adopted open-source columnar storage formats (Parquet and ORC) with a deep dive into their internals. We designed a benchmark to stress-test the formats' performance and space efficiency under different workload configurations. From our comprehensive evaluation of Parquet and ORC, we identify design decisions advantageous with modern hardware and real-world data distributions. These include using dictionary encoding by default, favoring decoding speed over compression ratio for integer encoding algorithms, making block compression optional, and embedding finer-grained auxiliary data structures. We also point out the inefficiencies in the format designs when handling common machine learning workloads and using GPUs for decoding. Our analysis identified important considerations that may guide future formats to better fit modern technology trends.
Article
Hybrid Transactional/Analytical Processing (HTAP) databases are designed to execute real-time analytics and provide performance isolation for online transactions and analytical queries. Real-time analytics emphasize analyzing the fresh data generated by online transactions. And performance isolation depicts the performance interference between concurrently executing online transactions and analytical queries. However, HTAP databases are extreme lack micro-benchmarks to accurately measure data freshness. Despite the abundance of HTAP databases and benchmarks, there needs to be more thorough research on the performance isolation and real-time analytics capabilities of HTAP databases. This paper focuses on the critical designs of mainstream HTAP databases and the state-of-the-art and state-of-the-practice HTAP benchmarks. First, we systematically introduce the advanced technologies adopted by HTAP databases for real-time analytics and performance isolation capabilities. Then, we summarize the pros and cons of the state-of-the-art and state-of-the-practice HTAP benchmarks. Next, we design and implement a micro-benchmark for HTAP databases, which can precisely control the rate of fresh data generation and the granularity of fresh data access. Finally, we devise experiments to evaluate the performance isolation and real-time analytics capabilities of the state-of-the-art HTAP database. In our continued pursuit of transparency and community collaboration, we will soon make available our comprehensive specifications, meticulously crafted source code, and significant results for public access at https://www.benchcouncil.org/mOLxPBench.
Article
In the two decades following its initial release, SQLite has become the most widely deployed database engine in existence. Today, SQLite is found in nearly every smartphone, computer, web browser, television, and automobile. Several factors are likely responsible for its ubiquity, including its in-process design, standalone codebase, extensive test suite, and cross-platform file format. While it supports complex analytical queries, SQLite is primarily designed for fast online transaction processing (OLTP), employing row-oriented execution and a B-tree storage format. However, fueled by the rise of edge computing and data science, there is a growing need for efficient in-process online analytical processing (OLAP). DuckDB, a database engine nicknamed "the SQLite for analytics", has recently emerged to meet this demand. While DuckDB has shown strong performance on OLAP benchmarks, it is unclear how SQLite compares. Furthermore, we are aware of no work that attempts to identify root causes for SQLite's performance behavior on OLAP workloads. In this paper, we discuss SQLite in the context of this changing workload landscape. We describe how SQLite evolved from its humble beginnings to the full-featured database engine it is today. We evaluate the performance of modern SQLite on three benchmarks, each representing a different flavor of in-process data management, including transactional, analytical, and blob processing. We delve into analytical data processing on SQLite, identifying key bottlenecks and weighing potential solutions. As a result of our optimizations, SQLite is now up to 4.2X faster on SSB. Finally, we discuss the future of SQLite, envisioning how it will evolve to meet new demands and challenges.
Article
In recent years, GPU database management systems (DBMSes) have rapidly become popular largely due to their remarkable acceleration capability obtained through extreme parallelism in query evaluations. However, there has been relatively little study on the characteristics of these GPU DBMSes for a better understanding of their query performance in various contexts. Also, little has been known about what the potential factors could be that affect the query processing jobs within the GPU DBMSes. To fill this gap, we have conducted a study to identify such factors and to propose a structural causal model, including key factors and their relationships, to explicate the variances of the query execution times on the GPU DBMSes. We have also established a set of hypotheses drawn from the model that explained the performance characteristics. To test the model, we have designed and run comprehensive experiments and conducted in-depth statistical analyses on the obtained empirical data. As a result, our model achieves about 77% amount of variance explained on the query time and indicates that reducing kernel time and data transfer time are the key factors to improve the query time. Also, our results show that the studied systems should resolve several concerns such as bounded processing within GPU memory, lack of rich query evaluation operators, limited scalability, and GPU under-utilization.
Article
Recent advances in computing are widely leveraged to design energy-efficient computing platforms for Internet of Things. Especially, scans is a crucial operation that consumes major energy in main-memory column-stores. Scanning a column involves returning a result bit vector to indicate whether each record satisfies a predicate (i.e., a filter condition) or not. ByteSlice, as an existing storage layout, chops the data into multiple bytes, thus possessing early-stop capability by comparing high-order bytes while scanning data. However, the width of columns is usually not multiples of byte, and in these cases, the last-byte of ByteSlice layout is padded with zeros, which wastes the computation energy and the memory bandwidth. To make the scan operation energy-efficient, a lightweight index is squeezed into the bits that originally padded with zeros. In this way, our proposed storage layout DIFusion integrates the data and the index in a unified structure. DIFusion enables an energy-efficient scan because it inherits the early-stopping capability from ByteSlice and possesses the data-skipping ability at the same time. The experimental evaluation on TPC-H, SSB and the real workload shows promising speedup for scans on DIFusion.
Article
Materialized view selection (MVS) plays a vital role for efficiently making decisions in a data warehouse. This problem is NP-hard and constrained optimization problem. The authors have handled both the space and maintenance cost constraint using penalty functions. Three penalty function methods i.e. static, dynamic and adaptive penalty functions have been used for handling constraints and Backtracking Search Optimization algorithm (BSA) has been used for optimizing the total query processing cost. Experiments were conducted comparing the static, dynamic and adaptive penalty functions on varying the space constraint. The adaptive penalty function method yields the best results in terms of minimum query processing cost and achieves the optimality, scalability and feasibility of the problem on varying the lattice dimensions and on increasing the number of user queries. The authors proposed work has been compared with other evolutionary algorithms i.e. PSO and genetic algorithm and yields better results in terms of minimum total query processing cost of the materialized views.
Conference Paper
Technology limitations are making the use of heterogeneous computing devices much more than an academic curiosity. In fact, the use of such devices is widely acknowledged to be the only promising way to achieve application-speedups that users urgently need and expect. However, building a robust and efficient query engine for heterogeneous co-processor environments is still a significant challenge. In this paper, we identify two effects that limit performance in case co-processor resources become scarce. Cache thrashing occurs when the working set of queries does not fit into the co-processor's data cache, resulting in performance degradations up to a factor of 24. Heap contention occurs when multiple operators run in parallel on a co-processor and when their accumulated memory footprint exceeds the main memory capacity of the co-processor, slowing down query execution by up to a factor of six. We propose solutions for both effects. Data-driven operator placement avoids data movements when they might be harmful; query chopping limits co-processor memory usage and thus avoids contention. The combined approach-data-driven query chopping-achieves robust and scalable performance on co-processors. We validate our proposal with our open-source GPU-accelerated database engine CoGaDB and the popular star schema and TPC-H benchmarks.
Article
Full-text available
Storage of pre-computed views in data warehouse can essentially reduce query processing cost for decision support queries. The problem is to choose an optimal set of materialized views. Various frameworks such as lattice, MVPP and AND-OR graphs and algorithms like heuristic based, greedy, stochastic algorithm have been proposed in the literature for materialized view selection. Heuristic and greedy algorithms become slower in high dimensional search space while stochastic algorithms do not guarantee global optimal solution but reach to the optimum most solution in a fast and efficient way. In this paper we have implemented Particle Swarm Optimization (PSO) algorithm, one of the stochastic algorithm, on lattice framework to select an optimal set of views for materialization in data warehouse by minimizing query processing cost. We have compared our results with Genetic algorithm to prove the effectiveness of PSO algorithm over genetic algorithm.
Article
In a typical commercial multi-core processor, the last level cache (LLC) is shared by two or more cores. Existing studies have shown that the shared LLC is beneficial to concurrent query processes with commonly shared data sets. However, the shared LLC can also be a performance bottleneck to concurrent queries, each of which has private data structures, such as a hash table for the widely used hash join operator, causing serious cache conflicts. We show that cache conflicts on multi-core processors can significantly degrade overall database performance. In this paper, we propose a hybrid system method called MCC-DB for accelerating executions of warehouse-style queries, which relies on the DBMS knowledge of data access patterns to minimize LLC conflicts in multi-core systems through an enhanced OS facility of cache partitioning. MCC-DB consists of three components: (1) a cacheaware query optimizer carefully selects query plans in order to balance the numbers of cache-sensitive and cache-insensitive plans; (2) a query execution scheduler makes decisions to co-run queries with an objective of minimizing LLC conflicts; and (3) an enhanced OS kernel facility partitions the shared LLC according to each query's cache capacity need and locality strength. We have implemented MCC-DB by patching the three components in PostgreSQL and Linux kernel. Our intensive measurements on an Intel multi-core system with warehouse-style queries show that MCC-DB can reduce query execution times by up to 33%.
Article
Full-text available
Denormalization is a common tactic for enhancing performance of data warehouses, though its side-effect is quite obvious. Besides being confronted with update abnormality, denormalization has to consume additional storage space. As a result, this tactic is rarely used in main memory databases, which regards storage space, i.e., RAM, as scarce resource. Nevertheless, our research reveals that main memory database can benefit enormously from denormalization, as it is able to remarkably simplify the query processing plans and reduce the computation cost. In this paper, we present A-Store, a main memory OLAP engine customized for star/snowflake schemas. Instead of generating fully materialized denormalization, A-Store resorts to virtual denormalization by treating array indexes as primary keys. This design allows us to harvest the benefit of denormalization without sacrificing additional RAM space. A-Store uses a generic query processing model for all SPJGA queries. It applies a number of state-of-the-art optimization methods, such as vectorized scan and aggregation, to achieve superior performance. Our experiments show that A-Store outperforms the most prestigious MMDB systems significantly in star/snowflake schema based query processing.
Article
In current databases, GPUs are used as dedicated accelerators to process each individual query. Sharing GPUs among concurrent queries is not supported, causing serious resource underutilization. Based on the profiling of an open-source GPU query engine running commonly used single-query data warehousing workloads, we observe that the utilization of main GPU resources is only up to 25%. The underutilization leads to low system throughput. To address the problem, this paper proposes concurrent query execution as an effective solution. To efficiently share GPUs among concurrent queries for high throughput, the major challenge is to provide software support to control and resolve resource contention incurred by the sharing. Our solution relies on GPU query scheduling and device memory swapping policies to address this challenge. We have implemented a prototype system and evaluated it intensively. The experiment results confirm the effectiveness and performance advantage of our approach. By executing multiple GPU queries concurrently, system throughput can be improved by up to 55% compared with dedicated processing.
Conference Paper
Full-text available
Column-Stores has gained market share due to promising physical storage alternative for analytical queries. However, for multi-attribute queries column-stores pays performance penalties due to on-the-fly tuple reconstruction. This paper presents an adaptive approach for reducing tuple reconstruction time. Proposed approach exploits decision tree algorithm to cluster attributes for each projection and also eliminates frequent database scanning. Experimentations with TPC-H data shows the effectiveness of proposed approach.
Article
SUMMARYA data warehouse can store very large amounts of data that should be processed in parallel in order to achieve reasonable query execution times. The MapReduce programming model is a very convenient way to process large amounts of data in parallel on commodity hardware clusters. A very popular query used in data warehouses is star-join. In this paper, we present a fast and efficient star-join query execution algorithm built on top of a MapReduce framework called Hadoop. By using dynamic filters against dimension tables, the algorithm needs a single scan of the fact table, which means a significant reduction of input/output operations and computational complexity. Also, the algorithm requires only two MapReduce iterations in total–one to build the filters against dimension tables and one to scan the fact table. Our experiments show that the proposed algorithm performs much better than the existing solutions in terms of execution time and input/output. Copyright © 2014 John Wiley & Sons, Ltd.
Conference Paper
Full-text available
Row-store commonly uses a volcano-style "once-a-tuple" pipeline processor for processing efficiency but looses the I/O efficiency when only a small part of columns are accessed in a wide table. The academic column-store usually uses "once-a-column" style processing for I/O and cache efficiency but it has to suffer multi-pass column scan for complex query. This paper focuses on how to achieve the maximal gains from storage models for both pipeline processing efficiency and column processing efficiency. Based on the "address-value" mapping for surrogate key in dimension table, we can map incremental primary keys as offset addresses, so the foreign keys in fact table can be utilized as native join index for dimensional tuples. We use predicate vector as bitmap vector filters for dimensions to enable star-join as pipeline operator and pre-generate hash aggregators for aggregat based on the column. Using these approaches, star-join and pre-grouping can be completed in one-pass scan on dimensional attributes in fact table, and the following aggregate column scanning responses for the sparse accessing aggregation. We can gain both I/O efficiency for vector processing and CPU efficiency for pipeline aggregating. We perform the experiments for both simulated algorithm based on the column and the commercial column-store database.
ResearchGate has not been able to resolve any references for this publication.