Conference Paper

Hybrid-Range Partitioning Strategy: A New Declustering Strategy for Multiprocessor Database Machines.

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

In shared-nothing multiprocessor database machines, the relational operators that form a query are executed on the processors where the relations they reference are stored. In general, as the number of processors over which a relation is declustered is increased, the execution time for the query is decreased because more processors are used, each of which has to process fewer tuples. However, for some queries increasing the degree of declustering actually increases the query's response time as the result of increased overhead for query startup, com- munication, and termination. In general, the declustering strategy selected for a relation can have a significant impact on the overall performance of the system. This paper presents the hybrid-range partitioning strategy, a new declustering strategy for multiproces- sor database machines. In addition to describing its characteris- tics and operation, its performance is compared to that of the current partitioning strategies provided by the Gamma database machine.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... We note that the data partitioning problem is very well studied in the database literature in both the parallel and distributed context [5,6,7,8]. We compare and contrast the partitioning problem that results in the context of hybrid clouds with the previously developed state-of-the-art approaches in the related work section. ...
... Our work builds upon a significant body of prior work on data partitioning (e.g., [7,8,5,6]), distributed query processing (e.g., evolution from systems such as SDD-1 [17] to DISCO [18] that operates on heterogeneous data sources, to Internet-scale systems such as Astrolabe [19], and cloud systems [20]), and data privacy [21,4,22]. However, to our knowledge, this is the first paper that takes a risk-based approach to data security in the hybrid model. ...
... We summarize a few of the most relevant previous works below. Data partitioning has been studied fairly extensively in distributed and parallel databases from a variety of perspectives, ranging from load balancing [5], efficient transaction processing [6] to physical database design [7,8]. In [8], the authors consider the problem of workload driven horizontal data partitioning for parallel databases. ...
Article
This paper explores query processing in a hybrid cloud model where a user's local computing capa-bility is exploited alongside public cloud services to deliver an efficient and secure data management solution. Hybrid clouds offer numerous economic advantages including the ability to better manage data privacy and confidentiality, as well as exerting control on monetary expenses of consuming cloud services by exploiting local resources. Nonetheless, query processing in hybrid clouds introduces nu-merous challenges, the foremost of which is, how to partition data and computation between the public and private components of the cloud. The solution must account for the characteristics of the workload that will be executed, the monetary costs associated with acquiring/operating cloud services as well as the risks affiliated with storing sensitive data on a public cloud. This paper pro-poses a principled framework for distributing data and processing in a hybrid cloud that meets the conflicting goals of performance, disclosure risk and resource allocation cost. The proposed solution is implemented as an add-on tool for a Hadoop and Hive based cloud computing infrastructure.
... The Hybrid-Range Partitioning Strategy [GD90] is an extension to range partitioning which divides a relation into a large number of small logical fragments that do not depend on the number of processors in the system. Each fragment is a small range of the partitioning key domain. ...
... Such a scheme will be called an assignment scheme and should include reassignment rules for self repair and node reintegration. Further on, the assignment scheme should give a fragment allocation resembling the Hybrid-Range Partitioning Strategy [GD90] (described in Section 7.1.5) to ensure scalable performance for a wide range of query types. ...
... The Hybrid-Range Partitioning Strategy (HRPS) [GD90] ensures efficient execution of both exact match, small range, and long range queries on the partitioning key through partitioning the relation into a number of fragments which are independent of the available number of nodes in the system. Small relations will be partitioned into few fragments and large into many. ...
Thesis
Full-text available
The thesis introduces the concept of multi-site declustering strategies with self repair for databases demanding very high service availability. Existing work on declustering strategies are centered around providing high performance and reliability inside a small geographical area (site). Applications demanding robustness against site failures like fire and power outages, can not use these methods. Such applications will often both replicate information inside one site and then replicate the site on another site and thus resulting in unnecessary high redundancy cost. Multi-site declustering provides robustness against site failures with only two replicas of data without compromising the performance and reliability. Self repair is proposed for reducing the probability of double-failures causing data loss and reducing the need for rapid replacement of failed hardware. A prerequisite for multi-site declustering with self repair is fast, long-distance, communication networks like ATM. The thesis shows how existing declustering strategies like Mirrored, Interleaved, Chained, and HypRa declustering can be used as multi-site declustering strategies. In addition a new strategy called Q-rot declustering is proposed. Compared with the others it gives larger flexibility with respect to repair strategy, number of sites, and usage pattern. To evaluate availability of systems using the methods a general evaluation model has been developed. Multi-site Chained declustering provides the best availability of the methods evaluated. Q-rot declustering has comparable availability but is significantly more flexible. The evaluation model provides insight and can be used to understand the declustering problem better and to develop new and improved multi-site declustering strategies. The model can also be used as a configuration tool by organizations wanting to deploy one of the declustering strategies.
... Hash partition achieves a more balanced query distribution, while range partition facilitates range queries [9]. To achieve a trade-off between query balance and fast range query, we adopt a hybrid range-hash strategy [16]. We first partition a vector to several ranges based on feature indexes, then use hash partition to put each partition onto one node. ...
... Our system, however, is able to customize both push and pull functions. Therefore, we move the split finding operation of Algorithm 1 (line [10][11][12][13][14][15][16][17] to the pull function to implement our two-phase split algorithm. With this method, we reduce the transferred size of a partition to one integer and two floating-point numbers. ...
Conference Paper
Gradient boosting decision tree (GBDT) is one of the most popular machine learning models widely used in both academia and industry. Although GBDT has been widely supported by existing systems such as XGBoost, LightGBM, and MLlib, one system bottleneck appears when the dimensionality of the data becomes high. As a result, when we tried to support our industrial partner on datasets of the dimension up to 330K, we observed suboptimal performance for all these aforementioned systems. In this paper, we ask "Can we build a scalable GBDT training system whose performance scales better with respect to dimensionality of the data?" The first contribution of this paper is a careful investigation of existing systems by developing a performance model with respect to the dimensionality of the data. We find that the collective communication operations in many existing systems only implement the algorithm designed for small messages. By just fixing this problem, we are able to speed up these systems by up to 2X. Our second contribution is a series of optimizations to further optimize the performance of collective communications. These optimizations include a task scheduler, a two-phase split finding method, and low-precision gradient histograms. Our third contribution is a sparsity-aware algorithm to build gradient histograms and a novel index structure to build histograms in parallel. We implement these optimizations in DimBoost and show that it can be 2-9X faster than existing systems.
... Some tasks need to get the whole parameter, while others need a portion of the parameter [18]. To achieve a trade-off between query balance and fast range query, we adopt a hybrid strategy, called range-hash partition [19]. We first partition the parameter to several ranges based on the indexes, and then use hash partition to assign the partitions to distributed nodes [28]. ...
... Range partition accelerates range query as it minimizes the number of multi-sited transactions, while hash partition obtains better load balance for random queries [15]. Some other works propose hybrid partition strategies [19] to combine different partition methods according to specific scenarios. ...
Conference Paper
We study distributed machine learning in heterogeneous environments in this work. We first conduct a systematic study of existing systems running distributed stochastic gradient descent; we find that, although these systems work well in homogeneous environments, they can suffer performance degradation, sometimes up to 10x, in heterogeneous environments where stragglers are common because their synchronization protocols cannot fit a heterogeneous setting. Our first contribution is a heterogeneity-aware algorithm that uses a constant learning rate schedule for updates before adding them to the global parameter. This allows us to suppress stragglers' harm on robust convergence. As a further improvement, our second contribution is a more sophisticated learning rate schedule that takes into consideration the delayed information of each update. We theoretically prove the valid convergence of both approaches and implement a prototype system in the production cluster of our industrial partner Tencent Inc. We validate the performance of this prototype using a range of machine-learning workloads. Our prototype is 2-12x faster than other state-of-the-art systems, such as Spark, Petuum, and TensorFlow; and our proposed algorithm takes up to 6x fewer iterations to converge.
... Intra-query parallelism is a key factor to quickly answer analytical queries on large-scale database [5]- [7]. One major direction is multi-node processing, which scatters the query processing work among multiple processing nodes [8]- [11]. Employing a higher number of processing nodes potentially allows the query processing to utilize more underlying resources such as processing capacity and I/O bandwidth, thus speeding up the query. ...
Article
Parallel processing is a typical approach to answer analytical queries on large database. As the size of the database increases, we often try to increase the parallelism by incorporating more processing nodes. However, this approach increases the possibility of node failure as well. According to the conventional practice, if a failure occurs during query processing, the database system restarts the query processing from the beginning. Such temporal cost may be unacceptable to the user. This paper proposes a fault-tolerant query processing mechanism, named PhoeniQ, for analytical parallel database systems. PhoeniQ continuously takes a checkpoint for every operator pipeline and replicates the output of each stateful operator among different processing nodes. If a single processing node fails during query processing, another can promptly take over the processing. Hence, PhoneniQ allows the database system to efficiently resume query processing after a partial failure event. This paper presents a key design of PhoeniQ and prototype-based experiments to demonstrate that PhoeniQ imposes negligible performance overhead and efficiently continues query processing in the face of node failure.
... Range partitioning splits a set of keys based on a given set of range splitters [GD90], which can be derived, e.g., from quantiles [DNS91] or samples [MRL98]. It is used to, e.g., horizontally partition (shard) the data in distributed database systems [HCZZ21]. ...
... We describe two of the main works: the Multi-Attribute GrId deClustering (MAGIC) [GD94] and Bubba's Extended Range Declustering (BERD) [BAC + 90]. MAGIC is an extension of the Hybrid-Range partitioning strategy published in [GD90] that strikes a compromise between the sequential execution paradigm of range declustering and the intra-query parallelism achieved with hash and round-robin. ...
Thesis
The Resource Description Framework (RDF) and SPARQL are very popular graph-based standards initially designed to represent and query information on the Web. The flexibility offered by RDF motivated its use in other domains and today RDF datasets are great information sources. They gather billions of triples in Knowledge Graphs that must be stored and efficiently exploited. The first generation of RDF systems was built on top of traditional relational databases. Unfortunately, the performance in these systems degrades rapidly as the relational model is not suitable for handling RDF data inherently represented as a graph. Native and distributed RDF systems seek to overcome this limitation. The former mainly use indexing as an optimization strategy to speed up queries. Distributed and parallel RDF systems resorts to data partitioning. The logical representation of the database is crucial to design data partitions in the relational model. The logical layer defining the explicit schema of the database provides a degree of comfort to database designers. It lets them choose manually or automatically (through advisors) the tables and attributes to be partitioned. Besides, it allows the partitioning core concepts to remain constant regardless of the database management system. This design scheme is no longer valid for RDF databases. Essentially, because the RDF model does not explicitly enforce a schema since RDF data is mostly implicitly structured. Thus, the logical layer is inexistent and data partitioning depends strongly on the physical implementations of the triples on disk. This situation contributes to have different partitioning logics depending on the target system, which is quite different from the relational model’s perspective. In this thesis, we promote the novel idea of performing data partitioning at the logical level in RDF databases. Thereby, we first process the RDF data graph to support logical entity-based partitioning. After this preparation, we present a partitioning framework built upon these logical structures. This framework is accompanied by data fragmentation, allocation, and distribution procedures. This framework was incorporated to a centralized (RDF_QDAG) and a distributed (gStoreD) triple store. We conducted several experiments that confirmed the feasibility of integrating our framework to existent systems improving their performances for certain queries. Finally, we design a set of RDF data partitioning management tools including a data definition language (DDL) and an automatic partitioning wizard.
... Atte et al. (2011) adapted the range-based partitioning method (DeWitt et al., 1992) for MapReduce, which primarily divides the join attribute into subranges that approximately have the same number of records in order to balance the workload in subsequent processing. The introduced skew handling algorithm (SAND Join) utilizes two partitioning methods: the simple range partitioner and the virtual processor range partitioner (DeWitt & Ghandeharizadeh, 1990). The simple range partitioner collects samples from each input split and merges them into table T. ...
Article
Full-text available
Large-scale datasets collected from heterogeneous sources often require a join operation to extract valuable information. MapReduce is an efficient programming model for large-scale data processing. However, it has some limitations in processing heterogeneous datasets. In this study, we review the state-of-the-art strategies for joining two datasets based on an equi-join condition and provide a detail implementation for each strategy. We also present an in-depth analysis of the join strategies and discuss their feasibilities and limitations to assist the reader in selecting the most efficient strategy. Concluding, we outline interesting directions for future join processing systems.
... Prior literature review mainly examines placement schemes, without referring to the problem of declustering [48,41,69,67]. To the best of our knowledge, heuristic approaches are mainly adopted for the creation of the decluster objects, such as in the Virtual Microscope [151,65], where the application developer is responsible for generating the declustered image data and the studies presented by Faloutsos and Bhagwat [64], Chen and Shinha [51] and Ghandeharizadeh and DeWitt [158], where objects are assumed to be declustered across all their coordinate axes. ...
Article
This copy of the thesis has been supplied on the condition that anyone who consults it is understood to recognise that the copyright rests with its author and that no quotation from the thesis and no information derived from it may be published without the prior written consent of the author or the university (as may be appropriate).
... The range partitioning has long been studied for assigning tuples to partitions on the basis of ranges rather than hash values of a join key. [23] sorts input datasets according to join keys. Depending on the processing capability of the system, the number of tuples T to be allocated to each partition is determined. ...
... Finally, we note that partitioning algorithms for relational databases have been studied extensively in the past, but the focus has been on data declustering and physical design tuning in order to speed up query evaluation by taking advantage of parallelization. Partitioning strategies include range partitioning, hash partitioning, and partitioning based on query cost models [5,23,21,19,12,18,20]. Furthermore, modern distributed storage systems, such as BigTable [7], are not optimized for relational workloads on multiple tables. ...
Article
With the widespread use of shared-nothing clusters of servers, there has been a proliferation of distributed object stores that offer high availability, reliability and enhanced performance for MapReduce-style workloads. However, relational workloads cannot always be evaluated efficiently using MapReduce without extensive data migrations, which cause network congestion and reduced query throughput. We study the problem of computing data placement strategies that minimize the data communication costs incurred by typical relational query workloads in a distributed setting. Our main contribution is a reduction of the data placement problem to the well-studied problem of {\sc Graph Partitioning}, which is NP-Hard but for which efficient approximation algorithms exist. The novelty and significance of this result lie in representing the communication cost exactly and using standard graphs instead of hypergraphs, which were used in prior work on data placement that optimized for different objectives (not communication cost). We study several practical extensions of the problem: with load balancing, with replication, with materialized views, and with complex query plans consisting of sequences of intermediate operations that may be computed on different servers. We provide integer linear programs (IPs) that may be used with any IP solver to find an optimal data placement. For the no-replication case, we use publicly available graph partitioning libraries (e.g., METIS) to efficiently compute nearly-optimal solutions. For the versions with replication, we introduce two heuristics that utilize the {\sc Graph Partitioning} solution of the no-replication case. Using the TPC-DS workload, it may take an IP solver weeks to compute an optimal data placement, whereas our reduction produces nearly-optimal solutions in seconds.
... Instead of using hash partitioning, which directs a skewed key to a single processing node, we use rangebased partitioning which distributes the skewed keys among a number of processing nodes. Range-based partitioning exploits the characteristics of data for load balancing; two of such strategies are simple range partitioning and virtual processor range partitioning [7]. ...
Article
Full-text available
The simplicity and flexibility of the MapReduce framework have motivated programmers of large scale distributed data processing applications to develop their applications using this framework. However, the implementations of this framework, including Hadoop, do not handle skew in the input data effectively. Skew in the input data results in poor load balancing which can swamp the benefits achievable by parallelization of applications on such parallel processing frameworks. The performance of join operation, which is the most expensive and most frequently executed operation, is severely degraded in the presence of heavy skew in the input datasets to be joined. Hadoop's implementation of the join operation cannot effectively handle such skewed joins, attributed to the use of hash partitioning for load distribution. In this work, we introduce “Skew hANDling Join” (SAND Join) that employs range partitioning instead of hash partitioning for load distribution. Experiments show that SAND Join algorithm can efficiently perform joins on the datasets that are sufficiently skewed. We also compare the performance of this algorithm with that of Hadoop's join algorithms.
... Unlike the other declustering strategies, the hybrid-range partitioning strategy utilizes the characteristics of the queries that access a relation to obtain the appropriate degree of intra-query parallelism. In particular, it strikes a compromise between the sequential execution paradigm of the range declustering strategy and the load balancing/intra-query parallelism characteristics of the hash and round-robin declustering strategies [10]. In our system, we choose range portioning method to divide our data. ...
... This strategy has been termed as declustering. Early studies on declustering for parallel range searching date back to the beginning of the 90's [56, 64, 49]. However, newer declustering techniques have been found and there has been plenty of recent attention to this topic. ...
Chapter
Parallel processing is a flagship approach for answering analytical queries on large-scale database. As the database scale increases, a larger number of processing nodes are likely to be incorporated to increase the degree of parallelism. However, this solution results in an increased probability of node failure. If such a failure happens during query processing, the processing often has to restart from scratch. This temporal cost may not be acceptable for the user. In this paper, we propose PhoeniQ, a fault-tolerant query processing mechanism for analytical parallel database systems. PhoeniQ takes a package-level checkpoint for every operator pipeline and replicates the output of stateful operators among different processing nodes. If a single processing node fails during processing, another node is enabled to resume the execution state of the failed node, so that the query can continue to run. This paper presents our intensive experiments based on our prototype, which demonstrate that PhoeniQ can continue the query processing in the face of node failures with significantly smaller cost than the conventional approach.
Conference Paper
Big Data has brought great challenges to traditional DBMS. Nearly all the Big Data management systems choose to handle Big Data by partitioning it according to some specific attributes and distributing the partitioned data in cluster. However, when data relationships and query requirements are exceedingly complex, it is difficult to decide which attributes should be chosen to partition data, because a data table can only be partitioned in exactly one specific way while different query requirements may need different partition plans which are in conflicts with each other. In the same time, replication is a common approach to obtain high fault tolerance. Consequently, it is a reasonable way to improve system performance by using multiple replicas to resolve partition conflicts.; AB@In this paper we analyzed the reasons of contradictions and presented a method to identify and resolve them. Our method first classifies queries into different categories by their requirements, and then uses partition algorithm to search the optimal partition plan for each query category. By introducing a two-tier server architecture, we could make more effective use of these replicas. TPC-E and TPC-H are chosen to evaluate our method, the evaluation results demonstrate that our method could improve system performance by up to 4x over single partition plan method.
Article
Full-text available
A fully self-managed DBMS which does not require administrator intervention is the ultimate goal of database developers. This system should automate deploying, configuration, administration, monitoring, and tuning tasks. Although there are some advances in this field, self-managed technology is largely not ready for industrial use and remains an active area of research. One of the most crucial tasks for such a system is automated physical design tuning. A self-managed approach for this task implies that the physical design of a database should be automatically adapted to changing workloads. The problems of materialized view and index selection, data allocation, horizontal and vertical partitioning were studied for a long time, and hundreds of approaches were developed. However, most of these approaches were static, thus, unsuitable for self-managed systems. In this paper we discuss the prospects of an adaptive distributed relational column-store. We show that the column-store approach holds a great promise for construction of an efficient self-managed database. At first, we present a short survey of existing physical design studies and provide a classification of approaches. In the survey, we highlight the self-managed aspects. Then, we provide some views on the organization of a self-managed distributed column-store system. We discuss its three core components: an alerter, a reorganization controller and a set of physical design options (actions) available to such a system. We present possible approaches for each of these components and evaluate them. Several physical design problems are formulated and discussed. This study is the first step towards a creation of an adaptive distributed column-store system.
Conference Paper
In cloud computing environment, data information is growing exponentially. Which raises new challenges in efficient distributed data storage and management for large scale OLTP and OLAP applications. Horizontal and vertical database partitioning can improve the performance and manageability for shared-nothing systems which are popular in nowadays. However, the existing partitioning techniques can?t deal with dynamic information efficiently and can?t get the real-time partitioning strategies. In this paper, we present WSPA: a workload-driven stream vertical partitioning approach based on streaming framework. We construct an affinity matrix to get the mapping information from a workload and cluster attributes according to the attribute affinity, then obtain the optimal partitioning scheme by a cost model. The experimental results show that WSPA has good partitioning quality and lower time complexity than existing vertical partitioning method. It is an efficient partitioning method for processing the dynamic and large scale queries.
Article
In this paper we survey various DBMS physical design options. We will consider both vertical and horizontal partitioning, and briefly cover replication. This survey is not limited only to local systems, but also includes distributed ones. The latter adds a new interesting question — how to actually distribute data among several processing nodes. Aside from theoretical approaches we consider the practical ones, implemented in any contemporary DBMS. We cover these aspects not only from user, but also architect and programmer perspectives.
Conference Paper
Host side caches use a form of storage faster than disk and less expensive than DRAM to deliver the speed demanded by data intensive applications, e.g., NAND Flash. A host side cache may integrate into an existing application seamlessly using an infrastructure component (such as a storage stack middleware or the operating system) to intercept the application read and write requests for disk pages, populate the flash cache with disk pages, and use the flash to service read and write requests intelligently. This study provides an overview of host side caches, an analysis of its overhead and costs to justify its use, alternative architectures including the use of the emerging Non Volatile Memory (NVM) for the host-side cache, and future research directions. Results from Dell’s Fluid Cache demonstrate it enhances the performance of a social networking workload from a factor of 3.6 to 18.
Chapter
Voraussetzung für die parallele Verarbeitung von Anfragen und die Nutzung von Datenparallelität in Parallelen Datenbanksystemen ist eine geeignete Datenverteilung, sodass mehrere Prozesse auf disjunkten Datenbereichen parallel arbeiten können. Während in Shared-Everything- und in Shared-Disk-Systemen lediglich eine Verteilung der Daten über mehrere Platten bzw. Externspeicher zu finden ist, erfordert Shared Nothing zugleich eine Verteilung der Daten unter den Verarbeitungsrechnern. Die Datenverteilung hat in dieser Architektur daher auch direkten Einfluss auf den Kommunikations-Overhead und ist daher von besonderer Bedeutung für die Leistungsfähigkeit. Wir konzentrieren uns daher weitgehend auf die Bestimmung der Datenverteilung für Shared Nothing, die ähnlich wie für Verteilte DBS die Schritte der Fragmentierung und Allokation erfordert. Um jedoch eine effektive Parallelisierung erreichen zu können, sind diese Aufgaben enger aufeinander abzustimmen. Insbesondere empfiehlt sich vor der Fragmentierung bereits die Festlegung des Verteilgrades einer Relation. Bei der Vorstellung der Teilschritte gehen wir besonders ausführlich auf Varianten einer horizontalen Fragmentierung ein, wobei auch mehrdimensionale bzw. mehrstufige Ansätze behandelt werden. Danach stellen wir noch drei für Shared Nothing einsetzbare Verfahren zur Datenallokation mit replizierter Zuordnung der Daten vor. Im Anschluss besprechen wir die Datenverteilung für Shared Everything und Shared Disk. Am Ende des Kapitels gehen wir auf die Datenverteilung in NoSQL-Systemen ein.
Conference Paper
In this paper, we discuss a self-managed distributed columnstore system which would adapt its physical design to changing workloads. Architectural novelties of column-stores hold a great promise for construction of an efficient self-managed database. At first, we present a short survey of an existing self-managed systems. Then, we provide some views on the organization of a self-managed distributed column-store system. We discuss its three core components: alerter, reorganization controller and the set of physical design options (actions) available to such a system. We present possible approaches to each of these components and evaluate them. This study is the first step towards a creation of an adaptive distributed column-store system.
Article
This thesis defines an approach for exploring parallelism in object-oriented database systems outside of the context of SQL queries. We propose a technique for parallelazing transactions in a flat classical transaction model where a transaction is sequence of operations. Intra-transaction parallelism is accomplished by transforming transaction code definition in order to execute operations in parallel. Our approach for exploring parallelism inside applications first extends the intra-transaction parallelization model so that a transaction is considered as an unit of parallelization. We have then considered a nested transaction model for exploring parallelism inside applications. We developed a parallelization model for applications where we merge capabilities for parallel execution already given in nested transactions with our approach for transaction parallelization by transformation. We implemented a prototype for the intra-transaction parallelization model, using the O2 object-oriented database system. The prototype introduces parallel execution inside O2 transactions through creation and synchronization of threads inside an O2 client running an application. Our prototype runs in a monoprocessor Unix-like workstation and supports virtual parallelism. We also applied the transaction parallelization model to the NAOS Rule System. Our approach considers a set of rules of an execution cycle for parallelization. We build an execution plan for the rules of a cycle which defines sequential or parallel execution for the rules.
Article
With the widespread use of shared-nothing clusters of servers, there has been a proliferation of distributed object stores that offer high availability, reliability and enhanced performance for MapReduce-style workloads. However, data-intensive scientific workflows and join-intensive queries cannot always be evaluated efficiently using MapReduce-style processing without extensive data migrations, which cause network congestion and reduced query throughput. In this paper, we study the problem of computing data placement strategies that minimize the data communication costs incurred by such workloads in a distributed setting. Our main contribution is a reduction of the data placement problem to the well-studied problem of Graph Partitioning, which is NP-Hard but for which efficient approximation algorithms exist. The novelty and significance of this result lie in representing the communication cost exactly and using standard graphs instead of hypergraphs, which were used in prior work on data placement that optimized for different objectives. We study several practical extensions of the problem: with load balancing, with replication, and with complex workflows consisting of multiple steps that may be computed on different servers. We provide integer linear programs (IPs) that may be used with any IP solver to find an optimal data placement. For the no-replication case, we use publicly available graph partitioning libraries (e.g., METIS) to efficiently compute nearly-optimal solutions. For the versions with replication, we introduce two heuristics that utilize the Graph Partitioning solution of the no-replication case. Using a workload based on TPC-DS, it may take an IP solver weeks to compute an optimal data placement, whereas our reduction produces nearly-optimal solutions in seconds.
Article
Scaling complex transactional workloads in parallel and distributed systems is a challenging problem. When transactions span data partitions that reside in different nodes, significant overheads emerge that limit the throughput of these systems. In this paper, we present a low-overhead data partitioning approach, termed JECB, that can reduce the number of distributed transactions in complex database workloads such as TPC-E. The proposed approach analyzes the transaction source code of the given workload and the database schema to find a good partitioning solution. JECB leverages partitioning by key-foreign key relationships to automatically identify the best way to partition tables using attributes from tables. We experimentally compare our approach with the state of the art data-partitioning techniques and show that over the benchmarks considered, JECB provides better partitioning solutions with significantly less overhead.
Conference Paper
With the development of Internet technology and Cloud Computing, more and more applications have to be confronted with the challenges of big data. NoSQL Database is fit to the management of big data because of the characteristics of high scalability, high availability and high fault-tolerance. The data partitioning strategy plays an important role in the NoSQL database. The existing data partitioning strategies will cause some problems such as low scalability, hot spot and low performance and so on. In this paper we proposed a new data partitioning strategy---HRCH, which can partitioning the data in a reasonable way. At last we use some experiments to verify the effectiveness of HRCH. It shows that the HRCH can improve the scalability of the system. It also can avoid the hot spot problem as far as possible. And it also can improve the parallel degree of processing to improve the system's performance in some processing.
Conference Paper
Zipfian distribution is used extensively to generate workloads to test, tune, and benchmark data stores. This paper presents a decentralized implementation of this technique, named D-Zipfian, using N parallel generators to issue requests. A request is a reference to a data item from a fixed population of data items. The challenge is for each generator to reference a disjoint set of data items. Moreover, they should finish at approximately the same time by performing work proportional to their processing capability. Intuitively, D-Zipfian assigns a total probability of 1/N to each of the N generators and requires each generator to reference data items with a scaled probability. In the case of heterogeneous generators, the total probability of each generator is proportional to its processing capability. We demonstrate the effectiveness of D-Zipfian using empirical measurements of the chi-square statistic.
Conference Paper
This paper compares the performance of an SQL solution that implements a relational data model with a document store named MongoDB. We report on the performance of a single node configuration of each data store and assume the database is small enough to fit in main memory. We analyze utilization of the CPU cores and the network bandwidth to compare the two data stores. Our key findings are as follows. First, for those social networking actions that read and write a small amount of data, the join operator of the SQL solution is not slower than the JSON representation of MongoDB. Second, with a mix of actions, the SQL solution provides either the same performance as MongoDB or outperforms it by 20%. Third, a middle-tier cache enhances the performance of both data stores as query result look up is significantly faster than query processing with either system.
Conference Paper
Parallel spatial database is an inevitable development trend of high performance spatial data management. A key problem to improve the performance of parallel spatial database is to achieve spatial data balancing distribution between parallel nodes especially in shared nothing parallel architecture. Considering the unique characteristics of spatial objects, the existing traditional data declustering methods are difficult to obtain well data declustering results when facing unstructured variable length spatial objects and intricate spatial locality relations. Aim at the status, this paper presents a Hilbert curve based spatial data declustering method, which use rectangular grids to partition data space, to assign the unique Hilbert code to each sub-grids according to the sequence of Hilbert curve going through each sub-grids, and use the Hilbert code to impose a linear ordering on each multidimensional spatial objects, and then declustering spatial objects based on their Hilbert code to keep spatial locality between spatial objects. In order to avoid code conflict problem of different sub-grids caused by hierarchically decomposing sub-grids, a fake Hilbert code strategy is given. Experimental results show that the proposed method can attain well spatial data balancing distribution results, and also keep well spatial locality of data objects in each declustering units.
Conference Paper
With the development of Internet technology and Cloud Computing, more and more applications have to be confronted with the challenges of big data. NoSQL Database is fit to the management of big data because of the characteristics of high scalability, high availability, and high fault-tolerance. The data partitioning strategy plays an important role in the NoSQL database. The existing data partitioning strategies will cause some problems such as low scalability, hot spot, and low performance and so on. In this paper we proposed a new data partitioning strategy-HRCH, which can partitioning the data in a reasonable way. At last we use some experiments to verify the effectiveness of HRCH. It shows that HRCH can improve the system's scalability, avoid the hot spot problem as far as possible. At the same time it also can improve the parallel degree of processing, which can improve the system's performance in some processing.
Article
Efficient storage and retrieval of multi-attribute data sets has become one of the essential requirements for many data-intensive applications. The Cartesian product file has been known as an effective multi-attribute file structure for partial-match and best-match queries. Several heuristic methods have been developed to decluster Cartesian product files across multiple disks to obtain high performance for disk accesses. Although the scalability of the declustering methods becomes increasingly important for systems equipped with a large number of disks, no analytic studies have been done so far. The authors derive formulas describing the scalability of two popular declustering methods-Disk Module and Fieldwise Xor-for range queries, which are the most common type of queries. These formulas disclose the limited scalability of the declustering methods, and this is corroborated by extensive simulation experiments. From the practical point of view, the formulas given in the paper provide a simple measure that can be used to predict the response time of a given range query and to guide the selection of a declustering method under various conditions
Article
Full-text available
Declustering techniques reduce query response times through parallel I/O by distributing data among parallel disks. Recently, replication-based approaches were proposed to further reduce the response time. Efficient retrieval of replicated data from multiple disks is a challenging problem. Existing retrieval techniques are designed for storage arrays with identical disks, having no initial load or network delay. In this article, we consider the generalized retrieval problem of replicated data where the disks in the system might be heterogeneous, the disks may have initial load, and the storage arrays might be located on different sites. We first formulate the generalized retrieval problem using a Linear Programming (LP) model and solve it with mixed integer programming techniques. Next, the generalized retrieval problem is formulated as a more efficient maximum flow problem. We prove that the retrieval schedule returned by the maximum flow technique yields the optimal response time and this result matches the LP solution. We also propose a low-complexity online algorithm for the generalized retrieval problem by not guaranteeing the optimality of the result. Performance of proposed and state of the art retrieval strategies are investigated using various replication schemes, query types, query loads, disk specifications, network delays, and initial loads.
Article
Various strategies have been developed to assist in determining an effective data placement for a parallel database system. However, little work has been done on assessing the relative performance obtained from different strategies. This paper studies the effects of different data placement strategies on a shared-nothing parallel relational database system when the number of disks attached to each node is varied. Three representative strategies have been used for the study and the performance of the resulting configuration has been assessed in the context of the TPC–C benchmark running on a specific parallel system. Results show an increase in sensitivity to data placement strategy with increasing number of disks per node.
Article
In a shared-disk database cluster, the overhead of data synchronization between nodes cannot be ignored, especially when the speed of the interconnect network is not fast enough. It is more likely to become the system bottleneck. However, most of those overhead can be avoided. This paper presents a logical data partitioning method. It can decoupling the nodes and reduce the overhead of interconnect network communication by using a data-oriented task dispatching strategy. Because the partitioning method is logical and it is easy to change for avoiding the data skew. The experiment results show that the proposed method can greatly reduce the communication data amount and improve the system performance.
Article
Load balancing is an important factor to achieve better performance for heterogeneous database cluster. In order to evaluate the impact of all kinds of resources in heterogeneous nodes on different types of applications, the weighted load index that take both the utilization of different resources and the types of workloads into consideration are introduced based on a load balancer architecture. And an efficient and dynamic load balancing scheme is proposed for OLTP workloads to maximize the utilization of distributed resources and achieve better performance, which need not collect the feedback of load information from the lower nodes and effectively keeps from the data skew. The simulation results for OLTP services gained by TPC-C tool show that the dynamic weighted balancing policy keeps the heterogeneous cluster well balanced and leads to sublinear throughput speedup
Article
Full-text available
Autonomous Disks (1) are high-function disks that realize indexed shared-nothing parallel data storage and implement fault tolerance as well as handling of data distribution and access skews transparentlyto client sy stems. A database should be capable of storing data items of arbitrary size, and declustering such data items across several disks can enhance performance and faciliate load balancing. This paper describes the design changes made to the Autonomous Disks system to enable it to handle arbitrary-size data, and offer a flexible and extensible way of declustering such data.
Article
Spatial data declustering is an important data processing method for parallel spatial database especially in shared nothing parallel architecture. Spatial data declustering can achieve parallel dataflow to exploit the I/O bandwidth of multiple parallel nodes by reading and writing them in parallel, which can improve the performance of parallel spatial database evidently. Aiming at the unique spatial objects locality, this paper presents a novel spatial data declustering method, which uses Hilbert space-filling curve to impose a linear ordering on multidimensional spatial objects, and to partition spatial objects logical segments according to this ordering to preserve spatial locality of spatial objects, and then to allocate logical segments to physical parallel nodes based on round-robin rule. Experimental results show that the proposed method can obtain well spatial data declustering results.
Article
Full-text available
String data is ubiquitous, and its management has taken on particular importance in the past few years. Approximate queries are very important on string data. This is due, for example, to the prevalence of typographical errors in data, and multiple conventions for recording attributes such as name and address. Commercial databases do not support approximate string queries directly, and it is a challenge to implement this functionality efficiently with user-defined functions (UDFs). In this paper, we develop a technique for building approximate string processing capabilities on top of commercial databases by exploiting facilities already available in them. At the core, our technique relies on generating short substrings of length q, called q-grams, and processing them using standard methods available in the DBMS. The proposed technique enables various approximate string processing methods in a DBMS, for example approximate (sub)string selections and joins, and can even be used with a variety of possible edit distance functions. The approximate string match predicate, with a suitable edit distance threshold, can be mapped into a vanilla relational expression and optimized by conventional relational optimizers.
Article
Parallel databases based on the shared-nothing architecture are ideally suited for the increasingly demanding data management needs of enterprise decision support systems. In an ideal parallel system, twice as many nodes can perform twice as large a task in the same time, resulting in linear scale-up; or, twice as many nodes can perform the same task twice as quickly, resulting in linear speed-up. Round-robin, hash, range, hybrid, and other declustering techniques ensure that the needed data is available at each node for processing, and thus help approximate the ideal scalability characteristics. Parallelism can be applied to each of the relational operators. For the select operator, interquery parallelism executes several relational queries simultaneously; interoperator parallelism executes several operations within the same query simultaneously; and intraoperator parallelism is employed to each operator within a query.
Article
In this paper, we will discuss a highly dependable system configuration method and recovery method for our proposed Fat-Btree structure, which is a directory structure for shared-nothing parallel computers. The goals are to enhance the service of operation during failure, to minimize the probability of data loss, and to enhance availability by combining physiological logging, logical logging, disk mirroring, staggered allocation, etc. Various system configurations formed by the combination of these methods will be quantitatively evaluated in this paper. As a result, it will be shown that a combination of physiological logging and disk mirroring is most appropriate when hardware cost can be ignored, and that a combination of global logical logging and physiological logging is most appropriate when hardware cost is considered. © 2006 Wiley Periodicals, Inc. Electron Comm Jpn Pt 3, 89(12): 42–58, 2006; Published online in Wiley InterScience (www.interscience.wiley.com). DOI 10.1002/ecjc.20288
Chapter
Efficient browsing and retrieval of geographically referenced information requires the allocation of data on different storage devices for concurrent retrieval. By dividing a two dimensional space into tiles, a system can allow users to specify regions of interest using a query rectangle and then retrieving all information related to tiles overlapping with the query. In this paper, we derive the necessary and sufficient conditions for strictly optimal allocations of two-dimensional data. These methods, when they exist, guarantee that for any query, the minimum number of tiles are assigned the same storage device, and hence ensures maximal retrieval concurrency.
Article
In heterogeneous database cluster, the performance of load balancing is closely related to the computing capabilities of heterogeneous nodes and the different types of workloads. Thus, a method is introduced to evaluate the load status of nodes by the weighted load values with consideration of both the utilization of different resources and the workload types in a load balancer and an efficient and dynamic load balancing scheme is proposed for OLTP(online transaction processing) workloads to maximize the utilization of distributed resources and achieve better performance, which need not collect the feedback of load information from the lower nodes and effectively keeps from the data skew. The simulation results for OLTP services gained by TPC-C tool show that the dynamic weighted balancing policy leads to sub-linear throughput speedup and keeps the heterogeneous cluster well balanced.
Article
Full-text available
Increasing performance of CPUs and memories will be squandered if not matched by a similar performance increase in I/O. While the capacity of Single Large Expensive Disks (SLED) has grown rapidly, the performance improvement of SLED has been modest. Redundant Arrays of Inexpensive Disks (RAID), based on the magnetic disk technology developed for personal computers, offers an attractive alternative to SLED, promising improvements of an order of magnitude in performance, reliability, power consumption, and scalability. This paper introduces five levels of RAIDs, giving their relative cost/performance, and compares RAID to an IBM 3380 and a Fujitsu Super Eagle.
Article
In this paper we present data distribution methods for parallel processing environment. The primary objective is to process partial match retrieval type queries for parallel devices. The main contribution of this paper is the development of a new approach called FX (Fieldwise eXclusive) distribution for maximizing data access concurrency. An algebraic property of exclusive-or operation, and field transformation techniques are fundamental to this data distribution techniques. We have shown through theorems and corollaries that this FX distribution approach performs better than other methods proposed earlier. We have also shown, by computing probability of optimal distribution and query response time, that FX distribution gives better performance than others over a large class of partial match queries. This approach presents a new basis in which optimal data distribution for more general type of queries can be formulated.
Article
In dataflow architectures, each dataflow operation is typically executed on a single physical node. We are concerned with distributed data-intensive systems, in which each base (i.e., persistent) set of data has been declustered over many physical nodes to achieve load balancing. Because of large base set size, each operation is executed where the base set resides, and intermediate results are transferred between physical nodes. In such systems, each dataflow operation is typically executed on many physical nodes. Furthermore, because computations are data-dependent, we cannot know until run time which subset of the physical nodes containing a particular base set will be involved in a given dataflow operation. This uncertainty creates several problems. We examine the problems of efficient program loading, dataflow—operation activation and termination, control of data transfer among dataflow operations, and transaction commit and abort in a distributed data-intensive system. We show how these problems are interrelated, and we present a unified set of mechanisms for efficiently solving them. For some of the problems, we present several solutions and compare them quantitatively.
Article
This paper presents the results of an initial performance evaluation of the Gamma database machine. In our experiments we measured the effect of relation size and indices on response time for selection, join, and aggregation queries, and single-tuple updates. A Teradata DBC/1012 database machine of similar size is used as a basis for interpreting the results obtained. We also analyze the performance of Gemma relative to the number of processors employed and study the impact of varying the memory size and disk page size on the execution time of a variety of selection and join queries. We analyze and interpret the results of these experiments based on our understanding of the system hardware and software, and conclude with an assessment of the strengths and weaknesses of Gamma.
Article
The optimization problem discussed in this paper is the translation of an SQL query into an efficient parallel execution plan for a multiprocessor database machine under the performance goal of reduced response times as well as increased throughput in a multiuser environment. We describe and justify the most important research problems which have to be solved to achieve this task, and we explain our approach to solve these problems.
Article
Cartesian product files have recently been shown to exhibit attractive properties for partial match queries. This paper considers the file allocation problem for Cartesian product files, which can be stated as follows: Given a k-attribute Cartesian product file and an m-disk system, allocate buckets among the m disks in such a way that, for all possible partial match queries, the concurrency of disk accesses is maximized. The Disk Modulo (DM) allocation method is described first, and it is shown to be strict optimal under many conditions commonly occurring in practice, including all possible partial match queries when the number of disks is 2 or 3. It is also shown that although it has good performance, the DM allocation method is not strict optimal for all possible partial match queries when the number of disks is greater than 3. The General Disk Modulo (GDM) allocation method is then described, and a sufficient but not necessary condition for strict optimality of the GDM method for all partial match queries and any number of disks is then derived. Simulation studies comparing the DM and random allocation methods in terms of the average number of disk accesses, in response to various classes of partial match queries, show the former to be significantly more effective even when the number of disks is greater than 3, that is, even in cases where the DM method is not strict optimal. The results that have been derived formally and shown by simulation can be used for more effective design of optimal file systems for partial match queries. When considering multiple-disk systems with independent access paths, it is important to ensure that similar records are clustered into the same or similar buckets, while similar buckets should be dispersed uniformly among the disks.
Conference Paper
Scalable parallel computers have been the Holy Grail of much of Computer Sciences research in the past two decades. There are now several products on the market, ranging from dozen processor bus-based systems to the multiple thousand processing element Connection Machine. These products, including data management systems, use parallelism to satisfy a range of system goals including performance, availability and cost. In this paper, I discuss parallelism issues in the context of data management. My focus is the Shared Nothing class of parallel system, examples of which include products from Tandem and Teradata and experimental systems such as the University of Wisconsin's Gamma and MCC's Bubba. I outline a number of key research areas emphasizing the inherent problems and current state of the art. Next, I summarize recent performance results. The basic message of the paper is that the Holy Grail of scalable parallelism, even in the limited application domain of data management is still elusive; but we've made several significant steps towards attaining it.
Article
Five well-known scheduling policies for movable head disks are compared using the performance criteria of expected seek time (system oriented) and expected waiting time (individual I/O request oriented). Both analytical and simulation results are obtained. The variance of waiting time is introduced as another meaningful measure of performance, showing possible discrimination against individual requests. Then the choice of a utility function to measure total performance including system oriented and individual request oriented measures is described. Such a function allows one to differentiate among the scheduling policies over a wide range of input loading conditions. The selection and implementation of a maximum performance two-policy algorithm are discussed.
Article
We investigate two schemes for placing data on multiple disks. We show that declustering (spreading each file across several disks) is inherently better than clustering (placing each file on a single disk) due to a number of reasons including parallelism and uniform load on all disks.
Article
We describe the implementation of a flexible data storage system for the UNIX environment that has been designed as an experimental vehicle for building database management systems. The storage component forms a foundation upon which a variety of database systems can be constructed including support for unconventional types of data. We describe the system architecture, the design decisions incorporated within its implementation, our experiences in developing this large piece of software, and the applications that have been built on top of it.
Article
The design of the Gamma database machine and the techniques employed in its implementation are described. Gamma is a relational database machine currently operating on an Intel iPSC/2 hypercube with 32 processors and 32 disk drives. Gamma employs three key technical ideas which enable the architecture to be scaled to hundreds of processors. First, all relations are horizontally partitioned across multiple disk drives, enabling relations to be scanned in parallel. Second, parallel algorithms based on hashing are used to implement the complex relational operators, such as join and aggregate functions. Third, dataflow scheduling techniques are used to coordinate multioperator queries. By using these techniques, it is possible to control the execution of very complex queries with minimal coordination. The design of the Gamma software is described and a thorough performance evaluation of the iPSC/s hypercube version of Gamma is presented
GAMMA -A High Performance Dataflow Database MachineA Per-formance Analysis of the Gamma Database Machine
  • D Dewitt
  • B Gerber
  • G Graefe
  • M Heytens
  • K Kumar
  • M Muralikrishna
Dewitt, D., Gerber, B., Graefe, G., Heytens, M., Kumar, K. and M. Muralikrishna, "GAMMA -A High Performance Dataflow Database Machine," Proceedings of the VLDB Conf., 1986. [DEW1881 Dewitt, D., Ghandeharizadeh, S., and D. Schneider, "A Per-formance Analysis of the Gamma Database Machine", Proc. ACM SIGMOD Conf., June 1988.
RAID: Redundant Arrays of Inexpensive DisksDatabase Processing Models in Parallel Processing Systems
  • D Patterson
Patterson, D., et al., "RAID: Redundant Arrays of Inexpensive Disks," Proc. ACM-SIGMOD Conf., June 1988. [PRAM891 Pramanik, S., and M. H. Kim, "Database Processing Models in Parallel Processing Systems," Database Machines, Boral, H.. and P. Faudemay, eds., Springer-Verlag. 1989.
Computers and Interactability: A Guide to the Theory of NP-Compeltenesa
  • M R Garey
  • D S Johnson
  • D S J Ghandeharizadeh
  • Dewitt
Garey, M. R., Johnson D. S., Computers and Interactability: A Guide to the Theory of NP-Compeltenesa, New York, 1979. [GHAN901 Ghandeharizadeh. S., and D. J. Dewitt, "Performance Analysis of Alternative Declustering Strategies," Proceedings of the 6th International Conference on Data Engineering, February 1990.
Database Processing Models in Parallel Processing Systems
  • S Pramanik
  • M H Kim
Pramanik, S., and M. H. Kim, "Database Processing Models in Parallel Processing Systems," Database Machines, Boral, H.. and P. Faudemay, eds., Springer-Verlag. 1989.
DBC/lOlZ Data Base Computer System Manual, Rel. 2.0
  • Teradata Corp
Teradata Corp., "DBC/lOlZ Data Base Computer System Manual, Rel. 2.0," Teradata Corp. Document No. ClO-0001-02. November 1985.