Transaction Processing on Modern Hardware
Abstract
The last decade has brought groundbreaking developments in transaction processing. This resurgence of an otherwise mature research area has spurred from the diminishing cost per GB of DRAM that allows many transaction processing workloads to be entirely memory-resident. This shift demanded a pause to fundamentally rethink the architecture of database systems. The data storage lexicon has now expanded beyond spinning disks and RAID levels to include the cache hierarchy, memory consistency models, cache coherence and write invalidation costs, NUMA regions, and coherence domains. New memory technologies promise fast non-volatile storage and expose unchartered trade-offs for transactional durability, such as exploiting byte-addressable hot and cold storage through persistent programming that promotes simpler recovery protocols. In the meantime, the plateauing single-threaded processor performance has brought massive concurrency within a single node, first in the form of multi-core, and now with many-core and heterogeneous processors.
The exciting possibility to reshape the storage, transaction, logging, and recovery layers of next-generation systems on emerging hardware have prompted the database research community to vigorously debate the trade-offs between specialized kernels that narrowly focus on transaction processing performance vs. designs that permit transactionally consistent data accesses from decision support and analytical workloads. In this book, we aim to classify and distill the new body of work on transaction processing that has surfaced in the last decade to navigate researchers and practitioners through this intricate research subject.
... Given the rise of key-value store-broadly classified as NoSQL-over the last decade there have been a huge efforts to accelerate NoSQL platform using modern hardware [1]. One of the approach has been to capitalize on the ever-increasing size of the main memory in each machine. ...
... HydraDB and FaRM scales well by increasing the number of clients because of the the use of READ for Get statement in read-intensive workload. 1 For a complete analysis please see the extended version of the paper [9]. Although Pilaf uses READ in Get statement, it can not outperform HydraDB and FaRM due to the higher number of READs and cost of CRC64 in its design. ...
... Blockchain fabric with high performance and scalability is crucial [5,25,27,45]. PoE [26] reduced one phase from Pbft [12] by introducing a speculative execution. RCC [28], FlexiTrust [30], and SpotLess [32] extend the single-leader protocols to multi-leaders to improve parallelism. ...
In the realm of blockchain systems, smart contracts have gained widespread adoption owing to their programmability. Consequently, developing a system capable of facilitating high throughput and scalability is of paramount importance. Directed acyclic graph (DAG) consensus protocols have demonstrated notable enhancements in both throughput and latency, however, the serial execution is now becoming a bottleneck. Numerous approaches prove impractical for smart contracts by assuming that read/write sets are known in prior. This paper introduces Thunderbolt, a novel architecture based on DAG-based protocols, that aims to furnish a scalable and concurrent execution for smart contract transactions. Inspired by Hyperledger, Thunderbolt also expands Execute-Order-Validate architecture in which transactions are distributed into distinct replicas, with execution outcomes determined prior to ordering through the DAG-based protocol. Existing protocols adopt serial executions after the ordering to avoid non-determinism. However, Thunderbolt provides parallel pre-execution before the ordering as well as parallel verifications once any source of non-determinism is removed. Each replica validates the transaction results during the construction of the DAG other than after the ordering following the construction to improve the latency. In an effort to enhance smart contract execution, we implement an execution engine that constructs a dependency graph to dynamically assign transaction orders, thus mitigating abort rates due to execution conflicts. Additionally, we introduce a novel shard reconfiguration to withstand malicious attacks by relocating replicas from the current DAG to a new DAG, and rotating the shards among different replicas. Our comparison of the results on SmallBank with serial execution on Narwhal-Tusk revealed a remarkable 50 times speedup with 64 replicas.
... Achieving fault-tolerant distributed consensus is an age-old problem. Commit protocols such as Two-Phase Commit (Gray 1978), Three-Phase Commit (Skeen 1982) and Easy-Commit Sadoghi 2018, 2020) help in reaching agreement among the participants in a partitioned distributed databases (Qadah and Sadoghi 2018;Qadah et al 2020;Sadoghi and Blanas 2019). However, commit protocols can only handle node failures and are unsafe under message delay or loss. ...
A blockchain is an append-only linked-list of blocks, which is maintained at each participating node. Each block records a set of transactions and their associated metadata. Blockchain transactions act on the identical ledger data stored at each node. Blockchain was first perceived by Satoshi Nakamoto as a peer-to-peer digital-commodity (also known as crypto-currency) exchange system. Blockchains received traction due to their inherent property of immutability-once a block is accepted, it cannot be reverted.
... Decades of academic research and industry experience has helped the community in designing efficient arXiv:1911.09208v2 [cs.DB] 27 Apr 2020 distributed applications [20], [21], [22], [23]. We use these principles to illustrate the design of a high-throughput yielding permissioned blockchain fabric, ResilientDB. ...
... Big data challenges are not characterized only by the large volume of data that has to be processed, but also by a high rate of data production and consumption i.e., high-velocity [30], [45], [36], [37], [44]. Explosion in data volume and velocity is commonplace in a wide range of monitoring applications. ...
Due to recent explosion of data volume and velocity, a new array of lightweight key-value stores have emerged to serve as alternatives to traditional databases. The majority of these storage engines, however, sacrifice their read performance in order to cope with write throughput by avoiding random disk access when writing a record in favor of fast sequential accesses. But, the boundary between sequential vs. random access is becoming blurred with the advent of solid-state drives (SSDs). In this work, we propose our new key-value store, LogStore, optimized for hybrid storage architectures. Additionally, introduce a novel cost-based data staging model based on log-structured storage, in which recent changes are first stored on SSDs, and pushed to HDD as it ages, while minimizing the read/write amplification for merging data from SSDs and HDDs. Furthermore, we take a holistic approach in improving both the read and write performance by dynamically optimizing the data layout, such as deferring and reversing the compaction process, and developing an access strategy to leverage the strengths of each available medium in our storage hierarchy. Lastly, in our extensive evaluation, we demonstrate that LogStore achieves up to 6x improvement in throughput/latency over LevelDB, a state-of-the-art key-value store
... Furthermore, it admits multiple execution paradigms (i.e., speculative or conservative) and multiple isolation levels (i.e., serializable isolation or read-committed isolation) seamlessly, unlike existing proposals of the deterministic database. It is important to note that several existing nondeterministic database systems already support multiple forms of isolation levels (e.g., [22,30,33,37,38]). ...
Distributed database systems partition the data across multiple nodes to improve the concurrency, which leads to higher through-put performance. Traditional concurrency control algorithms aim at producing an execution history equivalent to any serial history of transaction execution. Hence an agreement on the final serial history is required for concurrent transaction execution. Traditional agreement protocols such as Two-Phase-Commit (2PC) are typically used but act as a significant bottleneck when processing distributed transactions that access many partitions. 2PC requires extensive coordination among the participating nodes to commit a transaction. Unlike traditional techniques, deterministic concurrency control techniques aim for producing an execution history that obeys a predetermined transaction ordering. Recent proposals for de-terministic transaction processing demonstrate high potential for improving the system throughput, which had led to their successful commercial adoption. However, these proposals do not efficiently utilize and exploit modern computing resources and are limited by design to conservative execution. In this paper, we propose a novel distributed queue-oriented transaction processing paradigm that fundamentally rethinks how deterministic transaction processing is performed. The proposed paradigm supports multiple execution paradigms, multiple isolation levels, and is amenable to efficient resource utilization. We employ the principles of our proposed paradigm to build Q-Store, which is the first to support speculative execution and exploits intra-transaction parallelism efficiently among proposed deterministic and distributed transaction processing systems. We perform extensive evaluation against both deterministic and non-deterministic transaction processing protocols and demonstrate up to two orders of magnitude of improved performance.
... Resilient systems and consensus protocols have been widely studied by the distributed computing community (e.g., [50,51,54,76,80,82,84,87,88]). Here, we restrict ourselves to works addressing some of the challenges addressed by GeoBFT: consensus protocols supporting high-performance or geo-scale aware resilient system designs. ...
Recent developments in blockchain technology have inspired innovative new designs in resilient distributed and database systems. At their core, these blockchain applications typically use Byzantine fault-tolerant consensus protocols to maintain a common state across all replicas, even if some replicas are faulty or malicious. Unfortunately, existing consensus protocols are not designed to deal with geo-scale deployments in which many replicas spread across a geographically large area participate in consensus.
To address this, we present the Geo-Scale Byzantine Fault-Tolerant consensus protocol (GeoBFT). GeoBFT is designed for excellent scalability by using a topological-aware grouping of replicas in local clusters, by introducing parallelization of consensus at the local level, and by minimizing communication between clusters. To validate our vision of high-performance geo-scale resilient distributed systems, we implement GeoBFT in our efficient ResilientDB permissioned blockchain fabric. We show that GeoBFT is not only sound and provides great scalability, but also outperforms state-of-the-art consensus protocols by a factor of six in geo-scale deployments.
... Resilient systems and consensus protocols have been widely studied by the distributed computing community (e.g., [50,51,54,76,80,82,84,87,88]). Here, we restrict ourselves to works addressing some of the challenges addressed by GeoBFT: consensus protocols supporting high-performance or geo-scale aware resilient system designs. ...
Recent developments in blockchain technology have inspired innovative new designs in distributed resilient database systems. At their core, these database systems typically use Byzantine fault-tolerant consensus protocols to maintain a common state across all replicas, even if some replicas are faulty or malicious. Unfortunately, existing consensus protocols are not designed to deal with geo-scale deployments in which many replicas spread across a geographically large area participate in consensus. To address this, we present the Geo-Scale Byzantine Fault-Tolerant consensus protocol (GeoBFT). GeoBFT is designed for excellent scalability by using a topological-aware grouping of replicas in local clusters, by introducing parallelization of consensus at the local level, and by minimizing communication between clusters. To validate our vision of high-performance geo-scale resilient database systems using GeoBFT, we implement GeoBFT in our efficient ResilientDB blockchain and database fabric. We show that GeoBFT is not only sound and provides great scalability, but also outperforms state-of-the-art consensus protocols by a factor of six in geo-scale deployments.
... Considering the wide-spreading impact of database system based applications in recent days, researchers also started working to improve the performance of conventional databases systems by improving classic 2PC and 2PL [139][140][141][142][143]. Furthermore, it will be interesting to widen the discussion of priority inversion problem to energy aware RTDBS query processing [144][145][146][147][148][149], mobile DRTDBS [150][151][152][153][154], replicated DRTDBS [155,156], nested DRTDBS [157][158][159], active DRTDBS [126], and mobile ad-hoc network (MANET) databases [160][161][162]. ...
... The low throughput and high latency are the key reasons why Bft algorithms are often ignored. Prior works [26,49,50] have shown that the traditional distributed systems can achieve throughputs of the order 100K transactions per second, while the initial blockchain applications, Bitcoin [42] and Ethereum [60], have throughputs of at most ten transactions per second. Such low throughputs do not affect the users of these applications, as these applications were designed with an aim of open membership, that is, anyone can join, and the identities of the participants are kept hidden. ...
Since the inception of Bitcoin, the distributed and database community has shown interest in the design of efficient blockchain systems. At the core of any blockchain application is a Byzantine-Fault Tolerant (BFT) protocol that helps a set of replicas reach an agreement on the order of a client request. Initial blockchain applications (like Bitcoin) attain very low throughput and are computationally expensive. Hence, researchers moved towards the design of permissioned blockchain systems that employ classical BFT protocols, such as PBFT, to reach consensus. However, existing permissioned blockchain systems still attain low throughputs (of the order 10K txns/s). As a result, existing works blame this low throughput on the associated BFT protocol and expend resources in developing optimizes protocols. We believe such blames only depict a one-sided story. In specific, we raise a simple question: can a well-crafted system based on a classical BFT protocol outperform a modern protocol? We show that designing such a well-crafted system is possible and illustrate cases where a three-phase protocol can outperform a single-phase protocol. Further, we dissect a permissioned blockchain system and state several factors that affect its performance. We also design a high-throughput yielding permissioned blockchain system, ResilientDB, that employs parallel pipelines to balance tasks at a replica, and provides guidelines for future designs.
... ResilientDB is a spin-off of ExpoDB[48,89,92], our exploratory data platform, aiming at high-performance fault-resilient data processing. ...
Since the introduction of blockchains, several new database systems and applications have tried to employ them. At the core of such blockchain designs are Byzantine Fault-Tolerant (BFT) consensus protocols that enable designing systems that are resilient to failures and malicious behavior. Unfortunately, existing BFT protocols seem unsuitable for usage in database systems due to their high computational costs, high communication costs, high client latencies, and/or reliance on trusted components and clients. In this paper, we present the Proof-of-Execution consensus protocol (PoE) that alleviates these challenges. At the core of PoE are out-of-order processing and speculative execution, which allow PoE to execute transactions before consensus is reached among the replicas. With these techniques, PoE manages to reduce the costs of BFT in normal cases, while still providing reliable consensus toward clients in all cases. We envision the use of PoE in high-performance resilient database systems. To validate this vision, we implement PoE in our efficient ResilientDB blockchain and database framework. ResilientDB helps us to implement and evaluate PoE against several state-of-the-art BFT protocols. Our evaluation shows that PoE achieves up to more throughput than existing BFT protocols.
... ResilientDB is a spin-off of ExpoDB[27,56,58], our exploratory data platform, aiming at high-performance fault-resilient data processing. ...
The recent surge in blockchain applications and database systems has renewed the interest in traditional Byzantine Fault Tolerant consensus protocols (BFT). Several such BFT protocols follow a primary-backup design, in which a primary} replica coordinates the consensus protocol. In primary-backup designs, the normal-case operations are rather simple. At the same time, primary-backup designs place an unreasonable burden on primaries and allows malicious primaries to affect the system throughput substantially, however. To resolve this situation, we propose the MultiBFT paradigm, a protocol-agnostic approach towards improving the performance of primary-backup consensus protocols. At the core of MultiBFT is an approach to continuously order the client-transactions by running several instances of the underlying BFT protocol in parallel. We bring forth our paradigm to two well-established BFT protocols and demonstrate that the rendered parallelized protocols are not only safe and live but also significantly outperform, up to , their original non-parallelized forms. Further, we show that our MultiBFT paradigm reaches a throughput of up to 320K transactions per second.
... In fact atomicity acts as a contract and establishes trust among multiple communicating parties. However, it is a common knowledge [49,55,62] that the distributed systems undergo node failures. Recent failures [23,52,74] have shown that the distributed systems are still miles away from achieving undeterred availability. ...
Large scale distributed databases are designed to support commercial and cloud based applications. The minimal expectation from such systems is that they ensure consistency and reliability in case of node failures. The distributed database guarantees reliability through the use of atomic commitment protocols. Atomic commitment protocols help in ensuring that either all the changes of a transaction are applied or none of them exist. To ensure efficient commitment process, the database community has mainly used the two-phase commit (2PC) protocol. However, the 2PC protocol is blocking under multiple failures. This necessitated the development of non-blocking, three-phase commit (3PC) protocol. However, the database community is still reluctant to use the 3PC protocol, as it acts as a scalability bottleneck in the design of efficient transaction processing systems. In this work, we present EasyCommit protocol which leverages the best of both worlds (2PC and 3PC), that is non-blocking (like 3PC) and requires two phases (like 2PC). EasyCommit achieves these goals by ensuring two key observations: (i) first transmit and then commit, and (ii) message redundancy. We present the design of the EasyCommit protocol and prove that it guarantees both safety and liveness. We also present a detailed evaluation of EC protocol and show that it is nearly as efficient as the 2PC protocol. To cater the needs of geographically large scale distributed systems we also design a topology-aware agreement protocol (Geo-scale EasyCommit) that is non-blocking, safe, live and outperforms 3PC protocol.
Online-transaction-processing (OLTP) applications require the underlying storage system to guarantee consistency and serializability for distributed transactions involving large numbers of servers, which tends to introduce high coordination cost and cause low system performance. In-network coordination is a promising approach to alleviate this problem, which leverages programmable switches to move a piece of coordination functionality into the network. This paper presents a fast and scalable transaction processing system called SwitchTx. At the core of SwitchTx is a decentralized multi-switch in-network coordination mechanism, which leverages modern switches' programmability to reduce coordination cost while avoiding the central-switch-caused problems in the state-of-the-art Eris transaction processing system. SwitchTx abstracts various coordination tasks (e.g., locking, validating, and replicating) as in-switch gather-and-scatter (GaS) operations, and offloads coordination to a tree of switches for each transaction (instead of to a central switch for all transactions) where the client and the participants connect to the leaves. Moreover, to control the transaction traffic intelligently, SwitchTx reorders the coordination messages according to their semantics and redesigns the congestion control combined with admission control. Evaluation shows that SwitchTx outperforms current transaction processing systems in various workloads by up to 2.16X in throughput, 40.4% in latency, and 41.5% in lock time.
In-memory key-value stores have quickly become a key enabling technology to build high-performance applications that must cope with massively distributed workloads. In-memory key-value stores (also referred to as NoSQL) primarly aim to offer low-latency and high-throughput data access which motivates the rapid adoption of modern network cards such as Remote Direct Memory Access (RDMA). In this paper, we present the fundamental design principles for exploiting RDMAs in modern NoSQL systems. Moreover, we describe a break-down analysis of the state-of-the-art of the RDMA-based in-memory NoSQL systems regarding the indexing, data consistency, and the communication protocol. In addition, we compare traditional in-memory NoSQL with their RDMA-enabled counterparts. Finally, we present a comprehensive analysis and evaluation of the existing systems according to the impact of the number of clients, real-world request distributions, and workload read-write ratios.
We investigate a coordination-free approach to transaction processing on emerging multi-sockets, many-core, shared-memory architecture to harness its unprecedented available parallelism. We propose a queue-oriented, control-free concur-rency architecture, referred to as QueCC, that exhibits minimal contention among concurrent threads by eliminating the overhead of concurrency control from the critical path of the transaction. QueCC operates on batches of transactions in two deterministic phases of priority-based planning followed by control-free execution. We extensively evaluate our transaction execution architecture and compare its performance against seven state-of-the-art concurrency control protocols designed for in-memory, key-value stores. We demonstrate that QueCC can significantly out-perform state-of-the-art concurrency control protocols under high-contention by up to 6.3×. Moreover, our results show that QueCC can process nearly 40 million YCSB transactional operations per second while maintaining serializability guarantees with write-intensive workloads. Remarkably, QueCC out-performs H-Store by up to two orders of magnitude.
Interoperability in healthcare has traditionally been focused around data exchange between business entities, for example, different hospital systems. However, there has been a recent push towards patient-driven interoperability, in which health data exchange is patient-mediated and patient-driven. Patient-centered interoperability, however, brings with it new challenges and requirements around security and privacy, technology, incentives, and governance that must be addressed for this type of data sharing to succeed at scale. In this paper, we look at how blockchain technology might facilitate this transition through five mechanisms: (1) digital access rules, (2) data aggregation, (3) data liquidity, (4) patient identity, and (5) data immutability. We then look at barriers to blockchain-enabled patient-driven interoperability, specifically clinical data transaction volume, privacy and security, patient engagement, and incentives. We conclude by noting that while patient-driving interoperability is an exciting trend in healthcare, given these challenges, it remains to be seen whether blockchain can facilitate the transition from institution-centric to patient-centric data sharing.
We have already known for a long time that hardware components are not perfect and soft errors in terms of single bit flips happen all the time. Up to now, these single bit flips are mainly addressed in hardware using general-purpose protection techniques. However, recent studies have shown that all future hardware components become less and less reliable in total and multi-bit flips are occurring regularly rather than exceptionally. Additionally, hardware aging effects will lead to error models that change during run-time. Scaling hardware-based protection techniques to cover changing multi-bit flips is possible, but this introduces large performance, chip area, and power overheads, which will become non-affordable in the future. To tackle that, an emerging research direction is employing protection techniques in higher software layers like compilers or applications. The available knowledge at these layers can be efficiently used to specialize and adapt protection techniques. Thus, we propose a novel adaptable and on-the-fly hardware error detection approach called AHEAD for database systems in this paper. AHEAD provides configurable error detection in an end-to-end fashion and reduces the overhead (storage and computation) compared to other techniques at this level. Our approach uses an arithmetic error coding technique which allows query processing to completely work on hardened data on the one hand. On the other hand, this enables on-the-fly detection during query processing of (i) errors that modify data stored in memory or transferred on an interconnect and (ii) errors induced during computations. Our exhaustive evaluation clearly shows the benefits of our AHEAD approach.
In this paper we present BatchDB, an in-memory database engine designed for hybrid OLTP and OLAP workloads. BatchDB achieves good performance, provides a high level of data freshness, and minimizes load interaction between the transactional and analytical engines, thus enabling real time analysis over fresh data under tight SLAs for both OLTP and OLAP workloads.
BatchDB relies on primary-secondary replication with dedicated replicas, each optimized for a particular workload type (OLTP, OLAP), and a light-weight propagation of transactional updates. The evaluation shows that for standard TPC-C and TPC-H benchmarks, BatchDB can achieve competitive performance to specialized engines for the corresponding transactional and analytical workloads, while providing a level of performance isolation and predictable runtime for hybrid workload mixes (OLTP+OLAP) otherwise unmet by existing solutions.
Multiversion databases store both current and historical data. Rows are typically annotated with timestamps representing the period when the row is/was valid. We develop novel techniques to reduce index maintenance in multiversion databases, so that indexes can be used effectively for analytical queries over current data without being a heavy burden on transaction throughput. To achieve this end, we re-design persistent index data structures in the storage hierarchy to employ an extra level of indirection. The indirection level is stored on solid-state disks that can support very fast random I/Os, so that traversing the extra level of indirection incurs a relatively small overhead. The extra level of indirection dramatically reduces the number of magnetic disk I/Os that are needed for index updates and localizes maintenance to indexes on updated attributes. Additionally, we batch insertions within the indirection layer in order to reduce physical disk I/Os for indexing new records. In this work, we further exploit SSDs by introducing novel DeltaBlock techniques for storing the recent changes to data on SSDs. Using our DeltaBlock, we propose an efficient method to periodically flush the recently changed data from SSDs to HDDs such that, on the one hand, we keep track of every change (or delta) for every record, and, on the other hand, we avoid redundantly storing the unchanged portion of updated records. By reducing the index maintenance overhead on transactions, we enable operational data stores to create more indexes to support queries. We have developed a prototype of our indirection proposal by extending the widely used generalized search tree open-source project, which is also employed in PostgreSQL. Our working implementation demonstrates that we can significantly reduce index maintenance and/or query processing cost by a factor of 3. For the insertion of new records, our novel batching technique can save up to 90 % of the insertion time. For updates, our prototype demonstrates that we can significantly reduce the database size by up to 80 % even with a modest space allocated for DeltaBlocks on SSDs.
Over the last two releases SQL Server has integrated two special-ized engines into the core system: the Apollo column store engine for analytical workloads and the Hekaton in-memory engine for high-performance OLTP workloads. There is an increasing demand for real-time analytics, that is, for running analytical queries and reporting on the same system as transaction processing so as to have access to the freshest data. SQL Server 2016 will include enhance-ments to column store indexes and in-memory tables that signifi-cantly improve performance on such hybrid workloads. This paper describes four such enhancements: column store indexes on in-memory tables, making secondary column store indexes on disk-based tables updatable, allowing B-tree indexes on primary column store indexes, and further speeding up the column store scan oper-ator.
The limitations of traditional general-purpose processors have motivated the use of specialized hardware solutions (e.g., FPGAs) to achieve higher performance in stream processing. However, state-of-the-art hardware-only solutions have limited support to adapt to changes in the query workload. In this work, we present a reconfigurable hardware-based streaming architecture that offers the flexibility to accept new queries and to change existing ones without the need for expensive hardware reconfiguration. We introduce the Online Programmable Block (OP-Block), a "Lego-like" connectable stream processing element, for constructing a custom Flexible Query Processor (FQP), suitable to a wide range of data streaming applications, including real-time data analytics, information filtering, intrusion detection, algorithmic trading, targeted advertising, and complex event processing. Through evaluations, we conclude that updating OP-Blocks to support new queries takes on the order of nano to micro-seconds (e.g., 40 ns to realize a join operator on an OP-Block), a feature critical to support of streaming applications on FPGAs.
The Flexible Query Processor (FQP) constitutes a family of hardware-based data stream processors that support dynamic changes to queries and streams, as well as static changes to the processor-internal fabric in order to maximize performance for given workloads. FQP is prototyped on field-programmable gate arrays (FPGAs). To this end, FQP supports select, project and window-join queries over data streams. While processing incoming tuples, FQP can accept new queries, a key characteristic distinguishing FQP from related approaches employing FPGAs for stream processing. In this paper, we present our vision of FQP, focusing on few internal details to support the flexibility dimension, in particular, the segment-at-a-time mechanism to realize processing of tuples of variable sizes. While many of these features are readily available in software, their hardware-based realizations have been one of the main shortcomings of existing research efforts.
In multi-version databases, updates and deletions of records by transactions require appending a new record to tables rather than performing in-place updates. This mechanism incurs non-negligible performance overhead in the presence of multiple indexes on a table, where changes need to be propagated to all indexes. Additionally, an uncommitted record update will block other active transactions from using the index to fetch the most recently committed values for the updated record. In general, in order to support snapshot isolation and/or multi-version concurrency, either each active transaction is forced to search a database temporary area (e.g., roll-back segments) to fetch old values of desired records, or each transaction is forced to scan the entire table to find the older versions of the record in a multi-version database (in the absence of specialized temporal indexes).
In this work, we describe a novel kV-Indirection structure to enable efficient (parallelizable) optimistic and pessimistic multi-version concurrency control by utilizing the old versions of records (at most two versions of each record) to provide direct access to the recent changes of records without the need of temporal indexes. As a result, our technique results in higher degree of concurrency by reducing the clashes between readers and writers of data and avoiding extended lock delays. We have a working prototype of our concurrency model and kV-Indirection structure in a commercial database and conducted an extensive evaluation to demonstrate the benefits of our multi-version concurrency control, and we obtained orders of magnitude speed up over the single-version concurrency control.
This paper presents the design of a read-optimized relational DBMS that contrasts sharply with most current systems, which are write-optimized. Among the many differences in its design are: storage of data by column rather than by row, careful coding and packing of objects into storage including main memory during query processing, storing an overlapping collection of column-oriented projections, rather than the current fare of tables and indexes, a non-traditional implementation of transactions which includes high availability and snapshot isolation for read-only transactions, and the extensive use of bitmap indexes to complement B-tree structures.
We present preliminary performance data on a subset of TPC-H and show that the system we are building, C-Store, is substantially faster than popular commercial products. Hence, the architecture looks very encouraging.
Online Transaction Processing (OLTP) databases include a suite of features---disk-resident B-trees and heap files, locking-based concurrency control, support for multi-threading---that were optimized for computer technology of the late 1970's. Advances in modern processors, memories, and networks mean that today's computers are vastly different from those of 30 years ago, such that many OLTP databases will now fit in main memory, and most OLTP transactions can be processed in milliseconds or less. Yet database architecture has changed little.
Based on this observation, we look at some interesting variants of conventional database systems that one might build that exploit recent hardware trends, and speculate on their performance through a detailed instruction-level breakdown of the major components involved in a transaction processing database system (Shore) running a subset of TPC-C. Rather than simply profiling Shore, we progressively modified it so that after every feature removal or optimization, we had a (faster) working system that fully ran our workload. Overall, we identify overheads and optimizations that explain a total difference of about a factor of 20x in raw performance. We also show that there is no single "high pole in the tent" in modern (memory resident) database systems, but that substantial time is spent in logging, latching, locking, B-tree, and buffer management operations.
Fabric is a modular and extensible open-source system for deploying and operating permissioned blockchains and one of the Hyperledger projects hosted by the Linux Foundation (www.hyperledger.org).
Fabric is the first truly extensible blockchain system for running distributed applications. It supports modular consensus protocols, which allows the system to be tailored to particular use cases and trust models. Fabric is also the first blockchain system that runs distributed applications written in standard, general-purpose programming languages, without systemic dependency on a native cryptocurrency. This stands in sharp contrast to existing block-chain platforms that require "smart-contracts" to be written in domain-specific languages or rely on a cryptocurrency. Fabric realizes the permissioned model using a portable notion of membership, which may be integrated with industry-standard identity management. To support such flexibility, Fabric introduces an entirely novel blockchain design and revamps the way blockchains cope with non-determinism, resource exhaustion, and performance attacks.
This paper describes Fabric, its architecture, the rationale behind various design decisions, its most prominent implementation aspects, as well as its distributed application programming model. We further evaluate Fabric by implementing and benchmarking a Bitcoin-inspired digital currency. We show that Fabric achieves end-to-end throughput of more than 3500 transactions per second in certain popular deployment configurations, with sub-second latency, scaling well to over 100 peers.
Key-value storage systems are an integral part of many data centre applications, but as demand increases so does the need for high performance. This has motivated new designs that use Remote Direct Memory Access (RDMA) to reduce communication overhead. Current RDMA-enabled key-value stores (RKVSes) target workloads involving small values, running on dedicated servers on which no other applications are running. Outside of these domains, however, there may be other RKVS designs that provide better performance. In this paper, we introduce Nessie, an RKVS that is fully client-driven, meaning no server process is involved in servicing requests. Nessie also decouples its index and storage data structures, allowing indices and data to be placed on different servers. This flexibility can decrease the number of network operations required to service a request. These design elements make Nessie well-suited for a different set of workloads than existing RKVSes. Compared to a server-driven RKVS, Nessie more than doubles system throughput when there is CPU contention on the server, improves throughput by 70% for PUT-oriented workloads when data value sizes are 128 KB or larger, and reduces power consumption by 18% at 80% system utilization and 41% at 20% system utilization compared with idle power consumption.
Multi-core in-memory databases promise high-speed online transaction processing. However, the performance of individual designs suffers when the workload characteristics miss their small sweet spot of a desired contention level, read-write ratio, record size, processing rate, and so forth.
Cicada is a single-node multi-core in-memory transactional database with serializability. To provide high performance under diverse workloads, Cicada reduces overhead and contention at several levels of the system by leveraging optimistic and multi-version concurrency control schemes and multiple loosely synchronized clocks while mitigating their drawbacks. On the TPC-C and YCSB benchmarks, Cicada outperforms Silo, TicToc, FOEDUS, MOCC, two-phase locking, Hekaton, and ERMIA in most scenarios, achieving up to 3X higher throughput than the next fastest design. It handles up to 2.07 M TPC-C transactions per second and 56.5 M YCSB transactions per second, and scans up to 356 M records per second on a single 28-core machine.
With the advent of trusted execution environments provided by recent general purpose processors, a class of replication protocols has become more attractive than ever: Protocols based on a hybrid fault model are able to tolerate arbitrary faults yet reduce the costs significantly compared to their traditional Byzantine relatives by employing a small subsystem trusted to only fail by crashing. Unfortunately, existing proposals have their own price: We are not aware of any hybrid protocol that is backed by a comprehensive formal specification, complicating the reasoning about correctness and implications. Moreover, current protocols of that class have to be performed largely sequentially. Hence, they are not well-prepared for just the modern multi-core processors that bring their very own fault model to a broad audience. In this paper, we present Hybster, a new hybrid state-machine replication protocol that is highly parallelizable and specified formally. With over 1 million operations per second using only four cores, the evaluation of our Intel SGX-based prototype implementation shows that Hybster makes hybrid state-machine replication a viable option even for today's very demanding critical services.
Increasing transaction volumes have led to a resurgence of interest in distributed transaction processing. In particular, partitioning data across several servers can improve throughput by allowing servers to process transactions in parallel. But executing transactions across servers limits the scalability and performance of these systems.
In this paper, we quantify the effects of distribution on concurrency control protocols in a distributed environment. We evaluate six classic and modern protocols in an in-memory distributed database evaluation framework called Deneva, providing an apples-to-apples comparison between each. Our results expose severe limitations of distributed transaction processing engines. Moreover, in our analysis, we identify several protocol-specific scalability bottlenecks. We conclude that to achieve truly scalable operation, distributed concurrency control solutions must seek a tighter coupling with either novel network hardware (in the local area) or applications (via data modeling and semantically-aware execution), or both.
In order to guarantee recoverable transaction execution, database systems permit a transaction's writes to be observable only at the end of its execution. As a consequence, there is generally a delay between the time a transaction performs a write and the time later transactions are permitted to read it. This delayed write visibility can significantly impact the performance of serializable database systems by reducing concurrency among conflicting transactions.
This paper makes the observation that delayed write visibility stems from the fact that database systems can arbitrarily abort transactions at any point during their execution. Accordingly, we make the case for database systems which only abort transactions under a restricted set of conditions, thereby enabling a new recoverability mechanism, early write visibility, which safely makes transactions' writes visible prior to the end of their execution. We design a new serializable concurrency control protocol, piece-wise visibility (PWV), with the explicit goal of enabling early write visibility. We evaluate PWV against state-of-the-art serializable protocols and a highly optimized implementation of read committed, and find that PWV can outperform serializable protocols by an order of magnitude and read committed by 3X on high contention workloads.
Future servers will be equipped with thousands of CPU cores and deep memory hierarchies. Traditional concurrency control (CC) schemes---both optimistic and pessimistic---slow down orders of magnitude in such environments for highly contended workloads. Optimistic CC (OCC) scales the best for workloads with few conflicts, but suffers from clobbered reads for high conflict workloads. Although pessimistic locking can protect reads, it floods cache-coherence backbones in deep memory hierarchies and can also cause numerous deadlock aborts.
This paper proposes a new CC scheme, mostly-optimistic concurrency control (MOCC), to address these problems. MOCC achieves orders of magnitude higher performance for dynamic workloads on modern servers. The key objective of MOCC is to avoid clobbered reads for high conflict workloads, without any centralized mechanisms or heavyweight interthread communication. To satisfy such needs, we devise a native, cancellable reader-writer spinlock and a serializable protocol that can acquire, release and re-acquire locks in any order without expensive interthread communication. For low conflict workloads, MOCC maintains OCC's high performance without taking read locks.
Our experiments with high conflict YCSB workloads on a 288-core server reveal that MOCC performs 8× and 23× faster than OCC and pessimistic locking, respectively. It achieves 17 million TPS for TPC-C and more than 110 million TPS for YCSB without conflicts, 170× faster than pessimistic methods.
Transaction processing database management systems (DBMSs) are critical for today's data-intensive applications because they enable an organization to quickly ingest and query new information. Many of these applications exceed the capabilities of a single server, and thus their database has to be deployed in a distributed DBMS. The key factor affecting such a system's performance is how the database is partitioned. If the database is partitioned incorrectly, the number of distributed transactions can be high. These transactions have to synchronize their operations over the network, which is considerably slower and leads to poor performance. Previous work on elastic database repartitioning has focused on a certain class of applications whose database schema can be represented in a hierarchical tree structure. But many applications cannot be partitioned in this manner, and thus are subject to distributed transactions that impede their performance and scalability.
In this paper, we present a new on-line partitioning approach, called Clay, that supports both tree-based schemas and more complex "general" schemas with arbitrary foreign key relationships. Clay dynamically creates blocks of tuples to migrate among servers during repartitioning, placing no constraints on the schema but taking care to balance load and reduce the amount of data migrated. Clay achieves this goal by including in each block a set of hot tuples and other tuples co-accessed with these hot tuples. To evaluate our approach, we integrate Clay in a distributed, main-memory DBMS and show that it can generate partitioning schemes that enable the system to achieve up to 15× better throughput and 99% lower latency than existing approaches.
In this paper, we describe our experiences and lessons learned from building a general-purpose in-memory key-value middleware, called HydraDB. HydraDB synthesizes a collection of state-of-the-art techniques, including continuous fault-tolerance, Remote Direct Memory Access (RDMA), as well as awareness for multicore systems, etc, to deliver a high-throughput, low-latency access service in a reliable manner for cluster computing applications.
The uniqueness of HydraDB mainly lies in its design commitment to fully exploit the RDMA protocol to comprehensively optimize various aspects of a general-purpose key-value store, including latency-critical operations, read enhancement, and data replications for high-availability service, etc. At the same time, HydraDB strives to efficiently utilize multicore systems to prevent data manipulation on the servers from curbing the potential of RDMA.
Many teams in our organization have adopted HydraDB to improve the execution of their cluster computing frameworks, including Hadoop, Spark, Sensemaking analytics, and Call Record Processing. In addition, our performance evaluation with a variety of YCSB workloads also shows that HydraDB can substantially outperform several existing in-memory key-value stores by an order of magnitude. Our detailed performance evaluation further corroborates our design choices.
Concurrency control for on-line transaction processing (OLTP) database management systems (DBMSs) is a nasty game. Achieving higher performance on emerging many-core systems is difficult. Previous research has shown that timestamp management is the key scalability bottleneck in concurrency control algorithms. This prevents the system from scaling to large numbers of cores. In this paper we present TicToc, a new optimistic concurrency control algorithm that avoids the scalability and concurrency bottlenecks of prior T/O schemes. TicToc relies on a novel and provably correct data-driven timestamp management protocol. Instead of assigning timestamps to transactions, this protocol assigns read and write timestamps to data items and uses them to lazily compute a valid commit timestamp for each transaction. TicToc removes the need for centralized timestamp allocation, and commits transactions that would be aborted by conventional T/O schemes. We implemented TicToc along with four other concurrency control algorithms in an in-memory, shared-everything OLTP DBMS and compared their performance on different workloads. Our results show that TicToc achieves up to 92% better throughput while reducing the abort rate by 3.3x over these previous algorithms.
Although significant recent progress has been made in improving the multi-core scalability of high throughput transactional database systems, modern systems still fail to achieve scalable throughput for workloads involving frequent access to highly contended data. Most of this inability to achieve high throughput is explained by the fundamental constraints involved in guaranteeing ACID --- the addition of cores results in more concurrent transactions accessing the same contended data for which access must be serialized in order to guarantee isolation. Thus, linear scalability for contended workloads is impossible. However, there exist flaws in many modern architectures that exacerbate their poor scalability, and result in throughput that is much worse than fundamentally required by the workload. In this paper we identify two prevalent design principles that limit the multi-core scalability of many (but not all) transactional database systems on contended workloads: the multi-purpose nature of execution threads in these systems, and the lack of advanced planning of data access. We demonstrate the deleterious results of these design principles by implementing a prototype system, Orthrus, that is motivated by the principles of separation of database component functionality and advanced planning of transactions. We find that these two principles alone result in significantly improved scalability on high-contention workloads, and an order of magnitude increase in throughput for a non-trivial subset of these contended workloads.
Multicore in-memory databases often rely on traditional con- currency control schemes such as two-phase-locking (2PL) or optimistic concurrency control (OCC). Unfortunately, when the workload exhibits a non-trivial amount of contention, both 2PL and OCC sacrifice much parallel execution op- portunity. In this paper, we describe a new concurrency control scheme, interleaving constrained concurrency con- trol (IC3), which provides serializability while allowing for parallel execution of certain conflicting transactions. IC3 combines the static analysis of the transaction workload with runtime techniques that track and enforce dependencies among concurrent transactions. The use of static analysis simplifies IC3's runtime design, allowing it to scale to many cores. Evaluations on a 64-core machine using the TPC- C benchmark show that IC3 outperforms traditional con- currency control schemes under contention. It achieves the throughput of 434K transactions/sec on the TPC-C bench- mark configured with only one warehouse. It also scales better than several recent concurrent control schemes that also target contended workloads.
Data-intensive applications seek to obtain trill insights in real-time by analyzing a combination of historical data sets alongside recently collected data. This means that to support such hybrid workloads, database management systems (DBMSs) need to handle both fast ACID transactions and complex analytical queries on the same database. But the current trend is to use specialized systems that are optimized for only one of these workloads, and thus require an organization to maintain separate copies of the database. This adds additional cost to deploying a database application in terms of both storage and administration overhead.
To overcome this barrier, we present a hybrid DBMS architecture that efficiently supports varied workloads on the same database. Our approach differs from previous methods in that we use a single execution engine that is oblivious to the storage layout of data without sacrificing the performance benefits of the specialized systems. This obviates the need to maintain separate copies of the database in multiple independent systems. We also present a technique to continuously evolve the database's physical storage layout by analyzing the queries' access patterns and choosing the optimal layout for different segments of data within the same table. To evaluate this work, we implemented our architecture in an in-memory DBMS. Our results show that our approach delivers up to 3x higher throughput compared to static storage layouts across different workloads. We also demonstrate that our continuous adaptation mechanism allows the DBMS to achieve a near-optimal layout for an arbitrary workload without requiring any manual tuning.
Large main memories and massively parallel processors have triggered not only a resurgence of high-performance transaction processing systems optimized for large main-memory and massively parallel processors, but also an increasing demand for processing heterogeneous workloads that include read-mostly transactions. Many modern transaction processing systems adopt a lightweight optimistic concurrency control (OCC) scheme to leverage its low overhead in low contention workloads. However, we observe that the lightweight OCC is not suitable for heterogeneous workloads, causing significant starvation of read-mostly transactions and overall performance degradation.
In this paper, we present ERMIA, a memory-optimized database system built from scratch to cater the need of handling heterogeneous workloads. ERMIA adopts snapshot isolation concurrency control to coordinate heterogeneous transactions and provides serializability when desired. Its physical layer supports the concurrency control schemes in a scalable way. Experimental results show that ERMIA delivers comparable or superior performance and near-linear scalability in a variety of workloads, compared to a recent lightweight OCC-based system. At the same time, ERMIA maintains high throughput on read-mostly transactions when the performance of the OCC-based system drops by orders of magnitude.
An increasing number of applications rely on workflows that involve (1) continuous stream processing, (2) transactional and write-heavy workloads, and (3) interactive SQL analytics. These applications need to consume high-velocity streams to trigger real-time alerts, ingest them into a write-optimized store, and perform OLAP-style analytics to derive deep insight quickly.
Consequently, the demand for mixed workloads has resulted in several composite data architectures, exemplified in the "lambda" architecture, requiring multiple systems to be stitched together---an exercise that can be hard, time consuming and expensive.
Instead, our system, SnappyData, fulfills this promise by (i) enabling streaming, transactions and interactive analytics in a single unifying system---rather than stitching different solutions---and (ii) delivering true interactive speeds via a state-of-the-art approximate query engine that leverages a multitude of synopses as well as the full dataset.
The widely adopted single-threaded OLTP model assigns a single thread to each static partition of the database for processing transactions in a partition. This simplifies concurrency control while retaining parallelism. However, it suffers performance loss arising from skewed workloads as well as transactions that span multiple partitions. In this paper, we present a dynamic single-threaded in-memory OLTP system, called LADS, that extends the simplicity of the single-threaded model. The key innovation in LADS is the separation of dependency resolution and execution into two non-overlapping phases for batches of transactions. After the first phase of dependency resolution, the record actions of the transactions are partitioned and ordered. Each independent partition is then executed sequentially by a single thread, avoiding the need for locking. By careful mapping of the tasks to be performed to threads, LADS is able to achieve a high degree of balanced parallelism. We evaluate LADS against H-Store, a partition-based database; DORA, a data-oriented transaction processing system; and SILO, a multi-core in-memory OLTP engine. The experimental study shows that LADS achieves up to 20x higher throughput than existing systems and exhibits better robustness with various workloads.
The Optimistic Concurrency Control (OCC) method has been commonly used for in-memory databases to ensure transaction serializability --- a transaction will be aborted if its read set has been changed during execution. This simple criterion to abort transactions causes a large proportion of false positives, leading to excessive transaction aborts. Transactions aborted false-positively (i.e. false aborts) waste system resources and can significantly degrade system throughput (as much as 3.68x based on our experiments) when data contention is intensive.
Modern in-memory databases run on systems with increasingly parallel hardware and handle workloads with growing concurrency. They must efficiently deal with data contention in the presence of greater concurrency by minimizing false aborts. This paper presents a new concurrency control method named Balanced Concurrency Control (BCC) which aborts transactions more carefully than OCC does. BCC detects data dependency patterns which can more reliably indicate unserializable transactions than the criterion used in OCC. The paper studies the design options and implementation techniques that can effectively detect data contention by identifying dependency patterns with low overhead. To test the performance of BCC, we have implemented it in Silo and compared its performance against that of the vanilla Silo system with OCC and two-phase locking (2PL). Our extensive experiments with TPC-W-like, TPC-C-like and YCSB workloads demonstrate that when data contention is intensive, BCC can increase transaction throughput by more than 3x versus OCC and more than 2x versus 2PL; meanwhile, BCC has comparable performance with OCC for workloads with low data contention.
For data-intensive applications with many concurrent users, modern distributed main memory database management systems (DBMS) provide the necessary scale-out support beyond what is possible with single-node systems. These DBMSs are optimized for the short-lived transactions that are common in on-line transaction processing (OLTP) workloads. One way that they achieve this is to partition the database into disjoint subsets and use a single-threaded transaction manager per partition that executes transactions one-at-a-time in serial order. This minimizes the overhead of concurrency control mechanisms, but requires careful partitioning to limit distributed transactions that span multiple partitions. Previous methods used off-line analysis to determine how to partition data, but the dynamic nature of these applications means that they are prone to hotspots. In these situations, the DBMS needs to reconfigure how data is partitioned in real-time to maintain performance objectives. Bringing the system off-line to reorganize the database is unacceptable for on-line applications.
To overcome this problem, we introduce the Squall technique for supporting live reconfiguration in partitioned, main memory DBMSs. Squall supports fine-grained repartitioning of databases in the presence of distributed transactions, high throughput client workloads, and replicated data. An evaluation of our approach on a distributed DBMS shows that Squall can reconfigure a database with no downtime and minimal overhead on transaction latency.
Dynamic memory allocators (malloc/free) rely on mutual exclusion locks for protecting the consistency of their shared data structures under multithreading. The use of locking has many disadvantages with respect to performance, availability, robustness, and programming flexibility. A lock-free memory allocator guarantees progress regardless of whether some threads are delayed or even killed and regardless of scheduling policies. This paper presents a completely lockfree memory allocator. It uses only widely-available operating system support and hardware atomic instructions. It offers guaranteed availability even under arbitrary thread termination and crash-failure, and it is immune to deadlock regardless of scheduling policies, and hence it can be used even in interrupt handlers and real-time applications without requiring special scheduler support. Also, by leveraging some high-level structures from Hoard, our allocator is highly scalable, limits space blowup to a constant factor, and is capable of avoiding false sharing. In addition, our allocator allows finer concurrency and much lower latency than Hoard. We use PowerPC shared memory multiprocessor systems to compare the performance of our allocator with the default AIX 5.1 libc malloc, and two widely-used multithread allocators, Hoard and Ptmalloc. Our allocator outperforms the other allocators in virtually all cases and often by substantial margins, under various levels of parallelism and allocation patterns. Furthermore, our allocator also offers the lowest contention-free latency among the allocators by significant margins
Computer architectures are moving towards an era dominated by many-core machines with dozens or even hundreds of cores on a single chip. This unprecedented level of on-chip parallelism introduces a new dimension to scalability that current database management systems (DBMSs) were not designed for. In particular, as the number of cores increases, the problem of concurrency control becomes extremely challenging. With hundreds of threads running in parallel, the complexity of coordinating competing accesses to data will likely diminish the gains from increased core counts.
To better understand just how unprepared current DBMSs are for future CPU architectures, we performed an evaluation of concurrency control for on-line transaction processing (OLTP) workloads on many-core chips. We implemented seven concurrency control algorithms on a main-memory DBMS and using computer simulations scaled our system to 1024 cores. Our analysis shows that all algorithms fail to scale to this magnitude but for different reasons. In each case, we identify fundamental bottlenecks that are independent of the particular database implementation and argue that even state-of-the-art DBMSs suffer from these limitations. We conclude that rather than pursuing incremental solutions, many-core chips may require a completely redesigned DBMS architecture that is built from ground up and is tightly coupled with the hardware.
Scaling-out a database system typically requires partitioning the database across multiple servers. If applications do not partition perfectly, then transactions accessing multiple partitions end up being distributed, which has well-known scalability challenges. To address them, we describe a high-performance transaction mechanism that uses optimistic concurrency control on a multi-versioned tree-structured database stored in a shared log. The system scales out by adding servers, without partitioning the database.
Our solution is modeled on the Hyder architecture, published by Bernstein, Reid, and Das at CIDR 2011. We present the design and evaluation of the first full implementation of that architecture. The core of the system is a log roll-forward algorithm, called meld, that does optimistic concurrency control. Meld is inherently sequential and is therefore the main bottleneck. Our main algorithmic contributions are optimizations to meld that significantly increase transaction throughput. They use a pipelined design that parallelizes meld onto multiple threads. The slowest pipeline stage is much faster than the original meld algorithm, yielding a 3x improvement of system throughput over the original meld algorithm.
We present Centiman, a system for high performance, elastic transaction processing in the cloud. Centiman provides serializability on top of a key-value store with a lightweight protocol based on optimistic concurrency control (OCC).
Centiman is designed for the cloud setting, with an architecture that is loosely coupled and avoids synchronization wherever possible. Centiman supports sharded transaction validation; validators can be added or removed on-the-fly in an elastic manner. Processors and validators scale independently of each other and recover from failure transparently to each other. Centiman's loosely coupled design creates some challenges: it can cause spurious aborts and it makes it difficult to implement common performance optimizations for read-only transactions. To deal with these issues, Centiman uses a watermark abstraction to asynchronously propagate information about transaction commits through the system.
In an extensive evaluation we show that Centiman provides fast elastic scaling, low-overhead serializability for read-heavy workloads, and scales to millions of operations per second.
Server hardware is about to drastically change. As typified by emerging hardware such as UC Berkeley's Firebox project and by Intel's Rack-Scale Architecture (RSA), next generation servers will have thousands of cores, large DRAM, and huge NVRAM. We analyze the characteristics of these machines and find that no existing database is appropriate. Hence, we are developing FOEDUS, an open-source, from-scratch database engine whose architecture is drastically different from traditional databases. It extends in-memory database technologies to further scale up and also allows transactions to efficiently manipulate data pages in both DRAM and NVRAM. We evaluate the performance of FOEDUS in a large NUMA machine (16 sockets and 240 physical cores) and find that FOEDUS achieves multiple orders of magnitude higher TPC-C throughput compared to H-Store with anti-caching.
Although Latin square is a well-known algorithm to construct low-density parity-check (LDPC) codes for satisfying long code length, high code-rate, good correcting capability, and low error floor, it has a drawback of large submatrix that the hardware implementation will be suffered from large barrel shifter and worse routing congestion in fitting NAND flash applications. In this paper, a top-down design methodology, which not only goes through code construction and optimization, but also hardware implementation to meet all the critical requirements, is presented. A two-step array dispersion algorithm is proposed to construct long LDPC codes with a small submatrix size. Then, the constructed LDPC code is optimized by masking matrix to obtain better bit-error rate (BER) performance and lower error-floor. In addition, our LDPC codes have a diagonal-like structure in the parity-check matrix leading to a proposed hybrid storage architecture, which has the advantages of better area efficiency and large enough data bandwidth for high decoding throughput. To be adopted for NAND flash applications, an (18900, 17010) LDPC code with a code-rate of 0.9 and submatrix size of 63 is constructed and the field-programmable gate array simulations show that the error floor is successfully suppressed down to BER of 10⁻¹². An LDPC decoder using normalized min-sum variable-node-centric sequential scheduling decoding algorithm is implemented in UMC 90-nm CMOS process. The postlayout result shows that the proposed LDPC decoder can achieve a throughput of 1.58 Gb/s at six iterations with a gate count of 520k under a clock frequency of 166.6 MHz. It meets the throughput requirement of both NAND flash memories with Toggle double data rate 1.0 and open NAND flash interface 2.3 NAND interfaces.
The release of hardware transactional memory (HTM) in commodity CPUs has major implications on the design and implementation of main-memory databases, especially on the architecture of high-performance lock-free indexing methods at the core of several of these systems. This paper studies the interplay of HTM and lock-free indexing methods. First, we evaluate whether HTM will obviate the need for crafty lock-free index designs by integrating it in a traditional B-tree architecture. HTM performs well for simple data sets with small fixed-length keys and payloads, but its benefits disappear for more complex scenarios (e.g., larger variable-length keys and payloads), making it unattractive as a general solution for achieving high performance. Second, we explore fundamental differences between HTM-based and lock-free B-tree designs. While lock-freedom entails design complexity and extra mechanism, it has performance advantages in several scenarios, especially high-contention cases where readers proceed uncontested (whereas HTM aborts readers). Finally, we explore the use of HTM as a method to simplify lock-free design. We find that using HTM to implement a multi-word compare-and-swap greatly reduces lock-free programming complexity at the cost of only a 10-15% performance degradation. Our study uses two state-of-the-art index implementations: a memory-optimized B-tree extended with HTM to provide multi-threaded concurrency and the Bw-tree lock-free B-tree used in several Microsoft production environments.
In-memory key-value stores play a critical role in data processing to provide high throughput and low latency data accesses. In-memory key-value stores have several unique properties that include (1) data intensive operations demanding high memory bandwidth for fast data accesses, (2) high data parallelism and simple computing operations demanding many slim parallel computing units, and (3) a large working set. As data volume continues to increase, our experiments show that conventional and general-purpose multicore systems are increasingly mismatched to the special properties of key-value stores because they do not provide massive data parallelism and high memory bandwidth; the powerful but the limited number of computing cores do not satisfy the demand of the unique data processing task; and the cache hierarchy may not well benefit to the large working set.
In this paper, we make a strong case for GPUs to serve as special-purpose devices to greatly accelerate the operations of in-memory key-value stores. Specifically, we present the design and implementation of Mega-KV, a GPU-based in-memory key-value store system that achieves high performance and high throughput. Effectively utilizing the high memory bandwidth and latency hiding capability of GPUs, Mega-KV provides fast data accesses and significantly boosts overall performance. Running on a commodity PC installed with two CPUs and two GPUs, Mega-KV can process up to 160+ million key-value operations per second, which is 1.4-2.8 times as fast as the state-of-the-art key-value store system on a conventional CPU-based platform.
RAMCloud is a storage system that provides low-latency access to large-scale datasets. To achieve low latency, RAMCloud stores all data in DRAM at all times. To support large capacities (1PB or more), it aggregates the memories of thousands of servers into a single coherent key-value store. RAMCloud ensures the durability of DRAM-based data by keeping backup copies on secondary storage. It uses a uniform log-structured mechanism to manage both DRAM and secondary storage, which results in high performance and efficient memory usage. RAMCloud uses a polling-based approach to communication, bypassing the kernel to communicate directly with NICs; with this approach, client applications can read small objects from any RAMCloud storage server in less than 5μs, durable writes of small objects take about 13.5μs. RAMCloud does not keep multiple copies of data online; instead, it provides high availability by recovering from crashes very quickly (1 to 2 seconds). RAMCloud's crash recovery mechanism harnesses the resources of the entire cluster working concurrently so that recovery performance scales with cluster size.
The Deuteronomy transactional key value store executes millions of serializable transactions/second by exploiting multi-version timestamp order concurrency control. However, it has not supported range operations, only individual record operations (e.g., create, read, update, delete). In this paper, we enhance our multiversion timestamp order technique to handle range concurrency and prevent phantoms. Importantly, we maintain high performance while respecting the clean separation of duties required by Deuteronomy, where a transaction component performs purely logical concurrency control (including range support), while a data component performs data storage and management duties. Like the rest of the Deuteronomy stack, our range technique manages concurrency information in a latch-free manner. With our range enhancement, Deuteronomy can reach scan speeds of nearly 250 million records/s (more than 27 GB/s) on modern hardware, while providing serializable isolation complete with phantom prevention.
With prices of main memory constantly decreasing, people nowadays are more interested in performing their computations in main memory, and leave high I/O costs of traditional disk-based systems out of the equation. This change of paradigm, however, represents new challenges to the way data should be stored and indexed in main memory in order to be processed efficiently. Traditional data structures, like the venerable B-tree, were designed to work on disk-based systems, but they are no longer the way to go in main-memory systems, at least not in their original form, due to the poor cache utilization of the systems they run on. Because of this, in particular, during the last decade there has been a considerable amount of research on index data structures for main-memory systems. Among the most recent and most interesting data structures for main-memory systems there is the recently-proposed adaptive radix tree ARTful (ART for short). The authors of ART presented experiments that indicate that ART was clearly a better choice over other recent tree-based data structures like FAST and B+-trees. However, ART was not the first adaptive radix tree. To the best of our knowledge, the first was the Judy Array (Judy for short), and a comparison between ART and Judy was not shown. Moreover, the same set of experiments indicated that only a hash table was competitive to ART. The hash table used by the authors of ART in their study was a chained hash table, but this kind of hash tables can be suboptimal in terms of space and performance due to their potentially high use of pointers. In this paper we present a thorough experimental comparison between ART, Judy, two variants of hashing via quadratic probing, and three variants of Cuckoo hashing. These hashing schemes are known to be very efficient. For our study we consider whether the data structures are to be used as a non-covering index (relying on an additional store), or as a covering index (covering key-value pairs). We consider both OLAP and OLTP scenarios. Our experiments strongly indicate that neither ART nor Judy are competitive to the aforementioned hashing schemes in terms of performance, and, in the case of ART, sometimes not even in terms of space.
Multi-Version Concurrency Control (MVCC) is a widely employed concurrency control mechanism, as it allows for execution modes where readers never block writers. However, most systems implement only snapshot isolation (SI) instead of full serializability. Adding serializability guarantees to existing SI implementations tends to be prohibitively expensive.
We present a novel MVCC implementation for main-memory database systems that has very little overhead compared to serial execution with single-version concurrency control, even when maintaining serializability guarantees. Updating data in-place and storing versions as before-image deltas in undo buffers not only allows us to retain the high scan performance of single-version systems but also forms the basis of our cheap and fine-grained serializability validation mechanism. The novel idea is based on an adaptation of precision locking and verifies that the (extensional) writes of recently committed transactions do not intersect with the (intensional) read predicate space of a committing transaction. We experimentally show that our MVCC model allows very fast processing of transactions with point accesses as well as read-heavy transactions and that there is little need to prefer SI over full serializability any longer.
The Q100 uses hardware specialization to improve the energy efficiency of analytic database applications. The proposed accelerators are called database processing units. DPUs are analogous to GPUs, but where GPUs target graphics applications, DPUs target analytic database workloads. This article demonstrates a proof of concept design, called the Q100, which provides one to two orders of magnitude improvement in efficiency over single- and multithreaded software database management systems. The Q100 exploits the innate structure of the workload, viewing the data in terms of tables and columns rather than as an unstructured array of bytes, to more efficiently move and manipulate database content. Because a large proportion of a chip's energy is spent delivering data to the computation engines, this approach both improves overall energy efficiency and complements other specialized computation engines.
Five years ago I proposed a common database approach for transaction processing and analytical systems using a columnar in-memory database, disputing the common belief that column stores are not suitable for transactional workloads. Today, the concept has been widely adopted in academia and industry and it is proven that it is feasible to run analytical queries on large data sets directly on a redundancy-free schema, eliminating the need to maintain pre-built aggregate tables during data entry transactions. The resulting reduction in transaction complexity leads to a dramatic simplification of data models and applications, redefining the way we build enterprise systems. First analyses of productive applications adopting this concept confirm that system architectures enabled by in-memory column stores are conceptually superior for business transaction processing compared to row-based approaches. Additionally, our analyses show a shift of enterprise workloads to even more read-oriented processing due to the elimination of updates of transactionmaintained aggregates.
Modern data appliances face severe bandwidth bottlenecks when moving vast amounts of data from storage to the query processing nodes. A possible solution to mitigate these bottlenecks is query off-loading to an intelligent storage engine, where partial or whole queries are pushed down to the storage engine. In this paper, we present Ibex, a prototype of an intelligent storage engine that supports off-loading of complex query operators. Besides increasing performance, Ibex also reduces energy consumption, as it uses an FPGA rather than conventional CPUs to implement the off-load engine. Ibex is a hybrid engine, with dedicated hardware that evaluates SQL expressions at line-rate and a software fallback for tasks that the hardware engine cannot handle. Ibex supports GROUP BY aggregation, as well as projection- and selection- based filtering. GROUP BY aggregation has a higher impact on performance but is also a more challenging operator to implement on an FPGA.