Article
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Many of today’s applications need massive real-time data processing. In-memory database systems have become a good alternative for these requirements. These systems maintain the primary copy of the database in the main memory to achieve high throughput rates and low latency. However, a database in RAM is more vulnerable to failures than in traditional disk-oriented databases because of the memory volatility. DBMSs implement recovery activities (logging, checkpoint, and restart) for recovery proposes. Although the recovery component looks similar in disk- and memory-oriented systems, these systems differ dramatically in the way they implement their architectural components, such as data storage, indexing, concurrency control, query processing, durability, and recovery. This survey aims to provide a thorough review of in-memory database recovery techniques. To achieve this goal, we reviewed the main concepts of database recovery and architectural choices to implement an in-memory database system. Only then, we present the techniques to recover in-memory databases and discuss the recovery strategies of a representative sample of modern in-memory databases.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... However, modern MMDB systems cannot achieve high performance under hybrid workloads that include transactions that update the database while also executing complex analytical queries on this dataset [6][7][8][9][10][11][12][13]. For the hybrid workloads, some studies combine the advantages of row-store storage model and column-store storage model to design the row+column storage model [14][15][16][17][18]. ...
... We implement our Index-Organized 2 based on PelotonDB [14] with approximately 13,000 lines of C++ code. PelotonDB is an MMDB optimized for high-performance transaction processing [7], providing row-store, column-store, and row+column storage models. Even though the Peloton project has died, many works researched based on PelotonDB continue [7,23,[58][59][60]. ...
... PelotonDB is an MMDB optimized for high-performance transaction processing [7], providing row-store, column-store, and row+column storage models. Even though the Peloton project has died, many works researched based on PelotonDB continue [7,23,[58][59][60]. ...
Preprint
Full-text available
Large-scale data-intensive applications need massive real-time data processing.Recent hybrid DRAM-PM main memory database systems provide an effective approach by persisting data to persistent memory (PM) in an append-based manner for efficient storage while maintaining the primary database copy in DRAM for high throughput rates.However, they can not achieve high performance under a hybrid workload because they are unaware of the impact of pointer chasing.In this work, we investigate the impact of chasing pointers on modern main memory database systems to eliminate this bottleneck.We propose Index-Organized storage model that supports efficient reads and updates.We combine two techniques, i.e., cacheline-aligned node layout and cache prefetching, to accelerate pointer chasing, reducing memory access latency. We present four optimizations, i.e., pending versions, fine-grained memory management, Index-SSN, and cacheline-aligned writes, for supporting efficient transaction processing and fast logging.We implement our proposed storage model based on an open-sourced main memory database system.We extensively evaluate performance on a 20-core system featuring Intel Optane DC Persistent Memory Modules. Our experiments reveal that the Index-Organized approach achieves up to 3×\times speedup compared to traditional storage models (row-store, column-store, and row+column).
... However, modern MMDB systems cannot achieve high performance under hybrid workloads that include transactions that update the database while also executing complex analytical queries on this dataset [6][7][8][9][10][11][12][13]. For the hybrid workloads, some studies combine the advantages of row-store data layout and column-store data layout to design the row+column data layout [14][15][16][17][18]. ...
... We implement our Index-Organized 2 based on PelotonDB [14] with approximately 13,000 lines of C++ code. PelotonDB is an MMDB optimized for high-performance transaction processing [7], providing row-store, column-store, and row+column data layouts. Even though the Peloton project has died, many works researched based on PelotonDB continue [7,23,[56][57][58]. ...
... PelotonDB is an MMDB optimized for high-performance transaction processing [7], providing row-store, column-store, and row+column data layouts. Even though the Peloton project has died, many works researched based on PelotonDB continue [7,23,[56][57][58]. ...
Preprint
Full-text available
Large-scale data-intensive applications need massive real-time data processing. Recent hybrid DRAM-PM main memory database systems provide an effective approach by persisting data to persistent memory (PM) in an append-based manner for efficient storage while maintaining the primary database copy in DRAM for high throughput rates. However, they fail to achieve high performance under a hybrid workload because they are unaware of the impact of pointer chasing. In this work, we investigate the impact of chasing pointers on modern main memory database systems to eliminate this bottleneck. We propose Index-Organized data layout that supports efficient reads and updates. We combine two techniques, i.e., cacheline-aligned node layout and cache prefetching, to accelerate pointer chasing, reducing memory access latency. We present four optimizations, i.e., pending versions, fine-grained memory management, Index-SSN, and cacheline-aligned writes, for supporting efficient transaction processing and fast logging. We implement our proposed data layout based on an open-sourced main memory database system. We conduct extensive evaluations on a 20-core machine equipped with Intel Optane DC Persistent Memory Modules. Experimental results demonstrate that Index-Organized obtains up to 3x speedup than the conventional data layouts, i.e., row-store, column-store, and row+column.
... Nonetheless, in the nineties, the advances in hardware technology re-generated interest in MMDB research. New memory technologies have provided a larger storage capacity with lower costs [15,51]. Moreover, other recent hardware/architecture improvements have boosted the development of MMDBs, such as non-volatile memory [2], NUMA architecture [41], SIMD instructions [85], RDMA networking [56], and hardware transactional memory [43]. ...
... MMDBs avoid the traditional design of disk-resident databases for performance reasons. The fact that the database resides in volatile storage influences the design approaches adopted by MMDBs, such as data storage, concurrency control, query processing, indexing, and durability and recovery [15,51]. For the lack of space, this section will only detail durability and recovery. ...
... In [51], the authors provide an in-depth survey on recovery techniques for MMDBs. Besides, the authors detail the main features of recovery mechanisms delivered by well-known MMDBs. ...
Article
Full-text available
Main memory databases (MMDBs) technology handles the primary database in Random Access Memory (RAM) to provide high throughput and low latency. However, volatile memory makes MMDBs much more sensitive to system failures. The contents of the database are lost in these failures, and, as a result, systems may be unavailable for a long time until the database recovery process has been finished. Therefore, novel recovery techniques are needed to repair crashed MMDBs as quickly as possible. This paper presents MM-DIRECT (Main Memory Database Instant RECovery with Tuple consistent checkpoint), a recovery technique that enables MMDBs to schedule transactions simultaneously with the database recovery process at system startup. Thus, it gives the impression that the database is instantly restored. The approach implements a tuple-level consistent checkpoint to reduce the recovery time. To validate the proposed approach, experiments were performed in a prototype implemented on the Redis database. The results show that the instant recovery technique effectively provides high transaction throughput rates even during the recovery process and normal database processing.
... MMDBs keep the database in RAM to achieve very high IOPS (Input/Output Operations Per Second) rates. Such a feature makes MMDBs much more sensitive to system failures since it causes loss of main memory content [Magalhães et al. 2021a, Wu et al. 2017, Faerber et al. 2017]. ...
... MMDBs produce only Redo log records of modified data to reduce the amount of data written to secondary storage. The commit processing uses group commit, i.e., it tries to group multiple log records into one large I/O [Magalhães et al. 2021a, Stonebraker and Weisberg 2013, Faerber et al. 2017]. ...
... Thus, the recovery manager should load the last snapshot into memory and redo log records. The system can process new transactions only after complete recovery [Magalhães et al. 2021a, Stonebraker and Weisberg 2013, Faerber et al. 2017]. ...
Conference Paper
Main Memory Databases (MMDBs) technology handles the primary database in Random Access Memory (RAM) to provide high throughput and low latency. However, volatile memory makes MMDBs much more sensitive to system failures. The contents of the database are lost in these failures. As a result, systems may be unavailable for a long time until the database recovery process has been finished. Therefore, novel recovery techniques are needed to repair crashed MMDBs as quickly as possible. This thesis presents MM-DIRECT, a recovery technique that enables MMDBs to schedule transactions immediately after the system startup. The approach also implements a tuple-level consistent checkpoint to reduce the recovery time. To validate the proposed approach, experiments were performed in a prototype implemented on the Redis database. The results show that the proposed instant recovery technique effectively provides high transaction throughput rates even during the recovery process and normal database processing.
... This tutorial focuses on relational MMDB recovery. However, the recovery strategies presented are implemented similarly in other types of MMDBs [6,29,55]. ...
... In addition, he is a Ph.D. student at the Federal University of Ceara, where he has been researching the MMDB recovery area. During his doctoral studies, he published some relevant articles in his area of research, such as [29], [27] , and [28]. He has areas of interest in database and software engineering, working mainly on the following themes: self-tuning databases, cloud databases, and in-memory databases. ...
... However, the recovery strategies presented are implemented similarly in other types of MMDBs. Besides, the tutorial introduces modern recovery techniques, such as instant recovery [Magalhães et al. 2021b, Magalhães 2021, Magalhães et al. 2021a, Sauer 2019]. ...
... 6. Main challenges and future directions: This section intends to discuss some aspects related to challenges and future directions of research in MMDBs in order to provide guidance for other researchers. [Magalhães et al. 2021b], [Magalhães 2021], and [Magalhães et al. 2021a]. He has areas of interest in database and software engineering, working mainly on the following themes: self-tuning databases, cloud databases, and in-memory databases. ...
Conference Paper
Main Memory Database (MMDB) technology has been an efficient alternative for high-performance, real-time, mission-critical applications. However, MMDBs are more vulnerable to crashes due to memory volatility. Although the recovery component looks similar on disk- and memory-oriented systems, these systems differ dramatically in the way they implement their architectural components. This tutorial aims to provide a complete review of MMDB recovery techniques. To achieve this goal, the tutorial reviews key database recovery concepts and MMDB implementations. Only then do we introduce MMDB recovery techniques and discuss recovery strategies for a representative sample of modern MMDBs.
... Many applications need massive real-time data analysis and parallel processing in today's big data era. In-memory database system, which copies the traditional disk-based database system into the random access memory, has become an efficient technology to achieve the goal of high throughput rate and low latency time (Magalhaes et al., 2021). The technology of in-memory database arose in the 1980s (Margaret, 1988), but its development requires recent technologies such as larger storage capacity, better hardware architecture, lower running overhead, and etc. ...
... Despite the above advances, the in-memory database systems are more vulnerable than disk-based database systems (Magalhaes et al., 2021), due to the feature of their running carriers, as all kinds of memory errors, failures and even crashes are more harder to predict and prevent than those suffered from disk-based database systems. For this reason, it becomes a critical problem to design data backup and recovery policies to prevent memory failures for in-memory database management systems. ...
Article
Full-text available
In-memory database systems are becoming an efficient technology to achieve the goal of high throughput rate and low latency time, however, they are more vulnerable than disk-based database systems due to the feature of their running carriers, so that it becomes a critical problem to design data backup and recovery policies to prevent memory failures for in-memory database management systems. From this viewpoint, this paper firstly describes the stochastic processes of bulk-data update, triggers of failure-oblivious computing and in-memory database failure, and then model the expected cost rates for data backup and recovery when full backups are implemented at time T and at bulk-data update N, respectively. In order to compare the policies of T and N, integrated models of the backup policies implemented at time T and at bulk-data update N are studied, using the triggering approaches of first and last in maintenance theory. Furthermore, the policies of T and N are reconsidered when full backup is planned at the completion of the forthcoming bulk-data update. In addition, a cumulative cost of failure-oblivious computing is considered in the modified backup policies. All of the expected cost rates and their optimum policies of full backups are obtained in analytical ways.
... Today's HACs make a distinction between the different layers of an application. In protecting a layer 3 component (i.e., database), a HAC can either manage it by employing a database-specific extension (agent) or utilising replica or mir- roring features that are offered natively by the database [96]. ...
... Subsequently, states are replicated to a standby application or service that is hosted on the secondary node [61]. A more advanced approach is a State Machine Replication (SMR) which creates replicas of client and process states to one or more nodes deterministically [141], which can even support more comprehensive solutions such as databases [142,96]. An example of SMR concerning a HAC is an implementation of a HAC for HPC, which employed SMR to synchronise states between nodes in a symmetric active-active topology [83]. ...
Preprint
Full-text available
The delivery of key services in domains ranging from finance and manufacturing to healthcare and transportation is underpinned by a rapidly growing number of mission-critical enterprise applications. Ensuring the continuity of these complex applications requires the use of software-managed infrastructures called high-availability clusters (HACs). HACs employ sophisticated techniques to monitor the health of key enterprise application layers and of the resources they use, and to seamlessly restart or relocate application components after failures. In this paper, we first describe the manifold uses of HACs to protect essential layers of a critical application and present the architecture of high availability clusters. We then propose a taxonomy that covers all key aspects of HACs -- deployment patterns, application areas, types of cluster, topology, cluster management, failure detection and recovery, consistency and integrity, and data synchronisation; and we use this taxonomy to provide a comprehensive survey of the end-to-end software solutions available for the HAC deployment of enterprise applications. Finally, we discuss the limitations and challenges of existing HAC solutions, and we identify opportunities for future research in the area.
... In the asynchronous log transfer phase, logs generated by incoming transactions are transferred to the destination node and replayed to ensure the destination node obtain the latest data version. In-memory databases utilize redo logs to persist data for fault recovery [32]. Aion leverages these redo logs directly without incurring additional overhead. ...
Article
Full-text available
Distributed in-memory databases are widely adopted to achieve low latency and high bandwidth for data-intensive applications. They support scale-out by sharding and distributing data across multiple nodes. To efficiently adapt to various workloads, distributed in-memory databases must be capable of migrating shards across nodes. In this paper, we demonstrate that state-of-the-art approaches experience significant performance degradation during migration due to service downtime and redundant data transfer. Furthermore, our findings indicate that the presence of service downtime constrains the scalability of migration strategies, while the transfer of redundant data during the snapshot transfer phase limits their adaptability to dynamic workloads. To this end, this paper proposes Aion, a live migration strategy designed for distributed in-memory databases. Aion eliminates any potential service downtime by immediately switching transaction routing to the destination node. To ensure data consistency between the source and destination nodes, as well as serializable execution during migration, Aion proposes the mutual validation phase. Moreover, Aion introduces an analysis phase before the snapshot transfer phase to identify dynamically changing hotspots in workloads. The analysis phase identifies and transfers tuples and versions accessed less frequently to the destination node, reducing the amount of data transferred. Aion is implemented on a distributed in-memory database and evaluated using various OLTP workloads. The results demonstrate that Aion can fundamentally eliminate service downtime, adapt effectively to various workloads and exhibit robust scalability. Compared to state-of-the-art approaches, Aion achieves up to 2.25x–6.57x higher throughput during migration and shortens the migration duration by 53.7–68.2%.
... In other words, in the event of a database failure where workload interruption or restart is inevitable, all data may be lost. Therefore, many in-memory database products use the following methods during database failure recovery to ensure persistence [11,12]: ...
Article
Full-text available
As the demand for container technology and platforms increases due to the efficiency of IT resources, various workloads are being containerized. Although there are efforts to integrate various workloads into Kubernetes, the most widely used container platform today, the nature of containers makes it challenging to support persistence for memory-centric workloads like in-memory databases. In this paper, we discuss the drawbacks of one of the persistence support methods used for in-memory databases in a Kubernetes environment, namely, the data snapshot. To address these issues, we propose a compromise solution of using container checkpoints. Through this approach, we can perform checkpointing without incurring additional memory usage due to CoW, which is a problem in fork-based data snapshots during snapshot creation. Additionally, container checkpointing induces up to 7.1 times less downtime compared to the main process-based data snapshot. Furthermore, during database recovery, it is possible to achieve up to 11.3 times faster recovery compared to the data snapshot method.
... In the modern computing world, HTAP generally adopts in-memory techniques [1]. Storing the entire database structure in the memory is the signature characteristic of in-memory databases. ...
Article
Full-text available
The increasing demand for the simultaneous transaction and review of the data for either decision making or forecasting has created a need for faster and better Hybrid Transactional/Analytical Processing (HTAP). This paper emphasizes the speedup of Online Analytical Processing (OLAP) operations in an HTAP environment where analytical queries are mainly repetitive and contain non-indexed keys as their predicates. Zone maps and materialized views are popular approaches adopted by more extensive databases to address this issue. However, they are absent in in-memory databases because of space constraints. Instead, in-memory databases load the cache with result pages of frequently accessed queries. Increasing the number of such queries can fill the cache and raise the system’s overhead. This paper presents Query_Dictionary, a hybrid storage solution that leverages the full capabilities of SQLite by retaining less information of repetitive queries in the cache and efficiently accommodating the newly updated data by the end-user. The solution proposes storing page-level metadata query information for a larger result set and row-level information for a smaller result set. It demonstrates Query_Dictionary capabilities on three types of representative queries: single table, binary join, and transactional queries on non-indexed attributes. In comparison with SQLite, the proposed method performs better.
Article
A database replay system (DRS) captures workloads on a production system and then replays them in a test system to test various system changes, avoiding any risk before realizing them in production. The dependency graph generation in a DRS is crucial in preserving output determinism while maximizing concurrency. The state-of-the-art dependency graph generation algorithm deployed in a commercial DBMS uses a generate-and-prune strategy. It first generates a dependency graph by performing backward scans for each request in a workload. It then prunes all redundant edges using an expensive, transitive reduction algorithm. However, we notice that this generates a large dependency graph that contains many redundant edges and its worst-case time complexity is quadratic to the number of requests in a workload. In order to solve these challenging problems, we formally propose four classes of dependency graphs for DRSs. We then present a stateful single forward scan algorithm, SSFS, to generate any class of dependency graphs by performing a single scan over all requests while succinctly maintaining states. Here, states refer to information that is stored and maintained for efficient dependency graph generation. We also propose the parallel SSFS to utilize the computation power with multi-core CPUs while balancing the loads. We implemented our DRS in a leading commercial DBMS. Extensive experiments using the TPC-C, SD benchmarks, and a real-world customer workload show that our DRS significantly improves the dependency graph generation time by up to two orders of magnitude, compared to the state-of-the-art.
Article
Having dominated databases and various data management systems for decades, B ⁺ -tree is infamously subject to a logging dilemma: One could improve B ⁺ -tree speed performance by equipping it with a larger log, which nevertheless will degrade its crash recovery speed. Such a logging dilemma is particularly prominent in the presence of modern workloads that involve intensive small writes. In this paper, we propose a novel solution, called per-page logging based B ⁺ -tree, which leverages the emerging computational storage drive (CSD) with built-in transparent compression to fundamentally resolve the logging dilemma. Our key idea is to divide the large single log into many small (e.g., 4KB), highly compressible per-page logs , each being statically bounded with a B ⁺ -tree page. All per-page logs together form a very large over-provisioned log space for B ⁺ -tree to improve its operational speed performance. Meanwhile, during crash recovery, B ⁺ -tree does not need to scan any per-page logs, leading to a recovery latency independent from the total log size. We have developed and open-sourced a fully functional prototype. Our evaluation results show that, under small-write intensive workloads, our design solution can improve B ⁺ -tree operational throughput by up to 625.6% and maintain a crash recovery time of as low as 19.2 ms, while incurring a minimal storage overhead of only 0.5-1.6%.
Article
The delivery of key services in domains ranging from finance and manufacturing to healthcare and transportation is underpinned by a rapidly growing number of mission-critical enterprise applications. Ensuring the continuity of these complex applications requires the use of software-managed infrastructures called high-availability clusters (HACs). HACs employ sophisticated techniques to monitor the health of key enterprise application layers and of the resources they use, and to seamlessly restart or relocate application components after failures. In this paper, we first describe the manifold uses of HACs to protect essential layers of a critical application and present the architecture of high availability clusters. We then propose a taxonomy that covers all key aspects of HACs—deployment patterns, application areas, types of cluster, topology, cluster management, failure detection and recovery, consistency and integrity, and data synchronisation; and we use this taxonomy to provide a comprehensive survey of the end-to-end software solutions available for the HAC deployment of enterprise applications. Finally, we discuss the limitations and challenges of existing HAC solutions, and we identify opportunities for future research in the area.
Article
Full-text available
Traditional disk-resident OLTP systems were mainly designed for computers with relatively small memory. Driven by the advance of hardware, OLTP systems need to be redesigned for larger memory and multi-core environments. Compared to disk-resident systems, in-memory systems have significant performance advantages, from the perspectives of both transaction throughput and query latency. Their performance is no longer limited by disk I/Os. Instead, the efficiency and scalability over multi-core CPUs become more important. In this paper, we survey and summarize a wide spectrum of design and implementation considerations that may affect the efficiency or scalability of an in-memory OLTP system. These considerations are concerned with most of the main components of databases, including concurrency control, logging, indexing and transaction compilation. For each of the components, we provide some in-depth analysis based on recent research works. This survey also aims to provide some guidance for designing or implementing high-performance in-memory OLTP systems.
Article
Full-text available
This article provides an overview of recent developments in main-memory database systems. With growing memory sizes and memory prices dropping by a factor of 10 every 5 years, data having a “primary home” in memory is now a reality. Main-memory databases eschew many of the traditional architectural pillars of relational database systems that optimized for disk-resident data. The result of these memory-optimized designs are systems that feature several innovative approaches to fundamental issues (e.g., concurrency control, query processing) that achieve orders of magnitude performance improvements over traditional designs. Our survey covers five main issues and architectural choices that need to be made when building a high performance main-memory optimized database: data organization and storage, indexing, concurrency control, durability and recovery techniques, and query processing and compilation. We focus our survey on four commercial and research systems: H-Store/VoltDB, Hekaton, HyPer, and SAP HANA. These systems are diverse in their design choices and form a representative sample of the state of the art in main-memory database systems. We also cover other commercial and academic systems, along with current and future research trends. © 2017 F. Faerber, A. Kemper, P. Å Larson, J. Levandoski, T. Neumann, and A. Pavlo.
Article
Full-text available
Instant recovery improves system availability by reducing the mean time to repair, i.e., the interval during which a database is not available for queries and updates due to recovery activities. Variants of instant recovery pertain to system failures, media failures, node failures, and combinations of multiple failures. After a system failure, instant restart permits new transactions immediately after log analysis, before and concurrent to “redo” and “undo” recovery actions. After a media failure, instant restore permits new transactions immediately after allocation of a replacement device, before and concurrent to restoring backups and replaying the recovery log. Write-ahead logging is already ubiquitous in data management software. The recent definition of single-page failures and techniques for log-based single-page recovery enable immediate, lossless repair after a localized wear-out in novel or traditional storage hardware. In addition, they form the backbone of on-demand “redo” in instant restart, instant restore, and eventually instant failover. Thus, they complement on-demand invocation of traditional single-transaction “undo” or rollback. In addition to these instant recovery techniques, the discussion introduces self-repairing indexes and much faster offline restore operations, which impose no slowdown in backup operations and hardly any slowdown in log archiving operations. The new restore techniques also render differential and incremental backups obsolete, complete backup commands on a database server practically instantly, and even permit taking full up-to-date backups without imposing any load on the database server.
Article
Recovery is an intricate aspect of transaction processing architectures. In its traditional implementation, recovery requires the management of two persistent data stores---a write-ahead log and a materialized database---which must be carefully orchestrated to maintain transactional consistency. Furthermore, the design and implementation of recovery algorithms have deep ramifications into almost every component of the internal system architecture, from concurrency control to buffer management and access path implementation. Such complexity not only incurs high costs for development, testing, and training, but also unavoidably affects system performance, introducing overheads and limiting scalability. This paper proposes a novel approach for transactional storage and recovery called FineLine. It simplifies the implementation of transactional database systems by eliminating the log-database duality and maintaining all persistent data in a single, log-structured data structure. This approach not only provides more efficient recovery with less overhead, but also decouples the management of persistent data from in-memory access paths. As such, it blurs the lines that separate in-memory from disk-based database systems, providing the efficiency of the former with the reliability of the latter.
Article
Highly available database systems rely on data replication to tolerate machine failures. Both classes of existing replication algorithms, active-passive and active-active, were designed in a time when network was the dominant performance bottleneck. In essence, these techniques aim to minimize network communication between replicas at the cost of incurring more processing redundancy; a trade-off that suitably fitted the conventional wisdom of distributed database design. However, the emergence of next-generation networks with high throughput and low latency calls for revisiting these assumptions. In this paper, we first make the case that in modern RDMA-enabled networks, the bottleneck has shifted to CPUs, and therefore the existing network-optimized replication techniques are no longer optimal. We present Active-Memory Replication, a new high availability scheme that efficiently leverages RDMA to completely eliminate the processing redundancy in replication. Using Active-Memory, all replicas dedicate their processing power to executing new transactions, as opposed to performing redundant computation. Active-Memory maintains high availability and correctness in the presence of failures through an efficient RDMA-based undo-logging scheme. Our evaluation against active-passive and active-active schemes shows that Active-Memory is up to a factor of 2 faster than the second-best protocol on RDMA-based networks.
Conference Paper
I/O latency and throughput is one of the major performance bottlenecks for disk-based database systems. Upcoming persistent memory (PMem) technologies, like Intel's Optane DC Persistent Memory Modules, promise to bridge the gap between NAND-based flash (SSD) and DRAM, and thus eliminate the I/O bottleneck. In this paper, we provide one of the first performance evaluations of PMem in terms of bandwidth and latency. Based on the results, we develop guidelines for efficient PMem usage and two essential I/O primitives tuned for PMem: log writing and block flushing.
Chapter
Online Transaction Processing (OLTP) databases include a suite of features---disk-resident B-trees and heap files, locking-based concurrency control, support for multi-threading---that were optimized for computer technology of the late 1970's. Advances in modern processors, memories, and networks mean that today's computers are vastly different from those of 30 years ago, such that many OLTP databases will now fit in main memory, and most OLTP transactions can be processed in milliseconds or less. Yet database architecture has changed little. Based on this observation, we look at some interesting variants of conventional database systems that one might build that exploit recent hardware trends, and speculate on their performance through a detailed instruction-level breakdown of the major components involved in a transaction processing database system (Shore) running a subset of TPC-C. Rather than simply profiling Shore, we progressively modified it so that after every feature removal or optimization, we had a (faster) working system that fully ran our workload. Overall, we identify overheads and optimizations that explain a total difference of about a factor of 20x in raw performance. We also show that there is no single "high pole in the tent" in modern (memory resident) database systems, but that substantial time is spent in logging, latching, locking, B-tree, and buffer management operations.
Conference Paper
We present the Height Optimized Trie (HOT), a fast and space-efficient in-memory index structure. The core algorithmic idea of HOT is to dynamically vary the number of bits considered at each node, which enables a consistently high fanout and thereby good cache efficiency. The layout of each node is carefully engineered for compactness and fast search using SIMD instructions. Our experimental results, which use a wide variety of workloads and data sets, show that HOT outperforms other state-of-the-art index structures for string keys both in terms of search performance and memory footprint, while being competitive for integer keys. We believe that these properties make HOT highly useful as a general-purpose index structure for main-memory databases.
Conference Paper
Non-volatile memory (NVM) is a new storage technology that combines the performance and byte addressability of DRAM with the persistence of traditional storage devices like flash (SSD). While these properties make NVM highly promising, it is not yet clear how to best integrate NVM into the storage layer of modern database systems. Two system designs have been proposed. The first is to use NVM exclusively, i.e., to store all data and index structures on it. However, because NVM has a higher latency than DRAM, this design can be less efficient than main-memory database systems. For this reason, the second approach uses a page-based DRAM cache in front of NVM. This approach, however, does not utilize the byte addressability of NVM and, as a result, accessing an uncached tuple on NVM requires retrieving an entire page. In this work, we evaluate these two approaches and compare them with in-memory databases as well as more traditional buffer managers that use main memory as a cache in front of SSDs. This allows us to determine how much performance gain can be expected from NVM. We also propose a lightweight storage manager that simultaneously supports DRAM, NVM, and flash. Our design utilizes the byte addressability of NVM and uses it as an additional caching layer that improves performance without losing the benefits from the even faster DRAM and the large capacities of SSDs.
Article
In-memory database management systems (DBMSs) are a key component of modern on-line analytic processing (OLAP) applications, since they provide low-latency access to large volumes of data. Because disk accesses are no longer the principle bottleneck in such systems, the focus in designing query execution engines has shifted to optimizing CPU performance. Recent systems have revived an older technique of using just-in-time (JIT) compilation to execute queries as native code instead of interpreting a plan. The state-of-the-art in query compilation is to fuse operators together in a query plan to minimize materialization overhead by passing tuples efficiently between operators. Our empirical analysis shows, however, that more tactful materialization yields better performance. We present a query processing model called "relaxed operator fusion" that allows the DBMS to introduce staging points in the query plan where intermediate results are temporarily materialized. This allows the DBMS to take advantage of inter-tuple parallelism inherent in the plan using a combination of prefetching and SIMD vectorization to support faster query execution on data sets that exceed the size of CPU-level caches. Our evaluation shows that our approach reduces the execution time of OLAP queries by up to 2.2× and achieves up to 1.8× better performance compared to other in-memory DBMSs.
Article
Storage Class Memory (SCM) is a novel class of memory technologies that promise to revolutionize database architectures. SCM is byte-addressable and exhibits latencies similar to those of DRAM, while being non-volatile. Hence, SCM could replace both main memory and storage, enabling a novel single-level database architecture without the traditional I/O bottleneck. Fail-safe persistent SCM allocation can be considered conditio sine qua non for enabling this novel architecture paradigm for database management systems. In this paper we present PAllocator, a fail-safe persistent SCM allocator whose design emphasizes high concurrency and capacity scalability. Contrary to previous works, PAllocator thoroughly addresses the important challenge of persistent memory fragmentation by implementing an efficient defragmentation algorithm. We show that PAllocator outperforms state-of-the-art persistent allocators by up to one order of magnitude, both in operation throughput and recovery time, and enables up to 2.39x higher operation throughput on a persistent B-Tree.
Conference Paper
Media failures usually leave database systems unavailable for several hours until recovery is complete, especially in applications with large devices and high transaction volume. Previous work introduced a technique called single-pass restore, which increases restore bandwidth and thus substantially decreases time to repair. Instant restore goes further as it permits read/write access to any data on a device undergoing restore--even data not yet restored--by restoring individual data segments on demand. Thus, the restore process is guided primarily by the needs of applications, and the observed mean time to repair is effectively reduced from several hours to a few seconds. This paper presents an implementation and evaluation of instant restore. The technique is incrementally implemented on a system starting with the traditional ARIES design for logging and recovery. Experiments show that the transaction latency perceived after a media failure can be cut down to less than a second and that the overhead imposed by the technique on normal processing is minimal. The net effect is that a few "nines" of availability are added to the system using simple and low-overhead software techniques.
Conference Paper
The difference in the performance characteristics of volatile (DRAM) and non-volatile storage devices (HDD/SSDs) influences the design of database management systems (DBMSs). The key assumption has always been that the latter is much slower than the former. This affects all aspects of a DBMS's runtime architecture. But the arrival of new non-volatile memory (NVM) storage that is almost as fast as DRAM with fine-grained read/writes invalidates these previous design choices. In this tutorial, we provide an outline on how to build a new DBMS given the changes to hardware landscape due to NVM. We survey recent developments in this area, and discuss the lessons learned from prior research on designing NVM database systems. We highlight a set of open research problems, and present ideas for solving some of them.
Conference Paper
Main-memory database management systems (DBMS) can achieve excellent performance when processing massive volume of on-line transactions on modern multi-core machines. But existing durability schemes, namely, tuple-level and transaction-level logging-and-recovery mechanisms, either degrade the performance of transaction processing or slow down the process of failure recovery. In this paper, we show that, by exploiting application semantics, it is possible to achieve speedy failure recovery without introducing any costly logging overhead to the execution of concurrent transactions. We propose PACMAN, a parallel database recovery mechanism that is specifically designed for lightweight, coarse-grained transaction-level logging. PACMAN leverages a combination of static and dynamic analyses to parallelize the log recovery: at compile time, PACMAN decomposes stored procedures by carefully analyzing dependencies within and across programs; at recovery time, PACMAN exploits the availability of the runtime parameter values to attain an execution schedule with a high degree of parallelism. As such, recovery performance is remarkably increased. We evaluated PACMAN in a fully-fledged main-memory DBMS running on a 40-core machine. Compared to several state-of-the-art database recovery mechanisms, can significantly reduce recovery time without compromising the efficiency of transaction processing.
Article
Scaling the performance of shared-everything on-line transaction processing to highly-parallel multicore hardware remains a great challenge for database system designers. Developments in OLTP technology remove locking and logging from being scalability bottlenecks on such systems, leaving page latching as the next potential problem. To tackle the page latching problem, we design a system around physiological partitioning (PLP). The PLP design applies logical-only partitioning, maintaining the desired properties of shared-everything designs, and introduces a multi-rooted B+Tree index structure (MRBTree) which allows us to partition the accesses at the physical page level. That is, logical partitioning, along with MRBTrees ensure that all accesses to a given index page come from a single thread and, hence, can be entirely latch-free. We extend the design to make heap page accesses thread-private as well. The elimination of page latching allows us to simplify key code paths in the system such as B+Tree operations leading to more efficient yet easier maintainable code. The profiling of a prototype PLP system shows that it acquires 85% and 68% fewer contentious critical sections per transaction than an optimized conventional design and one based on logical-only partitioning respectively. As a result the PLP prototype improves performance by up to 40% and 18% over the two systems on two multicore machines.
Article
The design of the logging and recovery components of database management systems (DBMSs) has always been influenced by the difference in the performance characteristics of volatile (DRAM) and non-volatile storage devices (HDD/SSDs). The key assumption has been that non-volatile storage is much slower than DRAM and only supports block-oriented read/writes. But the arrival of new non-volatile memory (NVM) storage that is almost as fast as DRAM with fine-grained read/writes invalidates these previous design choices. This paper explores the changes that are required in a DBMS to leverage the unique properties of NVM in systems that still include volatile DRAM. We make the case for a new logging and recovery protocol, called write-behind logging, that enables a DBMS to recover nearly instantaneously from system failures. The key idea is that the DBMS logs what parts of the database have changed rather than how it was changed. Using this method, the DBMS flushes the changes to the database before recording them in the log. Our evaluation shows that this protocol improves a DBMS's transactional throughput by 1.3×, reduces the recovery time by more than two orders of magnitude, and shrinks the storage footprint of the DBMS on NVM by 1.5×. We also demonstrate that our logging protocol is compatible with standard replication schemes.
Article
Real-time analytics on massive datasets has become a very common need in many enterprises. These applications require not only rapid data ingest, but also quick answers to analytical queries operating on the latest data. MemSQL is a distributed SQL database designed to exploit memory-optimized, scale-out architecture to enable real-time transactional and analytical workloads which are fast, highly concurrent, and extremely scalable. Many analytical queries in MemSQL's customer workloads are complex queries involving joins, aggregations, sub-queries, etc. over star and snowflake schemas, often ad-hoc or produced interactively by business intelligence tools. These queries often require latencies of seconds or less, and therefore require the optimizer to not only produce a high quality distributed execution plan, but also produce it fast enough so that optimization time does not become a bottleneck. In this paper, we describe the architecture of the MemSQL Query Optimizer and the design choices and innovations which enable it quickly produce highly efficient execution plans for complex distributed queries. We discuss how query rewrite decisions oblivious of distribution cost can lead to poor distributed execution plans, and argue that to choose high-quality plans in a distributed database, the optimizer needs to be distribution-aware in choosing join plans, applying query rewrites, and costing plans. We discuss methods to make join enumeration faster and more effective, such as a rewrite-based approach to exploit bushy joins in queries involving multiple star schemas without sacrificing optimization time. We demonstrate the effectiveness of the MemSQL optimizer over queries from the TPC-H benchmark and a real customer workload.
Article
This tutorial provides an overview of recent developments in main-memory database systems. With growing memory sizes and memory prices dropping by a factor of 10 every 5 years, data having a "primary home" in memory is now a reality. Main-memory databases eschew many of the traditional architectural tenets of relational database systems that optimized for disk-resident data. Innovative approaches to fundamental issues such as concurrency control and query processing are required to unleash the full performance potential of main-memory databases. The tutorial is focused around design issues and architectural choices that must be made when building a high performance database system optimized for main-memory: data storage and indexing, concurrency control, durability and recovery techniques, query processing and compilation, support for high availability, and ability to support hybrid transactional and analytics workloads. This will be illustrated by example solutions drawn from four state-of-the-art systems: H-Store/VoltDB, Hekaton, HyPeR, and SAP HANA. The tutorial will also cover current and future research trends.
Conference Paper
Emerging non-volatile memory technologies (NVM) offer fast and byte-addressable access, allowing to rethink the durability mechanisms of in-memory databases. In this paper, we present Hyrise-NV, a database storage engine that maintains table and index structures on NVM. Our architecture updates the database state and index structures transactionally consistent on NVM using multi-version data structures, allowing to instantly recover databases independent of their size. For index structures, we present nvBTree using multi-versioning to provide failure-atomic tree updates on NVM. We evaluate Hyrise-NV both on DRAM and with hardware-based emulation of NVM using the TPC-C benchmark. Hyrise-NV recovers databases independent of their size, allowing the recovery of a table with 10 million rows in less than 100 ms.
Article
This paper presents an algorithm, called ARIES/CSA ( Algorithm for Recovery and Isolation Exploiting Semantics for Client-Server Architectures ), for performing recovery correctly in client-server (CS) architectures. In CS, the server manages the disk version of the database. The clients, after obtaining database pages from the server, cache them in their buffer pools. Clients perform their updates on the cached pages and produce log records. The log records are buffered locally in virtual storage and later sent to the single log at the server. ARIES/CSA supports a write-ahead logging (WAL), fine-granularity (e.g., record) locking, partial rollbacks and flexible buffer management policies like steal and no-force . It does not require that the clocks on the clients and the server be synchronized. Checkpointing by the server and the clients allows for flexible and easier recovery.
Article
Persistent memory invites applications to manipulate persistent data via load and store instructions. Because failures during updates may destroy transient data (e.g., in CPU registers), preserving data integrity in the presence of failures requires failure-atomic bundles of updates. Prior failure atomicity approaches for persistent memory entail overheads due to logging and CPU cache flushing. Persistent caches can eliminate the need for flushing, but conventional logging remains complex and memory intensive. We present the design and implementation of JUSTDO logging, a new failure atomicity mechanism that greatly reduces the memory footprint of logs, simplifies log management, and enables fast parallel recovery following failure. Crash-injection tests confirm that JUSTDO logging preserves application data integrity and performance evaluations show that it improves throughput 3x or more compared with a state-of-the-art alternative for a spectrum of data-intensive algorithms.
Article
This report examines the relative advantages of a storage model based on decomposition (of community view relations into binary relations containing a surrogate and one attribute) over conventional n-ary storage models There seems to be a general consensus among the database community that the n-ary approach is better This conclusion is usually based on a consideration of only one or two dimensions of a database system The purpose of this report is not to claim that decomposition is better Instead, we claim that the consensus opinion is not well founded and that neither is clearly better until a closer analysis is made along the many dimensions of a database system The purpose of this report is to movfuert her in both scope and depth toward such an analysis We examine such dimensions as simplicity, generality, storage requirements, update performance and retrieval performance.
Article
With memory prices dropping and memory sizes increasing accordingly, a number of researchers are addressing the problem of designing high-performance database systems for managing memory-resident data. In this paper we address the recovery problem in the context of such a system. We argue that existing database recovery schemes fall short of meeting the requirements of such a system, and we present a new recovery mechanism which is designed to overcome their shortcomings. The proposed mechanism takes advantage of a few megabytes of reliable memory in order to organize recovery information on a per “object” basis. As a result, it is able to amortize the cost of checkpoints over a controllable number of updates, and it is also able to separate post-crash recovery into two phases—high-speed recovery of data which is needed immediately by transactions, and background recovery of the remaining portions of the database. A simple performance analysis is undertaken, and the results suggest our mechanism should perform well in a high-performance, memory-resident database environment.
Article
This paper provides a comprehensive treatment of index management in transaction systems. We present a method, called ARIESIIM 1992, for concurrency control and recovery of B+-trees. ARIES/IM guarantees serializability and uses write-ahead logging for recovery. It supports very high concurrency and good performance by (1) treating as the lock of a key the same lock as the one on the corresponding record data in a data page (e.g., at the record level), (2) not acquiring, in the interest of permitting very high concurrency, commit duration locks on index pages even during index structure modification operations (SMOs) like page splits and page deletions, and (3) allowing retrievals, inserts, and deletes to go on concurrently with SMOs. During restart recovery, any necessary redos of index changes are always performed in a page-oriented fashion (i.e., without traversing the index tree) and, during normal processing and restart recovery, whenever possible undos are performed in a page-oriented fashion. ARIES/IM permits different granularities of locking to be supported in a flexible manner. A subset of ARIES/IM has been implemented in the OS/2 Extended Edition Database Manager. Since the locking ideas of ARIES/IM have general applicability, some of them have also been implemented in SQL/DS and the VM Shared File System, even though those systems use the shadow-page technique for recovery.
Article
In a mam memory database (WB), the primary copy of the database may be stored in a volatile memory When a crash occurs, a reload of the database from archive memory to main memory must be performed. It is essential that an efficient reload scheme be used to ensure that the expectations of high performance database systems are met. This implies that the overall performance measures of any potential reload algorithm should not be measured simply by reload time, but by its impact on overall system performance. This paper presents four different reload algorithms that aim at fast response time of transactions and high throughput of the overall system, Simulation studies comparing the algorithms indicate that the best overall approach is one based on frequency of access.
Conference Paper
Data-intensive applications seek to obtain trill insights in real-time by analyzing a combination of historical data sets alongside recently collected data. This means that to support such hybrid workloads, database management systems (DBMSs) need to handle both fast ACID transactions and complex analytical queries on the same database. But the current trend is to use specialized systems that are optimized for only one of these workloads, and thus require an organization to maintain separate copies of the database. This adds additional cost to deploying a database application in terms of both storage and administration overhead. To overcome this barrier, we present a hybrid DBMS architecture that efficiently supports varied workloads on the same database. Our approach differs from previous methods in that we use a single execution engine that is oblivious to the storage layout of data without sacrificing the performance benefits of the specialized systems. This obviates the need to maintain separate copies of the database in multiple independent systems. We also present a technique to continuously evolve the database's physical storage layout by analyzing the queries' access patterns and choosing the optimal layout for different segments of data within the same table. To evaluate this work, we implemented our architecture in an in-memory DBMS. Our results show that our approach delivers up to 3x higher throughput compared to static storage layouts across different workloads. We also demonstrate that our continuous adaptation mechanism allows the DBMS to achieve a near-optimal layout for an arbitrary workload without requiring any manual tuning.
Conference Paper
As it becomes increasingly common for transaction processing systems to operate on datasets that fit within the main memory of a single machine or a cluster of commodity machines, traditional mechanisms for guaranteeing transaction durability---which typically involve synchronous log flushes---incur increasingly unappealing costs to otherwise lightweight transactions. Many applications have turned to periodically checkpointing full database state. However, existing checkpointing methods---even those which avoid freezing the storage layer---often come with significant costs to operation throughput, end-to-end latency, and total memory usage. This paper presents Checkpointing Asynchronously using Logical Consistency (CALC), a lightweight, asynchronous technique for capturing database snapshots that does not require a physical point of consistency to create a checkpoint, and avoids conspicuous latency spikes incurred by other database snapshotting schemes. Our experiments show that CALC can capture frequent checkpoints across a variety of transactional workloads with extremely small cost to transactional throughput and low additional memory usage compared to other state-of-the-art checkpointing systems.
Conference Paper
By maintaining the data in main memory, in-memory databases dramatically reduce the I/O cost of transaction processing. However, for recovery purposes, in-memory systems still need to flush the log to disk, which incurs a substantial number of I/Os. Recently, command logging has been proposed to replace the traditional data log (e.g., ARIES logging) in in-memory databases. Instead of recording how the tuples are updated, command logging only tracks the transactions that are being executed, thereby effectively reducing the size of the log and improving the performance. However, when a failure occurs, all the transactions in the log after the last checkpoint must be redone sequentially and this significantly increases the cost of recovery. In this paper, we first extend the command logging technique to a distributed system, where all the nodes can perform their recovery in parallel. We show that in a distributed system, the only bottleneck of recovery caused by command logging is the synchronization process that attempts to resolve the data dependency among the transactions. We then propose an adaptive logging approach by combining data logging and command logging. The percentage of data logging versus command logging becomes a tuning knob between the performance of transaction processing and recovery to meet different OLTP requirements, and a model is proposed to guide such tuning. Our experimental study compares the performance of our proposed adaptive logging, ARIES-style data logging and command logging on top of H-Store. The results show that adaptive logging can achieve a 10x boost for recovery and a transaction throughput that is comparable to that of command logging.
Article
Traditional theory and practice of write-ahead logging and of database recovery focus on three failure classes: transaction failures (typically due to deadlocks) resolved by transaction rollback; system failures (typically power or software faults) resolved by restart with log analysis, "redo," and "undo" phases; and media failures (typically hardware faults) resolved by restore operations that combine multiple types of backups and log replay. The recent addition of single-page failures and single-page recovery has opened new opportunities far beyond the original aim of immediate, lossless repair of single-page wear-out in novel or traditional storage hardware. In the contexts of system and media failures, efficient single-page recovery enables on-demand incremental "redo" and "undo" as part of system restart or media restore operations. This can give the illusion of practically instantaneous restart and restore: instant restart permits processing new queries and updates seconds after system rebo...
Conference Paper
The increasing number of cores every generation poses challenges for high-performance in-memory database systems. While these systems use sophisticated high-level algorithms to partition a query or run multiple queries in parallel, they also utilize low-level synchronization mechanisms to synchronize access to internal database data structures. Developers often spend significant development and verification effort to improve concurrency in the presence of such synchronization. The Intel® Transactional Synchronization Extensions (Intel® TSX) in the 4th Generation Core™ Processors enable hardware to dynamically determine whether threads actually need to synchronize even in the presence of conservatively used synchronization. This paper evaluates the effectiveness of such hardware support in a commercial database. We focus on two index implementations: a B+Tree Index and the Delta Storage Index used in the SAP HANA® database system. We demonstrate that such support can improve performance of database data structures such as index trees and presents a compelling opportunity for the development of simpler, scalable, and easy-to-verify algorithms.
Article
Computer architectures are moving towards an era dominated by many-core machines with dozens or even hundreds of cores on a single chip. This unprecedented level of on-chip parallelism introduces a new dimension to scalability that current database management systems (DBMSs) were not designed for. In particular, as the number of cores increases, the problem of concurrency control becomes extremely challenging. With hundreds of threads running in parallel, the complexity of coordinating competing accesses to data will likely diminish the gains from increased core counts. To better understand just how unprepared current DBMSs are for future CPU architectures, we performed an evaluation of concurrency control for on-line transaction processing (OLTP) workloads on many-core chips. We implemented seven concurrency control algorithms on a main-memory DBMS and using computer simulations scaled our system to 1024 cores. Our analysis shows that all algorithms fail to scale to this magnitude but for different reasons. In each case, we identify fundamental bottlenecks that are independent of the particular database implementation and argue that even state-of-the-art DBMSs suffer from these limitations. We conclude that rather than pursuing incremental solutions, many-core chips may require a completely redesigned DBMS architecture that is built from ground up and is tightly coupled with the hardware.
Article
Server hardware is about to drastically change. As typified by emerging hardware such as UC Berkeley's Firebox project and by Intel's Rack-Scale Architecture (RSA), next generation servers will have thousands of cores, large DRAM, and huge NVRAM. We analyze the characteristics of these machines and find that no existing database is appropriate. Hence, we are developing FOEDUS, an open-source, from-scratch database engine whose architecture is drastically different from traditional databases. It extends in-memory database technologies to further scale up and also allows transactions to efficiently manipulate data pages in both DRAM and NVRAM. We evaluate the performance of FOEDUS in a large NUMA machine (16 sockets and 240 physical cores) and find that FOEDUS achieves multiple orders of magnitude higher TPC-C throughput compared to H-Store with anti-caching.
Article
The release of hardware transactional memory (HTM) in commodity CPUs has major implications on the design and implementation of main-memory databases, especially on the architecture of high-performance lock-free indexing methods at the core of several of these systems. This paper studies the interplay of HTM and lock-free indexing methods. First, we evaluate whether HTM will obviate the need for crafty lock-free index designs by integrating it in a traditional B-tree architecture. HTM performs well for simple data sets with small fixed-length keys and payloads, but its benefits disappear for more complex scenarios (e.g., larger variable-length keys and payloads), making it unattractive as a general solution for achieving high performance. Second, we explore fundamental differences between HTM-based and lock-free B-tree designs. While lock-freedom entails design complexity and extra mechanism, it has performance advantages in several scenarios, especially high-contention cases where readers proceed uncontested (whereas HTM aborts readers). Finally, we explore the use of HTM as a method to simplify lock-free design. We find that using HTM to implement a multi-word compare-and-swap greatly reduces lock-free programming complexity at the cost of only a 10-15% performance degradation. Our study uses two state-of-the-art index implementations: a memory-optimized B-tree extended with HTM to provide multi-threaded concurrency and the Bw-tree lock-free B-tree used in several Microsoft production environments.
Article
The increase in the capacity of main memory coupled with the decrease in cost has fueled research in and development of in-memory databases. In recent years, the emergence of new hardware has further given rise to new challenges which have attracted a lot of attention from the research community. In particular, it is widely accepted that hardware solutions can provide promising alternatives for realizing the full potential of in-memory systems. Here, we argue that naive adoption of hardware solutions does not guarantee superior performance over software solutions, and identify problems in such hardware solutions that limit their performance. We also highlight the primary challenges faced by in-memory databases, and summarize their potential solutions, from both software and hardware perspectives
Conference Paper
The advent of non-volatile memory (NVM) will fundamentally change the dichotomy between memory and durable storage in database management systems (DBMSs). These new NVM devices are almost as fast as DRAM, but all writes to it are potentially persistent even after power loss. Existing DBMSs are unable to take full advantage of this technology because their internal architectures are predicated on the assumption that memory is volatile. With NVM, many of the components of legacy DBMSs are unnecessary and will degrade the performance of data intensive applications. To better understand these issues, we implemented three engines in a modular DBMS testbed that are based on different storage management architectures: (1) in-place updates, (2) copy-on-write updates, and (3) log-structured updates. We then present NVM-aware variants of these architectures that leverage the persistence and byte-addressability properties of NVM in their storage and recovery methods. Our experimental evaluation on an NVM hardware emulator shows that these engines achieve up to 5.5X higher throughput than their traditional counterparts while reducing the amount of wear due to write operations by up to 2X. We also demonstrate that our NVM-aware recovery protocols allow these engines to recover almost instantaneously after the DBMS restarts.
Article
Next-generation non-volatile memories (NVMs) promise DRAM-like performance, persistence, and high density. They can attach directly to processors to form non-volatile main memory (NVMM) and offer the opportunity to build very low-latency storage systems. These high-performance storage systems would be especially useful in large-scale data center environments where reliability and availability are critical. However, providing reliability and availability to NVMM is challenging, since the latency of data replication can overwhelm the low latency that NVMM should provide. We propose Mojim, a system that provides the reliability and availability that large-scale storage systems require, while preserving the performance of NVMM. Mojim achieves these goals by using a two-tier architecture in which the primary tier contains a mirrored pair of nodes and the secondary tier contains one or more secondary backup nodes with weakly consistent copies of data. Mojim uses highly-optimized replication protocols, software, and networking stacks to minimize replication costs and expose as much of NVMM?s performance as possible. We evaluate Mojim using raw DRAM as a proxy for NVMM and using an industrial NVMM emulation system. We find that Mojim provides replicated NVMM with similar or even better performance than un-replicated NVMM (reducing latency by 27% to 63% and delivering between 0.4 to 2.7X the throughput). We demonstrate that replacing MongoDB's built-in replication system with Mojim improves MongoDB's performance by 3.4 to 4X.
Conference Paper
Multi-Version Concurrency Control (MVCC) is a widely employed concurrency control mechanism, as it allows for execution modes where readers never block writers. However, most systems implement only snapshot isolation (SI) instead of full serializability. Adding serializability guarantees to existing SI implementations tends to be prohibitively expensive. We present a novel MVCC implementation for main-memory database systems that has very little overhead compared to serial execution with single-version concurrency control, even when maintaining serializability guarantees. Updating data in-place and storing versions as before-image deltas in undo buffers not only allows us to retain the high scan performance of single-version systems but also forms the basis of our cheap and fine-grained serializability validation mechanism. The novel idea is based on an adaptation of precision locking and verifies that the (extensional) writes of recently committed transactions do not intersect with the (intensional) read predicate space of a committing transaction. We experimentally show that our MVCC model allows very fast processing of transactions with point accesses as well as read-heavy transactions and that there is little need to prefer SI over full serializability any longer.