ArticlePDF Available

IBM solidDB: In-Memory Database Optimized for Extreme Speed and Availability

Authors:
  • MariaDB Corporation
  • Unicom Global

Abstract and Figures

A relational in-memory database, IBM solidDB is used worldwide for its ability to deliver extreme speed and availability. As the name implies, an in-memory database resides entirely in main memory rather than on disk, making data access an order of several magnitudes faster than with conventional, disk-based databases. Part of that leap is due to the fact that RAM simply provides faster data access than hard disk drives. But solidDB also has data structures and access methods specifically designed for storing, searching, and processing data in main memory. As a result, it outperforms ordinary disk-based databases even when the latter have data fully cached in memory. Some databases deliver low latency but cannot handle large numbers of transactions or concurrent sessions. IBM solidDB provides throughput measured in the range of hundreds of thousands to million of transactions per second while consistently achieving response times (or latency) measured in microseconds. This article explores the structural differences between in-memory and disk-based databases, and how solidDB works to deliver extreme speed.
shows the results of an experiment involving a database benchmark called Telecom Application Transaction Processing (TATP) 1 that was run on a middle-range system 2 platform. The TATP benchmark simulates a typical Home Location Register (HLR) database used by a mobile carrier. The HLR is an application mobile network operators use to store all relevant information about valid subscribers, including the mobile phone number, the services to which they have subscribed, access privileges, and the current location of the subscriber's handset. Every call to and from a mobile phone involves look ups against the HLRs of both parties, making it a perfect example of a demanding, high-throughput environment where the workloads are pertinent to all applications requiring extreme speed: telecommunications, financial services, gaming, event processing and alerting, reservation systems, and so on. The benchmark generates a flooding load on a database server. This means that the load is generated up to the maximum throughput point that the server can sustain. The load is composed of pre-defined transactions run against a specified target database. The benchmark uses four tables and a set of seven transactions that may be combined in different mixes. The most typical mix is a combination of 80% or read transactions and 20% of modification transactions. The TATP benchmark has been used in industry [6] and research [7, 3, 8, 9, 14]. Experiment used database containing 1 million subscribers. The results of a TATP benchmark show the overall throughput of the system, measured as the Mean Qualified Throughput (MQTh) of the target database system, in transactions per second, over the seven transaction types. This experiment used shared memory access for clients and figure shows the scalability of the solidDB when number of concurrent clients are increased.
… 
Content may be subject to copyright.
IBM solidDB: In-Memory Database Optimized for Extreme
Speed and Availability
Jan Lindstr¨
om, Vilho Raatikka, Jarmo Ruuth, Petri Soini, Katriina Vakkila
IBM Helsinki Lab, Oy IBM Finland Ab
jplindst@gmail.com, raatikka@iki.fi, Jarmo.Ruuth@fi.ibm.com, Petri.Soini@fi.ibm.com,
vakkila@fi.ibm.com
Abstract
A relational in-memory database, IBM solidDB is used worldwide for its ability to deliver extreme
speed and availability. As the name implies, an in-memory database resides entirely in main memory
rather than on disk, making data access an order of several magnitudes faster than with conventional,
disk-based databases. Part of that leap is due to the fact that RAM simply provides faster data access
than hard disk drives. But solidDB also has data structures and access methods specifically designed
for storing, searching, and processing data in main memory. As a result, it outperforms ordinary disk-
based databases even when the latter have data fully cached in memory. Some databases deliver low
latency but cannot handle large numbers of transactions or concurrent sessions. IBM solidDB provides
throughput measured in the range of hundreds of thousands to million of transactions per second while
consistently achieving response times (or latency) measured in microseconds. This article explores the
structural differences between in-memory and disk-based databases, and how solidDB works to deliver
extreme speed.
1 Introduction
IBM solidDB [1, 18] is a relational database server that combines the high performance of in-memory tables
with the nearly unlimited capacity of disk-based tables. Pure in-memory databases are fast but strictly limited by
the size of memory. Pure disk-based databases allow nearly unlimited amounts of storage but their performance
is dominated by disk access. Even if the server has enough memory to store the entire database in memory,
database servers designed for disk-based tables can be slow because data structures that are optimal for disk-
based tables [13] are far from optimal for in-memory tables.
The solidDB solution is to provide a single hybrid database server [5] that contains two optimized engines:
The disk-based engine (DBE) is optimized for disk-based access.
The main-memory engine (MME) is optimized for in-memory access.
Copyright 2013 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for
advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any
copyrighted component of this work in other works must be obtained from the IEEE.
Bulletin of the IEEE Computer Society Technical Committee on Data Engineering
14
Both engines coexist inside the same server process and a single SQL statement may access data from both
engines. The key components of solidDB MME technology (Vtrie and index concurrency control) are discussed
in more detail in section 2.
To take the full advantage of the in-memory technology, you can also link your client application directly
to the database server routines for higher performance and tighter control over the server. The shared memory
access (SMA) feature of solidDB uses direct function calls to the solidDB server code. The in-memory database
is located in a shared memory segment available to all applications connected to the server via SMA. Multi-
ple concurrent applications running in separate processes can utilize this connection option to reduce response
times significantly. This means that the applications ODBC or JDBC requests are processed almost fully in the
application process space, with no need for a context switch among processes.
In addition to a fully functional relational database server, solidDB provides synchronization features be-
tween multiple solidDB instances or between solidDB and other enterprise data servers. The solidDB server
can also be configured for high availability. Synchronization between solidDB instances uses statement-based
replication and supports multi-master topologies.
The same protocol extends to synchronization between solidDB and mobile devices running the IBM Mobile
Database. For synchronization between solidDB and other data servers, log-based replication is used. The
solidDB database can cache data from a back-end database, for example, to improve application performance
using local in-memory processing or to save resources on a potentially very expensive back-end database server.
Caching functionality supports back-end databases from multiple vendors including IBM, Oracle and Microsoft.
The Hot-Standby functionality in solidDB combines with the in-memory technology to provide durability
with high performance. This is discussed in more detail is section 3.
2 Makings of solidDB in-memory technology
Database operations on solidDB in-memory tables can be extremely fast because the storage and index structures
are optimized for in-memory operation. The in-memory engine can reduce the number of instructions needed to
access the data. For example, the solidDB in-memory engine does not implement page-oriented indexes or data
structures that would introduce inherent overhead for in-page processing.
In-memory databases typically forgo the use of large-block indexes, sometimes called bushy trees, in favor
of slimmer structures (tall trees) where the number of index levels is increased and the index node size is kept
to a minimum to avoid costly in-node processing. IBM solidDB uses an index called trie [4] that was originally
created for text searching but is well suited for in-memory indexing. The Vtrie and index concurrency control
designs are discussed in more detail in sections 2.1 and 2.3.
The solidDB in-memory technology is further enhanced by checkpoint execution. A checkpoint is a persis-
tent image of the whole database, allowing the database to be recovered after a system crash or other cases of
down time. IBM solidDB creates a snapshot-consistent checkpoint that is alone sufficient to recover the database
to a consistent state that existed at some point in the past.
2.1 Vtrie indexes
The basic index structure in the in-memory engine is a Vtrie (variable length trie) that is optimized variation
of path- and level-compressed trie. Vtrie is built on top of leaf node level very much similar to those used in
B+-trees. Key values are encoded in such a way that comparison of corresponding key parts can use bytewise
comparison and end of keypart evaluates smaller than any byte value. Each Vtrie node can point to maximum
257 child nodes, one for each byte value plus “end of keypart”. Vtrie index stores the lowest key value of each
leaf node, which guarantees the Vtrie index will not occupy excessive space even if the key value distribution
would cause low branching factor. The worst case branching factor for Vtrie is 2 which occurs when key values
15
are strings consisting of only two possible byte values. The B+-tree -like leaf level guarantees the Vtrie still has
far fewer nodes than there are stored key values in the index.
There are two types of key values stored in the index: routing and row keys. Routing keys are stored in Vtrie
nodes, while row keys are stored at the leaf level, which consists of an ordered list of leaf nodes. Every leaf node
is pointed to by one routing key stored in the Vtrie. The value of the routing key is higher than the high key of
the node’s left sibling and at most the same as the low key of the leaf node it points to.
Leaf node size varies depending on the keys stored to it but the default size is a few cache lines. Unlike
B-, and binary trees Vtrie does not execute any comparisons during tree traversal. Each part of a key is applied
as an array index to a pointer array of a child node. Contrary to a value comparison, array lookup is a fast
operation if the array is cached in processor caches. When individual array index is sparsely populated, it is
compressed to avoid unnecessary cache misses. Finally, on the leaf level, row key lookup is performed by
scanning prefix-compressed keys in cache-aligned leaf node.
2.2 Main-memory checkpointing
A checkpoint is a persistent image of the whole database, allowing the database to be recovered after a system
crash or other case of down time. IBM solidDB executes a snapshot-consistent checkpoint [10] that is alone
sufficient to recover the database to a consistent state that existed at some point in the past. Other database
products do not normally allow this; the transaction log files must be used to reconstruct a consistent state.
However, solidDB allows transaction logging to be turned off, if desired. The solidDB solution is memory-
friendly because of its ability to allocate row images and row shadow images (different versions of the same
row) without using inefficient block structures. Only those images that correspond to a consistent snapshot are
written to the checkpoint file, and the row shadows allow the currently executing transactions to run unrestricted
during checkpointing.
2.3 Main-memory index concurrency control
The in-memory engine’s indexing scheme [12] allows any number of readers to simultaneously traverse the
index structure without locking conflicts. A read request on an index uses an optimistic scheme that relies on
version numbers in index nodes to protect the integrity of the search path. The read-only traverse algorithm is
presented on Figure 1.
retry:
parent = root; current = root; parent version = parent.version;
traverse ongoing = TRUE
while traverse ongoing loop
current version = current.version;
if parent.version mismatch(parent version) then
goto retry;
child = current.child by key(key)
if current.version mismatch(current version) then
goto retry;
if child.terminal() then
result = child; traverse ongoing = FALSE;
else
parent version = current version; parent = current;
end if
end loop
return result;
Figure 1: Index concurrency control algorithm.
16
A write operation on an index uses a two-level locking scheme to protect the operation against other write
requests.
To support the lock-free read-only traversal scheme, when a writer changes any index node it first increments
the node version number (to an odd value). After the consistency of the path being modified is achieved again the
(odd) version numbers are incremented again (to even). The method version mismatch() above returns TRUE if
either of these apply:
1. node.version <> version argument OR
2. version is an odd value.
To help a read request to complete the search even if the search path is being heavily updated, a hot-spot
protection scheme is also implemented. A search that needs to run a retry several times due to a version conflict
falls back to locking mode that performs the traversal as if it were a write request. In practice this is very seldom
needed.
The above methods still needs some refinement: When a ”child” pointer is dereferenced, two issues arise.
1. Can we safely even read the version number dereferencing the pointer?
2. Even if the pointer value is still a legal memory reference, how do we know that the version number field
mismatches in case the memory location has been freed and then reallocated for some other purpose?
Both those issues would certainly be problems if freeing of an unused tree node would be a direct call to the
allocator’s free()-function. To resolve this issue, solidDB uses a memory manager for versioned memory objects
[11] that guarantees the version number field is neither overwritten by an arbitrary value nor decremented. The
engine has a ”purge level” which is a deallocation counter value before the oldest index traversal started. When
an index traversal ends, it may increase the purge level. When the purge level increases, freed nodes can be
released to underlying memory manager. This ensures that the version-checking traversal remains safe even
when index nodes are freed. Recycling of memory to new versioned memory objects, i.e. index tree nodes, is
still possible, because unpurged freed nodes can be used as new versioned memory objects.
3 High Availability Through the Use of Hot-Standby Replication
IBM solidDB provides high availability through automated error detection and fast fail over functionality in two-
node Hot-Standby cluster (HSB, for short). HSB can be efficiently combined with in-memory engine because
replicating REDO logs provides failure resiliency without having to wait a single disk access before commit.
Full durability is achieved by means of asynchronous logging. Prior to active HSB replication the nodes, namely
Primary (master) and Secondary (standby, slave) establish a connection which guarantees that database in nodes
are in sync. Connection is maintained during normal operation and monitored by heartbeat mechanism. When
connected, Primary accepts write transactions, generates row-level redo log records of changes and sends them
to Secondary. Secondary secures the consistency of standby database by first acquiring long-term locks for rows
before re-executing log operations. As a consequence, conflicting updates [16] are serialized due the use of
locks, but non-conflicting ones benefit from parallelized, multi-threaded execution in multi-core environments.
In an HSB database, transaction logs are sent to the Secondary server by way of a replication protocol. In
order to preserve the database consistency in the presence of fail over, the replication protocol is built very much
on the same principles as physical log writing: the transaction order is preserved, and commit records denote
committed transactions. If a fail over happens, the standby server performs a similar database recovery as if
a transaction log was used: the uncommitted transactions are removed and the committed ones are queued for
execution.
17
Synchronous (2Safe) replication protocols provide varying balance between safety and latency [16]. How-
ever, they all ensure that there is no single point of failure which would lose transactions. There are three 2Safe
variables available. 2Safe Received commits as soon as Secondary acknowledges that is has received transac-
tion log. 2Safe Visible and 2Safe Committed both commit when Secondary has executed and committed the
transaction, but when 2Safe Visible is used, Secondary sends acknowledgment to the Primary prior to physical
log writing. Unlike 2Safe protocols, asynchronous replication protocol (1Safe) prefers performance over safety
by committing immediately without waiting for Secondary’s response.
When solidDB server is used in Hot-Standby mode, it is state conscious: inside solidDB there is a non-
deterministic finite-state automaton, which at any moment, unambiguously defines the server’s capabilities. For
example, whether the server can execute write transactions, solidDB provides an administrative API for querying
and modifying HSB states.
solidDB also provides a High-Availability Controller (HAC, for short) software to automatically detect er-
rors in HSB and recover from them. Every HSB node includes a solidDB HSB server, and a HAC instance.
In addition to sub-second failure detection and fail over, HAC handles all single failures and several failure
sequences without downtime. Network errors are detected by using an External Reference Entity (ERE) [15].
External Reference Entity which can be any device sharing the network with HSB nodes, which responds to ping
command. When HSB connection gets broken, HAC first attempts to reach the ERE to determine the correct
failure type. If ERE responds, HAC concludes that the other server has either failed or become isolated. Only
after that the local HSB server can continue (if it was Master) or start (otherwise) to execute write transactions.
4 Performance and Conclusions
In solidDB, the main focus is on short response times that naturally results from the fact that the data is already
in memory. Additional techniques are available to avoid any unnecessary latency in the local database usage,
such as linking applications with the database server code by way of special drivers. By using those approaches,
one can shorten the query response time to a fraction (one tenth or even less) of that available in a traditional
database system. The improved response time also fuels high throughput. Nevertheless, techniques of improving
the throughput of concurrent operations are applied too. The outcome is a unique combination of short response
times and high throughput.
The advantages of solidDB in-memory database over a disk-based database are illustrated in Figure 2 where
the response time in milliseconds of single primary key fetch and single row update based on the primary key is
shown.
Figure 2: solidDB single operation response time.
18
Figure 3 shows the results of an experiment involving a database benchmark called Telecom Application
Transaction Processing (TATP) 1that was run on a middle-range system 2platform. The TATP benchmark
simulates a typical Home Location Register (HLR) database used by a mobile carrier. The HLR is an application
mobile network operators use to store all relevant information about valid subscribers, including the mobile
phone number, the services to which they have subscribed, access privileges, and the current location of the
subscriber’s handset. Every call to and from a mobile phone involves look ups against the HLRs of both parties,
making it a perfect example of a demanding, high-throughput environment where the workloads are pertinent to
all applications requiring extreme speed: telecommunications, financial services, gaming, event processing and
alerting, reservation systems, and so on. The benchmark generates a flooding load on a database server. This
means that the load is generated up to the maximum throughput point that the server can sustain. The load is
composed of pre-defined transactions run against a specified target database.
The benchmark uses four tables and a set of seven transactions that may be combined in different mixes. The
most typical mix is a combination of 80% or read transactions and 20% of modification transactions. The TATP
benchmark has been used in industry [6] and research [7, 3, 8, 9, 14]. Experiment used database containing
1 million subscribers. The results of a TATP benchmark show the overall throughput of the system, measured
as the Mean Qualified Throughput (MQTh) of the target database system, in transactions per second, over the
seven transaction types. This experiment used shared memory access for clients and figure shows the scalability
of the solidDB when number of concurrent clients are increased.
Figure 3: solidDB user load scalability.
In addition to telecom solutions, solidDB has shown its strength on various other business areas where
predictable low-latency and high-throughput transactional data processing is a must, such as media delivery [2]
and electronic trading platform.
References
[1] Chuck Ballard, Dan Behman, Asko Huumonen, Ky ¨
osti Laiho, Jan Lindstr¨
om, Marko Milek, Michael Roche, John
Seery, Katriina Vakkila, Jamie Watters and Antoni Wolski, IBM solidDB: Delivering Data with Extreme Speed, IBM
RedBook, ISBN 07384355457, 2011.
1http:tatpbenchmark.sourceforge.net
2Hardware configuration: Sandy Bridge EP Processor: Intel Xeon E5-2680 (2.7 GHz, 2 socket, 8 cores16 threads per socket),
Memory: 32GB memory (8 x 4GB DDR3 1333GHz), Storage: 4x 32GB Intel X25-E SSDs, OS: RHEL 6.1 (64-bit)
19
[2] Fabrix Systems - Clustered Video Storage, Processing and Delivery Platform, http://www-
304.ibm.com/partnerworld/gsd/solutiondetails.do?solution=44455&expand=true&lc=en, IBM, 2013.
[3] Ru Fang, Hui-I Hsiao, Bin He, C. Mohan , and Yun Wang: A Novel Design of Database Logging System using Storage
Class Memory. ICDE 2011.
[4] Edward Fredkin: Trie Memory, Communications of the ACM 3 (9): 1960.
[5] Joan Guisado-G´
amez, Antoni Wolski, Calisto Zuzarte, Josep-Llu´
ıs Larriba-Pey, and Victor Munt´
es-Mulero, Hybrid
In-memory and On-disk Tables for Speeding-up Table Accesses, DEXA 2010.
[6] Intel and IBM Collaborate to Double In-Memory Database Performance, Intel 2009
http://communities.intel.com/docs/DOC-2985,
[7] Ippokratis Pandis, Ryan Johnson, Nikos Hardavellas, and Anastasia Ailamaki: Data-Oriented Transaction Execu-
tion. PVLDB, 3(1), 2010.
[8] Ryan Johnson, Ippokratis Pandis, Radu Stoica, Manos Athanassoulis, and Anastasia Ailamaki: Scalability of write-
ahead logging on multicore and multisocket hardware. VLDB Journal 21(2), 2011.
[9] Per- ´
Ake Larson, Spyros Blanas, Cristian Diaconu, Craig Freedman, Jignesh M. Patel, and Mike Zwilling: High-
Performance Concurrency Control Mechanisms for Main-Memory Databases. VLDB 2012.
[10] Antti-Pekka Liedes and Antoni Wolski, SIREN: A Memory-Conserving, Snapshot-Consistent Checkpoint Algorithm
for in-Memory Databases, ICDE 2006.
[11] Antti-Pekka Liedes and Petri Soini, Memory allocator for optimistic data access, US Patent number 12/121,133,
2008.
[12] Antti-Pekka Liedes, Bottom-up Optimistic Latching Method For Index Trees, US Patent 20120221531, 2012.
[13] Kerttu Pollari-Malmi, Jarmo Ruuth and Eljas Soisalon-Soininen, Concurrency Control for B-Trees with Differential
Indices, IDEAS 2000.
[14] Kishore Kumar Pusukuri, Rajiv Gupta, and Laxmi N. Bhuyan: No More Backstabbing... A Faithful Scheduling Policy
for Multithreaded Programs. PACT 2011.
[15] Vilho Raatikka and Antoni Wolski, External Reference Entity Method For Link Failure Detection in Highly Available
Systems, International Workshop On Dependable Services and Systems (IWoDSS), 2010.
[16] Antoni Wolski and Vilho Raatikka, Performance Measurement and Tuning of Hot-Standby Databases, ISAS 2006,
Lecture Notes in Computer Science, Volume 4328, 2006.
[17] Antoni Wolski and Ky¨
osti Laiho, Rolling Upgrades for Continuous Services, ISAS 2004.
[18] Antoni Wolski, Sally Hartnell: solidDB and Secrets of the Speed, IBM Data Management (1), 2010.
20
... The disaster recovery (DR) setups are replicas of the primary databases [1]. In different terminology, these replicas are named secondary (standby) databases, secondary databases, and slave databases [2]. The replicas rely on data retrieval and application methods to maintain consistency with the primary database [3]. ...
... Hence, the product of the time duration ( ) as a constant of proportionality and mean of network bandwidth ( ( )). derives the almost equal value for the total amount of data transferred ( ) can defined as mentioned in (2). ...
Article
Full-text available
The replication between the primary and secondary (standby) databases can be configured in either synchronous or asynchronous mode. It is referred to as out-of-sync in either mode if there is any lag between the primary and standby databases. In the previous research, the advantages of the asynchronous method were demonstrated over the synchronous method on highly transactional databases. The asynchronous method requires human intervention and a great deal of manual effort to configure disaster recovery database setups. Moreover, in existing setups there was no accurate calculation process for estimating the lag between the primary and standby databases in terms of sequences and time factors with intelligence. To address these research gaps, the current work has implemented a self-image looping database link process and provided decision-making capabilities at standby databases. Those decisions from standby are always in favor of selecting the most efficient data retrieval method and being in sync with the primary database. The purpose of this paper is to add intelligence and automation to the standby database to begin taking decisions based on the rate of concurrency in transactions at primary and out-of-sync status at standby.
... The problem is solved by calling the handlers of the caching methods that call the methods to initialize the data in the memory when the server is started. It takes longer to start the server, but the data in the cache corresponds to the real data from the database [8]. The in-memory databases implement different search algorithms. ...
... To minimize the data movement overhead in data analytics, the database backend is relocated from storage to main memory [26], [35], [44], avoiding expensive disk IO access. Additionally, some of analytical query operators (e.g., select, aggregate, sort, project, and join) are converted into vector operations, increasing the throughput of the query processing [5]. ...
Preprint
Full-text available
Processing-in-memory (PIM) architecture is an inherent match for data analytics application, but we observe major challenges to address when accelerating it using PIM. In this paper, we propose Darwin, a practical LRDIMM-based multi-level PIM architecture for data analytics, which fully exploits the internal bandwidth of DRAM using the bank-, bank group-, chip-, and rank-level parallelisms. Considering the properties of data analytics operators and DRAM's area constraints, Darwin maximizes the internal data bandwidth by placing the PIM processing units, buffers, and control circuits across the hierarchy of DRAM. More specifically, it introduces the bank processing unit for each bank in which a single instruction multiple data (SIMD) unit handles regular data analytics operators and bank group processing unit for each bank group to handle workload imbalance in the condition-oriented data analytics operators. Furthermore, Darwin supports a novel PIM instruction architecture that concatenates instructions for multiple thread executions on bank group processing entities, addressing the command bottleneck by enabling separate control of up to 512 different in-memory processing units simultaneously. We build a cycle-accurate simulation framework to evaluate Darwin with various DRAM configurations, optimization schemes and workloads. Darwin achieves up to 14.7x speedup over the non-optimized version. Finally, the proposed Darwin architecture achieves 4.0x-43.9x higher throughput and reduces energy consumption by 85.7% than the baseline CPU system (Intel Xeon Gold 6226 + 4 channels of DDR4-2933). Compared to the state-of-the-art PIM, Darwin achieves up to 7.5x and 7.1x in the basic query operators and TPC-H queries, respectively. Darwin is based on the latest GDDR6 and requires only 5.6% area overhead, suggesting a promising PIM solution for the future main memory system.
... Даже сейчас существует несколько чисто транзакционных СУБД в оперативной памяти. Некоторые из них являются успешными коммерческими продуктами, такими как VoltDB [3] и SolidDB [10], а другие представляют собой академические прототипы, например, Silo [11]. ...
Article
These days, real-time analytics is one of the most often used notions in the world of databases. Broadly, this term means very fast analytics over very fresh data. Usually the term comes together with other popular terms, hybrid transactional/analytical processing (HTAP) and in-memory data processing. The reason is that the simplest way to provide fresh operational data for analysis is to combine in one system both transactional and analytical processing. The most effective way to provide fast transactional and analytical processing is to store an entire database in memory. So on the one hand, these three terms are related but on the other hand, each of them has its own right to life. In this paper, we provide an overview of several in-memory data management systems that are not HTAP systems. Some of them are purely transactional, some are purely analytical, and some support real-time analytics. Then we overview nine in-memory HTAP DBMSs, some of which don't support real-time analytics. Existing real-time in-memory HTAP DBMSs have very diverse and interesting architectures although they use a number of common approaches: multiversion concurrency control, multicore parallelization, advanced query optimization, just in time compilation, etc. Additionally, we are interested whether these systems use non-volatile memory, and, if yes, in what manner. We conclude that an emergence of new generation of NVM will greatly stimulate its use in in-memory HTAP systems.
... From many perspectives, SAP HANA, introduced in 2010, can be considered the first widely adopted in-memory database [75]. Other large vendors followed with their own in-memory database systems two years later [213], including IBM solidDB [153], Oracle TimesTen [139], and Microsoft Hekaton [57]. A more comprehensive overview of In-memory databases is given by Zhang et al. [264]. ...
Thesis
Full-text available
A decade ago, it became feasible to store multi-terabyte databases in main memory. These in-memory databases (IMDBs) profit from DRAM's low latency and high throughput as well as from the removal of costly abstractions used in disk-based systems, such as the buffer cache. However, as the DRAM technology approaches physical limits, scaling these databases becomes difficult. Non-volatile memory (NVM) addresses this challenge. This new type of memory is persistent, has more capacity than DRAM (4x), and does not suffer from its density-inhibiting limitations. Yet, as NVM has a higher latency (5-15x) and a lower throughput (0.35x), it cannot fully replace DRAM. IMDBs thus need to navigate the trade-off between the two memory tiers. We present a solution to this optimization problem. Leveraging information about access frequencies and patterns, our solution utilizes NVM's additional capacity while minimizing the associated access costs. Unlike buffer cache-based implementations, our tiering abstraction does not add any costs when reading data from DRAM. As such, it can act as a drop-in replacement for existing IMDBs. Our contributions are as follows: (1) As the foundation for our research, we present Hyrise, an open-source, columnar IMDB that we re-engineered and re-wrote from scratch. Hyrise enables realistic end-to-end benchmarks of SQL workloads and offers query performance which is competitive with other research and commercial systems. At the same time, Hyrise is easy to understand and modify as repeatedly demonstrated by its uses in research and teaching. (2) We present a novel memory management framework for different memory and storage tiers. By encapsulating the allocation and access methods of these tiers, we enable existing data structures to be stored on different tiers with no modifications to their implementation. Besides DRAM and NVM, we also support and evaluate SSDs and have made provisions for upcoming technologies such as disaggregated memory. (3) To identify the parts of the data that can be moved to (s)lower tiers with little performance impact, we present a tracking method that identifies access skew both in the row and column dimensions and that detects patterns within consecutive accesses. Unlike existing methods that have substantial associated costs, our access counters exhibit no identifiable overhead in standard benchmarks despite their increased accuracy. (4) Finally, we introduce a tiering algorithm that optimizes the data placement for a given memory budget. In the TPC-H benchmark, this allows us to move 90% of the data to NVM while the throughput is reduced by only 10.8% and the query latency is increased by 11.6%. With this, we outperform approaches that ignore the workload's access skew and access patterns and increase the query latency by 20% or more. Individually, our contributions provide novel approaches to current challenges in systems engineering and database research. Combining them allows IMDBs to scale past the limits of DRAM while continuing to profit from the benefits of in-memory computing.
... Recent research works focus on mainmemory DBMS storage. Commercial systems include Oracle's Times Ten [3], IBM's solid DB [4] and Volt DB [5]. ...
Article
Full-text available
The growth in main-memory capacity has fueled the development of main-memory database systems. Due to the explosive growth of data, it is crucial to deal with data overflow. Recent research works tackle this problem by developing approaches that track attribute accesses to identify hot/cold attributes. Distinctly, we introduce a novel algorithm called HC_Apriori, which adapts the classic Apriori algorithm, employing a new optimization measure. To the best of the authors' knowledge, this is the first initiative to employ Frequent Item set Mining in order to classify hot/cold attributes. Our objective is to enhance the performance in terms of two dimensions: storage space and execution time. We implemented our algorithm using trie data structure and compared it with classic Apriori algorithm. Two popular real-world datasets were used. Experimental results show that, our proposed HC_Apriori reduces the storage space by average of 59-97% and reduces the execution time by average of 60-100%.
Article
Transaction isolation is conventionally achieved by restricting access to the physical items in a database. To maximize performance, isolation functionality is often packaged with recovery, I/O, and data access methods in a monolithic transactional storage manager. While this design has historically afforded high performance in online transaction processing systems, industry trends indicate a growing need for a new approach in which intertwined components of the transactional storage manager are disaggregated into modular services. This paper presents a new method to modularize the isolation component. Our work builds on predicate locking, an isolation mechanism that enables this modularization by locking logical rather than physical items in a database. Predicate locking is rarely used as the core isolation mechanism because of its high theoretical complexity and perceived overhead. However, we show that this overhead can be substantially reduced in practice by optimizing for common predicate structures. We present DIBS, a transaction scheduler that employs our predicate locking optimizations to guarantee isolation as a modular service. We evaluate the performance of DIBS as the sole isolation mechanism in a data processing system. In this setting, DIBS scales up to 10.5 million transactions per second on a TATP workload. We also explore how DIBS can be applied to existing database systems to increase transaction throughput. DIBS reduces per-transaction file system writes by 90% on TATP in SQLite, resulting in a 3X improvement in throughput. Finally, DIBS reduces row contention on YCSB in MySQL, providing serializable isolation with a 1.4X improvement in throughput.
Article
Full-text available
Big data has revolutionized science and technology leading to the transformation of our societies. High-performance computing (HPC) provides the necessary computational power for big data analysis using artificial intelligence and methods. Traditionally, HPC and big data had focused on different problem domains and had grown into two different ecosystems. Efforts have been underway for the last few years on bringing the best of both paradigms into HPC and big converged architectures. Designing HPC and big data converged systems is a hard task requiring careful placement of data, analytics, and other computational tasks such that the desired performance is achieved with the least amount of resources. Energy efficiency has become the biggest hurdle in the realization of HPC, big data, and converged systems capable of delivering exascale and beyond performance. Data locality is a key parameter of HPDA system design as moving even a byte costs heavily both in time and energy with an increase in the size of the system. Performance in terms of time and energy are the most important factors for users, particularly energy, due to it being the major hurdle in high-performance system design and the increasing focus on green energy systems due to environmental sustainability. Data locality is a broad term that encapsulates different aspects including bringing computations to data, minimizing data movement by efficient exploitation of cache hierarchies, reducing intra- and inter-node communications, locality-aware process and thread mapping, and in situ and transit data analysis. This paper provides an extensive review of cutting-edge research on data locality in HPC, big data, and converged systems. We review the literature on data locality in HPC, big data, and converged environments and discuss challenges, opportunities, and future directions. Subsequently, using the knowledge gained from this extensive review, we propose a system architecture for future HPC and big data converged systems. To the best of our knowledge, there is no such review on data locality in converged HPC and big data systems.
Article
Hybrid memory systems composed of DRAM and Non-Volatile Memory (NVM) promise the capacity benefits of NVM and the low-latency benefits of DRAM. Most existing hash-based indexes are designed for NVM only and do not exploit the benefits of DRAM. In this paper, we proposed a novel hybrid DRAM-NVM persistent and concurrent hashing index, named Multi-Hashing Index (MuHash). MuHash uses a multi-hash function scheme to solve the cascading write problem of open-addressed hash-based indexes in NVM. It employs a Cuckoo Filter, an approximate membership query data structure, to prune unnecessary NVM accesses for improving read performance. To maximize throughput in multi-thread environments, MuHash also includes a fine-grained concurrency control mechanism. We implemented MuHash for Intel Optane DC Persist Memory, and single-core experiments shows that MuHash achieves up to 90% higher read throughput compared to state-of-the-art hash-based indexes. On multicore experiments, MuHash achieves near-linear scalability for all operations.
Chapter
Early Data Base Machines - DBMs were mainframes running database management systems - DBMSs. Data mining on mainframes was considered too costly, since results obtained by data mining were considered of questionable value. The advent of powerful low-cost microprocessors allowed the building of DBMs affording a high degree of parallelism, such as the Teradata DBC/1012, which has been used for data mining and data warehousing. Active disks process higher priority disk accesses for OnLine Transaction Processing - OLTP, while processing data mining requests as no cost freeblock accesses. Disks with a processor per track capability, such as the Relational Associative Processor - RAP, are no longer feasible because of high track densities, but the concept of associating processing power has been applied to flash storage and DRAM. Systems combining OLTP and Online Analytic Processing - OLAP are discussed.
Patent
Full-text available
Methods, systems and computer program products for concurrency control in a hierarchical arrangement of nodes of a data structure by traversing a single search path in a hierarchical arrangement of nodes of a data structure, recording a version number for each node in the search path, identifying at least one node in the search path to be updated, latching the at least one node, reading a version number of the latched at least one node and comparing the recorded version number of the latched at least one node to the read version number of the latched at least one node.
Conference Paper
Full-text available
Efficient contention management is the key to achieving scalable performance for multithreaded applications running on multicore systems. However, contention management policies provided by modern operating systems increase context-switches and lead to performance degradation for multithreaded applications under high loads. Moreover, this problem is exacerbated by the interaction between contention management policies and OS scheduling polices. Time Share (TS) is the default scheduling policy in a modern OS such as Open Solaris and with TS policy, priorities of threads change very frequently for balancing load and providing fairness in scheduling. Due to the frequent ping-ponging of priorities, threads of an application are often preempted by the threads of the same application. This increases the frequency of involuntary context-switches as wells as lock-holder thread preemptions and leads to poor performance. This problem becomes very serious under high loads. To alleviate this problem, in this paper, we present a scheduling policy called Faithful Scheduling (FF), which dramatically reduces context-switches as well as lock-holder thread preemptions. We implemented FF on a 24-core Dell Power Edge R905 server running OpenSolaris.2009.06 and evaluated it using 22 programs including the TATP database application, SPECjbb2005, programs from PARSEC, SPEC OMP, and some micro benchmarks. The experimental results show that FF policy achieves high performance for both lightly and heavily loaded systems. Moreover it does not require any changes to the application source code or the OS kernel.
Article
Full-text available
One of the biggest challenges in high-availability (HA) system is to prevent a situation called split brain, whereby there are two active nodes in the system, which both allow for updating the data. Split brain situation may occur when a node failure cannot be distinguished from a communication link failure. External Reference Entity Method can be used to determine whether a communication link between the active and standby components of a HA system has failed or not. External Reference Entity (ERE) is a passive network component having the sole responsibility of responding to simple connectivity check requests like those of the ping protocol. With the ERE Method implemented in the active and standby nodes of an HA system, no third (active) processing node is needed to detect active-standby link failures.
Conference Paper
Full-text available
General-purpose, high-availability database systems have lately pro- liferated to various network element platforms. In telecommunication, data- bases are expected to meet demanding availability levels while preserving the required throughput. However, so far, the effects of various high-availability configurations on overall database performance have not been analyzed. In this paper, the operation of a fully replicated, hot-standby database system is pre- sented, together with some performance tuning possibilities. To study the effect of several database-tuning parameters, a telecom-oriented database benchmark, TM1, is used. The experiments involve varying of the read/write balance and various logging and replication parameters. It is shown that, by relaxing the re- liability requirements, significant performance gains can be achieved. Also, it is demonstrated that a possibility to redirect the log writing from the local disk to the standby node is one of the most important benefits of a high-availability database system.
Conference Paper
Full-text available
With the advent of highly available systems, a new challenge has appeared in the form of the requirement for rolling upgrade support. A rolling upgrade is an upgrade of a software version, performed without a noticeable down-time or other disruption of service. Highly available systems were origi- nally conceived to cope with hardware and software failures. Upgrading the software, while the same software is running, is a different matter and it is not trivial, given possible complex dependencies among different software and data entities. This paper addresses the needs for rolling upgradeability of various levels of software running in high-availability (HA) frameworks like the Avail- ability Management Framework (AMF) as specified by SA Forum. The mecha- nism of a controlled switchover available in HA frameworks is beneficial for rolling upgrades and allows for almost instantaneous replacement of a software instance with a new version thereof. However, problems emerge when the new version exposes dependencies on other upgrades. Such dependencies may result from new or changed communications protocols, changed interfaces of other entities or dependency on new data produced by another entity. The main con- tribution of this paper is a method to capture the code, data and schema depend- encies of a data-bound application system by way a directed graph called Upgrade Food Chain (UFC). By using UFC, the correct upgrade order of vari- ous entities may be established. Requirements and scenarios for upgrades of different layers of software including applications, database schemata, DBMS software and framework software are also separately discussed. The presented methods and guidelines may be effectively used in designing HA systems capa- ble of rolling upgrades.
Conference Paper
Full-text available
Checkpoint of an in-memory database is the main source of a persistent database image surviving a software crash, or a power outage, and is, together with transactions logs, a foundation for transaction durability. Since checkpoints are created simultaneously with transaction processing, they tend to decrease database throughput and increase its memory footprint. Of the current methods, most efficient are the fuzzy checkpoint algorithms that write dirty pages to disk and require transaction logs for reconstructing a consistent state. Known consistency-preserving methods suffer from excessive memory usage or a transaction-blocking behavior. In this paper, we present a consistency-preserving and memory-efficient checkpoint method. It is based on tuple shadowing as opposed to known page shadowing methods, and rearranging of tuples between pages for minimal memory usage overhead. The method’s algorithms are introduced and both analytical and experimental analysis of the proposed algorithms show significant reduction in the memory usage overhead, and up to 30% higher transaction throughput compared with a fuzzy checkpoint method with undo/redo log.
Article
Full-text available
A database system optimized for in-memory storage can support much higher transaction rates than current systems. However, standard concurrency control methods used today do not scale to the high transaction rates achievable by such systems. In this paper we introduce two efficient concurrency control methods specifically designed for main-memory databases. Both use multiversioning to isolate read-only transactions from updates but differ in how atomicity is ensured: one is optimistic and one is pessimistic. To avoid expensive context switching, transactions never block during normal processing but they may have to wait before commit to ensure correct serialization ordering. We also implemented a main-memory optimized version of single-version locking. Experimental results show that while single-version locking works well when transactions are short and contention is low performance degrades under more demanding conditions. The multiversion schemes have higher overhead but are much less sensitive to hotspots and the presence of long-running transactions.
Conference Paper
We present an indexing system where a database index is divided into two parts: the main index located on disk and the differential index in the main memory. Both indices are implemented as B-trees. All updates performed by active transactions are written in the differential index. Periodically, writes of committed transactions are transferred from differential index to the main index as a batch-update operation. Thus, updates falling into the same leaf of the tree can be performed simultaneously. In addition, the system offers a simple recovery scheme. After a system crash, no undo operations are needed and redo operations need only write to the main memory
  • Ippokratis Pandis
  • Ryan Johnson
  • Nikos Hardavellas
  • Anastasia Ailamaki
Ippokratis Pandis, Ryan Johnson, Nikos Hardavellas, and Anastasia Ailamaki: Data-Oriented Transaction Execution. PVLDB, 3(1), 2010.
A Novel Design of Database Logging System using Storage Class Memory
  • Ru Fang
  • I Hui
  • Hsiao
  • C Bin He
  • Yun Mohan
  • Wang
Ru Fang, Hui-I Hsiao, Bin He, C. Mohan, and Yun Wang: A Novel Design of Database Logging System using Storage Class Memory. ICDE 2011.