PreprintPDF Available

A Novel Index-Organized Storage Model for Hybrid DRAM-PM Main Memory Database Systems

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

Large-scale data-intensive applications need massive real-time data processing.Recent hybrid DRAM-PM main memory database systems provide an effective approach by persisting data to persistent memory (PM) in an append-based manner for efficient storage while maintaining the primary database copy in DRAM for high throughput rates.However, they can not achieve high performance under a hybrid workload because they are unaware of the impact of pointer chasing.In this work, we investigate the impact of chasing pointers on modern main memory database systems to eliminate this bottleneck.We propose Index-Organized storage model that supports efficient reads and updates.We combine two techniques, i.e., cacheline-aligned node layout and cache prefetching, to accelerate pointer chasing, reducing memory access latency. We present four optimizations, i.e., pending versions, fine-grained memory management, Index-SSN, and cacheline-aligned writes, for supporting efficient transaction processing and fast logging.We implement our proposed storage model based on an open-sourced main memory database system.We extensively evaluate performance on a 20-core system featuring Intel Optane DC Persistent Memory Modules. Our experiments reveal that the Index-Organized approach achieves up to 3×\times speedup compared to traditional storage models (row-store, column-store, and row+column).
Content may be subject to copyright.
A Novel Index-Organized Storage Model for Hybrid
DRAM-PM Main Memory Database Systems
Qian Zhang
East China Normal University
Xueqing Gong
East China Normal University
Hassan Ali Khan
East China Normal University
Jianhao Wei
East China Normal University
Yiyang Ren
East China Normal University
Research Article
Keywords: DRAM-PM, Main Memory Database System, Index-Organized, Hybrid Workload
Posted Date: November 18th, 2024
DOI: https://doi.org/10.21203/rs.3.rs-5286510/v1
License: This work is licensed under a Creative Commons Attribution 4.0 International License. 
Read Full License
Additional Declarations: No competing interests reported.
A Novel Index-Organized Storage Model for
Hybrid DRAM-PM Main Memory Database
Systems
Qian Zhang , Xueqing Gong , Hassan Ali Khan ,
Jianhao Wei, Yiyang Ren
Software Engineering Institute, East China Normal University, Putuo
North Zhongshan Road, Shanghai, 200062, China.
*Corresponding author(s). E-mail(s): 52184501012@stu.ecnu.edu.cn;
Contributing authors: xqgong@sei.ecnu.edu.cn;
hassankhanzae@gmail.com;52215902007@stu.ecnu.edu.cn;
51275902044@stu.ecnu.edu.cn;
Abstract
Large-scale data-intensive applications need massive real-time data processing.
Recent hybrid DRAM-PM main memory database systems provide an effective
approach by persisting data to persistent memory (PM) in an append-based man-
ner for efficient storage while maintaining the primary database copy in DRAM
for high throughput rates. However, they can not achieve high performance under
a hybrid workload because they are unaware of the impact of pointer chasing.
In this work, we investigate the impact of chasing pointers on modern main mem-
ory database systems to eliminate this bottleneck. We propose Index-Organized
storage model that supports efficient reads and updates. We combine two tech-
niques, i.e., cacheline-aligned node layout and cache prefetching, to accelerate
pointer chasing, reducing memory access latency. We present four optimiza-
tions, i.e., pending versions, fine-grained memory management, Index-SSN, and
cacheline-aligned writes, for supporting efficient transaction processing and fast
logging. We implement our proposed storage model based on an open-sourced
main memory database system. We extensively evaluate performance on a 20-core
system featuring Intel Optane DC Persistent Memory Modules. Our experiments
reveal that the Index-Organized approach achieves up to 3×speedup compared
to traditional storage models (row-store, column-store, and row+column).
Keywords: DRAM-PM, Main Memory Database System, Index-Organized, Hybrid
Workload
1
1 Introduction
The advancement of Non-Volatile Memory (NVM) technology has led to the increasing
maturity of byte-addressable persistent memory (PM). Recent main memory database
(MMDB) systems have adopted a hybrid DRAM-PM hierarchy [15]. These systems
leverage PM to achieve high performance with low-latency reads and writes compa-
rable to DRAM, while also providing persistent writes and extensive storage capacity
similar to SSDs.
However, modern MMDB systems cannot achieve high performance under hybrid
workloads that include transactions that update the database while also executing
complex analytical queries on this dataset [613]. For the hybrid workloads, some
studies combine the advantages of row-store storage model and column-store stor-
age model to design the row+column storage model [1418]. However, we observe
that modern MMDB systems face a new bottleneck in data access. Those systems
have applied the optimization techniques to achieve a higher level of parallelism, e.g.,
Write-optimized index [19,20], Multi-version concurrency control (MVCC) [2123],
Lock-free[20,24,25], Partitioning[2628]. Unfortunately, such new designs imple-
mented on memory-resident and pointer-based storage structures, result in complex
read paths, leading to more pointer chasing and increased memory access latency. The
bottleneck of data access in modern MMDB systems shifts to chasing pointers after
the optimization techniques are applied.
In main memory systems, storage structures are often implemented as pointer-
based, e.g., nested arrays, linked lists. The relationships between different data items
are maintained by using pointers. Pointer-chasing is the fundamental behavior to tra-
verse those linked data structures [29]. Unfortunately, complex data access patterns
are very difficult (if not impossible) for hardware prefetchers to predict accurately,
leading to very high likelihood of last-level cache misses1upon pointer dereference
[30]. As a result, data stalls often dominate total CPU execution time [29,3133].
In this work, we study the performance impact of pointer chasing on storage
models. The data access of an MMDB system typically consists of four steps: (1) a
root-to-leaf tree traversal is essential for searching a key in the tree-like table index.
(2) lookup the indirection array to find the logical address of the key’s version chain;
(3) lookup the mapping table to find the physical blocks storing the HEAD of the ver-
sion chain; (4) perform a linear traversal over the version chain until finding the target
version. As shown in Fig.3, it is the massively expensive pointer chasing that hides
the performance gap between the row-store, column-store, and row+column storage
models, thus heavily impacting the system’s performance.
To address this issues, we introduce a novel storage model called Index-Organized.
In summary, the contributions of this paper can be summarized as follows:
1. We observe that chasing pointers is becoming the new bottleneck in data access
in current hybrid DRAM-PM main memory database systems. To our knowledge,
this paper is the first work that comprehensively study the performance impact of
chasing pointers on the storage models.
1Unless otherwise specified, in this paper, cache misses specifically denote last-level misses that require
accessing memory.
2
2. We design and implement an index-organized storage model for modern MMDBs.
We combine two techniques to accelerate pointer chasing, i.e., cacheline-aligned
node layout and cache prefetching.
3. We present four optimizations for support efficient transaction processing and fast
logging, i.e., pending versions,fine-grain memory management,Index-SSN, and
cacheline-aligned writes.
4. We experimentally analyze the performance of Index-Organized using a compre-
hensive set of workloads and show that Index-Organized outperforms conventional
storage models.
2 Background and Motivation
2.1 Typical Storage Models
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Fraction of tuples selected
700
720
740
760
780
800
820
840
Execution time (s)
row-store column-store row-column
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Fraction of tuples selected
120
150
180
210
240
270
300
Execution time (s)
row-store column-store row-column
Fig. 1 Performance impact of the three typical storage models(Projectivity=0.01) - The execution
time when running workload 1(left) and workload 2(right).
For a relational table, there are typically three storage models: row-store [34,35],
column-store [36], and row+column [37]. In the row-store manner, all attributes of a
record are stored consecutively on one page. It enables good performance for inserting
large queries because records can be added to the table in the database by using
a single write. In the column-store manner, each attribute of a record is stored on
a different page. On one page, some type of attributes from different records are
placed consecutively. For a query workload of accessing partial attributes, a column-
store storage model is the best choice because it enables spatial locality. For a hybrid
workload, some works combine the advantages of row-store and column-store to design
the row+column storage model [3739].
In Fig. 1, we demonstrate previous research[37]. We create a database containing
a single table RT(a0,a1,a2,...,a500)that consists of 500 attributes. Each attribute ak
holds a random integer value. During initialization, we load 1 million records. Two
types of queries are considered: scan(select a0,a1,a2,...,akfrom R where a0< δ) and
insert(insert into R values(a0,a1,a2,...,a500)). Two workloads are examined: (1) a
mixed workload comprising 1000 scan queries followed by 10 million insert queries,
and (2) a read-only workload of 1000 scan queries. Note that the kand δvalues affect
queries’ projectivity and selectivity, respectively.
As depicted in Fig. 1(left), the row-store storage model demonstrates superior
performance compared to the column-store layout, achieving up to a 1.2×speedup.
3
Furthermore, as the number of insert queries increases, the performance gap increases.
This is because the column-store must split a record’s attributes during each insert
operation and store them in separate memory locations. For the scan queries, the
results are shown in Fig. 1(right). We see that column-store executes the scan
query up to 1.6×faster than row-store. The better performance is attributed to the
column-store’s enhancement of cache efficiency and bandwidth utilization through the
exclusion of unnecessary attributes.
2.2 Techniques to Absorb More Writes
As shown in Fig.2, in current MMDB systems, state-of-the-art techniques (write-
optimized index, lock-free structures, MVCC and data partitioning) can significantly
improve the write throughput but also complicate reads.
...
Mapping
Table
Write-optimized
Index B-Tree
RID Address
r0
r1
r2
...
...
...
Indirection
Array
RID RID RID
Partition0 Partition1
Partition2
Partition3
Partition4
PIDAddress
0
1
2
3
4
Mapping
Table
MVCC and Data Partitioning
offset
offset
offset
,,,r0V0 ,,,r0V0 ,,,r0V1 ,,,r0V1
,,,r1V0 ,,,r1V0
,,,r2V0 ,,,r2V0
,,,r3V0 ,,,r3V0
,,,r0V2 ,,,r0V2
,,,r1V1 ,,,r1V1 ,,,r1V2 ,,,r1V2 ,,,r0V3 ,,,r0V3
,,,r2V1 ,,,r2V1 ,,,r2V2 ,,,r2V2 ,,,r2V3 ,,,r2V3
,,,r3V1 ,,,r3V1 ,,,r3V2 ,,,r3V2
Fig. 2 In the modern MMDB systems, four state-of-the-art techniques to absorb more writes.
Write-optimized Index. To relieve the write amplification, researchers have pro-
posed various modifications [10,19,20,4042] to the typical B-tree index. Among
these, the Bw-tree’s delta chain approach represents a notable advancement. A write
to a leaf node or an internal node is appended to a delta chain, avoiding directly edit-
ing nodes. As the delta chains grow, they eventually merge with the tree nodes in
a batched method, thereby reducing write amplification. However, long delta chains
slow down data searches because more pointer chasing is needed to find the target
key/value pair.
MVCC. Most main memory database systems have adopted multi-version concur-
rency control (MVCC) because updates never block reads[22]. The system maintains
multiple physical versions of a record within the database to support concurrent oper-
ations on the same record. The versions are managed using a linked list, i.e., version
chain. During query processing, traversing a long-tail version chain to locate the
required version is very slow due to pointer-chasing, leading to cache pollution by
accessing redundant data[43]. Therefore, as the short-lived updates produce more and
4
more versions, the length of the version chain increases, and the search structure and
algorithms[44] could seriously limit the improvement of throughput performance.
Lock-free structures. Many main memory database systems [41,4547] imple-
ment data structures to mitigate bottlenecks inherent in locking mechanisms. The
widely used design involves an indirection structures that maps logical IDs to physical
addresses for the database’s objects, e.g., indirection mapping table[20], indirection
array[13,48]. While this indirection allows the system to update physical position
using compare-and-swap(CAS) atomically, it leads to degraded performance because
it increases expensive pointer chasing, thereby resulting in a slower read path[41].
Data partitioning. Data partitioning can substantially improve the system
performance[26,49,50]. It brings benefits of vectorized processing[51], which passes a
block of records to each operator, thus achieving higher performance than the canon-
ical tuple-at-a-time iterator model[52]. However, it adds additional cost to maintain
volumes of data fragments(horizontal partitions), thereby increasing the database’s
data structure complexity[27], degrading the spatial locality of the data searches.
2.3 The Impacts of Pointer Chasing
row-store column-store row-column index-organized
0.0
2.5
5.0
7.5
10.0
12.5
Scan time (s)
root-to-leaf
indirection array
mapping table
version chain
others
Fig. 3 Profiling results comparing pointer chasing codes to other system codes. Time breakdown
of various components of the MMDB system when running a hybrid workload.
Fig. 3demonstrates the practical impact on performances of different storage mod-
els on a real main memory database system[48]. For this experiment, we use the YCSB
benchmarks[53] to build an HTAP workload. The hybrid workload consists of a write-
only (100%update) thread and a read-only (100%scan) thread. We use the Zipfian
access distribution with a 0.9 skew factor (80% of requests access 33% of records). We
bulkload the database with 10,000 records (1000B per tuple).
We observe that the column-store does not outperform the row-store across
most scenarios. One reason for this gap was that the row-store reduced the record
reconstruction cost. This eliminated unnecessary memory references and less pointer
chasing, thus achieving a higher throughput rate. Second, the column-store benefits
5
less from the inter-record spatial locality. Because complicated data access pat-
terns result in massively expensive pointer chasing, which limits the performance of
hardware prefetching, the cache-efficient column store could not have been better.
As shown in Fig.3, we note that the performance gap between the row-store and
column-store was lower than shown in previous work. We observe that the performance
gap was smaller under heterogeneous workloads, i.e., less than 1.2×. This is because,
in this experiment, the complicated pointer-based data structures involved the long
pointer-chasing sequences, which is a major practical problem that impacts through-
put performance. Neither the row-store nor column-store storage models can achieve
high throughput performance. Due to its index organization and pointer chasing accel-
eration, index-organized scan exhibits much higher and more robust performance with
growing chain lengths.
3 Characteristics of Modern Hardware
4K 8K 16K 32K 64K 128K
The number of nodes
0
50
100
150
200
250
300
Traversal times (ms)
64 128 256 512 1024 2048
Payload data bytes
0.4
0.6
0.8
1.0
1.2
Traversal time/node(us)
Fig. 4 Performance achieved when traversing linked list containing pointers to the payloads.
Table 1 The comparison of characteristics of DRAM and PM(OPTANE DC PMMs).
DRAM PM
(OPTANE DC PMMs)
Sequential read latency(ns) 70 170
read (6 channels)(GB/s) 120 91
write (6 channels)(GB/s) 120 27
Random read latency(ns) 110 320
read (6 channels)(GB/s) 120 28
write (6 channels)(GB/s) 120 6
Others Addressability Byte Byte
Access Granularity(bytes) 64 256
Persistent No Yes
Endurance (cycles) 1016 1010
In Fig. 4, we explore the pointer chasing of realistic systems on modern proces-
sors. In this experiment, we consider a fundamental reference behavior, specifically
traversing a linked list. We set several parameters for the reference behavior, including
the node count and the data payload size per node. Every node comprises a ”next”
6
pointer and an indirect data payload that includes a pointer to the actual payload
data. We separate the linked list nodes and data by 4 KB to defeat the cache block
and prefetching. In Fig. 4(left), it is evident that the traversal time increases linearly
in correlation with the number of linked list nodes. We attribute this to spending more
time retrieving each node payload from memory by pointer chasing. In Fig. 4(right),
we notice a similar trend on the average traversal time per node, as the payload sizes
range from 64 Bytes to 2KB. With larger payload sizes, the traversal performance is
notably influenced by fetching the payload data using a reference pointer. This exper-
iment highlights that the linked list traversal performance is very sensitive to pointer
chasing due to its pointer-based characteristics and complex data access patterns.
In Table 1, we present a summary of the memory storage characteristics utilized in
the experiments. Persistent Memory (PM) is an innovative memory technology that
effectively bridges the performance gap between DRAM and flash-based storage (e.g.,
SSD). It offers latency comparable to DRAM while also providing persistence akin to
traditional storage mediums. Intel Optane DC Persistent Memory Module (Optane
DCPMM) is the first commercially available PM product [54]. We observe that Optane
DCPMM exhibits higher latency and lower bandwidth compared with DRAM. Both
DRAM and Optane DCPMM show high random read latency. In the current MMDB
system, data access patterns introduce large amounts of random memory accesses,
notably impacting the efficiency of pointer-chasing operations. Therefore, to achieve
a higher throughput performance, database developers should optimize the storage
structures to minimize the number of random memory accesses.
4 Index-Organized
In MVCC, a single logical record corresponds to one or multiple physically materialized
versions. These versions are organized as a singly linked list, which represents a version
chain. With the New-to-Old ordering, a version chain’s HEAD is the latest version,
which refers to its predecessor. Index-Organized maintains the latest version of each
record in the tree store and the older versions of the same record in the version store.
Fig.5shows the overall architecture of Index-Organized. It consists three components:
(1) the tree index for index versions and tombstone versions; (2) the mapping table
for pending versions; and (3) the version store for snapshot versions.
Index-Organized can leverage tree indexes and cache-friendly layout to support
efficient read and update operations. While this design can minimize the number of
random memory accesses, improving the performance of pointer chasing, it comes with
new concurrency, memory management, and resource utilization challenges. Subse-
quent sections discuss the design of each component in detail, specifically how we store
versions in cache-friendly layouts and how we handle concurrent operations. Finally,
we present how we leverage cache prefetching to eliminate the increases in version
search latency caused by pointer chasing operations.
4.1 Version Types
Index Versions are created upon the insertion of new records. Every index version
comprises key and non-key attributes stored alongside the key attribute within a B-tree
7
Pointer
Index
version
Pending
version
Snapshot
version
Tombstone
version
Transaction
timestamp Attributes
10 BB,BB
Pointer
Update
timestamp Attributes
15 BB,BB
Transaction
timestamp
10
Reader list
Txn16,Txn15
Key
r1
Key
r1
Pointer
Start
timestamp Attributes
2AA,BB
End
timestamp
10
Key
r1
Pointer
Is visible Attributes
false CC,CC
Transaction
timestamp
20
Key
r2
Insert
Updating
Update
Delete
Inner
nodes
Leaf
nodes
Header r0 meta r1 meta
9,,AA,AA 10,,BB,BB
Header r0 meta r1 meta
9,,AA,AA 10,,BB,BB
index
version
Value
15 BB,BB
10 Txn16,Txn15
r1
Key
... ...
... ...
...
pending
version
4AA,BB
10 r1
10 CC,AB
15 r2
snapshot
version
tombstone version
...
...
...
...
...
...
next jump
1
2
3
r0.v0
r1.v3
false,20,CC
r2.v1
r2 meta
r2.v2r2.v0
Fig. 5 Architecture of Index-Organized storage model and the version types of a logical record.
structure. Additionally, it contains a pointer to the predecessor and the transaction
timestamp of the inserting transaction. The latter is essential for index-only visibility
checks. For example, a transaction with transaction timestamp 9 (Fig.5) inserts a new
record (r0 ) in its initial version (r0.v0), causing the creation of a index version in the
tree’s leaf node. Generally, all of the table data can be held in its primary key index.
Pending Versions result from the updates on existing index versions. When an
update transaction attempts to modify an existing record, it acquire an empty slot in
the memory pool and copies the index version to this location. Then, it inserts the
new version into the mapping table to logically replace the index version with the key
and the version information (update timestamp, create timestamp). Before the update
transaction commits, the concurrent transactions reading the record will be tracked
in the reader list. Once the update transaction is committed successfully, the contents
of the index version will be changed, and the pending version will be evicted from
the mapping table. For example, a transaction with transaction timestamp 15 (Fig.5)
updates the attributes of the record (r1 ), producing a pending version. Although the
attributes remain unchanged, the version information of r1 has to be updated, causing
the creation of a pending version in the mapping table.
Snapshot Versions are no longer referenced by any other write operations. The
snapshot versions are appended in the version store and are placed according to the
new-to-old ordering. A snapshot version contains both the start and end timestamps
denoting the version’s lifetime, together with its search key and the non-key attributes
(Fig.5). When a transaction invokes a read operation on an existing record, the system
searches for a visible version where the transaction’s timestamp is in between the range
of the start timestamp and end timestamp fields. For example, a transaction with
8
transaction timestamp 10 (Fig.5) updates record (r1 ), creating a snapshot version
(r1.v3), modifying the attributes from ’AA, BA’ to ’AA, BB’.
Tombstone Versions indicate the deletion of a record. If a record is logically
deleted, it is not erased immediately in the tree index because it could be visible in a
concurrent transaction. Rather, a tombstone version is inserted in the version chain,
which needs to be reflected in the tree index. Tombstone versions mark the extinction
of the whole version chain. A tombstone version contains the is visible flag and the
transaction timestamp of the deleting transaction. For example, a transaction with
transaction timestamp 20 (Fig.5) deletes record (r2 ), creating a tombstone version
(r2.v2), reflecting deletion of the whole version chain r2.v2>r2.v1>r2.v0.
4.2 Basic Operations
Insert. An insertion yields the creation of an Index Version in the tree index with the
key attribute and non-key attributes of the newly inserted version and the timestamp
of the inserting transaction. During insertion, the process involves traversing the tree
to find the target leaf node and inspecting the status word and frozen bit of the
leaf node. Subsequently, it reads the occupied space and node size fields, calculating
the current occupied space. If sufficient free space exists, it initializes a record meta
and allocates storage space for the new version. Next, it copies the new version to
the reserved space and updates the fields in the corresponding record meta, setting
the visible flag to invisible and the state flag to in-inserting. When another insert is
accessed, it may encounter this insertion. Then, the position will be rechecked until
the in-process insert is finished. Finally, once the insert succeeds, the record-meta’s
visible flag field is set to visible, and the offset field is set to actual storage space offset.
Read. The read operation starts with searching the mapping table that may con-
tain the target key and the record. If a pending version is found in the mapping table
and is visible to the current transaction, we will return it and terminate the read oper-
ation early. If the record is not found, we traverse the tree index from the root to the
leaf node. Afterward, the matching index version is requested and checked for visibil-
ity. Typically, a read-for-update operation searches for the index version in leaf nodes
of the tree index, avoiding the complex retrieval of versions in the version store. For
read-only operations, an index-only scan is sufficient. Upon reaching a leaf node, both
the key and non-key attributes can be retrieved, thereby reducing the necessity for
random memory accesses to access a record. A long-running read transaction poten-
tially needs to traverse a long version chain to find a visible version. Cache prefetching
techniques are needed for looking ahead to the further version during the traversal of
the version chain. Cache prefetching helps to speed up pointer-chasing operations and
achieve good cache performance, improving the searches and scans.
Update. If a transaction attempts to update an existing record, it performs a
read-for-update operation on the tree index. Upon locating the leaf node containing
the searched key, it checks the record meta and ensures the record is not deleted.
Then, within the leaf node, it replaces parts of the attributes that have changed using
the updates. Before the replacement, it should do the following: (1) acquire a slot and
copy the index version to the location, and then insert the pending version into the
mapping table; (2) update the corresponding record meta, where the state flag field
9
will be set to in-processing to block other concurrent updates. Future readers can
directly read pending versions from the mapping table. Once the update transaction
is committed, the index version becomes visible, and the pending version is removed
from the mapping table.
Delete. Like Update, deleting a record is essentially inserting a tombstone version
into the tree index. During a read operation, encountering a tombstone in the leaf
node results in a not-found result. In the event of a split/merge operation by the tree
index, any existing tombstone version in the leaf node is removed.
4.3 Cacheline-aligned Node Layout
Fig. 6 Leaf node layout (64 bytes aligned).
Inner nodes and Leaf nodes share the same layout, storing key-value pairs in
sorted/nearly-sorted order and allowing efficient lookups. The key is the indexed
attribute of the record. Within an inner node, the value is a pointer to the children,
while in the leaf node, the value is the Index Version, including the non-key attributes.
Note that an inner node is immutable once created and store key-childpointer pairs
sorted in order by key. By default, a leaf node is fixed size (16 KB), storing the index
versions in the nearly-sorted order.
As shown in Fig.6, the node layout starts with a 20 byte Node Header, which
encodes the node size, successor pointer, record count (the number of child pointer-
s/index versions on the node), control, frozen, block size and delete size. Control and
frozen fields are used to flag in-process writes on the node. The block size field is used
to calculate the offset of the new index version within the available free space. The
deleted size field is useful for performing node merge.
The Node Header is followed by an array of Record Meta, which stores the keys
and the metadata of the key-value pair. The Record Meta and the actual data are
stored separately to support efficient searches. By storing the keys in a sorted array,
the binary search can benefit from the prefetching to obtain good cache performance.
Generally, a 16KB leaf node can hold more than 16 index versions; the Record Meta
array are all 8×64 bytes. Before starting the binary search, we can leverage the great
parallelism of the modern memory subsystem to prefetch the entire Record Meta array.
10
Furthermore, allowing the index versions to be nearly sorted shows efficient insertion,
as we only need to sort out-of-order keys when splitting or merging the leaf nodes.
Each Record Meta in leaf nodes is 32 bytes, which stores the control and visibility
of the key and index version and the offset of the index version in the node. It also
stores the key and index version lengths, the key attribute, the transaction timestamp
that creates the index version, and the pointer to the predecessors. Different from the
leaf node, each Record Meta in the inner node is 16 bytes, not including the transaction
timestamp and pointer fields.
4.4 Pending Versions and Mapping Table
Write
set
HEAD
r1,,BB,BB ...
Ptr
r1
Key
version
chain
Txn16
Txn15
Read
set Write
set
Read
set
15 BB,BB
10 Txn16,Txn15
pending
version
Mapping
Table
CaS
... ...
Fig. 7 Pending versions and mapping table.
Index-Organized uses pending versions to reduce index management operations
and avoid contention while reducing the indirection cost of multi-version execution. A
mapping table in Index-Organized links the search key to the physical location of the
pending version.
When a transaction attempts to update an existing record, it first searches the
index version of the record in the leaf node of the tree index, and then locks it by setting
the control bit of the Record Meta to in-processing. Besides, it creates a pending version
by acquiring a slot and copying the latest version to the location. If it successfully
commits, it uses an atomic instruction (compare-and-swap) operation on the index
version to modify it to point to the new HEAD. Otherwise, if it is aborted, the pending
version will be deallocated by the later reclamation. Before the update transaction
commits, any concurrent transaction can read the record. The mapping table is first
looked up to find the pending version of the record.
As shown in Fig. 7, the transaction Txn15 updates the record (r1 ) and creates
apending version. The pending version allows the Index-Organized to scale to high
concurrency, as the concurrent transactions can access the visible version without
blocking, such as transaction Txn16. A mapping table associates the record key with
the physical location of the pending version to accelerate the search for pending ver-
sions. We implement the mapping table using a hash table for its simplicity and
11
performance. To support transaction serializability, we leverage the mapping table to
track the dependent reader transactions. The update transaction cannot be committed
until all the transactions on which it depends have been committed.
4.5 Contention Management
In Index-Organized, a split is triggered when a node’s size exceeds a set maximum
threshold, typically 16 key-value pairs by default. Similarly, a merge operation is trig-
gered when a node’s size decreases below a specified minimum threshold. The problem
is that split/merge operations and update transactions may contend with the latch of
a leaf node, which will cause performance degradation.
Index-Organized implements optimistic latch-coupling[55] on tree nodes to reduce
their contention based on the observation that although highly contended, nodes are
rarely modified, i.e., split/merge. Specifically, we use an 8-byte status word (record
count, control, frozen, block size, delete size) in the Node Header to check the state
of the node. A split operation (write operation) will acquire an write lock (frozen) on
the node and bump the write lock after modification. If the split operation encounters
an update transaction (read operation) by checking the control of the Record Meta,
it will add the new node to the successor of the old one. The update transaction
can read the successor to find the actual location of the accessed index version. If an
update transaction encounters a node that needs splitting, it temporarily suspends its
operation so that the split operation can be performed before continuing
4.6 Jump Pointers and Cache Prefetching
In current main memory database systems, version chains are commonly structured as
linked lists, leading to random access during traversals. When a transaction reads from
the HEAD of the version chain, it involves random memory accesses until locating the
appropriate snapshot version in time.
Figure 8illustrates a version chain depicting multiple updates performed to a
record. The version chain is organized in the new-to-old ordering. Since the transaction
Txn Y starts with timestamp 2, it must traverse the version chain using the next
pointer to retrieve the target version v1with start and end timestamps (0,4). This is
slow due to pointer-chasing, and reading each version in a sequential access pattern
introduces many cache misses. Furthermore, with many current updates, the number
of active versions grows quickly, causing a long-tail version chain. The system may
suffer from increased memory access latency caused by traversing a long version chain,
leading to a significant drop in overall throughput performance.
Index-Organized solves this pointer-chasing problem by using the cache prefetching
technique [29,31,32]. Specifically, each snapshot version is accompanied by a jump
pointer field. These pointers indicate the version that should be prefetched next. For
instance, in Fig. 8,vi+2 can directly prefetch viusing a two-ahead jump pointer.
This strategy aims to ensure that when traversal is prepared to access a version, that
version is already cached, effectively reducing the latency associated with accessing vi.
Through this approach, Index-Organized can improve the search for the visible version
in a parallel access pattern.
12
Txn Y
20 80 ... next 0 4 ...
20 80 ... jump next 0 4 ...
HEAD of the
version chain
Timestamp 2
Start
timestamp
End
timestamp
HEAD of the
version chain
Txn X
Timestamp 80
update require
v80
v80
v1
v1
...
...
prefetch prefetch prefetch prefetch for (i=0; i < woptimal ; i++)
__builtin_prefetch((void *) node + i * 64, 0, 3);
for (int j=0; j < tuple_count; j++)
uint64_t key = node[j] ;
for (int j=0; j < tuple_count; j++)
uint64_t key = node[j] ;
Fig. 8 Jump pointers and cache prefetching.
5 Optimizations
5.1 Fine-grained Memory Management
Valenblock. In Index-Organized, inner nodes remain immutable once created, holding
sorted key-childpointer pairs arranged by key. Considering that inner nodes are mostly
read and undergo infrequent changes, they are entirely substituted by newly created
inner nodes during splitting operations. This approach improves searches within the
node but necessitates variable-length memory allocations. To address this challenge, we
have designed and implemented a valenblock-allocator tailored for allocating storage
space for inner nodes. The valenblock-allocator comprises two layers. In the initial
layer, a chunk list maintains a set of fixed-size memory chunks. When the allocator
needs a new chunk, it acquires one from the linked list, sets its reference counter
to 0, and incorporates it into the active chunk array. In the subsequent layer, upon
creating an inner node, a thread secures a small space from a random active chunk and
increments its reference count by one. Upon releasing the inner node, the reference
count of the active chunk decreases by one. A background thread periodically examines
and reclaims active chunks that are no longer referenced by any inner nodes.
Free-list. In the version store, snapshot versions are stored in fixed-size memory
chunks aligned to cache-line size. This design significantly benefits table scan opera-
tions. The version store also maintains multiple free lists, each with a different size
class, to track the recently de-allocated memory space of the snapshot version. A
snapshot version de-allocation happens when it has no longer been referenced by any
active transactions. Generally, a background thread periodically retrieve the global
minimum from the global transaction list, gather the invalid snapshot versions at an
ephemeral queue, and simultaneously unlink these versions from the associated trans-
action contexts. Finally, these de-allocated memory spaces are added to the free lists
and reused for future version allocation.
13
5.2 Transaction Serializable
To ensure robust and scalable performance on heterogeneous workloads, Index-
Organized adopts snapshot isolation concurrency control and provides serializability.
A transaction T’s accesses generate serial dependencies that constrain T’s position
in the global partial order of transactions. There are two forms of serial dependencies:
(1) T reads or overwrites a version that Ticreated, with Tidependency on T, Ti
T; (2) T reads a version that Tjoverwrites, with T an-dependency on Tj, T Tj.
Assuming a relation for a direct graph G, the serialization can also be defined as
TiTTj, which the vertices are committed transactions and the edges indicate
the required serialization ordering relationships. Then, for the case of Tiw:r/r:w
T
or Tiw:w
T, Tiis a forward edge of T, and T r:w
Tj, Tjis a backward edge of T.
Considering a cycle in G TiTjTi, that indicates a serialization failure. Therefore,
to detect potential a cycle in the dependency graph G, two timestamps are associated
with T: high watermarks, t h w, and low watermarks, t l w. The high watermarks can
be acquired via all forward edges. The low watermarks can be acquired via all backward
edges. When executing serializability checks for transaction T, if t l wt h w occurs,
T must be abort.
This design, which we called Index-SSN, is inspired by SSN’s design [56], where a
similar goal is achieved. Algorithm 1(Appendix A) shows the detailed implementation.
Our key insight is to reduce the overhead for tracking transaction dependencies and
cycle detection. We forbid the write/write dependency: T overwrites (Tiw:w
T)
a version that Ticreated. We design a state (1 bit) in the record meta to control
updating over the tree index. We maintain the full version timestamps for the pending
version, while the index version or snapshot version stores the created timestamp. We
use the mapping table to track all overwriters and detect anti-dependency.
In the READ phase, the transaction trecords the accessed version’s create times-
tamp (cstamp) as the t h w the largest forward edge. It then searches the mapping
table to check whether the concurrent transaction overwrites the version. If a concur-
rent transaction has overwritten the version, it records the version’s sstamp as the
t l w the smallest backward edge. Then, it verifies the serializability and aborts tupon
detecting any violations. If the version has not been overwritten, it will be added to
the read set and checked for late-arriving overwrites during commit processing.
In the COMMIT phase, the transaction first produces a commit timestamp by
utilizing a global counter incremented through the atomic fetch-and-add operation. It
first traverses the write set to calculate the high watermarks. Due to tracking all reads
on the version’s reader list, it just only needs to acquire the latest reader. If the reader
ris found, it examines and waits rto acquire its commit timestamp. If the reader has
committed, it updates the t h w using r’s cstamp. It then traverses the read set to
calculate the low watermarks. If other concurrent transactions have overwritten the
version, it will detect the transaction in the tree index or the mapping table. It uses the
transaction’s timestamp to compute the t l w. Lastly, with t h w and t l w accessible,
a straightforward comparison of t l w t h w identifies serializability violations. If
this condition holds, the transaction can proceed with a successful commit; otherwise,
it necessitates an abort.
14
5.3 Logging and Cacheline-aligned Writes
Block
Block 0, 0
Thread local 0 Thread local 1
Txn X
Timestamp
10
Txn Y
Timestamp
11
Segment 0
Segment 1
Segment 2
... ...
Segment 0
Block 1, 0
Block 0, 1
Block 1, 1
[segment 0, offset 0],[ r0,,,]
[segment 0, offset 2],[ r1,,,]
[segment 0, offset 1],[ r3,,,]
[segment 0, offset 3],[ r4 ,,,]
Block
Block (16 MB) Block (16 MB)
DRAM
PM
mapping
Cacheline-aligned Cacheline-aligned
offset 0
offset 1
offset 2
offset 3
...
...
48 bits 16 bits
Fig. 9 LSN management and Log files
Persistent Memory is an emerging computer storage device with non-volatile and
byte-addressable that can provide data persistence while achieving I/O throughput
comparable to DRAM. In contrast to DRAM, Table 1demonstrates significantly
reduced random read and write bandwidths for Optane DCPMM. To improve
throughput performance, we execute sequential reads and writes on Optane DCPMM.
Index-organized ensures recoverability by implementing the write-ahead logging
[57]. A log record consists of the log sequence number(LSN), transaction context, log
record type, and write contents. As shown in Fig. 9, we segment the PM log files
into smaller units, typically 16MB in size, known as blocks. These blocks are used for
sequential writing of log records to thread-local blocks. Each block contains multiple
log records aligned with cachelines, improving PM’s write performance by minimizing
redundant flushing to the same cacheline.
In Index-organized, we leverage the physical segment file to generate the monotomic
LSN. Although the LSNs may not be contiguous, they can be effectively mapped to
physical locations on PM. An LSN comprises two parts: the upper bits identify a
physical segment file (e.g., segment 0 ), and the lower bits indicate an offset within the
segment file (e.g., offset 0 ). Each segment file is 16KB by default, containing 2×1024
offsets (LSN) by default. We use dax-mmap to map a segment file residing on the PM
device using the commands: pmm ptr = mmap((caddr t) 0, len, prot read |prot write,
map shared, fd, 0)). In this manner, we can use the resulting address to manage the
LSN space directly.
During execution, a transaction can request an LSN at any time, provided it pos-
sesses a valid log sequence number. The log manager provides LSNs by incrementing
a globally shared log offset by the allocated size upon request. Subsequently, once the
transaction is complete, the log records are written to the thread-local block, and the
position of the stored log record within the block is inserted into the segment file offset.
15
6 Evaluation
In this section, we perform experiments on Index-Organized with various workloads.
To verify the efficiency of our solution, we compare it with the row-store, column-store,
and row+column storage models.
Implementation. We implement our Index-Organized2based on PelotonDB [14]
with approximately 13,000 lines of C++ code. PelotonDB is an MMDB optimized for
high-performance transaction processing [7], providing row-store, column-store, and
row+column storage models. Even though the Peloton project has died, many works
researched based on PelotonDB continue [7,23,5860].
In a row-oriented storage scheme, we maintain all attributes of a single version
consecutively. In a column-oriented storage scheme, attributes of each version are
stored sequentially. In a hybrid row+column storage scheme, columns are partitioned
into multiple groups. We use the Index-SSN transaction concurrency control algorithm
for all compared storage models.
Evaluation Platform. We test our performance on a one-socket server with an
Intel(R) Xeon(R) Gold 5218R processor (2.10 GHz, 40 logical cores, 20 physical cores,
Cascade Lake) and OPTANE DC PMMs. It is equipped with 96 GB DRAM (6 chan-
nels*16 GB/DIMM) and 756 GB PM (6 channels*126 GB/PMM). We run Ubuntu
with Linux kernel version 5.4 compiled from the source. Each working thread is pinned
to specific core to reduce context switching overhead and inter-socket communication
expenses. During each running, data is loaded from scratch, and the benchmark is
executed for 20 seconds.
Workloads. (1)YCSB: We evaluate different storage models on lookup, update,
and scan operations from the Yahoo! Cloud Serving Benchmark (YCSB) [53]. The
workload contains a single table containing records featuring an 8-byte primary key
and ten columns of random string data, each 100 bytes long. We construct the fol-
lowing set of workloads mixture by varying the read ratios. Scan-only: 100% range
scan; Read-only: 100% lookup; Read-intensive: 90% lookup, 10% update; Write-
intensive: 10% lookup, 90% update; Balanced: 50% lookup, 50% update. The
default access distribution is uniform distribution.
(2)TPC-C: This is the current standard benchmark for measuring the perfor-
mance of transaction processing systems[61]. It is default write-intensive, with less
than 20%read operations. It is also a dynamic adjustable workload. In this experiment,
we use 100%of the New-Order transactions, with the number of warehouses set to 20.
(3)TPC-C-hybrid: To simulate heterogeneous workloads, we also use TPC-C
to construct three transaction types: TPC-C RA, which is a 90%New-Order and
10%Stock-Level mix; TPC-C RB, which is a 50%New-Order and 50%Stock-Level
mix; and TPC-C RC, which mixes New-Order with the Query2 in the TPC-CH
benchmark[62]. CH-Q2 lists suppliers in a certain region and their items having the
lowest stock level. Compared with Stock-Level, it reads more heavily. In this transac-
tion, we set the ratio of CH-Q2 at 50% to measure the overall throughput and latency.
In addition, because the majority of accesses of Stock-Level and CH-Q2 are in the
2https://github.com/gitzhqian/Stage-IndexOrganized.git
16
item table and stock table, those transactions will frequently conflict with New-Order
and each other.
6.1 Overall Performance
lookup update scan
0.0
0.5
1.0
1.5
Throughput(Mops/s)
row-store
column-store
row-column
index-organized
Fig. 10 Comparison of different storage models with important workloads: lookup, update, and scan.
1 warehouse 20 warehouses
0
5
10
15
20
25
Throughput(Ktps)
row-store
column-store
row-column
index-organized
1 warehouse 20 warehouses
0.00
0.05
0.10
0.15
Abort rate
row-store
column-store
row-column
index-organized
Fig. 11 Overall throughput and abort rate when running the TPC-C New Order transaction.
In Fig. 10, this experiment examines how the four storage models perform on the
three most important workloads in practice: point lookup, update, and scan. Fig. 10
shows the throughput of the systems in these workloads with 20 threads. We see
that Index-Organized outperforms the other three storage models in all workloads.
Row-layout, column-layout, and row+column all implement a heap-based organiza-
tion that absorbs more writes, resulting in poor read performance. For range scan,
Index-Organized achieves 3×higher throughput than row-store, column-store, and
row+column layouts because it only needs to traverse the tree index.
In Fig. 11, we measure the overall throughput and abort rate in TPC-C with
one warehouse and 20 warehouses, respectively, using 20 threads. Note that Index-
Organized achieves up to 2×higher throughput than the other storage models. New
Order transactions are dominated by primary key accesses, i.e., lookup and update
operations. Index-Organized clusters the index versions in the primary key order,
thereby accelerating the lookup and update operations.
This section provides a high-level view of the system’s performance on differ-
ent workloads. It shows that Index-Organized can perform well in all representative
workloads because it minimizes the number of random memory accesses.
17
6.2 Read Write Ratio
write-only write-intensive balanced read-intensive read-only
0.0
0.5
1.0
1.5
Throughput(Mops/s)
row-store
column-store
row-column
index-organized
Fig. 12 Impact of read and write workloads. (YCSB)
Fig. 12 shows the impact of read and write workloads on Index-Organized and
other storage models. The number of concurrent threads is 20. As expected, for all
storage models, the throughput is lower with more write operations. This is because
we adopt the ”first-updater-wins” rule; the majority of write conflicts are identified
when the leading version remains uncommitted. The conflicting updater is promptly
aborted to prevent dirty writes, thereby reducing the extent of work lost due to
aborts. Overall, Index-Organized performs significantly better than the conventional
storage models, i.e., row-store, column-store, and row+column. This is because Index-
Organized tree index (including cache line-aligned node layout) can support index-only
version searches and minimize the number of random memory accesses, leading to the
best performance across all read and write ratios.
6.3 Scalability
This section examines the scalability of Index-Organized and conventional storage
models (i.e., row-store, column-store, and row+column) with YCSB and TPC-C-
hybrid workloads.
Fig.13 uses the YCSB workloads and varies the number of work threads. We con-
sider both uniform and Zipfian distribution and configure the skewness parameter of
Zipfian distribution 0.9. All storage models scale well before reaching the maximum
physical cores. Among them, Index-Organized performs the best, with the lowest read
starvation and the most efficient caching. Its efficient pending version handling also
allows it to scale well without being bottlenecked by the concurrency control over-
head. Under write-intensive workloads, conventional storage models suffer from central
memory heap management, achieving 1.5×lower throughput than Index-Organized.
Fig.14 shows the overall throughput and abort rate of TPC-C-hybrid workloads
when varying the number of work threads. At low thread count, Index-Organized per-
forms slightly better than conventional storage models, as it avoids traversing the
version chain overhead, specifically, for each primary key access, Index-Organized may
have more than one less cache miss than conventional storage models. At high thread
18
1 4 12 20 28 36
number of threads
0.0
0.5
1.0
1.5
2.0
Throughput (Mops/s)
(a)Read-intensive(uniform)
1 4 12 20 28 36
number of threads
0.0
0.5
1.0
1.5
(b)Balanced(uniform)
1 4 12 20 28 36
number of threads
0.00
0.25
0.50
0.75
1.00
(c)Write-intensive(uniform)
1 4 12 20 28 36
number of threads
0.0
0.5
1.0
1.5
2.0
2.5
3.0
Throughput (Mops/s)
(d)Read-intensive(Zipfian)
1 4 12 20 28 36
number of threads
0.0
0.5
1.0
1.5
2.0
(e)Balanced(Zipfian)
1 4 12 20 28 36
number of threads
0.00
0.25
0.50
0.75
1.00
(f)Write-intensive(Zipfian)
row-store column-store row-column index-organized
Fig. 13 The scalability of Index-Organized vs. conventional storage models. (YCSB workloads,
Uniform and Zipfian access distributions)
1 4 12 20 28 36
number of threads
0
5
10
15
20
25
Throughput (Ktps)
(a) TPC-C RA
1 4 12 20 28 36
number of threads
0
5
10
15
20
25
30
(b) TPC-C RB
1 4 12 20 28 36
number of threads
0
2
4
6
8
10
(c) TPC-C RC
1 4 12 20 28 36
number of threads
0.00
0.02
0.04
0.06
0.08
0.10
Abort rate
(d) TPC-C RA
1 4 12 20 28 36
number of threads
0.00
0.02
0.04
0.06
0.08
0.10
(e) TPC-C RB
1 4 12 20 28 36
number of threads
0.00
0.02
0.04
0.06
0.08
0.10
(f) TPC-C RC
row-store column-store row-column index-organized
Fig. 14 The scalability of Index-Organized vs. conventional storage models. (TPC-C-hybrid
workloads, throughput and abort rate)
count, Index-Organized achieves up to 1.6×higher throughput than conventional stor-
age models, and has a 40% lower abort rate, indicating that Index-Organized is more
efficient at handling contention.
Index-Organized systems achieve higher performance because their separate
memory management structures effectively protect read-intensive transactions from
updates. Furthermore, better performance gains also come from the index version and
cache prefetching policy. This helps reduce memory latency and benefits transaction
19
verification and commit processing. However, conventional storage models employing
a heap-based storage scheme increase contention and make the version chains grow
quickly. In contrast, they increase the memory access latency, which deteriorates the
overall throughput performance.
6.4 Latency Analysis
In this section, we perform latency analysis. Fig. 15 shows the latency of TPC-CH-Q2
over a varying number of warehouses with 20 work threads. We calculated the average
transaction latency within a run and presented the median value from three runs in the
figure. At the low warehouse count, Index-Organized has a similar latency performance
to the conventional storage models. This is because when the database is small, the
hot versions are more likely to be cached in CPU cache. The gap between Index-
Organized and conventional storage models is wider when the number of warehouses
increases. Thanks to its index organization, Index-Organized have the lowest latency
at 36 warehouses, almost 3×lower than conventional storage models. Clearly, as the
number of active versions increases, index-only scans gain importance because they can
reduce unnecessary main memory accesses. However, for conventional storage models,
each version access may incur more than four random main memory accesses (tree
searches, indirection searches, mapping table searches, and version chain searches).
Fig. 16 shows the flamegraph [63] of the Index-Organized for the TPC-C-RC work-
load. Tree index searches take roughly 70% of the time, involving multiple inner node
searches and one leaf node search. Searching the leaf node first loads the record meta
for keys, then binary searches the keys to find the request key, and finally searches the
index version. Because the index versions are nearly sorted, each index version access
first locates the offset and then uses the physical address to load the index version.
The mapping table searches spend 18.5% of the time. The read transaction always
needs to access the mapping table to find the pending version, requiring only one
memory access. The version chain searches use only 10.5% of the time. This is because
an update transaction (New-Order) does not need to access version chains, while a
long-running transaction (CH-Q2) can benefit from cache prefetching, spending less
time accessing main memory.
4 12 20 28 36
number of warehouses
0
1
2
3
4
5
6
Latency (s)
row-store
column-store
row-column
index-organized
Fig. 15 Latency of CH-Q2 with the TPC-C-hybrid workload (TPC-C RC), varying the number of
warehouses.
20
Tree index (71%)
Inner node (20%) Leaf node (51%)
Mapping table
(18.5%)
Version chains
(10.5%)
Record meta
(15.4%)
Index version
search (35.6%)
Key cmp
(12.5%)
Fig. 16 Time spent on different components of Index-Organized-Recreated from the flamegraph for
better clarity.
6.5 Version Chain Length
In this experiment, we examine the impact of version chain length. We run a long-
running query (CH-Q2) and execute a varied number of short update transactions
(New-Order). As the number of New-Order threads increases, the stock table will be
frequently updated, quickly producing more and more versions. As seen in 4.6, the
versions accumulate quickly over time, slowing down the readers. Fig.17 shows that
the scan time of the CH-Q2 query thread is highly affected by concurrent New-Order
transaction threads. As the version chain length increases, version scans slow down
conventional storage models by an order of magnitude. Index-Organized’s efficient
caching mechanism, i.e., cacheline-aligned node layout and cache prefetching, reduces
the pointer chasing overhead, hiding the latency. This makes its read performance
superior to that of the other storage models, which struggle because their version
search pattern is largely ineffective for pointer chasing.
1 4 12 20 28 36
number of New-Order transaction threads
0
1000
2000
3000
4000
CH-Q2 Latency (ms)
row-store
column-store
row-column
index-organized
1 2 6 15 20 30
version chain length
Fig. 17 Impact of version chain length - Increasing the number of New-Order transaction threads
using one CH-Q2 query thread.
21
6.6 Impact of PM
To evaluate our design and implementations on the PM (Intel Optane DCP-
MMs), we test microbenchmarks with YCSB insert-only benchmark using 20 work
threads. We configure three logging modes: no logging (transactions do not write log
records), logging-DRAM (transactions write log records to DRAM), and logging-PM
(transactions write log records to PM), respectively.
As shown in Fig. 18, the overall throughput peaks when the duration is 10 s.
This is because the server’s memory utilization reaches 50% due to new versions
being inserted continually. The memory usage contention becomes the limitation of the
system’s overall performance. In logging-PM mode, the Index-Organized’s throughput
keeps growing slowly until the duration is in the 20s. We attribute this to the low
write latency and PM’s large capacity. In addition, during execution, we also use the
Intel Processor Counter Monitor (PCM) library to collect hardware counter
metrics[64]. The memory access plots in Table 2confirm this behavior by showing
DRAM writes and PM writes.
5 10 15 20 25
Duration
0.2
0.3
0.4
0.5
Throughput (Mops)
no logging logging-DRAM logging-PM
Fig. 18 YCSB insert-only performance on Intel Optane DCPMMs.
Table 2 Microbenchmarks For Insert-only with 20 work threads
Metrics
Duration Time(s) L2 hit L3 hit DRAM writes PM writes
ratio ratio (bytes) (bytes)
5 0.74 0.90 27975517312 109719168
10 0.76 0.89 65096697088 498359552
15 0.76 0.90 95974431744 1901509120
20 0.75 0.90 127873320192 3732100224
25 0.75 0.89 160037408896 5578359936
7 Conclusion
As far as we know, this paper is the first to comprehensively discuss the impact of
pointer chasing on storage models for modern MMDBs. We observe that once state-of-
the-art techniques have been applied, chasing pointers will become the new bottleneck
22
of data access in modern MMDB systems. In this paper, we propose Index-Organized,
a novel storage model that supports index-only reads and efficient updates. We com-
bine the cacheline-aligned node layout and cache prefetching to minimize the number
of random memory accesses. To improve the update performance, we design and
implement fine-grained memory management, pending versions, Index-SSN algorithm,
and cacheline-aligned writes. Our evaluations show that Index-Organized outperforms
conventional storage models, achieving a 3×speedup for reads and a 2×speedup
for transaction processing. In our future work, we aim to leverage machine learning
algorithms to identify hot and cold data to improve logging and recovery further.
Acknowledgements. This work was supported in part by the National Natural
Science Foundation of China under Grant 61572194 and Grant 61672233.
Author contributions. The authors confirm their contribution to the paper as
follows: study conception and design: Qian Zhang and Xueqing Gong; software engi-
neering, code development, and interpretation of results: Qian Zhang, Hassan Ali
Khan, Jianhao Wei, and Yiyang Ren; draft manuscript preparation: Qian Zhang, Xue-
qing Gong and Hassan Ali Khan; all authors reviewed the results and approved the
final version of the manuscript.
Declarations
Conflict of interest: The authors declare no conflict of interest. Data availability:
All data from this study are in the manuscript, and the source code is accessible at
https://github.com/gitzhqian/Stage-IndexOrganized.git.
Appendix A Index-SSN Algorithm
References
[1] Zhou, X., Arulraj, J., Pavlo, A., Cohen, D.: Spitfire: A three-tier buffer manager
for volatile and non-volatile memory. In: Proceedings of the 2021 International
Conference on Management of Data, pp. 2195–2207 (2021)
[2] Ziegler, T., Binnig, C., Leis, V.: Scalestore: A fast and cost-efficient storage
engine using dram, nvme, and rdma. In: Proceedings of the 2022 International
Conference on Management of Data, pp. 685–699 (2022)
[3] Ruan, C., Zhang, Y., Bi, C., Ma, X., Chen, H., Li, F., Yang, X., Li, C., Aboul-
naga, A., Xu, Y.: Persistent memory disaggregation for cloud-native relational
databases. In: Proceedings of the 28th ACM International Conference on Archi-
tectural Support for Programming Languages and Operating Systems, Volume 3,
pp. 498–512 (2023)
[4] Hao, X., Zhou, X., Yu, X., Stonebraker, M.: Towards buffer management with
tiered main memory. Proceedings of the ACM on Management of Data 2(1), 1–26
(2024)
23
[5] Liu, G., Chen, L., Chen, S.: Zen+: a robust numa-aware oltp engine optimized
for non-volatile main memory. The VLDB Journal 32(1), 123–148 (2023)
[6] Kissinger, T., Schlegel, B., Habich, D., Lehner, W.: Kiss-tree: smart latch-free
in-memory indexing on modern architectures. In: Proceedings of the Eighth
International Workshop on Data Management on New Hardware, pp. 16–23
(2012)
[7] Magalhaes, A., Monteiro, J.M., Brayner, A.: Main memory database recovery: A
survey. ACM Computing Surveys (CSUR) 54(2), 1–36 (2021)
[8] Raza, A., Chrysogelos, P., Anadiotis, A.C., Ailamaki, A.: One-shot garbage collec-
tion for in-memory oltp through temporality-aware version storage. Proceedings
of the ACM on Management of Data 1(1), 1–25 (2023)
[9] Yu, G.X., Markakis, M., Kipf, A., Larson, P.-˚
A., Minhas, U.F., Kraska, T.: Tree-
line: an update-in-place key-value store for modern storage. Proceedings of the
VLDB Endowment 16(1) (2022)
[10] Hao, X., Chandramouli, B.: Bf-tree: A modern read-write-optimized concurrent
larger-than-memory range index. Proceedings of the VLDB Endowment 17(11),
3442–3455 (2024)
[11] Xu, H., Li, A., Wheatman, B., Marneni, M., Pandey, P.: Bp-tree: Overcoming
the point-range operation tradeoff for in-memory b-trees. In: Proceedings of the
2024 ACM Workshop on Highlights of Parallel Computing, pp. 29–30 (2024)
[12] Li, Z., Chen, J.: Eukv: Enabling efficient updates for hybrid pm-dram key-value
store. IEEE Access 11, 30459–30472 (2023)
[13] Kim, K., Wang, T., Johnson, R., Pandis, I.: Ermia: Fast memory-optimized
database system for heterogeneous workloads. In: Proceedings of the 2016
International Conference on Management of Data, pp. 1675–1687 (2016)
[14] Pavlo, A., Angulo, G., Arulraj, J., Lin, H., Lin, J., Ma, L., Menon, P., Mowry,
T.C., Perron, M., Quah, I., et al.: Self-driving database management systems. In:
CIDR, vol. 4, p. 1 (2017)
[15] Kemper, A., Neumann, T.: Hyper: A hybrid oltp&olap main memory database
system based on virtual memory snapshots. In: 2011 IEEE 27th International
Conference on Data Engineering, pp. 195–206 (2011). IEEE
[16] Makreshanski, D., Giceva, J., Barthels, C., Alonso, G.: Batchdb: Efficient iso-
lated execution of hybrid oltp+ olap workloads for interactive applications. In:
Proceedings of the 2017 ACM International Conference on Management of Data,
pp. 37–50 (2017)
24
[17] NuoDB. 2020. https://nuodb.com/
[18] Alagiannis, I., Idreos, S., Ailamaki, A.: H2o: a hands-free adaptive store. In:
Proceedings of the 2014 ACM SIGMOD International Conference on Management
of Data, pp. 1103–1114 (2014)
[19] Levandoski, J.J., Lomet, D.B., Sengupta, S.: The bw-tree: A b-tree for new hard-
ware platforms. In: 2013 IEEE 29th International Conference on Data Engineering
(ICDE), pp. 302–313 (2013). IEEE
[20] Wang, Z., Pavlo, A., Lim, H., Leis, V., Zhang, H., Kaminsky, M., Andersen, D.G.:
Building a bw-tree takes more than just buzz words. In: Proceedings of the 2018
International Conference on Management of Data, pp. 473–488 (2018)
[21] Neumann, T., uhlbauer, T., Kemper, A.: Fast serializable multi-version con-
currency control for main-memory database systems. In: Proceedings of the 2015
ACM SIGMOD International Conference on Management of Data, pp. 677–689
(2015)
[22] Wu, Y., Arulraj, J., Lin, J., Xian, R., Pavlo, A.: An empirical evaluation of in-
memory multi-version concurrency control. Proceedings of the VLDB Endowment
10(7), 781–792 (2017)
[23] ottcher, J., Leis, V., Neumann, T., Kemper, A.: Scalable garbage collection for
in-memory mvcc systems. Proceedings of the VLDB Endowment 13(2), 128–141
(2019)
[24] Herlihy, M., Shavit, N., Luchangco, V., Spear, M.: The Art of Multiprocessor
Programming. Newnes, ??? (2020)
[25] Kim, K., Wang, T., Johnson, R., Pandis, I.: Ermia: Fast memory-optimized
database system for heterogeneous workloads. In: Proceedings of the 2016
International Conference on Management of Data, pp. 1675–1687 (2016)
[26] Chevan, A., Sutherland, M.: Hierarchical partitioning. The American Statistician
45(2), 90–96 (1991)
[27] Agrawal, S., Narasayya, V., Yang, B.: Integrating vertical and horizontal par-
titioning into automated physical database design. In: Proceedings of the 2004
ACM SIGMOD International Conference on Management of Data, pp. 359–370
(2004)
[28] Bian, H., Yan, Y., Tao, W., Chen, L.J., Chen, Y., Du, X., Moscibroda, T.: Wide
table layout optimization based on column ordering and duplication. In: Pro-
ceedings of the 2017 ACM International Conference on Management of Data, pp.
299–314 (2017)
25
[29] Hsieh, K., Khan, S., Vijaykumar, N., Chang, K.K., Boroumand, A., Ghose, S.,
Mutlu, O.: Accelerating pointer chasing in 3d-stacked memory: Challenges, mech-
anisms, evaluation. In: 2016 IEEE 34th International Conference on Computer
Design (ICCD), pp. 25–32 (2016). IEEE
[30] Huang, K., Wang, T., Zhou, Q., Meng, Q.: The art of latency hiding in modern
database engines. Proceedings of the VLDB Endowment 17(3), 577–590 (2023)
[31] Ebrahimi, E., Mutlu, O., Patt, Y.N.: Techniques for bandwidth-efficient prefetch-
ing of linked data structures in hybrid prefetching systems. In: 2009 IEEE 15th
International Symposium on High Performance Computer Architecture, pp. 7–17
(2009). IEEE
[32] Weisz, G., Melber, J., Wang, Y., Fleming, K., Nurvitadhi, E., Hoe, J.C.: A
study of pointer-chasing performance on shared-memory processor-fpga sys-
tems. In: Proceedings of the 2016 ACM/SIGDA International Symposium on
Field-Programmable Gate Arrays, pp. 264–273 (2016)
[33] Ainsworth, S.: Prefetching for complex memory access patterns. Technical report,
University of Cambridge, Computer Laboratory (2018)
[34] Ramakrishnan, R., Gehrke, J.: Database Management Systems. McGraw-Hill,
Inc., ??? (2002)
[35] Codd, E.F.: A relational model of data for large shared data banks. Communica-
tions of the ACM 13(6), 377–387 (1970)
[36] Copeland, G.P., Khoshafian, S.N.: A decomposition storage model. ACM sigmod
record 14(4), 268–279 (1985)
[37] Arulraj, J., Pavlo, A., Menon, P.: Bridging the archipelago between row-stores
and column-stores for hybrid workloads. In: Proceedings of the 2016 International
Conference on Management of Data, pp. 583–598 (2016)
[38] Baykan, E.: Recent research on database system performance (2005)
[39] Hankins, R.A., Patel, J.M.: Data morphing: An adaptive, cache-conscious storage
technique. In: Proceedings 2003 VLDB Conference, pp. 417–428 (2003). Elsevier
[40] Mao, Y., Kohler, E., Morris, R.T.: Cache craftiness for fast multicore key-value
storage. In: Proceedings of the 7th ACM European Conference on Computer
Systems, pp. 183–196 (2012)
[41] Arulraj, J., Levandoski, J., Minhas, U.F., Larson, P.-A.: Bztree: A high-
performance latch-free range index for non-volatile memory. Proceedings of the
VLDB Endowment 11(5), 553–565 (2018)
[42] Kester, M.S., Athanassoulis, M., Idreos, S.: Access path selection in main-memory
26
optimized data systems: Should i scan or should i probe? In: Proceedings of the
2017 ACM International Conference on Management of Data, pp. 715–730 (2017)
[43] Boroumand, A., Ghose, S., Oliveira, G.F., Mutlu, O.: Polynesia: Enabling effective
hybrid transactional/analytical databases with specialized hardware/software co-
design. arXiv preprint arXiv:2103.00798 (2021)
[44] Kim, J., Kim, K., Cho, H., Yu, J., Kang, S., Jung, H.: Rethink the scan in mvcc
databases. In: Proceedings of the 2021 International Conference on Management
of Data, pp. 938–950 (2021)
[45] Valois, J.D.: Lock-free linked lists using compare-and-swap. In: Proceedings of the
Fourteenth Annual ACM Symposium on Principles of Distributed Computing,
pp. 214–222 (1995)
[46] Wang, T., Levandoski, J., Larson, P.-A.: Easy lock-free indexing in non-volatile
memory. In: 2018 IEEE 34th International Conference on Data Engineering
(ICDE), pp. 461–472 (2018). IEEE
[47] Kelly, R., Pearlmutter, B.A., Maguire, P.: Lock-free hopscotch hashing. In:
Symposium on Algorithmic Principles of Computer Systems, pp. 45–59 (2020).
SIAM
[48] Peloton Database Management System. 2019. http://pelotondb.org
[49] Boissier, M., Daniel, K.: Workload-driven horizontal partitioning and pruning
for large htap systems. In: 2018 IEEE 34th International Conference on Data
Engineering Workshops (ICDEW), pp. 116–121 (2018). IEEE
[50] Al-Kateb, M., Sinclair, P., Au, G., Ballinger, C.: Hybrid row-column partitioning
in teradata®. Proceedings of the VLDB Endowment 9(13), 1353–1364 (2016)
[51] Polychroniou, O., Ross, K.A.: Towards practical vectorized analytical query
engines. In: Proceedings of the 15th International Workshop on Data Management
on New Hardware, pp. 1–7 (2019)
[52] Graefe, G.: Volcano/spl minus/an extensible and parallel query evaluation system.
IEEE Transactions on Knowledge and Data Engineering 6(1), 120–135 (1994)
[53] Cooper, B.F., Silberstein, A., Tam, E., Ramakrishnan, R., Sears, R.: Benchmark-
ing cloud serving systems with ycsb. In: Proceedings of the 1st ACM Symposium
on Cloud Computing, pp. 143–154 (2010)
[54] Benson, L., Makait, H., Rabl, T.: Viper: An efficient hybrid pmem-dram key-value
store (2021)
[55] Leis, V., Scheibner, F., Kemper, A., Neumann, T.: The art of practical syn-
chronization. In: Proceedings of the 12th International Workshop on Data
27
Management on New Hardware, pp. 1–8 (2016)
[56] Wang, T., Johnson, R., Fekete, A., Pandis, I.: Efficiently making (almost) any
concurrency control mechanism serializable. The VLDB Journal 26, 537–562
(2017)
[57] Mohan, C., Haderle, D., Lindsay, B., Pirahesh, H., Schwarz, P.: Aries: A trans-
action recovery method supporting fine-granularity locking and partial rollbacks
using write-ahead logging. ACM Transactions on Database Systems (TODS)
17(1), 94–162 (1992)
[58] Wu, Y., Guo, W., Chan, C.-Y., Tan, K.-L.: Fast failure recovery for main-memory
dbmss on multicores. In: Proceedings of the 2017 ACM International Conference
on Management of Data, pp. 267–281 (2017)
[59] Lee, L., Xie, S., Ma, Y., Chen, S.: Index checkpoints for instant recovery in
in-memory database systems. Proceedings of the VLDB Endowment 15(8), 1671–
1683 (2022)
[60] Wang, H., Wei, Y., Yan, H.: Automatic single table storage structure selection for
hybrid workload. Knowledge and Information Systems 65(11), 4713–4739 (2023)
[61] Tpc. Tpc Benchmark C (oltp) Standard Specification, Revision 5.11,2010.
Available At. [Online]. http://www.tpc.org/tpcc
[62] Funke, F., Kemper, A., Neumann, T.: Benchmarking hybrid oltp&olap database
systems (2011)
[63] Gregg, B.: The flame graph. Communications of the ACM 59(6), 48–57 (2016)
[64] Intel Corporation et Al. Processor Counter Monitor. 2019. https://github.com/
opcm/pcm
28
Algorithm 1 Index-SSN Algorithm (Read and Commit)
Require: t h w =, t l w = 0.
1: function serializable read(t, v, cstamp)
2: //1.update T high watermarks, Tiw:r-T
3: th w = max(t h w, cstamp)
4: mapp version = mapping table(v.rcd meta.ptr)
5: if mapp version != null then
6: if mapp version.endstamp is not infinity then
7: //2.update T’s low watermarks with Tr:w-Tj
8: tl w = min(t l w, mapp version.sstamp)
9: end if
10: end if
11: if tl w <= t h w then abort.
12: end if
13: Add v to t readset return true.
14: end function
15: function serializable commit(t)
16: cmm stamp = counter.fetch add(1)
17: //1.traverse twrset, update T’s high watermarks
18: //find the Tir:w-T, forward edges
19: for v in t wrset do
20: for reader r in mapp version.read list do
21: if (r txn = glb act t.find(r)) then
22: if rtxn.cmm stamp<cmm stamp then
23: while rtxn is finished do
24: th w = max(t h w, r txn cmm )
25: end while
26: end if
27: end if
28: end for
29: end for
30: //2. traverse trdset, update T’s low watermarks
31: for v in t rdset do
32: //find the Tr:w-Tj, backward edges
33: r = v
34: if r.cs!=v.rcd meta.cstamp then
35: u = v.rcd meta.cstamp
36: utxn = glb act txn.find(u.cstamp)
37: tl w = min(t l w, u txn.t l w)
38: else
39: mapp version loc = v.rcd meta.ptr
40: u = mapp table(mapp version loc)
41: if (u txn = glb act txn.find(u.cstamp)) then
42: while utxn is finished do
43: tl w = min(t l w, u txn.t l w)
44: end while
45: end if
46: end if
47: end for
48: if tl w <= t h w then abort.
49: end if
50: end function
29
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
In the use of database systems, the design of the storage engine and data model directly affects the performance of the database when performing queries. Therefore, the users of the database need to select the storage engine and design data model according to the workload encountered. However, in a hybrid workload, the query set of the database is dynamically changing, and the design of its optimised storage structure is also changing. Motivated by this, we propose an automatic storage structure selection system based on learning cost, which is used to dynamically select the optimised storage structure of the database under hybrid workloads. In the system, we introduce a machine learning method to build a cost model for the storage engine, and a column-oriented data layout generation algorithm. Experimental results show that the proposed system can choose the optimal combination of storage engine and data model according to the current workload, which greatly improves the performance of the default storage structure. And the system is designed to be compatible with different storage engines for easy use in practical applications.
Article
Full-text available
Key-value store, a fundamental component in a variety of modern data-intensive applications, can benefit from the emerging persistent memory (PM) to achieve high performance. Recent research proposes hybrid PM-DRAM key-value stores, which persist key-value pairs to PM in a log-structured approach for efficient storage and maintain a volatile hash index in DRAM for fast indexing. However, they fail to achieve high performance under update-intensive workloads due to the unawareness of PM’s characteristics. We propose Eukv, a hybrid PM-DRAM key-value store that supports efficient updates. Eukv (1) leverages out-of-place updates and lazy deletions to reduce small random PM reads and writes, (2) performs concurrency control only in a volatile hash index to further reduce small random PM reads and writes introduced by persistent lock operations, (3) uses cacheline-aligned writes to take full advantage of PM’s write performance, (4) improves garbage collection efficiency through hot and cold data separation, (5) employs multi-threaded garbage collection technology to reduce the impact of inefficient garbage collection on updates at high-capacity utilization. We conduct extensive evaluations on an 18-core machine equipped with Intel Optane DC Persistent Memory Modules. Experimental results demonstrate that Eukv obtains up to 5.6× speedup than the state-of-the-art hybrid PM-DRAM key-value store.
Article
A B-Tree is the most widely used range index for larger-than-memory data systems. It organizes data in pages (usually 4 KB) that efficiently align with disk IO operations, fully utilizing each IO operation to narrow down the search space. On the other hand, a B-Tree's page-based organization leads to inefficient caching and high write amplification, as it needs to cache the entire page as a whole while often only a small subset of records are hot, and it needs to write the entire page for a single record update. The key insight of this paper is to separate cache pages from disk pages , i.e., a cache page is no longer a pure mirror of its disk content, but instead, it forms a judiciously chosen subset of the disk page that is worth caching, and can absorb both read and write operations in a consistent manner. Based on this insight, we propose Bf-Tree, a modern B-Tree that is read-write-optimized by building a new variable-length buffer pool to manage such cache pages, called mini-pages. Bf-Tree uses this in-memory buffer pool to support efficient record-level caching, buffering recent updates, caching range gaps, as well as mirrors of disk pages when needed. We implement a fully featured and modern Bf-Tree in Rust with 13k lines of code, and show that Bf-Tree is 2.5× faster than RocksDB (LSM-Tree) for scan operations, 6× faster than a B-Tree for write operations, and 2× faster than both B-Trees and LSM-Trees for point lookups. We believe these results firmly establish a new standard for database storage engines of the future.
Article
The scaling of per-GB DRAM cost has slowed down in recent years. Recent research has suggested that adding remote memory to a system can further reduce the overall memory cost while maintaining good performance. Remote memory (i.e., tiered memory), connected to host servers via high-speed interconnect protocols such as RDMA and CXL, is expected to deliver 100x (less than 1µs) lower latency than SSD and be more cost-effective than local DRAM through pooling or adopting cheaper memory technologies. Tiered memory opens up a large number of potential use cases within database systems. But previous work has only explored limited ways of using tiered memory. Our study provides a systematic study for DBMS to build tiered memory buffer management with respect to a wide range of hardware performance characteristics. Specifically, we study five different indexing designs that leverage remote memory in different ways and evaluate them through a wide range of metrics including performance, tiered-memory latency sensitivity, and cost-effectiveness. In addition, we propose a new memory provisioning strategy that allocates an optimal amount of local and remote memory for a given workload. Our evaluations show that while some designs achieve higher performance than others, no design can win in all measured dimensions.
Article
Modern database engines must well use multicore CPUs, large main memory and fast storage devices to achieve high performance. A common theme is hiding latencies such that more CPU cycles can be dedicated to "real" work, improving overall throughput. Yet existing systems are only able to mitigate the impact of individual latencies, e.g., by interleaving memory accesses with computation to hide CPU cache misses. They still lack the joint optimization of hiding the impact of multiple latency sources. This paper presents MosaicDB, a set of latency-hiding techniques to solve this problem. With stackless coroutines and carefully crafted scheduling policies, we explore how I/O and synchronization latencies can be hidden in a well-crafted OLTP engine that already hides memory access latency, without hurting the performance of memory-resident workloads. MosaicDB also avoids oversubscription and reduces contention using the coroutine-to-transaction paradigm. Our evaluation shows MosaicDB can achieve these goals and up to 33x speedup over prior state-of-the-art.
Article
Most modern in-memory online transaction processing (OLTP) engines rely on multi-version concurrency control (MVCC) to provide data consistency guarantees in the presence of conflicting data accesses. MVCC improves concurrency by generating a new version of a record on every write, thus increasing the storage requirements. Existing approaches rely on garbage collection and chain consolidation to reduce the length of version chains and reclaim space by freeing unreachable versions. However, finding unreachable versions requires the traversal of long version chains, which incurs random accesses right into the critical path of transaction execution, hence limiting scalability. This paper introduces OneShotGC, a new multi-version storage design that eliminates version traversal during garbage collection, with minimal discovery and memory management overheads. OneShotGC leverages the temporal correlations across versions to opportunistically cluster them into contiguous memory blocks that can be released in one shot. We implement OneShotGC in Proteus and use YCSB and TPC-C to experimentally evaluate its performance with respect to the state-of-the-art, where we observe an improvement of up to 2x in transactional throughput.
Article
Key-value stores (KVSs) have found wide application in modern software systems. For persistence, their data resides in slow secondary storage, which requires KVSs to employ various techniques to increase their read and write performance from and to the underlying medium. Emerging persistent memory (PMem) technologies offer data persistence at close-to-DRAM speed, making them a promising alternative to classical disk-based storage. However, simply drop-in replacing existing storage with PMem does not yield good results, as block-based access behaves differently in PMem than on disk and ignores PMem's byte addressability, layout, and unique performance characteristics. In this paper, we propose three PMem-specific access patterns and implement them in a hybrid PMem-DRAM KVS called Viper. We employ a DRAM-based hash index and a PMem-aware storage layout to utilize the random-write speed of DRAM and efficient sequential-write performance PMem. Our evaluation shows that Viper significantly outperforms existing KVSs for core KVS operations while providing full data persistence. Moreover, Viper outperforms existing PMem-only, hybrid, and disk-based KVSs by 4--18X for write workloads, while matching or surpassing their get performance.
Article
Many modern key-value stores, such as RocksDB, rely on log-structured merge trees (LSMs). Originally designed for spinning disks, LSMs optimize for write performance by only making sequential writes. But this optimization comes at the cost of reads: LSMs must rely on expensive compaction jobs and Bloom filters---all to maintain reasonable read performance. For NVMe SSDs, we argue that trading off read performance for write performance is no longer always needed. With enough parallelism, NVMe SSDs have comparable random and sequential access performance. This change makes update-in-place designs, which traditionally provide excellent read performance, a viable alternative to LSMs. In this paper, we close the gap between log-structured and update-in-place designs on modern SSDs with the help of new components that take advantage of data and workload patterns. Specifically, we explore three key ideas: (A) record caching for efficient point operations, (B) page grouping for high-performance range scans, and (C) insert forecasting to reduce the reorganization costs of accommodating new records. We evaluate these ideas by implementing them in a prototype update-in-place key-value store called TreeLine. On YCSB, we find that TreeLine outperforms RocksDB and LeanStore by 2.20× and 2.07× respectively on average across the point workloads, and by up to 10.95× and 7.52× overall.