PreprintPDF Available

Architecture-Aware, High Performance Transaction for Persistent Memory

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

Byte-addressable non-volatile main memory (NVM) demands transactional mechanisms to access and manipulate data on NVM atomically. Those transaction mechanisms often employ a logging mechanism (undo logging or redo logging). However, the logging mechanisms bring large runtime overhead (8%-49% in our evaluation), and 41%-78% of the overhead attributes to the frequent cache-line flushing. Such large overhead significantly diminishes the performance benefits offered by NVM. In this paper, we introduce a new method to remove the overhead of cache-line flushing for logging-based transactions. Different from the traditional method that works at the program level and leverages program semantics to reduce the logging overhead, we introduce architecture awareness. In particular, we do not flush certain cache blocks, as long as they are estimated to be eliminated out of the cache because of the caching mechanism (e.g., the cache replacement algorithm). Furthermore, we coalesce those cache blocks with low dirtiness to improve the efficiency of cache-line flushing. We implement an architecture-aware, high-performance transaction runtime system for persistent memory, Archapt. Our results show that comparing with the traditional undo logging, Archapt reduces cache flushing by 66% and improves system throughput by 22% on average (42% at most), when running TPC-C and YCSB (A-F) with Redis, and TPC-C, LinkBench and YCSB with SQLite.
Content may be subject to copyright.
Architecture-Aware, High Performance Transaction
for Persistent Memory
Kai Wu
kwu42@ucmerced.edu
Jie Ren
jren6@ucmerced.edu
Dong Li
dli35@ucmerced.edu
University of California, Merced
Abstract
Byte-addressable non-volatile main memory (NVM) demands
transactional mechanisms to access and manipulate data on
NVM atomically. Those transaction mechanisms often em-
ploy a logging mechanism (undo logging or redo logging).
However, the logging mechanisms bring large runtime over-
head (8%-49% in our evaluation), and 41%-78% of the over-
head attributes to the frequent cache-line flushing. Such large
overhead significantly diminishes the performance benefits
offered by NVM. In this paper, we introduce a new method
to remove the overhead of cache-line flushing for logging-
based transactions. Different from the traditional method that
works at the program level and leverages program semantics
to reduce the logging overhead, we introduce architecture
awareness. In particular, we do not flush certain cache blocks,
as long as they are estimated to be eliminated out of the
cache because of the caching mechanism (e.g., the cache re-
placement algorithm). Furthermore, we coalesce those cache
blocks with low dirtiness to improve the efficiency of cache-
line flushing.We implement an architecture-aware, high per-
formance transaction runtime system for persistent memory,
Archapt. Our results show that comparing with the traditional
undo logging, Archapt reduces cache flushing by 66% and im-
proves system throughput by 22% on average (42% at most),
when running TPC-C and YCSB (A-F) with Redis, and OLTP-
bench (TPC-C, LinkBench and YCSB) with SQLite.
1 Introduction
Non-volatile memory (NVM), addressed at a byte granularity
directly by CPU and accessed roughly at the latency of main
memory, are coming. While NVM as main memory provides
an appealing interface that uses simple
load/store
, it brings
new challenges to the designs of persistent data structures,
storage systems, and databases. In particular, a
store
does
not immediately make data persistent, because the memory
hierarchy (e.g., caches and store buffers) and processor state
can remain non-persistent. There is a need to ensure that data
is modified atomically when moving from one consistent state
to another, in order to provide consistency after a crash (e.g.,
power loss or hardware failure). In particular, a
store
does
not immediately make data persistent, because the memory
hierarchy (e.g., caches and store buffers) and processor state
can remain non-persistent. There is a need to ensure that data
is modified atomically when moving from one consistent state
to another, in order to provide consistency after a crash (e.g.,
power loss or hardware failure).
The NVM challenges have resulted in investigations of
transactional mechanisms to access and manipulate data on
persistent memory (NVM) atomically [8,27,33
35,42,45,46].
Those transactional mechanisms often employ a logging tech-
nique (undo logging or redo logging). However, those trans-
actional mechanisms have a high overhead. Our performance
evaluation reveals that running TPC-C [2,28] and YCSB
(A-F) [3,14] against Redis [7], and OLTP-bench [1] (TPC-
C, LinkBench [6] and YCSB) against SQLite [4] based on
an implementation of undo logging from Intel PMDK [24])
or a redo logging from [45] to enable transactions, we have
overheads of 8%-49%. Such large overhead significantly di-
minishes the performance benefit NVM promises to provide
in many workloads.
Most overhead of logging mechanisms comes from data
copy for creating logs and cache-line flushing by special in-
structions. Cache-line flushing takes a large portion of the
total overhead. Use our evaluation with the above workloads
as an example again. On average, the cache-line flushing takes
65% and 51% of total overhead for undo logging and redo
logging mechanisms respectively. Removing the overhead
of cache-line flushing is the key to enable high performance
transaction for persistent memory.
The traditional methods reduce the overhead of cache-line
flushing using asynchronous cache-line flushing (e.g., blur-
ring persistent boundary [32] and relaxing persistency order-
ing [32,45]). Those methods remove the overhead of cache-
line flushing off the critical path, by overlapping cache-line
flushing with the transaction. However, the effectiveness of
asynchronous cache-line flushing depends on the characteris-
arXiv:1903.06226v1 [cs.DC] 14 Mar 2019
tics of the transaction (e.g., how frequent data updates hap-
pen), cache-line flushing can still be exposed into the critical
path, increasing the latency of the transaction.
In this paper, we introduce a new method to remove the
overhead of cache-line flushing. The traditional methods work
at the program level and leverages program semantics: as
long as the transaction semantics remains correct, we can
change the order of persisting data and trigger asynchronous
cache-line flushing. Different from the traditional methods,
our method introduces architecture awareness. In particular,
we do not flush certain cache lines, as long as those cache
lines are eliminated out of the cache because of the caching
mechanism (e.g., the cache replacement algorithm). In other
words, we rely on the existing hardware mechanism to au-
tomatically and implicitly flush cache lines. The traditional
methods do not have architecture awareness. Ignoring the
possible effects of the caching mechanism, the traditional
methods flush cache lines by explicitly issuing cache flush
instructions, even though those cache lines will be soon or
have been eliminated out of the cache by hardware.
Furthermore, we examine the cache line dirtiness to quan-
tify the efficiency of cache-line flushing. The dirtiness of a
cache line is defined as the ratio of dirty bytes to total number
of bytes in a cache line. Since a cache line is the finest gran-
ularity to enforce data persistency, the whole cache line has
to be flushed, even though only a few bytes in the cache line
are dirty. Use our evaluation with the above workloads as an
example again: the average dirtiness of flushed cache lines in
Redis and SQLite is 49% and 49% for undo and redo logging
mechanisms respectively. Flushing clean data in a cache line
wastes memory bandwidth and decreases the efficiency of
cache-line flushing.
To leverage the architecture awareness to enable high per-
formance transactions, we must address a couple of chal-
lenges. First, we must have a software mechanism to rea-
son and decide the existence of cache blocks
1
in the cache,
without the hardware modification. The mechanism must be
lightweight and allow us to make a quick decision on whether
a cache-line flushing is necessary.
Second, we must provide strong guarantee on data consis-
tency to implement transactions. Skipping cache-line flushing
for some persistent objects raises the risk of losing data con-
sistency for committed transactions. The software mechanism
to reason the residence of a cache block in the cache is an
approximation to the hardware-based caching mechanism. If
the software mechanism skips a cache-line flushing, but the
corresponding dirty cache block is still in the cache, then there
is a chance that the cache block is inconsistent when a crash
happens. We must have a mechanism to detect and correct
such inconsistency in persistent memory.
To address the above two challenges, we introduce Archapt
1
We distinguish cache line and cache block in the paper. The cache line
is a location in the cache, and the cache block refers to the data that goes into
a cache line.
(Architecture-aware, performant and persistent transaction),
an architecture-aware, high performance transaction runtime
system. Archapt provides a new way to perform transactional
updates on persistent memory with efficient cache-line flush-
ing. To address the first challenge, Archapt uses an LRU
queue to reason the residence of cache blocks of a persistent
object in the cache and decide whether cache flushing for a
persistent object in a transaction is necessary.
To address the second challenge, Archapt introduces a
lightweight checksum mechanism. Checksums are built using
multiple cache blocks from one or more persistent objects
to establish implicit invariant relationships between cache
blocks. Leveraging the invariant, Archapt can detect data in-
consistency and make best efforts to correct data inconsistency
after a crash happens. The checksum mechanism provides
a strong guarantee on data consistency, while causes small
runtime overhead (less than 5% loss in throughput in our
evaluation).
Furthermore, to improve the efficiency of cache-line flush-
ing, we examine the implementation of common database sys-
tems (Redis and SQLite), and find two problems accounting
for the low dirtiness of flushed cache lines. The two problems
are unaligned cache-line flushing and uncoordinated cache-
line flushing. The two problems come from the fundamental
limitation of the existing memory allocation mechanism de-
signed for the traditional DRAM. In particular, the existing
memory allocation does not consider the effects of cache-
line flushing on persistent memory, and spread data structures
with different dirtiness across cache blocks. This causes the
low dirtiness of flushed cache lines. Archapt introduces a cus-
tomized memory allocation mechanism to coalesce cache-line
flushing and improve efficiency.
In summary, the paper makes the following contributions:
An architecture-aware new method to achieve high perfor-
mance transactions on persistent memory;
A mechanism that determines the necessity of cache-line
flushing based on the locality of cache blocks; A check-
sum mechanism to detect and correct data inconsistency to
provide strong guarantee on crash consistency;
We reveal the low dirtiness of flushed cache lines in two
common databases, and provide a solution to improve the
efficiency of cache-line flushing;
With Archapt, we reduce cache flushing by 66% and im-
prove system throughput by 22% (42% at most), when
running YCSB (A-F) and TPC-C against Redis, and OLTP-
bench (TPC-C, LinkBench and YCSB) against SQLite (us-
ing the traditional undo logging as baseline). Archapt pro-
vides strong crash consistency demonstrated by our crash
tests.
2 Background and Motivation
Many studies build a atomic and durable transaction [8,12,
20,21,24,27,27,32,44,45] to handle the crash consistency
2
issue on NVM. With such a transaction, each single update
must be “all or nothing”, i.e., either successfully completes,
or fails completely with the data in NVM intact. With such a
transaction, one has to write back the modified data from the
volatile cache to NVM that provides the durability. To ensure
a cache line is indeed written to NVM in a correct order,
one often uses cache-line flushing instructions (e.g.,
clflush
,
clflushopt
or
clwb
) and persistent barriers (e.g.,
sfence
and
mfence
). Cache-line flushing is expensive, because of
two reasons: (1) it may need to invalidate cache lines (with
clflush
and
clflushopt
instructions) and trigger cache-
line sized writes to the memory controller; and (2) it needs
persistent barriers to ensure that all flushes are completed and
force any updates in the memory controller to be written to
NVM.
In the rest of the paper, we use the term persistent object
to represent a data object that is modified within the transac-
tion and needs to be persisted. We use the term log record
to represent a log (an copy of the old data in an undo log-
ging mechanism or a copy of the new data in a redo logging
mechanism). To persist a persistent object, the current com-
mon practice is to flush all cache blocks of the persistent
object [24]
2
. We use “flushing all cache blocks” for a persis-
tent object and “cache-line flushing” for a persistent object
interchangeably in the paper.
2.1 Performance Analysis on Log-based Per-
sistent Memory Transactions
Undo and redo logging are two of the most common mecha-
nisms to build persistent transactions on persistent memory.
In undo and redo logging, the logging operations (including
data copy and log record manipulation) and persistence op-
erations (including cache-line flushing and store barrier) are
necessary. Both of them cause performance loss in a trans-
action. To quantify the impact of the persistent logging on
transaction throughput, we run multiple workloads, including
YCSB and TPC-C against Redis, and OLTP-bench (TPC-C,
LinkBench and YCSB) against SQLite with and without the
persistent logging. For each workload, we use eight client
threads. More experiment details are available in Section 5.1.
Figure 1shows the results.
The figure reveals that logging decreases throughput by
8%-49%. For a workload with frequent updates (YCSB-A)
or large updates (LinkBench), the logging overhead can be
very large (33% and 49% for YCSB-A and LinkBench respec-
tively). Furthermore, we measure the delay (latency overhead)
caused by logging operations and persistence operations. Fig-
ure 2shows the results. In the undo logging, the persistence
operations account for 56%-78% of the latency overhead;
2
We have to flush all cache blocks, even though some of them may not be
in the cache, because there is no mechanism to faithfully track which cache
blocks are in the cache. Flushing cache blocks not resident in the cache has
the similar cost as flushing those resident in the cache.
Figure 1:
Throughput when running YCSB workloads (A-F), TPC-
C and LinkBench against Redis and SQLite with and without log-
ging.
Figure 2: Breakdown of undo/redo logging latency overhead.
In the redo logging, the persistence operations account for
41%-64% of the latency overhead. The overhead of those
persistence operations is exposed to the critical path of trans-
actions. The above results show that the persist operations
can significantly impact the transaction performance. Thus,
we must avoid frequent cache-line flushing.
Introducing architecture awareness into the design of a
transaction, we want to skip cache flushing by leveraging data
reuse information in the cache. If data reuse is low, then there
is a very good chance that the data is eliminated out of the
cache by the hardware-based caching mechanism. We study
data reuse in the next section.
2.2 Data Reuse and Dirtiness Analysis
Data in a transaction includes log records and persistent ob-
jects. Log records, which are used to maintain the transaction
atomicity, are seldom reused. We study data reuse at the per-
sistent object level, and explore whether there are persistent
objects with few reuse. These persistent objects are candidates
for skipping cache-line flushing.
To study data reuse, we count the number of operations
(read and write) for each persistent object, and then report
what is the percentage of persistent objects with 0, 1, 2 or
more operations, which we call the distribution of data reuse.
Figure 3shows the results. The figure reveals that 78% of
persistent objects are used only once or twice in all workloads
except YCSB-E. In YCSB-E, about 89% of persistent objects
have data reuse no less than 2. Such high data reuse is be-
cause of the following reason: This workload has frequent
queries, each of which cover a range of persistent objects.
Those ranges of queries overlap with each other, causing high
data reuse.
3
Figure 3: Distribution of data reuse for persistent data objects.
Table 1: Average dirtiness of flushed cache lines.
Redis SQLite
TPC-C YCSB TPC-C LinkBench YCSB
A B D E F
0.31 0.55 0.55 0.51 0.51 0.56 0.40 0.49 0.46
We also explore the efficiency of cache-line flushing. In
particular, we quantify the average dirtiness of flushed cache
lines. Table 1shows the results for undo and redo logging (the
two logging mechanism have the same dirtiness). In general,
the dirtiness is less than 0.6 in all workloads, which is low.
Conclusions.
Using the industry standard workloads, our
analysis on data reuse and dirtiness shows great opportunities
to enable high performance transactions by skipping cache-
line flushing and improving its efficiency.
3 Design
Motivated by the above performance analysis, we introduce
a high performance transaction runtime system, targeting on
reducing the overhead of persistence operations. We describe
our design in details in this section.
3.1 Overview
Archapt avoids cache-line flushing for persistent objects (but
not log records) to enable high performance transactions with-
out disturbing transaction atomicity. Archapt uses an LRU-
based method to reason if persistent objects are in the cache.
With this approach, Archapt does not immediately make a de-
cision on flushing cache blocks for a persistent object, when a
cache flushing request is issued from a transaction to persist a
persistent object. Archapt delays the decision until it collects
more information on read/write of the persistent object and
estimates the locality of the persistent object, using the LRU
queue. For the persistent object that is estimated not to be
resident in the cache, the cache flushing for all of its cache
blocks is skipped.
Archapt is also featured with a checksum mechanism. Skip-
ping cache-line flushing for some persistent objects raises the
risk of having inconsistent data for committed transactions,
when a crash happens. To remove the risk, we introduce a
checksum mechanism. This mechanism generates checksums
for persistent objects that have cache-line flushing skipped.
The checksum mechanism builds an invariant relationship
between cache blocks. Upon a crash, the checksums can be
Figure 4: The architecture of Archapt.
used to detect and correct data inconsistency. We design the
mechanism with the consideration of avoiding frequently up-
dating checksums for best performance and maximizing the
capabilities of correcting data inconsistency.
Furthermore, we identify two reasons that account for the
low dirtiness of flushed cache lines: unaligned cache-line
flushing and uncoordinated cache line flushing. To address
the two problems, Archapt introduces a customized memory
allocation mechanism. It clusters persistent objects with the
same functionality (i.e., key, field, value, and log) into con-
tiguous cache blocks to coordinate cache-line flushing and
align cache-line flushing, based on which Archapt improves
the efficiency of cache-line flushing.
Overall architecture of Archapt.
Archapt has four major
components: transaction management unit, memory manage-
ment unit, persistent management unit, and history manage-
ment unit. Figure 4shows the architecture of Archapt.
(1) The transaction management unit includes a set of APIs
to establish a transaction (i.e., start and end). Such transaction
information is sent to the Archapt runtime to implement trans-
action semantics. The transaction management unit processes
the requested operations of the transaction. It also flushes
cache blocks for persistent objects that are estimated to be in
the cache. (2) The memory management unit pre-allocates a
set of memory pools for coalescing cache blocks and manages
the pools to meet memory allocation requests from transac-
tions. (3) The persistent management unit builds checksums
for persistent objects for which Archapt skips the cache-line
flushing. (4) The history management unit maintains an LRU
queue and a hash table,
Ob jH T
. The LRU queue is used to
estimate the locality of persistent objects (i.e., in the cache or
not).
Ob jH T
is used to provide metadata information for each
persistent object in the LRU queue, such as the location in
the LRU queue and whether there is any pending cache-line
flushing.
3.2 Architecture-Aware Cache-Line Flushing
The architecture-aware cache-line flushing uses an LRU
queue to reason if a persistent object is in the cache or not,
and skips cache-line flushing for it, if not. When a persistent
4
object is updated, its cache blocks are placed into the LRU
queue (the queue length is equal to the capacity of last level
cache), and the decision for cache flushing for this persis-
tent object is pending until we have enough information to
estimate the residence of the persistent object in the cache,
based on the LRU queue. We describe our design in details as
follows.
First, once Archapt receives a request (i.e., a read or write
operation to a persistent object) from the client, the transac-
tion management unit queries
Ob jH T
to see if the requested
persistent object has a record there. If yes, we infer that the
persistent object is accessed recently. The hardware cache
may have the persistent object resident in the cache because
of a previous operation on the persistent object. If the previ-
ous operation is a write operation, flushing cache blocks for
the persistent object must be pending. We finish the cache
flushing for the previous write operation. Furthermore, we
update the location of the persistent object in the LRU queue,
because of the current request. If the current request is a write
operation, we hold the cache flushing for the current request,
waiting for the opportunity to skip it in the future.
If the transaction management unit cannot find the re-
quested persistent object information in
Ob jH T
, we conclude
that the persistent object has not been accessed recently. The
hardware cache may evict the persistent object out of the
cache, or never access it at all. The transaction management
unit then skips any pending cache flushing request for the per-
sistent object. Afterwards, the transaction management unit
asks the history management unit to insert the information
for the persistent object into the LRU queue, and suspend
the cache flushing for the most recent request if it writes the
persistent object. In the future transactions, as other persis-
tent objects are accessed, the target persistent object can be
evicted out of the LRU queue according to the LRU policy,
and its record will be removed from
Ob jH T
and the pending
cache flushing will be skipped.
We must maintain the commit status of a transaction very
well. After the completion of a transaction, we cannot label
it as commit as in the traditional transaction, because cache
flushing for some persistent objects in the transaction may be
pending. For such a transaction, we label it as logical commit.
Only after all of cache flushing for persistent objects in the
transaction are either finished or skipped (but with checksums
added to the persistent objects. See Section 3.3), we label the
transaction as physical commit.
A logically committed transaction has completed all read
and write operations in the transaction. For such a transaction,
the system does not respond to the client to announce the
transaction commitment. For a physically committed trans-
action, the system does so, as in the traditional undo or redo
logging mechanisms.
The modern hardware-based cache hierarchy employs so-
phisticated caching policies. It is possible that a persistent
object is resident in the cache while the LRU estimates oth-
erwise. For this case, skipping cache-line flushing can poten-
tially cause data inconsistency for a physically committed
transaction, when a crash happens. We introduce a check-
sum mechanism to detect and correct inconsistent data (see
Section 3.3).
Handling log records.
Log records, once created for a
transaction, are seldom accessed (unless a crash happens).
We could skip cache flushing for log records and rely on
the hardware-based caching mechanism to implicitly persist
them. However, by doing so, some log records that are not
timely flushed by the hardware are lost when a crash happens;
We raise the risk of losing transaction atomicity before the
physical commitment of the transaction. Hence, we do not
skip cache-line flushing for log records. They are committed
and maintained as in the traditional logging mechanisms.
3.3 Checksum Design
Skipping cache-line flushing for some persistent objects raises
the risk of disturbing transaction atomicity: once a transaction
is physically committed, there is no strong guarantee on data
consistency, because we estimate data locality and the esti-
mation can be inaccurate. To remove the risk, we introduce a
checksum mechanism.
We have multiple requirements for the checksum design.
First, the checksum mechanism should have the capability
to detect data inconsistency in physically committed persis-
tent objects. Second, the checksum mechanism must provide
strong guarantee on data consistency for persistent objects
when they are physically committed. Third, the checksum
mechanism must be lightweight. Unlike RAID or some ECC
that can come with large performance overhead, the overhead
of our checksum construction and maintenance should be
small, and smaller than the performance benefit of skipping
cache flushing for persistent objects. We describe the design
of the checksum mechanism in this section.
General Description.
Our checksums are built with cache
blocks of multiple persistent objects from one or more trans-
actions. To build the checksums, cache blocks of multiple
persistent objects are logically organized as an
M×N
matrix
(
M
and
N
are the dimension sizes of the matrix, discussed
later). Those persistent objects have cache flushing skipped.
Each column of the matrix corresponds to cache blocks of a
persistent object, where each element of the column is a cache
block. Checksums are built as one extra row (the
(M+1)th
row) and one extra column (the
(N+1)th
column) of the ma-
trix. The matrix becomes
(M+1)×(N+1)
. The extra row,
named consistency checksums, is used to detect data incon-
sistency of
N
columns (i.e.,
N
persistent objects), and each
element of the extra row is a consistent checksum for one
column. The extra column, named correlation checksums,
builds an invariant relationship between cache blocks across
the multiple persistent objects. The correlation checksums
can correct data inconsistency. We name the matrix, virtual
5
(a) Checksum creation (b)
An example of correctable data inconsis-
tency
(c)
An example of uncorrectable data incon-
sistency
Figure 5: Three examples for checksum creation and correcting data inconsistency.
matrix in the future discussion.
Consistency checksums to detect data inconsistency.
When a persistent object with cache flushing skipped is logi-
cally committed, we immediately create a consistency check-
sum. The checksum is a simple summation of cache blocks
of the persistent object. The consistency checksum for a per-
sistent object is implemented as an extra cache block added
to the persistent object. The checksum is immediately flushed
for consistency once it is created. When a crash happens,
for each persistent object with a consistency checksum, we
recalculate the checksum and compare it with the existing
one in persistent memory. If there is a mismatch, then data
inconsistency is detected.
The consistency checksum mechanism is very effective to
detect data inconsistency for a persistent object. Any cache
block of the persistent object with data inconsistency can
easily cause checksum mismatch. In our evaluation with
ten workloads with hundreds of millions of transactions and
10,000 crash tests, the consistency checksum mechanism de-
tects all data inconsistency.
Correlation checksums to correct data inconsistency.
A
correlation checksum, as an element of the
(N+1)th
column
of the virtual matrix, is a summation of cache blocks of a row
in the virtual matrix. The
(N+1)th
column is composed of
M
correlation checksums, each of which is for one row. Since
the cache blocks of a row come from
N
persistent objects,
a correlation checksum aims to correct data inconsistency
for any of the
N
persistent objects. Correlation checksums
(i.e., the
(N+1)th
column) are immediately flushed out of the
cache to commit, once they are fully built.
Once a crash happens, we recalculate correlation check-
sums and compare them with the existing ones in persistent
memory. If there is a mismatch in any correlation checksum
(say the element
mk(N+1)
of the virtual matrix), then the corre-
sponding row (the row
k
) must have data inconsistency. Using
consistency checksums, we can reason which element of the
row
k
is inconsistent. Assume that the element
mk j
is. This
element is corrected by the following:
mk j =mk(N+1)
N
i=1,i6=j
mki (1)
where
mk(N+1)
is the correlation checksum committed in per-
sistent memory.
N
, the column size of the virtual matrix, is the number of
persistent objects we use to build the virtual matrix. A smaller
N
causes more frequent creation of checksums and hence
larger performance overhead, but reduces the possibility of
losing updates to persistent objects (because checksums are
frequently committed); A larger
N
has smaller performance
overhead, but increase the possibility. We empirically choose
N
as 16 to strike a balance. In other words, we commit
N
correlation checksums for Npersistent objects together.
M
, the row size of the virtual matrix is determined by the
largest persistent object among the
N
persistent objects:
M
is
the number of cache blocks in the largest persistent object. For
the shorter persistent objects, their corresponding columns
in the virtual matrix can have zero-valued elements to make
them as long as the largest persistent object.
An example.
Figure 5.a shows an example to further ex-
plain the idea of checksums. In this example, we have four per-
sistent objects with four, four, three, and four cache blocks re-
spectively. Hence, the virtual matrix is
4×4
, and each column
is for one persistent object. The consistency checksums are
in the fifth row, and the correlation checksums are in the fifth
columns. The consistency checksums,
CkSum1
-
ChkSum4
can
detect data inconsistency for the first-fourth persistent objects
respectively. The correlation checksums,
CkSum5
-
CkSum8
,
can be used to correct data inconsistency for the cache blocks
in the first-fourth rows. Suppose
CB#32
has inconsistency
detected by the consistency checksum
CkSum2
. The inconsis-
tency can be corrected by the correlation checksum
CkSum7
.
In particular,
CB#32 =CkSum7CB#31 CB#33 CB#34
.
Enabling high performance checksum mechanism.
Our
checksum mechanism does not cause large performance over-
head because of the following reasons. First, creating check-
sums is not in the critical path of a transaction. A checksum
for a persistent object is created, only after the persistent ob-
ject is estimated to be evicted out of the LRU queue. This
indicates that it is highly likely that the persistent object is not
accessed in the near future; Also, creating the checksum for
the persistent object and committing the checksum later do
not block the execution of other transactions. Hence, check-
sum creation can happen in parallel with other operations,
6
which removes it from the critical path.
Second, checksums do not need to be updated frequently.
Once a persistent object is updated, its checksums must be
recalculated and updated to maintain the validness of check-
sums. Such updates can cause performance overhead. This
performance problem is common in other mechanisms, such
as ECC or RAID. However, it is not a problem in our design,
because we use those persistent objects that are not frequently
accessed (according to the LRU queue) to build checksums.
Updating checksums do not happen often.
Third, the overhead of flushing cache blocks of checksums
can be smaller than that of flushing cache blocks for persistent
objects. Hence, the performance benefits of the checksum
mechanism provide opportunities to overweight its overhead.
Given an
N×M
virtual matrix, we need to flush (
N+M
)
cache blocks to make checksums consistent. In contrast, to
make consistent
N
persistent objects in the virtual matrix, we
need to flush at least (
M+N1
) cache blocks (assuming that
the largest persistent object has
M
cache blocks while each
of other persistent objects has just one cache block), and at
most (
M×N
) cache blocks (assuming that each of persistent
objects has
M
cache blocks). In fact, when we build a virtual
matrix, we try to use persistent objects with similar size, such
that in nearly all cases, the number of cache blocks to flush
for persistent objects is close to (
M×N
). In other words,
in nearly all cases, the number of cache blocks to flush for
checksums is significantly smaller than that to flush for
N
persistent objects (i.e.,
N+M
vs.
M×N
). Although we need
to update the checksums and flush again when there is any
update to the persistent objects„ that does not happen very
often because of the above second reason we discuss. In our
evaluation, using the checksum mechanism always saves the
flushes of cache blocks, hence bringing performance benefits.
Analysis on the capability of correcting data inconsis-
tency.
The correlation checksums have a strong capability to
correct data inconsistency. If a cache block in a row is incon-
sistent, we can easily correct it using Equation 1. If more than
one cache block in a row are inconsistent, we can use the con-
sistency checksums and the correlation checksums to correct
them. Note that the consistency checksum, although mainly
used to detect data inconsistency, can be used to correct the
inconsistency of cache blocks that fall into the same column,
using the similar method as in Equation 1for the correlation
checksum.
Figure 5.b gives an example where we have two incon-
sistent cache blocks in the first row. Using the correlation
checksum
CkSum5
alone is not able to correct them. How-
ever, using the consistency checksums
CkSum2
or
CkSum4
,
we can correct at least one inconsistent data. Afterwards, we
can use CkSum5 to correct the other.
It is possible that a row has multiple inconsistent cache
blocks and the columns where those inconsistent cache blocks
are resident have another inconsistent cache blocks. Figure 5.c
gives an example. In this example, the first row have two
inconsistent cache blocks (
CB#12
and
CB#14
). They are resi-
dent in the columns two and four. These two cache blocks are
not correctable by the correlation checksum
CkSum5
. Mean-
while, each of the columns two and four has another inconsis-
tent cache block (
CB#42
and
CB#24
), making the consistency
checksums (
CkSum2
and
CkSum4
) incapable of correcting
the inconsistent cache blocks too.
In this case, any checksum, including the combination of
consistency checksum and correlation checksum, cannot cor-
rect those cache blocks. However, such a case is extremely
rare: Those inconsistent cache blocks must be so “coinci-
dent” to fall into the same row and column together. In our
evaluation with ten workloads with hundreds of millions of
transactions and 100,000 crash tests, our checksums can cor-
rect all of data inconsistency for committed transactions.
Post-crash processing.
After a crash happens, we examine
persistent objects in persistent memory. If they do not have
checksums and the transactions are physically committed,
then the persistent objects must be consistent without any
cache flushing skipped. If they do not have checksum and the
transactions are not physically committed, then the transaction
updates are cancelled and the persistent objects are restored
using traditional logs.
If the persistent objects have checksums and the trans-
actions are physically committed, then we use consistency
checksums to detect consistency of each persistent object. If
there is any inconsistency, then we use correlation and con-
sistency checksums to correct them. If the data inconsistency
is not correctable, which is very rare, then the corresponding
transaction is aborted. To avoid incorrectable data inconsis-
tency after physical commitment, we could add another row
and column as consistency and correlation checksums. The
new and old checksums, each of which is built upon half of
rows or columns, can improve correction for those rare cases,
but come with larger performance overhead. Study of this
tradeoff is out of scope of in this paper, because the current
checksum mechanism already works very well in our evalua-
tion.
Ensuring transaction atomicity.
Before the physical com-
mitment, Archapt relies on logs, as in the traditional undo
and redo logging mechanisms, to ensure atomicity. After the
physical commitment, the persistent objects are successfully
updated with the assists of checksums and the atomicity is
enforced. In the extremely rare case where the persistent ob-
ject is not consistent and the checksum mechanism cannot
correct it, we rely on a traditional checkpoint mechanism and
go back to the last valid checkpoint to ensure atomicity.
3.4 Coalescing of Cache-Line Flushing
To reduce the overhead of cache-line flushing, we have two
methodologies: one is to avoid cache flushing for persistent
objects as in Section 3.2; the other is to coalesce cache-line
flushing to avoid low dirtiness of flushed cache lines. After
7
Figure 6:
Uncoordinated cache-line flushing in the two-level hash
table in Redis.
investigating two common databases (Redis and SQLite), we
find two reasons that account for low dirtiness of flushed
cache lines: unaligned cache-line flushing and uncoordinated
cache-line flushing.
The unaligned cache-line flushing happens when a persis-
tent object is not aligned with cache lines. For example, a
persistent object is 100 bytes. Ideally, the object should use
two cache blocks (assuming that the cache block size is 64
bytes). However, the object could not be aligned well during
the memory allocation, and uses three cache blocks. Once
the object is modified, we have to flush three cache lines in-
stead of two. This easily increases the number of cache-line
flushing by 50%. We find this problem in Redis and SQLite.
The uncoordinated cache-line flushing happens when mul-
tiple, associated data objects are allocated into separate cache
blocks. The multiple data objects are associated, because they
are often updated together. If they are allocated into the same
cache blocks, then we can reduce the number of cache-line
flushing. This problem happens more often in NoSQL sys-
tems, such as a key-value store system. Figure 6gives an
example in Redis where the uncoordinated cache-line flush-
ing happens. The figure shows the basic data structure of
Redis. As a key-value store system, Redis enables secondary
indexing based on a two-level hash table. In the second level,
Redis has a set of field-value pairs. Each field-value pair stores
the filed ID (F_n) and their data. For each field-value pair,
the field and value objects are allocated separately on differ-
ent cache blocks. This is inefficient on persistent memory,
because the field and value have to be persisted by flushing
separate cache lines. The size of the field object is small (usu-
ally less than one cache block), and the field object is usually
updated with the value object together. Therefore, coalesc-
ing the field and value objects into a fewer contiguous cache
blocks can reduce the number of cache-line flushing.
To address the above two problems, we introduce a new
memory allocation mechanism to improve the efficiency of
cache-line flushing. We use Redis as an example again. The
original implementation of Redis uses the traditional memory
allocation, without considering the implications of memory
allocation on cache flushing. Whenever a key or a complex
value is created, Redis allocates the corresponding memory
space on demand, without the coordination with other memory
Table 2: Archapt APIs
API Name Functionality
Archapt_Init()
Pre-allocate memory
pools and initialization
Archapt_Tx_Start() Identify the beginning of
a transaction
Archapt_Tx_End()
Identify the end of a trans-
action
Archapt_Tx_LCommit() Logical commitment
Archa_Malloc(int
type, size_t size)
Memory allocation for
coalescing cache-line
flushing
Archapt_Free(int
type, size_t size)
Free memory allocation
for coalescing cache-line
flushing
allocations.
In our new design for Redis, we pre-allocate three mem-
ory pools without allocating memory on demand. The three
memory pools,
key_pool
,
f ield _value_pool
and
log_pool
,
meet the memory allocation requests for keys, field and value
pairs, and log records, respectively. We use the three memory
pools, instead of allocating memory on demand, because the
three pools can use separate memory allocation methods to
minimize cache-line flushing; The three memory pools can
also cluster objects with the same functionality (i.e., key, field,
value, and log) into contiguous cache blocks to coordinate
cache-line flushing.
4 Implementation
Programming APIs.
Archapt is implemented as a user-level
library to provide persistence support and be integrated with
the existing log-based transaction implementation, such as
Intel PMDK [24], Redis and SQLite. Archapt includes a set
of APIs, defined in Table 2.
Archapt_init()
is used to pre-allocate multiple memory
pools for coalescing cache-line flushing (Section 3.4) and
initialize critical data structures (e.g., the LRU queue and
Qb jH T
).
Archapt_T x_Start()
and
Archapt_T x_End()
are
used to identify transactions for the Archapt runtime, and
can be embedded into the existing transaction start/finaliza-
tion functions.
Archapt_T x_LCommit()
is used to replace
the traditional transaction commit to implement the logical
commit for Archapt.
Archapt_Mall oc()
and
Archapt_Free()
are used to replace the traditional memory allocation and free
APIs in the transaction implementation. The two APIs are
used to allocate and free memory from/to the pre-allocated
memory pools for coalescing cache-line flushing.
System optimization.
Archapt includes a number of opti-
mization techniques to enable high performance and thread
safety. These techniques include SIMD vectorization of check-
sum creation and update, a high-performance concurrent lock-
8
Table 3:
The percentage of different operations in evaluated work-
loads; “R”, “U”, “I”, “RU”, “S” and “D” standard for read, update,
insert, read & update, scan, and delete operations respectively.
Redis SQLite
Ops TPCC YCSB TPCC LinkBH YCSB
A B C D E F
R 8 50 95 100 95 - 50 8 64 50
U 47 50 5 - - - - 47 16 10
I 45 - - - 5 5 - 45 12 5
RU - - - - - - 50 - - 10
S - - - - - 95 - - 4 15
D - - - - - - - - 4 10
free hash table, and a high-performance LRU queue based on
circular buffers. In addition, to avoid contention on the LRU
queue from multiple transactions (multiple threads). Archapt
creates a transaction management unit for each transaction,
and the transaction management unit puts information on
write/read of persistent objects into a local buffer. By fetching
the information from those local buffers, the history manage-
ment unit collectively updates the LRU queue.
5 Evaluation
5.1 Experimental Methodology
The goal of the evaluation is to evaluate the performance
of Archapt with a range of workloads with different charac-
teristics. We use both NoSQL and SQL systems (Redis and
SQLite). We use four persistent transaction mechanisms for
evaluation: undo logging with Archapt, undo logging, the
existing rollback journal system in SQLite, and the existing
AOF mechanism (logging every write operation) in Redis. We
do not show the results of redo logging in our evaluation, be-
cause of the space limitation. But the results of redo logging
are similar to those of undo logging. Our implementation of
the undo logging is based on Intel PMDK [24]. For the roll-
back journal and AOF, whenever a transaction commitment
happens, we commit the transaction updates to memory (not
hard drive), in order to enable fair performance comparison.
We run YCSB [14] (A-F) and TPC-C [2,28] against
Redis, and run OLTP-bench [18] (particularly, TPC-C,
LinkBench [6] and YCSB) against SQLite. These workloads
are chosen for Redis and SQLite respectively, because they
can easily run on the two database systems without any modi-
fication. Table 3gives some details for these workloads. For
YCSB running against Redis, we perform 10M operations;
for other workloads, we use the default configurations.
All experiments are performed on a 24-core machine with
two 12-core Xeon Gold 6126 processors with 187GB mem-
ory and 19.25MB last level cache. We use DRAM to emulate
NVM, since NVDIMM is not on the market at the time of
preparing this manuscript. For other slower NVMs, the bene-
fits of Archapt would only be larger because of the reduction
of cache-line flushing. We use the
clflushopt
instruction to
flush cache lines, which is one of the most recent and efficient
cache-line flushing instructions. Other advanced instructions
such as
clwb
are not available in a processor in the market at
the time of preparing this manuscript.
5.2 Experimental Results
Basic Performance.
We use different numbers of threads to
evaluate throughput and latency of the four persistent transac-
tion mechanisms. Figure 7and Figure 8show the results.
Figure 7reveals that undo with Archapt has the highest
throughput among the four transaction mechanisms. On av-
erage, undo with Archapt offers 22%, 60% and 35% better
throughput than the traditional undo logging, the rollback
journal system and AOF respectively. The biggest improve-
ment happens on YCSB-A, YCSB-F and LinkBench, which
are write-intensive. For these three workloads, undo with Ar-
chapt offers up to 36%, 80% and 58% higher throughput than
the other three transaction mechanisms. As we increase the
number of client threads, Archapt consistently performs best.
With 12 threads, Archapt offers 22%, 34% and 57% better
throughput than the traditional undo logging, the rollback jour-
nal system and AOF respectively. For the read-only workload
(YCSB-C), Archapt cannot offer performance improvement,
but the throughput is at most 1% lower than other transaction
mechanisms, which is small.
Figure 8shows the 99th-percentile latency for the four
transaction mechanisms. We run eight client threads for each
workload. Archapt has the shortest latency among the four
transaction mechanisms. On average, Archapt decreases the
tail latency by 7%, 9% and 20%, compared with the tradi-
tional undo logging, AOF and the rollback journal system
respectively. Such performance improvement in tail latency
comes from the reduction of unnecessary cache flushing. For
the read-only workload (YCSB-C) that offers no opportunity
to reduce cache flushing, Archapt still provides comparable
performance to the other three transaction mechanisms.
Quantifying the effectiveness of reducing cache-line
flushing.
We measure the number of cache-line flushing be-
fore and after applying Archapt to undo logging. We only
show the results for undo logging, because it has less cache-
line flushing than the rollback journal and AOF. Hence reduc-
ing cache-line flushing for undo-logging is more challenging
than doing that for the rollback journal and AOF. Figure 9
shows the number of reduced cache-line flushing after ap-
plying Archapt. The numbers in the figure are normalized
by the total numbers of cache-line flushing before applying
Archapt. The figure does not include YCSB-C, because this
workload is read-only and does not need cache flushing. The
figure also isolates the contributions of the two techniques
(the LRU-based approach and coalesce of cache-line flushing)
to compare the effectiveness of the two techniques.
The figure reveals that Archapt greatly reduces the number
of cache-line flushing by 66% on average. YCSB-E has less
reduction in the number of cache-line flushing than other
workloads, because it has more data reuse in persistent objects,
which provides less opportunities to skip cache-line flushing.
9
(a) YCSB A-F(Redis).
(b) TPC-C(Redis). (c) Three workloads with (SQLite).
Figure 7: Throughput with the four transaction mechanisms, as the number of threads vary from four to twelve. “T”=“threads”.
(a) Redis (b) SQLite
Figure 8: 99th-percentile transaction latency, as the number of threads is eight.
This result is consistent with that shown in Figure 3.
We further notice that both techniques effectively reduce
cache-line flushing. The contribution of the LRU-based ap-
proach to the reduction of cache-line flushing varies between
different workloads, because different workloads have differ-
ent data reuse of persistent objects. Furthermore, comparing
with Redis, SQLite gains less benefit from the coalesce of
cache-line flushing. This is because of the strict SQL data
structures in SQLite that have some existing optimization on
cache line alignment.
Quantify dirtiness of flushed cache lines.
Figure 10
shows the distribution of the dirtiness of flushed cache lines
before and after applying Archapt. The figure does not in-
clude YCSB-C, because this workload is read-only and does
not need cache flushing. In general, our memory allocation
optimization works in every workload. The average improve-
ment is 12%. Among all workloads, YCSB-E (Redis) has the
largest increase: the average cache line dirtiness increases
from 51% to 68%.
Random crash tests.
We examine data consistency in
Figure 9:
The numbers of reduced cache-line flushing after applying
Archapt to undo logging. The numbers are normalized by the total
numbers of cache-line flushing before using Archapt.
physically committed transactions using random crash tests.
We aim to evaluate the effectiveness of our checksum mecha-
nism. We use an NVM crash emulator [40], because of two
reasons: (1) a large number of crash tests affect the reliability
of our physical machine; and (2) DRAM, used for our NVM
emulation, loses data when the crash happens. The emula-
tor is based on PIN [38] and emulates a configurable LRU
cache hierarchy. The crash emulator retains data in the em-
10
Figure 10:
Distribution of the dirtiness of flushed cache lines with
and without Archapt; “Y” and “N” standard for using Archapt and
without using Archapt respectively.
Table 4:
Crash test results. The numbers for each workload in the
table are for 100 crash tests.
Systems Workloads # of in-
consistent
persistent
object
# of inconsis-
tent persistent
objects de-
tected by
checksums
# of detected
inconsistent per-
sistent objects
that cannot be
corrected by
checksums
Redis
TPC-C 693 693 0
YCSB-A 901 901 0
YCSB-B 0 0 0
YCSB-D 0 0 0
YCSB-E 1024 1024 0
YCSB-F 295 295 0
SQLite
TPC-C 782 782 0
LinkBench 811 811 0
YCSB 495 495 0
ulated main memory after a crash is triggered, allowing us
to examine data consistency. The crash emulator randomly
triggers crashes. In our evaluation, the cache capacities (19.25
MB in the last level cache) and associativity (11) in the crash
emulator are the same as those in our physical machine. For
each workload we perform crash tests 100 times to ensure
statistical significance.
Table 4shows the results. The table reports total number
of inconsistent persistent objects measured by the crash emu-
lator, total numbers of inconsistent persistent objects detected
and corrected by the checksum mechanism for 100 crash
tests. We have a couple of observations. First, the checksum
mechanism successfully detects and corrects all inconsistent
persistent objects for all workloads. This demonstrates that
the checksum mechanism is highly effectiveness. Second,
we do not have a large number of inconsistent persistent ob-
jects after crashes: We have only hundreds of inconsistent
persistent objects, while each workload updates millions of
persistent objects. For YCSB-B and YCSB-D, we do not even
find any inconsistent persistent objects. This indicates that
our LRU-based approach successfully estimates data locality,
such that skipping cache-line flushing causes only a small
number of inconsistent persistent objects after transactions
are physically committed.
6 Related Work
Persistency in NVM has received significant research activi-
ties recently. Previous work on the runtime systems for persis-
tent memory transaction [8,12,20,21,24,27,27,32,35,44,45]
exploits software-based approaches at the user level to pro-
vide safe, transactional access to non-volatile memories.
Some of those runtime systems employ undo logging mecha-
nisms [8,12,20,21,24,27], while others employ redo logging
mechanisms [32,44,45]. Our work can be applied to those
logging-based mechanisms to improve performance of persis-
tent transactions.
Enabling crash consistency on NVM.
Enabling crash
consistency on NVM can be expensive, because of cache-
line flushing and persistent barriers. Strict persistency [37]
enforces crash consistency by strictly enforcing write orders
in persistent memory and can cause a large performance loss.
Some work [13,25,26,37,39] relaxes the constrains on write
orders to improve performance. Different from the above
existing work, we do not relax write orders, but optimize
performance by skipping and coalescing cache-line flushing.
Detection and correction of data errors.
Previous ef-
forts on RAID [9,23,36,41] and ECC [17,29
31,43,49]
exploit hardware- and software-based approaches to detect
and correct data errors, but the relatively large runtime over-
head and possible hardware modifications make them hard to
be applied to detect and correct data inconsistency on NVM
for a transaction mechanism.
Algorithm-based fault tolerance, as an efficient software
mechanism to correct data errors, has been used for fault toler-
ance in high performance computing (HPC) [10,11,15,16,19,
22,47,48]. However, they are customized for specific numeri-
cal algorithms, and can be hard to be applied to transactional
workloads in database systems.
The lazy persistency [5] proposed by Alshboul et. al is very
relevant to our work. They focus on computation-intensive
HPC applications and skip cache-line flushing for all dirty
data objects. They rely on periodical cache flushing of all
dirty cache blocks. Our work is significantly different from
them in two perspectives. First, the lazy persistency does not
have a systematic way to decide which cache-line flushing can
be skipped. Second, the lazy persistency cannot correct data
inconsistency after a crash. The above two limitations cause
unpredictable data loss in committed transactions, hence the
lazy persistency cannot be reliably applied to transaction sys-
tems. Archapt avoids the above limitations by the LRU-based
approach and well-design checksums.
7 Conclusions
Enabling high performance transaction is critical to release
the power (performance benefit) of persistent memory for
many applications. In this paper, we present Archapt, an
architecture-aware, high performance transaction runtime sys-
tem for persistent memory. Archapt reduces the number of
cache-line flushing to improve performance of transactions.
Archapt estimates if cache blocks of a persistent objects are
in the cache to determine the necessity of cache-line flushing.
Relying on a checksum mechanism to detect data inconsis-
tence and correct it if possible, Archapt provides strong data
consistency. Archapt also coalesces cache blocks with low
dirtiness to improve the efficiency of cache-line flushing. Our
11
results show that Archapt reduces cache flushing by 66% and
improve system throughput by 22% (42% at most), when run-
ning YCSB (A-F) and TPC-C against Redis, and OLTP-bench
(TPC-C, LinkBench and YCSB) against SQLite (using the
traditional undo logging as baseline).
References
[1]
OLTPBench.
https://github.com/
oltpbenchmark/oltpbench.
[2]
Python TPC-C .
https://github.com/apavlo/
py-tpcc.
[3]
Yahoo! Cloud Serving Benchmark.
https://github.
com/brianfrankcooper/YCSB.
[4]
Grant Allen and Mike Owens. The Definitive Guide to
SQLite. Apress, Berkely, CA, USA, 2nd edition, 2010.
[5]
M. Alshboul, J. Tuck, and Y. Solihin. Lazy Persistency:
A High-Performing and Write-Efficient Software Persis-
tency Technique. In 2018 ACM/IEEE 45th Annual In-
ternational Symposium on Computer Architecture, June
2018.
[6]
Timothy G. Armstrong, Vamsi Ponnekanti, Dhruba
Borthakur, and Mark Callaghan. Linkbench: A database
benchmark based on the facebook social graph. In Pro-
ceedings of the 2013 ACM SIGMOD International Con-
ference on Management of Data, 2013.
[7]
J.L Carlson. Redis in Action. In Manning Publications:
Greenwich, 2013.
[8]
Dhruva R. Chakrabarti, Hans-J. Boehm, and Kumud
Bhandari. Atlas: Leveraging Locks for Non-volatile
Memory Consistency. In Proceedings of the 2014 ACM
International Conference on Object Oriented Program-
ming Systems Languages & Applications, 2014.
[9]
Peter M. Chen, Edward K. Lee, Garth A. Gibson,
Randy H. Katz, and David A. Patterson. RAID: High-
performance, Reliable Secondary Storage. ACM Com-
put. Surv., 1994.
[10]
Zizhong Chen. Algorithm-based Recovery for Iterative
Methods without Checkpointing. In International Sym-
posium on High Performance Distributed Computing,
2011.
[11]
Zizhong Chen. Online-ABFT: An Online Algorithm
Based Fault Tolerance Scheme for Soft Error Detection
in Iterative Methods. In ACM SIGPLAN Symposium on
Principles and Practice of Parallel Programming, 2013.
[12]
Joel Coburn, Adrian M. Caulfield, Ameen Akel,
Laura M. Grupp, Rajesh K. Gupta, Ranjit Jhala, and
Steven Swanson. NV-Heaps: Making Persistent Ob-
jects Fast and Safe with Next-generation, Non-volatile
Memories. In Proceedings of the Sixteenth International
Conference on Architectural Support for Programming
Languages and Operating Systems, 2011.
[13]
Jeremy Condit, Edmund B. Nightingale, Christopher
Frost, Engin Ipek, Benjamin Lee, Doug Burger, and Der-
rick Coetzee. Better I/O Through Byte-addressable, Per-
sistent Memory. In Proceedings of the ACM SIGOPS
22Nd Symposium on Operating Systems Principles,
2009.
[14]
Brian F. Cooper, Adam Silberstein, Erwin Tam, Raghu
Ramakrishnan, and Russell Sears. Benchmarking Cloud
Serving Systems with YCSB. In Proceedings of the 1st
ACM Symposium on Cloud Computing, 2010.
[15]
Teresa Davies and Zizhong Chen. Correcting Soft Errors
Online in LU Factorization. In International Symposium
on High-Performance Parallel and Distributed Comput-
ing, 2013.
[16]
Teresa Davies, Christer Karlsson, Hui Liu, Chong Ding,
and Zizhong Chen. High Performance Linpack Bench-
mark: A Fault Tolerant Implementation without Check-
pointing. In International Conference on Supercomput-
ing, 2011.
[17]
T. Dell. A White Paper On The Benefits Of Chipkill-
Correct ECC for PC Server Main Memory. Technical
report, IBM Microelectronics Division, 1997.
[18]
Djellel Eddine Difallah, Andrew Pavlo, Carlo Curino,
and Philippe Cudre-Mauroux. OLTP-Bench: An Exten-
sible Testbed for Benchmarking Relational Databases.
Proc. VLDB Endow., 2013.
[19]
Peng Du, Aurelien Bouteiller, George Bosilca, Thomas
Herault, and Jack Dongarra. Algorithm-based Fault Tol-
erance for Dense Matrix Factorizations. In Proceedings
of the 17th ACM SIGPLAN Symposium on Principles
and Practice of Parallel Programming, 2012.
[20]
Subramanya R. Dulloor, Sanjay Kumar, Anil Keshava-
murthy, Philip Lantz, Dheeraj Reddy, Rajesh Sankaran,
and Jeff Jackson. System Software for Persistent Mem-
ory. In Proceedings of the Ninth European Conference
on Computer Systems, 2014.
[21]
E. R. Giles, K. Doshi, and P. Varman. SoftWrAP: A
Lightweight Framework for Transactional Support of
Storage Class Memory. In 2015 31st Symposium on
Mass Storage Systems and Technologies, May 2015.
12
[22]
Kuang-Hua Huang and Abraham. Algorithm-Based
Fault Tolerance for Matrix Operations. IEEE Transac-
tions on Computers, 1984.
[23]
Kai Hwang, Hai Jin, and Roy Ho. RAID-x: A New Dis-
tributed Disk Array for I/O-centric Cluster Computing.
In Proceedings the Ninth International Symposium on
High-Performance Distributed Computing, 2000.
[24]
Intel. Persistent Memory Development Kit.
https:
//pmem.io/.
[25]
Arpit Joshi, Vijay Nagarajan, Marcelo Cintra, and Stratis
Viglas. Efficient Persist Barriers for Multicores. In
International Symposium on Microarchitecture, 2015.
[26]
A. Kolli, J. Rosen, S. Diestelhorst, A. Saidi, S. Pelley,
S. Liu, P. M. Chen, and T. F. Wenisch. Delegated Persist
Ordering. In 2016 49th Annual IEEE/ACM International
Symposium on Microarchitecture, 2016.
[27]
Aasheesh Kolli, Steven Pelley, Ali Saidi, Peter M. Chen,
and Thomas F. Wenisch. High-Performance Transac-
tions for Persistent Memories. In Proceedings of the
Twenty-First International Conference on Architectural
Support for Programming Languages and Operating
Systems, 2016.
[28]
Scott T. Leutenegger and Daniel Dias. A Modeling
Study of the TPC-C Benchmark. In SIGMOD Record,
1993.
[29]
Dong Li, Zizhong Chen, Panruo Wu, and Jeffrey S.
Vetter. Rethinking Algorithm-Based Fault Tolerance
with a Cooperative Software-Hardware Approach. In
ACM/IEEE International Conference for High Perfor-
mance Computing, Networking, Storage and Analysis,
2013.
[30]
Sheng Li, Doe H. Yoon, Ke Chen, Jishen Zhao, Jung H.
Ahn, Jay B. Brockman, Yuan Xie, and Norman P. Jouppi.
MAGE: Adaptive Granularity and ECC for Resilient and
Power Efficient Memory Systems. In International Con-
ference for High Performance Computing, Networking,
Storage and Analysis, 2012.
[31]
S. Lu, H. Li, and K. Miyase. Progressive ECC Tech-
niques for Phase Change Memory. In 2018 IEEE 27th
Asian Test Symposium, 2018.
[32]
Y. Lu, J. Shu, and L. Sun. Blurred Persistence in Trans-
actional Persistent Memory. In 2015 31st Symposium
on Mass Storage Systems and Technologies, 2015.
[33]
Y. Lu, J. Shu, L. Sun, and O. Mutlu. Loose-Ordering
Consistency for persistent memory. In 2014 IEEE
32nd International Conference on Computer Design,
Oct 2014.
[34]
Virendra J. Marathe, Achin Mishra, Amee Trivedi,
Yihe Huang, Faisal Zaghloul, Sanidhya Kashyap, Margo
Seltzer, Tim Harris, Steve Byan, Bill Bridge, and Dave
Dice. Persistent Memory Transactions. CoRR, 2018.
[35]
Amirsaman Memaripour, Anirudh Badam, Amar Phan-
ishayee, Yanqi Zhou, Ramnatthan Alagappan, Karin
Strauss, and Steven Swanson. Atomic In-place Updates
for Non-volatile Main Memories with Kamino-Tx. In
Proceedings of the Twelfth European Conference on
Computer Systems, 2017.
[36]
Jai Menon and Jim Cortney. The Architecture of a Fault-
tolerant Cached RAID Controller. In Proceedings of
the 20th Annual International Symposium on Computer
Architecture, 1993.
[37]
Steven Pelley, Peter M. Chen, and Thomas F. Wenisch.
Memory Persistency. In Proceeding of the 41st Annual
International Symposium on Computer Architecuture,
2014.
[38]
V.J. Reddi, A. Settle, D.A. Connors, and R.S. Cohn. Pin:
A Binary Instrumentation Tool for Computer Architec-
ture Research and Education. In Proceedings of the
2004 workshop on Computer architecture education,
2004.
[39]
J. Ren, J. Zhao, S. Khan, J. Choi, Y. Wu, and O. Mu-
tiu. ThyNVM: Enabling Software-transparent Crash
Consistency in Persistent Memory Systems. In 2015
48th Annual IEEE/ACM International Symposium on
Microarchitecture, 2015.
[40]
Jie Ren, Kai Wu, and Dong Li. Understanding Appli-
cation Recomputability Without Crash Consistency in
Non-Volatile Memory. In Proceedings of the Work-
shop on Memory Centric High Performance Computing,
2018.
[41]
Frank Schmuck and Roger Haskin. GPFS: A Shared-
Disk File System for Large Computing Clusters. In
Proceedings of the 1st USENIX Conference on File and
Storage Technologies, 2002.
[42]
H. Shu, H. Chen, H. Liu, Y. Lu, Q. Hu, and J. Shu. Em-
pirical Study of Transactional Management for Persis-
tent Memory. In 2018 IEEE 7th Non-Volatile Memory
Systems and Applications Symposium, 2018.
[43]
Aniruddha N. Udipi, Naveen Muralimanohar, Rajeev
Balsubramonian, Al Davis, and Norman P. Jouppi. LOT-
ECC: Localized and Tiered Reliability Mechanisms for
Commodity Memory Systems. In International Sympo-
sium on Computer Architecture, 2012.
13
[44]
Shivaram Venkataraman, Niraj Tolia, Parthasarathy Ran-
ganathan, and Roy H. Campbell. Consistent and Durable
Data Structures for Non-volatile Byte-addressable Mem-
ory. In Proceedings of the 9th USENIX Conference on
File and Stroage Technologies, 2011.
[45]
H. Volos, A. J. Tack, and M. M. Swift. Mnemosyne:
Lightweight Persistent Memory. In Architectural Sup-
port for Programming Languages and Operating Sys-
tems, 2011.
[46]
H. Wan, Y. Lu, Y. Xu, and J. Shu. Empirical Study of
Redo and Undo Logging in Persistent Memory. In 5th
Non-Volatile Memory Systems and Applications Sympo-
sium, 2016.
[47]
Panruo Wu, Chong Ding, Longxiang Chen, Teresa
Davies, Christer Karlsson, and Zizhong Chen. On-line
Soft Error Correction in Matrix-Matrix Multiplication.
Journal of Computational Science, 2013.
[48]
S. Yang, K. Wu, Y. Qiao, D. Li, and J. Zhai. Algorithm-
Directed Crash Consistence in Non-volatile Memory
for HPC. In 2017 IEEE International Conference on
Cluster Computing, 2017.
[49]
Doe Hyun Yoon and Mattan Erez. Virtualized and flex-
ible ecc for main memory. In Proceedings of the Fif-
teenth Edition of ASPLOS on Architectural Support for
Programming Languages and Operating Systems, 2010.
14
ResearchGate has not been able to resolve any citations for this publication.
Conference Paper
Full-text available
Emerging Non-Volatile Memories (NVMs) are expected to be included in future main memory, providing the opportunity to host important data persistently in main memory. However, achieving persistency requires that programs be written with failure-safety in mind. Many persistency models and techniques have been proposed to help the programmer reason about failure-safety. They require that the programmer eagerly flush data out of caches to make it persistent. Eager persistency comes with a large overhead because it adds many instructions to the program for flushing cache lines and incurs costly stalls at barriers to wait for data to become durable. To reduce these overheads, we propose Lazy Persistency (LP), a software persistency technique that allows caches to slowly send dirty blocks to the NVMM through natural evictions.With LP, there are no additional writes to NVMM, no decrease in write endurance, and no performance degradation from cache line flushes and barriers. Persistency failures are discovered using software error detection (checksum), and the system recovers from them by recomputing inconsistent results. We describe the properties and design of LP and demonstrate how it can be applied to loop-based kernels popularly used in scientific computing. We evaluate LP and compare it to the state-of-the-art Eager Persistency technique from prior work. Compared to it, LP reduces the execution time and write amplification overheads from 9% and 21% to only 1% and 3%, respectively.
Article
Outside of the world of enterprise computing, there is one database that enables a huge range of software and hardware to flex relational database capabilities, without the baggage and cost of traditional database management systems. That database is SQLite—an embeddable database with an amazingly small footprint, yet able to handle databases of enormous size. SQLite comes equipped with an array of powerful features available through a host of programming and development environments. It is supported by languages such as C, Java, Perl, PHP, Python, Ruby, TCL, and more. The Definitive Guide to SQLite, Second Edition is devoted to complete coverage of the latest version of this powerful database. It offers a thorough overview of SQLite’s capabilities and APIs. The book also uses SQLite as the basis for helping newcomers make their first foray into database development. In only a short time you can be writing programs as diverse as a server-side browser plug-in or the next great iPhone or Android application! • Learn about SQLite extensions for C, Java, Perl, PHP, Python, Ruby, and Tcl. • Get solid coverage of SQLite internals. • Explore developing iOS (iPhone) and Android applications with SQLite. SQLite is the solution chosen for thousands of products around the world, from mobile phones and GPS devices to set-top boxes and web browsers. You almost certainly use SQLite every day without even realizing it!
Conference Paper
Data structures for non-volatile memories have to be designed such that they can be atomically modified using transactions. Existing atomicity methods require data to be copied in the critical path which significantly increases the latency of transactions. These overheads are further amplified for transactions on byte-addressable persistent memories where often the byte ranges modified for data structure updates are significantly smaller compared to the granularity at which data can be efficiently copied and logged. We propose Kamino-Tx that provides a new way to perform transactional updates on non-volatile byte-addressable memories (NVM) without requiring any copying of data in the critical path. Kamino-Tx maintains an additional copy of data off the critical path to achieve atomicity. But in doing so Kamino-Tx has to overcome two important challenges of safety and minimizing NVM storage overhead. We propose a more dynamic approach to maintaining the additional copy of data to reduce storage overheads. To further mitigate the storage overhead of using Kamino-Tx in a replicated setting, we develop Kamino-Tx-Chain, a variant of Chain Replication where replicas perform in-place updates and do not maintain data copies locally; replicas in Kamino-Tx-Chain leverage other replicas as copies to roll back or forward for atomicity. Our results show that using Kamino-Tx increases throughput by up to 9.5x for unreplicated systems and up to 2.2x for replicated settings.
Conference Paper
Emerging non-volatile memory (NVRAM) technologies offer the durability of disk with the byte-addressability of DRAM. These devices will allow software to access persistent data structures directly in NVRAM using processor loads and stores, however, ensuring consistency of persistent data across power failures and crashes is difficult. Atomic, durable transactions are a widely used abstraction to enforce such consistency. Implementing transactions on NVRAM requires the ability to constrain the order of NVRAM writes, for example, to ensure that a transaction's log record is complete before it is marked committed. Since NVRAM write latencies are expected to be high, minimizing these ordering constraints is critical for achieving high performance. Recent work has proposed programming interfaces to express NVRAM write ordering constraints to hardware so that NVRAM writes may be coalesced and reordered while preserving necessary constraints. Unfortunately, a straightforward implementation of transactions under these interfaces imposes unnecessary constraints. We show how to remove these dependencies through a variety of techniques, notably, deferring commit until after locks are released. We present a comprehensive analysis contrasting two transaction designs across three NVRAM programming interfaces, demonstrating up to 2.5x speedup.