Architecture-Aware, High Performance Transaction
for Persistent Memory
University of California, Merced
Byte-addressable non-volatile main memory (NVM) demands
transactional mechanisms to access and manipulate data on
NVM atomically. Those transaction mechanisms often em-
ploy a logging mechanism (undo logging or redo logging).
However, the logging mechanisms bring large runtime over-
head (8%-49% in our evaluation), and 41%-78% of the over-
head attributes to the frequent cache-line ﬂushing. Such large
overhead signiﬁcantly diminishes the performance beneﬁts
offered by NVM. In this paper, we introduce a new method
to remove the overhead of cache-line ﬂushing for logging-
based transactions. Different from the traditional method that
works at the program level and leverages program semantics
to reduce the logging overhead, we introduce architecture
awareness. In particular, we do not ﬂush certain cache blocks,
as long as they are estimated to be eliminated out of the
cache because of the caching mechanism (e.g., the cache re-
placement algorithm). Furthermore, we coalesce those cache
blocks with low dirtiness to improve the efﬁciency of cache-
line ﬂushing.We implement an architecture-aware, high per-
formance transaction runtime system for persistent memory,
Archapt. Our results show that comparing with the traditional
undo logging, Archapt reduces cache ﬂushing by 66% and im-
proves system throughput by 22% on average (42% at most),
when running TPC-C and YCSB (A-F) with Redis, and OLTP-
bench (TPC-C, LinkBench and YCSB) with SQLite.
Non-volatile memory (NVM), addressed at a byte granularity
directly by CPU and accessed roughly at the latency of main
memory, are coming. While NVM as main memory provides
an appealing interface that uses simple
, it brings
new challenges to the designs of persistent data structures,
storage systems, and databases. In particular, a
not immediately make data persistent, because the memory
hierarchy (e.g., caches and store buffers) and processor state
can remain non-persistent. There is a need to ensure that data
is modiﬁed atomically when moving from one consistent state
to another, in order to provide consistency after a crash (e.g.,
power loss or hardware failure). In particular, a
not immediately make data persistent, because the memory
hierarchy (e.g., caches and store buffers) and processor state
can remain non-persistent. There is a need to ensure that data
is modiﬁed atomically when moving from one consistent state
to another, in order to provide consistency after a crash (e.g.,
power loss or hardware failure).
The NVM challenges have resulted in investigations of
transactional mechanisms to access and manipulate data on
persistent memory (NVM) atomically [8,27,33
Those transactional mechanisms often employ a logging tech-
nique (undo logging or redo logging). However, those trans-
actional mechanisms have a high overhead. Our performance
evaluation reveals that running TPC-C [2,28] and YCSB
(A-F) [3,14] against Redis , and OLTP-bench  (TPC-
C, LinkBench  and YCSB) against SQLite  based on
an implementation of undo logging from Intel PMDK )
or a redo logging from  to enable transactions, we have
overheads of 8%-49%. Such large overhead signiﬁcantly di-
minishes the performance beneﬁt NVM promises to provide
in many workloads.
Most overhead of logging mechanisms comes from data
copy for creating logs and cache-line ﬂushing by special in-
structions. Cache-line ﬂushing takes a large portion of the
total overhead. Use our evaluation with the above workloads
as an example again. On average, the cache-line ﬂushing takes
65% and 51% of total overhead for undo logging and redo
logging mechanisms respectively. Removing the overhead
of cache-line ﬂushing is the key to enable high performance
transaction for persistent memory.
The traditional methods reduce the overhead of cache-line
ﬂushing using asynchronous cache-line ﬂushing (e.g., blur-
ring persistent boundary  and relaxing persistency order-
ing [32,45]). Those methods remove the overhead of cache-
line ﬂushing off the critical path, by overlapping cache-line
ﬂushing with the transaction. However, the effectiveness of
asynchronous cache-line ﬂushing depends on the characteris-
arXiv:1903.06226v1 [cs.DC] 14 Mar 2019
tics of the transaction (e.g., how frequent data updates hap-
pen), cache-line ﬂushing can still be exposed into the critical
path, increasing the latency of the transaction.
In this paper, we introduce a new method to remove the
overhead of cache-line ﬂushing. The traditional methods work
at the program level and leverages program semantics: as
long as the transaction semantics remains correct, we can
change the order of persisting data and trigger asynchronous
cache-line ﬂushing. Different from the traditional methods,
our method introduces architecture awareness. In particular,
we do not ﬂush certain cache lines, as long as those cache
lines are eliminated out of the cache because of the caching
mechanism (e.g., the cache replacement algorithm). In other
words, we rely on the existing hardware mechanism to au-
tomatically and implicitly ﬂush cache lines. The traditional
methods do not have architecture awareness. Ignoring the
possible effects of the caching mechanism, the traditional
methods ﬂush cache lines by explicitly issuing cache ﬂush
instructions, even though those cache lines will be soon or
have been eliminated out of the cache by hardware.
Furthermore, we examine the cache line dirtiness to quan-
tify the efﬁciency of cache-line ﬂushing. The dirtiness of a
cache line is deﬁned as the ratio of dirty bytes to total number
of bytes in a cache line. Since a cache line is the ﬁnest gran-
ularity to enforce data persistency, the whole cache line has
to be ﬂushed, even though only a few bytes in the cache line
are dirty. Use our evaluation with the above workloads as an
example again: the average dirtiness of ﬂushed cache lines in
Redis and SQLite is 49% and 49% for undo and redo logging
mechanisms respectively. Flushing clean data in a cache line
wastes memory bandwidth and decreases the efﬁciency of
To leverage the architecture awareness to enable high per-
formance transactions, we must address a couple of chal-
lenges. First, we must have a software mechanism to rea-
son and decide the existence of cache blocks
in the cache,
without the hardware modiﬁcation. The mechanism must be
lightweight and allow us to make a quick decision on whether
a cache-line ﬂushing is necessary.
Second, we must provide strong guarantee on data consis-
tency to implement transactions. Skipping cache-line ﬂushing
for some persistent objects raises the risk of losing data con-
sistency for committed transactions. The software mechanism
to reason the residence of a cache block in the cache is an
approximation to the hardware-based caching mechanism. If
the software mechanism skips a cache-line ﬂushing, but the
corresponding dirty cache block is still in the cache, then there
is a chance that the cache block is inconsistent when a crash
happens. We must have a mechanism to detect and correct
such inconsistency in persistent memory.
To address the above two challenges, we introduce Archapt
We distinguish cache line and cache block in the paper. The cache line
is a location in the cache, and the cache block refers to the data that goes into
a cache line.
(Architecture-aware, performant and persistent transaction),
an architecture-aware, high performance transaction runtime
system. Archapt provides a new way to perform transactional
updates on persistent memory with efﬁcient cache-line ﬂush-
ing. To address the ﬁrst challenge, Archapt uses an LRU
queue to reason the residence of cache blocks of a persistent
object in the cache and decide whether cache ﬂushing for a
persistent object in a transaction is necessary.
To address the second challenge, Archapt introduces a
lightweight checksum mechanism. Checksums are built using
multiple cache blocks from one or more persistent objects
to establish implicit invariant relationships between cache
blocks. Leveraging the invariant, Archapt can detect data in-
consistency and make best efforts to correct data inconsistency
after a crash happens. The checksum mechanism provides
a strong guarantee on data consistency, while causes small
runtime overhead (less than 5% loss in throughput in our
Furthermore, to improve the efﬁciency of cache-line ﬂush-
ing, we examine the implementation of common database sys-
tems (Redis and SQLite), and ﬁnd two problems accounting
for the low dirtiness of ﬂushed cache lines. The two problems
are unaligned cache-line ﬂushing and uncoordinated cache-
line ﬂushing. The two problems come from the fundamental
limitation of the existing memory allocation mechanism de-
signed for the traditional DRAM. In particular, the existing
memory allocation does not consider the effects of cache-
line ﬂushing on persistent memory, and spread data structures
with different dirtiness across cache blocks. This causes the
low dirtiness of ﬂushed cache lines. Archapt introduces a cus-
tomized memory allocation mechanism to coalesce cache-line
ﬂushing and improve efﬁciency.
In summary, the paper makes the following contributions:
An architecture-aware new method to achieve high perfor-
mance transactions on persistent memory;
A mechanism that determines the necessity of cache-line
ﬂushing based on the locality of cache blocks; A check-
sum mechanism to detect and correct data inconsistency to
provide strong guarantee on crash consistency;
We reveal the low dirtiness of ﬂushed cache lines in two
common databases, and provide a solution to improve the
efﬁciency of cache-line ﬂushing;
With Archapt, we reduce cache ﬂushing by 66% and im-
prove system throughput by 22% (42% at most), when
running YCSB (A-F) and TPC-C against Redis, and OLTP-
bench (TPC-C, LinkBench and YCSB) against SQLite (us-
ing the traditional undo logging as baseline). Archapt pro-
vides strong crash consistency demonstrated by our crash
2 Background and Motivation
Many studies build a atomic and durable transaction [8,12,
20,21,24,27,27,32,44,45] to handle the crash consistency
issue on NVM. With such a transaction, each single update
must be “all or nothing”, i.e., either successfully completes,
or fails completely with the data in NVM intact. With such a
transaction, one has to write back the modiﬁed data from the
volatile cache to NVM that provides the durability. To ensure
a cache line is indeed written to NVM in a correct order,
one often uses cache-line ﬂushing instructions (e.g.,
) and persistent barriers (e.g.,
). Cache-line ﬂushing is expensive, because of
two reasons: (1) it may need to invalidate cache lines (with
instructions) and trigger cache-
line sized writes to the memory controller; and (2) it needs
persistent barriers to ensure that all ﬂushes are completed and
force any updates in the memory controller to be written to
In the rest of the paper, we use the term persistent object
to represent a data object that is modiﬁed within the transac-
tion and needs to be persisted. We use the term log record
to represent a log (an copy of the old data in an undo log-
ging mechanism or a copy of the new data in a redo logging
mechanism). To persist a persistent object, the current com-
mon practice is to ﬂush all cache blocks of the persistent
. We use “ﬂushing all cache blocks” for a persis-
tent object and “cache-line ﬂushing” for a persistent object
interchangeably in the paper.
2.1 Performance Analysis on Log-based Per-
sistent Memory Transactions
Undo and redo logging are two of the most common mecha-
nisms to build persistent transactions on persistent memory.
In undo and redo logging, the logging operations (including
data copy and log record manipulation) and persistence op-
erations (including cache-line ﬂushing and store barrier) are
necessary. Both of them cause performance loss in a trans-
action. To quantify the impact of the persistent logging on
transaction throughput, we run multiple workloads, including
YCSB and TPC-C against Redis, and OLTP-bench (TPC-C,
LinkBench and YCSB) against SQLite with and without the
persistent logging. For each workload, we use eight client
threads. More experiment details are available in Section 5.1.
Figure 1shows the results.
The ﬁgure reveals that logging decreases throughput by
8%-49%. For a workload with frequent updates (YCSB-A)
or large updates (LinkBench), the logging overhead can be
very large (33% and 49% for YCSB-A and LinkBench respec-
tively). Furthermore, we measure the delay (latency overhead)
caused by logging operations and persistence operations. Fig-
ure 2shows the results. In the undo logging, the persistence
operations account for 56%-78% of the latency overhead;
We have to ﬂush all cache blocks, even though some of them may not be
in the cache, because there is no mechanism to faithfully track which cache
blocks are in the cache. Flushing cache blocks not resident in the cache has
the similar cost as ﬂushing those resident in the cache.
Throughput when running YCSB workloads (A-F), TPC-
C and LinkBench against Redis and SQLite with and without log-
Figure 2: Breakdown of undo/redo logging latency overhead.
In the redo logging, the persistence operations account for
41%-64% of the latency overhead. The overhead of those
persistence operations is exposed to the critical path of trans-
actions. The above results show that the persist operations
can signiﬁcantly impact the transaction performance. Thus,
we must avoid frequent cache-line ﬂushing.
Introducing architecture awareness into the design of a
transaction, we want to skip cache ﬂushing by leveraging data
reuse information in the cache. If data reuse is low, then there
is a very good chance that the data is eliminated out of the
cache by the hardware-based caching mechanism. We study
data reuse in the next section.
2.2 Data Reuse and Dirtiness Analysis
Data in a transaction includes log records and persistent ob-
jects. Log records, which are used to maintain the transaction
atomicity, are seldom reused. We study data reuse at the per-
sistent object level, and explore whether there are persistent
objects with few reuse. These persistent objects are candidates
for skipping cache-line ﬂushing.
To study data reuse, we count the number of operations
(read and write) for each persistent object, and then report
what is the percentage of persistent objects with 0, 1, 2 or
more operations, which we call the distribution of data reuse.
Figure 3shows the results. The ﬁgure reveals that 78% of
persistent objects are used only once or twice in all workloads
except YCSB-E. In YCSB-E, about 89% of persistent objects
have data reuse no less than 2. Such high data reuse is be-
cause of the following reason: This workload has frequent
queries, each of which cover a range of persistent objects.
Those ranges of queries overlap with each other, causing high
Figure 3: Distribution of data reuse for persistent data objects.
Table 1: Average dirtiness of ﬂushed cache lines.
TPC-C YCSB TPC-C LinkBench YCSB
A B D E F
0.31 0.55 0.55 0.51 0.51 0.56 0.40 0.49 0.46
We also explore the efﬁciency of cache-line ﬂushing. In
particular, we quantify the average dirtiness of ﬂushed cache
lines. Table 1shows the results for undo and redo logging (the
two logging mechanism have the same dirtiness). In general,
the dirtiness is less than 0.6 in all workloads, which is low.
Using the industry standard workloads, our
analysis on data reuse and dirtiness shows great opportunities
to enable high performance transactions by skipping cache-
line ﬂushing and improving its efﬁciency.
Motivated by the above performance analysis, we introduce
a high performance transaction runtime system, targeting on
reducing the overhead of persistence operations. We describe
our design in details in this section.
Archapt avoids cache-line ﬂushing for persistent objects (but
not log records) to enable high performance transactions with-
out disturbing transaction atomicity. Archapt uses an LRU-
based method to reason if persistent objects are in the cache.
With this approach, Archapt does not immediately make a de-
cision on ﬂushing cache blocks for a persistent object, when a
cache ﬂushing request is issued from a transaction to persist a
persistent object. Archapt delays the decision until it collects
more information on read/write of the persistent object and
estimates the locality of the persistent object, using the LRU
queue. For the persistent object that is estimated not to be
resident in the cache, the cache ﬂushing for all of its cache
blocks is skipped.
Archapt is also featured with a checksum mechanism. Skip-
ping cache-line ﬂushing for some persistent objects raises the
risk of having inconsistent data for committed transactions,
when a crash happens. To remove the risk, we introduce a
checksum mechanism. This mechanism generates checksums
for persistent objects that have cache-line ﬂushing skipped.
The checksum mechanism builds an invariant relationship
between cache blocks. Upon a crash, the checksums can be
Figure 4: The architecture of Archapt.
used to detect and correct data inconsistency. We design the
mechanism with the consideration of avoiding frequently up-
dating checksums for best performance and maximizing the
capabilities of correcting data inconsistency.
Furthermore, we identify two reasons that account for the
low dirtiness of ﬂushed cache lines: unaligned cache-line
ﬂushing and uncoordinated cache line ﬂushing. To address
the two problems, Archapt introduces a customized memory
allocation mechanism. It clusters persistent objects with the
same functionality (i.e., key, ﬁeld, value, and log) into con-
tiguous cache blocks to coordinate cache-line ﬂushing and
align cache-line ﬂushing, based on which Archapt improves
the efﬁciency of cache-line ﬂushing.
Overall architecture of Archapt.
Archapt has four major
components: transaction management unit, memory manage-
ment unit, persistent management unit, and history manage-
ment unit. Figure 4shows the architecture of Archapt.
(1) The transaction management unit includes a set of APIs
to establish a transaction (i.e., start and end). Such transaction
information is sent to the Archapt runtime to implement trans-
action semantics. The transaction management unit processes
the requested operations of the transaction. It also ﬂushes
cache blocks for persistent objects that are estimated to be in
the cache. (2) The memory management unit pre-allocates a
set of memory pools for coalescing cache blocks and manages
the pools to meet memory allocation requests from transac-
tions. (3) The persistent management unit builds checksums
for persistent objects for which Archapt skips the cache-line
ﬂushing. (4) The history management unit maintains an LRU
queue and a hash table,
Ob jH T
. The LRU queue is used to
estimate the locality of persistent objects (i.e., in the cache or
Ob jH T
is used to provide metadata information for each
persistent object in the LRU queue, such as the location in
the LRU queue and whether there is any pending cache-line
3.2 Architecture-Aware Cache-Line Flushing
The architecture-aware cache-line ﬂushing uses an LRU
queue to reason if a persistent object is in the cache or not,
and skips cache-line ﬂushing for it, if not. When a persistent
object is updated, its cache blocks are placed into the LRU
queue (the queue length is equal to the capacity of last level
cache), and the decision for cache ﬂushing for this persis-
tent object is pending until we have enough information to
estimate the residence of the persistent object in the cache,
based on the LRU queue. We describe our design in details as
First, once Archapt receives a request (i.e., a read or write
operation to a persistent object) from the client, the transac-
tion management unit queries
Ob jH T
to see if the requested
persistent object has a record there. If yes, we infer that the
persistent object is accessed recently. The hardware cache
may have the persistent object resident in the cache because
of a previous operation on the persistent object. If the previ-
ous operation is a write operation, ﬂushing cache blocks for
the persistent object must be pending. We ﬁnish the cache
ﬂushing for the previous write operation. Furthermore, we
update the location of the persistent object in the LRU queue,
because of the current request. If the current request is a write
operation, we hold the cache ﬂushing for the current request,
waiting for the opportunity to skip it in the future.
If the transaction management unit cannot ﬁnd the re-
quested persistent object information in
Ob jH T
, we conclude
that the persistent object has not been accessed recently. The
hardware cache may evict the persistent object out of the
cache, or never access it at all. The transaction management
unit then skips any pending cache ﬂushing request for the per-
sistent object. Afterwards, the transaction management unit
asks the history management unit to insert the information
for the persistent object into the LRU queue, and suspend
the cache ﬂushing for the most recent request if it writes the
persistent object. In the future transactions, as other persis-
tent objects are accessed, the target persistent object can be
evicted out of the LRU queue according to the LRU policy,
and its record will be removed from
Ob jH T
and the pending
cache ﬂushing will be skipped.
We must maintain the commit status of a transaction very
well. After the completion of a transaction, we cannot label
it as commit as in the traditional transaction, because cache
ﬂushing for some persistent objects in the transaction may be
pending. For such a transaction, we label it as logical commit.
Only after all of cache ﬂushing for persistent objects in the
transaction are either ﬁnished or skipped (but with checksums
added to the persistent objects. See Section 3.3), we label the
transaction as physical commit.
A logically committed transaction has completed all read
and write operations in the transaction. For such a transaction,
the system does not respond to the client to announce the
transaction commitment. For a physically committed trans-
action, the system does so, as in the traditional undo or redo
The modern hardware-based cache hierarchy employs so-
phisticated caching policies. It is possible that a persistent
object is resident in the cache while the LRU estimates oth-
erwise. For this case, skipping cache-line ﬂushing can poten-
tially cause data inconsistency for a physically committed
transaction, when a crash happens. We introduce a check-
sum mechanism to detect and correct inconsistent data (see
Handling log records.
Log records, once created for a
transaction, are seldom accessed (unless a crash happens).
We could skip cache ﬂushing for log records and rely on
the hardware-based caching mechanism to implicitly persist
them. However, by doing so, some log records that are not
timely ﬂushed by the hardware are lost when a crash happens;
We raise the risk of losing transaction atomicity before the
physical commitment of the transaction. Hence, we do not
skip cache-line ﬂushing for log records. They are committed
and maintained as in the traditional logging mechanisms.
3.3 Checksum Design
Skipping cache-line ﬂushing for some persistent objects raises
the risk of disturbing transaction atomicity: once a transaction
is physically committed, there is no strong guarantee on data
consistency, because we estimate data locality and the esti-
mation can be inaccurate. To remove the risk, we introduce a
We have multiple requirements for the checksum design.
First, the checksum mechanism should have the capability
to detect data inconsistency in physically committed persis-
tent objects. Second, the checksum mechanism must provide
strong guarantee on data consistency for persistent objects
when they are physically committed. Third, the checksum
mechanism must be lightweight. Unlike RAID or some ECC
that can come with large performance overhead, the overhead
of our checksum construction and maintenance should be
small, and smaller than the performance beneﬁt of skipping
cache ﬂushing for persistent objects. We describe the design
of the checksum mechanism in this section.
Our checksums are built with cache
blocks of multiple persistent objects from one or more trans-
actions. To build the checksums, cache blocks of multiple
persistent objects are logically organized as an
are the dimension sizes of the matrix, discussed
later). Those persistent objects have cache ﬂushing skipped.
Each column of the matrix corresponds to cache blocks of a
persistent object, where each element of the column is a cache
block. Checksums are built as one extra row (the
row) and one extra column (the
column) of the ma-
trix. The matrix becomes
. The extra row,
named consistency checksums, is used to detect data incon-
persistent objects), and each
element of the extra row is a consistent checksum for one
column. The extra column, named correlation checksums,
builds an invariant relationship between cache blocks across
the multiple persistent objects. The correlation checksums
can correct data inconsistency. We name the matrix, virtual
(a) Checksum creation (b)
An example of correctable data inconsis-
An example of uncorrectable data incon-
Figure 5: Three examples for checksum creation and correcting data inconsistency.
matrix in the future discussion.
Consistency checksums to detect data inconsistency.
When a persistent object with cache ﬂushing skipped is logi-
cally committed, we immediately create a consistency check-
sum. The checksum is a simple summation of cache blocks
of the persistent object. The consistency checksum for a per-
sistent object is implemented as an extra cache block added
to the persistent object. The checksum is immediately ﬂushed
for consistency once it is created. When a crash happens,
for each persistent object with a consistency checksum, we
recalculate the checksum and compare it with the existing
one in persistent memory. If there is a mismatch, then data
inconsistency is detected.
The consistency checksum mechanism is very effective to
detect data inconsistency for a persistent object. Any cache
block of the persistent object with data inconsistency can
easily cause checksum mismatch. In our evaluation with
ten workloads with hundreds of millions of transactions and
10,000 crash tests, the consistency checksum mechanism de-
tects all data inconsistency.
Correlation checksums to correct data inconsistency.
correlation checksum, as an element of the
of the virtual matrix, is a summation of cache blocks of a row
in the virtual matrix. The
column is composed of
correlation checksums, each of which is for one row. Since
the cache blocks of a row come from
a correlation checksum aims to correct data inconsistency
for any of the
persistent objects. Correlation checksums
column) are immediately ﬂushed out of the
cache to commit, once they are fully built.
Once a crash happens, we recalculate correlation check-
sums and compare them with the existing ones in persistent
memory. If there is a mismatch in any correlation checksum
(say the element
of the virtual matrix), then the corre-
sponding row (the row
) must have data inconsistency. Using
consistency checksums, we can reason which element of the
is inconsistent. Assume that the element
element is corrected by the following:
mk j =mk(N+1)−
is the correlation checksum committed in per-
, the column size of the virtual matrix, is the number of
persistent objects we use to build the virtual matrix. A smaller
causes more frequent creation of checksums and hence
larger performance overhead, but reduces the possibility of
losing updates to persistent objects (because checksums are
frequently committed); A larger
has smaller performance
overhead, but increase the possibility. We empirically choose
as 16 to strike a balance. In other words, we commit
correlation checksums for Npersistent objects together.
, the row size of the virtual matrix is determined by the
largest persistent object among the
the number of cache blocks in the largest persistent object. For
the shorter persistent objects, their corresponding columns
in the virtual matrix can have zero-valued elements to make
them as long as the largest persistent object.
Figure 5.a shows an example to further ex-
plain the idea of checksums. In this example, we have four per-
sistent objects with four, four, three, and four cache blocks re-
spectively. Hence, the virtual matrix is
, and each column
is for one persistent object. The consistency checksums are
in the ﬁfth row, and the correlation checksums are in the ﬁfth
columns. The consistency checksums,
detect data inconsistency for the ﬁrst-fourth persistent objects
respectively. The correlation checksums,
can be used to correct data inconsistency for the cache blocks
in the ﬁrst-fourth rows. Suppose
detected by the consistency checksum
. The inconsis-
tency can be corrected by the correlation checksum
CB#32 =CkSum7−CB#31 −CB#33 −CB#34
Enabling high performance checksum mechanism.
checksum mechanism does not cause large performance over-
head because of the following reasons. First, creating check-
sums is not in the critical path of a transaction. A checksum
for a persistent object is created, only after the persistent ob-
ject is estimated to be evicted out of the LRU queue. This
indicates that it is highly likely that the persistent object is not
accessed in the near future; Also, creating the checksum for
the persistent object and committing the checksum later do
not block the execution of other transactions. Hence, check-
sum creation can happen in parallel with other operations,
which removes it from the critical path.
Second, checksums do not need to be updated frequently.
Once a persistent object is updated, its checksums must be
recalculated and updated to maintain the validness of check-
sums. Such updates can cause performance overhead. This
performance problem is common in other mechanisms, such
as ECC or RAID. However, it is not a problem in our design,
because we use those persistent objects that are not frequently
accessed (according to the LRU queue) to build checksums.
Updating checksums do not happen often.
Third, the overhead of ﬂushing cache blocks of checksums
can be smaller than that of ﬂushing cache blocks for persistent
objects. Hence, the performance beneﬁts of the checksum
mechanism provide opportunities to overweight its overhead.
virtual matrix, we need to ﬂush (
cache blocks to make checksums consistent. In contrast, to
persistent objects in the virtual matrix, we
need to ﬂush at least (
) cache blocks (assuming that
the largest persistent object has
cache blocks while each
of other persistent objects has just one cache block), and at
) cache blocks (assuming that each of persistent
cache blocks). In fact, when we build a virtual
matrix, we try to use persistent objects with similar size, such
that in nearly all cases, the number of cache blocks to ﬂush
for persistent objects is close to (
). In other words,
in nearly all cases, the number of cache blocks to ﬂush for
checksums is signiﬁcantly smaller than that to ﬂush for
persistent objects (i.e.,
). Although we need
to update the checksums and ﬂush again when there is any
update to the persistent objects„ that does not happen very
often because of the above second reason we discuss. In our
evaluation, using the checksum mechanism always saves the
ﬂushes of cache blocks, hence bringing performance beneﬁts.
Analysis on the capability of correcting data inconsis-
The correlation checksums have a strong capability to
correct data inconsistency. If a cache block in a row is incon-
sistent, we can easily correct it using Equation 1. If more than
one cache block in a row are inconsistent, we can use the con-
sistency checksums and the correlation checksums to correct
them. Note that the consistency checksum, although mainly
used to detect data inconsistency, can be used to correct the
inconsistency of cache blocks that fall into the same column,
using the similar method as in Equation 1for the correlation
Figure 5.b gives an example where we have two incon-
sistent cache blocks in the ﬁrst row. Using the correlation
alone is not able to correct them. How-
ever, using the consistency checksums
we can correct at least one inconsistent data. Afterwards, we
can use CkSum5 to correct the other.
It is possible that a row has multiple inconsistent cache
blocks and the columns where those inconsistent cache blocks
are resident have another inconsistent cache blocks. Figure 5.c
gives an example. In this example, the ﬁrst row have two
inconsistent cache blocks (
). They are resi-
dent in the columns two and four. These two cache blocks are
not correctable by the correlation checksum
while, each of the columns two and four has another inconsis-
tent cache block (
), making the consistency
) incapable of correcting
the inconsistent cache blocks too.
In this case, any checksum, including the combination of
consistency checksum and correlation checksum, cannot cor-
rect those cache blocks. However, such a case is extremely
rare: Those inconsistent cache blocks must be so “coinci-
dent” to fall into the same row and column together. In our
evaluation with ten workloads with hundreds of millions of
transactions and 100,000 crash tests, our checksums can cor-
rect all of data inconsistency for committed transactions.
After a crash happens, we examine
persistent objects in persistent memory. If they do not have
checksums and the transactions are physically committed,
then the persistent objects must be consistent without any
cache ﬂushing skipped. If they do not have checksum and the
transactions are not physically committed, then the transaction
updates are cancelled and the persistent objects are restored
using traditional logs.
If the persistent objects have checksums and the trans-
actions are physically committed, then we use consistency
checksums to detect consistency of each persistent object. If
there is any inconsistency, then we use correlation and con-
sistency checksums to correct them. If the data inconsistency
is not correctable, which is very rare, then the corresponding
transaction is aborted. To avoid incorrectable data inconsis-
tency after physical commitment, we could add another row
and column as consistency and correlation checksums. The
new and old checksums, each of which is built upon half of
rows or columns, can improve correction for those rare cases,
but come with larger performance overhead. Study of this
tradeoff is out of scope of in this paper, because the current
checksum mechanism already works very well in our evalua-
Ensuring transaction atomicity.
Before the physical com-
mitment, Archapt relies on logs, as in the traditional undo
and redo logging mechanisms, to ensure atomicity. After the
physical commitment, the persistent objects are successfully
updated with the assists of checksums and the atomicity is
enforced. In the extremely rare case where the persistent ob-
ject is not consistent and the checksum mechanism cannot
correct it, we rely on a traditional checkpoint mechanism and
go back to the last valid checkpoint to ensure atomicity.
3.4 Coalescing of Cache-Line Flushing
To reduce the overhead of cache-line ﬂushing, we have two
methodologies: one is to avoid cache ﬂushing for persistent
objects as in Section 3.2; the other is to coalesce cache-line
ﬂushing to avoid low dirtiness of ﬂushed cache lines. After
Uncoordinated cache-line ﬂushing in the two-level hash
table in Redis.
investigating two common databases (Redis and SQLite), we
ﬁnd two reasons that account for low dirtiness of ﬂushed
cache lines: unaligned cache-line ﬂushing and uncoordinated
The unaligned cache-line ﬂushing happens when a persis-
tent object is not aligned with cache lines. For example, a
persistent object is 100 bytes. Ideally, the object should use
two cache blocks (assuming that the cache block size is 64
bytes). However, the object could not be aligned well during
the memory allocation, and uses three cache blocks. Once
the object is modiﬁed, we have to ﬂush three cache lines in-
stead of two. This easily increases the number of cache-line
ﬂushing by 50%. We ﬁnd this problem in Redis and SQLite.
The uncoordinated cache-line ﬂushing happens when mul-
tiple, associated data objects are allocated into separate cache
blocks. The multiple data objects are associated, because they
are often updated together. If they are allocated into the same
cache blocks, then we can reduce the number of cache-line
ﬂushing. This problem happens more often in NoSQL sys-
tems, such as a key-value store system. Figure 6gives an
example in Redis where the uncoordinated cache-line ﬂush-
ing happens. The ﬁgure shows the basic data structure of
Redis. As a key-value store system, Redis enables secondary
indexing based on a two-level hash table. In the second level,
Redis has a set of ﬁeld-value pairs. Each ﬁeld-value pair stores
the ﬁled ID (F_n) and their data. For each ﬁeld-value pair,
the ﬁeld and value objects are allocated separately on differ-
ent cache blocks. This is inefﬁcient on persistent memory,
because the ﬁeld and value have to be persisted by ﬂushing
separate cache lines. The size of the ﬁeld object is small (usu-
ally less than one cache block), and the ﬁeld object is usually
updated with the value object together. Therefore, coalesc-
ing the ﬁeld and value objects into a fewer contiguous cache
blocks can reduce the number of cache-line ﬂushing.
To address the above two problems, we introduce a new
memory allocation mechanism to improve the efﬁciency of
cache-line ﬂushing. We use Redis as an example again. The
original implementation of Redis uses the traditional memory
allocation, without considering the implications of memory
allocation on cache ﬂushing. Whenever a key or a complex
value is created, Redis allocates the corresponding memory
space on demand, without the coordination with other memory
Table 2: Archapt APIs
API Name Functionality
pools and initialization
Archapt_Tx_Start() Identify the beginning of
Identify the end of a trans-
Archapt_Tx_LCommit() Logical commitment
type, size_t size)
Memory allocation for
type, size_t size)
Free memory allocation
for coalescing cache-line
In our new design for Redis, we pre-allocate three mem-
ory pools without allocating memory on demand. The three
f ield _value_pool
meet the memory allocation requests for keys, ﬁeld and value
pairs, and log records, respectively. We use the three memory
pools, instead of allocating memory on demand, because the
three pools can use separate memory allocation methods to
minimize cache-line ﬂushing; The three memory pools can
also cluster objects with the same functionality (i.e., key, ﬁeld,
value, and log) into contiguous cache blocks to coordinate
Archapt is implemented as a user-level
library to provide persistence support and be integrated with
the existing log-based transaction implementation, such as
Intel PMDK , Redis and SQLite. Archapt includes a set
of APIs, deﬁned in Table 2.
is used to pre-allocate multiple memory
pools for coalescing cache-line ﬂushing (Section 3.4) and
initialize critical data structures (e.g., the LRU queue and
Qb jH T
used to identify transactions for the Archapt runtime, and
can be embedded into the existing transaction start/ﬁnaliza-
is used to replace
the traditional transaction commit to implement the logical
commit for Archapt.
are used to replace the traditional memory allocation and free
APIs in the transaction implementation. The two APIs are
used to allocate and free memory from/to the pre-allocated
memory pools for coalescing cache-line ﬂushing.
Archapt includes a number of opti-
mization techniques to enable high performance and thread
safety. These techniques include SIMD vectorization of check-
sum creation and update, a high-performance concurrent lock-
The percentage of different operations in evaluated work-
loads; “R”, “U”, “I”, “RU”, “S” and “D” standard for read, update,
insert, read & update, scan, and delete operations respectively.
Ops TPCC YCSB TPCC LinkBH YCSB
A B C D E F
R 8 50 95 100 95 - 50 8 64 50
U 47 50 5 - - - - 47 16 10
I 45 - - - 5 5 - 45 12 5
RU - - - - - - 50 - - 10
S - - - - - 95 - - 4 15
D - - - - - - - - 4 10
free hash table, and a high-performance LRU queue based on
circular buffers. In addition, to avoid contention on the LRU
queue from multiple transactions (multiple threads). Archapt
creates a transaction management unit for each transaction,
and the transaction management unit puts information on
write/read of persistent objects into a local buffer. By fetching
the information from those local buffers, the history manage-
ment unit collectively updates the LRU queue.
5.1 Experimental Methodology
The goal of the evaluation is to evaluate the performance
of Archapt with a range of workloads with different charac-
teristics. We use both NoSQL and SQL systems (Redis and
SQLite). We use four persistent transaction mechanisms for
evaluation: undo logging with Archapt, undo logging, the
existing rollback journal system in SQLite, and the existing
AOF mechanism (logging every write operation) in Redis. We
do not show the results of redo logging in our evaluation, be-
cause of the space limitation. But the results of redo logging
are similar to those of undo logging. Our implementation of
the undo logging is based on Intel PMDK . For the roll-
back journal and AOF, whenever a transaction commitment
happens, we commit the transaction updates to memory (not
hard drive), in order to enable fair performance comparison.
We run YCSB  (A-F) and TPC-C [2,28] against
Redis, and run OLTP-bench  (particularly, TPC-C,
LinkBench  and YCSB) against SQLite. These workloads
are chosen for Redis and SQLite respectively, because they
can easily run on the two database systems without any modi-
ﬁcation. Table 3gives some details for these workloads. For
YCSB running against Redis, we perform 10M operations;
for other workloads, we use the default conﬁgurations.
All experiments are performed on a 24-core machine with
two 12-core Xeon Gold 6126 processors with 187GB mem-
ory and 19.25MB last level cache. We use DRAM to emulate
NVM, since NVDIMM is not on the market at the time of
preparing this manuscript. For other slower NVMs, the bene-
ﬁts of Archapt would only be larger because of the reduction
of cache-line ﬂushing. We use the
ﬂush cache lines, which is one of the most recent and efﬁcient
cache-line ﬂushing instructions. Other advanced instructions
are not available in a processor in the market at
the time of preparing this manuscript.
5.2 Experimental Results
We use different numbers of threads to
evaluate throughput and latency of the four persistent transac-
tion mechanisms. Figure 7and Figure 8show the results.
Figure 7reveals that undo with Archapt has the highest
throughput among the four transaction mechanisms. On av-
erage, undo with Archapt offers 22%, 60% and 35% better
throughput than the traditional undo logging, the rollback
journal system and AOF respectively. The biggest improve-
ment happens on YCSB-A, YCSB-F and LinkBench, which
are write-intensive. For these three workloads, undo with Ar-
chapt offers up to 36%, 80% and 58% higher throughput than
the other three transaction mechanisms. As we increase the
number of client threads, Archapt consistently performs best.
With 12 threads, Archapt offers 22%, 34% and 57% better
throughput than the traditional undo logging, the rollback jour-
nal system and AOF respectively. For the read-only workload
(YCSB-C), Archapt cannot offer performance improvement,
but the throughput is at most 1% lower than other transaction
mechanisms, which is small.
Figure 8shows the 99th-percentile latency for the four
transaction mechanisms. We run eight client threads for each
workload. Archapt has the shortest latency among the four
transaction mechanisms. On average, Archapt decreases the
tail latency by 7%, 9% and 20%, compared with the tradi-
tional undo logging, AOF and the rollback journal system
respectively. Such performance improvement in tail latency
comes from the reduction of unnecessary cache ﬂushing. For
the read-only workload (YCSB-C) that offers no opportunity
to reduce cache ﬂushing, Archapt still provides comparable
performance to the other three transaction mechanisms.
Quantifying the effectiveness of reducing cache-line
We measure the number of cache-line ﬂushing be-
fore and after applying Archapt to undo logging. We only
show the results for undo logging, because it has less cache-
line ﬂushing than the rollback journal and AOF. Hence reduc-
ing cache-line ﬂushing for undo-logging is more challenging
than doing that for the rollback journal and AOF. Figure 9
shows the number of reduced cache-line ﬂushing after ap-
plying Archapt. The numbers in the ﬁgure are normalized
by the total numbers of cache-line ﬂushing before applying
Archapt. The ﬁgure does not include YCSB-C, because this
workload is read-only and does not need cache ﬂushing. The
ﬁgure also isolates the contributions of the two techniques
(the LRU-based approach and coalesce of cache-line ﬂushing)
to compare the effectiveness of the two techniques.
The ﬁgure reveals that Archapt greatly reduces the number
of cache-line ﬂushing by 66% on average. YCSB-E has less
reduction in the number of cache-line ﬂushing than other
workloads, because it has more data reuse in persistent objects,
which provides less opportunities to skip cache-line ﬂushing.
(a) YCSB A-F(Redis).
(b) TPC-C(Redis). (c) Three workloads with (SQLite).
Figure 7: Throughput with the four transaction mechanisms, as the number of threads vary from four to twelve. “T”=“threads”.
(a) Redis (b) SQLite
Figure 8: 99th-percentile transaction latency, as the number of threads is eight.
This result is consistent with that shown in Figure 3.
We further notice that both techniques effectively reduce
cache-line ﬂushing. The contribution of the LRU-based ap-
proach to the reduction of cache-line ﬂushing varies between
different workloads, because different workloads have differ-
ent data reuse of persistent objects. Furthermore, comparing
with Redis, SQLite gains less beneﬁt from the coalesce of
cache-line ﬂushing. This is because of the strict SQL data
structures in SQLite that have some existing optimization on
cache line alignment.
Quantify dirtiness of ﬂushed cache lines.
shows the distribution of the dirtiness of ﬂushed cache lines
before and after applying Archapt. The ﬁgure does not in-
clude YCSB-C, because this workload is read-only and does
not need cache ﬂushing. In general, our memory allocation
optimization works in every workload. The average improve-
ment is 12%. Among all workloads, YCSB-E (Redis) has the
largest increase: the average cache line dirtiness increases
from 51% to 68%.
Random crash tests.
We examine data consistency in
The numbers of reduced cache-line ﬂushing after applying
Archapt to undo logging. The numbers are normalized by the total
numbers of cache-line ﬂushing before using Archapt.
physically committed transactions using random crash tests.
We aim to evaluate the effectiveness of our checksum mecha-
nism. We use an NVM crash emulator , because of two
reasons: (1) a large number of crash tests affect the reliability
of our physical machine; and (2) DRAM, used for our NVM
emulation, loses data when the crash happens. The emula-
tor is based on PIN  and emulates a conﬁgurable LRU
cache hierarchy. The crash emulator retains data in the em-
Distribution of the dirtiness of ﬂushed cache lines with
and without Archapt; “Y” and “N” standard for using Archapt and
without using Archapt respectively.
Crash test results. The numbers for each workload in the
table are for 100 crash tests.
Systems Workloads # of in-
# of inconsis-
# of detected
that cannot be
TPC-C 693 693 0
YCSB-A 901 901 0
YCSB-B 0 0 0
YCSB-D 0 0 0
YCSB-E 1024 1024 0
YCSB-F 295 295 0
TPC-C 782 782 0
LinkBench 811 811 0
YCSB 495 495 0
ulated main memory after a crash is triggered, allowing us
to examine data consistency. The crash emulator randomly
triggers crashes. In our evaluation, the cache capacities (19.25
MB in the last level cache) and associativity (11) in the crash
emulator are the same as those in our physical machine. For
each workload we perform crash tests 100 times to ensure
Table 4shows the results. The table reports total number
of inconsistent persistent objects measured by the crash emu-
lator, total numbers of inconsistent persistent objects detected
and corrected by the checksum mechanism for 100 crash
tests. We have a couple of observations. First, the checksum
mechanism successfully detects and corrects all inconsistent
persistent objects for all workloads. This demonstrates that
the checksum mechanism is highly effectiveness. Second,
we do not have a large number of inconsistent persistent ob-
jects after crashes: We have only hundreds of inconsistent
persistent objects, while each workload updates millions of
persistent objects. For YCSB-B and YCSB-D, we do not even
ﬁnd any inconsistent persistent objects. This indicates that
our LRU-based approach successfully estimates data locality,
such that skipping cache-line ﬂushing causes only a small
number of inconsistent persistent objects after transactions
are physically committed.
6 Related Work
Persistency in NVM has received signiﬁcant research activi-
ties recently. Previous work on the runtime systems for persis-
tent memory transaction [8,12,20,21,24,27,27,32,35,44,45]
exploits software-based approaches at the user level to pro-
vide safe, transactional access to non-volatile memories.
Some of those runtime systems employ undo logging mecha-
nisms [8,12,20,21,24,27], while others employ redo logging
mechanisms [32,44,45]. Our work can be applied to those
logging-based mechanisms to improve performance of persis-
Enabling crash consistency on NVM.
consistency on NVM can be expensive, because of cache-
line ﬂushing and persistent barriers. Strict persistency 
enforces crash consistency by strictly enforcing write orders
in persistent memory and can cause a large performance loss.
Some work [13,25,26,37,39] relaxes the constrains on write
orders to improve performance. Different from the above
existing work, we do not relax write orders, but optimize
performance by skipping and coalescing cache-line ﬂushing.
Detection and correction of data errors.
forts on RAID [9,23,36,41] and ECC [17,29
exploit hardware- and software-based approaches to detect
and correct data errors, but the relatively large runtime over-
head and possible hardware modiﬁcations make them hard to
be applied to detect and correct data inconsistency on NVM
for a transaction mechanism.
Algorithm-based fault tolerance, as an efﬁcient software
mechanism to correct data errors, has been used for fault toler-
ance in high performance computing (HPC) [10,11,15,16,19,
22,47,48]. However, they are customized for speciﬁc numeri-
cal algorithms, and can be hard to be applied to transactional
workloads in database systems.
The lazy persistency  proposed by Alshboul et. al is very
relevant to our work. They focus on computation-intensive
HPC applications and skip cache-line ﬂushing for all dirty
data objects. They rely on periodical cache ﬂushing of all
dirty cache blocks. Our work is signiﬁcantly different from
them in two perspectives. First, the lazy persistency does not
have a systematic way to decide which cache-line ﬂushing can
be skipped. Second, the lazy persistency cannot correct data
inconsistency after a crash. The above two limitations cause
unpredictable data loss in committed transactions, hence the
lazy persistency cannot be reliably applied to transaction sys-
tems. Archapt avoids the above limitations by the LRU-based
approach and well-design checksums.
Enabling high performance transaction is critical to release
the power (performance beneﬁt) of persistent memory for
many applications. In this paper, we present Archapt, an
architecture-aware, high performance transaction runtime sys-
tem for persistent memory. Archapt reduces the number of
cache-line ﬂushing to improve performance of transactions.
Archapt estimates if cache blocks of a persistent objects are
in the cache to determine the necessity of cache-line ﬂushing.
Relying on a checksum mechanism to detect data inconsis-
tence and correct it if possible, Archapt provides strong data
consistency. Archapt also coalesces cache blocks with low
dirtiness to improve the efﬁciency of cache-line ﬂushing. Our
results show that Archapt reduces cache ﬂushing by 66% and
improve system throughput by 22% (42% at most), when run-
ning YCSB (A-F) and TPC-C against Redis, and OLTP-bench
(TPC-C, LinkBench and YCSB) against SQLite (using the
traditional undo logging as baseline).
Python TPC-C .
Yahoo! Cloud Serving Benchmark.
Grant Allen and Mike Owens. The Deﬁnitive Guide to
SQLite. Apress, Berkely, CA, USA, 2nd edition, 2010.
M. Alshboul, J. Tuck, and Y. Solihin. Lazy Persistency:
A High-Performing and Write-Efﬁcient Software Persis-
tency Technique. In 2018 ACM/IEEE 45th Annual In-
ternational Symposium on Computer Architecture, June
Timothy G. Armstrong, Vamsi Ponnekanti, Dhruba
Borthakur, and Mark Callaghan. Linkbench: A database
benchmark based on the facebook social graph. In Pro-
ceedings of the 2013 ACM SIGMOD International Con-
ference on Management of Data, 2013.
J.L Carlson. Redis in Action. In Manning Publications:
Dhruva R. Chakrabarti, Hans-J. Boehm, and Kumud
Bhandari. Atlas: Leveraging Locks for Non-volatile
Memory Consistency. In Proceedings of the 2014 ACM
International Conference on Object Oriented Program-
ming Systems Languages & Applications, 2014.
Peter M. Chen, Edward K. Lee, Garth A. Gibson,
Randy H. Katz, and David A. Patterson. RAID: High-
performance, Reliable Secondary Storage. ACM Com-
put. Surv., 1994.
Zizhong Chen. Algorithm-based Recovery for Iterative
Methods without Checkpointing. In International Sym-
posium on High Performance Distributed Computing,
Zizhong Chen. Online-ABFT: An Online Algorithm
Based Fault Tolerance Scheme for Soft Error Detection
in Iterative Methods. In ACM SIGPLAN Symposium on
Principles and Practice of Parallel Programming, 2013.
Joel Coburn, Adrian M. Caulﬁeld, Ameen Akel,
Laura M. Grupp, Rajesh K. Gupta, Ranjit Jhala, and
Steven Swanson. NV-Heaps: Making Persistent Ob-
jects Fast and Safe with Next-generation, Non-volatile
Memories. In Proceedings of the Sixteenth International
Conference on Architectural Support for Programming
Languages and Operating Systems, 2011.
Jeremy Condit, Edmund B. Nightingale, Christopher
Frost, Engin Ipek, Benjamin Lee, Doug Burger, and Der-
rick Coetzee. Better I/O Through Byte-addressable, Per-
sistent Memory. In Proceedings of the ACM SIGOPS
22Nd Symposium on Operating Systems Principles,
Brian F. Cooper, Adam Silberstein, Erwin Tam, Raghu
Ramakrishnan, and Russell Sears. Benchmarking Cloud
Serving Systems with YCSB. In Proceedings of the 1st
ACM Symposium on Cloud Computing, 2010.
Teresa Davies and Zizhong Chen. Correcting Soft Errors
Online in LU Factorization. In International Symposium
on High-Performance Parallel and Distributed Comput-
Teresa Davies, Christer Karlsson, Hui Liu, Chong Ding,
and Zizhong Chen. High Performance Linpack Bench-
mark: A Fault Tolerant Implementation without Check-
pointing. In International Conference on Supercomput-
T. Dell. A White Paper On The Beneﬁts Of Chipkill-
Correct ECC for PC Server Main Memory. Technical
report, IBM Microelectronics Division, 1997.
Djellel Eddine Difallah, Andrew Pavlo, Carlo Curino,
and Philippe Cudre-Mauroux. OLTP-Bench: An Exten-
sible Testbed for Benchmarking Relational Databases.
Proc. VLDB Endow., 2013.
Peng Du, Aurelien Bouteiller, George Bosilca, Thomas
Herault, and Jack Dongarra. Algorithm-based Fault Tol-
erance for Dense Matrix Factorizations. In Proceedings
of the 17th ACM SIGPLAN Symposium on Principles
and Practice of Parallel Programming, 2012.
Subramanya R. Dulloor, Sanjay Kumar, Anil Keshava-
murthy, Philip Lantz, Dheeraj Reddy, Rajesh Sankaran,
and Jeff Jackson. System Software for Persistent Mem-
ory. In Proceedings of the Ninth European Conference
on Computer Systems, 2014.
E. R. Giles, K. Doshi, and P. Varman. SoftWrAP: A
Lightweight Framework for Transactional Support of
Storage Class Memory. In 2015 31st Symposium on
Mass Storage Systems and Technologies, May 2015.
Kuang-Hua Huang and Abraham. Algorithm-Based
Fault Tolerance for Matrix Operations. IEEE Transac-
tions on Computers, 1984.
Kai Hwang, Hai Jin, and Roy Ho. RAID-x: A New Dis-
tributed Disk Array for I/O-centric Cluster Computing.
In Proceedings the Ninth International Symposium on
High-Performance Distributed Computing, 2000.
Intel. Persistent Memory Development Kit.
Arpit Joshi, Vijay Nagarajan, Marcelo Cintra, and Stratis
Viglas. Efﬁcient Persist Barriers for Multicores. In
International Symposium on Microarchitecture, 2015.
A. Kolli, J. Rosen, S. Diestelhorst, A. Saidi, S. Pelley,
S. Liu, P. M. Chen, and T. F. Wenisch. Delegated Persist
Ordering. In 2016 49th Annual IEEE/ACM International
Symposium on Microarchitecture, 2016.
Aasheesh Kolli, Steven Pelley, Ali Saidi, Peter M. Chen,
and Thomas F. Wenisch. High-Performance Transac-
tions for Persistent Memories. In Proceedings of the
Twenty-First International Conference on Architectural
Support for Programming Languages and Operating
Scott T. Leutenegger and Daniel Dias. A Modeling
Study of the TPC-C Benchmark. In SIGMOD Record,
Dong Li, Zizhong Chen, Panruo Wu, and Jeffrey S.
Vetter. Rethinking Algorithm-Based Fault Tolerance
with a Cooperative Software-Hardware Approach. In
ACM/IEEE International Conference for High Perfor-
mance Computing, Networking, Storage and Analysis,
Sheng Li, Doe H. Yoon, Ke Chen, Jishen Zhao, Jung H.
Ahn, Jay B. Brockman, Yuan Xie, and Norman P. Jouppi.
MAGE: Adaptive Granularity and ECC for Resilient and
Power Efﬁcient Memory Systems. In International Con-
ference for High Performance Computing, Networking,
Storage and Analysis, 2012.
S. Lu, H. Li, and K. Miyase. Progressive ECC Tech-
niques for Phase Change Memory. In 2018 IEEE 27th
Asian Test Symposium, 2018.
Y. Lu, J. Shu, and L. Sun. Blurred Persistence in Trans-
actional Persistent Memory. In 2015 31st Symposium
on Mass Storage Systems and Technologies, 2015.
Y. Lu, J. Shu, L. Sun, and O. Mutlu. Loose-Ordering
Consistency for persistent memory. In 2014 IEEE
32nd International Conference on Computer Design,
Virendra J. Marathe, Achin Mishra, Amee Trivedi,
Yihe Huang, Faisal Zaghloul, Sanidhya Kashyap, Margo
Seltzer, Tim Harris, Steve Byan, Bill Bridge, and Dave
Dice. Persistent Memory Transactions. CoRR, 2018.
Amirsaman Memaripour, Anirudh Badam, Amar Phan-
ishayee, Yanqi Zhou, Ramnatthan Alagappan, Karin
Strauss, and Steven Swanson. Atomic In-place Updates
for Non-volatile Main Memories with Kamino-Tx. In
Proceedings of the Twelfth European Conference on
Computer Systems, 2017.
Jai Menon and Jim Cortney. The Architecture of a Fault-
tolerant Cached RAID Controller. In Proceedings of
the 20th Annual International Symposium on Computer
Steven Pelley, Peter M. Chen, and Thomas F. Wenisch.
Memory Persistency. In Proceeding of the 41st Annual
International Symposium on Computer Architecuture,
V.J. Reddi, A. Settle, D.A. Connors, and R.S. Cohn. Pin:
A Binary Instrumentation Tool for Computer Architec-
ture Research and Education. In Proceedings of the
2004 workshop on Computer architecture education,
J. Ren, J. Zhao, S. Khan, J. Choi, Y. Wu, and O. Mu-
tiu. ThyNVM: Enabling Software-transparent Crash
Consistency in Persistent Memory Systems. In 2015
48th Annual IEEE/ACM International Symposium on
Jie Ren, Kai Wu, and Dong Li. Understanding Appli-
cation Recomputability Without Crash Consistency in
Non-Volatile Memory. In Proceedings of the Work-
shop on Memory Centric High Performance Computing,
Frank Schmuck and Roger Haskin. GPFS: A Shared-
Disk File System for Large Computing Clusters. In
Proceedings of the 1st USENIX Conference on File and
Storage Technologies, 2002.
H. Shu, H. Chen, H. Liu, Y. Lu, Q. Hu, and J. Shu. Em-
pirical Study of Transactional Management for Persis-
tent Memory. In 2018 IEEE 7th Non-Volatile Memory
Systems and Applications Symposium, 2018.
Aniruddha N. Udipi, Naveen Muralimanohar, Rajeev
Balsubramonian, Al Davis, and Norman P. Jouppi. LOT-
ECC: Localized and Tiered Reliability Mechanisms for
Commodity Memory Systems. In International Sympo-
sium on Computer Architecture, 2012.
Shivaram Venkataraman, Niraj Tolia, Parthasarathy Ran-
ganathan, and Roy H. Campbell. Consistent and Durable
Data Structures for Non-volatile Byte-addressable Mem-
ory. In Proceedings of the 9th USENIX Conference on
File and Stroage Technologies, 2011.
H. Volos, A. J. Tack, and M. M. Swift. Mnemosyne:
Lightweight Persistent Memory. In Architectural Sup-
port for Programming Languages and Operating Sys-
H. Wan, Y. Lu, Y. Xu, and J. Shu. Empirical Study of
Redo and Undo Logging in Persistent Memory. In 5th
Non-Volatile Memory Systems and Applications Sympo-
Panruo Wu, Chong Ding, Longxiang Chen, Teresa
Davies, Christer Karlsson, and Zizhong Chen. On-line
Soft Error Correction in Matrix-Matrix Multiplication.
Journal of Computational Science, 2013.
S. Yang, K. Wu, Y. Qiao, D. Li, and J. Zhai. Algorithm-
Directed Crash Consistence in Non-volatile Memory
for HPC. In 2017 IEEE International Conference on
Cluster Computing, 2017.
Doe Hyun Yoon and Mattan Erez. Virtualized and ﬂex-
ible ecc for main memory. In Proceedings of the Fif-
teenth Edition of ASPLOS on Architectural Support for
Programming Languages and Operating Systems, 2010.