Conference PaperPDF Available

Wren: Nonblocking Reads in a Partitioned Transactional Causally Consistent Data Store


Abstract and Figures

Transactional Causal Consistency (TCC) extends causal consistency, the strongest consistency model compatible with availability, with interactive read-write transactions, and is therefore particularly appealing for geo-replicated platforms. This paper presents Wren, the first TCC system that at the same time i) implements nonblocking read operations, thereby achieving low latency, and ii) allows an application to efficiently scale out within a replication site by sharding. Wren introduces new protocols for transaction execution, dependency tracking and stabilization. The transaction protocol supports nonblocking reads by providing a transaction with a snapshot that is the union of a fresh causal snapshot S installed by every partition in the local data center and a client-side cache for writes that are not yet included in S. The dependency tracking and stabilization protocols require only two scalar timestamps, resulting in efficient resource utilization and providing scalability in terms of replication sites. In return for these benefits, Wren slightly increases the visibility latency of updates. We evaluate Wren on an AWS deployment using up to 5 replication sites and 16 partitions per site. We show that Wren delivers up to 1.4x higher throughput and up to 3.6x lower latency when compared to the state-of-the-art design. The choice of an older snapshot increases local update visibility latency by a few milliseconds. The use of only two timestamps to track causality increases remote update visibility latency by less than 15%.
Content may be subject to copyright.
Wren: Nonblocking Reads in a Partitioned
Transactional Causally Consistent Data Store
Kristina Spirovska
Diego Didona
Willy Zwaenepoel
Abstract—Transactional Causal Consistency (TCC) extends
causal consistency, the strongest consistency model compatible
with availability, with interactive read-write transactions, and is
therefore particularly appealing for geo-replicated platforms.
This paper presents Wren, the first TCC system that at the
same time i) implements nonblocking read operations, thereby
achieving low latency, and ii) allows an application to efficiently
scale out within a replication site by sharding.
Wren introduces new protocols for transaction execution,
dependency tracking and stabilization. The transaction protocol
supports nonblocking reads by providing a transaction with a
snapshot that is the union of a fresh causal snapshot S installed
by every partition in the local data center and a client-side cache
for writes that are not yet included in S. The dependency tracking
and stabilization protocols require only two scalar timestamps,
resulting in efficient resource utilization and providing scalability
in terms of replication sites. In return for these benefits, Wren
slightly increases the visibility latency of updates.
We evaluate Wren on an AWS deployment using up to 5
replication sites and 16 partitions per site. We show that Wren
delivers up to 1.4x higher throughput and up to 3.6x lower latency
when compared to the state-of-the-art design. The choice of an
older snapshot increases local update visibility latency by a few
milliseconds. The use of only two timestamps to track causality
increases remote update visibility latency by less than 15%.
Many large-scale data platforms rely on geo-replication to
meet strict performance and availability requirements [1], [2],
[3], [4], [5]. Geo-replication reduces latencies by keeping a
copy of the data close to the clients, and enables availability
by replicating data at geographically distributed data centers
(DCs). To accommodate the ever-growing volumes of data,
today’s large-scale on-line services also partition the data
across multiple servers within a single DC [6], [7].
Transactional Causal Consistency (TCC). TCC [8] is an
attractive consistency level for building geo-replicated data-
stores. TCC enforces causal consistency (CC) [9], which is the
strongest consistency model compatible with availability [10],
[11]. Compared to strong consistency [12], CC does not suffer
from high synchronization latencies, limited scalability and
unavailability in the presence of network partitions between
DCs [13], [14], [15]. Compared to eventual consistency [2],
CC avoids a number of anomalies that plague programming
with weaker models. In addtion, TCC extends CC with inter-
active read-write transactions, that allow applications to read
from a causal snapshot and to perform atomic multi-item
Enforcing CC while offering always-available interactive
multi-partition transactions is a challenging problem [7]. The
main culprit is that in a distributed environment, unavoidably,
partitions do not progress at the same pace. Current TCC
designs either avoid this issue altogether, by not supporting
sharding [16], or block reads to ensure that the proper snapshot
is installed [8]. The former approach sacrifices scalability,
while the latter incurs additional latencies.
Wren. This paper presents Wren, the first TCC system that
implements nonblocking reads, thereby achieving low latency,
and allows an application to scale out by sharding. Wren
implements CANToR (Client-Assisted Nonblocking Trans-
actional Reads), a novel transaction protocol in which the
snapshot of the data store visible to a transaction is defined as
the union of two components: i)a fresh causal snapshot that
has been installed by every partition within the DC; and ii)
a per-client cache, which stores the updates performed by the
client that are not yet reflected in said snapshot. This choice
of snapshot departs from earlier approaches where a snapshot
is chosen by simply looking at the local clock value of the
partition acting as transaction coordinator.
Wren also introduces Binary Dependency Time (BDT), a
new dependency tracking protocol, and Binary Stable Time
(BiST), a new stabilization protocol. Regardless of the number
of partitions and DCs, these two protocols assign only two
scalar timestamps to updates and snapshots, corresponding
to dependencies on local and remote items. These protocols
provide high resource efficiency and scalability, and preserve
Wren exposes to clients a snapshot that is slightly in the
past with respect to the one exposed by existing approaches.
We argue that this is a small price to pay for the performance
improvements that Wren offers.
We compare Wren with Cure [8], the state-of-the-art TCC
system, on an AWS deployment with up to 5 DCs with 16
partitions each. Wren achieves up to 1.4x higher throughput
and up to 3.6x lower latencies. The choice of an older snapshot
increases local update visibility latency by a few milliseconds.
The use of only two timestamps to track causality increases
remote update visibility latency by less than 15%.
We make the following contributions.
1) We present the design and implementation of Wren, the
first TCC key-value store that achieves nonblocking reads,
efficiently scales horizontally, and tolerates network partitions
between DCs.
2) We propose new dependency and stabilization protocols that
achieve high resource efficiency and scalability.
3) We experimentally demonstrate the benefits of Wren over
state-of-the-art solutions.
Roadmap. The paper is organized as follows. Section 2
describes TCC and the target system model. Section 3 presents
the design of Wren. Section 4 describes the protocols in Wren.
Section 5 presents the evaluation of Wren. Section 6 discusses
related work. Section 7 concludes the paper.
A. System model
We consider a distributed key-value store whose data-set is
split into Npartitions. Each key is deterministically assigned
to one partition by a hash function. We denote by pxthe
partition that contains key x.
The data-set is fully replicated: each partition is replicated
at all MDCs. We assume a multi-master system, i.e., each
replica can update the keys in its partition. Updates are
replicated asynchronously to remote DCs.
The data store is multi-versioned. An update operation
creates a new version of a key. Each version stores the value
corresponding to the key and some meta-data to track causal-
ity. The system periodically garbage-collects old versions of
At the beginning of a session, a client cconnects to a DC,
referred to as the local DC. All c’s operations are performed
within said DC to preserve availability [17] 1.cdoes not issue
another operation until it receives the reply to the current one.
Partitions communicate through point-to-point lossless FIFO
channels (e.g., a TCP socket).
B. Causal consistency
Causal consistency requires that the key-value store returns
values that are consistent with causality [9], [18]. For two
operations a, b, we say that bcausally depends on a, and write
a b, if and only if at least one of the following conditions
holds: i)aand bare operations in a single thread of execution,
and ahappens before b;ii)ais a write operation, bis a read
operation, and breads the version written by a;iii)there is
some other operation csuch that a cand c b. Intuitively,
CC ensures that if a client has seen the effects of band a b,
then the client also sees the effects of a.
We use lower-case letters, e.g., x, to refer to a key and the
corresponding upper-case letter, e.g., X, to refer to a version
of the key. We say that Xcausally depends on Yif the write
of Xcausally depends on the write of Y.
We use the term availability to indicate that a client opera-
tion never blocks as the result of a network partition between
DCs [19].
1Wren can be extended to allow a client cto move to a different DC by
blocking cuntil the last snapshot seen by chas been installed in the new DC.
C. Transactional causal consistency
Semantics. TCC extends CC by means of interactive read-
write transactions in which clients can issue several operations
within a transaction, each reading or writing (potentially)
multiple items [8]. TCC provides a more powerful semantics
than one-shot read-only or write-only transactions provided
by earlier CC systems [7], [15], [20], [21]. It enforces the
following two properties.
1. Transactions read from a causal snapshot. A causal snap-
shot is a set of item versions such that all causal dependencies
of those versions are also included in the snapshot. For any
two items, xand y, if X Yand both Xand Ybelong
to the same causal snapshot, then there is no X0, such that
X X0 Y.
Transactional reads from a causal snapshot avoid undesir-
able anomalies that can arise by issuing multiple individual
read operations. For example, they prevent the well-known
anomaly in which person A removes person B from the access
list of a photo album and adds a photo to it, only to have
person B read the original permissions and the new version of
the album [15].
2. Updates are atomic. Either all items written by a transaction
are visible to other transactions, or none is. If a transaction
writes Xand Y, then any snapshot visible to other transactions
either includes both Xand Yor neither one of them.
Atomic updates increase the expressive power of applica-
tions, e.g., they make it easier to maintain symmetric relation-
ships among entities within an application. For example, in
a social network, if person A becomes friend with person B,
then B simultaneously becomes friend with A. By putting both
updates inside a transaction, both or neither of the friendship
relations are visible to other transactions [21].
Conflict resolution. Two writes are conflicting if they are
not related by causality and update the same key. Conflicting
writes are resolved by means of a commutative and associative
function, that decides the value corresponding to a key given
its current value and the set of updates on the key [15].
For simplicity, Wren resolves write conflicts using the last-
writer-wins rule based on the timestamp of the updates [22].
Possible ties are settled by looking at the id of the update’s
originating DC combined with the identifier of transaction that
created the update. Wren can be extended to support other
conflict resolution mechanisms [8], [15], [21], [23].
A client starts a transaction T, issues read and write (multi-
key) operations and commits T. Wren’s client API exposes
the following operations:
< TID , S >ST ART () : starts an interactive transac-
tion T and returns T’s transaction identifier TID and the causal
snapshot S visible to T.
•hvalsi ← READ(k1, ..., kn) : reads the set of items
corresponding to the input set of keys within T.
W RIT E(hk1, v1i, ..., hkn, vni) : updates a set of given
input keys to the corresponding values within T.
(a) Blocking reads in existing systems.
(b) Nonblocking reads in Wren.
Fig. 1: In existing systems (a), a transaction can be assigned a snapshot that has not been installed by every partition in the
local DC. c1’s transaction is assigned timestamp 10, but pxhas not installed snapshot 10 by the time c1reads. This leads pxto
block c1’s read. In Wren (b), c1s transaction is assigned a timestamp corresponding to a snapshot installed by every partition
in the local DC, thus avoiding blocking. The trade-off is that older versions of xand yare returned.
CO M M IT () : finalizes the transaction Tand atomically
updates the items modified by means of a WRITE operation
within T, if any.
In TCC, conflicting updates do not cause transactions to
abort, because they are resolved by the conflict resolution
mechanism. Transactions can abort by means of explicit APIs,
or because of system-related issues, e.g., not enough space
on a server to perform an update. For simplicity, we do not
consider aborts in this paper.
We first illustrate the challenge in providing nonblocking
reads, by showing how reads can block in the state-of-the-art
Cure system [8]. We then present CANToR (§III-B), and BDT
and BiST (§III-C). We discuss fault tolerance and availability
in Wren (§III-D).
A. The challenge in providing nonblocking reads
For the sake of simplicity, we assume that a transaction
snapshot Sis defined by a logical timestamp, denoted st. We
say that a server has installed a snapshot with timestamp t
if the server has applied the modifications of all committed
transactions with timestamp up to and including t. Once a
server installs a snapshot with timestamp t, the server cannot
commit any transaction with a timestamp t.
Achieving nonblocking reads in TCC is challenging, be-
cause they have to preserve consistency and respect the atom-
icity of multi-item (and hence multi-partition) write transac-
tions. Assume that a transaction writes Xand Y. A transaction
Tthat reads xand y, must either see both Xand Yor neither
of them. The complexity of the problem is increased by the fact
that the reads on individual keys in a transactional READ may
proceed in parallel. In other words, a READ(TID , x, y)sends
in parallel a read(x)operation to pxand a read(y)operation
to py, and the read(x)taking place on pxis unaware of the
item returned by the read(y)on py.
Cure [8] provides the state-of-the-art solution to this prob-
lem. When a client cstarts a transaction T,Tis assigned a
causal snapshot Sby a randomly chosen coordinator partition.
Sincludes all previous snapshots seen by c. To this end,
the coordinator sets st as the maximum between the highest
snapshot timestamp seen by cand the current clock value at
the coordinator. When Tcommits, it is assigned a commit
timestamp by means of a two-phase commit (2PC) protocol.
Every partition that stores an item modified by Tproposes a
timestamp (strictly higher than st), and the coordinator picks
the maximum as the commit timestamp ct of T. All items
written by Tare assigned ct as timestamp. Because ct > st,
all such writes carry the information that they depend on the
items in S, whose timestamps are less than or equal to st.
Cure achieves causality and enforces atomicity. If a trans-
action is assigned a snapshot timestamp st, the individual
read operations of a READ transaction can in parallel read
the version of any requested key with the highest timestamp
st. This protocol, however, enforces causality and atomicity
at the cost of potentially blocking read operations. We show
this behavior by means of an example, depicted in Figure 1a.
To initiate T1, client c1contacts a coordinator partition, in
this case pz.T1is the first transaction issued by c1, so c1does
not piggyback any snapshot timestamp to initiate a transaction.
The local time on pzis 10. To maximize the freshness of the
snapshot visible to T1,pzassigns to T1a timestamp equal to
10. In the meantime, c2commits T2, which writes X2and Y2.
During the 2PC, pxproposes 6 as commit timestamp, i.e., the
current clock’s value on px. Similarly, pyproposes 10. The
coordinator of T2,pw, picks the maximum between these two
values and assigns to T2a commit timestamp 10. pyreceives
the commit message, writes Y2and installs a snapshot with
timestamp 10. px, instead, does not immediately receive the
commit message, and its snapshot still has the value 5.
At this point, c1issues its READ(T1, x, y) operation by
sending a request to pxand pywith the snapshot timestamp
of T1, which is 10. pyhas installed a snapshot that is fresh
enough, and returns Y2. Instead, pxhas to block the read of
T1, because pxcannot determine which version of xto return.
pxcannot safely return X1, because it could violate CC and
atomicity. pxcannot return X2either, because pxdoes not
yet know the commit timestamp of X2. If X2were eventually
to be assigned a commit timestamp >10, then returning X2
to T1violates CC. pxcan install X2and the corresponding
snapshot only when receiving the commit message from pw.
Then, pxcan serve c1’s pending read with the consistent value
Similar dynamics characterize also other CC systems with
write transactions, e.g., Eiger [21].
B. Nonblocking reads in Wren
Wren implements CANToR, a novel transaction protocol
that, similarly to Cure, is based on snapshots and 2PC, but
avoids blocking reads by changing how snapshots visible to
transactions are defined. In particular, a transaction snapshot
is expressed as the union of two components:
1) a fresh causal snapshot installed by every partition in
the local DC, which we call local stable snapshot, and
2) a client-side cache for writes done by the client and that
have not yet been included in the local stable snapshot.
1) Causal snapshot. Existing approaches block reads, because
the snapshot assigned to a transaction Tmay be “in the future”
with respect to the snapshot installed by a server from which
Treads an item. CANToR avoids blocking by providing to a
transaction a snapshot that only includes writes of transactions
that have been installed at all partitions. When using such a
snapshot, then clearly all reads can proceed without blocking.
To ensure freshness, the snapshot timestamp st provided to
a client is the largest timestamp such that all transactions with
a commit timestamp smaller than or equal to st have been
installed at all partitions. We call this timestamp the local
stable time (LST), and the snapshot that it defines the local
stable snapshot. The LST is determined by a stabilization
protocol, by which partitions within a DC gossip the latest
snapshots they have installed (§III-C). In CANToR, when a
transaction starts, it chooses a transaction coordinator, and it
uses as its snapshot timestamp the LST value known to the
Figure 1b depicts the nonblocking behavior of Wren. pz
proposes 5 as snapshot timestamp (because of px). Then c1
can read without blocking on both pxand py, despite the
concurrent commit of T2. The trade-off is that c1reads older
versions of xand y, namely X1and Y1, compared to the
scenarion in Figure 1a, where it reads X2and Y2.
Only assigning a snapshot slightly in the past, however, does
not solve completely the issue of blocking reads. The local
stable snapshot includes all the items that have been written
by all clients up until the boundary defined by the snapshot and
on which c(potentially) depends. The local stable snapshot,
however, might not include the most recent writes performed
by cin earlier transactions.
Consider, for example, the case in which ccommits a
transaction T, that includes a write on item x, and obtains
a value ct as its commit timestamp. Subsequently, cstarts
another transaction T´
, and obtains a snapshot timestamp
smaller than ct, because ct has not yet been installed at
all partitions. If we were to let cread from this snapshot, and
it were to read x, it would not see the value it had written
previously in T.
A simple solution would be to block the commit of Tuntil
ct LST . This would guarantee that ccan issue its next
transaction T0only after the modifications of Thave been
applied at every partition in the DC. This approach, however,
introduces high commit latencies.
2) Client-side cache. Wren takes a different approach that
leverages the fact that the only causal dependencies of cthat
may not be in the local stable snapshot are items that chas
written itself in earlier transactions (e.g., x). Wren therefore
provides clients with a private cache for such items: all items
written by care stored in its private cache, from which it reads
when appropriate, as detailed below.
When starting a transaction, the client removes from the
cache all the items that are included in the causal snapshot, in
other words all items with commit timestamp lower than its
causal snapshot time st. When reading x, a client first looks up
xin its cache. If there is a version of xin the cache, it means
that the client has written a version of xthat is not included
in the transaction snapshot. Hence, it must be read from the
cache. Otherwise, the client reads xfrom px. In either case,
the read is performed without blocking 2.
C. Dependency tracking and stabilization protocols
BDT. Wren implements BDT, a novel protocol to track the
causal dependencies of items. The key feature of BDT is that
every data item tracks dependencies by means of only two
scalar timestamps, regardless of the scale of the system. One
entry tracks the dependencies on local items and the other
entry summarizes the dependencies on remote items.
The use of only two timestamps enables higher efficiency
and scalability than other designs. State-of-the-art solutions
employ dependency meta-data whose size grows with the
number of DCs [8], [16], partitions [24] or causal dependen-
cies [7], [15], [21], [25]. Meta-data efficiency is paramount
for many applications dominated by very small items, e.g.,
Facebook [3], [26], in which meta-data can easily grow bigger
than the item itself. Large meta-data increases processing,
communication and storage overhead.
BiST. Wren relies on BDT to implement BiST, an efficient sta-
bilization protocol to determine when updates can be included
in the snapshots proposed to clients within a DC (i.e., when
they are visible within a DC). BiST allows updates originating
in a DC to become visible in that DC without waiting for the
receipt of remote items. A remote update d, instead, is visible
in a DC when it is stable, i.e., when all the causal dependencies
of dhave been received in the DC.
2The client can avoid contacting px, because Wren uses the last-writer-wins
rule to resolve conflicting updates (see §II-A). With other conflict resolution
methods, the client would always have to read the version of xfrom px, and
apply the updates(s) in the cache to that version.
BiST computes two cut-off values that indicate, respectively,
which local and remote items can become visible to transac-
tions within a DC. The local component computed by BiST is
the LST, which we described earlier. The remote component
is the Remote Stable Time (RST), that, similarly to the LST,
indicates a lower bound on remote snapshots that have been
installed by every node within the local DC.
By decoupling local and remote items, BiST allows trans-
actions to determine the visibility of local items without
synchronizing with remote DCs [8], in contrast to systems that
use a single scalar timestamp for dependency tracking [20],
[24]. This decoupling enables availability and nonblocking
reads also in the geo-replicated case, because a snapshot
visible to a transaction includes only remote items that have
already been received in the local DC.
With BiST, periodically partitions within a DC exchange the
commit timestamp of the latest local and remote transactions
they have applied. Then, each partition computes the LST,
resp., RST, as minimum of the received timestamps corre-
sponding to local, resp., remote transactions. Therefore, LST
and RST reflect local and remote snapshots that have been
already installed by all partitions in the DC, and from which
transactions can read without blocking, as we explain in the
Snapshots and nonblocking reads. When a transaction T
starts, the local, resp., remote, entry of the corresponding
snapshot Sis set to be the maximum between the LST, resp.,
RST on the coordinator and the highest LST, resp. RST, value
seen by the client, ensuring that clients see monotonically
increasing snapshots.
Tuses the timestamps in Sto determine the version of an
item that it can read, namely the freshest version of an item
that falls within the visible snapshot (let alone the items that
are read from the client-side cache, as described in §III-B). S
includes local, resp., remote, items whose timestamp is lower
than the LST, resp., the RST. Because both LST and RST
reflect snapshots installed by every partition in the DC, Tcan
read from Sin a nonblocking fashion.
Trade-off. BiST enables high scalability and performance at
the expense of a slight increase in the time that it takes for
an update to become visible in a DC (the so called visibility
latency). By using BiST, Wren only tracks the lower bound
on the commit time of local transactions (LST) and replicated
transactions coming from all the remote DCs (RST). We
describe this trade-off by sketching in Figure 2 how BiST
and other existing stabilization protocols work at a high level.
In the example, the local DC (DC2) has committed transac-
tions with timestamp up to 15. It has received commits from
DC0 with timestamp up to 4, and from DC1 with timestamp
up to 6. Wren exposes to transactions remote items with
timestamp up to 4, the minimum of 4 and 6. Cure [8] uses one
timestamp per DC, so transactions can see items from DC0
with timestamp up to 4 and from DC1with timestamp up
to 6. GentleRain [20] uses a single timestamp (the local one)
to encode both local and remote snapshots, so transactions
Fig. 2: Resource efficiency vs freshness in BiST (one partition
per DC). DC2is the local DC.
can see all items up to timestamp 15. However, they have to
block until the local DC has received all remote updates with
timestamps lower than or equal to 15.
Timestamps. So far, we have assumed that Wren uses logical,
Lamport clocks to generate timestamps. Wren, instead, uses
Hybrid Logical Physical Clocks (HLC) [27]. In brief, an HLC
is a logical clock whose value on a partition is the maximum
between the local physical clock and the highest timestamp
seen by the partition plus one. HLCs combine the advantages
of logical and physical clocks. Like logical clocks, HLCs can
be moved forward to match the timestamp of an incoming
event. Like physical clocks, they advance in the absence of
events and at approximately the same pace.
Wren’s use of HLCs improves the freshness of the snapshot
determined by BiST, which, as a by-product, also reduces the
amount of data stored in the client-side caches. HLCs have
previously been employed by CC systems to avoid waiting
for physical clocks to catch up when generating timestamps
for updates [28], [29]. HLCs alone, however, do not solve the
problem of blocking reads with TCC. A snapshot timestamp
can be “in the future” with respect to the installed snapshot of
a partition, regardless of whether the clock is logical, physical
or hybrid.
D. Fault tolerance and Availability
Fault tolerance (within a DC). Similarly to previous trans-
actional systems based on 2PC, Wren can integrate fault-
tolerance capabilities by means of standard replication tech-
niques such as Paxos [30].
Wren preserves nonblocking reads, even if such fault tol-
erance mechanisms are enabled. In blocking systems, instead,
fault tolerance increases the latency incurred by transactions
upon blocking, because it increases the duration of a commit.
The failure of a server blocks the progress of BiST, but
only during the short amount of time during which a backup
partition has not yet replaced the failed one. The failure of a
client does not affect the behavior of the system. The clients
only keep local meta-data, and cache data that have already
been committed to the data-store.
Availability (between DCs). BiST is always available. Trans-
actions are never blocked or delayed as a result of the
disconnection or failure of a DC. The disconnection of DCi
causes the RST to freeze in all DCs that get disconnected from
DCi. However, because BiST decouples local from remote
dependencies, any RST assigned to a transaction refers to a
snapshot that is already available in the DC, and the trans-
action can thus proceed. The LST, instead, always advances,
Algorithm 1 Wren client c(open session towards pm
1: function START
2: send hStartTxReq lstc, rstcito pm
3: receive hStartTxResp id, lst, rstifrom pm
4: rstcrst;lstclst;idcid
5: RSc← ∅;W Sc← ∅
6: Remove from W Ccall items with commit timestamp up to lstc
7: end function
8: function READ (χ)
9: D← ∅;χ0← ∅
10: for each kχdo
11: dcheck W Sc, RSc, W Cc(in this order)
12: if (d6=NU LL)then Dd
13: end for
14: χ0χ\D.keyS et()
15: send hTxReadReq idc, χ0ito pm
16: receive hTxReadResp D0ifrom pm
17: DDD0
18: RScRScD
19: return D
20: end function
21: function WRIT E(χ)
22: for each hk, vi ∈ χdo Update W Scor write new entry
23: if (dW S :d== k)then d.v velse W ScW Sc∪ hk, v i
24: end for
25: end function
26: function COMMIT Only invoked if WS 6=
27: send hCommitReq idc, hwtc, W Scito pm
28: receive hCommitResp ctifrom pm
29: hwtcct Update client’s highest write time
30: Tag WScentries with hwtc
31: Move W Scentries to W CcOverwrite (older) duplicate entries
32: end function
ensuring that clients can prune their local caches even if a DC
We now describe in more detail the meta-data stored and
the protocols implemented by clients and servers in Wren.
A. Meta-data
Items. An item dis a tuple hk, v, ut, r dt, idT, sri.kand v
are the key and value of d, respectively. ut is the timestamp
of dwhich is assigned upon commit of dand summarizes the
dependencies on local items. rdt is the remote dependency
time of d, i.e., it summarizes the dependencies towards remote
items. idTis the id of the transaction that created the item
version. sr is the source replica of d.
Client. In a client session, a client cmaintains idcwhich iden-
tifies the current transaction, and lstcand rstc, that correspond
to the local and remote timestamp of the transaction snapshot,
respectively. calso stores the commit time of its last update
transaction, represented with hwtc. Finally, cstores W Sc,RSc
and W Cccorresponding to the client’s write set, read set and
client-side cache, respectively.
Servers. A server pm
nis identified by the partition id (n)
and the DC id (m). In our description, thus, mis the local
DC of the server. Each server has access to a monotonically
increasing physical clock, Clockm
n. The local clock value on
nis represented by the hybrid clock HLCm
nalso maintains V V m
n, a vector of HLCs with Mentries.
V V m
n[i], i 6=mindicates the timestamp of the latest update
Algorithm 2 Wren server pm
n- transaction coordinator.
1: upon receive hStartTxReq lstc, rstcifrom cdo
2: rstm
n, rstc}Update remote stable time
3: lstm
n, lstc}Update local stable time
4: idTgenerateU niqueI d()
5: T X[idT]← hlstm
n, lstm
n1}i Save TX context
6: send hStartTxResp idT, T X[idT]iAssign transaction snapshot
7: upon receive hTxReadReq idT, χifrom cdo
8: hlt, rti ← T X [idT]
9: D← ∅
10: χi← {kχ:partition(k) == i}Partitions with 1 key to read
11: for (i:χi6=)do
12: send hSliceReq χi, lt, rtito pm
13: receive hSliceResp Diifrom pm
14: DDDi
15: end for
16: send hTxReadResp Dito c
17: upon receive hCommitReq idT, hwt, W Sifrom cdo
18: hlt, rti ← T X [idT]
19: ht max{lt, rt, hwt}Max timestamp seen by the client
20: Di← {hk, vi ∈ W S :partition(k) == i}
21: for (i:Di6=)do Done in parallel
22: send hPrepareReq idT, lt, rt, ht, Diito pm
23: receive hPrepareResp idT, ptiifrom pm
24: end for
25: ct maxi:Di6={pti}Max proposed timestamp
26: for (i:Di6=)do send hCommit idT, ctito pm
iend for
27: delete TX[idT]Clear transactional context of c
28: send hCommitResp ctito c
received by pm
nthat comes from the n-th partition at the i-th
DC. V V m
n[m]is the version clock of the server and represents
the local snapshot installed by pm
n. The server also stores lstm
and rstm
n=tindicates that pm
nis aware that every
partition in the local DC has installed a local snapshot with
timestamp at least t.rstm
n=t0indicates that pm
nis aware that
every partition in the local DC has installed all the updates
generated from all remote DCs with update time up to t0.
Finally, pm
nkeeps a list of prepared and a list of committed
transactions. The former stores transactions for which pm
proposed a commit timestamp and for which pm
nis awaiting
the commit message. The latter stores transactions that have
been assigned a commit timestamp and whose modifications
are going to be applied to pm
B. Operations
Start. Client cinitiates a transaction Tby picking at random
acoordinator partition (denoted pm
n) and sending it a start
request with lstcand
nuses these values to update
its lstm
nand rstm
n, so that pm
ncan propose a snapshot that
is at least as fresh as the one accessed by cin previous
transactions. Then, pm
ngenerates the snapshot visible to T.
The local snapshot timestamp is lstm
n. The remote one is set
as the minimum between rstm
nand lstm
n1. Wren enforces
the remote snapshot time to be lower than the local one, to
efficiently deal with concurrent conflicting updates. Assume c
wants to read x, that chas a version Xlin its private cache with
commit timestamp ct > lstm
n, and that there exist a visible
remote Xrwith commit timestamp ct. Then, cmust retrieve
Xr, its commit timestamp and its source replica to determine
whether Xlor Xrshould be read according to the last writer
wins rule. By forcing the remote stable time to be lower than
Algorithm 3 Wren server pm
n- transaction cohort.
1: upon receive hSliceReq χ, lt, rtifrom pm
2: rstm
n, rst}Update remote stable time
3: lstm
n, lst}Update local stable time
4: D← ∅
5: for (kχ)do
6: Dk← {d:d.k == k}All versions of k
7: Dlv ← { == md.ut lt d.rst rt}Local visible
8: Drv ← { 6=md.ut rt d.rst lt}Remote visible
9: Dkv ← {Dk∩ {Dlv Drv }} All visible versions of k
10: DD∪ {argmaxd.ut {dDkv}} Freshest visible vers. of k
11: end for
12: reply hSliceResp Dito pm
13: upon receive hPrepareReq idT, lt, rt, ht, Diifrom pm
14: HLC m
n, ht + 1, H LCm
n+ 1) Update HLC
15: pt HLC m
nProposed commit time
16: lstm
n, lt}Update local stable time
17: rstm
n, rt}Update remote stable time
18: P reparedm
nP reparedm
n∪ {idT, rt, Di}Append to pending list
19: send hPrepareResp idT, ptito pm
20: upon receive hCommitReq idT, ctifrom cdo
21: HLC m
nmax(HLC m
n, ct, Clockm
n)Update HLC
22: hidT, rst, Di ← {hi, r, φi ∈ P reparedm
n:i== idT}
23: P reparedm
nP reparedm
n\ {hidT, rst, Di} Remove from pending
24: Committedm
n∪ {hidT, ct, rst, D}Mark to commit
lst – and hence of ct – the client knows that the freshest visible
version of xis Xl, which can be read locally from the private
cache 3.
After defining the snapshot visible to T,pm
nalso generates a
unique identifier for T, denoted idT, and inserts Tin a private
data structure. pm
nreplies to cwith idTand the snapshot
Upon receiving the reply, cupdates lstcand rstc, and evicts
from the cache any version with timestamp lower than lstc.c
can prune the cache using lstcbecause pm
nhas enforced that
the highest remote timestamp visible to Tis lower than lstm
This ensures that if, after pruning, there is a version Xin the
private cache of c, then X.ct > lst and hence the freshest
version of xvisible to cis X.
Read. The client cprovides the set of keys to read. For each
key kto read, csearches the write-set, the read-set and the
client cache, in this order. If an item corresponding to kis
found, it is added to the set of items to return, ensuring
read-your-own-writes and repeatable-reads semantics. Reads
for keys that cannot be served locally are sent in parallel to
the corresponding partitions, together with the snapshot from
which to serve them. Upon receiving a read request, a server
first updates the server’s LST and RST, if they are smaller than
the client’s (Alg. 2 Lines 2–3). Then, the server returns to the
client, for each key, the version within the snapshot with the
highest timestamp (Alg. 3 Lines 6–10). cinserts the returned
items in the read-set.
Write. Client clocally buffers the writes in its write-set W Sc.
If a key being written is already present in W Sc, then it is
updated; otherwise, it is inserted.
Commit. The client sends a commit request to the coordinator
3The likelihood of rstm
nbeing higher than lstm
nis low given that i)geo-
replication delays are typically higher than the skew among the physical
clocks [31] and ii)rstm
nis the minimum value across all timestamps of
the latest updates received in the local DC.
Algorithm 4 Wren server pm
n- Auxiliary functions.
1: function UPD ATE(k, v, ut, rdt, idT)
2: create d:hd.k, d.v, d.ut, d.rdt, d.idT, i←hk, v, ut, rdt, idT, mi
3: insert new item din the version chain of key k
4: end function
5: upon Every Rdo
6: if (P reparedm
n6=)then ub min{}{pP reparedm
n} − 1
7: else ub max{Clockm
n, HLC m
n}end if
8: if (Committedm
n6=)then Commit tx in increasing order of ct
9: C← {hid, ct, rst, Di} ∈ C ommittedm
n:ct ub
10: for (T← {hid, rst, Di} ∈ (group C by ct)) do
11: for (hid, rst, Di ∈ T)do
12: for (hk, vi ∈ D)do update (k, v, ct, r st, id)end for
13: end for
14: for (i6=m) send hReplicate T, ctito pi
nend for
15: Committedm
16: end for
17: V V m
n[m]ub Set version clock
18: else
19: V V m
n[m]ub Set version clock
20: for (i6=m)do send hHeartbeat V V m
n[m]ito pi
nend for
21: end if
22: upon receive hReplicate T, ctifrom pi
23: for (hid, rst, Di ∈ T)do
24: for (hk, vi ∈ D)do update (k, v, ct, r st, id)end for
25: end for
26: V V m
n[i]ct Update remote snapshot of i-th replica
27: upon receive hHeartbeat tifrom pi
28: V V m
n[i]tUpdate remote snapshot of i-th replica
29: upon Every Gdo Compute remote and local stable snapshots
30: rstm
nmin{i=0,...,M1,i6=m;j=0,...,N 1}V V m
31: lstm
nmin{i=0,...,N1}V V m
with the content of W Sc, the id of the transaction and
the commit of its last update transaction hwtc, if any. The
coordinator contacts the partitions that store the keys that need
to be updated (the cohorts) and sends them the corresponding
updates and hwtc. The partitions update their HLCs, propose
a commit timestamp and append the transaction to the pending
list. To reflect causality, the proposed timestamp is higher than
the snapshot timestamps and hwtc. The coordinator then picks
the maximum among the proposed timestamps [32], sends it to
the cohort partitions, clears the local context of the transaction
and sends the commit timestamp to the client. The cohort
partitions move the transaction from the pending list to the
commit list, with the new commit timestamp.
Applying and replicating transactions. Periodically, the
servers apply the effects of committed transactions, in in-
creasing commit timestamp order (Alg. 4 Lines 6-20). pm
applies the modifications of transactions that have a commit
timestamp lower than the lowest timestamp present in the
pending list. This timestamp represents the lower bound on
the commit timestamps of future transactions on pm
n. After
applying the transactions, pm
nupdates its local version clock
and replicates the transactions to remote DCs. When there are
more transactions with the same commit time ct,pm
its local version clock only after applying the last transaction
with the same ct and packs them together to be propagated in
one replication message (Alg. 4 Lines 10–17).
If a server does not commit a transaction for a given amount
of time, it sends a heartbeat with its current HLC to its peer
0 5 10 15 20 25 30 35 40 45
Resp. time (msec)
Throughput (1000 x TX/s)
Cure H-Cure Wren
(a) Throughput vs average TX latency.
0 5 10 15 20 25 30 35 40 45
Blocking time (msec)
Throughput (1000 x TX/s)
Cure H-Cure
(b) Mean blocking time in Cure and H-Cure. Wren never blocks.
Fig. 3: Performance of Wren, H-Cure and Cure on 3 DCs, 8 partitions/DC, 4 partitions involved per transaction, and 95:5
r:w ratio. Wren achieves better latencies because it never blocks reads (a). H-Cure achieves performance in-between Cure and
Wren, showing that only using HLCs does not solve the problem of blocking reads in TCC. Cure and H-Cure incur a mean
blocking time that grows with the load (b). Because of blocking, Cure and H-Cure need higher concurrency to fully utilize
the resources on the servers. This leads to higher contention on physical resources and to a lower throughput (a).
replicas, ensuring the progress of the RST.
BiST. Periodically, partitions within a DC exchange their
version vectors. The LST is computed as the minimum across
the local entries in such vectors; the RST as minimum across
the remote ones (Alg. 4 Lines 30–32). Partitions within a DC
are organized as a tree to reduce communication costs [20].
Garbage collection. Periodically, the partitions within a DC
exchange the oldest snapshot corresponding to an active trans-
action (pm
nsends its current visible snapshot if it has is no
running transaction). The aggregate minimum determines the
oldest snapshot Sold that is visible to a running transaction.
The partitions scan the version chain of each key backwards
and keep the all the versions up to (and including) the oldest
one within Sold. Earlier versions are removed.
C. Correctness
Because of space constraints, we provide only a high-level
argument to show the correctness of Wren.
Snapshots are causal. To start a transaction, a client c
piggybacks the freshest snapshot it has seen, ensuring the
monotonicity of the snapshot seen by c(Alg. 2 Lines 1–6).
Commit timestamps reflect causality (Alg. 2 Line 19), and
BiST tracks a lower bound on the snapshot installed by every
partition in a DC. If Xis within the snapshot of a transaction,
so are its dependencies, because i)dependencies generated
in the same DC where Xis created have a timestamp lower
than Xand ii)dependencies generated in a remote DC have a
timestamp lower than X.rdt. On top of the snapshot provided
by the coordinator, the client applies its writes that are not in
the snapshot. These writes cannot depend on items created by
other clients that are outside the snapshot visible to c.
Writes are atomic. Items written by a transaction have the
same commit timestamp and RST. LST and RST are computed
as the minimum values across all the partitions within a DC.
If a transaction has written Xand Yand a snapshot contains
X, then it also contains Y(and vice-versa).
We evaluate the performance of Wren in terms of through-
put, latency and update visibility. We compare Wren with
Cure [8], the state-of-the-art approach to TCC, and with
H-Cure, a variant of Cure that uses HLCs. By comparing
with H-Cure, we show that using HLCs alone, as in existing
systems [29], [33], is not sufficient to achieve the same
performance as Wren, and that nonblocking reads in the
presence of multi-item atomic writes are essential.
A. Experimental environment
Platform. We consider a geo-replicated setting deployed
across up to 5 replication sites on Amazon EC2 (Virginia,
Oregon, Ireland, Mumbai and Sydney). When using 3 DCs,
we use Virginia, Oregon and Ireland. In each DC we use up
to 16 servers (m4.large instances with 2 VCPUs and 8 GB
of RAM). We spawn one client process per partition in each
DC. Clients issue requests in a closed loop, and are collocated
with the server partition they use as coordinator. We spawn
different number of client threads to generate different load
conditions. In particular, we spawn 1, 2, 4, 8, 16 threads per
client process. Each “dot” in the curve plots corresponds to a
different number of threads per client.
Implementation. We implement Wren, H-Cure and Cure in
the same C++ code-base 4. All protocols implement the last-
writer-wins rule for convergence. We use Google Protobufs
for communication, and NTP to synchronize physical clocks.
The stabilization protocols run every 5 milliseconds.
Workloads. We use workloads with 95:5, 90:10 and 50:50 r:w
ratios. These are standard workloads also used to benchmark
other TCC systems [8], [16], [34]. In particular, the 50:50
and 95:5 r:w ratio workloads correspond, respectively, to the
update-heavy (A) and read-heavy (B) YCSB workloads [35].
Transactions generate the three workloads by executing 19
reads and 1 write (95:5), 18 reads and 2 writes (90:10), and
10 reads and 10 writes (50:50). A transaction first executes all
reads in parallel, and then all writes in parallel.
Our default workload uses the 95:5 r:w ratio and runs
transactions that involve 4 partitions on a platform deployed
over 3 DCs and 8 partitions. We also consider variations of
this workload in which we change the value of one parameter
0 5 10 15 20 25 30 35
Resp. time (msec)
Throughput (1000 x TX/s)
Cure H-Cure Wren
(a) Throughput vs average TX latency (90:10 r:w).
0 5 10 15 20 25 30 35
Resp. time (msec)
Throughput (1000 x TX/s)
(b) Throughput vs average TX latency (50:50 r:w).
Fig. 4: Performance of Wren, Cure and H-Cure with different 90:10 (a) and 50:50 (b) r:w ratios, 4 partitions involved per
transaction (3DCs, 8 partitions). Wren outperforms Cure and H-Cure for both read-heavy and write-heavy workloads.
0 10 20 30 40 50
Resp. time (msec)
Throughput (1000 x TX/s)
Cure H-Cure Wren
(a) Throughput vs average TX latency (p=2).
0 10 20 30 40 50
Resp. time (msec)
Throughput (1000 x TX/s)
(b) Throughput vs average TX latency (p=8).
Fig. 5: Performance of Wren, Cure and H-Cure with transactions that read from 2 (a) and 8 (b) partitions with 95:5 r:w ratio
(3DCs, 8 partitions). Wren outperforms Cure and H-Cure with both small and large transactions.
and keep the others at their default values. Transactions access
keys within a partition according to a zipfian distribution,
with parameter 0.99, which is the default in YCSB and
resembles the strong skew that characterizes many production
systems [26], [36], [37]. We use small items (8 bytes), which
are prevalent in many production workloads [26], [36]. With
bigger items Wren would retain the benefits of its nonblocking
reads. The effectiveness of BDT and BiST would naturally
decrease as the size of the items increases, because meta-data
overhead would become less critical.
B. Performance evaluation
Latency and throughput. Figure 3a reports the average
transaction latency vs. throughput achieved by Wren, H-Cure
and Cure with the default workload. Wren achieves up to
2.33x lower response times than Cure, because it never blocks
a read due to clock skew or to wait for a snapshot to be
installed. Wren also achieves up to 25% higher throughput
than Cure. Cure needs a higher number of concurrent clients
to fully utilize the processing power left idle by blocked reads.
The presence of more threads creates more contention on the
physical resources and implies more synchronization to block
and unblock reads, which ultimately leads to lower throughput.
Wren also outperforms H-Cure, achieving up to 40% lower
latency and up to 15% higher throughput. HLCs enable H-
Cure to avoid blocking the read of a transaction Tbecause
of clock skew. This blocking happens on a partition if the
local timestamp of T’s snapshot is t, there are no pending or
committed transactions on the partition with commit times-
tamp lower than t, but the physical clock on the partition is
lower than t. HLCs, however, cannot avoid blocking Tif there
are pending transactions on the partition, and Tis assigned a
snapshot that has not been installed on the partition.
Statistics on blocking in Cure and H-Cure. Figure 3b pro-
vides insights on the blocking occurring in Cure and H-Cure,
that leads to the aforementioned performance differences. The
plots show the mean blocking time of transactions that block
upon reading. A transaction Tis considered as blocked if at
least one of its individual reads blocks. The blocking time
of Tis computed as the maximum blocking time of a read
belonging to T.
Blocking can take up a vast portion of a transaction exe-
cution time. In Cure, blocking reads introduce a delay of 2
milliseconds at low load, and almost 4 milliseconds at high
load (without considering overload conditions). These values
correspond to 35-48% of the total mean transaction execution
time. Similar considerations hold for H-Cure. The blocking
time increases with the load, because higher load leads to
more transactions being inserted in the pending and commit
queues, and to higher latency between the time a transaction
is committed and the corresponding snapshot is installed.
C. Varying the workload
Figure 4a and Figure 4b report the average transaction
latency as a function of the load for the 90:10 and 50:50
r:w ratios, respectively. Figure 5a and Figure 5b report the
same metric with the default r:w ratio of 95:5, but with p= 2
and p= 8 partitions involved in a transaction, respectively.
These figures show that Wren delivers better performance than
Cure and H-Cure for a wide range of workloads. It achieves
transaction latencies up to 3.6x lower then Cure, and up to
1.6x lower than H-Cure. It achieves maximum throughput up
to 1.33x higher than Cure and 1.23x higher than H-Cure. The
95:5 90:10 50:50
normalized w.r.t Cure
Wren-4P Wren-8P Wren-16P
32 24
61 46
(a) Throughput when varying the number of partitions/DC (3 DCs).
95:5 90:10 50:50
normalized w.r.t Cure
Wren-3DC Wren-5DC
73 61 46
112 91 66
(b) Throughput when varying the number of DCs (16 partitions/DCs).
Fig. 6: Throughput achieved by Wren when increasing the number of partition per DC (a) and DCs in the system (b). Each
bar represents the throughput of Wren normalized w.r.t. to Cure (y axis starts from 1). The number on top of each bar reports
the absolute value of the throughput achieved by Wren in 1000 x TX/s. Wren consistently achieves better throughput than
Cure and achieves good scalability both when increasing the number of partitions and the number of DCs.
peak throughput of all three systems decreases with a lower
r:w ratio, because writing more items increases the duration of
the commit and the replication overhead. Similarly, a higher
value of pdecreases throughput, because more partitions are
contacted during a transaction.
D. Varying the number of partitions
Figure 6a reports the throughput achieved by Wren with 4,
8 and 16 partitions per DC. The bars represent the throughput
of Wren normalized with respect to the throughput achieved
by Cure in the same setting. The number on top of each bar
represents the absolute throughput achieved by Wren.
The plots show three main results. First, Wren consistently
achieves higher throughput than Cure, with a maximum im-
provement of 38%. Second, the performance improvement of
Wren is more evident with more partitions and lower r:w
ratios. More partitions touched by transactions and more writes
increase the chances that a read in Cure targets a laggard
partition and blocks, leading to higher latencies, lower resource
efficiency, and worse throughput. Third, Wren provides effi-
cient support for application scale-out. When increasing the
number of partitions from 4 to 16, throughput increases by
3.76x for the write-heavy and 3.88x for read-heavy workload,
approximating the ideal improvement of 4x.
Repl. Stabl.
Bytes norm. w.r.t Cure
Wren 3DC
0 20 40 60 80 100
Visibility latency (msec)
Wren L
Cure R
Fig. 7: (a) BiST incurs lower overhead than Cure to track the
dependencies of replicated updates and to determine trans-
actional snapshots (default workload). (b) Wren achieves a
slightly higher Remote update visibility latency w.r.t. Cure,
and makes Local updates visible when they are within the
local stable snapshot (3 DCs).
E. Varying the number of DCs
Figure 6b shows the throughput achieved by Wren with
3 and 5 DCs (16 partitions per DC). The bars represent the
throughput normalized with respect to Cure’s throughput in
the same scenario. The numbers on top of the bars indicate
the absolute throughput achieved by Wren.
Wren obtains higher throughput than Cure for all workloads,
achieving an improvement of up to 43%. Wren performance
gains are higher with 5 DCs, because the meta-data overhead
is constant in BiST, while in Cure it grows linearly with the
number of DCs. The throughput achieved by Wren with 5
DCs is 1.53x, 1.49x, and 1.44x higher than the throughput
achieved with 3 DCs, for the 95:5, 90:10 and 50:50 workloads,
respectively, approximating the ideal improvement of 1.66x.
A higher write intensity reduces the performance gain when
scaling from 3 to 5 DCs, because it implies more updates
being replicated.
F. Resource efficiency
Figure 7a shows the amount of data exchanged in Wren
to run the stabilization protocol and to replicate updates,
with the default workload. The results are normalized with
respect to the amounts of data exchanged in Cure at the same
throughput. With 5 DCs, Wren exchanges up to 37% fewer
bytes for replication and up to 60% fewer bytes for running
the stabilization protocol. With 5 DCs, updates, snapshots and
stabilization messages carry 2 timestamps in Wren versus 5
in Cure.
G. Update visibility
Figure 7b shows the CDF of the update visibility latency
with 3 DCs. The visibility latency of an update Xin DCiis
the difference between the wall-clock time when Xbecomes
visible in DCiand the wall-clock time when Xwas committed
in its original DC (which is DCiitself in the case of local
visibility latency). The CDFs are computed as follows: we
first obtain the CDF on every partition and then we compute
the mean for each percentile.
Cure achieves lower update visibility latencies than Wren.
The remote update visibility time in Wren is slightly higher
than in Cure (68 vs. 59 milliseconds in the worst case,
i.e., 15% higher), because Wren tracks dependencies at the
granularity of the DC, while Wren only tracks local and remote
dependencies (see §III-C Figure 2). Local updates become
visible immediately in Cure. In Wren they become visible
after a few milliseconds, because Wren chooses a slightly
older snapshot. We argue that these slightly higher update
visibility latencies are a small price to pay for the performance
improvements offered by Wren.
TCC systems. In Cure [8] a transaction Tcan be assigned
a snapshot that has not been installed by some partitions. If
Treads from any of such laggard partitions, it blocks. Wren,
on the contrary, achieves low-latency nonblocking reads by
either reading from a snapshot that is already installed in all
partitions, or from the client-side cache.
Occult [34] implements a master-slave design in which
only the master replica of a partition accepts writes and
replicates them asynchronously. The commit of a transaction,
then, may span multiple DCs. A replicated item can be read
before its causal dependencies are received, hence achieving
the lowest data staleness. However, a read may have have to
be retried several times in case of missing dependencies, and
may even have to contact the remote master replica, which
might not be accessible due to a network partition. The effect
of retrying in Occult has a negative impact on performance,
that is comparable to blocking the read to receive the right
value to return. Wren, instead, implements always-available
transactions that complete wholly within a DC, and never
block nor retry read operations.
In SwiftCloud [16] clients declare the items in which they
are interested, and the system sends them the corresponding
updates, if any. SwiftCloud uses a sequencer-based approach,
which totally orders updates, both those generated in a DC
and those received from remote DCs. The sequencer-based
approach ensures that the stream of updates pushed to clients
is causally consistent. However, sequencing the updates also
makes it cumbersome to achieve horizontal scalability. Wren,
instead, implements decentralized protocols that efficiently
enable horizontal scalability.
Cure and SwiftCloud use dependency vectors with one entry
per DC. Occult uses one dependency timestamp per master
replica. By contrast, Wren timestamps items and snapshots
with constant dependency meta-data, which increases resource
efficiency and scalability.
The trade-off that Wren makes to achieve low latency,
availability and scalability is that it exposes snapshots slightly
older than those exposed by other TCC systems.
CC systems. Many CC systems provide weaker semantics
than TCC. COPS [15], Orbe [24], GentleRain [20], Chain-
Reaction [25], POCC [38] and COPS-SNOW [7] implement
read-only transactions. Eiger [21] additionally supports write-
only transactions. These systems either block a read while
waiting for the receipt of remote updates [20], [24], [38],
require a large amount of meta-data [7], [15], [21], or rely
on a sequencer process per DC [25].
Highly available transactional systems. Bailis et al. [39],
[40] propose several flavors of transactional protocols that are
available and support read-write transactions. These protocols
rely on fine-grained dependency tracking and enforce a con-
sistency level that is weaker than CC. TARDiS [41] supports
merge functions over conflicting states of the application,
rather than at key granularity. This flexibility requires a sig-
nificant amount of meta-data and a resource-intensive garbage
collection scheme to prune old states. Moreover, TARDiS does
not implement sharding. GSP [42] is an operational model
for replicated data that supports highly available transactions.
GSP targets non-partitioned data stores and uses a system-wide
broadcast primitive to totally order the updates. Wren, instead,
is designed for applications that scale-out by sharding and
achieves scalability and consistency by lightweight protocols.
Strongly consistent transactional systems. Many systems
support geo-replication with consistency guarantees stronger
than CC (e.g., Spanner [1], Walter [43], Gemini [44],
Lynx [45], Jessy [46], Clock-SRM[47], SDUR [48] and
Droopy [49]). These systems require cross-DC coordination to
commit transactions, hence they are not always-available [11],
[19]. Wren targets a class of applications that can tolerate a
weaker form of consistency, and for these applications it pro-
vides low latency, high throughput, scalability and availability.
Client-side caching. Caching at the client side is a technique
primarily used to support disconnected clients, especially in
mobile and wide area network settings [50], [51], [52]. Wren,
instead, uses client-side caching to guarantee consistency.
We have presented Wren, the first TCC system that at the
same time implements nonblocking reads thereby achieving
low latency and allows applications to scale-out by sharding.
Wren implements a novel transactional protocol, CANToR,
that defines transaction snapshots as the union of a fresh causal
snapshot and the contents of a client-side cache. Wren also
introduces BDT, a new dependency tracking protocol, and
BiST, a new stabilization protocol. BDT and BiST use only 2
timestamps per update and per snapshot, enabling scalability
regardless of the size of the system. We have compared Wren
with the state-of-the-art TCC system, and we have shown
that Wren achieves lower latencies and higher throughput,
while only slightly penalizing the freshness of data exposed
to clients.
We thank the anonymous reviewers, Fernando Pedone,
Sandhya Dwarkadas, Richard Sites and Baptiste Lepers for
their valuable suggestions and helpful comments. This re-
search has been supported by The Swiss National Science
Foundation through Grant No. 166306, by an EcoCloud post-
doctoral research fellowship, and by Amazon through AWS
Cloud Credits.
[1] J. C. Corbett, J. Dean, M. Epstein, and et al., “Spanner: Google’s
Globally-distributed Database,” in Proc. of OSDI, 2012.
[2] G. DeCandia, D. Hastorun, M. Jampani, and et al., “Dynamo: Amazon’s
Highly Available Key-value Store,” in Proc. of SOSP, 2007.
[3] R. Nishtala, H. Fugal, S. Grimm, and et al., “Scaling Memcache at
Facebook,” in Proc. of NSDI, 2013.
[4] S. A. Noghabi, S. Subramanian, P. Narayanan, and et al., “Ambry:
LinkedIn’s Scalable Geo-Distributed Object Store,” in Proc. of SIG-
MOD, 2016.
[5] A. Verbitski, A. Gupta, D. Saha, and et al., “Amazon Aurora: De-
sign Considerations for High Throughput Cloud-Native Relational
Databases,” in Proc. of SIGMOD, 2017.
[6] F. Cruz, F. Maia, M. Matos, R. Oliveira, J. a. Paulo, J. Pereira, and
R. Vilac¸a, “MeT: Workload Aware Elasticity for NoSQL,” in Proceed-
ings of the 8th ACM European Conference on Computer Systems, ser.
EuroSys ’13, 2013.
[7] H. Lu, C. Hodsdon, K. Ngo, S. Mu, and W. Lloyd, “The SNOW Theorem
and Latency-Optimal Read-Only Transactions,” in OSDI, 2016.
[8] D. D. Akkoorath, A. Tomsic, M. Bravo, and et al., “Cure: Strong
semantics meets high availability and low latency,” in Proc. of ICDCS,
[9] M. Ahamad, G. Neiger, J. E. Burns, P. Kohli, and P. W. Hutto, “Causal
Memory: Definitions, Implementation, and Programming,” Distributed
Computing, vol. 9, no. 1, pp. 37–49, 1995.
[10] H. Attiya, F. Ellen, and A. Morrison, “Limitations of Highly-Available
Eventually-Consistent Data Stores,” in Proc. of PODC, 2015.
[11] P. Mahajan, L. Alvisi, and M. Dahlin, “Consistency, Availability, Con-
vergence,” Computer Science Department, University of Texas at Austin,
Tech. Rep. TR-11-22, May 2011.
[12] M. P. Herlihy and J. M. Wing, “Linearizability: A Correctness Condition
for Concurrent Objects,” ACM Trans. Program. Lang. Syst., vol. 12,
no. 3, pp. 463–492, Jul. 1990.
[13] K. Birman, A. Schiper, and P. Stephenson, “Lightweight Causal and
Atomic Group Multicast,” ACM Trans. Comput. Syst., vol. 9, no. 3, pp.
272–314, Aug. 1991.
[14] R. Ladin, B. Liskov, L. Shrira, and S. Ghemawat, “Providing High Avail-
ability Using Lazy Replication,” ACM Trans. Comput. Syst., vol. 10,
no. 4, pp. 360–391, Nov. 1992.
[15] W. Lloyd, M. J. Freedman, M. Kaminsky, and D. G. Andersen, “Don’t
Settle for Eventual: Scalable Causal Consistency for Wide-area Storage
with COPS,” in Proc. of SOSP, 2011.
[16] M. Zawirski, N. Preguic¸a, S. Duarte, and et al., “Write Fast, Read in
the Past: Causal Consistency for Client-Side Applications,” in Proc. of
Middleware, 2015.
[17] P. Bailis, A. Ghodsi, J. M. Hellerstein, and I. Stoica, “Bolt-on Causal
Consistency,” in Proc. of SIGMOD, 2013.
[18] L. Lamport, “Time, Clocks, and the Ordering of Events in a Distributed
System,” Commun. ACM, vol. 21, no. 7, pp. 558–565, Jul. 1978.
[19] E. A. Brewer, “Towards Robust Distributed Systems (Abstract),” in Proc.
of PODC, 2000.
[20] J. Du, C. Iorgulescu, A. Roy, and W. Zwaenepoel, “GentleRain: Cheap
and Scalable Causal Consistency with Physical Clocks,” in Proc. of
SoCC, 2014.
[21] W. Lloyd, M. J. Freedman, M. Kaminsky, and D. G. Andersen, “Stronger
Semantics for Low-latency Geo-replicated Storage,” in Proc. of NSDI,
[22] R. H. Thomas, “A Majority Consensus Approach to Concurrency
Control for Multiple Copy Databases,” ACM Trans. Database Syst.,
vol. 4, no. 2, pp. 180–209, Jun. 1979.
[23] M. Shapiro, N. Preguic¸a, C. Baquero, and M. Zawirski, “Conflict-free
Replicated Data Types,” in Proc. of SSS, 2011.
[24] J. Du, S. Elnikety, A. Roy, and W. Zwaenepoel, “Orbe: Scalable Causal
Consistency Using Dependency Matrices and Physical Clocks,” in Proc.
of SoCC, 2013.
[25] S. Almeida, J. a. Leit˜
ao, and L. Rodrigues, “ChainReaction: A Causal+
Consistent Datastore Based on Chain Replication,” in Proc. of EuroSys,
[26] B. Atikoglu, Y. Xu, E. Frachtenberg, S. Jiang, and M. Paleczny,
“Workload Analysis of a Large-scale Key-value Store,” in Proc. of
[27] S. S. Kulkarni, M. Demirbas, D. Madappa, B. Avva, and M. Leone,
“Logical Physical Clocks,” in Proc. of OPODIS, 2014.
[28] C. Gunawardhana, M. Bravo, and L. Rodrigues, “Unobtrusive Deferred
Update Stabilization for Efficient Geo-Replication,” in Proc. of ATC,
[29] M. Roohitavaf, M. Demirbas, and S. Kulkarni, “CausalSpartan: Causal
Consistency for Distributed Data Stores using Hybrid Logical Clocks,
in SRDS, 2017.
[30] L. Lamport, “The Part-time Parliament,ACM Trans. Comput. Syst.,
vol. 16, no. 2, pp. 133–169, May 1998.
[31] H. Lu, K. Veeraraghavan, P. Ajoux, and et al., “Existential Consistency:
Measuring and Understanding Consistency at Facebook,” in Proc. of
SOSP, 2015.
[32] J. Du, S. Elnikety, and W. Zwaenepoel, “Clock-SI: Snapshot isolation
for partitioned data stores using loosely synchronized clocks,” in Proc
of SRDS, 2013.
[33] D. Didona, K. Spirovska, and W. Zwaenepoel, “Okapi: Causally Consis-
tent Geo-Replication Made Faster, Cheaper and More Available,” ArXiv
e-prints,, Feb. 2017.
[34] S. A. Mehdi, C. Littley, N. Crooks, L. Alvisi, N. Bronson, and W. Lloyd,
“I Can’t Believe It’s Not Causal! Scalable Causal Consistency with No
Slowdown Cascades,” in Proc. of NSDI, 2017.
[35] B. F. Cooper, A. Silberstein, E. Tam, R. Ramakrishnan, and R. Sears,
“Benchmarking Cloud Serving Systems with YCSB,” in Proc. of SoCC,
[36] F. Nawab, V. Arora, D. Agrawal, and A. El Abbadi, “Minimizing
Commit Latency of Transactions in Geo-Replicated Data Stores,” in
Proc. of SIGMOD, 2015.
[37] O. Balmau, D. Didona, R. Guerraoui, and et al., “TRIAD: Creating
Synergies Between Memory, Disk and Log in Log Structured Key-Value
Stores,” in Proc. of ATC, 2017.
[38] K. Spirovska, D. Didona, and W. Zwaenepoel, “Optimistic Causal
Consistency for Geo-Replicated Key-Value Stores,” in Proc. of ICDCS,
[39] P. Bailis, A. Davidson, A. Fekete, A. Ghodsi, J. M. Hellerstein, and
I. Stoica, “Highly Available Transactions: Virtues and Limitations,
Proc. VLDB Endow., vol. 7, no. 3, pp. 181–192, Nov. 2013.
[40] P. Bailis, A. Fekete, J. M. Hellerstein, and et al., “Scalable Atomic
Visibility with RAMP Transactions,” in Proc. of SIGMOD, 2014.
[41] N. Crooks, Y. Pu, N. Estrada, T. Gupta, L. Alvisi, and A. Clement,
“TARDiS: A Branch-and-Merge Approach To Weak Consistency,” in
Proc. of SIGMOD, 2016.
[42] S. Burckhardt, D. Leijen, J. Protzenko, and M. Fahndrich, “Global
Sequence Protocol: A Robust Abstraction for Replicated Shared State,”
in Proceedings of ECOOP, 2015.
[43] Y. Sovran, R. Power, M. K. Aguilera, and J. Li, “Transactional Storage
for Geo-replicated Systems,” in Proc. of SOSP, 2011.
[44] V. Balegas, C. Li, M. Najafzadeh, and et al., “Geo-Replication: Fast If
Possible, Consistent If Necessary,” Data Engineering Bulletin, vol. 39,
no. 1, pp. 81–92, Mar. 2016.
[45] Y. Zhang, R. Power, S. Zhou, and et al., “Transaction Chains: Achieving
Serializability with Low Latency in Geo-distributed Storage Systems,
in Proc. of SOSP, 2013.
[46] M. S. Ardekani, P. Sutra, and M. Shapiro, “Non-monotonic Snapshot
Isolation: Scalable and Strong Consistency for Geo-replicated Transac-
tional Systems,” in Proc. of SRDS, 2013.
[47] J. Du, D. Sciascia, S. Elnikety, W. Zwaenepoel, and F. Pedone, “Clock-
RSM: Low-Latency Inter-datacenter State Machine Replication Using
Loosely Synchronized Physical Clocks,” in Proc. of DSN, 2014.
[48] D. Sciascia and F. Pedone, “Geo-replicated storage with scalable de-
ferred update replication,” in Proc. of DSN, 2013.
[49] S. Liu and M. Vukoli´
c, “Leader Set Selection for Low-Latency Geo-
Replicated State Machine,” IEEE Transactions on Parallel and Dis-
tributed Systems, vol. 28, no. 7, pp. 1933–1946, July 2017.
[50] M. E. Bjornsson and L. Shrira, “BuddyCache: High-performance Object
Storage for Collaborative Strong-consistency Applications in a WAN,”
in Proc. of OOPSLA, 2002.
[51] D. Perkins, N. Agrawal, A. Aranya, and et al., “Simba: Tunable End-
to-end Data Consistency for Mobile Apps,” in Proc. of EuroSys, 2015.
[52] I. Zhang, N. Lebeck, P. Fonseca, and et al., “Diamond: Automating
Data Management and Storage for Wide-Area, Reactive Applications,
in Proc. of OSDI, 2016.
... Causal Consistency. Causal consistency [7] lies between these two endpoints and is an attractive model for building georeplicated data stores because it hits a sweet spot in this tradeoff between ease of programming and performance [8], [9], [10], [11], [12], [13], [14], [15], [16], [17], [18]. On the one hand, it avoids the long latencies and the inability to tolerate network partitions of strong consistency. ...
... The average one-way latency between these regions is ∼5 ms [24]. In existing systems stabilization protocols are typically run every 5-10 ms [13], [15], [16]. Allowing in addition for clock skew and the time to execute the stabilization protocol, it could take up to 15 ms to make updates visible. ...
... Our work is primarily related to the vast literature on causally consistent systems, which include COPS [8], Eiger [9], Bolt-on causal consistency [12], ChainReaction [11], Orbe [10], GentleRain [13], Bolt-on CC [12], SwiftCloud [14], Saturn [49], Contrarian [15], Wren [16], CausalSpartan [50], COPS-SNOW [17], Cure [18] and PaRiS [51]. These systems differ in the mechanism they employ to achieve causal consistency. ...
Causal consistency (CC) is an attractive consistency model for geo-replicated data stores because it hits a sweet spot in the ease-of-programming versus performance trade-off. We present a new approach for implementing CC in geo-replicated data stores, which we call Optimistic Causal Consistency (OCC). OCC's main design goal is to maximize data freshness. The optimism in our approach lies in the fact that the updates replicated to a remote data center are made visible immediately, without checking if their causal dependencies have been received. Servers perform the dependency check needed to enforce CC only upon serving a client operation, rather than on receipt of a replicated data item as in existing systems. OCC offers a significant gain in data freshness, which is of crucial importance for various types of applications, such as real-time systems. OCC's potentially blocking behavior makes it vulnerable to network partitions. We therefore propose a recovery mechanism that allows an OCC system to fall back on a pessimistic protocol to continue operating during network partitions. We implement POCC, the first causally consistent geo-replicated multi-master key-value data store designed to maximize data freshness. We show that POCC improves data freshness, while offering comparable or better performance than its pessimistic counterparts.
... The traditional causal consistency model [14][15][16][17] uses a distributed model based on full replication. In distributed data storage, the dataset is completely replicated at M data centers (i.e., replicas), where each data center is divided into N partitions (Fig. 1). ...
... However, due to the coordinator's timestamp selection rule, this model produces a deadlock state. To solve the deadlock problem caused by Cure, in 2018, Kristina Spirovska et al. proposed a transaction causal consistency model for nonblocking read operations, Wren [15], and proposed a new protocol for transaction execution, dependency tracking, and stability. Wren achieves nonblocking read operations by providing a new type of snapshot composed of two scalar timestamps and a client-side cache. ...
... There is a reasonable trade-off between metadata communication and update visibility. Unlike the literature [15,28], CausalSpartanX implements transactional read-only operations, and its performance advantage is that it is not affected by slow unrelated partitions (i.e., a partition that does not carry any key required by the transaction), which reduces the visible delay in updating the model, but the problem of increased read delay due to the data stabilization process is still unresolved. ...
Full-text available
Causal consistency has attracted considerable attention in distributed systems because it meets the high availability and high-performance requirements in the presence of network partitions. Existing causal consistency models seldom pay attention to hot data governance and run the data stabilization process periodically and thus fail to meet the user requirements of real-time data and high concurrency. In response to this problem, this study proposes a model based on thermal data governance, the Horae model, which simplifies causal sequence verification by sorting, accelerates data stability, and optimizes model update visibility and read response latency. Furthermore, the Horae model stores hot data, reduces the number of partition loads, increases operation parallelism, and ultimately improves throughput. Theoretical analysis and simulation experiments show that the proposed model outperforms existing models in terms of throughput, read response time, and update visibility.
... However, the computational and storage overhead caused by metadata management loses throughput. To improve the scalability and resource utilization of the system, Wren [10] proposed two scalar timestamps to track causal consistency. One timestamp tracks the dependency of local items, and the other timestamp tracks the dependency of remote items. ...
... DC 0 accept data submitted from DC 1 and DC 2 , the timestamp of DC 0 update to 8, the timestamps for DC 1 and DC 2 are updated to 10 and 6, respectively. For Wren [10], it adopts scalar to track the causal relationship between things and takes the minimum value of all timestamps in the system as a stable timestamp. Therefore, the data written by DC 1 and DC 2 at time t ≤ 6 (the minimum value of 6 and 10) is visible to DC 0 . ...
Full-text available
Data consistency has always been a significant topic in distributed systems. In the existing consistency models, causal consistency attracts more attention because it can meet high-performance requirements even when there are network partitions in the system. The synchronization method between replicas is one of the key indicators affecting the performance of causal consistency, especially when there are a large number of nodes in the system. In the case of deploying a large number of nodes in the system, this paper optimizes the synchronization mode between data centers and proposes a causal consistency model based on the grouping strategy (Gart). Gart manages all nodes in groups to reduce the management cost during data synchronization, and adopt a leader mechanism to improve the management efficiency of the system. At the same time, a client migration mechanism be introduced to ensure that throughput can be improved without sacrificing the remote update visibility. The simulation results demonstrate that compared with the existing causal consistency model, Gart can achieve better throughput when handling a large number of nodes, and with the same communication delay, it can achieve higher update visibility.
... Wren [18] takes a somewhat similar approach to [17], but uses Hybrid Logical Clocks (HLC) [19] to timestamp events in a more reliable manner. Furthermore, Wren implements transactional CC, allowing clients to perform read transactions as well as running multiple WRITE operations atomically. ...
The current causal consistency model has trouble facing the high synchronization overheads and response delays found in cloud storage systems. This paper proposes a causal consistency model for distributed storage based on partial geographical replication and Cloud-Edge collaboration structure (PGCE). The model is based on the distributed network architecture of Cloud-Edge collaboration, and the cloud dataset is divided into multiple subsets by a hash function to store these subsets in the edge nodes that are close to the user network to realize partial geo-replication. At the same time, the timestamp stabilization mechanism and metadata processing service are set up to implement data consistency between nodes on the premise of ensuring causality, reducing the overhead of metadata processing and data synchronization. The client interacts directly with the edge nodes, which reduces the response delay of interacts with the cloud DC. An evaluation of the PGCE compared with existing models shows that it has better performance in response latency and throughput.
Data consistency is a critical topic in distributed systems. In existing consistency models, causal consistency has attracted a significant amount of attention because it can satisfy high‐performance requirements even in the presence of network partitions. At present, most of the causal consistency models face a tradeoff between throughput and update visibility. Simultaneously, they cannot take full advantage of partial geo‐replication. To resolve the problems, this paper proposes a causal consistency model that supports partial replication using the adjacency list, called Adjoin. In Adjoin, each data center (DC) stores only a subset of the full data, by reading adjacency relationships, and the relevant nodes quickly reach synchronization. We also introduce the Adjacency Stable Vector and Adjacency Dependency Set to capture causality, which reduces the system storage overhead. We evaluate Adjoin with different workloads on a cloud platform using multiple sites. The results show that Adjoin has good performance in terms of throughput and update visibility compared with previous causal consistency models.
Conference Paper
Full-text available
Causal consistency is an intermediate consistency model that can be achieved together with high availability and high-performance requirements even in presence of network partitions. In the context of partitioned data stores, it has been shown that implicit dependency tracking using clocks is more efficient than explicit dependency tracking by sending dependency check messages. Existing clock-based solutions depend on monotonic psychical clocks that are closely synchronized. These requirements make current protocols vulnerable to clock anomalies. In this paper, we propose a new clock-based algorithm, CausalSpartan, that instead of physical clocks, utilizes Hybrid Logical Clocks (HLCs). We show that using HLCs, without any overhead, we make the system robust on physical clock anomalies. This improvement is more significant in the context of query amplification, where a single query results in multiple GET/PUT operations. We also show that CausalSpartan decreases the visibility latency for a given data item comparing to existing clock-based approaches. In turn, this reduces the completion time of collaborative applications where two clients accessing two different replicas edit same items of the data store. Like previous protocols, CausalSpartan assumes that a given client does not access more than one replica. We show that in presence of network partitions, this assumption (made in several other works) is essential if one were to provide causal consistency as well as immediate availability to local updates.
Conference Paper
Full-text available
We present TRIAD, a new persistent key-value (KV) store based on Log-Structured Merge (LSM) trees. TRIAD improves LSM KV throughput by reducing the write amplification arising in the maintenance of the LSM tree structure. Although occurring in the background, write amplification consumes significant CPU and I/O resources. By reducing write amplification, TRIAD allows these resources to be used instead to improve user-facing throughput. TRIAD uses a holistic combination of three techniques. At the LSM memory component level, TRIAD leverages skew in data popularity to avoid frequent I/O operations on the most popular keys. At the storage level, TRIAD amortizes management costs by deferring and batching multiple I/O operations. At the commit log level, TRIAD avoids duplicate writes to storage. We implement TRIAD as an extension of Facebook’s RocksDB and evaluate it with production and synthetic workloads. With these workloads, TRIAD yields up to 193% improvement in throughput. It reduces write amplification by a factor of up to 4x, and decreases the amount of I/O by an order of magnitude.
Conference Paper
Full-text available
In this paper we present a new approach to implementing causal consistency in geo-replicated data stores, which we call Optimistic Causal Consistency (OCC). The optimism in our approach lies in that updates from a remote data center are immediately made visible in the local data center, without checking if their causal dependencies have been received. Servers perform the dependency check needed to enforce causal consistency only upon serving a client operation, rather than on the receipt of a replicated data item as in existing systems. OCC explores a novel trade-off in the landscape of causal consistency protocols. The potentially blocking behavior of OCC makes it vulnerable to network partitions. Because network partitions are rare in practice, however, OCC chooses to trade availability to maximize data freshness and reduce the communication overhead. We further propose a recovery mechanism that allows an OCC system to fall back on a pessimistic protocol to continue operating even during network partitions. POCC is an implementation of OCC based on physical clocks. We show that OCC improves data freshness, while offering comparable or better performance than its pessimistic counterpart.
Full-text available
Okapi is a new causally consistent geo-replicated key- value store. Okapi leverages two key design choices to achieve high performance. First, it relies on hybrid logical/physical clocks to achieve low latency even in the presence of clock skew. Second, Okapi achieves higher resource efficiency and better availability, at the expense of a slight increase in update visibility latency. To this end, Okapi implements a new stabilization protocol that uses a combination of vector and scalar clocks and makes a remote update visible when its delivery has been acknowledged by every data center. We evaluate Okapi with different workloads on Amazon AWS, using three geographically distributed regions and 96 nodes. We compare Okapi with two recent approaches to causal consistency, Cure and GentleRain. We show that Okapi delivers up to two orders of magnitude better performance than GentleRain and that Okapi achieves up to 3.5x lower latency and a 60% reduction of the meta-data overhead with respect to Cure.
Full-text available
In this paper we propose a novel approach to manage the throughput vs latency tradeoff that emerges when managing updates in geo-replicated systems. Our approach consists in allowing full concurrency when processing local updates and using a deferred local serialisation procedure before shipping updates to remote datacenters. This strategy allows to implement inexpensive mechanisms to ensure system consistency requirements while avoiding intrusive effects on update operations, a major performance limitation of previous systems. We have implemented our approach as a variant of Riak KV. Our extensive evaluation shows that we outperform sequencer-based approaches by almost an order of magnitude in the maximum achievable throughput. Furthermore, unlike previous sequencer-free solutions, our approach reaches nearly optimal remote update visibility latencies without limiting throughput.
Conference Paper
Amazon Aurora is a relational database service for OLTP workloads offered as part of Amazon Web Services (AWS). In this paper, we describe the architecture of Aurora and the design considerations leading to that architecture. We believe the central constraint in high throughput data processing has moved from compute and storage to the network. Aurora brings a novel architecture to the relational database to address this constraint, most notably by pushing redo processing to a multi-tenant scale-out storage service, purpose-built for Aurora. We describe how doing so not only reduces network traffic, but also allows for fast crash recovery, failovers to replicas without loss of data, and fault-tolerant, self-healing storage. We then describe how Aurora achieves consensus on durable state across numerous storage nodes using an efficient asynchronous scheme, avoiding expensive and chatty recovery protocols. Finally, having operated Aurora as a production service for over 18 months, we share the lessons we have learnt from our customers on what modern cloud applications expect from databases.
Conference Paper
There is a gap between the theory and practice of distributed systems in terms of the use of time. The theory of distributed systems shunned the notion of time, and introduced “causality tracking” as a clean abstraction to reason about concurrency. The practical systems employed physical time (NTP) information but in a best effort manner due to the difficulty of achieving tight clock synchronization. In an effort to bridge this gap and reconcile the theory and practice of distributed systems on the topic of time, we propose a hybrid logical clock, HLC, that combines the best of logical clocks and physical clocks. HLC captures the causality relationship like logical clocks, and enables easy identification of consistent snapshots in distributed systems. Dually, HLC can be used in lieu of physical/NTP clocks since it maintains its logical clock to be always close to the NTP clock. Moreover HLC fits in to 64 bits NTP timestamp format, and is masking tolerant to NTP kinks and uncertainties.We show that HLC has many benefits for wait-free transaction ordering and performing snapshot reads in multiversion globally distributed databases.
Modern replicated data stores aim to provide high availability, by immediately responding to client requests, often by implementing objects that expose concurrency. Such objects, for example, multi-valued registers (MVRs), do not have sequential specifications. This paper explores a recent model for replicated data stores that can be used to precisely specify causal consistency for such objects, and liveness properties like eventual consistency, without revealing details of the underlying implementation. The model is used to prove the following results: 1) An eventually consistent data store implementing MVRs cannot satisfy a consistency model strictly stronger than observable causal consistency (OCC).OCC is a model somewhat stronger than causal consistency, which captures executions in which client observations can use causality to infer concurrency of operations. This result holds under certain assumptions about the data store. 2) Under the same assumptions, an eventually consistent and causally consistent replicated data store must send messages of size linear in the size of the system: If s objects, each Ω (lgk)-bit in size, are supported by n replicas, then there is an execution in which an Ω (n,slgk)-bit message is sent.
Collaborative applications provide a shared work environment for groups of networked clients collaborating on a common task. They require strong consistency for shared persistent data and efficient access to fine-grained objects. These properties are difficult to provide in wide-area networks because of high network latency. BuddyCache is a new transactional caching approach that improves the latency of access to shared persistent objects for collaborative strong-consistency applications in high-latency network environments. The challenge is to improve performance while providing the correctness and availability properties of a transactional caching protocol in the presence of node failures and slow peers. We have implemented a BuddyCache prototype and evaluated its performance. Analytical results, confirmed by measurements of the BuddyCache prototype using the multi-user 007 benchmark indicate that for typical Internet latencies, e.g. ranging from 40 to 80 milliseconds round trip time to the storage server, peers using BuddyCache can reduce by up to 50% the latency of access to shared objects compared to accessing the remote servers directly.
Modern planetary scale distributed systems largely rely on a State Machine Replication protocol to keep their service reliable, yet it comes with a specific challenge: latency, bounded by the speed of light. In particular, clients of a single-leader protocol, such as Paxos, must communicate with the leader which must in turn communicate with other replicas: inappropriate selection of a leader may result in unnecessary round-trips across the globe. To cope with this limitation, several all-leader and leaderless alternatives have been proposed recently. Unfortunately, none of them fits all circumstances. In this article we argue that the “right” choice of the number of leaders depends on a given replica configuration and the workload. Then we present ${\mathsf {Droopy}}$ and ${\mathsf {Dripple}}$ , two sister approaches built upon state machine replication protocols. ${\mathsf {Droopy}}$ dynamically reconfigures the set of leaders. Whereas, ${\mathsf {Dripple}}$ coordinates state partitions wisely, so that each partition can be reconfigured (by ${\mathsf {Droopy}}$ ) separately. Our experimental evaluation on Amazon EC2 shows that, ${\mathsf {Droopy}}$ and ${\mathsf {Dripple}}$ reduce latency under imbalanced or localized workloads, compared to their native protocol. When most requests are non-commutative, our approaches do not affect the performance of their native protocol and both outperform a state-of-the-art leaderless protocol.