ArticlePDF Available

Abstract and Figures

CRDTs are distributed data types that make eventual consistency of a distributed object possible and non ad-hoc. Specifically, state-based CRDTs achieve this by sharing local state changes through shipping the entire state, that is then merged to other replicas with an idempotent, associative, and commutative join operation, ensuring convergence. This imposes a large communication overhead as the state size becomes larger. We introduce Delta State Conflict-Free Replicated Datatypes ({\delta}-CRDT), which make use of {\delta}-mutators, defined in such a way to return a delta-state, typically, with a much smaller size than the full state. Delta-states are joined to the local state as well as to the remote states (after being shipped). This can achieve the best of both worlds: small messages with an incremental nature, as in operation-based CRDTs, disseminated over unreliable communication channels, as in traditional state-based CRDTs. We introduce the {\delta}-CRDT framework, and we explain it through establishing a correspondence to current state- based CRDTs. In addition, we present two anti-entropy algorithms: a basic one that provides eventual convergence, and another one that ensures both convergence and causal consistency. We also introduce two {\delta}-CRDT specifications of well-known replicated datatypes.
depicts a δ-CRDT specification of a counter datatype that is a deltastate decomposition of the state-based counter in Fig. 1. The state, join and value query operation remain as before . Only the mutator inc δ is newly defined, which increments the map entry corresponding to the local replica and only returns that entry, instead of the full map as inc in the state-based CRDT counter does. This maintains the original semantics of the counter while allowing the smaller deltas returned by the delta-mutator to be sent, instead of the full map. As before, the received payload (whether one or more deltas) might not include entries for all keys in I, which are assumed to have zero values. The decomposition is easy to understand in this example since the equation inc i (X) = X inc δ i (X) holds as m{i → m(i) + 1} = m {i → m(i) + 1}. In other words, the single value for key i in the delta, corresponding to the local replica identifier, will overwrite the corresponding one in m since the former maps to a higher value (i.e., using max). Here it can be noticed that: (1) a delta is just a state, that can be joined possibly several times without requiring exactly-once delivery, and without being a representation of the " increment " operation (as in operation-based CRDTs), which is itself non-idempotent; (2) joining deltas into a delta-group and disseminating delta-groups at a lower rate than the operation rate reduces data communication overhead, since multiple increments from a given source can be collapsed into a single state counter.
Content may be subject to copyright.
Efficient State-based CRDTs by Delta-Mutation
Paulo S´ergio Almeida, Ali Shoker, and Carlos Baquero
HASLab/INESC TEC and Universidade do Minho, Portugal
Abstract. CRDTs are distributed data types that make eventual con-
sistency of a distributed object possible and non ad-hoc. Specifically,
state-based CRDTs achieve this by sharing local state changes through
shipping the entire state, that is then merged to other replicas with
an idempotent, associative, and commutative join operation, ensuring
convergence. This imposes a large communication overhead as the state
size becomes larger. We introduce Delta State Conflict-Free Replicated
Datatypes (δ-CRDT), which make use of δ-mutators, defined in such a
way to return a delta-state, typically, with a much smaller size than the
full state. Delta-states are joined to the local state as well as to the
remote states (after being shipped). This can achieve the best of both
worlds: small messages with an incremental nature, as in operation-based
CRDTs, disseminated over unreliable communication channels, as in tra-
ditional state-based CRDTs. We introduce the δ-CRDT framework, and
we explain it through establishing a correspondence to current state-
based CRDTs. In addition, we present two anti-entropy algorithms: a
basic one that provides eventual convergence, and another one that en-
sures both convergence and causal consistency. We also introduce two
δ-CRDT specifications of well-known replicated datatypes.
Keywords: Distributed systems. Eventual consistency. CRDT.
1 Introduction
Eventual consistency (EC) is a relaxed consistency model that is often adopted
by large-scale distributed systems [1,2,3] where availability must be maintained,
despite outages and partitioning, whereas delayed consistency is acceptable.
The limitations resulting from the CAP theorem [4] suggest trading strong
consistency for high availability. A typical approach in EC systems is to al-
low replicas of a distributed object to temporarily diverge, provided that they
can eventually be reconciled into a common state. To avoid application-specific
reconciliation methods, costly and error-prone, Conflict-Free Replicated Data
Types (CRDTs) [5,6] were introduced, allowing the design of self-contained dis-
tributed data types that are always available and eventually converge when all
operations are reflected at all replicas. Though CRDTs are being deployed in
practice [1], more work is still required to improve their design and performance.
CRDTs support two complementary designs: state-based which disseminate
object states and operation-based which disseminate operations [5,6]. In a state-
based design [7,6] an operation is only executed on the local replica state. A
arXiv:1410.2803v1 [cs.DC] 10 Oct 2014
replica periodically propagates its local changes to other replicas through ship-
ping its entire state. A received state is incorporated with the local state via a
merge function that deterministically reconciles both states. To maintain conver-
gence, merge is defined as a join: a least upper bound over a join-semilattice [7,6].
A major drawback in current state-based CRDTs is the communication over-
head of shipping the entire state, which can get very large in size. For instance,
the state size of a counter CRDT (a vector of integer counters, one per replica)
increases with the number of replicas; whereas in a grow-only Set, the state size
depends on the set size, that grows as more operations are invoked. This com-
munication overhead limits the use of state-based CRDTs to data-types with
small state size (e.g., counters are reasonable while sets are not); recently there
has been demand for CRDTs with large state sizes (e.g., in RIAK DT Maps [8]
that can compose multiple CRDTs).
In this paper, we rethink the way state-based CRDTs should be designed,
having in mind the problematic shipping of the entire state. Our aim is to ship a
representation of the effect of recent update operations on the state, rather than
the whole state, while preserving the idempotent nature of join; thus, allowing
unreliable communication, on the contrary to operation-based CRDTs that de-
mand exactly-once delivery and are prone to message replays. To achieve this,
we introduce Delta State-based CRDTs (δ-CRDT): a state is a join-semilattice
that results from the join of multiple fine-grained states, i.e., deltas, generated
by what we call δ-mutators; these are new versions of the datatype mutators
that return the effect of these mutators on the state. In this way, deltas can be
retained in a buffer to be shipped individually (or joined in groups) instead of
shipping the entire object. The changes to the local state are then incorporated
at other replicas by joining the shipped deltas with their own states.
A key point in our approach is a simple equation relating the novel δ-mutators
with the original CRDT mutators. The challenge when designing a new δ-CRDT
that corresponds to an existing CRDT is to derive δ-mutators that obey this
equation. In this paper, we prove that eventual consistency is guaranteed in
δ-CRDT as long as all deltas produced by δ-mutators are delivered and joined
at other replicas, and we present a corresponding simple anti-entropy algorithm.
We then focus on causal consistency, introducing the concept of delta-interval
and the causal delta-merging condition. Based on these, we then present an anti-
entropy algorithm for δ-CRDT, where sending and then joining delta-intervals
into another replica state produces the same effect as if the entire state had
been shipped and joined. We illustrate our approach by explaining a simple
counter δ-CRDT specification; and then we introduce a challenging non-trivial
specification for a widely used datatype: Optimized Add-Wins Observed-Remove
Sets. In addition, we make a basic δ-CRDT C++ library available online [9] for
various CRDTs. Our experience shows that a δ-CRDT version can be devised
for any CRDT, however, this requires some design effort that varies with the
complexity of different CRDTs.
1.1 System Model
Consider a distributed system with nodes containing local memory, with no
shared memory between them. Any node can send messages to any other node.
The network is asynchronous, there being no global clock, no bound on the time
it takes for a message to arrive, nor bounds on relative processing speeds. The
network is unreliable: messages can be lost, duplicated or reordered (but are
not corrupted). Some messages will, however, eventually get through: if a node
sends infinitely many messages to another node, infinitely many of these will be
delivered. In particular, this means that there can be arbitrarily long partitions,
but these will eventually heal. Nodes have access to durable storage; nodes can
crash but eventually will recover with the content of the durable storage as at the
time of the crash. Durable state is written atomically at each state transition.
Each node has access to its globally unique identifier in a set I.
2 A Background of State-based CRDTs
Conflict-Free Replicated Data Types [5,6] (CRDTs) are distributed datatypes
that allow different replicas of a distributed CRDT instance to diverge and
ensures that, eventually, all replicas converge to the same state. State-based
CRDTs achieve this through propagating updates of the local state by dissem-
inating the entire state across replicas. The received states are then merged to
remote states, leading to convergence.
A state-based CRDT consists of a triple (S, M, Q), where Sis a join-semi-
lattice [10], Qis a set of query functions (which return some result without
modifying the state), and Mis a set of mutators that perform updates; a mutator
mMtakes a state XSas input and returns a new state X0=m(X). A
join-semilattice is a set with a partial order vand a binary join operation t
that returns the least upper bound (LUB) of two elements in S; a join is designed
to be commutative, associative, and idempotent. Mutators are defined in such a
way to be inflations, i.e., for any mutator mand state X, the following holds:
In this way, for each replica there is a monotonic sequence of states, defined under
the lattice partial order, where each subsequent state subsumes the previous state
when joined elsewhere.
Both query and mutator operations are always available since they are per-
formed using the local state without requiring inter-replica communication; how-
ever, as mutators are concurrently applied at distinct replicas, replica states will
likely diverge. Eventual convergence is then obtained using an anti-entropy pro-
tocol that periodically ships the entire local state to other replicas. Each replica
merges the received state with its local state using the join operation in S.
Given the mathematical properties of join, if mutators stop being issued, all
replicas eventually converge to the same state. i.e. the least upper-bound of all
states involved. State-based CRDTs are interesting as they demand little guar-
antees from the dissemination layer, working under message loss, duplication,
reordering, and temporary network partitioning, without impacting availability
and eventual convergence.
inci(m) = m{i7→ m(i)+1}
valuei(m) = X
mtm0={(i, max(m(i), m0(i))) |iI}
Fig. 1: State-based Counter CRDT;
replica i.
Fig. 1 represents a state-based
increment-only counter. The CRDT
state Σis a map from replica iden-
tifiers to positive integers. Initially,
iis an empty map (assuming that
unmapped keys implicitly map to
zero, and only non zero mappings are
stored). A single mutator, i.e., inc, is
defined that increments the value cor-
responding to the local replica i(re-
turning the updated map). The query
operation value returns the counter
value by adding the integers in the
map entries. The join of two states is
the point-wise maximum of the maps.
The main weakness of state-based CRDTs is the cost of dissemination of
updates, as the full state is sent. In this simple example of counters, even though
increments only update the value corresponding to the local replica i, the whole
map will always be sent in messages though the other map values remained
intact (since no messages have been received and merged).
It would be interesting to only ship the recent modification incurred on
the state. This is, however, not possible with the current model of state-based
CRDTs as mutators always return a full state. Approaches which simply ship
operations (e.g., an “increment n” message), like in operation-based CRDTs,
require reliable communication (e.g., because increment is not idempotent). In
contrast, our approach allows producing and encoding recent mutations in an in-
cremental way, while keeping the advantages of the state-based approach, namely
the idempotent, associative, and commutative properties of join.
3 Delta-state CRDTs
We introduce Delta-State Conflict-Free Replicated Data Types, or δ-CRDT for
short, as a new kind of state-based CRDTs, in which delta-mutators are defined
to return a delta-state: a value in the same join-semilattice which represents the
updates induced by the mutator on the current state.
Definition 1 (Delta-mutator). A delta-mutator mδis a function, correspond-
ing to an update operation, which takes a state Xin a join-semilattice Sas
parameter and returns a delta-mutation mδ(X), also in S.
Definition 2 (Delta-group). A delta-group is inductively defined as either a
delta-mutation or a join of several delta-groups.
Definition 3 (δ-CRDT). Aδ-CRDT consists of a triple (S, Mδ, Q), where
Sis a join-semilattice, Mδis a set of delta-mutators, and Qa set of query
functions, where the state transition at each replica is given by either joining the
current state XSwith a delta-mutation:
or joining the current state with some received delta-group D:
In a δ-CRDT, the effect of applying a mutation, represented by a delta-
mutation δ=mδ(X), is decoupled from the resulting state X0=Xtδ, which
allows shipping this δrather than the entire resulting state X0. All state transi-
tions in a δ-CRDT, even upon applying mutations locally, are the result of some
join with the current state. Unlike standard CRDT mutators, delta-mutators do
not need to be inflations in order to inflate a state; this is however ensured by
joining their output, i.e., deltas, into the current state.
In principle, a delta could be shipped immediately to remote replicas once ap-
plied locally. For efficiency reasons, multiple deltas returned by applying several
delta-mutators can be joined locally into a delta-group and retained in a buffer.
The delta-group can then be shipped to remote replicas to be joined with their
local states. Received delta-groups can optionally be joined into their buffered
delta-group, allowing transitive propagation of deltas. A full state can be seen
as a special (extreme) case of a delta-group.
If the causal order of operations is not important and the attended aim is
merely eventual convergence of states, then delta-groups can be shipped using
an unreliable dissemination layer that may drop, reorder, or duplicate messages.
Delta-groups can always be re-transmitted and re-joined, possibly out of order,
or can simply be subsumed by a less frequent sending of the full state, e.g. for
performance reasons or when doing state transfers to new members. In Section 4,
we address state convergence when causal consistency is not required, and we
address the latter in Section 5.
3.1 Delta-state decomposition of standard CRDTs
Aδ-CRDT (S, M δ, Q) is a delta-state decomposition of a state-based CRDT
(S, M, Q), if for every mutator mM, we have a corresponding mutator mδ
Mδsuch that, for every state XS:
m(X) = Xtmδ(X)
This equation states that applying a delta-mutator and joining into the cur-
rent state should produce the same state transition as applying the corresponding
mutator of the standard CRDT.
Given an existing state-based CRDT (which is always a trivial decomposition
of itself, i.e., m(X) = Xtm(X), as mutators are inflations), it will be useful
to find a non-trivial decomposition such that delta-states returned by delta-
mutators in Mδare smaller than the resulting state:
size(mδ(X)) size(m(X))
3.2 Example: δ-CRDT Counter
i(m) = {i7→ m(i)+1}
valuei(m) = X
mtm0={(i, max(m(i), m0(i))) |iI}
Fig. 2: A δ-CRDT counter; replica i.
Fig. 2 depicts a δ-CRDT specification
of a counter datatype that is a delta-
state decomposition of the state-based
counter in Fig. 1. The state, join and
value query operation remain as be-
fore. Only the mutator incδis newly
defined, which increments the map en-
try corresponding to the local replica
and only returns that entry, instead of
the full map as inc in the state-based
CRDT counter does. This maintains
the original semantics of the counter
while allowing the smaller deltas re-
turned by the delta-mutator to be
sent, instead of the full map. As before, the received payload (whether one or
more deltas) might not include entries for all keys in I, which are assumed to have
zero values. The decomposition is easy to understand in this example since the
equation inci(X) = Xtincδ
i(X) holds as m{i7→ m(i)+ 1}=mt {i7→ m(i)+ 1}.
In other words, the single value for key iin the delta, corresponding to the local
replica identifier, will overwrite the corresponding one in msince the former
maps to a higher value (i.e., using max). Here it can be noticed that: (1) a
delta is just a state, that can be joined possibly several times without requiring
exactly-once delivery, and without being a representation of the “increment”
operation (as in operation-based CRDTs), which is itself non-idempotent; (2)
joining deltas into a delta-group and disseminating delta-groups at a lower rate
than the operation rate reduces data communication overhead, since multiple
increments from a given source can be collapsed into a single state counter.
4 State Convergence
In the δ-CRDT execution model, and regardless of the anti-entropy algorithm
used, a replica state always evolves by joining the current state with some delta:
either the result of a delta-mutation, or some arbitrary delta-group (which itself
can be expressed as a join of delta-mutations). Therefore, all states can be ex-
pressed as joins of delta-mutations, which makes state convergence in δ-CRDT
easy to achieve: it is enough that all delta-mutations generated in the system
reach every replica, as expressed by the following proposition.
2ni∈ P(I), set of neighbors
3tiB, true for transitive mode
4chooseiS×SS, ship state or
5durable state:
6XiS, CRDT state; initially
7volatile state:
8DiS, join of deltas; initially
9on operationi(mδ)
10 d=mδ(Xi)
11 X0
12 D0
13 periodically
14 m=choosei(Xi, Di)
15 for jnido
16 sendi,j (m)
17 D0
18 on receivej,i(d)
19 X0
20 if tithen
21 D0
22 else
23 D0
Algorithm 1: Basic anti-entropy algorithm for δ-CRDT.
Proposition 1. (δ-CRDT convergence) Consider a set of replicas of a δ-CRDT
object, replica ievolving along a sequence of states X0
i=, X1
i, . . ., each replica
performing delta-mutations of the form mδ
i)at some subset of its sequence
of states, and evolving by joining the current state either with self-generated
deltas or with delta-groups received from others. If each delta-mutation mδ
produced at each replica is joined (directly or as part of a delta-group) at least
once with every other replica, all replica states become equal.
Proof. Trivial, given the associativity, commutativity, and idempotence of the
join operation in any join-semilattice.
This opens up the possibility of having anti-entropy algorithms that are only
devoted to enforce convergence, without necessarily providing causal consistency
(enforced in standard CRDTs); thus, making a trade-off between performance
and consistency guarantees. For instance, in a counter (e.g., for the number of
likes on a social network), it may not be critical to have causal consistency, but
merely not to lose increments and achieve convergence.
4.1 Basic Anti-Entropy Algorithm
A basic anti-entropy algorithm that ensures eventual convergence in δ-CRDT is
presented in Algorithm 1. For the node corresponding to replica i, the durable
state, which persists after a crash, is simply the δ-CRDT state Xi. The volatile
state Dstores a delta-group that is used to accumulate deltas before eventually
sending it to other replicas. Without loss of generality, we assume that the join-
semilattice has a bottom , which is the initial value for both Xiand Di.
When an operation is performed, the corresponding delta-mutator mδis ap-
plied to the current state of Xi, generating a delta d. This delta is joined both
with Xito produce a new state, and with D. In the same spirit of standard
state based CRDTs, a node sends its messages in a periodic fashion, where the
message payload is either the delta-group Dior the full state Xi; this decision is
made by the function chooseiwhich returns one of them. To keep the algorithm
simple, a node simply broadcasts its messages without distinguishing between
neighbors. After each send, the delta-group is reset to .
Once a message is received, the payload dis joined into the current δ-CRDT
state. The basic algorithm operates in two modes: (1) a transitive mode (when
tiis true) in which mis also joined into D, allowing transitive propagation of
delta-mutations; meaning that, deltas received at node ifrom some node jcan
later be sent to some other node k; (2) a direct mode where a delta-group is
exclusively the join of local delta-mutations (jmust send its deltas directly to
k). The decisions of whether to send a delta-group versus the full state (typically
less periodically), and whether to use the transitive or direct mode are out of
the scope of this paper. In general, decisions can be made considering many
criteria like delta-groups size, state size, message loss distribution assumptions,
and network topology.
5 Causal Consistency
For some CRDTs with commutative operations, like the counter in Fig. 2 , even-
tual convergence of states may be enough, and thus any anti-entropy algorithm
that satisfies the condition in Proposition 1, like Algorithm 1, can be used. How-
ever, stronger consistency guarantees, like causal consistency, are often required
by today’s applications. When using an anti-entropy mechanism which dissem-
inates deltas with no order guarantees (like Algorithm 1) the execution is, in
general, not causally consistent.
Traditional state-based CRDTs converge using joins of the full state, which
implicitly ensures per-object causal consistency [11]: each state of some replica
of an object reflects the causal past of operations on the object (either applied
locally, or applied at other replicas and transitively joined).
Therefore, it is desirable to have δ-CRDTs offer the same causal-consistency
guarantees that standard state-based CRDTs offer. This raises the question
about how can delta propagation and merging of δ-CRDT be constrained (and
expressed in an anti-entropy algorithm) in such a manner to give the same re-
sults as if a standard state-based CRDT was used. Towards this objective, it is
useful to define a particular kind of delta-group, which we call a delta-interval:
Definition 4 (Delta-interval). Given a replica iprogressing along the states
i, X1
i, . . ., by joining delta dk
i(either local delta-mutation or received delta-
group) into Xk
ito obtain Xk+1
i, a delta-interval a,b
iis a delta-group resulting
from joining deltas da
i, . . . , db1
i|ak < b}
The use of delta-intervals in anti-entropy algorithms will be a key ingredient
towards achieving causal consistency. We now define a restricted kind of anti-
entropy algorithms for δ-CRDTs.
Definition 5 (Delta-interval-based anti-entropy algorithm). A given anti-
entropy algorithm for δ-CRDTs is delta-interval-based, if all deltas sent to other
replicas are delta-intervals.
Moreover, to achieve causal consistency the next condition must satisfied:
Definition 6 (Causal delta-merging condition). A delta-interval based anti-
entropy algorithm is said to satisfy the causal delta-merging condition if the al-
gorithm only joins a,b
jfrom replica jinto replica istates Xithat satisfy:
This means that a delta-interval is only joined into states that at least reflect
(i.e., subsume) the state into which the first delta in the interval was previously
joined. The causal delta-merging condition is important since any delta-interval
based anti-entropy algorithm of a δ-CRDT that satisfies it, can be used to obtain
the same outcome of standard CRDTs; this is formally stated in Proposition 2.
Proposition 2. (CRDT and δ-CRDT correspondence) Let (S, M, Q)be a stan-
dard state-based CRDT and (S, M δ, Q)a corresponding delta-state decomposi-
tion. Any δ-CRDT state reachable by an execution Eδover (S, M δ, Q), by a
delta-interval based anti-entropy algorithm Aδsatisfying the causal delta-merging
condition, is equal to a state resulting from an execution Eover (S, M, Q), hav-
ing the corresponding data-type operations, by an anti-entropy algorithm Afor
state-based CRDTs.
Proof. See appendix.
Corollary 1. (δ-CRDT causal consistency) Any δ-CRDT in which states are
propagated and joined using a delta-interval-based anti-entropy algorithm satis-
fying the causal delta-merging condition ensures causal consistency.
Proof. From Proposition 2 and causal consistency of state-based CRDTs.
5.1 Anti-Entropy Algorithm for Causal Consistency
Algorithm 2 is a delta-interval based anti-entropy algorithm which enforces the
causal delta-merging condition. It can be used whenever the causal consistency
guarantees of standard state-based CRDTs are needed. For simplicity, it excludes
some optimizations that are important, but easy to derive, in practice. The
algorithm distinguishes neighbor nodes, and only sends them delta-intervals that
are joined at the receiving node, obeying the delta-merging condition.
Each node ikeeps a contiguous sequence of deltas dl
i, . . . , du
iin a map Dfrom
integers to deltas, with l= min(dom(M)) and u=max(dom(M)). The sequence
2ni∈ P(I), set of neighbors
3durable state:
4XiS, CRDT state; initially
5ciN, sequence number; initially
ci= 0
6volatile state:
7DiNS, sequence of deltas;
initially Di={}
8AiIN, acknowledges map;
initially Ai={}
9on receivej,i(delta, d, n)
10 X0
11 D0
i=Di{ci7→ d}
12 c0
i=ci+ 1
13 sendi,j (ack, n)
14 on receivej,i(ack, n)
15 A0
i=Ai{j7→ max(Ai(j), n)}
16 on operationi(mδ)
17 d=mδ(Xi)
18 X0
19 D0
i=Di{ci7→ d}
20 c0
i=ci+ 1
21 periodically // ship delta-interval or
22 j=random(ni)
23 if Di={} ∨ min(dom(Di)) > Ai(j)
24 d=Xi
25 else
26 d=F{Di(l)|Ai(j)l < ci}
27 sendi,j (delta, d, ci)
28 periodically // garbage collect deltas
29 l= min{n|(, n)Ai}
30 D0
i={(n, d)Di|nl}
Algorithm 2: Anti-entropy algorithm ensuring causal consistency of δ-CRDT.
numbers of deltas are obtained from the counter cithat is incremented when
a delta (whether a delta-mutation or delta-interval received) is joined with the
current state. Each node ikeeps an acknowledgments map Athat stores, for each
neighbor j, the index bsuch that a,b
iis the last delta-interval acknowledged by
j(after it receives a,b
ifrom iand joins it into Xj).
Node isends a delta-interval d=a,b
iwith a (delta, d, b) message; the re-
ceiving node j, after joining a,b
iinto its replica state, replies with an acknowl-
edgment message (ack, b); if an ack from jwas successfully received by node i,
it updates the entry of jin the acknowledgment map, using the max function.
This handles possible old duplicates and messages arriving out of order.
Like the δ-CRDT state, the counter ciis also kept in a durable storage.
This is essential to avoid conflicts after potential crash and recovery incidents.
Otherwise, there would be the danger of receiving some delayed ack, for a delta-
interval sent before crashing, make the node skip sending some deltas generated
after recovery, thus violating the delta-merging condition.
The algorithm for node iperiodically picks a random neighbor j. In principle,
isends the join of all deltas starting from the delta that jacked and forward.
Exceptionally, isends the entire state in two cases: (1) if the sequence of deltas
Diis empty, or (2) if jis expecting from ia delta that was already removed from
Di(e.g., after a crash and recovery); itracks this in Ai[j]. To garbage collect
old deltas, the algorithm periodically removes the deltas that have been acked
by all neighbors.
Σ=P(I×N×E)× P(I×N)
i= ({},{})
i(e, (s, t)) = ({(i, n + 1, e)},{})
with n=max({k|(i, k, )s})
i(e, (s, t)) = ({},{(j, n)|(j, n, e)s})
elementsi((s, t)) = {e|(j, n, e)s(j, n)6∈ t}
(s, t)t(s0, t0)=(ss0, t t0)
(a) With Tombstones
Σ=P(I×N×E)× P(I×N)
i= ({},{})
i(e, (s, c)) = ({(i, n, e)},{(i, n + 1)})
with n=max({k|(i, k)c})
i(e, (s, c)) = ({},{(j, n)|(j, n, e)s})
elementsi((s, c)) = {e|(j, n, e)s}
(s, c)t(s0, c0) = ((ss0)∪ {(i, n, e)s|(i, n)6∈ c0}
∪{(i, n, e)s0|(i, n)6∈ c}, c c0)
(b) Without Tombstones (optimized)
Fig. 3: Add-wins observed-remove δ-CRDT set, replica i.
Proposition 3. Algorithm 2 produces the same reachable states as a standard
algorithm over a CRDT for which the δ-CRDT is a decomposition.
Proof. See appendix.
6δ-CRDTs for Add-Wins OR-Sets
An Add-wins Observed-Remove Set is a well-known CRDT datatype that offers
the same sequential semantics of a sequential set and adopts a specific resolution
semantics for operations that concurrently add and remove the same element.
Add-wins means that an add prevails over a concurrent remove. Remove opera-
tions, however, only affect elements added by causally preceding adds.
Fig. 3a depicts a simple, but inefficient, δ-CRDT implementation of a state-
based add-wins OR-Set. The state Σconsists of a set of tagged elements and a
set of tags, acting as tombstones. Globally unique tags of the form I×Nare used
and ensured by pairing a replica identifier in Iwith a monotonically increasing
natural counter. Once an element eEis added to the set, the delta-mutator
addδcreates a globally unique tag by incrementing the highest tag present in
its local state and that was created by replica iitself (max returns 0 if no tag
is present). This tag is paired with value eand stored as a new unique triple.
Since removes should only tombstone elements that are added before the remove
operation, the delta-mutator rmvδretains in the tombstone set all tags associated
to element e, being removed from the local state. Function elements only returns
the elements that are added but not tombstoned yet. Join tsimply unions the
respective sets that are, therefore, both grow-only.
A more efficient design is presented in Fig. 3b, which offers the same semantics
and have a similar state structure; however, it uses a different join-semilattice,
allowing the set of tagged elements to shrink as elements are removed. Now,
elements returns all elements in the tagged set s. Instead of the tombstone set, a
causal context set is used. Adding an element creates a unique tag by resorting
to the causal context cinstead of s, that can now shrink; the new triple is added
to sas before, but now the new tag is also added to causal context c. The delta-
mutator rmvδis the same as before, collecting all tags associated to the element
being removed. The desired semantics are achieved by the novel join operation
t. To join two states, their causal contexts are simply unioned; whereas, the new
tagged element set only preserves: (1) the triples present in both sets (therefore,
not removed in either), and also (2) any triple present in one of the sets and
whose tag is not present in the causal context of the other state.
Causal Context Compression. For presentation simplicity, this optimized version
of the add-wins OR-Set has a grow-only causal context that collects all the
unique tags (even from elements added but no longer present). In practice the
causal context can be efficiently compressed without any loss of information.
When using an anti-entropy algorithm that provides causal consistency, e.g.,
Algorithm 2, then for each replica state Xi= (si, ci) and replica identifier jI,
we have a contiguous sequence:
1nmax({k|(j, k)ci})(j, n)ci.
Thus, the causal context can always be encoded as a compact version vector [12]
INthat keeps the maximum sequence number for each replica. Even under
non-causal anti-entropy, compression is still possible by keeping a version vector
that encodes the offset of the contiguous sequence of tags from each replica,
together with a set for the non-contiguous tags. As anti-entropy proceeds, each
tag is eventually encoded in the vector, and thus the set remains typically small.
Compression is less likely for the causal context of delta-groups in transit or
buffered to be sent, but those contexts are only transient and smaller than those
in the actual replica states. Moreover, the same techniques that encode contigu-
ous sequences of tags can also be used for transient context compression [13].
7 Message Complexity
Our delta-based framework, δ-CRDT, clearly introduces significant cost improve-
ments on messaging. Despite being generic, δ-CRDT requires deltas to be defined
per datatype. This makes the bit-message complexity datatype-based rather than
generic. To give an intuition about this complexity, we address the two datatypes
introduced above: counter and OR-Set. In classical state-based counter CRDTs,
the entire map of the counter is shipped. As the map-size grows with the num-
ber of replicas, this leads a bit-message complexity of e
O(|I|)1. Whereas, in the
δ-CRDT case, only recently updated map entries αare shipped yielding a bit-
complexity e
O(α), where α |I|. Moreover, if transitive forwarding is not allowed
this drops to constant e
O(1). As for OR-set, shipping in classical OR-set CRDTs
delivers the entire state which yields a bit-message complexity of O(S), where S is
the state-size. In δ-CRDT, only deltas are shipped, which renders a bit-message
Ois a variant of big Oignoring logarithmic factors in the size of integers and ids.
complexity O(s) where srepresents the size of the recent updates occurred since
the last shipping. Clearly, sSsince the updates that occur on a state in a
period of time are often much less than the total number of items.
8 Related Work
Eventually convergent data types. The design of replicated systems that are
always available and eventually converge can be traced back to historical de-
signs in [14,15], among others. More recently, replicated data types that always
eventually converge, both by reliably broadcasting operations (called operation-
based) or gossiping and merging states (called state-based), have been formalized
as CRDTs [16,7,5,6]. These are also closely related to BloomL[17] and Cloud
Types [18]. State semi-lattices were used for deterministic parallel programming
in LVars [19], where variables progress in the lattice order by joining other values,
and are only accessible by special threshold reads.
Message size. A key feature of δ-CRDT is message size reduction and coalesc-
ing, using small-sized deltas. The general old idea of using differences between
things, called “deltas” in many contexts, can lead to many designs, depending on
how exactly a delta is defined. The state-based deltas introduced for Computa-
tional CRDTs [20] require an extra delta-specific merge which does not ensure
idempotence. In [21], an improved synchronization method for non-optimized
OR-set CRDT [5] is presented, where delta information is propagated; in that
paper deltas are a collection of items (related to update events between synchro-
nizations), manipulated and merged through a protocol, as opposed to normal
states in the semilattice. No generic framework is defined (that could encompass
other data types) and the protocol requires several communication steps to com-
pute the information to exchange. Operation-based CRDTs [5,6,22] also support
small message sizes, and in particular, pure flavors [22] that restrict messages
to the operation name, and possible arguments. Though pure operation-based
CRDTs allow for compact states and are very fast at the source (since opera-
tions are broadcast without consulting the local state), the model requires more
systems guarantees than δ-CRDT do, e.g., exactly-once reliable delivery and
membership information, and impose more complex integration of new replicas.
Encoding causal histories. State-based CRDT are always designed to be causally
consistent [7,6]. Optimized implementations of sets, maps, and multi-value regis-
ters can build on this assumption to keep the meta-data small [11]. In δ-CRDT,
however, deltas and delta-groups are normally not causally consistent, and thus
the design of join, the meta-data state, as well as the anti-entropy algorithm used
must ensure this. Without causal consistency, the causal context in δ-CRDT can
not always be summarized with version vectors, and consequently, techniques
that allow for gaps are often used. A well known mechanism that allows for
encoding of gaps is found in Concise Version Vectors [23]. Interval Version Vec-
tors [13], later on, introduced an encoding that optimizes sequences and allows
gaps, while preserving efficiency when gaps are absent.
9 Conclusion
CRDTs allow flexible, while principled, design of distributed protocols that trade
strict consistency for improved availability, faster response time, and support for
disconnected operation. These benefits are harvested once a given application
can model all, or a part, of its behavior using CRDTs. In particular, state-based
CRDTs allow idempotent gossiping of states and require very basic guarantees
from the network, as they cope with message loss, re-ordering, and duplication.
State-based CRDTs come with a price: states have a potential to get very
large. This is even worse if multiple objects are composed into a single CRDT
object to benefit from intra-replica consistency and atomicity. In this paper,
we addressed these limitations by introducing the new concept of δ-CRDT. By
devising delta-mutators over state-based datatypes which can detach the changes
that an operation induces on the state. This brings a significant performance gain
as it allows only shipping small states, i.e., deltas, instead of the entire state.
The significant property in δ-CRDT is that it preserves the crucial properties
(idempotence, associativity and commutativity) of standard state-based CRDT.
We have shown how δ-CRDT can achieve convergence possibly with, or with-
out, causal consistency; and we presented an anti-entropy algorithm for each
case. In particular, the causally consistent algorithm allows replacing classical
state-based CRDTs by more efficient ones, while preserving their properties. As
a first application of our approach, we designed a novel δ-CRDT specification
for a well-known and widely used datatype: an optimized observed-remove set.
1. Cribbs, S., Brown, R.: Data structures in Riak. In: Riak Conference (RICON),
San Francisco, CA, USA (oct 2012)
2. Terry, D.B., Theimer, M.M., Petersen, K., Demers, A.J., Spreitzer, M.J., Hauser,
C.H.: Managing update conflicts in Bayou, a weakly connected replicated storage
system. In: Symp. on Op. Sys. Principles (SOSP), Copper Mountain, CO, USA,
ACM SIGOPS, ACM Press (December 1995) 172–182
3. DeCandia, G., Hastorun, D., Jampani, M., Kakulapati, G., Lakshman, A., Pilchin,
A., Sivasubramanian, S., Vosshall, P., Vogels, W.: Dynamo: Amazon’s highly avail-
able key-value store. In: Symp. on Op. Sys. Principles (SOSP). Volume 41 of
Operating Systems Review., Stevenson, Washington, USA, Assoc. for Computing
Machinery (October 2007) 205–220
4. Gilbert, S., Lynch, N.: Brewer’s conjecture and the feasibility of consistent, avail-
able, partition-tolerant web services. SIGACT News 33(2) (2002) 51–59
5. Shapiro, M., Pregui¸ca, N., Baquero, C., Zawirski, M.: A comprehensive study of
Convergent and Commutative Replicated Data Types. Rapp. Rech. 7506, Institut
National de la Recherche en Informatique et Automatique (INRIA), Rocquencourt,
France (January 2011)
6. Shapiro, M., Pregui¸ca, N., Baquero, C., Zawirski, M.: Conflict-free replicated data
types. In D´efago, X., Petit, F., Villain, V., eds.: Int. Symp. on Stabilization, Safety,
and Security of Distributed Systems (SSS). Volume 6976 of Lecture Notes in Comp.
Sc., Grenoble, France, Springer-Verlag (October 2011) 386–400
7. Baquero, C., Moura, F.: Using structural characteristics for autonomous operation.
Operating Systems Review 33(4) (1999) 90–96
8. Brown, R., Cribbs, S., Meiklejohn, C., Elliott, S.: Riak dt map: A composable,
convergent replicated dictionary. In: Proceedings of the First Workshop on Princi-
ples and Practice of Eventual Consistency. PaPEC ’14, New York, NY, USA, ACM
(2014) 1:1–1:1
9. Baquero, C.: Delta-crdt-cpp.
10. Davey, B.A., Priestley, H.A.: Introduction to Lattices and Order (2. ed.). Cam-
bridge University Press (2002)
11. Burckhardt, S., Gotsman, A., Yang, H., Zawirski, M.: Replicated data types:
specification, verification, optimality. In Jagannathan, S., Sewell, P., eds.: POPL,
ACM (2014) 271–284
12. Parker, D.S., Popek, G.J., Rudisin, G., Stoughton, A., Walker, B.J., Walton, E.,
Chow, J.M., Edwards, D., Kiser, S., Kline, C.: Detection of mutual inconsistency
in distributed systems. IEEE Trans. Softw. Eng. 9(3) (May 1983) 240–247
13. Mukund, M., R., G.S., Suresh, S.P.: Optimized or-sets without ordering con-
straints. In: Proceedings ot the International Conference on Distributed Comput-
ing and Networking, New York, NY, USA, ACM (2014) 227–241
14. Wuu, G.T.J., Bernstein, A.J.: Efficient solutions to the replicated log and dictio-
nary problems. In: Symp. on Principles of Dist. Comp. (PODC), Vancouver, BC,
Canada (August 1984) 233–242
15. Johnson, P.R., Thomas, R.H.: The maintenance of duplicate databases. Internet
Request for Comments RFC 677, Information Sciences Institute (January 1976)
16. Letia, M., Pregui¸ca, N., Shapiro, M.: CRDTs: Consistency without concurrency
control. Rapp. Rech. RR-6956, Institut National de la Recherche en Informatique
et Automatique (INRIA), Rocquencourt, France (June 2009)
17. Conway, N., Marczak, W.R., Alvaro, P., Hellerstein, J.M., Maier, D.: Logic and lat-
tices for distributed programming. In: Proceedings of the Third ACM Symposium
on Cloud Computing, ACM (2012) 1
18. Burckhardt, S., F¨ahndrich, M., Leijen, D., Wood, B.P.: Cloud types for eventual
consistency. In: ECOOP 2012–Object-Oriented Programming. Springer (2012)
19. Kuper, L., Newton, R.R.: Lvars: lattice-based data structures for deterministic
parallelism. In: Proceedings of the 2nd ACM SIGPLAN workshop on Functional
high-performance computing, ACM (2013) 71–84
20. Navalho, D., Duarte, S., Pregui¸ca, N., Shapiro, M.: Incremental stream processing
using computational conflict-free replicated data types. In: Proceedings of the 3rd
International Workshop on Cloud Data and Platforms, ACM (2013) 31–36
21. Deftu, A., Griebsch, J.: A scalable conflict-free replicated set data type. In: Pro-
ceedings of the 2013 IEEE 33rd International Conference on Distributed Comput-
ing Systems. ICDCS ’13, Washington, DC, USA, IEEE Computer Society (2013)
22. Baquero, C., Almeida, P.S., Shoker, A.: Making operation-based CRDTs operation-
based. In: to appear in Proceedings of Distributed Applications and Interoperable
Systems: 14th IFIP WG 6.1 International Conference, Springer (2014)
23. Malkhi, D., Terry, D.: Concise version vectors in winfs. Distributed Computing
20(3) (2007) 209–219
A Proof of Proposition 1
Proof. By simulation, establishing a correspondence between an execution Eδ,
and execution Eof a standard CRDT of which (S, Mδ, Q) is a decomposition, as
follows: 1) the state (Xi, Di, . . .) of each node in Eδcontaining CRDT state Xi,
information about delta-intervals Diand possibly other information, corresponds
to only Xicomponent (in the same join-semilattice); 2) for each action which is
a delta-mutation mδin Eδ,Eexecutes he corresponding mutation m, satisfying
m(X) = Xtmδ(X); 3) whenever Eδcontains a send action of a delta-interval
i, execution Econtains a send action containing the full state Xb
i; 4) whenever
Eδperforms a join into some Xiof a delta-interval a,b
j, execution Edelivers
and joins the corresponding message containing the full CRDT state Xb
j. By
induction on the length of the trace, assume that for each replica i, each node
state Xiin Eis equal to the corresponding component in the node state in Eδ, up
to the last action in the global trace. A send action does not change replica state,
preserving the correspondence. Replica states only change either by performing
data-type update operations or upon message delivery by merging deltas/states
respectively. If the next action is an update operation, the correspondence is
preserved due to the delta-state decomposition property m(X) = Xtmδ(X).
If the next action is a message delivery at replica i, with a merging of delta-
interval/state from other replica j, because algorithm Aδsatisfies the causal
merging-condition, it only joins into state Xk
ia delta-interval a,b
jif Xk
In this case, the outcome will be:
j|al < b}
j|al < b}
jt. . . tdb1
jt. . . tdb1
=. . .
The resulting state Xk+1
iin Eδwill be, therefore, the same as the corresponding
one in Ewhere the full CRDT state from jhas been joined, preserving the
correspondence between Eδand E.
B Proof of Proposition 3
Proof. From Proposition 1, it is enough to prove that the algorithm satisfies
the causal delta-merging condition. The algorithm explicitly keeps deltas dk
tagged with increasing sequence numbers (even after a crash), according with
the definition; node jonly sends to ia delta-interval a,b
jif ihas acked a; this ack
is sent only if ihas already joined some delta-interval (possibly a full state) k,a
Either k= 0 or, by the same reasoning, this k,a
jcould only have been joined
at iif some other interval l,k
jhad already been joined into i. This reasoning
can be recursed until a delta-interval starting from zero is reached. Therefore,
j|0k < a}=0,a
... This becomes costly when CRDTs grow larger. A solution to this problem is discussed by Almeida et al. [2] by only transmitting state-deltas instead of the complete data structure. In addition, certain CRDT designs su er from state in ation, e.g., due to accumulation of tombstone values. ...
General solutions of state machine replication have to ensure that all replicas apply the same commands in the same order, even in the presence of failures. Such strict ordering incurs high synchronization costs caused by distributed consensus or by the use of a leader. This paper presents a protocol for linearizable state machine replication of conflict-free replicated data types (CRDTs) that neither requires consensus nor a leader. By leveraging the properties of state-based CRDTs - in particular the monotonic growth of a join semilattice - synchronization overhead is greatly reduced. In addition, updates just need a single round trip and modify the state `in-place' without the need for a log. Furthermore, the message size overhead for coordination consists of a single counter per message. While reads in the presence of concurrent updates are not wait-free without a coordinator, we show that more than 97% of reads can be handled in one or two round trips under highly concurrent accesses. Our protocol achieves high throughput without auxiliary processes like command log management or leader election. It is well suited for all practical scenarios that need linearizable access on CRDT data on a fine-granular scale.
... We use it both as a distributed key/value store for IoT sensor data and a propagation tool for our generic task model. Lasp provides access to a wide range of Conflict-free Replicated Data Types (CRDTs), that ensure that conflicting operations on a same data entry are automatically handled using the underlying conflict resolution algorithm [10,18,31]. Consequently, Achlys clusters are able to preserve strong eventual consistency of data across nodes. ...
Full-text available
Edge computing is one of the key success factors for future Internet solutions that intend to support the ongoing IoT evolution. By offloading central areas using resources that are closer to clients, providers can offer reliable services with higher quality. But even industry standards are still lacking a valid solution for edge systems with actual sense-making capabilities when no preexisting infrastructure whatsoever is available. The current edge model involves a tight coupling with gateway devices and Internet access, even when autonomous ad hoc IoT networks could perform partial or even complete tasks correctly. In our previous research efforts, we have introduced Achlys, an Erlang programming framework that takes advantage of the GRiSP embedded system capabilities in order to bring edge computing one step further. GRiSP is an embedded board that can easily be programmed directly in Erlang without requiring deep low level knowledge, which offers the extensive toolset of the Erlang ecosystem directly on bare metal hardware. We have been able to demonstrate that our framework allows building reliable applications on unreliable networks of unreliable GRiSP nodes with a very simple programming API. In this paper, we present how Erlang can successfully be used to address edge computing challenges directly on IoT sensor nodes, taking advantage of our existing framework. We display results of deployed distributed programs at the edge and examples of the unique advantage that is offered by Erlang higher-order and concurrent programming in order to achieve reliable general-purpose computing through Achlys.
... CRDTs in Lasp are implemented using additional metadata that allows each operation at each node to be taken into consideration. In fact, the Lasp library uses an efficient implementation of CRDTs called delta-based dissemination mode, which propagates only delta-mutators [27], [22], i.e., update operations, instead of the full state, to achieve consistency. This uses significantly less traffic between nodes than a naive implementation that propagates the full state. ...
Conference Paper
Full-text available
Internet of Things (IoT) continues to grow exponentially , in number of devices and the amount of data they generate. Processing this data requires an exponential increase in computing power. For example, aggregation can be done directly at the edge. However, aggregation is very limited; ideally we would like to do more general computations at the edge. In this paper we propose a framework for doing general-purpose edge computing directly on sensor networks themselves, without requiring external connections to gateways or cloud. This is challenging because sensor networks have unreliable communication, unreliable nodes, and limited (if any) computing power and storage. How can we implement production-quality components directly on these networks? We need to bridge the gap between the unreliable, limited infrastructure and the stringent requirements of the components. To solve this problem we present Achlys, an edge computing framework that provides reliable storage, computation, and communication capabilities directly on wireless networks of IoT sensor nodes. Using Achlys, the sensor network is able to configure and manage itself directly, without external connectivity. Achlys combines the Lasp key/value store and the Partisan communication library. Lasp provides efficient decentralized storage based on the properties of CRDTs (Conflict-Free Replicated Data Types). Partisan provides efficient connectivity and broadcast based on hybrid gossip. Both Lasp and Partisan are specifically designed to be extremely resilient. They are able to continue working despite high node churn, frequent network partitions, and unreliable communication. Our first implementation of Achlys is on a network of GRiSP embedded system boards. We choose GRiSP as our first implementation platform because it implements high-level functionality, namely Erlang, directly on the bare hardware and because it directly supports Pmod sensors and wireless connectivity. We give some first results on using Achlys for building edge systems and we explain how we plan to evolve Achlys in the future. Achlys is a work in progress that is being done in the context of the LightKone European H2020 research project, and we are in the process of implementing and evaluating a proof-of-concept application in the area of precision agriculture.
... Delta-state CRDTs [3,4] address this issue in a principled way by propagating delta-mutators, that encode the changes that have been made to a replica since the last communication. The first time a replica communicates with some other replica, the full state needs to be propagated. ...
Internet-scale distributed systems often replicate data at multiple geographic locations to provide low latency and high availability, despite node and network failures. Geo-replicated systems that adopt a weak consistency model allow replicas to temporarily diverge, requiring a mechanism for merging concurrent updates into a common state. Conflict-free Replicated Data Types (CRDT) provide a principled approach to address this problem. This document presents an overview of Conflict-free Replicated Data Types research and practice, organizing the presentation in the aspects relevant for the application developer, the system developer and the CRDT developer.
Conference Paper
Conflict-free replicated data types (CRDTs) [7] aid programmers develop highly available and scalable distributed systems. However, CRDTs require operations to commute which is not practical. This means that programmers cannot replicate regular objects without worrying about concurrency. In this paper, we introduce strong eventually consistent replicated objects (SECROs), a generic data type that is highly available and guarantees strong eventual consistency (SEC) without imposing restrictions on its operations.
Conference Paper
Many web applications are built around direct interactions among users, from collaborative applications and social networks to multi-user games. Despite being user-centric, these applications are usually supported by services running on servers that mediate all interactions among clients. When users are in close vicinity of each other, relying on a centralized infrastructure for mediating user interactions leads to unnecessarily high latency while hampering fault-tolerance and scalability. In this paper, we propose to extend user-centric Internet services with peer-to-peer interactions. We have designed a framework named Legion that enables client web applications to securely replicate data from servers, and synchronize these replicas directly among them. Legion allows for client-side modules, that we dub adapters, to leverage existing web platforms for storing data and to assist in Legion operation. Using these adapters, legacy applications accessing directly the web platforms can co-exist with new applications that use our framework, while accessing the same shared objects.Our experimental evaluation shows that, besides supporting direct client interactions, even when disconnected from the servers, Legion provides lower latency for update propagation with decreased network traffic for servers.
Conference Paper
Pure operation-based (op-based) Conflict-free Replicated Data Types (CRDTs) are generic and very efficient as they allow for compact solutions in both sent messages and state size. Although the pure op-based model looks promising, it is still not fully understood in terms of practical implementation. In this paper, we explain the challenges faced in implementing pure op-based CRDTs in a real system: the well-known in-memory cache key-value store Redis. Our purpose of choosing Redis is to implement a multi-master replication feature, which the current system lacks. The experience demonstrates that pure op-based CRDTs can be implemented in existing systems with minor changes in the original API.
Conference Paper
State-based CRDTs allow updates on local replicas without remote synchronization. Once these updates are propagated, possible conflicts are resolved deterministically across all replicas. δ-CRDTs bring significant advantages in terms of the size of messages exchanged between replicas during normal operation. However, when a replica joins the system after a network partition, it needs to receive the updates it missed and propagate the ones performed locally. Current systems solve this by exchanging the full state bidirectionally or by storing additional metadata along the CRDT. We introduce the concept of join-decomposition for state-based CRDTs, a technique orthogonal and complementary to delta-mutation, and propose two synchronization methods that reduce the amount of information exchanged, with no need to modify current CRDT definitions.
Conference Paper
Full-text available
Mobile devices commonly access shared data stored on a server. To ensure responsiveness, many applications maintain local replicas of the shared data that remain instantly accessible even if the server is slow or temporarily unavailable. Despite its apparent simplicity and commonality, this scenario can be surprisingly challenging. In particular, a correct and reliable implementation of the communication protocol and the conflict resolution to achieve eventual consistency is daunting even for experts. To make eventual consistency more programmable, we propose the use of specialized cloud data types. These cloud types provide eventually consistent storage at the programming language level, and thus abstract the numerous implementation details (servers, networks, caches, protocols). We demonstrate (1) how cloud types enable simple programs to use eventually consistent storage without introducing undue complexity, and (2) how to provide cloud types using a system and protocol comprised of multiple servers and clients.
We propose efficient algorithms to maintain a replicated dictionary using a log in an unreliable network. A non-serializable approach is used to achieve high concurrency. The solutions are resilient to both node and communication failures. Optimizations are developed for networks which are not completely connected.
Conference Paper
Eventual consistency is a relaxation of strong consistency that guarantees that if no new updates are made to a replicated data object, then all replicas will converge. The conflict free replicated datatypes (CRDTs) of Shapiro et al. are data structures whose inherent mathematical structure guarantees eventual consistency. We investigate a fundamental CRDT called Observed-Remove Set (OR-Set) that robustly implements sets with distributed add and delete operations. Existing CRDT implementations of OR-Sets either require maintaining a permanent set of “tombstones” for deleted elements, or imposing strong constraints such as causal order on message delivery. We formalize a concurrent specification for OR-Sets without ordering constraints and propose a generalized implementation of OR-sets without tombstones that provably satisfies strong eventual consistency. We introduce Interval Version Vectors to succinctly keep track of distributed time-stamps in systems that allow out-of-order delivery of messages. The space complexity of our generalized implementation is competitive with respect to earlier solutions with causal ordering. We also formulate k-causal delivery, a generalization of causal delivery, that provides better complexity bounds.
Conflict-Free Replicated Data-Types (CRDTs) [6] provide greater safety properties to eventually-consistent distributed systems without requiring synchronization. CRDTs ensure that concurrent, uncoordinated updates have deterministic outcomes via the properties of bounded join-semilattices. We discuss the design of a new convergent (state-based) replicated data-type, the Map, as implemented by the Riak DT library [4] and the Riak data store [3]. Like traditional dictionary data structures, the Map associates keys with values, and provides operations to add, remove, and mutate entries. Unlike traditional dictionaries, all values in the Map data structure are also state-based CRDTs and updates to embedded values preserve their convergence semantics via lattice inflations [1] that propagate upward to the top-level. Updates to the Map and its embedded values can also be applied atomically in batches. Metadata required for ensuring convergence is minimized in a manner similar to the optimized OR-set [5]. This design allows greater flexibility to application developers working with semi-structured data, while removing the need for the developer to design custom conflict-resolution routines for each class of application data. We also discuss the experimental validation of the data-type using stateful property-based tests with QuickCheck [2].
Conflict-free Replicated Datatypes can simplify the design of predictable eventual consistency. They can be classified into state-based or operation-based. Operation-based approaches have the potential for allowing compact designs in both the sent message and the object state size, but current approaches are still far from this objective. Here we explore the design space for operation-based solutions, and we leverage the interaction with the middleware by offering a technique that delivers very compact solutions, while only broadcasting operation names and arguments.
Conference Paper
In recent years there has been interest in achieving application-level consistency criteria without the latency and availability costs of strongly consistent storage infrastructure. A standard technique is to adopt a vocabulary of commutative operations; this avoids the risk of inconsistency due to message reordering. Another approach was recently captured by the CALM theorem, which proves that logically monotonic programs are guaranteed to be eventually consistent. In logic languages such as Bloom, CALM analysis can automatically verify that programs achieve consistency without coordination. In this paper we present BloomL, an extension to Bloom that takes inspiration from both of these traditions. BloomL generalizes Bloom to support lattices and extends the power of CALM analysis to whole programs containing arbitrary lattices. We show how the Bloom interpreter can be generalized to support efficient evaluation of lattice-based code using well-known strategies from logic programming. Finally, we use BloomL to develop several practical distributed programs, including a key-value store similar to Amazon Dynamo, and show how BloomL encourages the safe composition of small, easy-to-analyze lattices into larger programs.
Conference Paper
Geographically distributed systems often rely on replicated eventually consistent data stores to achieve availability and performance. To resolve conflicting updates at different replicas, researchers and practitioners have proposed specialized consistency protocols, called replicated data types, that implement objects such as registers, counters, sets or lists. Reasoning about replicated data types has however not been on par with comparable work on abstract data types and concurrent data types, lacking specifications, correctness proofs, and optimality results. To fill in this gap, we propose a framework for specifying replicated data types using relations over events and verifying their implementations using replication-aware simulations. We apply it to 7 existing implementations of 4 data types with nontrivial conflict-resolution strategies and optimizations (last-writer-wins register, counter, multi-value register and observed-remove set). We also present a novel technique for obtaining lower bounds on the worst-case space overhead of data type implementations and use it to prove optimality of 4 implementations. Finally, we show how to specify consistency of replicated stores with multiple objects axiomatically, in analogy to prior work on weak memory models. Overall, our work provides foundational reasoning tools to support research on replicated eventually consistent stores.
Conference Paper
Programs written using a deterministic-by-construction model of parallel computation are guaranteed to always produce the same observable results, offering programmers freedom from subtle, hard-to-reproduce nondeterministic bugs that are the scourge of parallel software. We present LVars, a new model for deterministic-by-construction parallel programming that generalizes existing single-assignment models to allow multiple assignments that are monotonically increasing with respect to a user-specified lattice. LVars ensure determinism by allowing only monotonic writes and "threshold" reads that block until a lower bound is reached. We give a proof of determinism and a prototype implementation for a language with LVars and describe how to extend the LVars model to support a limited form of nondeterminism that admits failures but never wrong answers.
Conference Paper
Information has become a key commodity for most service providers. Analyzing streams of data efficiently, in real time, has become increasingly more important for supporting new products and applications. This paper outlines a novel abstraction for performing incremental stream processing based on Computational Conflict-free Replicated Data Types. C-CRDTs are replicated objects that can be updated concurrently without coordination to perform a computation and still converge to a consistent state that reflects all contributions. Results obtained with a preliminary prototype show that C-CRDTs have the potential to match and improve computational throughput when compared with a state of the art stream processing system.
Conference Paper
Replication of state is the fundamental approach to achieve scalability and availability. In order to maintain or restore replica consistency under updates, some form of synchronization is needed. Conflict-free Replicated Data Types (CRDTs) ensure eventual consistency, such that replicas converge to a common state, equivalent to a correct sequential execution without foreground synchronization. A particular CRDT is the set data type, which is a pervasive abstraction for storing collections of unique elements and constitutes an important building block for other, more complex data structures. Since the original specification is not scalable, we improve it by introducing an efficient algorithm for sending deltas of updates between replicas and by partitioning a set replica into disjunctive subsets. We further add support for limited-lifetime elements, which, in turn, enable simple garbage collection strategies to address the problem of unbounded database growth. Lastly, implementation details and evaluation results of a client library for this data structure are presented.