Content uploaded by Carlos Baquero
Author content
All content in this area was uploaded by Carlos Baquero on Mar 01, 2016
Content may be subject to copyright.
Efficient State-based CRDTs by Delta-Mutation
Paulo S´ergio Almeida, Ali Shoker, and Carlos Baquero
HASLab/INESC TEC and Universidade do Minho, Portugal
Abstract. CRDTs are distributed data types that make eventual con-
sistency of a distributed object possible and non ad-hoc. Specifically,
state-based CRDTs achieve this by sharing local state changes through
shipping the entire state, that is then merged to other replicas with
an idempotent, associative, and commutative join operation, ensuring
convergence. This imposes a large communication overhead as the state
size becomes larger. We introduce Delta State Conflict-Free Replicated
Datatypes (δ-CRDT), which make use of δ-mutators, defined in such a
way to return a delta-state, typically, with a much smaller size than the
full state. Delta-states are joined to the local state as well as to the
remote states (after being shipped). This can achieve the best of both
worlds: small messages with an incremental nature, as in operation-based
CRDTs, disseminated over unreliable communication channels, as in tra-
ditional state-based CRDTs. We introduce the δ-CRDT framework, and
we explain it through establishing a correspondence to current state-
based CRDTs. In addition, we present two anti-entropy algorithms: a
basic one that provides eventual convergence, and another one that en-
sures both convergence and causal consistency. We also introduce two
δ-CRDT specifications of well-known replicated datatypes.
Keywords: Distributed systems. Eventual consistency. CRDT.
1 Introduction
Eventual consistency (EC) is a relaxed consistency model that is often adopted
by large-scale distributed systems [1,2,3] where availability must be maintained,
despite outages and partitioning, whereas delayed consistency is acceptable.
The limitations resulting from the CAP theorem [4] suggest trading strong
consistency for high availability. A typical approach in EC systems is to al-
low replicas of a distributed object to temporarily diverge, provided that they
can eventually be reconciled into a common state. To avoid application-specific
reconciliation methods, costly and error-prone, Conflict-Free Replicated Data
Types (CRDTs) [5,6] were introduced, allowing the design of self-contained dis-
tributed data types that are always available and eventually converge when all
operations are reflected at all replicas. Though CRDTs are being deployed in
practice [1], more work is still required to improve their design and performance.
CRDTs support two complementary designs: state-based which disseminate
object states and operation-based which disseminate operations [5,6]. In a state-
based design [7,6] an operation is only executed on the local replica state. A
arXiv:1410.2803v1 [cs.DC] 10 Oct 2014
replica periodically propagates its local changes to other replicas through ship-
ping its entire state. A received state is incorporated with the local state via a
merge function that deterministically reconciles both states. To maintain conver-
gence, merge is defined as a join: a least upper bound over a join-semilattice [7,6].
A major drawback in current state-based CRDTs is the communication over-
head of shipping the entire state, which can get very large in size. For instance,
the state size of a counter CRDT (a vector of integer counters, one per replica)
increases with the number of replicas; whereas in a grow-only Set, the state size
depends on the set size, that grows as more operations are invoked. This com-
munication overhead limits the use of state-based CRDTs to data-types with
small state size (e.g., counters are reasonable while sets are not); recently there
has been demand for CRDTs with large state sizes (e.g., in RIAK DT Maps [8]
that can compose multiple CRDTs).
In this paper, we rethink the way state-based CRDTs should be designed,
having in mind the problematic shipping of the entire state. Our aim is to ship a
representation of the effect of recent update operations on the state, rather than
the whole state, while preserving the idempotent nature of join; thus, allowing
unreliable communication, on the contrary to operation-based CRDTs that de-
mand exactly-once delivery and are prone to message replays. To achieve this,
we introduce Delta State-based CRDTs (δ-CRDT): a state is a join-semilattice
that results from the join of multiple fine-grained states, i.e., deltas, generated
by what we call δ-mutators; these are new versions of the datatype mutators
that return the effect of these mutators on the state. In this way, deltas can be
retained in a buffer to be shipped individually (or joined in groups) instead of
shipping the entire object. The changes to the local state are then incorporated
at other replicas by joining the shipped deltas with their own states.
A key point in our approach is a simple equation relating the novel δ-mutators
with the original CRDT mutators. The challenge when designing a new δ-CRDT
that corresponds to an existing CRDT is to derive δ-mutators that obey this
equation. In this paper, we prove that eventual consistency is guaranteed in
δ-CRDT as long as all deltas produced by δ-mutators are delivered and joined
at other replicas, and we present a corresponding simple anti-entropy algorithm.
We then focus on causal consistency, introducing the concept of delta-interval
and the causal delta-merging condition. Based on these, we then present an anti-
entropy algorithm for δ-CRDT, where sending and then joining delta-intervals
into another replica state produces the same effect as if the entire state had
been shipped and joined. We illustrate our approach by explaining a simple
counter δ-CRDT specification; and then we introduce a challenging non-trivial
specification for a widely used datatype: Optimized Add-Wins Observed-Remove
Sets. In addition, we make a basic δ-CRDT C++ library available online [9] for
various CRDTs. Our experience shows that a δ-CRDT version can be devised
for any CRDT, however, this requires some design effort that varies with the
complexity of different CRDTs.
1.1 System Model
Consider a distributed system with nodes containing local memory, with no
shared memory between them. Any node can send messages to any other node.
The network is asynchronous, there being no global clock, no bound on the time
it takes for a message to arrive, nor bounds on relative processing speeds. The
network is unreliable: messages can be lost, duplicated or reordered (but are
not corrupted). Some messages will, however, eventually get through: if a node
sends infinitely many messages to another node, infinitely many of these will be
delivered. In particular, this means that there can be arbitrarily long partitions,
but these will eventually heal. Nodes have access to durable storage; nodes can
crash but eventually will recover with the content of the durable storage as at the
time of the crash. Durable state is written atomically at each state transition.
Each node has access to its globally unique identifier in a set I.
2 A Background of State-based CRDTs
Conflict-Free Replicated Data Types [5,6] (CRDTs) are distributed datatypes
that allow different replicas of a distributed CRDT instance to diverge and
ensures that, eventually, all replicas converge to the same state. State-based
CRDTs achieve this through propagating updates of the local state by dissem-
inating the entire state across replicas. The received states are then merged to
remote states, leading to convergence.
A state-based CRDT consists of a triple (S, M, Q), where Sis a join-semi-
lattice [10], Qis a set of query functions (which return some result without
modifying the state), and Mis a set of mutators that perform updates; a mutator
m∈Mtakes a state X∈Sas input and returns a new state X0=m(X). A
join-semilattice is a set with a partial order vand a binary join operation t
that returns the least upper bound (LUB) of two elements in S; a join is designed
to be commutative, associative, and idempotent. Mutators are defined in such a
way to be inflations, i.e., for any mutator mand state X, the following holds:
Xvm(X)
In this way, for each replica there is a monotonic sequence of states, defined under
the lattice partial order, where each subsequent state subsumes the previous state
when joined elsewhere.
Both query and mutator operations are always available since they are per-
formed using the local state without requiring inter-replica communication; how-
ever, as mutators are concurrently applied at distinct replicas, replica states will
likely diverge. Eventual convergence is then obtained using an anti-entropy pro-
tocol that periodically ships the entire local state to other replicas. Each replica
merges the received state with its local state using the join operation in S.
Given the mathematical properties of join, if mutators stop being issued, all
replicas eventually converge to the same state. i.e. the least upper-bound of all
states involved. State-based CRDTs are interesting as they demand little guar-
antees from the dissemination layer, working under message loss, duplication,
reordering, and temporary network partitioning, without impacting availability
and eventual convergence.
Σ=I→N
σ0
i={}
inci(m) = m{i7→ m(i)+1}
valuei(m) = X
i∈I
m(i)
mtm0={(i, max(m(i), m0(i))) |i∈I}
Fig. 1: State-based Counter CRDT;
replica i.
Fig. 1 represents a state-based
increment-only counter. The CRDT
state Σis a map from replica iden-
tifiers to positive integers. Initially,
σ0
iis an empty map (assuming that
unmapped keys implicitly map to
zero, and only non zero mappings are
stored). A single mutator, i.e., inc, is
defined that increments the value cor-
responding to the local replica i(re-
turning the updated map). The query
operation value returns the counter
value by adding the integers in the
map entries. The join of two states is
the point-wise maximum of the maps.
The main weakness of state-based CRDTs is the cost of dissemination of
updates, as the full state is sent. In this simple example of counters, even though
increments only update the value corresponding to the local replica i, the whole
map will always be sent in messages though the other map values remained
intact (since no messages have been received and merged).
It would be interesting to only ship the recent modification incurred on
the state. This is, however, not possible with the current model of state-based
CRDTs as mutators always return a full state. Approaches which simply ship
operations (e.g., an “increment n” message), like in operation-based CRDTs,
require reliable communication (e.g., because increment is not idempotent). In
contrast, our approach allows producing and encoding recent mutations in an in-
cremental way, while keeping the advantages of the state-based approach, namely
the idempotent, associative, and commutative properties of join.
3 Delta-state CRDTs
We introduce Delta-State Conflict-Free Replicated Data Types, or δ-CRDT for
short, as a new kind of state-based CRDTs, in which delta-mutators are defined
to return a delta-state: a value in the same join-semilattice which represents the
updates induced by the mutator on the current state.
Definition 1 (Delta-mutator). A delta-mutator mδis a function, correspond-
ing to an update operation, which takes a state Xin a join-semilattice Sas
parameter and returns a delta-mutation mδ(X), also in S.
Definition 2 (Delta-group). A delta-group is inductively defined as either a
delta-mutation or a join of several delta-groups.
Definition 3 (δ-CRDT). Aδ-CRDT consists of a triple (S, Mδ, Q), where
Sis a join-semilattice, Mδis a set of delta-mutators, and Qa set of query
functions, where the state transition at each replica is given by either joining the
current state X∈Swith a delta-mutation:
X0=Xtmδ(X),
or joining the current state with some received delta-group D:
X0=XtD.
In a δ-CRDT, the effect of applying a mutation, represented by a delta-
mutation δ=mδ(X), is decoupled from the resulting state X0=Xtδ, which
allows shipping this δrather than the entire resulting state X0. All state transi-
tions in a δ-CRDT, even upon applying mutations locally, are the result of some
join with the current state. Unlike standard CRDT mutators, delta-mutators do
not need to be inflations in order to inflate a state; this is however ensured by
joining their output, i.e., deltas, into the current state.
In principle, a delta could be shipped immediately to remote replicas once ap-
plied locally. For efficiency reasons, multiple deltas returned by applying several
delta-mutators can be joined locally into a delta-group and retained in a buffer.
The delta-group can then be shipped to remote replicas to be joined with their
local states. Received delta-groups can optionally be joined into their buffered
delta-group, allowing transitive propagation of deltas. A full state can be seen
as a special (extreme) case of a delta-group.
If the causal order of operations is not important and the attended aim is
merely eventual convergence of states, then delta-groups can be shipped using
an unreliable dissemination layer that may drop, reorder, or duplicate messages.
Delta-groups can always be re-transmitted and re-joined, possibly out of order,
or can simply be subsumed by a less frequent sending of the full state, e.g. for
performance reasons or when doing state transfers to new members. In Section 4,
we address state convergence when causal consistency is not required, and we
address the latter in Section 5.
3.1 Delta-state decomposition of standard CRDTs
Aδ-CRDT (S, M δ, Q) is a delta-state decomposition of a state-based CRDT
(S, M, Q), if for every mutator m∈M, we have a corresponding mutator mδ∈
Mδsuch that, for every state X∈S:
m(X) = Xtmδ(X)
This equation states that applying a delta-mutator and joining into the cur-
rent state should produce the same state transition as applying the corresponding
mutator of the standard CRDT.
Given an existing state-based CRDT (which is always a trivial decomposition
of itself, i.e., m(X) = Xtm(X), as mutators are inflations), it will be useful
to find a non-trivial decomposition such that delta-states returned by delta-
mutators in Mδare smaller than the resulting state:
size(mδ(X)) size(m(X))
3.2 Example: δ-CRDT Counter
Σ=I→N
σ0
i={}
incδ
i(m) = {i7→ m(i)+1}
valuei(m) = X
i∈I
m(i)
mtm0={(i, max(m(i), m0(i))) |i∈I}
Fig. 2: A δ-CRDT counter; replica i.
Fig. 2 depicts a δ-CRDT specification
of a counter datatype that is a delta-
state decomposition of the state-based
counter in Fig. 1. The state, join and
value query operation remain as be-
fore. Only the mutator incδis newly
defined, which increments the map en-
try corresponding to the local replica
and only returns that entry, instead of
the full map as inc in the state-based
CRDT counter does. This maintains
the original semantics of the counter
while allowing the smaller deltas re-
turned by the delta-mutator to be
sent, instead of the full map. As before, the received payload (whether one or
more deltas) might not include entries for all keys in I, which are assumed to have
zero values. The decomposition is easy to understand in this example since the
equation inci(X) = Xtincδ
i(X) holds as m{i7→ m(i)+ 1}=mt {i7→ m(i)+ 1}.
In other words, the single value for key iin the delta, corresponding to the local
replica identifier, will overwrite the corresponding one in msince the former
maps to a higher value (i.e., using max). Here it can be noticed that: (1) a
delta is just a state, that can be joined possibly several times without requiring
exactly-once delivery, and without being a representation of the “increment”
operation (as in operation-based CRDTs), which is itself non-idempotent; (2)
joining deltas into a delta-group and disseminating delta-groups at a lower rate
than the operation rate reduces data communication overhead, since multiple
increments from a given source can be collapsed into a single state counter.
4 State Convergence
In the δ-CRDT execution model, and regardless of the anti-entropy algorithm
used, a replica state always evolves by joining the current state with some delta:
either the result of a delta-mutation, or some arbitrary delta-group (which itself
can be expressed as a join of delta-mutations). Therefore, all states can be ex-
pressed as joins of delta-mutations, which makes state convergence in δ-CRDT
easy to achieve: it is enough that all delta-mutations generated in the system
reach every replica, as expressed by the following proposition.
1inputs:
2ni∈ P(I), set of neighbors
3ti∈B, true for transitive mode
4choosei∈S×S→S, ship state or
delta
5durable state:
6Xi∈S, CRDT state; initially
Xi=⊥
7volatile state:
8Di∈S, join of deltas; initially
Di=⊥
9on operationi(mδ)
10 d=mδ(Xi)
11 X0
i=Xitd
12 D0
i=Ditd
13 periodically
14 m=choosei(Xi, Di)
15 for j∈nido
16 sendi,j (m)
17 D0
i=⊥
18 on receivej,i(d)
19 X0
i=Xitd
20 if tithen
21 D0
i=Ditd
22 else
23 D0
i=Di
Algorithm 1: Basic anti-entropy algorithm for δ-CRDT.
Proposition 1. (δ-CRDT convergence) Consider a set of replicas of a δ-CRDT
object, replica ievolving along a sequence of states X0
i=⊥, X1
i, . . ., each replica
performing delta-mutations of the form mδ
i,k(Xk
i)at some subset of its sequence
of states, and evolving by joining the current state either with self-generated
deltas or with delta-groups received from others. If each delta-mutation mδ
i,k(Xk
i)
produced at each replica is joined (directly or as part of a delta-group) at least
once with every other replica, all replica states become equal.
Proof. Trivial, given the associativity, commutativity, and idempotence of the
join operation in any join-semilattice.
This opens up the possibility of having anti-entropy algorithms that are only
devoted to enforce convergence, without necessarily providing causal consistency
(enforced in standard CRDTs); thus, making a trade-off between performance
and consistency guarantees. For instance, in a counter (e.g., for the number of
likes on a social network), it may not be critical to have causal consistency, but
merely not to lose increments and achieve convergence.
4.1 Basic Anti-Entropy Algorithm
A basic anti-entropy algorithm that ensures eventual convergence in δ-CRDT is
presented in Algorithm 1. For the node corresponding to replica i, the durable
state, which persists after a crash, is simply the δ-CRDT state Xi. The volatile
state Dstores a delta-group that is used to accumulate deltas before eventually
sending it to other replicas. Without loss of generality, we assume that the join-
semilattice has a bottom ⊥, which is the initial value for both Xiand Di.
When an operation is performed, the corresponding delta-mutator mδis ap-
plied to the current state of Xi, generating a delta d. This delta is joined both
with Xito produce a new state, and with D. In the same spirit of standard
state based CRDTs, a node sends its messages in a periodic fashion, where the
message payload is either the delta-group Dior the full state Xi; this decision is
made by the function chooseiwhich returns one of them. To keep the algorithm
simple, a node simply broadcasts its messages without distinguishing between
neighbors. After each send, the delta-group is reset to ⊥.
Once a message is received, the payload dis joined into the current δ-CRDT
state. The basic algorithm operates in two modes: (1) a transitive mode (when
tiis true) in which mis also joined into D, allowing transitive propagation of
delta-mutations; meaning that, deltas received at node ifrom some node jcan
later be sent to some other node k; (2) a direct mode where a delta-group is
exclusively the join of local delta-mutations (jmust send its deltas directly to
k). The decisions of whether to send a delta-group versus the full state (typically
less periodically), and whether to use the transitive or direct mode are out of
the scope of this paper. In general, decisions can be made considering many
criteria like delta-groups size, state size, message loss distribution assumptions,
and network topology.
5 Causal Consistency
For some CRDTs with commutative operations, like the counter in Fig. 2 , even-
tual convergence of states may be enough, and thus any anti-entropy algorithm
that satisfies the condition in Proposition 1, like Algorithm 1, can be used. How-
ever, stronger consistency guarantees, like causal consistency, are often required
by today’s applications. When using an anti-entropy mechanism which dissem-
inates deltas with no order guarantees (like Algorithm 1) the execution is, in
general, not causally consistent.
Traditional state-based CRDTs converge using joins of the full state, which
implicitly ensures per-object causal consistency [11]: each state of some replica
of an object reflects the causal past of operations on the object (either applied
locally, or applied at other replicas and transitively joined).
Therefore, it is desirable to have δ-CRDTs offer the same causal-consistency
guarantees that standard state-based CRDTs offer. This raises the question
about how can delta propagation and merging of δ-CRDT be constrained (and
expressed in an anti-entropy algorithm) in such a manner to give the same re-
sults as if a standard state-based CRDT was used. Towards this objective, it is
useful to define a particular kind of delta-group, which we call a delta-interval:
Definition 4 (Delta-interval). Given a replica iprogressing along the states
X0
i, X1
i, . . ., by joining delta dk
i(either local delta-mutation or received delta-
group) into Xk
ito obtain Xk+1
i, a delta-interval ∆a,b
iis a delta-group resulting
from joining deltas da
i, . . . , db−1
i:
∆a,b
i=G{dk
i|a≤k < b}
The use of delta-intervals in anti-entropy algorithms will be a key ingredient
towards achieving causal consistency. We now define a restricted kind of anti-
entropy algorithms for δ-CRDTs.
Definition 5 (Delta-interval-based anti-entropy algorithm). A given anti-
entropy algorithm for δ-CRDTs is delta-interval-based, if all deltas sent to other
replicas are delta-intervals.
Moreover, to achieve causal consistency the next condition must satisfied:
Definition 6 (Causal delta-merging condition). A delta-interval based anti-
entropy algorithm is said to satisfy the causal delta-merging condition if the al-
gorithm only joins ∆a,b
jfrom replica jinto replica istates Xithat satisfy:
XiwXa
j.
This means that a delta-interval is only joined into states that at least reflect
(i.e., subsume) the state into which the first delta in the interval was previously
joined. The causal delta-merging condition is important since any delta-interval
based anti-entropy algorithm of a δ-CRDT that satisfies it, can be used to obtain
the same outcome of standard CRDTs; this is formally stated in Proposition 2.
Proposition 2. (CRDT and δ-CRDT correspondence) Let (S, M, Q)be a stan-
dard state-based CRDT and (S, M δ, Q)a corresponding delta-state decomposi-
tion. Any δ-CRDT state reachable by an execution Eδover (S, M δ, Q), by a
delta-interval based anti-entropy algorithm Aδsatisfying the causal delta-merging
condition, is equal to a state resulting from an execution Eover (S, M, Q), hav-
ing the corresponding data-type operations, by an anti-entropy algorithm Afor
state-based CRDTs.
Proof. See appendix.
Corollary 1. (δ-CRDT causal consistency) Any δ-CRDT in which states are
propagated and joined using a delta-interval-based anti-entropy algorithm satis-
fying the causal delta-merging condition ensures causal consistency.
Proof. From Proposition 2 and causal consistency of state-based CRDTs.
5.1 Anti-Entropy Algorithm for Causal Consistency
Algorithm 2 is a delta-interval based anti-entropy algorithm which enforces the
causal delta-merging condition. It can be used whenever the causal consistency
guarantees of standard state-based CRDTs are needed. For simplicity, it excludes
some optimizations that are important, but easy to derive, in practice. The
algorithm distinguishes neighbor nodes, and only sends them delta-intervals that
are joined at the receiving node, obeying the delta-merging condition.
Each node ikeeps a contiguous sequence of deltas dl
i, . . . , du
iin a map Dfrom
integers to deltas, with l= min(dom(M)) and u=max(dom(M)). The sequence
1inputs:
2ni∈ P(I), set of neighbors
3durable state:
4Xi∈S, CRDT state; initially
Xi=⊥
5ci∈N, sequence number; initially
ci= 0
6volatile state:
7Di∈N→S, sequence of deltas;
initially Di={}
8Ai∈I→N, acknowledges map;
initially Ai={}
9on receivej,i(delta, d, n)
10 X0
i=Xitd
11 D0
i=Di{ci7→ d}
12 c0
i=ci+ 1
13 sendi,j (ack, n)
14 on receivej,i(ack, n)
15 A0
i=Ai{j7→ max(Ai(j), n)}
16 on operationi(mδ)
17 d=mδ(Xi)
18 X0
i=Xitd
19 D0
i=Di{ci7→ d}
20 c0
i=ci+ 1
21 periodically // ship delta-interval or
state
22 j=random(ni)
23 if Di={} ∨ min(dom(Di)) > Ai(j)
then
24 d=Xi
25 else
26 d=F{Di(l)|Ai(j)≤l < ci}
27 sendi,j (delta, d, ci)
28 periodically // garbage collect deltas
29 l= min{n|(, n)∈Ai}
30 D0
i={(n, d)∈Di|n≥l}
Algorithm 2: Anti-entropy algorithm ensuring causal consistency of δ-CRDT.
numbers of deltas are obtained from the counter cithat is incremented when
a delta (whether a delta-mutation or delta-interval received) is joined with the
current state. Each node ikeeps an acknowledgments map Athat stores, for each
neighbor j, the index bsuch that ∆a,b
iis the last delta-interval acknowledged by
j(after it receives ∆a,b
ifrom iand joins it into Xj).
Node isends a delta-interval d=∆a,b
iwith a (delta, d, b) message; the re-
ceiving node j, after joining ∆a,b
iinto its replica state, replies with an acknowl-
edgment message (ack, b); if an ack from jwas successfully received by node i,
it updates the entry of jin the acknowledgment map, using the max function.
This handles possible old duplicates and messages arriving out of order.
Like the δ-CRDT state, the counter ciis also kept in a durable storage.
This is essential to avoid conflicts after potential crash and recovery incidents.
Otherwise, there would be the danger of receiving some delayed ack, for a delta-
interval sent before crashing, make the node skip sending some deltas generated
after recovery, thus violating the delta-merging condition.
The algorithm for node iperiodically picks a random neighbor j. In principle,
isends the join of all deltas starting from the delta that jacked and forward.
Exceptionally, isends the entire state in two cases: (1) if the sequence of deltas
Diis empty, or (2) if jis expecting from ia delta that was already removed from
Di(e.g., after a crash and recovery); itracks this in Ai[j]. To garbage collect
old deltas, the algorithm periodically removes the deltas that have been acked
by all neighbors.
Σ=P(I×N×E)× P(I×N)
σ0
i= ({},{})
addδ
i(e, (s, t)) = ({(i, n + 1, e)},{})
with n=max({k|(i, k, )∈s})
rmvδ
i(e, (s, t)) = ({},{(j, n)|(j, n, e)∈s})
elementsi((s, t)) = {e|(j, n, e)∈s∧(j, n)6∈ t}
(s, t)t(s0, t0)=(s∪s0, t ∪t0)
(a) With Tombstones
Σ=P(I×N×E)× P(I×N)
σ0
i= ({},{})
addδ
i(e, (s, c)) = ({(i, n, e)},{(i, n + 1)})
with n=max({k|(i, k)∈c})
rmvδ
i(e, (s, c)) = ({},{(j, n)|(j, n, e)∈s})
elementsi((s, c)) = {e|(j, n, e)∈s}
(s, c)t(s0, c0) = ((s∩s0)∪ {(i, n, e)∈s|(i, n)6∈ c0}
∪{(i, n, e)∈s0|(i, n)6∈ c}, c ∪c0)
(b) Without Tombstones (optimized)
Fig. 3: Add-wins observed-remove δ-CRDT set, replica i.
Proposition 3. Algorithm 2 produces the same reachable states as a standard
algorithm over a CRDT for which the δ-CRDT is a decomposition.
Proof. See appendix.
6δ-CRDTs for Add-Wins OR-Sets
An Add-wins Observed-Remove Set is a well-known CRDT datatype that offers
the same sequential semantics of a sequential set and adopts a specific resolution
semantics for operations that concurrently add and remove the same element.
Add-wins means that an add prevails over a concurrent remove. Remove opera-
tions, however, only affect elements added by causally preceding adds.
Fig. 3a depicts a simple, but inefficient, δ-CRDT implementation of a state-
based add-wins OR-Set. The state Σconsists of a set of tagged elements and a
set of tags, acting as tombstones. Globally unique tags of the form I×Nare used
and ensured by pairing a replica identifier in Iwith a monotonically increasing
natural counter. Once an element e∈Eis added to the set, the delta-mutator
addδcreates a globally unique tag by incrementing the highest tag present in
its local state and that was created by replica iitself (max returns 0 if no tag
is present). This tag is paired with value eand stored as a new unique triple.
Since removes should only tombstone elements that are added before the remove
operation, the delta-mutator rmvδretains in the tombstone set all tags associated
to element e, being removed from the local state. Function elements only returns
the elements that are added but not tombstoned yet. Join tsimply unions the
respective sets that are, therefore, both grow-only.
A more efficient design is presented in Fig. 3b, which offers the same semantics
and have a similar state structure; however, it uses a different join-semilattice,
allowing the set of tagged elements to shrink as elements are removed. Now,
elements returns all elements in the tagged set s. Instead of the tombstone set, a
causal context set is used. Adding an element creates a unique tag by resorting
to the causal context cinstead of s, that can now shrink; the new triple is added
to sas before, but now the new tag is also added to causal context c. The delta-
mutator rmvδis the same as before, collecting all tags associated to the element
being removed. The desired semantics are achieved by the novel join operation
t. To join two states, their causal contexts are simply unioned; whereas, the new
tagged element set only preserves: (1) the triples present in both sets (therefore,
not removed in either), and also (2) any triple present in one of the sets and
whose tag is not present in the causal context of the other state.
Causal Context Compression. For presentation simplicity, this optimized version
of the add-wins OR-Set has a grow-only causal context that collects all the
unique tags (even from elements added but no longer present). In practice the
causal context can be efficiently compressed without any loss of information.
When using an anti-entropy algorithm that provides causal consistency, e.g.,
Algorithm 2, then for each replica state Xi= (si, ci) and replica identifier j∈I,
we have a contiguous sequence:
1≤n≤max({k|(j, k)∈ci})⇒(j, n)∈ci.
Thus, the causal context can always be encoded as a compact version vector [12]
I→Nthat keeps the maximum sequence number for each replica. Even under
non-causal anti-entropy, compression is still possible by keeping a version vector
that encodes the offset of the contiguous sequence of tags from each replica,
together with a set for the non-contiguous tags. As anti-entropy proceeds, each
tag is eventually encoded in the vector, and thus the set remains typically small.
Compression is less likely for the causal context of delta-groups in transit or
buffered to be sent, but those contexts are only transient and smaller than those
in the actual replica states. Moreover, the same techniques that encode contigu-
ous sequences of tags can also be used for transient context compression [13].
7 Message Complexity
Our delta-based framework, δ-CRDT, clearly introduces significant cost improve-
ments on messaging. Despite being generic, δ-CRDT requires deltas to be defined
per datatype. This makes the bit-message complexity datatype-based rather than
generic. To give an intuition about this complexity, we address the two datatypes
introduced above: counter and OR-Set. In classical state-based counter CRDTs,
the entire map of the counter is shipped. As the map-size grows with the num-
ber of replicas, this leads a bit-message complexity of e
O(|I|)1. Whereas, in the
δ-CRDT case, only recently updated map entries αare shipped yielding a bit-
complexity e
O(α), where α |I|. Moreover, if transitive forwarding is not allowed
this drops to constant e
O(1). As for OR-set, shipping in classical OR-set CRDTs
delivers the entire state which yields a bit-message complexity of O(S), where S is
the state-size. In δ-CRDT, only deltas are shipped, which renders a bit-message
1e
Ois a variant of big Oignoring logarithmic factors in the size of integers and ids.
complexity O(s) where srepresents the size of the recent updates occurred since
the last shipping. Clearly, sSsince the updates that occur on a state in a
period of time are often much less than the total number of items.
8 Related Work
Eventually convergent data types. The design of replicated systems that are
always available and eventually converge can be traced back to historical de-
signs in [14,15], among others. More recently, replicated data types that always
eventually converge, both by reliably broadcasting operations (called operation-
based) or gossiping and merging states (called state-based), have been formalized
as CRDTs [16,7,5,6]. These are also closely related to BloomL[17] and Cloud
Types [18]. State semi-lattices were used for deterministic parallel programming
in LVars [19], where variables progress in the lattice order by joining other values,
and are only accessible by special threshold reads.
Message size. A key feature of δ-CRDT is message size reduction and coalesc-
ing, using small-sized deltas. The general old idea of using differences between
things, called “deltas” in many contexts, can lead to many designs, depending on
how exactly a delta is defined. The state-based deltas introduced for Computa-
tional CRDTs [20] require an extra delta-specific merge which does not ensure
idempotence. In [21], an improved synchronization method for non-optimized
OR-set CRDT [5] is presented, where delta information is propagated; in that
paper deltas are a collection of items (related to update events between synchro-
nizations), manipulated and merged through a protocol, as opposed to normal
states in the semilattice. No generic framework is defined (that could encompass
other data types) and the protocol requires several communication steps to com-
pute the information to exchange. Operation-based CRDTs [5,6,22] also support
small message sizes, and in particular, pure flavors [22] that restrict messages
to the operation name, and possible arguments. Though pure operation-based
CRDTs allow for compact states and are very fast at the source (since opera-
tions are broadcast without consulting the local state), the model requires more
systems guarantees than δ-CRDT do, e.g., exactly-once reliable delivery and
membership information, and impose more complex integration of new replicas.
Encoding causal histories. State-based CRDT are always designed to be causally
consistent [7,6]. Optimized implementations of sets, maps, and multi-value regis-
ters can build on this assumption to keep the meta-data small [11]. In δ-CRDT,
however, deltas and delta-groups are normally not causally consistent, and thus
the design of join, the meta-data state, as well as the anti-entropy algorithm used
must ensure this. Without causal consistency, the causal context in δ-CRDT can
not always be summarized with version vectors, and consequently, techniques
that allow for gaps are often used. A well known mechanism that allows for
encoding of gaps is found in Concise Version Vectors [23]. Interval Version Vec-
tors [13], later on, introduced an encoding that optimizes sequences and allows
gaps, while preserving efficiency when gaps are absent.
9 Conclusion
CRDTs allow flexible, while principled, design of distributed protocols that trade
strict consistency for improved availability, faster response time, and support for
disconnected operation. These benefits are harvested once a given application
can model all, or a part, of its behavior using CRDTs. In particular, state-based
CRDTs allow idempotent gossiping of states and require very basic guarantees
from the network, as they cope with message loss, re-ordering, and duplication.
State-based CRDTs come with a price: states have a potential to get very
large. This is even worse if multiple objects are composed into a single CRDT
object to benefit from intra-replica consistency and atomicity. In this paper,
we addressed these limitations by introducing the new concept of δ-CRDT. By
devising delta-mutators over state-based datatypes which can detach the changes
that an operation induces on the state. This brings a significant performance gain
as it allows only shipping small states, i.e., deltas, instead of the entire state.
The significant property in δ-CRDT is that it preserves the crucial properties
(idempotence, associativity and commutativity) of standard state-based CRDT.
We have shown how δ-CRDT can achieve convergence possibly with, or with-
out, causal consistency; and we presented an anti-entropy algorithm for each
case. In particular, the causally consistent algorithm allows replacing classical
state-based CRDTs by more efficient ones, while preserving their properties. As
a first application of our approach, we designed a novel δ-CRDT specification
for a well-known and widely used datatype: an optimized observed-remove set.
References
1. Cribbs, S., Brown, R.: Data structures in Riak. In: Riak Conference (RICON),
San Francisco, CA, USA (oct 2012)
2. Terry, D.B., Theimer, M.M., Petersen, K., Demers, A.J., Spreitzer, M.J., Hauser,
C.H.: Managing update conflicts in Bayou, a weakly connected replicated storage
system. In: Symp. on Op. Sys. Principles (SOSP), Copper Mountain, CO, USA,
ACM SIGOPS, ACM Press (December 1995) 172–182
3. DeCandia, G., Hastorun, D., Jampani, M., Kakulapati, G., Lakshman, A., Pilchin,
A., Sivasubramanian, S., Vosshall, P., Vogels, W.: Dynamo: Amazon’s highly avail-
able key-value store. In: Symp. on Op. Sys. Principles (SOSP). Volume 41 of
Operating Systems Review., Stevenson, Washington, USA, Assoc. for Computing
Machinery (October 2007) 205–220
4. Gilbert, S., Lynch, N.: Brewer’s conjecture and the feasibility of consistent, avail-
able, partition-tolerant web services. SIGACT News 33(2) (2002) 51–59
5. Shapiro, M., Pregui¸ca, N., Baquero, C., Zawirski, M.: A comprehensive study of
Convergent and Commutative Replicated Data Types. Rapp. Rech. 7506, Institut
National de la Recherche en Informatique et Automatique (INRIA), Rocquencourt,
France (January 2011)
6. Shapiro, M., Pregui¸ca, N., Baquero, C., Zawirski, M.: Conflict-free replicated data
types. In D´efago, X., Petit, F., Villain, V., eds.: Int. Symp. on Stabilization, Safety,
and Security of Distributed Systems (SSS). Volume 6976 of Lecture Notes in Comp.
Sc., Grenoble, France, Springer-Verlag (October 2011) 386–400
7. Baquero, C., Moura, F.: Using structural characteristics for autonomous operation.
Operating Systems Review 33(4) (1999) 90–96
8. Brown, R., Cribbs, S., Meiklejohn, C., Elliott, S.: Riak dt map: A composable,
convergent replicated dictionary. In: Proceedings of the First Workshop on Princi-
ples and Practice of Eventual Consistency. PaPEC ’14, New York, NY, USA, ACM
(2014) 1:1–1:1
9. Baquero, C.: Delta-crdt-cpp. https://github.com/CBaquero/
delta-enabled-crdts
10. Davey, B.A., Priestley, H.A.: Introduction to Lattices and Order (2. ed.). Cam-
bridge University Press (2002)
11. Burckhardt, S., Gotsman, A., Yang, H., Zawirski, M.: Replicated data types:
specification, verification, optimality. In Jagannathan, S., Sewell, P., eds.: POPL,
ACM (2014) 271–284
12. Parker, D.S., Popek, G.J., Rudisin, G., Stoughton, A., Walker, B.J., Walton, E.,
Chow, J.M., Edwards, D., Kiser, S., Kline, C.: Detection of mutual inconsistency
in distributed systems. IEEE Trans. Softw. Eng. 9(3) (May 1983) 240–247
13. Mukund, M., R., G.S., Suresh, S.P.: Optimized or-sets without ordering con-
straints. In: Proceedings ot the International Conference on Distributed Comput-
ing and Networking, New York, NY, USA, ACM (2014) 227–241
14. Wuu, G.T.J., Bernstein, A.J.: Efficient solutions to the replicated log and dictio-
nary problems. In: Symp. on Principles of Dist. Comp. (PODC), Vancouver, BC,
Canada (August 1984) 233–242
15. Johnson, P.R., Thomas, R.H.: The maintenance of duplicate databases. Internet
Request for Comments RFC 677, Information Sciences Institute (January 1976)
16. Letia, M., Pregui¸ca, N., Shapiro, M.: CRDTs: Consistency without concurrency
control. Rapp. Rech. RR-6956, Institut National de la Recherche en Informatique
et Automatique (INRIA), Rocquencourt, France (June 2009)
17. Conway, N., Marczak, W.R., Alvaro, P., Hellerstein, J.M., Maier, D.: Logic and lat-
tices for distributed programming. In: Proceedings of the Third ACM Symposium
on Cloud Computing, ACM (2012) 1
18. Burckhardt, S., F¨ahndrich, M., Leijen, D., Wood, B.P.: Cloud types for eventual
consistency. In: ECOOP 2012–Object-Oriented Programming. Springer (2012)
283–307
19. Kuper, L., Newton, R.R.: Lvars: lattice-based data structures for deterministic
parallelism. In: Proceedings of the 2nd ACM SIGPLAN workshop on Functional
high-performance computing, ACM (2013) 71–84
20. Navalho, D., Duarte, S., Pregui¸ca, N., Shapiro, M.: Incremental stream processing
using computational conflict-free replicated data types. In: Proceedings of the 3rd
International Workshop on Cloud Data and Platforms, ACM (2013) 31–36
21. Deftu, A., Griebsch, J.: A scalable conflict-free replicated set data type. In: Pro-
ceedings of the 2013 IEEE 33rd International Conference on Distributed Comput-
ing Systems. ICDCS ’13, Washington, DC, USA, IEEE Computer Society (2013)
186–195
22. Baquero, C., Almeida, P.S., Shoker, A.: Making operation-based CRDTs operation-
based. In: to appear in Proceedings of Distributed Applications and Interoperable
Systems: 14th IFIP WG 6.1 International Conference, Springer (2014)
23. Malkhi, D., Terry, D.: Concise version vectors in winfs. Distributed Computing
20(3) (2007) 209–219
A Proof of Proposition 1
Proof. By simulation, establishing a correspondence between an execution Eδ,
and execution Eof a standard CRDT of which (S, Mδ, Q) is a decomposition, as
follows: 1) the state (Xi, Di, . . .) of each node in Eδcontaining CRDT state Xi,
information about delta-intervals Diand possibly other information, corresponds
to only Xicomponent (in the same join-semilattice); 2) for each action which is
a delta-mutation mδin Eδ,Eexecutes he corresponding mutation m, satisfying
m(X) = Xtmδ(X); 3) whenever Eδcontains a send action of a delta-interval
∆a,b
i, execution Econtains a send action containing the full state Xb
i; 4) whenever
Eδperforms a join into some Xiof a delta-interval ∆a,b
j, execution Edelivers
and joins the corresponding message containing the full CRDT state Xb
j. By
induction on the length of the trace, assume that for each replica i, each node
state Xiin Eis equal to the corresponding component in the node state in Eδ, up
to the last action in the global trace. A send action does not change replica state,
preserving the correspondence. Replica states only change either by performing
data-type update operations or upon message delivery by merging deltas/states
respectively. If the next action is an update operation, the correspondence is
preserved due to the delta-state decomposition property m(X) = Xtmδ(X).
If the next action is a message delivery at replica i, with a merging of delta-
interval/state from other replica j, because algorithm Aδsatisfies the causal
merging-condition, it only joins into state Xk
ia delta-interval ∆a,b
jif Xk
iwXa
j.
In this case, the outcome will be:
Xk+1
i=Xk
it∆a,b
j
=Xk
itG{dl
j|a≤l < b}
=Xk
itXa
jtG{dl
j|a≤l < b}
=Xk
itXa
jtda
jtda+1
jt. . . tdb−1
j
=Xk
itXa+1
jtda+1
jt. . . tdb−1
j
=. . .
=Xk
itXb−1
jtdb−1
j
=Xk
itXb
j
The resulting state Xk+1
iin Eδwill be, therefore, the same as the corresponding
one in Ewhere the full CRDT state from jhas been joined, preserving the
correspondence between Eδand E.
B Proof of Proposition 3
Proof. From Proposition 1, it is enough to prove that the algorithm satisfies
the causal delta-merging condition. The algorithm explicitly keeps deltas dk
i
tagged with increasing sequence numbers (even after a crash), according with
the definition; node jonly sends to ia delta-interval ∆a,b
jif ihas acked a; this ack
is sent only if ihas already joined some delta-interval (possibly a full state) ∆k,a
j.
Either k= 0 or, by the same reasoning, this ∆k,a
jcould only have been joined
at iif some other interval ∆l,k
jhad already been joined into i. This reasoning
can be recursed until a delta-interval starting from zero is reached. Therefore,
XiwF{dk
j|0≤k < a}=∆0,a
j=Xa
j.