ArticlePDF Available

Eventually Consistent Transactions (full version)

Authors:

Abstract and Figures

When distributed clients query or update shared data, eventual consistency can provide better availability than strong consistency models. However, programming and implementing such systems can be difficult unless we establish a reasonable consistency model, i.e. some minimal guarantees that programmers can understand and systems can provide effectively. To this end, we propose a novel consistency model based on eventually consistent transactions. Unlike serializable transactions, eventually consistent transactions are ordered by two order relations (visibility and arbitration) rather than a single order relation. To demonstrate that eventually consistent transactions can be effectively implemented, we establish a handful of simple operational rules for managing replicas, versions and updates, based on graphs called revision diagrams. We prove that these rules are sufficient to guarantee correct implementation of eventually consistent transactions. Finally, we present two operational models (single server and server pool) of systems that provide eventually consistent transactions.
Content may be subject to copyright.
Eventually Consistent Transactions
January 6, 2012
Technical Report
MSR-TR-2011-117
Microsoft Research
Microsoft Corporation
One Microsoft Way
Redmond, WA 98052
This page intentionally not left blank.
Eventually Consistent Transactions
Sebastian Burckhardt Manuel F
¨
ahndrich
Daan Leijen
Microsoft Research
{sburckha,daan,maf}@microsoft.com
Mooly Sagiv
Tel-Aviv University
msagiv@post.tau.ac.il
Abstract
When distributed clients query or update shared data, eventual
consistency can provide better availability than strong consistency
models. However, programming and implementing such systems
can be difficult unless we establish a reasonable consistency model,
i.e. some minimal guarantees that programmers can understand and
systems can provide effectively.
To this end, we propose a novel consistency model based on
eventually consistent transactions. Unlike serializable transactions,
eventually consistent transactions are ordered by two order rela-
tions (visibility and arbitration) rather than a single order relation.
To demonstrate that eventually consistent transactions can be effec-
tively implemented, we establish a handful of simple operational
rules for managing replicas, versions and updates, based on graphs
called revision diagrams. We prove that these rules are sufficient to
guarantee correct implementation of eventually consistent transac-
tions. Finally, we present two operational models (single server and
server pool) of systems that provide eventually consistent transac-
tions.
1. Introduction
Eventual Consistency [16] is a well-known workaround to the fun-
damental problem of providing CAP [8] (consistency, availability,
and partition tolerance) to clients that perform queries and updates
against shared data in a distributed system. It weakens traditional
consistency guarantees (such as linearizability) in order to allow
clients to perform updates against any replica, at any time. Even-
tually consistent systems guarantee that all updates are eventually
delivered to all replicas, and that they are applied in a consistent
order.
Eventual consistency is popular with system builders. One rea-
son is that it allows temporarily disconnected replicas to remain
fully available to clients. This is particularly useful for implement-
ing clients on mobile devices [19]. Another reason is that it does not
require updates to be immediately performed on all server replicas,
thus improving scalability. In more theoretical terms, the benefit of
eventual consistency can be understood as its ability to delay con-
sensus [15].
However, eventual consistency is a weak consistency model
that breaks with traditional approaches (e.g. serializable operations)
and thus requires developers to be more careful. The essential
problem is that updates are not immediately applied globally, thus
the conditions under which they are applied are subject to change,
which can easily break data invariants. Many eventually consistent
systems address this issue by providing higher-level data types
to programmers. Still, the semantic details often remain sketchy.
Experience has shown that ad-hoc approaches to the semantics and
implementation of such systems can lead to surprising behaviors
(e.g. a shopping cart where deleted items reappear [6]). To take
eventual consistency to its full potential, we need answers to the
following questions:
How can we provide consistency guarantees that are as strong
as possible without forsaking lazy consensus?
How can we effectively understand and implement systems that
provide those guarantees?
In this paper, we propose a two-pronged solution that addresses
both questions, based on (1) a notion of transactions for eventual
consistency, and (2) a general implementation technique based on
revision diagrams.
Eventually consistent transactions differ significantly from tra-
ditional transactions, as they are not serializable. Nevertheless, they
uphold traditional atomicity and isolation guarantees. Even better,
they exhibit some strong properties that simplify the life of pro-
grammers and are not typically offered by traditional transactions:
(1) transactions cannot fail and never roll back, and (2) all code,
even long-running tasks, can run inside transactions without com-
promising performance.
We first present an abstract, concise specification of eventually
consistent transactions. This formalization uses mathematical tech-
niques (sets of events, partial orders, and equivalence relations) that
are commonly used in research on relaxed memory models and
transactional memory. Our definition provides immediate insight
on how eventual consistency is related to strong consistency: the
only difference is that eventual consistency uses two separate order
relations (visibility order and arbitration order) rather than a single
order over transactions.
We then proceed to describe a more concrete and operational
implementation technique based on revision diagrams [5]. Revi-
sion diagrams provide implementors with a simple set of rules for
managing updates and replicas. Revision diagrams make the fork
and join of versions explicit, which determines the visibility and
arbitration of transactions. We prove a theorem that guarantees that
any system following the revision diagram rules provides eventu-
ally consistent transactions according to the abstract definition. We
also illustrate the use of revision diagrams by presenting two sim-
ple system models (one using a single server, and one using a server
pool).
Overall, we make the following contributions:
We introduce a notion of eventually consistent transactions and
give a concise and abstract definition.
We present a systematic approach for building systems that sup-
port such transactions, based on revision diagrams. We present
a precise, operational definition of revision diagrams.
We prove a theorem stating that the revision diagram rules
are sufficient to guarantee eventual consistency. The proof is
nontrivial as it depends on deep structural properties of revision
diagrams.
We illustrate the use of revision diagrams by presenting two
operational system models, using a single server and a server
pool, respectively.
2. Formulation
To get started, we need to establish some precise terminology. Per-
haps the very first question is: what is a database? At a high abstrac-
tion level, databases are no different than abstract data types, which
are semantically defined by the operations they support to update
them and retrieve data. Taking cues from common definitions of
abstract data types, we define:
DEFINITION 1. A query-update interface is a tuple (Q, V, U )
where Q is an abstract set of query operations, V is an abstract set
of values returned by queries, and U is an abstract set of update
operations.
Note that the sets of queries, query results, and updates are not
required to be finite (and usually are not). Query-update interfaces
can apply in various scenarios, where they may describe abstract
data types, relational databases, or simple random-access mem-
ory, for example. For databases, queries are typically defined re-
cursively by a query language.
EXAMPLE 1. Consider random-access memory that supports loads
and stores of bytes in a 64-bit address space A = {a N | 0 <
a 2
64
}. For that example we define Q = {load(a) | a A},
V = {v N | 0 < v 2
8
} and U = {store(a, v) | a
A and v V }.
This example is excellent for illustration purposes (we will revisit
it throughout), and it provides an explicit connection between our
results and previous work on relaxed memory models and transac-
tional memory. Of course, most databases also fit in this abstract
interface where the queries are SQL queries and the update opera-
tions are SQL updates like insertion and deletion.
So far, our interfaces have no inherent meaning. The most direct
way to define the semantics of queries and updates is to relate them
to some notion of state:
DEFINITION 2. A query-update automaton (QUA) for the interface
(Q, V, U) is a tuple (S, s
0
) where S is a set of states with (1) an
initial state s
0
S, (2) an interpretation q
#
of each query q Q
as a function S V , and (3) an interpretation u
#
of each update
operation u U as a a function S S.
EXAMPLE 2. The random-access memory interface described in
Example 1 above can be represented by a QUA (S, s
0
) where S
is the set of total functions A V , and where s
0
is the constant
function that maps all locations to zero, and where load(a)
#
(s) =
s(a) and store(a, v)
#
(s) = s[a 7→ v].
QUAs can naturally support abstract data types (e.g. collections, or
even entire documents) that offer higher-level operations (queries
and updates) beyond just loads and stores. Such data types are often
important when programming against a weak consistency model
[17], since they can ensure that the data representation remains in-
tact when handling concurrent and potentially conflicting updates.
The following two characteristics of QUAs are important to
understand how they relate to other definitions of abstract data
types:
There is a strict separation between query and update opera-
tions: it is not possible for an operation to both update the data
and return information to the caller.
All updates are total functions. It is thus not possible for an
update to ’fail’; however, it is of course possible to define
updates to have no effect in the case some precondition is not
satisfied.
For instance, in our formalization, we would not allow a classic
stack abstract data type with a pop operation for two reasons, (1)
pop both removes the top element of the stack and returns it, so it
is neither an update nor a query, and (2) pop is not total, i.e. it can
not be applied to the empty stack.
This restriction is crucial to enable eventual consistency, where
the sequencing and application of updates may be delayed, and
updates may thus be applied to a different state than the one in
which they were originally issued by the program.
2.1 Clients and Transactions
Things become more interesting and challenging once we consider
a distributed system. We call the participants of our system clients.
Clients typically reside on physically distinct devices, but are not
required to do so. When clients in a distributed system issue queries
and updates against some shared QUA, we need to define what con-
sistency programmers can expect. This consistency model should
also address the semantics of transactions, which provide clients
with the ability to perform several updates as an atomic “bundle”.
We formally represent this scenario by defining a set C of
clients. Each client, at its own speed, issues a sequence of trans-
actions. Supposedly, each client runs some form of program (the
details of which we leave unspecified for simplicity and general-
ity). This program determines when to begin and end a transaction,
and what operations to perform in each transaction, which may de-
pend on various factors, such as the results returned by queries, or
external factors such as user inputs.
For uniformness, we require that all operations are part of a
transaction. This assumption comes at no loss of generality: a
device that does not care about transactions can simply issue each
operation in its own transaction.
Since all operations are inside transactions, we need not dis-
tinguish between the end of a transaction and the beginning of a
transaction. Formally, we can thus represent the activities on a de-
vice as a stream of operations (queries or updates) interrupted by
special yield operations that mark the transaction boundary.
1
We can thus fully describe the interaction between programs
executing on the clients and the database by the following three
types of operations:
1. Updates u U issued by the program,
2. Pairs (q, v) representing a query q Q issued by the program,
together with a response v V by the database system,
3. The yield operations issued by the program.
DEFINITION 3. A history H for a set C of clients and a query-
update interface (Q, V, U) is a map H which maps each client
c C to a finite or infinite sequence H(c) of operations from
the alphabet Σ = U (Q × V ) {yield}.
Note that our history does not a priori include a global ordering
of events, since such an order is not always meaningful when
working with relaxed consistency models. Rather, the existence of
certain orderings, subject to certain conditions, is what determines
whether a history satisfies a consistency model or not.
1
We call this operation yield() since it is semantically similar to a yield
we may encounter on a uniprocessor performing cooperative multittasking:
such a yield marks locations where other threads may read and modify the
current state of the data, while at all other locations, only the current thread
may read or modify the state.
2.1.1 Notation and Terminology
To reason about a history H, it is helpful to introduce the following
auxiliary terminology. We let E
H
be the set of all events in H, by
which we mean all occurrences of operations in Σ \ {yield} in the
sequences H(c) (we consider yield to be just a marker within the
operation sequence, but not an event).
For a client c, we call a maximal nonempty contiguous subse-
quence of events in H(c) that does not contain yield a transac-
tion of c. We call a transaction committed if it is succeeded by
a yield operation, and uncommitted otherwise. We let T
H
be the
set of all transactions of all clients, and committed(T
H
) T
H
the subset of all committed transactions. For an event e, we let
trans(e) T
H
be the transaction that contains e. Moreover, we let
committed(E
H
) E
H
be the subset of events that are contained in
committed transactions. We conclude by giving definitions related
to ordering events and transactions:
Program order. For a given history H, we define a partial order
<
p
over events in H such that e <
p
e
0
iff e appears before e
0
in some sequence H(c).
Apply in order. For a history H, for a state s S, for a subset
of events E
0
E
H
, and for a total order < over the events in
E
0
, we let apply(E
0
, <, s) be the state obtained by applying all
updates appearing in E
0
to the state s, in the order specified by
<.
Factoring. We define an equivalence relation
t
(meaning
same-transaction) over events such that e
t
e
0
iff trans(e) =
trans(e
0
). For any partial order over events, we say that
factors over
t
iff for any events x and y from different trans-
actions, x y implies x
0
y
0
for any x, y such that x
t
x
0
and y
t
y
0
. This is an important property to have for any or-
dering , since if factors over
t
, it induces a corresponding
partial order on the transactions.
2.2 Sequential Consistency
Sequential consistency posits that the observed behavior must be
consistent with an interleaving of the transactions by the various de-
vices. We formalize this interleaving as a partial order over events
(rather than a total order as more commonly used) since some
events are not instantly ordered by the system; for example, the
relative order of operations in uncommitted transactions may not
be fully determined yet.
DEFINITION 4. A history H is sequentially consistent if there ex-
ists a partial order < over the events in E
H
that satisfies the fol-
lowing conditions for all events e
1
, e
2
, e E
H
:
(compatible with program order) if e
1
<
p
e
2
then e
1
< e
2
(total order on past events) if e
1
< e and e
2
< e then either
e
1
< e
2
or e
2
< e
1
.
(consistent query results) for all (q, v) E
H
,
v = q
#
(apply({e (H) | e < q}, <, s
0
)).
This simply says that a query returns the state as it results from
applying all past updates to the initial state.
(atomicity) < factors over
t
.
(isolation) if e
1
/ committed(E
H
) and e
1
< e
2
, then e
1
<
p
e
2
. That is, events in uncommitted transactions precede only
events on the same client.
(eventual delivery) for all committed transactions t, there exist
only finitely many transactions t
0
T
H
such that t 6< t
0
.
Sequential consistency fundamentally limits availability in the
presence of network partitions. The reason is that any query issued
by some transaction t must see the effect of all updates that occur in
transactions that are globally ordered before t, even if on a remote
device. Thus we cannot conclusively commit transactions in the
presence of network partitions.
2.3 Eventual Consistency
Eventual consistency relaxes sequential consistency by allowing
queries in a transaction t to see only a subset of all transactions that
are globally ordered before t. It does so by distinguishing between a
visibility order (a partial order that defines what updates are visible
to a query), and an arbitration order (a partial order that determines
the relative order of updates).
DEFINITION 5. A history H is eventually consistent if there exist
two partial orders <
v
(the visibility order) and <
a
(the arbitration
order) over events in H, such that the following conditions are
satisfied for all events e
1
, e
2
, e E
H
:
(arbitration extends visibility) if e
1
<
v
e
2
then e
1
<
a
e
2
.
(total order on past events) if e
1
<
v
e and e
2
<
v
e, then either
e
1
<
a
e
2
or e
2
<
a
e
1
.
(compatible with program order) if e
1
<
p
e
2
then e
1
<
v
e
2
.
(consistent query results) for all (q, v) E
H
,
v = q
#
(apply({e H) | e <
v
q}, <
a
, s
0
)).
This says that a query returns the state as it results from apply-
ing all preceding visible updates (as determined by the visibility
order) to the initial state, in the order given by the arbitration
order.
(atomicity) Both <
v
and <
a
factor over
t
.
(isolation) if e
1
/ committed(E
H
) and e
1
<
v
e
2
, then e
1
<
p
e
2
. That is, events in uncommitted transactions are visible only
to later events by the same client.
(eventual delivery) for all committed transactions t, there exist
only finitely many transactions t
0
T
H
such that t 6<
v
t
0
.
The reason why eventual consistency can tolerate temporary
network partitions is that the arbitration order can be constructed
incrementally, i.e. may remain only partially determined for some
time after a transaction commits. This allows conflicting updates to
be committed even in the presence of network partitions.
Note that eventual consistency is a weaker consistency model
than sequential consistency. We can prove this statement as follows.
LEMMA 1. A sequentially consistent history is eventually consis-
tent.
PROOF. Given a history H that is sequentially consistent, we know
there exists a partial order < satisfying all conditions. Now define
<
v
=<
a
=<; then all conditions for eventual consistency follow
easily.
2.4 Eventual Consistency In Related Work
Eventual consistency across the literature uses a variety of tech-
niques to propagate updates (e.g. general causally-ordered broad-
cast [17, 18], or pairwise anti-entropy [14]). All of these techniques
are particular implementations that specialize our general definition
of visibility as a partial order. As for the arbitration order, we found
that two main approaches prevail. The most common one is to use
(logical or actual) timestamps: Timestamps provide a simple way
to arbitrate events. Another approach (sometimes combined with
timestamps) is to make updates commutative, which makes arbitra-
tion unnecessary (i.e. we can pick an arbitrary serialization of the
visibility order to satisfy the conditions in Def. 5).
We show in the next section (Section 3) how to arbitrate updates
without using timestamps or requiring commutativity, a feature that
sets our work apart. We prefer to not use timestamps because they
exhibit the write stabilization problem [19], i.e. the inability to
finalize the effect of updates while older updates may still linger in
disconnected network partitions. Consider, for example, a mobile
user called Robinson performing an important update, but getting
stranded on a disconnected island before transmitting it. When
Robinson reconnects after years of exile, Robinson’s update is older
than (and may thus alter the effect of) all the updates committed by
other users in the meantime. So either (1) none of these updates can
stabilize until Robinson returns, or (2) after some timeout we give
up on Robinson and discard his update. Clearly, neither of these
solutions is satisfactory. A better solution is to abandon time stamps
and instead use an arbitration order that simply orders Robinson’s
update after all the other updates. In fact, this is the outcome we
achieve when using revision diagrams, as explained in Section 3.
3. Revision Consistency
Our definition of eventual consistency (Def. 5) is concise and gen-
eral. By itself, it is however not very constructive, insofar that it
does not give practical guidelines as to how a system can efficiently
and correctly construct the necessary ordering (visibility and arbi-
tration). We now proceed to describe a more specific implementa-
tion technique for eventually consistent systems, based on the no-
tion of revision diagrams introduced in [5].
Revision diagrams show an extended history not only of the
queries, updates, and transactions by each client, but also of the
forking and joining of revisions, which are logical replicas of the
state (Fig. 1(a)). A client works with one revision at a time, and
can perform operations (queries and updates) on it. Since differ-
ent clients work with different revisions, clients can perform both
queries and updates concurrently and in isolation (i.e. without cre-
ating race conditions). Reconciliation happens during join opera-
tions. When a revision joins another revision, it replays all the up-
dates performed in the joined revision at the join point.
2
After a
revision is joined, no more operations can be performed on it (i.e.
clients may need to fork new revisions to keep enough revisions
available).
3.1 Revision Diagrams
Revision diagrams are directed graphs constructed from three types
of edges (successor, fork, and join edges, or s-, f- and j-edges for
short), and five types of vertices (start, fork, join, update, and query
vertices). A start vertex represents the beginning of a revision, s-
edges represent successors within a revision, and fork/join edges
represent the forking and joining of revisions.
We pictorially represent revision diagrams using the following
conventions
Use · for start, query, and update vertices
Use and for fork and join vertices, respectively
Use vertical down-arrows for s-edges
Use horizontal-to-vertical curved arrows for f-edges
Use vertical-to-horizontal curved arrows for j-edges
A vertex x has a s-path (i.e. a path contanining only s-edges) to
vertex y if and only if they are part of the same revision. Since all
s-edges are vertical in our pictures, vertices belonging to the same
revision are always aligned vertically. For any vertex x we let S(x)
be the start vertex of the revision that x belongs to. For any vertex
x whose start vertex S(x) is not the root, we define F (x) to be the
fork vertex such that F (x)
f
S(x) (i.e. the fork vertex that started
2
This replay operation is conceptual. Rather than replaying a potentially
unbounded log, actual implementations can often use much more space-
and time-efficient merge functions, as explained in Section 4.
Figure 2. Visualization of the construction rules for revision dia-
grams in Def. 6.
the revision x belongs to). We call a vertex with no outgoing s-
or j-edges a terminal; terminals are the last operation in a revision
that can still perform operations (has not been joined yet), and thus
represent potential extension points of the graph.
We now give a formal, constructive definition for revision dia-
grams.
DEFINITION 6. A revision diagram is a directed graph constructed
by applying a (possibly empty or infinite) sequence of the following
construction steps (see Fig. 2) to a single initial start vertex (called
the root):
Query Choose some terminal t, create a new query vertex x, and
add an edge t
s
x.
Update Choose some terminal t, create a new update vertex x, and
add an edge t
s
x.
Fork Choose some terminal t, create a new fork vertex x and a
new start vertex y, and add edges t
s
x and x
f
y.
Join Choose two terminals t, t
0
satisfying the join condition
F (t
0
)
t, then create a new join vertex x and add edges
t
s
x and t
0
j
x.
The join condition expresses that the terminal t (the “joiner”)
must be reachable from the fork vertex that started the revision that
contains t
0
(the “joinee”). This condition makes revision diagrams
more restricted than general task graphs. See Fig 1(b) for some
examples of invalid diagrams where the join condition does not
hold at construction of the join nodes.
The join condition has some important, not immediately obvi-
ous consequences. For example, it implies that revision diagrams
are always semilattices (for a proof of this nontrivial fact see [5]).
Also, it ensures some diagram properties (Lemmas 2 and 3) that we
need to prove our main result (Thm. 1). Futhermore, it still allows
more general graphs than strict series-parallel graphs [20], which
allow only the recursive serial and parallel composition of tasks
(and are also called fork-join concurrency in some contexts, which
is potentially misleading). For instance, the right-most revision di-
agram in Fig. 1(a) is not a series-parallel graph but it is a valid revi-
sion diagram. While series-parallel graphs are easier to work with
than revision diagrams, they are not flexible enough for our pur-
pose, since they would enforce too much synchronization between
participants.
Also, note that fork and the join are fundamentally asymmetric:
the revision that initiates the fork (the “forker”) continues to exist
after the fork, but also starts a new revision (the “forkee”), and sim-
(a) (b)
(fork) (nested) (non-planar) (non-series-parallel) (shortcut) (parent join)
·
·
·
·
·
·
qq
·
· ·
·
·
·
·
·
qq
·
qq
·
·
·
·
·
qq
·
·
--
·
·
mm
·
·
·
qq
·
·
·
·
qq
·
qq
·
·
·
·
t ·
·
· t
0
mm
x
·
·
·
·
t
0
·
--
· t
x
·
Figure 1. (a) Four examples of revision diagrams. (b) Two diagrams that are not revision diagrams since they violate the join property at the
creation of the join node x. In the rightmost diagram, F (t
0
) is undefined on the main revision and therefore F(t
0
)
t does not hold.
ilarly, the revision that initiates the join (the “joiner”) can continue
to perform operations after the join, but ends the joined revision
(the “joinee”).
3.2 Graph Properties
We now examine some properties of the revision diagrams, for
better visualization, and because we need some technical properties
in our later proofs. Most statements are easily proved by induction
over the construction rules in Def. 6; if not, we mention how to
prove them.
Revision diagrams are connected, and all vertices are reachable
from the root vertex. There can be multiple paths from the root to a
given vertex, but exactly one of those is free of j-edges.
DEFINITION 7. For any vertex v in a revision diagram, let the root-
path of v be the unique path from the root to v that does not contain
j-edges.
The join condition does not make revision diagrams necessarily
planar, i.e. when drawing revision diagrams, it is not always possi-
ble to avoid crossing lines (see the third diagram in Fig. 1(a) for an
example). However, it is always possible to choose horizontal coor-
dinates for the vertices such that (1) vertices in the same revisions
are vertically aligned, and (2) revisions are horizontally arranged
such that forkers are left of forkees, and (3) joiners are left of joi-
nees. The existence of such an order is not immediately obvious;
for example, such a layout is not possible for the incorrect revision
diagram at the right in Fig. 1(b). The following lemma formalizes
the claims (1,2,3) above (where the preorder
l
corresponds to a
relation on vertices that compares their horizontal coordinates):
LEMMA 2. [Layout Preorder] In any revision diagram, there exists
a preorder
l
on vertices
3
such that
S(x) = S(y) (x
l
y) (y
l
x) (1)
x
f
y x
l
y (2)
x
j
y y
l
x (3)
3
A preorder is a reflexive transitive binary relation. Unlike partial orders,
preorders are not necessarily antisymmetric, i.e. they may contain cycles.
·
store(a, 1) ·
· store(a, 2)
· store(b, 2)
ss
store(a, 2); store(b, 2)
load(a) ·
Figure 3. A labeled revision diagram. The path-result of the
bottom vertex is now the query applied to its root-path:
load(a)
#
(store(b, 2)
#
(store(a, 2)
#
(store(a, 1)
#
(s
0
)))) = 2.
We include the proof in the appendix. For proving our main result
later on, we need to establish another basic fact about revision
diagrams. We call a path direct if all of its f-edges (if any) appear
after all of its j-edges (if any). The following lemma (which appears
as a theorem in [5], and for which we include a proof in the
appendix as well) shows that we can always choose direct paths:
LEMMA 3 (Direct Paths.). Let x, y be vertices in a revision dia-
gram. If x
y, there exists a direct path from x to y.
3.3 Query and Update Semantics
We now proceed to explain how to determine the results of a query
in a revision diagram. The basic idea is to (1) return a result that is
consistent with applying all the updates along the root path, and (2)
if there are join vertices along that path, they summarize the effect
of all updates by the joined revision.
For example, consider the diagram in Fig. 3. This is an example
of a revision diagram labeled with the operations of the random
access memory example described in Example 2. The join vertex
is labeled with the composition of all update operations of the
joinee. The path-result of the final query node load(a) can now be
evaluated by applying to the composition of all update operations
along the root-path:
load(a)
#
(store(b, 2)
#
(store(a, 2)
#
(store(a, 1)
#
(s
0
)))) = 2.
We can define this more formally. To reduce the verbosity of
our definitions, we assume a fixed query-update interface (Q, V, U)
and QUA (S, s
0
) for the rest of this section.
DEFINITION 8. For any vertex x, we let the effect of x be a function
x
: S S defined inductively as follows:
If x is a start, fork, or query vertex, the effect is a no-op, i.e.
x
(s) = s.
If x is an update vertex for the update operation u, then the
effect is that update, i.e. x
(s) = u
#
(s).
If x is a join vertex, then the effect is the composition of all
effects in the joined revision, i.e. if y
1
, . . . , y
n
is the sequence
of vertices in the joined revision (i.e. y
1
is a start vertex, y
i
s
y
i+1
for all 1 i < n, and y
n
j
x), then x
(s) =
y
n
(y
n1
(. . . y
1
(s))).
We can then define the expected query result as follows.
DEFINITION 9. Let x be a query vertex with query q, and let
(y
1
, . . . , y
n
, x) be the root path of x. Then define the path-result
of x as q
#
(y
n
(y
n1
(. . . y
1
(s
0
))).
3.4 Revision Diagrams and Histories
We can naturally relate histories to revision diagrams by associating
each query event (q, v) E
H
with a query vertex, and each update
event u E
H
with a update vertex. The intention is to validate
the query results in the history using the path results, and to keep
transactions atomic and isolated by ensuring that their events form
contiguous sequences within a revision.
DEFINITION 10. We call a revision diagram a witness for the
history H if it satisfies the following conditions:
1. For all query events (q, v) in E
H
, the value v matches the path-
result of the query vertex.
2. If x, y are two successive non-yield operations in H(c) for some
c, then they must be connected by a s-edge.
3. If x is the last event of H(c) for some c and not a yield, then it
must be a terminal.
4. If x, y are two operations preceding and succeeding some yield
in H(c) for some c, then there must exist a path from x to y. In
other words, the beginning of a transaction must be reachable
from the end of the previous transaction.
We call a history H revision-consistent if there exists a witness
revision diagram.
To ensure eventual delivery of updates, we need to somehow
make sure there are enough forks and joins. To formulate a liveness
condition on infinite histories, we define “neglected vertices” as
follows:
DEFINITION 11. We call a vertex x in a revision diagram ne-
glected if there exists an infinite number of vertices y such that
there is no path from x to y.
We are now ready to state and prove our main result.
THEOREM 1. Let H be a history. If there exists a witness diagram
for H such that no committed events are neglected, then H is
eventually consistent.
Note that this theorem gives us a solid basis for implementing
eventually consistent transactions: an implementation can be based
on dynamically constructing a witness revision diagram and as a
consequence guarantee eventual consistent transactions. Moreover,
as we will see in Section 4, implementations do not need to actually
construct such witness diagrams at runtime but can rely on efficient
state-based implementations.
The proof of our Theorem (in Section 3.5 below) constructs
partial orders <
v
, <
a
from the revision diagram by (1) specifying
x <
v
y iff there is a path from x to y in the revision diagram,
and (1) specifying <
a
to order all events in a joined revision to
occur in between the joiner terminal and the join vertex. Note that
the converse of Thm. 1 is not true, not even if restricted to finite
histories (we include a finite counterexample in the appendix). Also
Note that the most difficult part of the proof is the safety, not the
liveness, since the proof that <
a
is a partial order extending <
v
depends on the join condition in a nontrivial way.
3.5 Proof of Thm 1
We devote the rest of this section to this proof, which requires some
deeper insight into structural properties of revision diagrams. First,
however, we need some definitions, notations, and lemmas.
A revision diagrams is a connected graph. However, if we re-
move all f-edges from the picture, it may decompose into several
components. We define a join-component to be a maximal compo-
nent connected by s and j edges only. We say x
j
y if they are in
the same join component, and let J(x) = {y | x
j
y}. It is easy
to see that each join-component contains exactly one terminal. For
a vertex x, we let T (x) be the terminal of J(x) (note that T (x) is
the unique terminal reachable from x by a path containing j and s
edges only).
DEFINITION 12. Define the binary relation
a
on vertices by
adding the following edges during the construction of a revision
diagram as in Def. 6:
(Query, Update, Fork) for all y J(t), add y
a
x
(Join) for all y J(t) and y
0
J(t
0
), add edges y
a
x,
y
0
a
x, and y
a
y
0
.
LEMMA 4. For any revision diagram,
a
as defined above is a
partial order over all vertices in the diagram satisfying (1) when
restricted to any one join-component,
a
is a total order (2)
a
does not cross join-components.
LEMMA 5. For vertices x, y in a revision diagram and a preorder
l
as guaranteed by Lemma 2, x
y implies T (x)
l
T (y).
We include proofs for both lemmas in the appendix. The first
one is a simple induction, the second one is a bit more intricate
and uses the path properties guaranteed by Lemma 3 and the layout
preorder guaranteed by Lemma 2.
We are now ready to prove Theorem 1. Given a history H and a
witness revision diagram, define two binary relations
<
v
=
and <
a
= (<
v
a
)
.
By Lemma 6 below, <
a
and <
v
are partial orders. We can then
prove the remaining claims as follows:
(arbitration extends visibility) By Lemma 6 below.
(total order on past events) if e
1
<
v
e and e
2
<
v
e, then by
Lemma 3 there exist direct paths for e
1
e and for e
2
e.
If either path is a prefix of the other, e
1
and e
2
are ordered by
<
v
and thus by <
a
. If not, they must combine in a join vertex,
implying that e
1
j
e
2
, which implies (by Lemma 4) that they
are ordered by <
a
.
(compatible with program order) By conditions 2 and 4 of
Def. 10.
(consistent query results) We can show inductively (over Def. 6)
that for any vertex x, the combined effect of the vertices on the
root path (as in Def. 8) to x is equal to the combined effect of
all updates {x
0
| x
0
<
v
x} ordered by <
a
. This is trivial for all
but the join case. In the join case, Def. 12 orders all all updates
in the joinee after updates in the joiner which is consistent with
interpreting them as an effect of the join vertex.
(atomicity) By condition 2 we know there can be no intervening
forks or joins. This implies that both and <
a
factor over
t
.
(isolation) By condition 3.
(eventual delivery) Assume the condition is violated. Then there
exists a committed transaction t committed(T
H
) and an
infinite number of transactions t
1
, t
2
, . . . such that for all i,
t 6<
v
t
i
. Since transactions can not be empty, we can pick
vertices x t and x
i
t
i
, with x 6<
v
x
i
for all i. But that
implies that x is neglected, contradicting the condition in the
theorem.
The only thing left to prove is the lemma below, which ar-
guably contains the most interesting part of the proof. In particular,
it shows how consequences of the join condition (specifically, Lem-
mas 2 and 5) are used in the construction of an arbitration order <
a
that satisfies <
v
<
a
as required for eventual consistency.
LEMMA 6. Given some revision diagram, define binary relations
<
v
=
and <
a
= (<
v
a
)
. Then both <
v
and <
a
are
partial orders, and <
v
<
a
.
PROOF. Clearly, <
v
is a partial order (since revision diagrams are
acyclic) and <
v
<
a
. The interesting part is to show that <
a
is
antisymmetric (i.e. x <
a
y and y <
a
x implies x = y). We prove
this by showing that (
a
) is acyclic. Consider some minimal
cycle. Since
a
is transitive, and both
a
and are acyclic on
their own, it must be of the following form (where n 1):
x
1
y
1
a
x
2
y
2
a
. . .
a
x
n
y
n
a
x
1
By Lemma 4 this implies
x
1
y
1
j
x
2
y
2
j
. . .
a
x
n
y
n
j
x
1
using the preorder guaranteed by Lemma 2 and Lemma 5, we get
T (x
1
)
l
T (y
1
) = T (x
2
)
l
T (y
2
) . . . T (x
n
)
l
T (y
n
) = T (x
1
)
But by Lemma 2 such an
l
-cycle implies that all vertices are in
the same revision which is a contradiction.
4. System Implementation
Revision diagrams can help to develop efficient implementations
since they provide a solid abstraction that decouples the consis-
tency model from actual implementation choices. In this section,
we describe some implementation techniques that are likely to be
useful for that purpose. We present three sketches of client-server
systems that implement eventual consistency.
It is usually not necessary for implementations to store the
actual revision diagram. Rather, we found it highly convenient to
work with state representations that can directly provide fork and
join operations.
DEFINITION 13. A fork-join QUA (FJ-QUA) for a query-update
interface (Q, V, U ) is a tuple , σ
0
, f , j ) where (1) , σ
0
) is a
QUA over (Q, V, U), (2) f : Σ Σ × Σ, and (3) j : Σ × Σ Σ.
If we have a fork-join QUA, we can simply associate a Σ-state
with each revision, and then perform all queries and updates locally
on that state, without communicating with other revisions. The join
function of the FJ-QUA, if implemented correctly, guarantees that
all updates are applied at the join time. We can state this more
formally as follows.
DEFINITION 14. For a FJ-QUA , σ
0
, f , j ) and a revision dia-
gram over the same interface (Q, V, U), define the state σ(x) of
each vertex x inductively by setting σ(r) = σ
0
for the initial vertex
r, and (for the construction rules as they appear in Def. 6)
(Query) Let σ(x) = σ(t)
(Update) Let σ(x) = u
#
(σ(t))
(Fork) Let (σ(x), σ(y)) = f (σ(t))
(Join) Let σ(x) = j (σ(t), σ(t
0
))
DEFINITION 15. A FJ-QUA , σ
0
, f , j ) implements the QUA
(S, s
0
) over the same interface if and only if for all revision di-
agrams, for all vertices x, the locally computed state σ(x) (as in
Def. 14) matches the path result (as in Def. 9).
EXAMPLE 3. Consider the QUA representing random access mem-
ory as defined in Example 2. We can implement this QUA using an
FJ-QUA that maintains a “write-set” as follows:
Σ = S × P(A)
σ
0
= (s
0
, )
load(a)
#
(s, W ) = s(a)
store(a, v)
#
(s, W ) = (s[a 7→ v], W {a})
f (s, W ) = ((s, W ), (s, ))
j ((s
1
, W
1
), (s
2
, W
2
)) = (s
0
, W
1
W
2
)
where s
0
(a) =
s
1
(a) if a / W
2
s
2
(a) if a W
2
The write set (together with the current state) provides sufficient
information to conceptually replay all updates during join (since
only the last written value matters). Note that the write set gets
cleared on forks.
Since we can store a log of updates inside Σ, it is always possi-
ble to provide an FJ-QUA for any QUA (we show this construction
in detail in Section B.4 in the appendix). However, more space-
effective implementations are often possible for QUAs since logs
are typically compressible. We include several finite-state exam-
ples of FJ-QUAs in the appendix (Section C) as well.
4.1 System Models
If we have a FJ-QUA, we can implement eventually consistent
systems quite easily. We now present two models that demonstrate
this principle.
4.2 Single Synchronous Server Model
We first present a model using a single server. We define the set of
devices I = C {s} where C is the set of clients and s is the single
server. We store on each device i a state from the FJ-QUA, that is,
we define R : I * Σ. To keep the transition rules simple, we use
the notation R[i 7→ σ] to denote the map R modified by mapping
i to σ, and we let R(c 7→ σ) be a pattern that matches R, c, and σ
such that R(c) = σ. Each client can perform updates and queries
while reading and writing only the local state:
UPDATE(c, u):
σ
0
= u
#
(σ)
R(c 7→ σ) R[c 7→ σ
0
]
QUERY(c, q, v):
q
#
(σ) = v
R(c 7→ σ) R
As for synchronization, all we need is two rules, one to create a
new client (forking the server state), and one to perform the yield
on the client (joining the client state into the server, then forking a
fresh client state from the server):
SPAWN(c):
c / dom R f (σ) = (σ
1
, σ
2
)
R(s 7→ σ) R[s 7→ σ
1
][c 7→ σ
2
]
YIELD(c):
j (σ
1
, σ
2
) = σ
3
f (σ
3
) = (σ
4
, σ
5
)
R(s 7→ σ
1
)(c 7→ σ
2
) R[s 7→ σ
4
][c 7→ σ
5
]
Thanks to Theorem 1, we can precisely argue why this system
is eventually consistent. By induction over the transitions, we can
show that each state σ appearing in R corresponds to a terminal
in the revision diagram, and each transition rule manipulates those
terminals (applying fork, join, update or query) in accordance with
the revision diagram construction rules. In particular, the join con-
dition is always satisfied since all forks and joins are performed by
the same server revision. Transactions are not interrupted by forks
or joins, and no vertices are neglected: each yield creates a path
from the freshly committed vertices into the server revision, from
where it must be visible to any new clients, and to any client that
performs an infinite number of yields.
An interesting observation is that, if the fork does not modify
the left component (i.e. for all σ Σ, f(σ) = (σ, σ
0
) for some
σ
0
), the server is effectively stateless, in the sense that it does not
store any information about the client. This is a highly desirable
characteristics for scalability, and in our experience it is well worth
to go through some extra length in defining FJ-QUAs that have this
property.
4.3 Server Pool Model
The single server model still suffers some drawbacks. For one,
clients performing a yield access both server and client state. This
means clients block if they have no connection. Also, a single
server may not scale to large numbers of clients.
We can fix both of these issues by using a server pool rather than
a single server, i.e. we let the set of devices be I = CS where S is
a set of server identifiers. Using multiple servers not only improves
scalability, but it helps with disconnected operation as well: if we
keep one server next to each client (e.g. on the same mobile device),
we can guarantee that the client does not block on yield. Servers
themselves can perform a sync operation (at any convenient time)
to exchange state with other servers.
However, we need to keep additional information in each device
to ensure that the join condition is maintained. We do so by (1)
storing on each client c a pair (σ, n) where σ is the revision state
as before, and n is a counter indicating the current transaction, and
(2) storing on each server s a triple (σ, J, L) where σ is the revision
state as before, J is the set of servers that s may join, and L is a
vectorclock (a partial function (I N)) indicating for each client
the latest transaction of c that s may join.
The transitions that involve the client are then as shown in
Fig. 4. The servers can perform forks and joins without involving
clients. On joins, servers join the state, take the union of the sets J
of joinable servers, and merge the vector clocks (defined as taking
the pointwise maximum).
Again, we can use Theorem 1 to reason that finite executions
of this system are eventually consistent (for infinite executions we
need additional fairness guarantees as discussed below). Again, all
states σ stored in R correspond to terminals in a revision diagram
and are manipulated according to the rules. This time, the join
condition is satisfied because of the following invariants: (1) if the
set J of server s
1
contains s
2
, then s
1
s terminal is reachable from
the fork vertex that forked s
2
s revision, and (2) if L(c) = n for
server s, and client cs transaction counter is n, then s’ terminal is
reachable from the fork vertex that forked cs revision.
Since the transition rules do not contain any guarantees that
force servers to synchronize with each other, it is possible to con-
struct infinite executions that violate eventual consistency. Actual
implementations would thus likely add a mechanism to guarantee
that updates eventually reach the main revision, and that clients that
perform an infinite sequence of transactions receive versions from
the main revision infinitely often.
5. Related Work
For a high-level comparison of our work with various notions of
eventual consistency appearing in the literature, see Section 2.4.
Briefly stated, our work is set apart by its unique use of revision
diagrams to determine both arbitration and visibility, rather than
separately using a causally consistent partial order for visibility,
and timestamps for arbitration.
There is of course a large body of work on transactions. Most
academic work considers strong consistency (serializable transac-
tions) only, and is thus not directly applicable to eventual consis-
tency. Nevertheless there are some similarities, to pick a few:
[9] provides insight on the limitations of serializable transac-
tions, and proposes similar workarounds as used by eventual
consistency (timestamps and commutative updates). However,
transactions remain tentative during disconnection.
Snapshot isolation [7] relaxes the consistency model, but trans-
actions can still fail, and can not commit in the presence of net-
work partitions.
Coarse-grained transactions [10, 13] share with our work the
use of abstract data types to facilitate concurrent transactions.
Automatic Mutual Exclusion [1], like our work, uses yield state-
ments to separate transactions.
Previous work on revisions [2, 5, 3, 4] introduces revision di-
agrams and conflict resolution. In this paper we feature a simpler,
more direct definition using graph construction rules. Also, we pur-
sue a different goal (eventually consistent transactions in a dis-
tributed system, rather than deterministic parallel programming).
In particular, eventually consistent transactions exhibit pervasive
nondeterminism caused by factors that are by definition outside the
control of the system, such as network partitions. Also, this paper
is the first to give a single, simple formalization of merge functions
(FJ-QUAS are optimized implementations of QUAs).
Research on persistent data types [12] is related to our defi-
nition of FJ-QUAs insofar it concerns itself with efficient imple-
mentations of data types that permit retrieval and mutations of past
versions. However, it does not concern itself with apects related to
transactions or distribution.
Prior work on operational transformations [18] can be under-
stood as a specialized form of eventual consistency where updates
are applied to different replicas in different orders, but are them-
selves modified in such a way as to guarantee convergence. This
specialized formulation can provide highly efficient broadcast-
based real-time collaboration, but poses significant implementation
challenges [11].
If we consider transactions with single elements only, it is sen-
sible to compare our work with related work on conflict-free repli-
cated data types (CRDTs) [17] and Bayou’s weakly consistent
replication [19].
Our definition is strictly more general than CRDTs [17] in the
following sense: From any state-based CRDT we can obtain
a FJ-QUA by using the same state and initial state, the same
UPDATE(c, u):
σ
0
= u
#
(σ)
R(c 7→ (σ, n)) R[c 7→ (σ
0
, n)]
QUERY(c, q, v):
q
#
(σ) = v
R(c 7→ (σ, L)) R
SPAWN(c):
c / dom R f (σ) = (σ
1
, σ
2
) L
0
= L[c 7→ 0]
R(s 7→ (σ, J, L)) R[s 7→ (σ
1
, J, L
0
)][c 7→ (σ
2
, 0)]
YIELD(s, c):
L(c) = n L
0
= L[c 7→ n + 1] j (σ
1
, σ
2
) = σ
3
f (σ
3
) = (σ
4
, σ
5
)
R(s 7→ (σ
1
, J, L))(c 7→ (σ
2
, n)) R[s 7→ (σ
4
, J, L
0
)][c 7→ (σ
5
, n + 1)]
FORK(s
1
, s
2
):
s
2
/ dom R f (σ) = (σ
1
, σ
2
) J
0
= J {s
2
}
R(s
1
7→ (σ, J, L)) R[s
1
7→ (σ
1
, J
0
, L)][s
2
7→ (σ
2
, J, L)]
JOIN(s
1
, s
2
):
s
2
J
1
σ
0
= j (σ
1
, σ
2
) J
0
= J
1
J
2
L
0
= merge(L
1
, L
2
)
R(s
1
7→ (σ
1
, J
1
, L
1
))(s
2
7→ (σ
2
, J
2
, L
2
)) R[s
1
7→ (σ
0
, J
0
, L
0
)][s
2
7→ ]
Figure 4. The server pool model.
query and update functions, a fork function that creates a new
replica and then merges the forker state, and a join function
that uses the merge. Note that the definition of strong eventual
consistency in [17], just like ours, requires that updates can be
applied to any state.
In Bayou [19], and in the Concurrent Revisions work[5], users
can specify how to resolve conflicting updates by writing cus-
tom merge functions. At first sight, this may appear more gen-
eral that QUAs. However, by performing a simple automatic
transformation of the QUA and the client program, we can sup-
port merge functions for conflict resolution purposes. The rea-
son is that QUAs already allow updates to perform any desired
total function. We describe this transformation in Section B.2
in the appendix.
6. Conclusion and Future Work
We have proposed eventually consistent transactions as a consis-
tency model that (1) generalizes earlier definitions of eventual con-
sistency and (2) shows how to make some strong guarantees (trans-
actions never fail, all code runs in transactions) to compensate for
weak consistency. We have shown that revision diagrams provide a
convenient way to build correct implementations of eventual con-
sistency, by relying on just a handful of simple rules that are easily
visualized using diagrams.
In future work, we would like to extend the study of the pro-
gramming model, investigate a selection of basic FJ-QUAs, and
ways to combine them. Furthermore, we would like to understand
whether stronger consistency guarantees are possible for subclasses
of eventually consistent transactions, and whether such classes can
be automatically recognized or synthesized.
Acknowledgments. We thank Marc Shapiro for introducing us to
a principled world of eventual consistency, and for general guid-
ance. We also thank Tom Ball, Sean McDirmid and Benjamin
Wood for inspired discussions, helpful examples, and constructive
comments.
References
[1] M. Abadi, A. Birrell, T. Harris, and M. Isard. Semantics of
transactional memory and automatic mutual exclusion. In Principles
of Programming Languages (POPL), 2008.
[2] S. Burckhardt, A. Baldassin, and D. Leijen. Concurrent programming
with revisions and isolation types. In Object-Oriented Programming,
Systems, Languages, and Applications (OOPSLA), 2010.
[3] S. Burckhardt, D. Leijen, and M. F
¨
ahndrich. Roll forward, not
back: A case for deterministic conflict resolution. In Workshop on
Determinism and Correctness in Parallel Progr., 2011.
[4] S. Burckhardt, D. Leijen, J. Yi, C. Sadowski, and T. Ball. Two for
the price of one: A model for parallel and incremental computation
(distinguished paper award). In Object-Oriented Programming,
Systems, Languages, and Applications (OOPSLA), 2011.
[5] Sebastian Burckhardt and Daan Leijen. Semantics of concurrent
revisions,. In European Symposium on Programming (ESOP),
LNCS, volume 6602, pages 116–135, 2011. Full version as Microsoft
Technical Report MSR-TR-2010-94.
[6] G. Decandia, D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman,
A. Pilchin, S. Sivasubramanian, P. Vosshall, and W. Vogels. Dynamo:
amazon’s highly available key-value store. In Symposium on
Operating Systems Principles, pages 205–220, 2007.
[7] A. Fekete, D. Liarokapis, E. O’Neil, P. O’Neil, and D. Shasha.
Making snapshot isolation serializable. ACM Trans. Database Syst.,
30(2):492–528, 2005.
[8] S. Gilbert and N. Lynch. Brewer’s conjecture and the feasibility of
consistent, available, partition-tolerant web services. SIGACT News,
33:51–59, June 2002.
[9] J. Gray, P. Helland, P. O’Neil, and D. Shasha. The dangers of
replication and a solution. Sigmod Record, 25:173–182, 1996.
[10] M. Herlihy and E. Koskinen. Transactional boosting: a methodology
for highly-concurrent transactional objects. In Principles and
Practice of Parallel Programming (PPoPP), 2008.
[11] A. Imine, M. Rusinowitch, G. Oster, and P. Molli. Formal design
and verification of operational transformation algorithms for copies
convergence. Theoretical Computer Science, 351:167–183, 2006.
[12] H. Kaplan. Persistent data structures. In Handbook on Data Structures
and Applications, pages 241–246. CRC Press, 1995.
[13] E. Koskinen, M. Parkinson, and M. Herlihy. Coarse-grained
transactions. In Principles of Programming Languages (POPL),
2010.
[14] K. Petersen, M. Spreitzer, D. Terry, M. Theimer, and A. Demers.
Flexible update propagation for weakly consistent replication.
Operating Systems Review, 31:288–301, 1997.
[15] Y. Saito and M. Shapiro. Optimistic replication. ACM Computing
Surveys, 37:42–81, 2005.
[16] M. Shapiro and B. Kemme. Eventual consistency. In Encyclopedia
of Database Systems, pages 1071–1072. 2009.
[17] M. Shapiro, N. Preguia, C. Baquero, and M. Zawirski. Conflict-free
replicated data types. 2011.
[18] C. Sun and C. Ellis. Operational transformation in real-time group
editors: issues, algorithms, and achievements. In Conference on
Computer Supported Cooperative Work, pages 59–68, 1998.
[19] D. Terry, M. Theimer, K. Petersen, A. Demers, M. Spreitzer, and
C. Hauser. Managing update conflicts in bayou, a weakly connected
replicated storage system. SIGOPS Oper. Syst. Rev., 29:172–182,
December 1995.
[20] J. Valdes, R. Tarjan, and E. Lawler. The recognition of series parallel
digraphs. In ACM Symposium on Theory of Computing, pages 1–12,
1979.
A. Proofs
A.1 Proof of Lemma 2
PROOF. For a vertex v, we define the pedigree ped(x) to the
word in {s, j}
obtained by (1) taking the root path of x and
concatenating the edge labels along the path into a word, and (2)
removing any trailing s letters from the word. Clearly, two vertices
are in the same revision iff they have the same pedigree.
Now we define x
l
y iff ped(x) < ped(y) in the lexico-
graphic order on the words {s, j}
where we order the letters as
s < j.
4
Then it is straightforward to show that
l
is transitive. Claim
(1) follows from the observation above (revision determined by
pedigree). Claim (2) follows since adding a f at the end of a word
increases it in the lexicographic order. For claim (3), consider the
construction rule for the join: the join condition tells us that the
root path of the joiner terminal and the root path of the joinee
terminal must have a common prefix from which they diverge in
such a way that the joiner terminal’s path continues with s and the
joinee terminal’s path continues with f , which implies the desired
order after we add the join edge.
A.2 Proof of Lemma 3
PROOF. For a path p, let w(p) {f, s, j}
be the word representing
the sequence of edge labels, and let < be the lexicographic order
on words induced by the total order s < j < f on labels. Now
assume the claim is false and choose a path p = (v
0
, v
1
, . . . , v
n
)
from x = v
0
to y = v
n
such that w(p) is minimal with respect to
<. Since the claim is false there must exist i < j such that
x = v
0
v
i
f
v
i+1
s
v
j
j
v
j+1
v
n
= y.
Now, consider the construction step that added v
j+1
to the graph
(which is a join vertex). The join condition implies that there exists
an alternate path from v
i
to v
j+1
that starts with an s-edge rather
than a f-edge, which contradicts the assumption that p is minimal
with respect to <.
A.3 Proof of Lemma 5
PROOF. Since y has a path to T (y), and x has a path to y, x has a
path to T(y), thus by Lemma 3 there must exist a direct path (i.e.
all j-edges appear before all f-edges) from x to T (y). If this path
does not contain fork edges, then x
j
y and the claim follows
because T (x) = T (y). Otherwise, let z be the vertex originating
the first f-edge on the path. Then (a) x
j
z, and (b) z has a
path to T (y) containing only s- and f -edges. From (a) we know
z has a path to T(x) containing only s- and j-edges, which implies
T (x)
l
z (using conditions (1) and (3) of Lemma 2). From (b) we
know z
l
T (y) (using conditions (1) and (2) of Lemma 2). Thus
T (x)
l
T (y) q.e.d.
B. Additional Material
B.1 Notions of Equivalence
Naturally, we may ask questions about how different QUAs with
the same interface can be compared, and whether two implementa-
tions are equivalent.
DEFINITION 16. Two QUAs (S
a
, s
a
) and (S
b
, s
b
) for the same
interface (Q, V, U) are isomorphic if there exists a bijection ρ :
S
a
S
b
such that ρ(s
a
) = s
b
, and such that for all u
U, q Q, and s S
a
we have q
#a
(s) = q
#b
(ρ(s)) and
ρ(u
#a
(s)) = u
#b
(ρ(s)).
4
To be precise: we define the lexicographic order over words by stating
w
1
< w
2
if either w
1
is a prefix of w
2
, or there exists an index k such that
w
1
[i] = w
2
[i] for i < k and w
1
[k] < w
2
[k].
Isomorphism is a very strong notion of equivalence. In fact, it is
often not necessary to require the existence of a bijection between
S
a
and S
b
. The reason is that we consider the state of a database
(its “internal organization”) to be only indirectly observable, by
queries. Thus, some aspects of the state may be irrelevant and can
be abstracted away. The following definition clarifies this intuition.
DEFINITION 17. Two QUAs (S
a
, s
a
) and (S
b
, s
b
) for the same
interface (Q, V, U) are observationally equivalent if for all n 0,
for all u
1
, . . . , u
n
U
, and for all q Q it is always the case
that q
#a
(u
#a
n
(. . . (u
#a
1
(s
a
)))) = q
#b
(u
#b
n
(. . . (u
#b
1
(s
b
)))).
This definition has a very practical implication: it means that
many different implementations of a QUA can satisfy the same re-
quirements. Even more interestingly, if the set of queries is limited,
it may be possible to dramatically reduce the space requirements
for storing the state.
B.2 Supporting custom conflict resolution in QUAs
Suppose we start with some QUA and would like to add a cus-
tomized conflict resolution. Given some user-defined conflict res-
olution function f : U × S × S S that computes the state
f(u, s, s
0
) that should result from update u being issued in state s
but applied in a possibly different state s
0
, we can construct a QUA
that uses f by (1) adding a query q
snapshot
to the set Q of queries,
that returns the current state q
#
snapshot
(s) = s, and (2) add for each
state s and update u an update u
s
to the set of U of updates, and
define u
#
s
(s
0
) = f(s, s
0
, u
#
(s)). Now, if the user issues updates
of the form u
s
that include a snapshot s of the current state (ob-
tained by calling q
snapshot
), the conflict resolution function is used
as desired.
B.3 Counterexample to the converse of Thm 1
Consider a history containing 7 operations (all by different clients,
and all succeeded by a yield), ordered by the following visibility
order:
x
y
z
u
OO
FF
v
XX0
0
0
0
0
FF
w
OO
XX1
1
1
1
1
s
FF
OOXX1
1
1
1
1
Where arrows represent the partial order <
v
. Define now the arbi-
tration order as x <
a
y <
a
z <
a
u <
a
v <
a
w <
a
s. Then all
the conditions for eventual consistency are satisfied (we need not
bother determining exactly what updates and queries to use). How-
ever, these partial orders cannot result from any revision diagram
since we cannot place the events on a revision diagrams without
creating additional <
v
edges.
To prove this, we need a few more structural graph properties.
First, for any set of nodes x
1
, . . . , x
n
in the same join component
(i.e. nodes for which T (x
1
) = · · · = T (x
n
)) we define the
“join point” J(x
1
, . . . , x
n
) to be the vertex that was added during
the construction step which first caused the equality T (x
1
) =
T (x
2
) = · · · = T(x
n
) to hold (this will either be an update, query,
or join step).
LEMMA 7. If x
1
, . . . , x
n
are nodes in a revision diagram, and
x
i
<
v
z, then J(x
1
, . . . x
n
) <
v
z.
PROOF. Suppose the lemma is not true, i.e. we have x
i
with
x
i
<
v
z but not J(x
1
, . . . , x
n
) <
v
z (i.e. either J(x
1
, . . . , x
n
)
does not exist, or does not have a path to z). There must exist di-
rect paths p
i
from x
i
to z by Lemma 3. Without loss of generality,
we can assume we picked x
i
and p
i
in such a way as to mini-
mize the sum of the lenghts of p
i
. Consider the vertex on which
all of these paths merge, lets call it a. One of the incoming edges
of a must be a j edge. Let’s say this edge is on the path p
t
com-
ing from x
t
. Since the segment of p
t
from x
t
to a ends with a j-
edge, it can contain only s- and j-edges (because it is direct), thus
T (x
t
) = T (a). But this means we could have picked a instead of
x
t
(clearly a <
v
z, but not J(x
1
, . . . , x
t1
, a, x
t+1
, . . . x
n
) <
v
z:
if J(x
1
, . . . , x
t1
, a, x
t+1
, . . . x
n
) exists, so does J(x
1
, . . . , x
n
),
and J(x
1
, . . . , x
n
) <
v
J(x
1
, . . . , x
t1
, a, x
t+1
, . . . x
n
) contra-
dicting the assumption) , contradicting minimality.
This lemma implies that J(u, v, w) <
v
s, in particular, it must
exist. Since join components are (upside-down) trees whose nodes
have arity one or two, the following must hold without loss of
generality (for some permutation of x,y,z)
J(x, y) <
v
J(x, y, z) and J(y, z) = J(x, z) = J(x, y, z)
But by the above lemma, J(y, z) <
v
w, thus J(x, y, z) <
v
w,
thus x <
v
w which contradicts the definition of <
v
for this
example.
B.4 The standard FJ-QUA construction
THEOREM 2. For any QUA (S, s
0
) over some interface (Q, V, U),
there exists a FJ-QUA , σ
0
, f , j ) that implements it.
PROOF. By construction. Let the state Σ = S × U
be a pair of
the QUA state and a log of operations. Define,
σ
0
= (s
0
, ) (4)
u
#r
(s, l) = (s, l.u) (5)
q
#r
(s, l) = q
#
(apply l s) (6)
f
r
(s, l) = ((s, l), (apply l s, )) (7)
j
r
(s
1
, l
1
)(s
2
, l
2
) = (s
1
, l
1
.l
2
) (8)
(9)
where
apply s = s (10)
apply l.u s = u
#
(apply l s) (11)
Let ρ : Σ S be apply l s. The labeling of the original QUA
corresponds exactly to the labeling of the FJ-QUA mapped by ρ
on the same revision diagram. Furthermore, q
#
ρ(σ) = q
#r
σ by
definition.
We call the QUA algebra constructed in the above proof the
reference implementation of the given QUA implementation.
DEFINITION 18. A FJ-QUA B =
b
, σ
b
, f
b
, j
b
) implements
another FJ-QUA A =
a
, σ
a
, f
a
, j
a
) over the same interface
(Q, V, U) if and only if 1) there exists a function comp : Σ
a
Σ
b
,
such that the labeling of a revision diagram computed by A can be
transformed into the labeling computed by B on the same diagram
by applying comp to each vertex, and 2) for each query q and each
state σ Σ
a
, the query is answered the same by both FJ-QUAs:
q
#a
σ = q
#b
(comp σ).
Definition 18 together with Theorem 2 provide us with an alter-
native formulation of Definition 15 by saying that a revision algebra
B implements a QUA, if B implements the reference implementa-
tion of the QUA.
C. Bounded FJ-QUA Implementations
This section contains a few examples of QUAs along with opti-
mized FJ-QUAs. For each case, we show that the revision algebra
implements the QUA using bounded space.
C.1 Counter with Reset
A counter with reset consists of a single integer along with two
update operations: inc, incrementing the counter, and reset, setting
the counter to 0. The query is get which simply retrieves the counter
value. The QUA implementation is given by
Σ = int
σ
0
= 0
inc
#
n = n + 1
reset
#
n = 0
get
#
n = n
C.1.1 Counter FJ-QUA
An effective implementation of a revision consistent system re-
quires finding a more optimized representation of the state and cor-
responding updates/queries in order to avoid having to compute the
sequence of operations of a forked branch at a join point and ap-
plying them one-by-one to the target revision. Instead, the goal of
the optimized representation is to summarize the effect of a branch
compactly, so that it can be applied at a join point in time bounded
by the amount of observable information, rather than number of
updates on the branch.
For the counter example, our FJ-QUA
b
, σ
b
, f
b
, j
b
) consists
of
Σ
b
= bool × int × int
σ
b
= (false, 0, 0)
inc
#b
(r, b, d) = (r, b, d + 1)
reset
#b
(r, b, d) = (true, 0, 0)
get
#b
(r, b, d) = b + d
f
b
(r, b, d) = ((r, b, d), (false, b + d, 0))
j
b
(r
1
, b
1
, d
1
) (r
2
, b
2
, d
2
) =
(true, b
2
, d
2
) if r
2
= true
(r
1
, b
1
, d
1
+ d
2
) otherwise
THEOREM 3. The counter FJ-QUA implements the counter QUA
in constant space.
PROOF. The space bound is clear given Σ
b
. We show that the FJ-
QUA implements the reference implementation
r
, σ
r
, f
r
, j
r
) of
the counter QUA using the function comp : Σ
r
Σ
b
given by:
comp(n, ) = (false, n, 0) (12)
comp(n, l.u) = u
#b
(comp(n, l)) (13)
We now need to show 1) the consistent labeling of any revision
diagram and 2) that queries are anwered the same by both imple-
mentations as summarized by the following proof obligations:
comp(σ
0
, ) = σ
b
(14)
u. comp(u
#r
(n, l)) = u
#b
(comp(n, l)) (15)
(σ
1
, σ
2
) = f
r
σ (comp σ
1
, comp σ
2
) = f
b
(comp σ)
(16)
comp(j
r
(σ
1
, σ
2
)) = j
b
(comp σ
1
, comp σ
2
) (17)
q.q
#r
(n, l) = q
#b
(comp(n, l)) (18)
To show (14), we apply (12) to σ
0
= 0 which gives us (false, 0, 0) =
σ
b
. To show (15), observe that comp(u
#r
(n, l)) = comp(n, l.u)
by (5), applying (13), this is equal to u
#b
(comp(n, l)). To show (16),
first note that by (7) σ
1
= σ, and thus comp(σ
1
) = comp(σ) which
matches the definition of f
b
. Assuming σ = (n, l), we have σ
2
=
(apply l n, ), and we need to show comp σ
2
= (false, b+ d, 0), as-
suming comp σ = (x, b, d). comp(apply l n, ) = (false, apply l n, 0)
by (12). Thus it remains to show that apply l n = b+d, i.e., the sum
of the second and third component of comp(n, l). We proceed by
induction over the length of log l. If l = , then apply l n = n
and comp σ = (false, n, 0) (where b = n and d = 0). If
l = l
1
.u, then we can assume that comp(n, l
1
) = (.., b
0
, d
0
) and
that b
0
+ d
0
= apply l
1
n. We split on the form of u. If u = reset,
then apply l n = 0 and comp(n, l) = reset
#b
(comp(n, l
1
)) =
(true, 0, 0). If u = inc, then apply l n = 1 + apply l
1
n, and
comp(n, l) = inc
#b
(comp(n, l
1
)) = (.., b
0
, d
0
+ 1), thus estab-
lishing that b
0
+ d
0
+ 1 = apply l n.
To show (17), note that comp(j
r
(σ
1
, σ
2
)) = comp(j
r
(n
1
, l
1
)(n
2
, l
2
)) =
comp(n
1
, l
1
.l
2
) = comp(comp(n
1
, l
1
), l
2
) = comp(comp σ
1
, l
2
).
We consider 2 cases based on the implementation of j
b
: if comp(n
2
, l
2
) =
(false, b
2
, d
2
), then by f
b
, inc
#b
, reset
#b
, we know that l
2
con-
tains no set operation, and that the number of inc operations is
d
2
. Let comp σ
1
= (s
1
, b
1
, d
1
), then j
b
(comp σ
1
, comp σ
2
) =
(s
1
, b
1
, d
1
+d
2
). Note that comp(comp σ
1
, l
2
) = inc
#b
(... inc
#b
(s
1
, b
1
, d
1
)),
where the number of inc
#b
operations is d
2
. By inc
#b
, the result
is (s
1
, b
1
, d
1
+ d
2
). For the other case, we have comp(n
2
, l
2
) =
(true, b
2
, d
2
) and j
b
(comp σ
1
, comp σ
2
) = comp σ
2
. By f
b
, inc
#b
, reset
#b
,
we know there exists a reset operation in l
2
followed by d
2
inc op-
erations. Thus, comp(comp(n
1
, l
2
), l
2
) = comp(x, l
2
) for any x,
in particular for n
2
, thus comp(j
r
(σ
1
, σ
2
)) = comp(σ
2
).
Finally, we show that all queries q return the same value in
both implementations (18) by induction over l. If l = , then
get(n, l) = get
#
(apply n ) = n and get
#b
(comp(n, l)) =
get
#b
(false, n, 0) = n + 0 = n. For l = l
1
.u, we perform a case
analysis based on u. We have get(n, l
1
) = get
#b
(comp(n, l
1
)). If
u = inc, then get(n, l) = get
#
(apply n l) = get
#
(inc
#
(apply n l
1
)) =
1+get
#
(apply n l
1
) = 1+get(n, l
1
) = 1+get
#b
(comp(n, l
1
)) =
get
#b
(inc
#b
(comp(n, l
1
))) = get
#b
(comp(n, l)). If u = reset,
then get(n, l) = get
#
(apply n l) = get
#
(reset
#
(apply n l
1
)) =
get
#
0 = 0 = get
#b
(reset
#b
(comp(n, l
1
))) = get
#b
(comp(n, l)).
C.2 Integer Register
An integer register is a generalization of a counter with reset. It also
consists of a single integer value along with two update operations:
add d, adding a delta to the integer register, and set n, setting the
counter to n. The query is get which simply retrieves the register
value. The initial state is 0. The corresponding QUA is
Σ = int
σ
0
= 0
add
#
d n
0
= n
0
+ d
set
#
n n
0
= n
get
#
n
0
= n
0
A FJ-QUA for the integer register with constant space looks
similar to the optimized counter.
Σ
b
: bool × int × int
σ
b
: (false, 0, 0)
add
#b
n (r, b, d) = (r, b, d + n)
set
#b
n (r, b, d) = (true, n, 0)
get
#b
(r, b, d) = b + d
f
b
(r, b, d) = (r, b, d), (false, b + d, 0)
j
b
(r
1
, b
1
, d
1
)(r
2
, b
2
, d
2
) =
(true, b
2
, d
2
) if r
2
= true
(r
1
, b
1
, d
1
+ d
2
) otherwise
THEOREM 4. The integer register FJ-QUA implements the integer
register QUA in constant space.
PROOF. We use the identical comp function as for the counter and
the proof is analogous.
C.3 High Score
The high-score problem is to maintain the top k score-name pairs
for some game. The state consists of a list of at most k such pairs,
ordered by decreasing score (and increasing arbitration order in
case of a tie). The single update operation is post s p, posting a new
score of s from player p. The query operation is get i, retrieving the
ith score (where i is less than k). The initial value of the high-score
table is all 0s and empty names. Written as a QUA we have
Σ = listhint, stringi
σ
0
= [(0, ), (0, ), . . . , (0, )]
post
#
s n l = take k(insertion-sort(s, n)l)
get
#
i l = l[i]
An optimized high-score maintains an additional high-score list
for only the newly posted scores on a branch, such that at a merge,
only the new scores are merged into the main revision. If we were
to merge the high score table from the branch, we might end up
with duplicate scores that were posted prior to the fork.
Σ
b
: listhint, stringi × listhint, stringi
σ
b
: ([(0, ), (0, ), . . . , (0, )], [(0, ), (0, ), . . . , (0, )])
post
#b
s n (l, r) = (take k(insertion-sort(s, n)l),
take k(insertion-sort(s, n)r))
get
#b
i (l, r) = l[i]
f
b
(l, r) = (l, r), (l, [(0, ), (0, ), . . . , (0, )])
j
b
(l
1
, r
1
)(l
2
, r
2
) = (merge-sort l
1
r
2
,
merge-sort r
1
r
2
)
... This requires complete isolation of the effects of any two methods. Such an extreme is used, e. g., in the CR library [19]. The typical csm variable, however, will strike a trade- off between these two extremes. ...
... Concurrent revisions [19] introduce a generic and deterministic programming model for parallel programming. This model supports fork-join parallelism and processes are allowed to make concurrent modifications to shared data by creat- ing local copies that are eventually merged using suitable (programmer specified) merge functions at join boundaries. ...
... Our approach subsumes Kahn buffers of SHIM and the local-copy-merge protocol of concurrent revisions by an appropriate choice of method interface and policy. None of these approaches [19,21,22] uses a clock as a central barrier mechanism like our approach does. ...
Chapter
Full-text available
Synchronous Programming (SP) is a universal computational principle that provides deterministic concurrency. The same input sequence with the same timing always results in the same externally observable output sequence, even if the internal behaviour generates uncertainty in the scheduling of concurrent memory accesses. Consequently, SP languages have always been strongly founded on mathematical semantics that support formal program analysis. So far, however, communication has been constrained to a set of primitive clock-synchronised shared memory (csm) data types, such as data-flow registers, streams and signals with restricted read and write accesses that limit modularity and behavioural abstractions.
... But before presenting it (in §4), we need to define the semantics of the store itself: which values can operations on primitive objects return in an execution of the store? This is determined by the consistency model of causally consistent transactions [26,18,19,24,17,12,4], which we informally described in §1. To formalise it, we use a variant of the framework proposed by Burckhardt et al. [11,12,10], which defines the store semantics declaratively, without referring to implementation-level concepts such as replicas or messages. ...
... This is determined by the consistency model of causally consistent transactions [26,18,19,24,17,12,4], which we informally described in §1. To formalise it, we use a variant of the framework proposed by Burckhardt et al. [11,12,10], which defines the store semantics declaratively, without referring to implementation-level concepts such as replicas or messages. The framework models store executions using structures on events and relations in the style of weak memory models and allows us to define the semantics of the store in two stages. ...
... A correspondence between the declarative store specification and operational models closer to implementations was established elsewhere [11,12,10]. Although we do not present an operational model in this paper, we often explain various features of the store specification framework by referring to the implementation-level concepts they are meant to model. ...
Conference Paper
Modern large-scale distributed systems often rely on eventually consistent replicated stores, which achieve scalability in exchange for providing weak semantic guarantees. To compensate for this weakness, researchers have proposed various abstractions for programming on eventual consistency, such as replicated data types for resolving conflicting updates at different replicas and weak forms of transactions for maintaining relationships among objects. However, the subtle semantics of these abstractions makes using them correctly far from trivial. To address this challenge, we propose composite replicated data types, which formalise a common way of organising applications on top of eventually consistent stores. Similarly to an abstract data type, a composite data type encapsulates objects of replicated data types and operations used to access them, implemented using transactions.We develop a method for reasoning about programs with composite data types that reflects their modularity: the method allows abstracting away the internals of composite data type implementations when reasoning about their clients. We express the method as a denotational semantics for a programming language with composite data types. We demonstrate the effectiveness of our semantics by applying it to verify subtle data type examples and prove that it is sound and complete with respect to a standard non-compositional semantics.
... The above definition of eventual consistency assumes a total order of write actions. For the sake of simplicity, we avoid the definition of even weaker notions as the ones in which events are not globally ordered [11], which could also be recast into this framework. ...
... A thread of research investigates criteria and data types for the correct implementation of eventually consistent storages. The work in [11] studies a similar notion of store to the one we consider; it defines sufficient rules for the correct implementation of transactions in a server using revision diagrams. [9] proposes data types to ensure eventual consistency over cloud systems. ...
... A more general model of computation in which actions are just partially ordered could be more naturally represented by substituting traces with partially ordered sets of actions. We leave this extension, and the formalisation of weaker models of consistency, such as the Revision Diagrams studied in [11,29], as a future work. We also analyse some concrete implementations of stores with different consistency levels, by using idealised operational models. ...
Article
Full-text available
Managing data over cloud infrastructures raises novel challenges with respect to existing and well-studied approaches such as ACID and long-running transactions. One of the main requirements is to provide availability and partition tolerance in a scenario with replicas and distributed control. This comes at the price of a weaker consistency, usually called eventual consistency. These weak memory models have proved to be suitable in a number of scenarios, such as the analysis of large data with map reduce. However, due to the widespread availability of cloud infrastructures, weak storages are used not only by specialised applications but also by general purpose applications. We provide a formal approach, based on process calculi, to reason about the behaviour of programs that rely on cloud stores. For instance, it allows to check that the composition of a process with a cloud store ensures ‘strong’ properties through a wise usage of asynchronous message-passing; in this case, we say that the process supports the consistency level provided by the cloud store. The proposed approach is compositional: the support of a consistency level is preserved by parallel composition when the preorder used to compare process-store ensembles is the weak simulation.
... We include the proof in the full version [4]. For proving our main result later on, we need to establish another basic fact about revision diagrams. ...
... We call a path direct if all of its f-edges (if any) appear after all of its j-edges (if any). The following lemma (which appears as a theorem in [6], and for which we include a proof in [4] as well) shows that we can always choose direct paths: ...
... The proof of our Theorem (in Section 3.5 below) constructs partial orders < v , < a from the revision diagram by (1) specifying x < v y iff there is a path from x to y in the revision diagram, and (1) specifying < a to order all events in a joined revision to occur in between the joiner terminal and the join vertex. Note that the converse of Thm. 1 is not true, not even if restricted to finite histories (we include a finite counterexample in the full version [4]). Also Note that the most difficult part of the proof is the safety, not the liveness, since the proof that < a is a partial order extending < v depends on the join condition in a nontrivial way. ...
Conference Paper
Full-text available
When distributed clients query or update shared data, eventual consistency can provide better availability than strong consistency models. However, programming and implementing such systems can be difficult unless we establish a reasonable consistency model, i.e. some minimal guarantees that programmers can understand and systems can provide effectively. To this end, we propose a novel consistency model based on eventually consistent transactions. Unlike serializable transactions, eventually consistent transactions are ordered by two order relations (visibility and arbitration) rather than a single order relation. To demonstrate that eventually consistent transactions can be effectively implemented, we establish a handful of simple operational rules for managing replicas, versions and updates, based on graphs called revision diagrams. We prove that these rules are sufficient to guarantee correct implementation of eventually consistent transactions. Finally, we present two operational models (single server and server pool) of systems that provide eventually consistent transactions.
... Bailis et al. [5] adopts this model to define read atomicity. Burckhardt et al. [11] and Cerone et al. [12] propose axiomatic specifications of consistency models for transaction systems using visibility and arbitration relationships. Shapiro et al. [35] propose a classification along three dimensions (total order, visibility, and transaction composition) for transactional consistency models. ...
Chapter
Many transaction systems distribute, partition, and replicate their data for scalability, availability, and fault tolerance. However, observing and maintaining strong consistency of distributed and partially replicated data leads to high transaction latencies. Since different applications require different consistency guarantees, there is a plethora of consistency properties—from weak ones such as read atomicity through various forms of snapshot isolation to stronger serializability properties—and distributed transaction systems (DTSs) guaranteeing such properties. This paper presents a general framework for formally specifying a DTS in Maude, and formalizes in Maude nine common consistency properties for DTSs so defined. Furthermore, we provide a fully automated method for analyzing whether the DTS satisfies the desired property for all initial states up to given bounds on system parameters. This is based on automatically recording relevant history during a Maude run and defining the consistency properties on such histories. To the best of our knowledge, this is the first time that model checking of all these properties in a unified, systematic manner is investigated. We have implemented a tool that automates our method, and use it to model check state-of-the-art DTSs such as P-Store, RAMP, Walter, Jessy, and ROLA.
... Atomic operations are supported in some weakly consistent data stores as highly available transactions (HATs) [11,9,7]. HATs support atomicity without reducing availability. ...
Article
Full-text available
Scalable and highly available systems often require data stores that offer weaker consistency guarantees than traditional relational databases systems. The correctness of these applications highly depends on the resilience of the application model against data inconsistencies. In particular regarding application security, it is difficult to determine which inconsistencies can be tolerated and which might lead to security breaches. In this paper, we discuss the problem of how to develop an access control layer for applications using weakly consistent data stores without loosing the performance benefits gained by using weaker consistency models. We present ACGreGate, a Java framework for implementing correct access control layers for applications using weakly consistent data stores. Under certain requirements on the data store, ACGreGate ensures that the access control layer operates correctly with respect to dynamically adaptable security policies. We used ACGreGate to implement the access control layer of a student management system. This case study shows that practically useful security policies can be implemented with the framework incurring little overhead. A comparison with a setup using a centralized server shows the benefits of using ACGreGate for scalability of the service to geo-scale.
... Weak forms of transactional guarantees can be made available under partitions, using consistency models such as eventually consistent transactions [Burckhardt et al., 2012a[Burckhardt et al., , 2014b, causally consistent transactions [Li et al., 2012, Lloyd et al., 2013, or highly available transactions [Bailis et al., 2013[Bailis et al., , 2014. ...
Article
Hardware consolidation in the datacenter often leads to scalability bottlenecks from heavy utilization of critical resources, such as the storage and network bandwidth. Client-side caching on durable media is already applied at block level to reduce the storage backend load but has received criticism for added overhead, restricted sharing, and possible data loss at client crash. We introduce a journal to the kernel-level client of an object-based distributed filesystem to improve durability at high I/O performance and reduced shared resource utilization. Storage virtualization at the file interface achieves clear consistency semantics across data and metadata, supports native file sharing among clients, and provides flexible configuration of durable data staging at the host. Over a prototype that we have implemented, we experimentally quantify the performance and efficiency of the proposed Arion system in comparison to a production system. We run microbenchmarks and application-level workloads over a local cluster and a public cloud. We demonstrate reduced latency by 60% and improved performance up to 150% at reduced server network and disk bandwidth by 41% and 77%, respectively. The performance improvement reaches 92% for 16 relational databases as clients and gets as high as 11.3x with two key-value stores as clients.
Article
Full-text available
Client-side logic and storage are increasingly used in web and mobile applications to improve response time and availability. Current approaches tend to be ad-hoc and poorly integrated with the server-side logic. We present a principled approach to integrate client- and server-side storage. We support mergeable and strongly consistent transactions that target either client or server replicas and provide access to causally-consistent snapshots efficiently. In the presence of infrastructure faults, a client-assisted failover solution allows client execution to resume immediately and seamlessly access consistent snapshots without waiting. We implement this approach in SwiftCloud, the first transactional system to bring geo-replication all the way to the client machine. Example applications show that our programming model is useful across a range of application areas. Our experimental evaluation shows that SwiftCloud provides better fault tolerance and at the same time can improve both latency and throughput by up to an order of magnitude, compared to classical geo-replication techniques.
Article
Full-text available
Enabling applications to execute various tasks in parallel is difficult if those tasks exhibit read and write conflicts. In recent work, we developed a programming model based on concurrent revisions that addresses this challenge: each forked task gets a conceptual copy of all locations that are declared to be shared. Each such location has a specific isolation type; on joins, state changes to each location are merged deterministically based on its isolation type. In this paper, we study how to specify isolation types abstractly using operation-based compensation functions rather than state based merge functions. Using several examples including a list with insert, delete and modify operations, we propose compensation tables as a concise, general and intuitively accessible mechanism for determining how to merge arbitrary operation sequences. Finally, we provide sufficient conditions to verify that a state-based merge function correctly implements a compensation table.
Article
Full-text available
Update anywhere-anytime-anyway transactional replication has unstable behavior as the workload scales up: a ten-fold increase in nodes and traffic gives a thousand fold increase in deadlocks or reconciliations. Master copy replication (primary copy) schemes reduce this problem. A simple analytic model demonstrates these results. A new two-tier replication algorithm is proposed that allows mobile (disconnected) applications to propose tentative update transactions that are later applied to a master copy. Commutative update transactions avoid the instability of other replication schemes.
Conference Paper
Full-text available
We describe a methodology for transforming a large class of highly-concurrent linearizable objects into highly-concurrent trans- actional objects. As long as the linearizable implementation satis- fies certain regularity properties (informally, that every method has an inverse), we define a simple wrapper for the linearizable im- plementation that guarantees that concurrent transactions without inherent conflicts can synchronize at the same granularity as the original linearizable implementation. Synchronizing via read/write conflicts has one substantial ad- vantage: it can be done automatically without programmer partici- pation. It also has a substantial disadvantage: it can severely and un- necessarily restrict concurrency for certain shared objects. If these objects are subject to high levels of contention (that is, they are "hot-spots"), then the performance of the system as a whole may suffer. Here is a simple example. Consider a mutable set of integers that provides add(x), remove(x) and contains(x) methods with the obvious meanings. Suppose we implement the set as a sorted linked list in the usual way. Each list node has two fields, an integer value and a node reference next. List nodes are sorted by value, and values are not duplicated. Integer x is in the set if and only if a list node has value field x. The add(x) method reads along the list until it encounters the largest value less than x. Assuming x is
Conference Paper
Full-text available
Enabling applications to execute various tasks in parallel is difficult if those tasks exhibit read and write conflicts. We recently developed a programming model based on concurrent revisions that addresses this challenge in a novel way: each forked task gets a conceptual copy of all the shared state, and state changes are integrated only when tasks are joined, at which time write-write conflicts are deterministically resolved. In this paper, we study the precise semantics of this model, in particular its guarantees for determinacy and consistency. First, we introduce a revision calculus that concisely captures the programming model. Despite allowing concurrent execution and locally nondeterministic scheduling, we prove that the calculus is confluent and guarantees determinacy. We show that the consistency guarantees of our calculus are a logical extension of snapshot isolation with support for conflict resolution and nesting. Moreover, we discuss how custom merge functions can provide stronger guarantees for particular data types that are tailored to the needs of the application. Finally, we show we can visualize the nonlinear history of state in our computations using revision diagrams that clarify the synchronization between tasks and allow local reasoning about state updates.
Conference Paper
Full-text available
Replicating data under Eventual Consistency (EC) allows any replica to accept updates without remote synchronisation. This ensures performance and scalability in large-scale distributed systems (e.g., clouds). However, published EC approaches are ad-hoc and error-prone. Under a formal Strong Eventual Consistency (SEC) model, we study sufficient conditions for convergence. A data type that satisfies these conditions is called a Conflict-free Replicated Data Type (CRDT). Replicas of any CRDT are guaranteed to converge in a self-stabilising manner, despite any number of failures. This paper formalises two popular approaches (state- and operation-based) and their relevant sufficient conditions. We study a number of useful CRDTs, such as sets with clean semantics, supporting both \add and \remove operations, and consider in depth the more complex Graph data type. CRDT types can be composed to develop large-scale distributed applications, and have interesting theoretical properties.
Conference Paper
Full-text available
Building applications that are responsive and can exploit parallel hardware while remaining simple to write, understand, test, and maintain, poses an important challenge for developers. In particular, it is often desirable to enable various tasks to read or modify shared data concurrently without requiring complicated locking schemes that may throttle concurrency and introduce bugs. We introduce a mechanism that simplifies the parallel execution of different application tasks. Programmers declare what data they wish to share between tasks by using isolation types, and execute tasks concurrently by forking and joining revisions . These revisions are isolated: they read and modify their own private copy of the shared data only. A runtime creates and merges copies automatically, and resolves conflicts deterministically, in a manner declared by the chosen isolation type. To demonstrate the practical viability of our approach, we developed an efficient algorithm and an implementation in the form of a C# library, and used it to parallelize an interactive game application. Our results show that the parallelized game, while simple and very similar to the original sequential game, achieves satisfactory speedups on a multicore processor.