Conference PaperPDF Available

Bounded Version Vectors

Authors:

Abstract

Version vectors play a central role in update tracking un- der optimistic distributed systems, allowing the detection of obsolete or inconsistent versions of replicated data. Ver- sion vectors do not have a bounded representation; they are based on integer counters that grow indenitely as updates occur. Existing approaches to this problem are scarce; the mechanisms proposed are either unbounded or operate only under specic settings. This paper examines version vec- tors as a mechanism for data causality tracking and claries their role with respect to vector clocks. Then, it introduces bounded stamps and proves them to be a correct alternative to integer counters in version vectors. The resulting mecha- nism, bounded version vectors, represents the rst bounded solution to data causality tracking between replicas subject to local updates and pairwise symmetrical synchronization.
Bounded Version Vectors
José Bacelar Almeida Paulo Sérgio Almeida Carlos Baquero
Departamento de Informática, Universidade do Minho
{jba,psa,cbm}@di.uminho.pt
Abstract
Version vectors play a central role in update tracking un-
der optimistic distributed systems, allowing the detection
of obsolete or inconsistent versions of replicated data. Ver-
sion vectors do not have a bounded representation; they are
based on integer counters that grow indefinitely as updates
occur. Existing approaches to this problem are scarce; the
mechanisms proposed are either unbounded or operate only
under specific settings. This paper examines version vec-
tors as a mechanism for data causality tracking and clarifies
their role with respect to vector clocks. Then, it introduces
bounded stamps and proves them to be a correct alternative
to integer counters in version vectors. The resulting mecha-
nism, bounded version vectors, represents the first bounded
solution to data causality tracking between replicas subject
to local updates and pairwise symmetrical synchronization.
Keywords: Replication, causality, version vectors, up-
date tracking, bounded state.
1 Introduction
Optimistic replication is a critical technology in dis-
tributed systems, in particular when improving availability
of database systems and adding support to mobility and par-
titioned operation [17]. Under optimistic replication, data
replicas can evolve autonomously by incorporation new up-
dates into their state. Thus, when contact can be established
between two or more replicas, mutual consistency must be
evaluated and potential divergence detected.
The classic mechanism for assessing divergence between
mutable replicas is provided by version vectors which,
since their introduction by Parker et al [13], have been one
of the cornerstones of optimistic data management. Version
vectors associate to each replica a vector of integer coun-
ters that keeps track of the last update that is known to have
been originated in every other replica and in the replica it-
self. The mechanism is simple and intuitive but requires a
state of unbounded size, since each counter in the vector
can grow indefinitely.
The potential existence of a bounded substitute to version
vectors has been overlooked by the community. A possible
cause is a frequent confusion of the roles played by ver-
sion vectors and vector clocks (e.g. [16, 17]), that have the
same representation [13, 4, 12], together with the existence
of a minimality result by Charron-Bost [3], stating that vec-
tor clocks are the most concise characterization of causality
among process events.
In this article we show that a bounded solution is possi-
ble for the problem addressed by version vectors: the detec-
tion of mutual inconsistency between replicas subject to lo-
cal updates and pairwise symmetrical synchronization. We
present a mechanism, bounded stamps, that can be used to
replace integer counters in version vectors, stressing that
the minimality result that precludes bounded vector clocks
does not apply to version vectors.
1.1 On version vectors and vector clocks
Asynchronous distributed systems track causality and log-
ical time among communicating processes by means of
several mechanisms [11, 18], in particular vector clocks
[4, 12].
While being structurally equivalent to version vectors,
vector clocks serve a very distinct purpose. Vector clocks
track causality by establishing a strict partial order on the
events of processes that communicate by message pass-
ing, and are known to be the most concise solution to this
problem. Vector clocks, being a vector of integer counters,
are unbounded in size, but so is the number of events that
must be ordered and timestamped by them. In short, vector
clocks order an unlimited number of events occurring in a
1
given number of processes.
If we consider the role of version vectors, data causal-
ity, there is always a limit to the number of possible rela-
tions that can be established on the set of replicas. This
limit is independent on the number of update events that
are considered on any given run. For example, in a two
replica system only four cases can occur: ,
, and . If the two replicas are al-
ready divergent the inclusion of new update events on any
of the replicas does not change their mutual divergence and
the corresponding relation between them. In short, version
vectors order a given number of replicas, according to an
unlimited number of update events.
The existence of a limited number of relations is a nec-
essary but not sufficient condition for the existence of a
bounded characterization mechanism. A relation, which
is a global abstraction, must be encoded and computed
through local operations on replica pairs without the need
for a global view. This is one of the important properties of
version vectors.
2 Data causality and version vectors
Data causality on a set of replicas can be assessed via set
inclusion of the sets of update events known to each replica.
Data causality is the pre-order defined by:
iff
being and the sets of update events (globally unique
events), known to replicas and .
When tracking data causality with version vectors in
a replica system, one associates to each replica
a vector of integer counters. The
order on version vectors is the standard pointwise (coor-
dinatewise) order:
iff
where denotes component of vector .
The operations on version vectors, formally presented in
Figure 1, are as follows:
Initialization ( ) establishes the initial system state. All
vectors are initialized with zeroes.
Update ( ) an update event in replica increments .
Operation :
Operation :
if ;
otherwise.
Operation :
Figure 1: Semantics of version vector operations.
Synchronization ( ) synchronization of and is
achieved by taking the pointwise join (greatest ele-
ment) of and .
This classic mechanism encodes data causality because
comparing version vectors gives the same result as compar-
ing sets of known update events. For all runs and replicas
and :
iff iff
Figure 2 shows a run with version vectors in a four
replica system. Updates are depicted by a “ ” and synchro-
nization by two “ ” connected by a line.
2.1 Version vector slices
All operations over version vectors exhibit a pointwise na-
ture: a given vector position is only compared or updated to
the same position in other vectors, resulting from all infor-
mation about updates originated in replica being stored
in component of each version vector. This allows a de-
composition of the replicated system into slices, where
each slice represents the updates that were originated in a
given replica. Slice for a replica system is made up of
the th component of each version vector:
This means that data causality in replicas can be en-
coded by the concatenation of the representation for each
of the slices. It also means that it is enough to concen-
trate on a subproblem: encoding the distributed knowledge
about a single source of updates, and the corresponding ver-
sion vector slice (VVS). The source of updates increments
2
0 1 OO
1 2 OO
2
0 0 1 OO
2 2
0OO
2
0 2 2 2
Figure 2: Version Vectors: example run, depicting slice counters by a boxed digit.
Operation :
Operation :
if ;
otherwise.
Operation :
Figure 3: VVS semantics for slice 0.
its counter and all other replicas keep potentially outdated
copies of that counter; this subproblem amounts to storing
a distributed representation of a total order.
For the remainder of the paper we will concentrate, for
notational convenience and without loss of generality, on
finding a bounded representation for slice 0. Figure 3
presents the semantics of version vectors restricted to slice
0; in the run presented in Figure 2 this slice is shown using
boxed counters.
3 Informal presentation
We now give an informal presentation of the mechanism
and give some intuition of how it works and how it ac-
complishes its purpose. Having shown that it is enough to
concentrate on a subproblem (a single source of updates)
and the corresponding slice of version vectors, we now
present the stamp that will replace, in each replica, the in-
teger counter of the corresponding version vector.
For problem size , i.e. assuming replicas, with
the “primary” where updates take place and
the “secondary” replicas, we represent a stamp by some-
thing like
cb a
ca
a
ca
It has a representation of bounded size, as it consists of
rows, each with at most symbols (letters here), taken
from a finite set . An example run consisting of four
replicas is presented in Figure 4.
A stamp is, in abstract, a vector of totally ordered sets.
Each of the components (rows in our notation) repre-
sents a total order, with the greatest element on the left (the
first row above means ). In a stamp for replica
, row ( ) is what we call the principal
order (displayed with a gray background), while the other
rows are the cached orders. (Thus, the stamp above would
belong to replica .) The cached order in row repre-
sents the principal order of replica at some point in time,
propagated to replica (either directly or indirectly through
several synchronizations).
The greatest element of the principal order (on the left,
depicted in bold over gray) is what we call the principal
element. It represents the most recent update (in the pri-
mary) known by the replica. In a representation using an
infinite total ordered set instead of nothing more would
be needed. This element can be thought of as “correspond-
ing” to the value of the integer counter in version vectors.
The left column in a stamp (depicted in bold) is what
we call the principal vector; it is made up of the greatest
element of each order (row). It represents the most recent
local knowledge about the principal element of each replica
(including itself).
In a stamp, there is a relationship between the principal
order and the principal vector: the elements in the principal
vector are the same ones as in the principal order. In other
words, the set of elements in the principal vector is ordered
according to the principal order.
3.1 Comparison and synchronization as well de-
fined local operations
As we will show below, the mechanism is able to compare
two stamps by a local operation on the respective principal
orders. No global knowledge is used: not even a global
3
a
a
a
a
ba
a
a
a
OO
ba
ba
a
a
cb a
ba
a
a
OO
cb a
ba
a
cb a
OO
c
ca
c
c
bc
ca
c
c
OO
bc
bc
c
c
a
a
a
a
ba
ba
a
a
OO
cb a
ca
a
ca
bc
bc
c
ca
a
a
a
a
OO
cb a
ca
c
c
a
a
a
a
cb a
ba
a
cb a
cb a
ca
a
ca
cb a
ca
c
c
c
ca
c
c
Figure 4: Bounded stamps: example run.
order on the set of symbols is assumed. For comparison
purposes is simply an unordered set, with elements that
are ordered differently in different stamps. As an example,
the comparison of
bc
ca
c
c
with
cb a
ca
a
ca
involves looking at bcand ca, and gives .
When synchronizing two stamps, in the positions of the
two principal elements, the resulting value will be the max-
imum of the two principal elements; the rest of the resulting
principal vector will be the pointwise maximum of the re-
spective values. The comparisons are performed according
to the principal orders of the two stamps involved.
Is is important to notice that, in general, it is not possi-
ble to take two arbitrary total orders and merge them into
a new total order. As such, it could be thought that com-
puting the maximum as mentioned above is ill defined. As
we will show, several properties of the model can be ex-
plored that make these operations indeed possible and well
defined. We will also show that it is possible to totally order
the elements in the resulting principal vector, i.e. to obtain
a new principal order.
3.2 Garbage collection for symbol reuse
The boundedness of the mechanism is only possible
through symbol reuse. When an update operation is per-
formed, instead of incrementing an integer counter, some
symbol is chosen to become the new principal element. By
using a finite set of symbols , an update will eventually
reuse a symbol that was already used in the past to repre-
sent some previous update that has been synchronized with
other replicas.
However, by reusing symbols, an obvious problem arises
that needs to be addressed: the symbol reuse cannot com-
promise the well-definedness of the comparison operations
described above. As an example, it would not be accept-
able that, due to reuse, the principal orders of two stamps
end up being ab c and ca, as it would not be possible to
overcome the ambiguity between and and
to infer which one is the greatest stamp.
To address the problem, the mechanism implements a
distributed “garbage collection” of symbols. This is accom-
plished through the extra information in the cached orders.
As we will show, any element in the principal order/vector
of any replica is also present in the primary replica (in some
principal or cached order). This is the key property towards
symbol reuse: when an update is performed, any symbol
which is not present in the primary replica is considered
“garbage” and can be (re)used for the new principal ele-
ment.
As an example, in Figure 4, when the final update occurs,
symbol can be used for the new principal element because
4
0
0
0
0
1
0
0
0
OO
1
1
0
0
2
1
0
0
OO
2
1
0
2
0
0
0
0
1
1
0
0
OO
2
2
0
2
0
0
0
0
OO
2
2
2
2
0
0
0
0
2
1
0
2
2
2
0
2
2
2
2
2
Figure 5: Counter mode principal vectors.
it is not present in the primary replica:
c
ca
c
c
Notice that the scheme only assures that does not occur
in the principal orders/vectors. In this example occurs in
some cached orders of replicas
cb a
ca
a
ca
and
cb a
ca
c
c
but this is not a problem because those elements will not be
used in comparisons; the “old” will not be confused with
the “new” .
3.3 Synopsis of formal presentation
The formal presentation and proof of correctness will make
use of an unbounded mechanism which we call the counter
mode principal vectors (CMPV). This auxiliary mechanism
represents what the evolution of the principal vector would
be if we could afford to use integer counters. The mecha-
nism makes use of the total order on natural numbers and
does not encode orders locally. In Figure 5 we present part
of the run in Figure 4 using the counter mode mechanism.
The bulk of the proof consists in establishing several
properties of the CMPV model that allow the relevant com-
parison operations to be computed in a well-defined way
Operation :
Operation :
if ;
otherwise.
Operation :
if ;
otherwise.
Figure 6: Semantics of operations in CMPV.
using only local information. The key idea is that, exploit-
ing these properties, bounded stamps can be seen as an en-
coding of CMPV using a finite set , where the principal
orders are used to encode the relevant order information.
4 Counter Mode Principal Vectors
Version Vector Slices (VVS) rely on an unbounded totally
ordered set — the integers. Their unbounded nature is ac-
tually a consequence of adopting a predetermined order re-
lation (and hence globally known) to capture data causality
among replicas. To overcome this, we enrich VVS in a
way that order judgments become, in a sense, local to each
replica. In this way, it will be possible to dynamically en-
code the causality order and open the perspective of bound-
ing the “counters” domain.
For a replica index , its local state in the CMPV model
is denoted by and defined as the tuple where
is a vector of integers with size — the principal vector
for (see Figure 5). The value in position of vector
is denoted by and represents the knowledge of stamp
concerning the most recent update known by stamp .
The element plays a central role since it holds s view
about the more recent update — this is essentially the infor-
mation contained in VVS counters and we call it the prin-
cipal element for stamp .
Figure 6 defines the semantics of the operations in the
CMPV model. Symbol denotes the join operation under
integer ordering (i.e. taking the maximum element). Notice
that the order information is only required to perform the
synchronization operation. Moreover, comparisons are al-
ways between principal elements or pointwise (between the
same position in two principal vector). Occasionally, it will
be convenient to write for the result of the synchro-
5
nization on stamps and (i.e. the principal vector of
one of these stamps after synchronization).
Atrace consists of a sequence of operations starting with
and followed by an arbitrary number of updates and syn-
chronizations. In the remainder, when stating properties in
the CMPV, we will leave implicit that they only refer to
reachable states, i.e. states that result from some trace of
operations. Induction over the traces is the fundamental
tool to prove invariance properties, as the following simple
facts about CMPV.
Proposition 1. For every replica , and index ,
1. ,
2. ,
3. .
Proof. Simple induction on the length of traces.
Given stamps and we define their data causality
order under CMPV ( ) as the comparison of their princi-
pal elements:
iff
By Figure 6 it can be seen that the computation of princi-
pal elements only depends upon principal elements. More-
over, if we restrict the impact of the operations to the princi-
pal element we recover the VVS semantics (Figure 3). This
observation leads immediately to the correctness of CMPV
as a data causality encoding for slice 0:
iff
This result is not surprising since CMPV was defined as a
semantics preserving extension of VVS.
Next we will show that the additional information con-
tained in the CMPV model makes it possible to avoid re-
lying on the integer order, and to replace it with a locally
encoded order. For this, we will use a non-trivial invariant
on the global state given by the following lemma. Its proof
is presented in the appendix since it requires an auxiliary
definition and some additional lemmata.
Lemma 2. For every stamp and and index ,
and implies
Proof. See appendix A.
Recall that the order information is only required to per-
form the synchronization operation. Moreover, compar-
isons are always between principal elements or pointwise
(between the same position in two principal vector). In the
following we will show that these comparisons can be per-
formed without relying on integer order as long as we can
order the elements in the principal vector of each stamp in-
dividually.
Comparison between principal elements reduces to a
membership testing.
Proposition 3. For every stamp , ,
iff
Proof. If then, by Proposition 1(1) we have
that and so, by Lemma 2, .
If then, by Proposition 1(3) we have that
.
For a stamp , let us denote by the restriction of the
intrinsic integer order to the values contained in the princi-
pal vector :
iff and and
Using these orderings, we define new ones that are appro-
priate to perform the required comparisons. For stamps
and , let their combined order be defined as:
iff and or
and
For convenience, we also define the corresponding join
operation as:
if
otherwise.
The following proposition establishes the claimed prop-
erties for this ordering.
Proposition 4. For every stamp and and index ,
1. iff
2. iff
6
Proof. (1) Follows directly from Propositions 1 and 3.
(2) Let . When Proposition 3
guaranties that and, by Lemma 2, we have
and then , which establishes . The
case is trivial since, either (in which case
), or and so . Let
(that is, ). The proof proceeds as in the previous
implication.
Restricted orders can be explicitly encoded (e.g. by a
sequence) and can be easily manipulated. We now show
that when a synchronization is performed, all the elements
in the resulting principal vector were already present in the
more up-to-date stamp. This means that the restricted order
that results is a restriction of the one from the more up-to-
date stamp.
Proposition 5. Let and be stamps and .
If then, for all ,
Proof. For the pointwise join : if
then ; if then, by Lemma 2, .
Otherwise, note that the resulting principal element ( ) is
already in .
These observations together with the fact that the global
state can only retain a bounded amount of integer values
(an obvious limit is ) opens the way for a change in the
domain from the integers in the CMPV model to a finite set.
5 Bounded Stamps
A migration from the domain of integer counters in CMPV
to a finite set is faced with the following difficulty: the
update operation should be able to choose a value, that is
not present in any principal vector, for the new principal
element in the primary.
Adopting a set sufficiently large (e.g. with ele-
ments) guaranties that such a choice exists under a global
view. The problem lies in making that choice using only
the information in the state of the primary. To overcome
this problem we make a new extension of the model that
allows the primary to keep track of all the values in use in
the principal vectors of all stamps.
We will present this new model parameterized by a set
(the symbol domain), a distinguished element
(the initial element), and an oracle for new symbols
(satisfying an axiom described bellow). For each
replica index , its local state in the bounded stamps model
is denoted by and defined as where:
is the replica index;
is a vector of values from with size — the
principal vector;
is a vector of total orders, encoded as sequences,
representing the full bounded stamp.
This last component contains all the information in the
principal vector, the principal order and the cached orders.
Although the principle vector is redundant (as each com-
ponent is also present in the first position of each ), it
is kept in the model for notational convenience in describ-
ing the operations and in establishing the correspondence
between the models.
The intuitive idea is that the state for each stamp keeps an
explicit representation of the restricted orders. More pre-
cisely, for stamp , the sequence contains precisely
the elements of ordered downward (first element is ).
From that sequence one easily defines the restricted order
for stamp , what we call principal order to emphasize its
explicit nature.
iff or
where denotes the sequence restricted to the elements
in , i.e. and . The combined order
and associated join are defined precisely as in counter
mode, that is
iff or
The other sequences in keep information about (poten-
tially outdated) principal orders of other stamps — these are
called the cached orders.
Figure 7 gives the semantics for the operations in this
model. The oracle for new symbols is a function
that gives an element of satisfying the following axiom:
For every stamp ,
The argument in the oracle intends to emphasize
that the choice of the new symbol should be made based on
the primary local state.
7
Operation :
Operation :
Operation :
if ,
otherwise,
if :
if ,
otherwise,
if and :
if ,
otherwise,
if ,
otherwise.
Figure 7: Semantics of operations on BS model.
Data causality ordering under the Bounded Stamps
model is defined by
iff
The correctness of the proposed model follows from the
observation that, apart from the cached orders used for the
symbol reuse mechanism, it is actually an encoding of the
CMPV model. To formalize the correspondence between
both models, we introduce an encoding function that
maps each integer in the CMPV model into the correspond-
ing symbol (in ) in the state resulting from a given trace.
This map is defined recursively on the traces.
if ,
otherwise,
Where is the number of update events in , is the
bounded stamp for the primary after trace , and
gives a canonical choice for the new principal element on
the primary after the update. When we discard the cached
orders, the semantics of operations given in Figure 7 are
precisely the ones in CMPV (Figure 6) affected by the en-
coding map. Moreover, the principal orders are encodings
for the restricted orders presented in the previous section.
Lemma 6. For an arbitrary trace , replicas index and
:
1.
2. implies
3. iff
Proof. This results from a simple induction on the length
of traces. When the last operation was it is trivial. When
it was , the result follows from the induction hypothesis
and the axiom for the oracle . When it was ,
the result follows from induction hypothesis, the fact that,
since computes the required joins (Proposition 4), the
definitions of both models are the same, and the correctness
of the new restricted orders (Proposition 5).
As a simple consequence of the previous result, we can
state the following correctness result.
Proposition 7. For any arbitrary trace and replica in-
dexes and we have
iff
Proof. Immediate from Lemma 6 and the definitions of
and .
It remains to instantiate the parameters of the model. A
trivial but unbounded instantiation would be: set as the
integers, as value and . In this set-
ting, principal orders would be an explicit representation
of counter mode restricted orders. Obviously, we are inter-
ested in bounded instantiations of . To show that such
instantiations exists, we introduce the following lemma that
puts in evidence the role of cached orders. Once again we
will postpone its proof to the appendix since it uses a simi-
lar technique as the proof of lemma 2.
Lemma 8. For every stamp there exists an such that
Proof. See appendix B.
8
We are now able to present a bounded instantiation for
the model. Let be a totally ordered set with ele-
ments (the total order is here only to avoid making non-
deterministic choices). We define:
and
Lemma 8 guaranties that satisfies the axiom. It fol-
lows then that it acts as an encoding of counter mode model
(Proposition 7). Thus we have constructed a bounded
model for the data causality problem in a slice, which gen-
eralizes, by concatenating slices, to the full data causality
problem addressed by version vectors.
6 Related Work
On what concerns bounded replacements for version vec-
tors there is, up to our knowledge, no previous solution to
the problem. The possible existence of a bounded substi-
tute to version vectors was referred in [1] while introducing
the version stamps concept. Version stamps allow the char-
acterization of data causality in settings where version vec-
tors cannot operate, namely when replicas can be created
and terminated autonomously.
There have been several approaches to version vector
compression. Update coalescing [14] takes advantage of
the fact that several consecutive updates issued in isolation
in a single replica can be made equivalent to a single large
update. Update coalescing is intrinsic in bounded stamps
since sequence restriction in the update operation discards
non-propagated symbols. Dynamic compression [14] can
effectively reduce the size of version vectors by removing
a common minimum from all entries (along each slice).
However, this technique requires distributed consensus on
all replicas and therefore cannot progress if one or more
replicas are unreachable. Unilateral version vector prun-
ing [16] avoids distributed consensus by allowing unilat-
eral deletion of inactive version vectors entries, but relays
on some timing assumptions on the physical-clock’s skew.
Lightweight version vectors [8] develop an integer en-
coding technique that allows a gradual increase of integer
storage as counters increase. This technique is used in con-
junction with update coalescing to provide a dynamic size
representation. Hash histories [9] track data causality by
collecting hash fingerprints of contents. This representa-
tion is independent of the number of replicas but grows in
proportion to the number of updates.
The minimality of vectors clocks as a characterization
of Lamport causality [11], presented by Charron-Bost [3]
and recently re-addressed in [6], indicates particular runs
where the full expressiveness of vectors clocks is required.
However there are cases in which smaller representations
can operate: Plausible Clocks [19] offer a bounded substi-
tute to vectors clocks that are accurate in a large percentage
of situations and may be used in settings were deviations
only impacts performance and not correctness; Resettable
Vector Clocks [2] allow a bounded implementation of vec-
tor clocks under a specific communication pattern between
processes.
The collection of cached copies of the knowledge in
other replicas has been explored before in [5, 20] and used
for optimization of message passing strategies. This con-
cept is sometimes referred to as matrix clocks [15]. These
clocks are based on integer counters and are similar to our
intermediate “counter mode principal vector” representa-
tion.
7 Conclusions
Version vectors are the key mechanism in the detection of
inconsistency and obsolescence among optimistically repli-
cated data. This mechanism has been used extensively in
the design of distributed file systems [10, 7], in particu-
lar for data causality tracking among file copies. It is well
known that version vectors are unbounded due to their use
of counters; some approaches in the literature have tried to
address this problem.
We have brought the attention to the fact that causally
ordering a limited number of replicas does not require the
full expressive power of version vectors. Due to the limited
number of configurations among replicas, data causality
tracking does not necessarily imply the use of unbounded
mechanisms. As a consequence, Charron-Bost’s minimal-
ity of vector clocks cannot be transposed to version vectors.
We have noted that to find a bounded alternative to
version vectors, it was enough to concentrate on a sub-
problem: keeping distributed knowledge about a total order
generated by a single entity.
The key to bounded stamps was defining an intermediate
unbounded mechanism and showing that it was possible to
perform comparisons without requiring a global total order;
this was the bulk of the proof correctness; bounded stamps
were then derived as an encoding into a finite set of sym-
bols. This required the definition of a non-trivial symbol
9
reuse mechanism that is able to progress even if an arbitrary
number of replicas ceases to participate in the exchanges.
This mechanism may have a broader applicability beyond
its current use (e.g. log dissemination and pruning) and be-
come a building block in other mechanisms for distributed
systems.
The construction of the mechanism was supported by a
simulator1, which was used in the proof of correctness so
as to probe (and discard) tentative hypotheses. The simula-
tor was also turned into a model checker which proved the
correctness up to , giving some confidence before
the full proof of correctness was attempted.
Bounded version vectors are obtained by substituting in-
teger counters on version vectors by bounded stamps. It
represents the first bounded mechanism for detection of ob-
solescence and mutual inconsistency in distributed systems.
References
[1] Paulo Sérgio Almeida, Carlos Baquero, and Victor Fonte.
Version stamps – decentralized version vectors. In Proceed-
ings of the 22nd International Conference on Distributed
Computing Systems (ICDCS), pages 544–551. IEEE Com-
puter Society, 2002.
[2] A. Arora, S. S .Kulkarni, and M. Demirbas. Resettable vec-
tor clocks. In 19th Symposium on Principles of Distributed
Computing (PODC’2000), Portland, 2000. ACM, 2000.
[3] Bernadette Charron-Bost. Concerning the size of logical
clocks in distributed systems. Information Processing Let-
ters, 39:11–16, 1991.
[4] Colin Fidge. Timestamps in message-passing systems that
preserve the partial ordering. In 11th Australian Computer
Science Conference, pages 55–66, 1989.
[5] Michael J. Fischer and A. Michael. Sacrificing serializabil-
ity to attain high availability of data. In Proceedings of the
ACM Symposium on Principles of Database Systems, pages
70–75. ACM, 1982.
[6] V. K. Garg and C. Skawratananond. String realizers of
posets with applications to distributed computing. In Pro-
ceedings of the ACM Symposium on Principles of Dis-
tributed Computing (PODC’01), pages 72–80. ACM, 2001.
[7] Richard G. Guy, John S. Heidemann, Wai Mak, Thomas W.
Page, Gerald J. Popek, and Dieter Rothmeier. Implementa-
tion of the ficus replicated file system. In USENIX Confer-
ence Proceedings, pages 63–71. USENIX, June 1990.
1http://gsd.di.uminho.pt/bvv/bvv-simulator.py
[8] Yun-Wu Huang and Philip Yu. Lightweight version vec-
tors for pervasive computing devices. In Proceedings of
the 2000 International Workshops on Parallel Processing,
pages 43–48. IEEE Computer Society, 2000.
[9] Brent ByungHoon Kang, Robert Wilensky, and John Kubi-
atowicz. The hash history approach for reconciling mutual
inconsistency. In Proceedings of the 23nd International
Conference on Distributed Computing Systems (ICDCS),
pages 670–677. IEEE Computer Society, 2003.
[10] James Kistler and M. Satyanarayanan. Disconnected opera-
tion in the coda file system. ACM Transaction on Computer
Systems, 10(1):3–25, February 1992.
[11] Leslie Lamport. Time, clocks and the ordering of events
in a distributed system. Communications of the ACM,
21(7):558–565, July 1978.
[12] Friedemann Mattern. Virtual time and global clocks in dis-
tributed systems. In Workshop on Parallel and Distributed
Algorithms, pages 215–226, 1989.
[13] D. Stott Parker, Gerald Popek, Gerard Rudisin, Allen
Stoughton, Bruce Walker, Evelyn Walton, Johanna Chow,
David Edwards, Stephen Kiser, and Charles Kline. Detec-
tion of mutual inconsistency in distributed systems. Trans-
actions on Software Engineering, 9(3):240–246, 1983.
[14] David Howard Ratner. Roam: A Scalable Replication Sys-
tem for Mobile and Distributed Computing. PhD thesis,
1998. UCLA-CSD-970044.
[15] Frédéric Ruget. Cheaper matrix clocks. In Proceedings of
the 8th International Workshop on Distributed Algorithms,
pages 355–369. Springer Verlag, LNCS, 1994.
[16] Yasushi Saito. Unilateral version vector pruning using
loosely synchronized clocks. Technical Report HPL-2002-
51, HP Labs, 2002.
[17] Yasushi Saito and Marc Shapiro. Optimistic replication.
Technical Report MSR-TR-2003-60, Microsoft Research,
2003.
[18] R. Schwarz and F. Mattern. Detecting causal relationships
in distributed computations: In search of the holy grail. Dis-
tributed Computing, 3(7):149–174, 1994.
[19] FJ Torres-Rojas and M. Ahamad. Plausible clocks: con-
stant size logical clocks for distributed systems. Distributed
Computing, 12(4):179–196, 1999.
[20] G. T. J. Wuu and A. J. Bernstein. Efficient solutions to the
replicated log and dictionary problems. In Proceedings of
the ACM Symposium on Principles of Distributed Comput-
ing (PODC’84), pages 232–242. ACM, 1984.
10
A Proof of Lemma 2
The hypothesis of lemma 2 concern two stamps (say and
) in which we can identify some sort of conflict between
each stamp knowledge: Stamp has a better knowledge
concerning the primary state ( ) but has an outdated
vision concerning some other stamp (say ), i.e. .
Lemma 2 states that when this happens stamp already at-
tributes the value of to some other stamp (say — that is,
). In order to prove this result, it will be necessary
to reinforce this statement: not only but it is pos-
sible to identify a flow of information between stamp and
. Moreover, this flow of information (a sequence of syn-
chronization operations starting from to ) can be traced
in stamp s local state as a sequence of indexes enjoying
some properties. These sequence of indexes are called de-
lay paths and are defined as follows.
9 Definition (Delay Path). Adelay path between and
is a non-empty sequence of indexes such that,
for any stamp ,
1. ,
2. ,
3. for all ,
4. for all ,
5. for all .
Some simple facts concerning delay paths.
Proposition 10. Let be a delay path between
and . The following facts hold:
1. ,
2. ,
3. for all ,
4. .
Proof. The first three facts are immediate consequences
from the definition and Proposition 1. Regarding the last
fact, if occurred in a position , being , by condi-
tion (4) of delay paths we have ; but this contra-
dicts condition (3). Thus, only occurs in a singleton delay
path.
Some of the conditions on delay paths impose global
constrains on them that will allow to reason about global
state changes and their impact on the local states. The fol-
lowing Lemma exposes the use of such global constrains.
Lemma 11 (Pointwise-join Lemma). Let be
a non-empty sequence of indexes. If for some ,
1. ,
2. for all , ,
3. for all and any stamp , if then
.
Then, for any stamp for which , there exists
such that and, for all ,
.
Proof. By induction on the length of the sequence
. For the base case (singular sequence) we have
that . Since we have and the
remaining condition is vacuous. For the induction step, we
consider the following cases: If then we set
since . Otherwise, we know that
and, by (4), that . So we apply the induction hypoth-
esis to the sequence and set to the resulting
index plus 1.
We now show that the conditions in Lemma 2 are suffi-
cient to establish the existence of delay paths.
Lemma 12. If and are two stamps and a position such
that
and
then there exists a delay path between and .
Proof. We prove by induction on the length of the trace. If
the last operation was we use the singleton sequence
for the delay path and the conditions hold trivially. If the
last operation was consider the following cases:
:we pick the sequence that satisfies trivially
all conditions;
:after the update , which contradicts the
hypothesis;
:if then , which contradicts the
hypothesis. If we use the same delay path that
comes from the IH, which is still valid after the update
because it does not contain position , since
(Proposition 10).
11
:we use the same delay path from the IH, which
is still valid: (1,2,3) because and are not affected
by the update; (4) because only changes; (5) be-
cause even if for some we have , if ,
then due to (4).
If the last operation was (and lets assume, without loss
of generality, that is the more up-to-date stamp, i.e.
) we need to distinguish the following cases:
:we use the same delay path from the
IH, which is still valid: (1,2,3) because and are
not affected; (4) because can only increase; (5)
because for every , if , then either
is computed pointwise and follows from
the IH, or is either or and (by 4)
.
:stamps and become equal after the
synchronization and we pick the sequence for the
delay path;
:in this case the stamp re-
sults from the synchronization of and and we have
. Consider the following two
cases:
When and . First, given that
and , we can apply the IH to and
on index and establish the existence of a delay path
for in . Then we prefix it by ,
obtaining , which is a suitable delay path
between and , given that: (1) holds by construc-
tion, (2) from the IH, (3) from the IH and
(since ); (4) from the IH and
; (5) from the IH and because for
every stamp , .
Otherwise, then either or ; applying
the IH to either or and in position gives us
a valid delay path for the resulting configuration (all
conditions hold, including (5) as shown for the case
).
:in this case the stamp re-
sults from the synchronization of and .
When is either or , we have ; but
this means (as and ) that ;
therefore is a delay path.
Otherwise, ; this means that
and by the IH there exists a delay path between
and . Given that also , Lemma 11 estab-
lishes the existence of a sequence
(prefix of ) that is a delay path between and
for the following reasons. Positions and do not
appear in because we are assum-
ing , and for , otherwise
we would have (condition (4) of
delay paths of which is a prefix) and then
, which contradicts Lemma 11. Thus, all
elements , with are computed pointwise (i.e
), making conditions (2,3 and 5) immedi-
ate consequences of Lemma 11. Condition (1) is triv-
ially observed ( is a prefix of ); and condition (4)
from the IH and because upon a join values can only
increase.
We can finally state Lemma 2.
Lemma (2). For every stamp and , and every index
,
and implies
Proof. Direct from Lemma 12.
B Proof of Lemma 8
Lemma 8 says that each principal order is already contained
in some cached order on the primary. Note that Lemma 2
already states that every principal element belongs to
the primary principal vector, and delay paths were used to
show where it can be found. Now, we will show that it is
precisely in the primary cached order located in the position
pointed out by the delay path between and that we can
find all the elements in . To prove this we need to reason
about cached orders along delay paths. This suggests an
extension of these to what we call principal delay paths.
13 Definition. Aprincipal delay path for stamp is a
delay path between and that additionally
satisfies the following condition: for every and
any stamp ,
implies or
and
12
We now prove the existence of principal delay paths by
extending the proof of existence in Lemma 12. Here we
only go through the cases that are relevant for the additional
condition.
Lemma 14. For every stamp there exists a principal
delay path.
Proof. (Sketch)
Consider the following additional arguments to the proof
of Lemma 12. If the last operation was (assume
):
:let . If is either or ,
we know that (since ). Let
. When , by condition (4), we
have or which determines that
. When is computed pointwise, the new
condition follows by the induction hypothesis.
:when and ,
let be the principal delay path for .
The new condition if verified for since,
the case is trivial (because
). For , the new condition is satisfied
since (Proposition 5).
in this case the primary re-
sults from the synchronization of and (i.e. is the
primary before synchronization). Since , then
is computed pointwise. By IH we get a principal
delay path to which we apply Lemma 11 to get a
new sequence where and never occur (c.f. proof
of Lemma 12). The new condition follows by the in-
duction hypothesis.
Lemma (8). For every stamp there exists a position
such that
Proof. Let be the principal delay path for
(given by Lemma 14). Instantiating the new condition for
on we get that
13
... Bounded non-stabilizing solutions exist in the literature [1,21]. Selfstabilizing resettable vector clocks [3] consider distributed applications that are structured in phases and track causality merely within a bounded number of successive phases. ...
Chapter
Full-text available
Vector clock algorithms are basic wait-free building blocks that facilitate causal ordering of events. As wait-free algorithms, they are guaranteed to complete their operations within a finite number of steps. Stabilizing algorithms allow the system to recover after the occurrence of transient faults, such as soft errors and arbitrary violations of the assumptions according to which the system was designed to behave.
... Specifically, in asynchronous/partially synchronous systems, to identify whether two events could have happened at the same time, we need to use techniques such as vector clocks [5,6] that require O(n) space where n is the number of processes. Even though there are attempts to reduce the size [18,22], the worst case size is still O(n). By contrast, in fully synchronous systems, if two events happen at the same time on two different processes, we can conclude that they happened at the same time. ...
Article
Full-text available
Runtime verification focuses on analyzing the execution of a given program by a monitor to determine if it is likely to violate its specifications. There is often an impedance mismatch between the assumptions/model of the monitor and that of the underlying program. This constitutes problems especially for distributed systems, where the concept of current time and state are inherently uncertain. A monitor designed with asynchronous system model assumptions may cause false-positives for a program executing in a partially synchronous system: the monitor may flag a global predicate that does not actually occur in the underlying system. A monitor designed with a partially synchronous system model assumption may cause false negatives as well as false positives for a program executing in an environment where the bounds on partial synchrony differ (albeit temporarily) from the monitor model assumptions. In this paper we analyze the effects of the impedance mismatch between the monitor and the underlying program for the detection of conjunctive predicates. We find that there is a small interval where the monitor assumptions are hypersensitive to the underlying program environment. We provide analytical derivations for this interval, and also provide simulation support for exploring the sensitivity of predicate detection to the impedance mismatch between the monitor and the program under a partially synchronous system.
... Almeida et al. [3] propose a mechanism for bounding the size of the elements used in version vectors. It is intended for point-to-point communication and demands the transmission of a short list of previous versions. ...
Article
Full-text available
Many distributed services need to be scalable: internet search, electronic commerce, e-government(Formula presented.) In order to achieve scalability those applications rely on replicated components. Because of the dynamics of growth and volatility of customer markets, applications need to be hosted by adaptive systems. In particular, the scalability of the reliable multicast mechanisms used for supporting the consistency of replicas is of crucial importance. Reliable multicast may propagate updates in a pre-defined order (e.g., FIFO, total or causal). Since total order needs more communication rounds than causal order, the latter appears to be the preferable candidate for achieving multicast scalability, although the consistency guarantees based on causal order are weaker than those of total order. This paper provides a historical survey of different scalability approaches for reliable causal multicast protocols.
... rely on timestamps (TS) [80], vector clocks (VC) [94] or version vectors (VV) [8,99]. More recently, Sovran et al. [136] and Sciascia and Pedone [129, Section E] discovered independently the concept of vector timestamps (VTS) that allows the computation of partially consistent snapshots at the cost of communicating in the background with all replicas. ...
Article
In the first part, we study consistency in a transactional systems, and focus on reconciling scalability with strong transactional guarantees. We identify four scalability properties, and show that none of the strong consistency criteria ensure all four. We define a new scalable consistency criterion called Non-Monotonic Snapshot Isolation (NMSI), while is the first that is compatible with all four properties. We also present a practical implementation of NMSI, called Jessy, which we compare experimentally against a number of well-known criteria. We also introduce a framework for performing fair comparison among different transactional protocols. Our insight is that a large family of distributed transactional protocols have a common structure, called Deferred Update Replication (DUR). Protocols of the DUR family differ only in behaviors of few generic functions. We present a generic DUR framework, called G-DUR. We implement and compare several transactional protocols using the G-DUR framework.In the second part, we focus on ensuring consistency in non-transactional data stores. We introduce Tuba, a replicated key-value store that dynamically selects replicas in order to maximize the utility delivered to read operations according to a desired consistency defined by the application. In addition, unlike current systems, it automatically reconfigures its set of replicas while respecting application-defined constraints so that it adapts to changes in clients’ locations or request rates. Compared with a system that is statically configured, our evaluation shows that Tuba increases the reads that return strongly consistent data by 63%.
... Numerous instances of the above abstraction (V, <) have been proposed in the past. These include timestamps [14], vector clocks [17], version vectors [18, 19] , or more recently , version vectors with exception [20] and interval tree clocks [21]. Depending on how concurrency is tracked and for which purposes, the dimension of V may vary from a ...
Conference Paper
Full-text available
The ability to access and query data stored in multiple versions is an important asset for many applications, such as Web graph analysis, collaborative editing platforms, data forensics, or correlation mining. The storage and retrieval of versioned data requires a specific API and support from the storage layer. The choice of the data structures used to maintain versioned data has a fundamental impact on the performance of insertions and queries. The appropriate data structure also depends on the nature of the versioned data and the nature of the access patterns. In this paper we study the design and implementation space for providing versioning support on top of a distributed key-value store (KVS). We define an API for versioned data access supporting multiple writers and show that a plain KVS does not offer the necessary synchronization power for implementing this API. We leverage the support for listeners at the KVS level and propose a general construction for implementing arbitrary types of data structures for storing and querying versioned data. We explore the design space of versioned data storage ranging from a flat data structure to a distributed sharded index. The resulting system, ALEPH, is implemented on top of an industrial-grade open-source KVS, Infinispan. Our evaluation, based on real-world Wikipedia access logs, studies the performance of each versioning mechanisms in terms of load balancing, latency and storage overhead in the context of different access scenarios.
Chapter
Version vectors constitute an essential feature of distributed systems that enable the computing elements to keep track of causality between the events of the distributed systems. In this article, we study a variant named Bounded Version Vectors. We define the semantics of version vectors using the framework of Mazurkiewicz traces. We use these semantics along with the solution for the gossip problem to come up with a succinct bounded representation of version vectors in distributed environments where replicas communicate via pairwise synchronization.
Article
Hybrid vector clock(s) (HVC) provide a mechanism to combine the theory and practice of distributed systems. Improving on traditional vector clock(s) (VC), HVC utilizes synchronized physical clocks to reduce the size by focusing only on causality where the physical time associated with two events is within a given uncertainty window ε and letting physical clock alone determine the order of events that are outside the uncertainty window. In this paper, we develop a model for determining the bounds on the size of HVC. Our model uses four parameters, ε: uncertainty window, 8: message delay, a: communication frequency and n: number of nodes in the system. We derive the size of HVC in terms of a delay differential equation, and show that the size predicted by our model is almost identical to the results obtained by simulation. We also identify closed form solutions that provide tight lower and upper bounds for useful special cases. We show that for many practical applications and deployment environments in Amazon EC2, the size of HVC remains only as a couple entries and substantially less than n. Finally, although the analytical results rely on a specific communication pattern they are useful in evaluating size of HVC in different communication scenarios.
Conference Paper
Eventual consistency is a relaxation of strong consistency that guarantees that if no new updates are made to a replicated data object, then all replicas will converge. The conflict free replicated datatypes (CRDTs) of Shapiro et al. are data structures whose inherent mathematical structure guarantees eventual consistency. We investigate a fundamental CRDT called Observed-Remove Set (OR-Set) that robustly implements sets with distributed add and delete operations. Existing CRDT implementations of OR-Sets either require maintaining a permanent set of “tombstones” for deleted elements, or imposing strong constraints such as causal order on message delivery. We formalize a concurrent specification for OR-Sets without ordering constraints and propose a generalized implementation of OR-sets without tombstones that provably satisfies strong eventual consistency. We introduce Interval Version Vectors to succinctly keep track of distributed time-stamps in systems that allow out-of-order delivery of messages. The space complexity of our generalized implementation is competitive with respect to earlier solutions with causal ordering. We also formulate k-causal delivery, a generalization of causal delivery, that provides better complexity bounds.
Article
A large family of distributed transactional protocols have a common structure, called Deferred Update Replication (DUR). DUR provides dependability by replicating data, and performance by not re-executing transactions but only applying their updates. Protocols of the DUR family differ only in behaviors of few generic functions. Based on this insight, we offer a generic DUR middleware, called G-DUR, along with a library of finely-optimized plug-in implementations of the required behaviors. This paper presents the middleware, the plugins, and an extensive experimental evaluation in a geo-replicated environment. Our empirical study shows that:(i) G-DUR allows developers to implement various transactional protocols under 600 lines of code; (ii) It provides a fair, apples-to-apples comparison between transactional protocols; (iii) By replacing plugs-ins, developers can use G-DUR to understand bottlenecks in their protocols; (iv) This in turn enables the improvement of existing protocols; and (v) Given a protocol, G-DUR helps evaluate the cost of ensuring various degrees of dependability.
Conference Paper
Full-text available
Disconnected operation is a mode of operation that enables a client to continue accessing critical data during temporary failures of a shared data repository. An important, though not exclusive, application of disconnected operation is in supporting portable computers. In this paper, we show that disconnected operation is feasible, efficient and usable by describing its design and implementation in the Code File System. The central idea behind our work is that caching of data, now widely used for performance, can also be exploited to improve availability.
Article
We propose efficient algorithms to maintain a replicated dictionary using a log in an unreliable network. A non-serializable approach is used to achieve high concurrency. The solutions are resilient to both node and communication failures. Optimizations are developed for networks which are not completely connected.
Conference Paper
Vector clocks (VC) are an inherent component of a rich class of distributed applications. In this paper, we consider the problem of realistic —more specifically, bounded-space and fault-tolerant— implementation of these client applications. To this end, we generalize the notion of VC to resettable vector clocks (RVC), and provide a realistic implementation of RVC. Further, we identify an interface contract under which our RVC implementation can be substituted for VC in client applications, without affecting the client's correctness. Based on such substitution, we show how to transform the client so that it is itself realistically implemented; we demonstrate our method in the context of Ricart-Agrawala's mutual exclusion program.
Conference Paper
We present a simple algorithm for maintaining a replicated distributed dictionary which achieves high availability of data, rapid processing of atomic actions, efficient utilization of storage, and tolerance to node or network failures including lost or duplicated messages. It does not require transaction logs, synchronized clocks, or other complicated mechanisms for its operation. It achieves consistency contraints which are considerably weaker than serial consistency but nonetheless are adequate for many dictionary applications such as electronic appointment calendars and mail systems. The degree of consistency achieved depends on the particular history of operation of the system in a way that is intuitive and easily understood. The algorithm implements a "best effort" approximation to full serial consistency, relative to whatever internode communication has successfully taken place, so the semantics are fully specified even under partial failure of the system. Both the correctness of the algorithm and the utility of such weak semantics depend heavily on special properties of the dictionary operations.
Article
Timestamping is a common method of totally ordering events in concurrent programs. However, for applications requiring access to the global state, a total ordering is inappropriate. This paper presents algorithms for timestamping events in both synchronous and asynchronous message-passing programs that allow for access to the partial ordering inherent in a parallel system. The algorithms do not change the communications graph or require a central timestamp issuing authority.
Article
In a Distributed System with N sites, the precise detection of causal relationships between events can only be done with vector clocks of size N. This gives rise to scalability and efficiency problems for logical clocks that can be used to order events accurately. In this paper we propose a class of logical clocks called plausible clocks that can be implemented with a number of components not affected by the size of the system and yet they provide good ordering accuracy. We develop rules to combine plausible clocks to produce more accurate clocks. Several examples of plausible clocks and their combination are presented. Using a simulation model, we evaluate the performance of these clocks. We also present examples of applications where constant size clocks can be used.
Conference Paper
Replication improves the performance and availability of sharing information in a large-scale network. Classical, pessimistic replication incurs network access before any access, in order to avoid conflicts and resulting stale reads and lost writes. Pessimistic protocols assume some central locking site or necessitate distributed consensus. The protocols are fragile in the presence of network failures, partitioning, or denial-of-service attacks. They are safe (i.e., stale reads and lost writes do not occur) but at the expense of performance and availability, and they do not scale well.
Conference Paper
Traditional version vectors can be used to optimize peer-to-peer synchronization for pervasive computing devices. However, their storage overhead may be a prohibitive factor in scalability in an environment with a typically low communication bandwidth and a relatively small storage memory. We present a dynamic version vector design that allows small data sizes for version vector items. We call it the lightweight version vector (LVV) approach, and argue that a step increase method of LVV can be an effective solution for peer-to-peer synchronization of pervasive computing devices