Conference Paper

Bounded Version Vectors

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Version vectors play a central role in update tracking un- der optimistic distributed systems, allowing the detection of obsolete or inconsistent versions of replicated data. Ver- sion vectors do not have a bounded representation; they are based on integer counters that grow indenitely as updates occur. Existing approaches to this problem are scarce; the mechanisms proposed are either unbounded or operate only under specic settings. This paper examines version vec- tors as a mechanism for data causality tracking and claries their role with respect to vector clocks. Then, it introduces bounded stamps and proves them to be a correct alternative to integer counters in version vectors. The resulting mecha- nism, bounded version vectors, represents the rst bounded solution to data causality tracking between replicas subject to local updates and pairwise symmetrical synchronization.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Causality tracking in a distributed system is a powerful concept in reasoning and analysing about a computation, and in the design of applications. It is key to solving a wide range of problems in distributed systems, such as in distributed algorithm design [CL85,Mat87], in tracking of dependent events [CM91,HW88,MN91], knowledge about the progress of the system [WB84,AAB04], and as a concurrency measure [CB89,Fid91]. An extensive discussion of causality and its applications can be found in Schwartz and Mattern [SM94]. ...
... In these cases, however, version vectors are overly expressive since, as vector clocks, they record information that enables them to assess causality and concurrency in the past of each possible consistent cut. In the typical case where past can be discarded, a bounded representation of causality can be achieved using Bounded Version Vectors [AAB04]. ...
... We refer the interested reader to the survey in [SM94] and to the historical notes in [BR02]. After an initial focus on message passing systems, recent developments have improved causality tracking for replicated data: they addressed efficient coding for groups of related objects [MT05]; bounded representation of version vectors [AAB04]; and the semantics of reconciliation [GKK + 06]. ...
... We refer the interested reader to the survey in [20] and to the historical notes in [4]. After an initial focus on message passing systems, recent developments have improved causality tracking for replicated data: they addressed efficient coding for groups of related objects [13]; bounded representation of version vectors [1]; and the semantics of reconciliation [10]. ...
... E.g. In version vector systems and in bounded version vectors [1] it models the atomic synchronization of two replicas. ...
Conference Paper
Causality tracking mechanisms, such as vector clocks and version vectors, rely on mappings from globally unique identifiers to integer counters. In a system with a well known set of entities these ids can be preconfigured and given distinct positions in a vector or distinct names in a mapping. Id management is more problematic in dynamic systems, with large and highly variable number of entities, being worsened when network partitions occur. Present solutions for causality tracking are not appropriate to these increasingly common scenarios. In this paper we introduce Interval Tree Clocks, a novel causality tracking mechanism that can be used in scenarios with a dynamic number of entities, allowing a completely decentralized creation of processes/replicas without need for global identifiers or global coordination. The mechanism has a variable size representation that adapts automatically to the number of existing entities, growing or shrinking appropriately. The representation is so compact that the mechanism can even be considered for scenarios with a fixed number of entities, which makes it a general substitute for vector clocks and version vectors.
... Bounded non-stabilizing solutions exist in the literature [1,21]. Selfstabilizing resettable vector clocks [3] consider distributed applications that are structured in phases and track causality merely within a bounded number of successive phases. ...
Chapter
Full-text available
Vector clock algorithms are basic wait-free building blocks that facilitate causal ordering of events. As wait-free algorithms, they are guaranteed to complete their operations within a finite number of steps. Stabilizing algorithms allow the system to recover after the occurrence of transient faults, such as soft errors and arbitrary violations of the assumptions according to which the system was designed to behave.
... Specifically, in asynchronous/partially synchronous systems, to identify whether two events could have happened at the same time, we need to use techniques such as vector clocks [5,6] that require O(n) space where n is the number of processes. Even though there are attempts to reduce the size [18,22], the worst case size is still O(n). By contrast, in fully synchronous systems, if two events happen at the same time on two different processes, we can conclude that they happened at the same time. ...
Article
Full-text available
Runtime verification focuses on analyzing the execution of a given program by a monitor to determine if it is likely to violate its specifications. There is often an impedance mismatch between the assumptions/model of the monitor and that of the underlying program. This constitutes problems especially for distributed systems, where the concept of current time and state are inherently uncertain. A monitor designed with asynchronous system model assumptions may cause false-positives for a program executing in a partially synchronous system: the monitor may flag a global predicate that does not actually occur in the underlying system. A monitor designed with a partially synchronous system model assumption may cause false negatives as well as false positives for a program executing in an environment where the bounds on partial synchrony differ (albeit temporarily) from the monitor model assumptions. In this paper we analyze the effects of the impedance mismatch between the monitor and the underlying program for the detection of conjunctive predicates. We find that there is a small interval where the monitor assumptions are hypersensitive to the underlying program environment. We provide analytical derivations for this interval, and also provide simulation support for exploring the sensitivity of predicate detection to the impedance mismatch between the monitor and the program under a partially synchronous system.
... Almeida et al. [3] propose a mechanism for bounding the size of the elements used in version vectors. It is intended for point-to-point communication and demands the transmission of a short list of previous versions. ...
Article
Full-text available
Many distributed services need to be scalable: internet search, electronic commerce, e-government(Formula presented.) In order to achieve scalability those applications rely on replicated components. Because of the dynamics of growth and volatility of customer markets, applications need to be hosted by adaptive systems. In particular, the scalability of the reliable multicast mechanisms used for supporting the consistency of replicas is of crucial importance. Reliable multicast may propagate updates in a pre-defined order (e.g., FIFO, total or causal). Since total order needs more communication rounds than causal order, the latter appears to be the preferable candidate for achieving multicast scalability, although the consistency guarantees based on causal order are weaker than those of total order. This paper provides a historical survey of different scalability approaches for reliable causal multicast protocols.
... rely on timestamps (TS) [80], vector clocks (VC) [94] or version vectors (VV) [8,99]. More recently, Sovran et al. [136] and Sciascia and Pedone [129, Section E] discovered independently the concept of vector timestamps (VTS) that allows the computation of partially consistent snapshots at the cost of communicating in the background with all replicas. ...
Article
In the first part, we study consistency in a transactional systems, and focus on reconciling scalability with strong transactional guarantees. We identify four scalability properties, and show that none of the strong consistency criteria ensure all four. We define a new scalable consistency criterion called Non-Monotonic Snapshot Isolation (NMSI), while is the first that is compatible with all four properties. We also present a practical implementation of NMSI, called Jessy, which we compare experimentally against a number of well-known criteria. We also introduce a framework for performing fair comparison among different transactional protocols. Our insight is that a large family of distributed transactional protocols have a common structure, called Deferred Update Replication (DUR). Protocols of the DUR family differ only in behaviors of few generic functions. We present a generic DUR framework, called G-DUR. We implement and compare several transactional protocols using the G-DUR framework.In the second part, we focus on ensuring consistency in non-transactional data stores. We introduce Tuba, a replicated key-value store that dynamically selects replicas in order to maximize the utility delivered to read operations according to a desired consistency defined by the application. In addition, unlike current systems, it automatically reconfigures its set of replicas while respecting application-defined constraints so that it adapts to changes in clients’ locations or request rates. Compared with a system that is statically configured, our evaluation shows that Tuba increases the reads that return strongly consistent data by 63%.
... Numerous instances of the above abstraction (V, <) have been proposed in the past. These include timestamps [14], vector clocks [17], version vectors [18, 19] , or more recently , version vectors with exception [20] and interval tree clocks [21]. Depending on how concurrency is tracked and for which purposes, the dimension of V may vary from a ...
Conference Paper
Full-text available
The ability to access and query data stored in multiple versions is an important asset for many applications, such as Web graph analysis, collaborative editing platforms, data forensics, or correlation mining. The storage and retrieval of versioned data requires a specific API and support from the storage layer. The choice of the data structures used to maintain versioned data has a fundamental impact on the performance of insertions and queries. The appropriate data structure also depends on the nature of the versioned data and the nature of the access patterns. In this paper we study the design and implementation space for providing versioning support on top of a distributed key-value store (KVS). We define an API for versioned data access supporting multiple writers and show that a plain KVS does not offer the necessary synchronization power for implementing this API. We leverage the support for listeners at the KVS level and propose a general construction for implementing arbitrary types of data structures for storing and querying versioned data. We explore the design space of versioned data storage ranging from a flat data structure to a distributed sharded index. The resulting system, ALEPH, is implemented on top of an industrial-grade open-source KVS, Infinispan. Our evaluation, based on real-world Wikipedia access logs, studies the performance of each versioning mechanisms in terms of load balancing, latency and storage overhead in the context of different access scenarios.
... Almeida et al. [3] address the problem of efficiently representing version vectors when replicas can leave and join the system. Almeida, and Baquero[1] show that version vectors can be represented using only a bounded amount of information per replica, when they are only being used to check dominance or conflict between current versions of data in a distributed system. Other work [24, 28, ?] has focused on improving the way vector clock schemes scale with the number of replicas. ...
Conference Paper
Full-text available
Current techniques for reconciling disconnected changes to optimistically replicated data often use version vectors or related mechanisms to track causal histories. This allows the system to tell whether the value at one replica dominates another or whether the two replicas are in conflict. However, current algorithms do not provide entirely satisfactory ways of repairing conflicts. The usual approach is to introduce fresh events into the causal history, even in situations where the causally independent values at the two replicas are actually equal. In some scenarios these events may later conflict with each other or with further updates, slowing or even preventing convergence of the whole system. To address this issue, we enrich the set of possible actions at a replica to include a notion of explicit conflict resolution between existing events, where the user at a replica declares that one set of events dominates another, or that a set of events are equivalent. We precisely specify the behavior of this refined replication framework from a user’s point of view and show that, if communication is assumed to be “reciprocal” (with pairs of replicas exchanging information about their current states), then this specification can be implemented by an algorithm with the property that the information stored at any replica and the sizes of the messages sent between replicas are bounded by a polynomial function of the number of replicas in the system.
Chapter
Version vectors constitute an essential feature of distributed systems that enable the computing elements to keep track of causality between the events of the distributed systems. In this article, we study a variant named Bounded Version Vectors. We define the semantics of version vectors using the framework of Mazurkiewicz traces. We use these semantics along with the solution for the gossip problem to come up with a succinct bounded representation of version vectors in distributed environments where replicas communicate via pairwise synchronization.
Article
Hybrid vector clock(s) (HVC) provide a mechanism to combine the theory and practice of distributed systems. Improving on traditional vector clock(s) (VC), HVC utilizes synchronized physical clocks to reduce the size by focusing only on causality where the physical time associated with two events is within a given uncertainty window ε and letting physical clock alone determine the order of events that are outside the uncertainty window. In this paper, we develop a model for determining the bounds on the size of HVC. Our model uses four parameters, ε: uncertainty window, 8: message delay, a: communication frequency and n: number of nodes in the system. We derive the size of HVC in terms of a delay differential equation, and show that the size predicted by our model is almost identical to the results obtained by simulation. We also identify closed form solutions that provide tight lower and upper bounds for useful special cases. We show that for many practical applications and deployment environments in Amazon EC2, the size of HVC remains only as a couple entries and substantially less than n. Finally, although the analytical results rely on a specific communication pattern they are useful in evaluating size of HVC in different communication scenarios.
Conference Paper
Eventual consistency is a relaxation of strong consistency that guarantees that if no new updates are made to a replicated data object, then all replicas will converge. The conflict free replicated datatypes (CRDTs) of Shapiro et al. are data structures whose inherent mathematical structure guarantees eventual consistency. We investigate a fundamental CRDT called Observed-Remove Set (OR-Set) that robustly implements sets with distributed add and delete operations. Existing CRDT implementations of OR-Sets either require maintaining a permanent set of “tombstones” for deleted elements, or imposing strong constraints such as causal order on message delivery. We formalize a concurrent specification for OR-Sets without ordering constraints and propose a generalized implementation of OR-sets without tombstones that provably satisfies strong eventual consistency. We introduce Interval Version Vectors to succinctly keep track of distributed time-stamps in systems that allow out-of-order delivery of messages. The space complexity of our generalized implementation is competitive with respect to earlier solutions with causal ordering. We also formulate k-causal delivery, a generalization of causal delivery, that provides better complexity bounds.
Article
A large family of distributed transactional protocols have a common structure, called Deferred Update Replication (DUR). DUR provides dependability by replicating data, and performance by not re-executing transactions but only applying their updates. Protocols of the DUR family differ only in behaviors of few generic functions. Based on this insight, we offer a generic DUR middleware, called G-DUR, along with a library of finely-optimized plug-in implementations of the required behaviors. This paper presents the middleware, the plugins, and an extensive experimental evaluation in a geo-replicated environment. Our empirical study shows that:(i) G-DUR allows developers to implement various transactional protocols under 600 lines of code; (ii) It provides a fair, apples-to-apples comparison between transactional protocols; (iii) By replacing plugs-ins, developers can use G-DUR to understand bottlenecks in their protocols; (iv) This in turn enables the improvement of existing protocols; and (v) Given a protocol, G-DUR helps evaluate the cost of ensuring various degrees of dependability.
Article
This document describes a Ph.D. level course, corresponding to a Curricular Unit covering the Theory of Distributed Computing, currently running in the joint MAP-i doctoral programme in Informatics, organized by three Portuguese universities (Minho, Aveiro and Porto). This course has been taught in the two previous editions of the MAP-i pro-gramme. Lecture material of the previous edition can be found in the slides folder at http://gsd.di.uminho.pt/teaching/DC/2008/. This course has been submitted for accreditation by the CMU doctoral pro-gramme in August 2008 and awaits for the process' outcome. The current proposal for the 2009-2010 edition fundamentally builds on the previous course further emphasizing on critical systems with deeper study of agreement problems, state-of-the-art formal modeling of timed asynchronous networks, and real-time systems. Moreover, the proponent team has been ex-tended encompassing now researchers from two universities, Minho and Porto.
Article
Conflicts naturally arise in optimistically replicated systems. The common way to detect update conflicts is via version vectors, whose storage and communication overhead are number of replicas × number of objects. These costs may be prohibitive for large systems. This paper presents predecessor vectors with exceptions (PVEs), a novel optimistic replication technique developed for Microsoft’s WinFS system. The paper contains a systematic study of PVE’s performance gains over traditional schemes. The results demonstrate a dramatic reduction of storage and communication overhead in normal scenarios, during which communication disruptions are infrequent. Moreover, they identify a cross-over threshold in communication failure-rate, beyond which PVEs loses efficiency compared with traditional schemes.
Conference Paper
Full-text available
The large and growing number of computing devices used by individuals has caused the challenges of distributed storage to take on increased importance. In addition to desktops and laptops, portable devices, such as cell phones, digital cameras, iPods, and PDAs are capable of storing and sharing data. These devices’ mobility coupled with wireless networking capabilities allows them to opportunistically propagate data to other devices they might encounter, even if connectivity is unplanned and transient. To take advantage of such ad hoc connectivity, devices must have an efficient method for determining which files they hold in common and which versions must be propagated to achieve consistency. This paper presents new techniques for reducing the cost of this data synchronization operation by over an order of magnitude in many situations.
Conference Paper
Full-text available
Disconnected operation is a mode of operation that enables a client to continue accessing critical data during temporary failures of a shared data repository. An important, though not exclusive, application of disconnected operation is in supporting portable computers. In this paper, we show that disconnected operation is feasible, efficient and usable by describing its design and implementation in the Code File System. The central idea behind our work is that caching of data, now widely used for performance, can also be exploited to improve availability.
Article
Full-text available
Disconnected operation is a mode of operation that enables a client to continue accessing critical data during temporary failures of a shared data repository. An important, though not exclusive, application of disconnected operation is in supporting portable computers. In this paper, we show that disconnected operation is feasible, efficient and usable by describing its design and implementation in the Coda File System. The central idea behind our work is that caching of data, now widely used for performance, can also be exploited to improve availability.
Article
Full-text available
of the Dissertation Roam: A Scalable Replication System for Mobile and Distributed Computing by David Howard Ratner Doctor of Philosophy in Computer Science University of California, Los Angeles, 1998 Professor Gerald J. Popek, Co-chair Professor W. W. Chu, Co-chair Mobile computing is rapidly becoming a way of life. Users carry their laptops, PDAs, and other portable devices with them almost constantly, whether their mobility takes them across town or across the world. Recent hardware innovations and improvements in chip technology have made mobile computing truly feasible. Unfortunately, unlike the hardware industry, much of today's system software is not "mobile-ready." Such is the case with the replication service. Nomadic users require replication to store copies of critical data on their mobile machines, since disconnected or poorly connected machines must rely primarily on local resources. However, the existing replication services are designed for stationary environments, and do not provide mobile users with the capabilities they require. Replication in mobile environments requires fundamentally different solutions than those previously proposed, because nomadicity presents a fundamentally new and different computing paradigm. Mobile users require several key features from the replication system: the ability for direct synchronization between any two replicas, the capability for widespread scaling and large numbers of replicas, and detailed control over what files reside on their local (mobile) replica. Today's replication systems do not and cannot provide these features. Therefore, mobile users must adapt their behavior to match the service provided to them, complicating their ability to work while mobile and effectively hindering ...
Article
We propose efficient algorithms to maintain a replicated dictionary using a log in an unreliable network. A non-serializable approach is used to achieve high concurrency. The solutions are resilient to both node and communication failures. Optimizations are developed for networks which are not completely connected.
Conference Paper
Vector clocks (VC) are an inherent component of a rich class of distributed applications. In this paper, we consider the problem of realistic —more specifically, bounded-space and fault-tolerant— implementation of these client applications. To this end, we generalize the notion of VC to resettable vector clocks (RVC), and provide a realistic implementation of RVC. Further, we identify an interface contract under which our RVC implementation can be substituted for VC in client applications, without affecting the client's correctness. Based on such substitution, we show how to transform the client so that it is itself realistically implemented; we demonstrate our method in the context of Ricart-Agrawala's mutual exclusion program.
Conference Paper
We present a simple algorithm for maintaining a replicated distributed dictionary which achieves high availability of data, rapid processing of atomic actions, efficient utilization of storage, and tolerance to node or network failures including lost or duplicated messages. It does not require transaction logs, synchronized clocks, or other complicated mechanisms for its operation. It achieves consistency contraints which are considerably weaker than serial consistency but nonetheless are adequate for many dictionary applications such as electronic appointment calendars and mail systems. The degree of consistency achieved depends on the particular history of operation of the system in a way that is intuitive and easily understood. The algorithm implements a "best effort" approximation to full serial consistency, relative to whatever internode communication has successfully taken place, so the semantics are fully specified even under partial failure of the system. Both the correctness of the algorithm and the utility of such weak semantics depend heavily on special properties of the dictionary operations.
Article
Timestamping is a common method of totally ordering events in concurrent programs. However, for applications requiring access to the global state, a total ordering is inappropriate. This paper presents algorithms for timestamping events in both synchronous and asynchronous message-passing programs that allow for access to the partial ordering inherent in a parallel system. The algorithms do not change the communications graph or require a central timestamp issuing authority.
Article
In a Distributed System with N sites, the precise detection of causal relationships between events can only be done with vector clocks of size N. This gives rise to scalability and efficiency problems for logical clocks that can be used to order events accurately. In this paper we propose a class of logical clocks called plausible clocks that can be implemented with a number of components not affected by the size of the system and yet they provide good ordering accuracy. We develop rules to combine plausible clocks to produce more accurate clocks. Several examples of plausible clocks and their combination are presented. Using a simulation model, we evaluate the performance of these clocks. We also present examples of applications where constant size clocks can be used.
Conference Paper
Replication improves the performance and availability of sharing information in a large-scale network. Classical, pessimistic replication incurs network access before any access, in order to avoid conflicts and resulting stale reads and lost writes. Pessimistic protocols assume some central locking site or necessitate distributed consensus. The protocols are fragile in the presence of network failures, partitioning, or denial-of-service attacks. They are safe (i.e., stale reads and lost writes do not occur) but at the expense of performance and availability, and they do not scale well.
Conference Paper
Traditional version vectors can be used to optimize peer-to-peer synchronization for pervasive computing devices. However, their storage overhead may be a prohibitive factor in scalability in an environment with a typically low communication bandwidth and a relatively small storage memory. We present a dynamic version vector design that allows small data sizes for version vector items. We call it the lightweight version vector (LVV) approach, and argue that a step increase method of LVV can be an effective solution for peer-to-peer synchronization of pervasive computing devices
Article
We propose efficient algorithms to maintain a replicated dictionary using a log in an unreliable network. A non-serializable approach is used to achieve high concurrency. The solutions are resilient to both node and communication failures. Optimizations are developed for networks which are not completely connected.
Article
The concept of one event happening before another in a distributed system is examined, and is shown to define a partial ordering of the events. A distributed algorithm is given for synchronizing a system of logical clocks which can be used to totally order the events. The use of the total ordering is illustrated with a method for solving synchronization problems. The algorithm is then specialized for synchronizing physical clocks, and a bound is derived on how far out of synchrony the clocks can become.
Article
Data replication is a key technology in distributed data sharing systems, enabling higher availability and performance. This paper surveys optimistic replication algorithms that allow replica contents to diverge in the short term, in order to support concurrent work practices and to tolerate failures in low-quality communication links. The importance of such techniques is increasing as collaboration through wide-area and mobile networks becomes popular. Optimistic replication techniques are different from traditional ?pessimistic? ones. Instead of synchronous replica coordination, an optimistic algorithm propagates changes in the background, discovers conflicts after they happen and reaches agreement on the final contents incrementally. We explore the solution space for optimistic replication algorithms. This paper identifies key challenges facing optimistic replication systems ? ordering operations, detecting and resolving conflicts, propagating changes efficiently, and bounding replica divergence ?and provides a comprehensive survey of techniques developed for addressing these challenges.
Conference Paper
We introduce the hash history mechanism for capturing dependencies among distributed replicas. Hash histories, consisting of a directed graph of version hashes, are independent of the number of active nodes but dependent on the rate and number of modifications. We present the basic hash history scheme and discuss mechanisms for trimming the history over time. We simulate the efficacy of hash histories on several large CVS traces. Our results highlight a useful property of the hash history: the ability to recognize when two different non-commutative operations produce the same output, thereby reducing false conflicts and increasing the rate of convergence. We call these events coincidental equalities and demonstrate that their recognition can greatly reduce the time to global convergence.
Conference Paper
Version vectors and their variants play a central role in update tracking in optimistic distributed systems. Existing mechanisms for a variable number of participants use a mapping from identities to integers, and rely on some form of global configuration or distributed naming protocol to assign unique identifiers to each participant. These approaches are incompatible with replica creation under arbitrary partitions, a typical mode of operation in mobile or poorly connected environments. We present an update tracking mechanism that overcomes this limitation; it departs from the traditional mapping and avoids the use of integer counters, while providing all the functionality of version vectors in what concerns version tracking.
Conference Paper
Nomadic users require replication to store copies of critical data on their mobile machines while disconnected or poorly connected. Existing replication services do not provide all classes of mobile users with the capabilities they require, which include: the ability for direct synchronization between any two replicas, support for large numbers of replicas, and detailed control over what files reside on their local (mobile) replica. Mobile users must adapt their behavior to match the level of service provided by today's replication systems, thereby hindering mobility and costing additional time, money, and systems management. Roam is a replication system designed to satisfy the requirements of the mobile user. Roam is based on the Ward Model, a replication architecture for mobile environments. Using the Ward Model and new distributed algorithms, Roam provides a scalable replication solution for the mobile user. We describe the motivation, design, and implementation of Roam and report its performance.
Article
Many distributed systems are now being developed to provide users with convenient access to data via some kind of communications network. In many cases it is desirable to keep the system functioning even when it is partitioned by network failures. A serious problem in this context is how one can support redundant copies of resources such as files (for the sake of reliability) while simultaneously monitoring their mutual consistency (the equality of multiple copies). This is difficult since network faiures can lead to inconsistency, and disrupt attempts at maintaining consistency. In fact, even the detection of inconsistent copies is a nontrivial problem. Naive methods either 1) compare the multiple copies entirely or 2) perform simple tests which will diagnose some consistent copies as inconsistent. Here a new approach, involving version vectors and origin points, is presented and shown to detect single file, multiple copy mutual inconsistency effectively. The approach has been used in the design of LOCUS, a local network operating system at UCLA.
Article
A distributed system can be characterized by the fact that the global state is distributed and that a common time base does not exist. However, the notion of time is an important concept in every day life of our decentralized "real world" and helps to solve problems like getting a consistent population census or determining the potential causality between events. We argue that a linearly ordered structure of time is not (always) adequate for distributed systems and propose a generalized non-standardmodel of time which consists of vectors of clocks. These clock-vectors arepartially orderedand form a lattice. By using timestamps and a simple clock update mechanism the structureofcausality is represented in an isomorphic way. The new model of time has a close analogy to Minkowski's relativistic spacetime and leads among others to an interesting characterization of the global state problem. Finally, we present a new algorithm to compute a consistent global snapshot of a distributed system where messages may bereceived out of order.
Article
Matrix clocks have nice properties that can be used in the context of distributed database protocols and fault tolerant protocols. Unfortunately, they are costly to implement, requiring storage and communication overhead of size O(n2)\mathcal{O}(n^2 ) for a system of n sites. They are often considered a non feasible approach when the number of sites is large. In this paper, we firstly describe an efficient incremental algorithm to compute the matrix clock, which achieves storage and communication overhead of size O(n)\mathcal{O}(n) when the sites of the computation are “well synchronized”. Secondly, we introduce the k-matrix clock: an approximation to the genuine matrix clock that can be computed with a storage and communication overhead of size O(kn)\mathcal{O}(kn). k-matrix clocks can be useful to implement faulttolerant protocols for systems with crash failure semantics such that the maximum number of simultaneous faults is bounded by k−1.
Article
As we approach nation-wide integration of computer systems, it is clear that file replication will play a key role, both to improve data availability in the face of failures, and to improve performance by locating data near where it will be used. We expect that future file systems will have an extensible, modular structure in which features such as replication can be "slipped in" as a transparent layer in a stackable layered architecture. We introduce the Ficus replicated file system for NFS and show how it is layered on top of existing file systems. The Ficus file system differs from previous file replication services in that it permits update during network partition if any copy of a file is accessible. File and directory updates are automatically propagated to accessible replicas. Conflicting updates to directories are detected and automatically repaired; conflicting updates to ordinary files are detected and reported to the owner. The frequency of communications outages rendering ...
Article
We introduce the hash history mechanism for capturing dependencies among distributed replicas. Hash histories, consisting of a directed graph of version hashes, are independent of the number of active nodes but dependent on the rate and number of modifications. We present the basic hash history scheme and discuss mechanisms for trimming the history over time. We simulate the efficacy of hash histories on several large CVS traces. Our results highlight a useful property of the hash history: the ability to recognize when two different non-commutative operations produce the same output, thereby reducing false conflicts and increasing the rate of convergence. We call these events coincidental equalities and demonstrate that their recognition can greatly reduce the time to global convergence.
Article
inactive entries in version vectors (VVs). This algorithm lets each node remove inactive VV entries without any coordination with other nodes. It achieves this feature by devising a new way to compare two version vectors based on loosely synchronized clocks and placing a timing restriction on the behavior of the application. VVs computed by our algorithm can accurately and completely capture the "happened-before" relation between events just like ordinary VVs. This paper proves the correctness of our algorithm as well.
Conference Paper
We present a new replication algorithm that supports replication of a large number of objects on a diverse set of nodes. The algorithm allows replica sets to be changed dynamically on a per-object basis. It tolerates most types of failures, including multiple node failures, network partitions, and sudden node retirements. These advantages make the algorithm particularly attractive in large cluster-based data services that experience frequent failures and configuration changes. We prove the correctness of the algorithm and show that its performance is near-optimal.
Article
: The paper shows that characterizing the causal relationship between significant events is an important but non-trivial aspect for understanding the behavior of distributed programs. An introduction to the notion of causality and its relation to logical time is given; some fundamental results concerning the characterization of causality are presented. Recent work on the detection of causal relationships in distributed computations is surveyed. The relative merits and limitations of the different approaches are discussed, and their general feasibility is analyzed. Keywords: Distributed Computation, Causality, Distributed System, Causal Ordering, Logical Time, Vector Time, Global Predicate Detection, Distributed Debugging 1 Introduction Today, distributed and parallel systems are generally available, and their technology has reached a certain degree of maturity. Unfortunately, we still lack complete understanding of how to design, realize, and test the software for such system...
Article
In this paper, we show the connection between vector clocks used in distributed computing and dimension theory of partially ordered sets. Based on this connection, we provide lower bounds on the number of coordinates for timestamping events in a distributed computation for capturing the happened-before relation. To this end, we introduce the notion of a string realizer and the string dimension of a poset. For distributed computing and other applications, the concept of string realizer is more natural than the chain realizer used in the classical dimension theory. We establish the relationship between the string dimension and the chain dimension of a poset. Using this relationship and Dilworth's theorem for the chain dimension of nite distributive lattices, we obtain the desired lower bound. The concept of strings also has applications in ecient encoding of partial orders because it requires fewer bits to encode a string realizer than a chain realizer. 1.
Versionstamps–decentralizedversionvectors. InProceed-ings of the
  • Paulo Sérgio
  • Carlos Almeida
  • Baquero
  • Fonte
Paulo Sérgio Almeida, Carlos Baquero, and Victor Fonte. Versionstamps–decentralizedversionvectors. InProceed-ings of the 22nd International Conference on Distributed Computing Systems (ICDCS), pages 544–551. IEEE Com-puter Society, 2002.
Disconnectedopera-tion in the codafile system
  • James Kistler
  • Andm Satyanarayanan
James Kistler andM. Satyanarayanan. Disconnectedopera-tion in the codafile system. ACM Transactionon Computer Systems, 10(1):3–25,February 1992.