Article

Restricted failure detectors: Definition and reduction protocols

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

This paper investigates unreliable failure detectors with restricted properties, in the context of asynchronous distributed systems made up of n processes where at most f may crash. “Restricted” means that the completeness and the accuracy properties defining a failure detector class are not required to involve all the correct processes but only k and k′ of them, respectively (k are involved in the completeness property, and k′ in the accuracy property). These restricted properties define the classes R(k,k′) and ♢R(k,k′) of unreliable failure detectors.A reduction protocol that transforms a restricted failure detector into its non-restricted counterpart is presented. It is shown that the reduction requires k+k′>n (to be safe) and max(k,k′)≤n−f (to be live). So, when these two conditions are satisfied, R(k,k′) and ♢R(k,k′) are equivalent to the Chandra–Toueg's failure detector classes S and ♢S, respectively. This theoretical transformation is also interesting from a practical point of view because the restricted properties are usually easier to satisfy than their non-restricted counterparts in asynchronous distributed systems.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Other types of failure detectors. Restricted failure detectors are studied in [34]. Unreliable failure detectors with a limited scope accuracy are investigated in [3]. ...
Conference Paper
Full-text available
Unreliable failure detectors are abstract devices that, when added to asynchronous distributed systems, allow to solve distributed computing problems (e.g., Consensus) that otherwise would be impossible to solve in these systems. This paper focuses on two classes of failure detectors defined by Chandra and Toueg, namely, the classes denoted P\Diamond {\cal P} (eventually perfect) and S\Diamond {\cal S} (eventually strong). Both classes include failure detectors that eventually detect permanently all process crashes, but while the failure detectors of P\Diamond {\cal P} eventually make no erroneous suspicions, the failure detectors of S\Diamond {\cal S} are only required to eventually not suspect a single correct process. In such a context, this paper addresses the following question related to the comparative power of these classes, namely: “Are there one-shot agreement problems that can be solved in asynchronous distributed systems with reliable links but prone to process crash failures augmented with P\Diamond {\cal P}, but cannot be solved when those systems are augmented with S\Diamond {\cal S}?” Surprisingly, the paper shows that the answer to this question is “no”. An important consequence of this result is that P\Diamond {\cal P} cannot be the weakest class of failure detectors that enables solving one-shot agreement problems in unreliable asynchronous distributed systems. These results are then extended to the case of more severe failure modes.
... Failure detectors where both the completeness property and the accuracy property hold on subsets of correct processes have been investigated by Raynal and Tronel [13]. Their completeness and accuracy properties are restricted in the following sense: they are not required to involve all the correct processes but only k 1 and k 2 of them (k 1 are involved in the completeness property, and k 2 in the accuracy property). ...
Conference Paper
Full-text available
Let the scope of the accuracy property ofan unreliable failure detector be the minimum number (k) ofpro cesses that may not erroneously suspect a correct process to have crashed. Classical failure detectors implicitly consider a scope equal to n (the total number ofp rocesses). This paper investigates accuracy properties with limited scope, thereby giving rise to the S k and ◊S k classes off ailure detectors. A reduction protocol transforming any failure detector belonging to S k (resp. ◊S k ) into a failure detector (without limited scope) of the class S (resp. ◊S) is given. This reduction protocol requires f < k, where f is the maximum number ofp rocess crashes. (This leaves open the problem to prove/disprove that this condition is necessary.) Then, the paper studies the consensus problem in asynchronous distributed message-passing systems equipped with a failure detector of the class ◊S k . It presents a simple consensus protocol that is explicitly based on ◊S k . This protocol requires f < min(k, n/2).
... Unreliable failure detectors with limited completeness and/or accuracy are studied in [3], [14], [20], [21], [24], [25]. Realistic failure detectors have been introduced and investigated in [8]. ...
Article
Full-text available
Unreliable failure detectors were proposed by Chandra and Toueg as mechanisms that provide information about process failures. Chandra and Toueg defined eight classes of failure detectors, depending on how accurate this information is, and presented an algorithm implementing a failure detector of one of these classes in a partially synchronous system. This algorithm is based on all-to-all communication and periodically exchanges a number of messages that is quadratic on the number of processes. We study the implementability of different classes of failure detectors in several models of partial synchrony. We first show that no failure detector with perpetual accuracy (namely, P, Q, S, and W) can be implemented in these models in systems with even a single failure. We also show that, in these models of partial synchrony, it is necessary a majority of correct processes to implement a failure detector of the class θ proposed by Aguilera et al. Then, we present a family of distributed algorithms that implement the four classes of unreliable failure detectors with eventual accuracy (namely, ◊ P, ◊ Q, ◊ S, and ◊ W). Our algorithms are based on a logical ring arrangement of the processes, which defines the monitoring and failure information propagation pattern. The resulting algorithms periodically exchange at most a linear number of messages.
... Other types of failure detectors. Restricted failure detectors are studied in [34]. Unreliable failure detectors with a limited scope accuracy are investigated in [3]. ...
... One drawback of global oracles is that communication overhead can limit their practicality for large-scale networks. Accordingly, scope-restricted oracles have been proposed that provide information only about subsets of processes111213. Our dining solution uses a variant of 3P defined in [14, 15] for which suspect information is only provided about immediate neighbors. ...
Conference Paper
We explore dining philosophers under eventual weak exclusion in environments subject to permanent crash faults. Eventual weak exclusion permits neighboring diners to eat concurrently only finitely many times, but requires that, for each run, there exists a (potentially unknown) time after which live neighbors never eat simultaneously. This safety property models systems where resources can be recovered from crashed processes or where sharing violations precipitate only transient faults. Although dining under even- tual weak exclusion can be solved in synchronous environments, the problem is unsolvable in asynchronous systems, where crash faults can precipitate permanent starvation failures among other diners. We present the first wait-free solution to this problem under partial synchrony, along with a careful proof of correct- ness. Our solution is related to the hygienic algorithm of Chandy and Misra insofar as we use forks for safety and a dynamic ordering of process priorities for progress. Additionally, we use a local refinement of the eventually perfect failure detector P1 to guarantee wait-freedom in the presence of crash faults. This localized oracle provides information only about immediate neighbors, and, as such, is fundamental to the scalability of our approach, since it is implementable in sparse communication graphs that are partitionable by crash faults. Our results have potential applications to duty-cycle scheduling in sensor networks, as well as distributed daemon refinement for self-stabilizing algorithms.
... The second considers a failure detector of the class 3S k and requires f < max(K, max 1≤α≤K (min(n − αn/(α + 1), α + k − 1))). Failure detectors where both the completeness property and the accuracy property hold on subsets of correct processes have been investigated by Raynal and Tronel [13]. Their completeness and accuracy properties are restricted in the following sense: they are not required to involve all the correct processes but only k 1 and k 2 of them (k 1 are involved in the completeness property, and k 2 in the accuracy property). ...
Article
Abstract Unreliable failure detectors are oracles that give information about process failures Chandra and Toueg were first to study such failure detectors for distributed systems, and they identified a number that enabled the solution of the Consensus problem in asynchronous distributed systems This paper focuses on two of these, denoted S (strong) and 3S (eventually strong) The characteristics of a given unreliable failure detector are usually described by its completeness and accuracy properties Completeness is a requirement on the actual detection of failures, while accuracy limits the mistakes a failure detector can make Let the scope of the accuracy property of an unreliable failure detector be the minimum number (k) of processes that may not erroneously suspect a correct process to have crashed Usual failure detectors implicitly consider a scope equal to n (the total number of processes) Accuracy properties with limited scope give rise to the classes of failure detectors that we call Sk and 3Sk This paper investigates the following question: "Given Sk and 3Sk , under which condition is it possible to transform their failure detectors into their counterparts with unlimited accuracy, i e S and 3S?" The paper answers this question in the following way It first presents a particularly simple pro - tocol that realizes such a transformation when f < k (where f is the maximum number of processes that may crash) Then, it shows that there is no reduction protocol when f k
... A Failure Detector Oracle Informally, a failure detector consists of a set of modules, each one attached to a process: the module attached to p i maintains a set (named suspected i ) of processes it currently suspects to have crashed. We say "process p i suspects process p j " at some time, if at that time we have p j ∈ suspected i [7,24]. A failure detector class is formally defined by two abstract properties, namely a Completeness property and an Accuracy property. ...
Article
This paper is an introduction to oracles the aim of which is to help solving distributed computing problems in asynchronous distributed systems prone to process crash failures and fair lossy channels. Actually, the combination of asynchrony and failures makes a lot of problems impossible to solve in unreliable asynchronous distributed systems. Hence, those systems have to be extended with appropriate oracles in order these problems become solvable. Using two such problems (namely, the design of a quiescent uniform reliable broadcast facility, and the consensus problem), this paper presents appropriate oracles allowing to solve these problems. In that sense, the paper is a guided tour to the definition of oracles suited to unreliable asynchronous distributed systems.
... The protocols implementing failure detectors in such systems obey the following principle: using successive approximations , each process dynamically determines a value ∆ that eventually becomes an upper bound on transfer delays. Restricted failure detectors are studied in [24] . Unreliable failure detectors with a limited scope accuracy are investigated in [3, 14, 21, 22, 28] . ...
Conference Paper
Full-text available
Unreliable failure detectors introduced by Chandra and Toueg are abstract mechanisms that provide information on process failures. On the one hand, failure detectors allow to state the minimal requirements on process failures that allow to solve problems that cannot be solved in purely asynchronous systems. But, on the other hand, they cannot be implemented in such. systems: their implementation requires that the underlying distributed system be enriched with additional assumptions. The usual failure detector implementations rely on additional synchrony assumptions (e.g., partial synchrony). This paper proposes a new look at the implementation of failure detectors and more specifically at Chandra-Toueg's failure detectors. The proposed approach does not rely on synchrony assumptions (e.g., it allows the communication delays to always increase). It is based on a query-response mechanism and assumes that the query/response messages exchanged obey a pattern where the responses from some processes to a query arrive among the (n - f) first ones (n being the total number of processes, f the maximum number of them that can crash, with 1 less than or equal to f < n). When we consider the particular case f = 1, and the implementation of a failure detector of the class denoted lozengeS (the weakest class that allows to solve the consensus problem), the additional assumption the underlying system has to satisfy boils down to a simple channel property, namely, there is eventually a pair of processes (p(i), p(j)) such that the channel connecting them is never the slowest among the channels connecting p(i) or p(j) to the other processes. A probabilistic analysis shows that this requirement is practically met in asynchronous distributed systems.
... One of them is based on random oracles (the progress of a process can be determined according to random numbers) [2,5]. Another family is based on the use of Chandra-Toueg's unreliable failure detectors [3,15] (failure detector-based consensus protocols are described in [3,9,11]). Interestingly, [11] presents a generic protocol that gives rise to a family of failure detector-based Consensus protocols. ...
Conference Paper
Full-text available
Atomic Broadcast (all processes deliver the same set of messages in the same order) is a very powerful communication primitive when one is interested in building fault-tolerant distributed systems. Moreover, it has been shown that Atomic Broadcast and Consensus are equivalent problems in asynchronous distributed systems prone to process crash failures. Hence, several Consensus-based Atomic Broadcast protocols have been designed. This paper introduces a new and particularly efficient Consensus-based Atomic Broadcast protocol. The efficiency is obtained by limiting the use of the Consensus subroutine to the cases where asynchrony and crashes prevent processes from obtaining a simple agreement on the message delivery order. The protocol assumes n>2f (where n is the number of processes and f the maximum number of them that can crash). In the most favorable cases, it requires two communication steps for processes to determine a message batch. In the worst case it requires an additional Consensus execution. It is shown that, when n>3f, the protocol can be simplified. It then requires a single communication step in the most favorable cases. This exhibits an interesting tradeoff relating the cost of the protocol with the maximum number of process failures
Conference Paper
Unreliable failure detectors are abstract devices that, when added to asynchronous distributed systems, allow to solve distributed computing problems (e.g., Consensus) that otherwise would be impossible to solve in these systems. This paper focuses on two classes of failure detectors defined by Chandra and Toueg, namely, the classes denoted lozengeP (eventually perfect) and OS (eventually strong). Both classes include failure detectors that eventually detect permanently all process crashes, but while the failure detectors of lozengeP eventually make no erroneous suspicions, the failure detectors of OS are only required to eventually not suspect a single correct process. In such a context, this paper addresses the following question related to the comparative power of these classes, namely: "Are there one-shot agreement problems that can be solved in asynchronous distributed systems with reliable links but prone to process crash failures augmented with lozengeP, but cannot be solved when those systems are augmented with lozengeS?" Surprisingly, the paper shows that the answer to this question is "no". An important consequence of this result is that lozengeP cannot be the weakest class of failure detectors that enables solving one-shot agreement problems in unreliable asynchronous distributed systems. These results are then extended to the case of more severe failure modes.
Conference Paper
Full-text available
Let the scope of the accuracy property of an unreliable failure detector be the number x of processes that may not suspect a correct process. The scope notion gives rise to new classes of failure detectors among which we consider Sx and &diam;Sx in this paper (Usual failure detectors consider an implicit scope equal to n, the total number of processes).The k-set agreement problem generalizes the consensus problem: each correct process has to decide a value in such a way that a decided value is a proposed value, and the number of decided values is bounded by k. There exist protocols that solve this problem in asynchronous distributed systems when ƒ k (where ƒ is the maximum number of processes that may crash). Moreover, it has been shown that there is no solution in those systems when ƒ ≥ k. The paper considers asynchronous distributed systems equipped with limited scope accuracy failure detectors. It studies conditions on n, ƒ, k and x that allow to solve the k-set agreement problem in those systems and presents two protocols. The first protocol solves the k-set agreement in asynchronous distributed systems augmented with a failure detector of the class Sx. It requires ƒ k + x - 1. The second protocol works with any failure detector of the class &diam;Sx. It actually defines a family of protocols. This family allows to solve the k-set agreement problem when ƒ max(k, max1≤α≤k(min(n - α⌊n/(α + 1)⌋, α +x - 1))). We conjecture that, when ƒ ≥ k, these conditions are necessary to solve the k-set agreement problem in asynchronous distributed systems equipped with failure detectors ε Sx or &diam;Sx, respectively.
Conference Paper
This paper is a short and informal introduction to failure detector oracles for asynchronous distributed systems prone to process crashes and fair lossy channels. A distributed coordination problem (the implementation of Uniform Reliable Broadcast with a quiescent protocol) is used as a paradigm to visit two types of such oracles. One of them is a “guessing” oracle in the sense that it provides a process with information that the processes could only approximate if they had to compute it. The other is a “hiding” oracle in the sense that it allows to isolate and encapsulate the part of a protocol that has not the required behavioral properties. A quiescent uniform reliable broadcast protocol is described. The guessing oracle is used to ensure the “uniformity” requirement stated in the problem specification. The hiding oracle is used to ensure the additional “quiescence” property that the protocol behavior has to satisfy.
Conference Paper
Full-text available
Unreliable failure detectors, proposed by Chandra and Toueg [2], are mechanisms that provide information about process fail- ures. In [2], eight classes of failure detectors were de.ned, depending on how accurate this information is, and an algorithm implementing a fail- ure detector of one of these classes in a partially synchronous system was presented. This algorithm is based on all-to-all communication, and peri- odically exchanges a number of messages that is quadratic on the number of processes. To our knowledge, no other algorithm implementing these classes of unreliable failure detectors has been proposed. In this paper, we present a family of distributed algorithms that imple- ment four classes of unreliable failure detectors in partially synchronous systems. Our algorithms are based on a logical ring arrangement of the processes, which defines the monitoring and failure information propa- gation pattern. The resulting algorithms periodically exchange at most a linear number of messages.
Conference Paper
Full-text available
This paper addresses the Consensus problem in asynchronous distributed systems (made of n processes, at most f of them may crash) equipped with unreliable failure detectors. A generic Consensus proto- col is presented: it is quorum-based and works with any failure detector belonging to the class S (provided that f ≤ n – or to the class ◊S (provided that f n/2). This quorum-based generic approach for solv- ing the Consensus problem is new (to our knowledge). Moreover, the proposed protocol is conceptually simple, allows early decision and uses messages shorter than previous solutions. The generic dimension and the surprising design simplicity of the pro- posed protocol provide a better understanding of the basic algorithmic structures and principles that allow to solve the Consensus problem with the help of unreliable failure detectors.
Article
Full-text available
Consensus is one of the most fundamental problems in the context of fault-tolerant distributed computing. The problem consists, given a set Ω of processes having each an initial value v i , in deciding among Ω on a common value v. In 1985, Fischer, Lynch and Paterson proved that the consensus problem is not solvable in an asynchronous system subject to a single process crash. In 1991, Chandra and Toueg showed that, by augmenting the asynchronous system model with a well defined unreliable failure detector, consensus becomes solvable. They also give an algorithm that solves consensus using the ◊? failure detector. In this paper we propose a new consensus algorithm, also using the ◊? failure detector, that is more efficient than the Chandra-Toueg consensus algorithm. We measure efficiency by introducing the notion of latency degree, which defines the minimal number of communication steps needed to solve consensus. The Chandra-Toueg algorithm has a latency degree of 3 (it requires at least three communication steps), whereas our early consensus algorithm requires only two communication steps (latency degree of 2). We believe that this is an interesting result, which adds to our current understanding of the cost of consensus algorithms based on ◊?.
Article
Full-text available
The consensus problem involves an asynchronous system of processes, some of which may be unreliable. The problem is for the reliable processes to agree on a binary value. In this paper, it is shown that every protocol for this problem has the possibility of nontermination, even with only one faulty process. By way of contrast, solutions are known for the synchronous case, the “Byzantine Generals” problem.
Article
The Consensus problem is a fundamental pa- radigm for fault-tolerant asynchronous systems. It abstracts a family of problems known as Agreement (or Coordina- tion) problems. Any solution to consensus can serve as a basic building block for solving such problems (e.g., atomic commitment or atomic broadcast). Solving consensus in an asynchronous system is not a trivial task: it has been proven (1985) by Fischer, Lynch and Paterson that there is no deter- ministic solution in asynchronous systems which are subject to even a single crash failure. To circumvent this impossi- bility result, Chandra and Toueg have introduced the con- cept of unreliable failure detectors (1991), and have studied how these failure detectors can be used to solve consen- sus in asynchronous systems with crash failures. This paper presents a new consensus protocol that uses a failure detector of the class S. Like previous protocols, it is based on the rotating coordinator paradigm and proceeds in asynchronous rounds. Simplicity and efficiency are the main characteristics of this protocol. From a performance point of view, the pro- tocol is particularly efficient when, whether failures occur or not, the underlying failure detector makes no mistake (a common case in practice). From a design point of view, the protocol is based on the combination of three simple mech- anisms: a voting mechanism, a small finite state automaton which manages the behavior of each process, and the possi- bility for a process to change its mind during a round.
Article
Devices]: Models of Computation ---automata; relations among models; F.1.2 [Computation by Abstract Devices]: Modes of Computation---parallelism and concurrency; H.2.4 [Database Management]: Systems---con- currency; distributed systems; transaction processing General Terms: Algorithms, Reliability, Theory Additional Key Words and Phrases: agreement problem, asynchronous systems, atomic broadcast, Byzantine Generals' problem, commit problem, consensus problem, crash failures, failure detection, fault-tolerance, message passing, partial synchrony, processor failures A preliminary version of this paper appeared in Proceedings of the Tenth ACM Symposium on Principles of Distributed Computing, pages 325--340. ACM press, August 1991.