ArticlePDF Available

Automating the Addition of Fail-Safe Fault-Tolerance: Beyond Fusion-Closed Specifications

Authors:

Abstract and Figures

The tolerance theory by Arora and Kulkarni views a fault-tolerant program as the composition of a fault-intolerant program and fault tolerance components called detectors and correctors.At its core, the theory assumes that the correctness specifications under consideration are fusion closed.In general, fusion closure of specifications can be achieved by adding history variables to the program. However, addition of history variables causes an exponential growth of the state space of the program.To redress this problem, we present a method which can be used to add history information to a program in a way that (in a certain sense) minimizes the additional states. Hence, automated methods that add fault tolerance can now be efficiently applied to environments with not fusion closed specifications.
Content may be subject to copyright.
Automating the Addition of Fail-Safe
Fault-Tolerance: Beyond Fusion-Closed
Specifications
Felix C. G¨
artner
´
Ecole Polytechnique F´
ed´
erale de Lausanne (EPFL)
Departement de Syst`
emes de Communications
Laboratoire de Programmation Distribu´
ee
CH-1015 Lausanne, Switzerland
fcg@acm.org
Arshad Jhumka
Technische Universit¨
at Darmstadt
Fachbereich Informatik
D-64283 Darmstadt, Germany
arshad@informatik.tu-darmstadt.de
April 11, 2003
Swiss Federal Institute of Technology (EPFL)
School of Computer and Communication Sciences
Technical Report IC/2003/23
Abstract
The fault tolerance theories of Arora and Kulkarni [3] and of Jhumka et
al. [11] view a fault-tolerant program as the result of composing a fault-intolerant
program with fault tolerance components called detectors and correctors. At
their core, the theories assume that the correctness specifications under consid-
eration are fusion closed. In general, fusion closure of specifications can be
achieved by adding history variables to the program. However, addition of his-
tory variables causes an exponential growth of the state space of the program,
causing addition of fault tolerance to be expensive. To redress this problem, we
present a method which can be used to add history information to a program in a
way that (in a certain sense) minimizes the additional states. Hence, automated
methods that add fault tolerance can now be efficiently applied in environments
where specifications are not necessarily fusion closed.
Keywords:
fault-tolerance, safety, fusion closure, specifications, transition systems, theory, exten-
sion
1
1 Introduction
It is an established engineering method in computer science to generate complicated
things from simpler things. The most obvious example for this is a compiler for a pro-
gramming language (like C). The compiler takes a high-level programming instruction
in form of a C program and generates a sequence of machine code instructions that per-
form the specified task. Of course, the original C program might be complicated too,
but it is at least easier to understand than the generated assembly code since it abstracts
away from the machine architecture and supports a more natural formulation of control
structures etc.
Another area in which this technique has been applied is the area of fault-tolerant
systems. The goal is to start off with a system which is not fault-tolerant for certain
kinds of faults and use a sound procedure to transform it into a program which is
fault-tolerant. The approaches which have been proposed range from practical propos-
als like Schneider’s state machine approach [17] to theoretical studies like the one by
Basu et al. [5]. The former approach can be used to tolerate permanent faults in a cer-
tain number of replicated processes while the latter approach studies tolerance against
certain types of transient communication faults. Although these methods can be com-
bined, in general they seem a little oversized since they cannot be easily adapted to
other types of faults with finer granularity like a stuck-at-0 register.
To this end, Arora and Kulkarni [3] initially presented a method which can be used
to combat finer grained fault assumptions. Fault tolerance is achieved by composing a
fault-intolerant program with two types of fault-tolerance components called detectors
and correctors. Briefly spoken, a detector is used to detect a certain (error) condi-
tion on the system state and a corrector is used to bring the system into a valid state
again. Since common fault-tolerance methods like triple modular redundancy or er-
ror correcting codes can be modeled by using detectors and correctors, the theory can
be viewed as an abstraction of many existing fault tolerance techniques, including the
state machine approach.
Kulkarni and Arora [12] and more recently Jhumka et al. [11] proposed meth-
ods to automate the addition of detectors and correctors to a fault-intolerant program.
The basic idea of these methods is to perform a state space analysis of the fault-
affected program and change its transition relation in such a way that it still satisfies
its specification in the presence of faults. These changes result in either the removal
of transitions to satisfy a safety specification or the addition of transitions to satisfy
a liveness specification. G¨
artner and V¨
olzer [9] analyzed the assumptions behind the
original Kulkarni-Arora method and argued that it is based on two distinct forms of
redundancy: redundancy in space and redundancy in time. The former refers to non-
reachable states of the program while the latter refers to non-reachable transitions.
However, the detector/corrector method cannot be viewed as a method which “adds
redundancy” (like for example the state machine approach) because the redundancy is
already present in the fault intolerant program. This stems from the fact that Arora and
Kulkarni [3] assume that their correctness specifications are fusion closed.
Basically, fusion closure means that the next step of a program merely depends
on the current state and not on the previous history of the execution. For example,
given a program with a single variable xN, then the specification “never x= 1
is fusion closed while the specification x= 4 implies that previously x= 2 is
2
not. Specifications written in the popular Unity Logic [6] are fusion closed [10], as
are specifications consisting of state transition systems (like C programs). But general
temporal logic formulas which are usually used in the area of fault-tolerant program
synthesis and refinement [15, 16] are not. Arora and Kulkarni [3, p. 75] originally
argued that this assumption is not restrictive in the sense that for every non-fusion
closed specification there exists an “equivalent” specification which is fusion closed if
it is allowed to add history variables to the program. History variables are additional
control variables which are used to record the previous state sequence of an execution
and hence can be used to answer the question of, e.g., “has the program been in state
x= 2?”. Using such a history variable hthe example above which was not fusion
closed can be rephrased in a fusion-closed fashion as:
“never (x= 4 and (x= 2) 6∈ h)”
However, these history variables add states to the program and in effect add the neces-
sary redundancy to be fault-tolerant.
There are obvious “brute force” approaches on how to add history information like
the one sketched above where the history variable remembers the entire previous state
sequence of an execution. However, since history variables must be implemented,
they exponentially enlarge the state space of the fault-intolerant program. Rephrasing
this in the redundancy terminology of G¨
artner and V¨
olzer [9], history variables add
redundancy in space. Specifically, the history variables add exponential redundancy
in space, which is costly. So, we are interested in adding as little redundancy (i.e.,
as little additional states) as possible. Intuitively, the minimal amount of redundancy
which is necessary to tolerate a certain class of faults depends on the kind and nature
of the faults.
In this paper, we present a method to add history states to a program in a way
which (in general) avoids exponential growth of the state space. More specifically,
we start with a problem specification SPEC 1which is not fusion closed, a program
Σ1which satisfies SPEC 1and a class of faults F. Depending on Fwe show how to
transform SPEC 1and Σ1into SPEC 2and Σ2in such a way that (a) SPEC 2is fusion
closed, (b) Σ2can be made fault tolerant for SPEC 2iff Σ1can be made fault tolerant
for SPEC 1, and (c) Σ2is (in a certain sense) minimal with respect to the added states.
We restrict our attention to cases where SPEC is a safety property and therefore are
only concerned with what Arora and Kulkarni call fail-safe fault-tolerance [3].
The benefit of the proposed method is the following: Firstly, it makes the methods
which automatically add detectors [11,12] amendable to specifications which are not
fusion closed and closes a gap in the applicability of the detector/corrector theory [3].
And secondly, the presented method offers further insight into the efficiency of the
basic mechanisms which are applied in fault tolerance.
The paper is structured as follows: We first present some preliminary definitions
in Section 2 and then relate the assumption of fusion closure to the notion of state
space redundancy in Section 3. In Section 4 we study specifications which are not fu-
sion closed and present a method which makes these types of specifications efficiently
manageable in the context of automated methods which add fault tolerance. Finally,
Section 5 presents some open problems and directions for future work.
3
2 Formal Preliminaries
In this section we define the formal system model used throughout this paper.
2.1 States, Traces and Properties
The state space of a program is an unstructured finite nonempty set Cof states. A
state predicate over Cis a boolean predicate over C. A state transition over Cis a
pair (r, s)of states from C.
In the following, let Cbe a state set and Tbe a state transition set. We define a
trace over Cto be a non-empty sequence s1, s2, s3, . . . of states over C. We sometimes
use the notation sito refer to the i-th element of a trace. Note that traces can be finite
or infinite. A trace is finite if its length is finite. We will always use greek letters to
denote traces and normal lowercase letters to denote states. For two traces αand β,
we write α·βto mean the concatenation of the two traces. We say that a transition t
occurs in some trace σif there exists an isuch that (si, si+1) = t.
We define a property over Cto be a set of traces over C. A trace σsatisfies a
property Piff σP. If σdoes not satisfy Pwe say that σviolates P. There are two
important types of properties called safety and liveness [2, 13]. Informally spoken, a
safety property demands that “something bad never happens” [13], i.e., it rules out a set
of unwanted trace prefixes. Mutual exclusion and deadlock freedom are two promi-
nent examples of safety properties. A liveness property on the other hand demands
that “something good will eventually happen” [13] and can be used to formalize, e.g.,
notions of termination. Since we are only concerned with safety properties we omit a
formal definition of liveness. Safety properties are formally defined as follows.
Definition 1 (safety property over C)Asafety property Sover Cis a property over
Cfor which the following holds: For each trace σwhich violates Sthere exists a prefix
αof σsuch that for all traces β,α·βviolates S.
2.2 Programs, Specifications and Correctness
We define programs as state transition systems consisting of a state set C, a set of
initial states ICand a transition relation Tover C, i.e., a program (sometimes also
called system) is a triple Σ = (C, I , T ). The state predicate Itogether with the state
transition set Tdescribe a safety property S, i.e., all traces which are constructable by
starting in a state in Iand using only state transitions from T. We denote this property
by safety-prop(Σ). For brevity, we sometimes write Σinstead of safety-prop(Σ). A
state sCof a program Σis reachable iff there exists a trace σΣsuch that s
occurs in σ. Otherwise sis non-reachable. Sometimes we will call a non-reachable
state a redundant.
We define specifications to be properties, i.e., a specification over Cis a prop-
erty over C. A safety specification is a specification which is a safety property. Un-
like Arora and Kulkarni [3], we do not assume that problem specifications are fusion
closed. Fusion closure is defined as follows: Let Cbe a state set, sC,Xbe property
over C,α,γfinite state sequences, and β,δ,σbe state sequences over C.
4
Definition 2 (fusion closed set) The set Xis fusion closed if the following holds: If
α·s·βand γ·s·δare in Xthen α·s·δand γ·s·βare also in X.
It is easy to see that for every program Σholds that safety-prop(Σ) is fusion closed.
Intuitively, fusion closure means that the entire history of every trace is present in
every state of the trace. We will give examples for fusion closed and not fusion closed
specifications later.
Let SPEC be a specification and Σbe a program over C. We say that Σsatisfies
SPEC iff all traces in Σsatisfy SPEC . Consequently, we say that Σviolates SPEC iff
there exists a trace σΣwhich violates SPEC .
2.3 Extensions
Given some program Σ1= (C1, I1, T1)our goal is to define the notion of a fault-
tolerant version Σ2of Σ1meaning that Σ2does exactly what Σ1does in fault-free
scenarios and has additional fault-tolerance abilities which Σ1lacks. Sometimes, Σ2=
(C2, I2, T2)will have additional states (i.e., C2C1) and for this case we must define
what these states “mean” with respect to the original program Σ1. This is done using
astate projection function π:C27→ C1which tells which states of Σ2are “the same”
with respect to states of Σ1. A state projection function can be naturally extended to
traces and properties, e.g., for a trace s1, s2, . . . over C2holds that π(s1, s2, . . .) =
π(s1), π(s2), . . .
Definition 3 (extends) Let Σ1= (C1, I1, T1)and Σ2= (C2, I2, T2)be two pro-
grams. Program Σ2extends program Σ1using state projection πiff the following
conditions hold:
1. C2C1,
2. πis a total mapping from C2to C1(for simplicity we assume that for any sC1
holds that π(s) = s), and
3. π(safety-prop2)) = safety-prop1).
Note that the concept of extension is related to the notion of refinement [1]. Ex-
tensions are refinements with the additional property that the original state space is
preserved and that there is no notion of stuttering [1].
If Σ2extends Σ1using πand Σ1satisfies SPEC then obviously π2)satisfies
SPEC . When it is clear from the context that Σ2extends Σ1we will simply say that
Σ2satisfies SPEC instead of π2)satisfies SPEC ”.
2.4 Fault Models and Fault-Tolerant Versions
Since we are concerned with fault tolerant systems we must have a way of modeling
faulty behavior. We define a fault model Fas being a program transformation [8],
i.e., a mapping Ffrom programs to programs. The resulting program is called the
fault-affected version. For a given program Σ,F(Σ) is also called program Σin the
presence of faults F.
5
We require that a fault model does not tamper with the set of initial states, i.e.,
we rule out “immediate” faults that occur before the system is switched on. We also
restrict ourselves to the case where F“adds” transitions, since this is the only way to
violate a safety specification.
Definition 4 (fault model) Afault model Fmaps a program Σ=(C, I , T )to a pro-
gram F(Σ) = (F(C), F (I), F (T)) such that the following conditions hold:
1. F(C) = C
2. F(I) = I
3. F(T)T
For a given fault model Fand a specification SPEC, we say that a program Σis
F-intolerant with respect to SPEC if Σsatisfies SPEC but F(Σ) violates SPEC .
Given two programs Σ1and Σ2such that Σ2extends Σ1and a fault model F, it
makes sense to assume that Ftreats Σ1and Σ2in a “similar way”. Basically, this
means that Fshould at least add the same transitions to Σ1and Σ2. But with respect
to the possible new states of Σ2it can possibly add new fault transitions. This models
faults which occur within the fault-detection and correction mechanisms.
Definition 5 (fault extension monotonicity) A fault model Fis extension monotonic
iff for any two programs Σ1= (C1, I1, T1)and Σ2= (C2, I2, T2)such that Σ2extends
Σ1using πholds:
F(T1)\T1F(T2)\T2
original system Σ1
extension Σ2a b
a b a b
d c
a b
d
π
not extension monotonicextension monotonic
Figure 1: Examples for extension monotonic and not extension monotonic fault mod-
els.
An example is given in Fig. 1. The original system is given at the top and the
extension is given below (the state projection is implied by vertical orientation, i.e.,
states which are vertically aligned are mapped to the same state by π). In the left
example the fault model is extension monotonic since all fault transitions in Σ1are
also in Σ2. The right example is not extension monotonic. Intuitively, an extension
monotonic fault model maintains at least its original transitions over extensions.
6
The extension monotonicity requirement does not restrict faulty behavior on the
new states of the extension. However, we have to restrict this type of behavior since
it would be impossible to build fault-tolerant versions otherwise. In this paper we
assume a very general type of restriction: it basically states that in any infinite sequence
of extensions of the original program there is always some point where Fdoes not
introduce new fault transitions anymore.
Definition 6 (finite fault model) A extension monotonic fault model Fis finite iff for
any infinite sequence of programs Σ1,Σ2, . . . such that for all i,Σi+1 extends Σiholds
that there exists a jsuch that for all kjno new fault transition is introduced in Σk,
i.e., F(Tk+1)\Tk+1 =F(Tk)\Tk.
Finite fault models retain the fault transitions in the original program (i.e., they are
extension monotonic for each pair of extensions). They do not restrict the additional
faulty behavior introduced in the new states of an extension. However, they exclude
fault models for which infinite redundancy is necessary to tolerate them. The engineer-
ing process is as follows: Given a program Σ1and a fault model F, we extend Σ1to
Σ2to make Ftolerable. Then we look at the new states introduced in this process and
consider faults which might happen there. Regarding these new faults we construct a
new extension Σ3of Σ2to potentially tolerate these faults. This process is repeated.
In theory, this process might never terminate, namely if Fforever adds certain kinds
of faults to the new states. A finite fault model guarantees that this process must even-
tually terminate. In this paper, we assume our fault model to be finite and extension
monotonic.
Now we are able to define a fault-tolerant version. It captures the idea of starting
with some program Σ1which is fault-intolerant regarding a specification SPEC and
some fault model F. A fault-tolerant version Σ2of Σ1is a program which has the
same behavior as Σ1if no faults occur, but additionally satisfies SPEC in the presence
of faults.
Definition 7 (fault-tolerant version) Let Fbe a fault model, SPEC be a specification
and Σ1and Σ2be programs. Assume that Σ1satisfies SPEC but F1)violates SPEC.
We call a program Σ2the F-tolerant version of program Σ1for SPEC using state
projection πiff the following conditions hold:
1. Σ2extends Σ1using π,
2. F2)satisfies SPEC.
3 Problem Statement
The basic task we would like to solve is to construct a fault-tolerant version for a given
program and a safety specification.
Definition 8 (general fail-safe transformation problem) Given a fault model Fand
a program Σ1which is F-intolerant with respect to a general safety specification
SPEC1. The general fail-safe transformation problem consists of finding a fault-tolerant
version of Σ1, i.e., a program Σ2such that Σ2extends Σ1and F2)satisfies SPEC1.
7
The case where SPEC is fusion closed has been studied by Kulkarni and Arora
[12] and Jhumka et al. [11], i.e., they solve a restricted transformation problem.
Definition 9 (fusion-closed fail-safe transformation problem) The fusion-closed fail-
safe transformation problem consists of solving the general fail-safe transformation
problem where SPEC1is fusion closed.
In the remainder of this section we briefly recall the approaches used by Kulkarni
and Arora [12] and Jhumka et al. [11] to solve the latter problem.
3.1 Adding Fail-Safe Fault Tolerance to Fusion-Closed Specifications
The basic mechanism which Kulkarni and Arora [12] and Jhumka et al. [11] apply
is the creation of non-reachable states. The fact that specifications are fusion closed
implies that safety specifications can be concisely represented by a set of “bad” tran-
sitions, transitions which causes a violation of the specification [3, 10].
Definition 10 (maintains) Let Σbe a program, SPEC be a specification and αbe a
finite computation of Σ. We say that αmaintains SPEC iff there exists a sequence of
states βsuch that α·βSPEC.
If SPEC is a safety property, every trace not in SPEC has a prefix which does
not maintain SPEC . From the definition of maintains, we have that there must be a
transition where a given trace σswitches from “good” to “bad”, i.e., σcan be written
as α·d·b·βsuch that α·dmaintains SPEC and all “longer” prefixes (starting with
α·d·b) do not maintain SPEC . Arora and Kulkarni have shown [4, “Only-if” part of
Lemma 3.2] that (d, b)is a transition which will cause any trace in which it occurs to
violate SPEC . We rephrase this result as follows:
Lemma 1 Let Σ = (C, I , T )be a system, SPEC be safety property which is fusion
closed and assume that Σviolates SPEC and that for all xIholds that xmaintains
SPEC. Then there exists a transition (d, b)Tsuch that for all traces σof Σholds:
if (d, b)occurs in σthen σ6∈ SPEC.
The known automated procedures [11,12] which are based on the concept of non-
reachable states use the following approach for addition of fail-safe fault tolerance:
Since F1)violates SPEC , there must exist executions in which a specified bad
transition occurs. Inevitably, we must prevent the occurrence of such a transition. So,
for all bad transitions t= (d, b)we must make either state dor state bunreachable in
F2). If tis a program transition then it depends on whether or not tis reachable in
Σ1or not.
If tis a reachable program transition, then a violation of SPEC can occur even
if no faults occur, so, obviously, no fault-tolerant version exists since we would
have to change the behavior of the original program.
If tis a redundant (i.e., non-reachable) program transition, then we can remove
it resulting in a smaller transition set T2of Σ2.
8
If tis a transition which has been introduced by F, then we cannot remove it directly.
The best we can do is make the starting state dof tunreachable. But this can only
be done if there exists a non-reachable program transition on the path to d. If such a
transition exists, we can safely remove it. If not, then again no fault-tolerant version
exists.
abef
c d gh
Figure 2: Illustration for the Kulkarni-Arora method. The specification is “never h”.
As an illustration of the method consider Figure 2 which shows a program Σ1
in a state-chart like notation. Again, states are drawn as circles and transitions are
arrows between states. Initial states are identified using arrows without starting states.
Transitions which are introduced by Fare shown as dashed arrows.
Assume the correctness specification SPEC for Σ1is that it never reaches state h.
Obviously, the system satisfies SPEC in the absence of faults but it violates SPEC in
the presence of faults. The bad transitions which we must prevent in F2)are all
transitions which have state has destination state. We can remove the transition (g, h)
easily from T2because its removal does not change the fault-free behavior of Σ2. But
we cannot remove transition (f, h)since it is a fault transition. But luckily, there exists
a redundant transition (d, e)on the path leading to fwhich can be removed in T2. So
Σ2is constructed from Σ1by removing (g, h)and (d, e)from T2.
Figure 2 also helps to illustrate the cases where no fault-tolerant version exists. For
example, if there were a transition (c, h)T1then his reachable in Σ1and, hence, Σ1
does not satisfy the specification anyway. The other case arises for example, if there
were a fault transition (b, h)F(T1), i.e., his reachable along a path with only fault
transitions and reachable program transitions. Again, such an Fis not tolerable. How-
ever, the fact that Fis not tolerable is not a drawback of the transformation method; it
is simply states that generally the chosen fault assumption is too severe to be tolerated.
This concludes the recapitulation of the known approaches to automatically make
a program fail-safe fault tolerant. Recall that specifications are required to be fusion
closed. The above illustrations show that fusion closure together with the assumption
that a fault assumption is tolerable implies that state space redundancy (e.g., states d
and g) is already available in the fault-intolerant system Σ1. This type of redundancy
allows to formulate detection predicates in the language of guarded commands [6,
7] which is the basis of the Arora-Kulkarni theory [3]. These detection predicates
are conjoined to the guards of certain actions and hence have the effect of removing
transitions.
3.2 Handling Specifications which are Not Fusion Closed
Programs are presented in a guarded command notation [6,7]. The state space of a
program is defined by a set of variables and the state transitions by a set of actions. An
9
process Σ1
var x {0,1,2,3,4}init 0
begin
x= 0 x:= 1
[] x= 1 x:= 2
[] x= 2 x:= 3
[] x= 3 x:= 4
[] f:x= 1 x:= 3
end
process Σ2
var x {0,1,2,3,4}init 0
hsequence of {0,1,2,3,4}init hi
begin
x= 0 x:= 1;h:= h1i
[] x= 1 x:= 2;h:= h1,2i
[] x= 2 x:= 3;h:= h1,2,3i
[] x= 3 x:= 4;h:= h1,2,3,4i
[] f:x= 1 x:= 3
end
Figure 3: Two programs in guarded command notation.
action of a program has the form
hguardi hstatementi
in which the guard is a boolean expression over the program variables and the state-
ment is either the empty statement or an instantaneous assignment to one or more vari-
ables. An execution is constructed by repeatedly and non-deterministically choosing
any action where the guard evaluates to true and executing the corresponding action.
Consider the program on the left side of Figure 3. The program has a variable
xwhich can take five different values (0–4) and simply proceeds from state x= 0 to
x= 4 through all intermediate states. The fault assumption Fhas added one transition
from x= 1 to x= 3 to the transition relation (the action is marked with an f’).
Consider the correctness specification
SPEC =“always (x= 4 implies that previously x= 2)”
Note that F1)does not satisfy SPEC (i.e., F1)can reach state x= 4 without
having been in state x= 2), and that SPEC is not fusion closed. To see the latter,
consider the two traces 0,3,2,4and 2,3,4from SPEC . The fusion at state x= 3
yields trace 0,3,4which is not in SPEC . Since SPEC is not fusion closed, we cannot
apply the known transformation methods [11,12].
The specification can be made fusion closed by adding a history variable hwhich
records the entire state history. Such a variable has been added to the program on the
right side of Figure 3. Now SPEC can be rephrased as
SPEC =“always (x= 4 implies h2i h)”
or equivalently
SPEC =“never (x= 4 and h2i 6∈ h)”
Now we can identify a set of bad transitions which must be prevented, e.g.:
x= 3 h=h1i x= 4 h=h1,2,3,4i
The precondition for the transition to a state where x= 4 must be strengthened by the
detection predicate h6=h1i, i.e., the fourth guarded command of Σ2must be changed
to:
x= 3 h6=h1i x:= 4;h:= h1,2,3,4i
10
Hence, bad transitions are prevented and the modified system satisfies SPEC in the
presence fault f.
3.3 State Space Redundancy Through History Variables
Adding a history variable hin the previous example adds states tothe state space of the
system. In fact, defining the domain of has the set of all sequences over {0,1,2,3,4}
adds infinitely many states. Clearly this can be reduced by the observation that if faults
do not corrupt h, then hwill only take on five different values (hi,h1i,h1,2i,h1,2,3i,
and h1,2,3,4i). But still, the state space has been increased from five states to 52= 25
states.
Note that Σ2has redundant states and Σ1is not redundant at all. So the redundancy
is due to the history variable h. But even if the domain of hhas cardinality 5, the
redundancy is in a certain sense not minimal, as we now explain.
Consider the program Σ3on the left side of Figure 4. It tolerates the fault fby
adding only one state to the state space of Σ1(namely, x= 5). The state space together
with the transitions is depicted on the right side of the figure. Note that Σ3has only one
redundant state, so Σ3can be regarded as redundancy-minimal with respect to SPEC .
The metric used for minimality is the number of redundant states. We want to exploit
this observation to deal with the general case.
process Σ3
var x {0,1,2,3,4,5}init 0
begin
x= 0 x:= 1
[] x= 1 x:= 2
[] x= 2 x:= 5
[] x= 5 x:= 4
[] f:x= 1 x:= 3
end
1234
0
5
Figure 4: A redundancy-minimal version of the program in Figure 3 in guarded com-
mands (left) and a state chart notation (right). The specification is “always (x= 4
implies that previously x= 2)”.
4 Beyond Fusion Closure
Although the automated procedures of [11, 12] were developed for fusion-closed spec-
ifications, they (may) still work for specifications which are not fusion closed only if
the fault model has a certain pleasant form. For example, consider the system in Fig-
ure 5 and the specification
SPEC =“(eimplies previously c) and (never g)”
Obviously, the fault model Fcan be tolerated using the known transformation methods
because Fdoes not “exploit” the part of the specification which is not fusion closed.
11
a b c d e f g
Figure 5: The fail-safe transformation can be successful even if the specification is not
fusion closed. The specification in this case is “(eimplies previously c) and (never g)”.
4.1 Exploiting Non-Fusion Closure
Now we formalize what it means for a fault model to “exploit” the fact that a specifi-
cation is not fusion-closed (we call this property non-fusion closure). First we define
what it means for a trace to be the fusion of two other traces.
Definition 11 (fusion and fusion point of traces) Let sbe a state and α=αpre ·s·
αpost and β=βpre ·s·βpost be two traces in which soccurs. Then we define
fusion(α, s, β) = αpre ·s·βpost
If fusion(α, s, β)6=αand fusion(α, s, β )6=βwe call safusion point of αand β.
Lemma 2 For the fusion of three traces α, β, γ holds: If soccurs before s0in βthen
fusion(α, s, fusion(β, s0, γ )) = fusion(fusion(α, s, β), s0, γ)
and
fusion(γ, s0,fusion(α, s, β )) = fusion(γ, s0, β)
Proofs are written in a structured style similar to proof trees of interactive theorem
proving environments. This approach is advocated by Lamport who promises that this
style “makes it much harder to prove things that are not true” [14]. The proof is a
sequence of numbered steps at different levels. Every step has a proof which may be
refined at lower levels by additional steps. For example, step h1i2.is the second step
on level 1. Proofs may also be read in a structured way, for example, by reading only
the top level steps and going into sublevels only when necessary.
PROOF: Assume that soccurs before s0in β. The proof is by direct calculation. Let
α=αpre ·s·αpost,β=βpr e ·s·βmid ·s0·βpost, and γ=γpre ·s0·γpost . Then
fusion(α, s, fusion(β, s0, γ )) = fusion(α, s, βpre ·s·βmid ·s0γpost)
=αpre ·s·βmid ·s0·γpost
=fusion(αpre · ·s·βmid ·s0·βpost, s0, γpost )
=fusion(fusion(α, s, β), s0, γ )
proves the first equation and
fusion(γ, s0,fusion(α, s, β )) = fusion(γ, s0, αpre ·s·βmid ·s0·βpost)
=γpre ·s0·βpost
=fusion(γ, s0, β )
proves the second equation.
If SPEC is a set of traces, we recursively define the fusion closure of SPEC , de-
noted by fusion-closure(SPEC ), as the set which is closed under finite applications of
the fusion operator.
12
Definition 12 (fusion closure) Given a specification SPEC, a trace σis in fusion-closure(SP EC )
iff
1. σis in SPEC, or
2. σ=fusion(α, s, β)for traces α, β fusion-closure(SPEC)and a state sthat
occurs in αand β.
Lemma 2 guarantees that every trace in fusion-closure(SPEC )which is not in
SPEC has a “normal form”, i.e., it can be represented uniquely as the sequence of
fusions of traces in SPEC . This is shown in the following theorem.
Theorem 1 For every trace σfusion-closure(SPEC)which is not in SPEC there
exists a sequence of traces α0, α1, α2, . . . and a sequence of states s1, s2, s3, . . . such
that
1. for all i0,αiSPEC,
2. for all i1,siis a fusion point of αi1and αi, and
3. σcan be written as:
σ=fusion(fusion(. . . fusion(α0, s1, α1), s2, α2), s3, α3), . . .)
PROOF SKETCH: The proof is by induction on the structure of how σevolved from
traces in SPEC . Basically this means an induction on the number of fusion points
in sigma. The induction step assumes that σis the fusion of two traces which have
at most nfusion points and depending on their relative positions uses the rules of
Lemma 2 to construct the normal form for sigma.
1h1i1. The theorem holds for all traces which have one fusion point.
PROOF: Since σhas one fusion point, it can be written as σ=fusion(α0, s1, α1)
with α0and α1from SPEC and s1a fusion point of α0and α1.
2h1i2. ASSUME: The theorem holds for all traces with at most nfusion points.
PROVE: The theorem holds for all traces σwhich are fusions of traces with at
most nfusion points.
PROOF SKETCH: Take two traces τand τ0which have at most nfusion points and
which share an additional common fusion point s(see Fig. 6). The new fusion point
sdivides the fusion points in τand τ0into two groups of kand mfusion points in
τ(and k0and m0fusion points in τ0respectively). The fusion of both traces will
maintain the kfusion points of τand the m0fusion points of τ0. This follows from
the second equation of Lemma 2. Because of the ordering of the fusion points we
can use the first equation of Lemma 2 to construct the normal form. In general, the
resulting trace can have more than nfusion points.
2.1 h2i1.σcan be written as σ=fusion(τ, s, τ 0)where τand τ0have at most nfusion
points.
PROOF: Follows from the fact that σis the fusion of two traces with at most n
fusion points.
2.2 h2i2.σcan be written as
σ=fusion(fusion(. . . fusion(α0, s1, α1), s2, α2). . .), s,
fusion(. . . fusion(α0
0, s0
1, α0
1), s0
2, α0
2). . .))
13
PROOF: Follows from the induction hypothesis and by replacing τand τ0with
their normal forms in the formula of step h2i1.
2.3 h2i3. Let k,m,k0and m0denote the number of fusion points to the left and right of
sin τand τ0(see Fig. 6). Then σcan be written as
fusion(. . . fusion(α0, s1, α1), s2, α2). . .), s,
fusion(. . . fusion(α0
k0, s0
k0+1, α0
k0+1), s0
k0+2, α0
k0+2). . .))
PROOF: The first k0fusion points of τ0precede sand so by repeatedly applying
the second equation of Lemma 2 we can remove the k0first applications of fusions
from the formula of step h2i2.
2.4 h2i4.σcan be written as
fusion(. . . fusion(α0, s1, α1), s2, α2). . .),
sk, αk), s, α0
k0), s0
k0+1, α0
k0+1), s0
k0+2, α0
k0+2). . .)
PROOF: From the definition of fusion, we can ignore the final mfusion points of
τ. The formula follows by repeatedly applying the first equation of Lemma 2 to
the formula of step h2i3(shifting the fusion operator to the left).
2.5 h2i5. Q.E.D.
PROOF: The formula of step h2i4has the required normal form because all αi
and α0
jare in SPEC and all siand s0
jare fusion points of consecutive elements in
the formula.
3h1i3. Q.E.D.
PROOF: Follows from induction.
τ
τ0
σ
k m
k0m0
s
s
Figure 6: Diagram accompanying the proof of Theorem 1.
Now consider the system depicted in Figure 7. The corresponding specification is:
SPEC =fimplies previously d
The system may exhibit the following two traces in the absence of faults, namely
α=a·b·cand β=a·d·e·f. In the presence of faults, a new trace is possible,
namely γ=a·b·e·f. Observe that γviolates SPEC and that γis the fusion of two
traces α, β SPEC (the state which plays the role of sin Definition 11 is state e). In
such a case we say that fault model Fexploits the non-fusion closure of SPEC .
We now formally define what is meant by exploiting the non-fusion closure of a
specification.
Definition 13 (exploiting non-fusion closure) Let Σbe a system, Fbe a fault model
and SPEC be a specification which is satisfied by Σ. Then F(Σ) exploits the non-
fusion closure of SPEC iff there exists a trace σF(Σ) such that σ6∈ SPEC and
σfusion-closure(SPEC).
14
b c d e f
a
Figure 7: Example where the non-fusion closure of a specification is exploited by a
fault model. The specification is fimplies previously d”.
Intuitively, exploiting the non-fusion closure means that there exists a bad com-
putation (σ6∈ SPEC ) that can potentially “impersonate” a good computation (σ
fusion-closure(SPEC )). Definition 13 states that Fcauses a violation of SPEC by
constructing a fusion of two (allowed) traces.
Given a fault model Fsuch that F(Σ) exploits the non-fusion closure of SPEC,
then also we say that the non-fusion closure of SPEC is exploited for Σin the presence
of F.
Obviously, if for some specification SPEC and system Σsuch an Fexists, then
SPEC is not fusion closed. Similarly trivial to prove is the observation that no fault
model Fcan exploit the non-fusion closure of a specification which is fusion closed.
On the other hand, if the non-fusion closure of SPEC cannot be exploited, this does
not necessarily mean that SPEC is fusion closed. To see this consider Figure 8. The
correctness specification SPEC of the program is cimplies previously a”. Obviously,
a fault model can only generate traces that begin with a. Since ais an initial state
and we assume initial state preservance, no Fcan exploit the non-fusion closure. But
SPEC is not fusion closed.
b ca
Figure 8: Example where the non-fusion closure cannot be exploited but the specifica-
tion is not fusion closed. The specification is cimplies previously a”.
4.2 Preventing the Exploitation of Non-Fusion Closure
The fact that a fault model may not exploit the non-fusion closure of a specification
will be important in our approach to solve the general fail-safe transformation problem
(Def. 8). A method to solve this problem, i.e., that of finding a fault-tolerant version
Σ2, should be a generally applicable method, which constructs Σ2from Σ1(this is de-
picted in the top part of Figure 9). Instead of devising such a method from scratch, our
aim is to reuse the existing transformations to add fail-safe fault tolerance which are
based on fusion-closed specifications [11, 12]. This approach is shown in the bottom
part of Figure 9. Starting from Σ1, we construct some intermediate program Σ0
2and
some intermediate fusion-closed specification SPEC 2to which we apply one of the
above mentioned methods for fusion-closed specifications [11,12]. The construction
of Σ0
2and SPEC 2must be done in such a way that the resulting program satisfies the
properties of the general transformation problem stated in Definition 8. How can this
15
be done?
The idea of our approach is the following: First, choose SPEC 2to be the fusion
closure of SPEC 1, i.e., choose
SPEC 2=fusion-closure(SPEC 1)
and construct Σ0
2from Σ1in such a way that F0
2)does not exploit the non-fusion
closure of SPEC 1. More precisely, Σ0
2results from applying a constructive method
(which we give below) which ensures that
Σ0
2extends Σ1using some state projection πand
F0
2)does not exploit the non-fusion closure of SPEC 1.
Our claim, which we formally prove later, is that the program Σ2resulting from ap-
plying (for example) the algorithms of [11,12] to Σ0
2with respect to SPEC 2in fact
satisfies the requirements of Definition 8, i.e., Σ2is in fact an F-tolerant version of Σ1
with respect to SPEC 1.
fault-intolerant w.r.t.
general specification
SP E C1
fusion-closed
SP E C2
general method
“standard” fail-safe transformation
w.r.t. fusion-closed SP EC2
this paper
Σ0
2
fault-tolerant w.r.t.
SP E C1
Σ1
Σ1Σ2
Σ2
Figure 9: Overview of transformation problem (top) and our approach (bottom). The
constructive method described in Section 4.3 offers a solution to the first step (i.e.,
Σ1Σ0
2).
4.3 Bad Fusion Points
For a given system Σand a specification SPEC, how can we tell whether or not the
nature of SPEC is exploitable by a fault model? For the negative case (where it can
be exploited), we give a sufficient criterion. It is based on the notion of a bad fusion
point.
Definition 14 (bad fusion point) Let SPEC be a specification, Σbe a system satis-
fying SPEC,sbe a state of Σ, and Fa fault model such that F(Σ) violates SPEC.
State sis a bad fusion point of Σfor SPEC in the presence of Fiff there exist traces
α, β SPEC such that
16
1. sis a fusion point of αand β,
2. fusion(α, s, β)F(Σ), and
3. fusion(α, s, β)6∈ SPEC.
Intuitively, a bad fusion point is a state in which “multiple pasts” may have hap-
pened, i.e., there may be two different execution paths passing through s, and from the
point of view of the specification it is important to tell the difference. We now give
several examples of bad fusion points.
As an example, consider Fig. 7 where eis a bad fusion point. To instantiate the
definition, take α=a·b·eF(Σ) and β=a·d·e·fF(Σ). The fusion at e
yields the trace a·b·e·fwhich is not in SPEC .
Theorem 2 (bad fusion point criterion) Let SPEC be a specification, Σbe a system
satisfying SPEC and Fbe a fault model. The following two statements are equivalent:
1. Σhas no bad fusion point for SPEC in the presence of F.
2. F(Σ) does not exploit the non-fusion closure of SPEC.
α0
α1
s1
s1
αk+1
sk+1
αk1
αk
sk
σ
sk
sk+1
β
σ0
α
Figure 10: Diagram accompanying the proof of Theorem 2.
PROOF SKETCH: We prove the contraposition of the theorem in both directions. First
we assume that F(Σ) exploits the non-fusion closure and use Theorem 1 to construct
a bad fusion point. Second we prove that if there exists a bad fusion point then F(Σ)
exploits the non-fusion closure.
1h1i1. ASSUME:Σhas no bad fusion point for SPEC in the presence of F.
PROVE:F(Σ) does not exploit the non-fusion closure of SPEC .
1.1 h2i1. ASSUME:F(Σ) exploits the non-fusion closure of SPEC .
PROVE: False
1.1.1 h3i1. There exists a minimal prefix σ0of σwhich violates SPEC .
PROOF: Follows from the fact that σ6∈ SPEC and that SPEC is a safety prop-
erty.
1.1.2 h3i2.σ0contains at least one fusion point.
17
PROOF: Since σ6∈ SPEC but σfusion-closure(SPEC )we can apply Theo-
rem 1 and write σas the fusion of traces αiSPEC (see Fig. 10. If there were
no fusion point within σ0, then σ0would be a prefix of α0, a contradiction to
the fact that α0SPEC .
1.1.3 h3i3. Let sdenote the rightmost fusion point skin σ0and let αdenote the prefix
of σ0up to and including state s(see Fig. 10). Then αSPEC .
PROOF: Follows from the fact that σ0is minimal (i.e., prefixes of σ0satisfy
SPEC , shown in step h3i1) and the fact that αis a prefix of σ0.
1.1.4 h3i4. If there exists a fusion point sk+1 after skin αk, let βbe the trace αkup
to and including sk(see Fig. 10). Otherwise let βbe the trace αk. Then
βSPEC .
PROOF: In both cases βis a prefix of αk, which is in SPEC and so βSPEC
too.
1.1.5 h3i5.fusion(α, s, β)6∈ SPEC
PROOF: Follows from the fact that σ0is a prefix of fusion(α, s, β)and SPEC
is a safety property (any extension of σ0is not in SPEC ).
1.1.6 h3i6.fusion(α, s, β)F(Σ)
PROOF: Follows from the construction of αand s(in step h3i3) and β(in step
h3i4) and the fact that fusion(α, s, β)is a prefix of σwhich is in F(Σ).
1.1.7 h3i7.sis a bad fusion point for Σin the presence of F.
PROOF: Steps h3i3and h3i4exhibit traces αand βwhich are both in SPEC .
Step h3i6shows that their fusion at state sis in F(Σ). Finally, step h3i5shows
that this fusion is not in SPEC . From Definition 14 follows that sis a bad
fusion point for Σin the presence of F.
1.1.8 h3i8. Q.E.D.
PROOF: Step h3i7contradicts the assumption that Σhas no bad fusion point in
the presence of F.
1.2 h2i2. Q.E.D.
PROOF: Follows indirectly from step h2i1.
2h1i2. ASSUME:F(Σ) does not exploit the non-fusion closure of SPEC .
PROVE:SPEC has no bad fusion point for Σin the presence of F.
2.1 h2i1. ASSUME:SPEC has a bad fusion point for Σin the presence of F.
PROVE: False
2.1.1 h3i1. There exists a trace σin F(Σ) such that σ6∈ SPEC and σis the fusion of
two traces αand βin SPEC at some state s.
PROOF: From assumption.
2.1.2 h3i2. The non-fusion closure of SPEC can be exploited for Σ
PROOF: From step h3i1and the definition of exploits (Definition 13)
2.1.3 h3i3. Q.E.D.
PROOF: Step h3i2contradicts the assumption of the theorem.
2.2 h2i2. Q.E.D.
PROOF: Follows indirectly from step h2i1.
3h1i3. Q.E.D.
PROOF: The two top level steps show both directions of the equivalence.
18
4.4 Removal of Bad Fusion Points
Theorem 2 states that it is both necessary and sufficient to remove all bad fusion points
from Σto make its structure robust against fault models that exploit the non-fusion
closure of SPEC . So how can we get rid of bad fusion points?
Recall that a bad fusion point is one which has multiple pasts, and from the point
of view of the specification, it is necessary to distinguish between those pasts. Thus,
the basic idea of our method is to introduce additional states which split the fusion
paths. This is sketched in Figure 11. Let Σ1= (C1, I1, T1)be a system. If sis a bad
fusion point of Σ1for SPEC , there exists a trace βSPEC and a trace αF(Σ)
which both go through s.
Constructive Method to Remove Bad Fusion Points: To remove bad fusion points,
we now construct an extension Σ2= (C2, I2, T2)of Σ1in the following way:
C2=C1 {s0}where s0is a “new” state,
I2=I1, and
T2results from T1by “diverting” the transitions of βto and from s0instead of s.
The extension is completed by defining the state projection function πto map s0to s.
Observe that sis not a bad fusion point regarding αand βanymore because αnow
contains sand βa different state s0which cannot be fused. So this procedure gets rid
of one bad fusion point. Also, it does not by itself introduce a new one, since s0is an
extension state which cannot be referenced in SPEC . So we can repeatedly apply the
procedure and incrementally build a sequence of extensions Σ1,Σ2, . . . where in every
step one bad fusion point is removed and an additional state is added. However, F
may cause new bad fusion points to be created during this process by introducing new
faults transitions defined on the newly added states. But since the fault model is finite it
will do this only finitely often. Hence, repeating this construction for every bad fusion
point will terminate unless there are infinitely many bad fusion points. This, however,
is impossible if the state space is finite.
Note that in the extension process, certain states can be extended multiple times
because they might be bad fusion points for different combinations of traces.
s s
s0
α
βα
β
Figure 11: Splitting fusion paths.
We now prove that the above method results in a program with the desired proper-
ties.
Lemma 3 Let Fbe a fault model, SPEC1be a non-fusion closed specification, and Σ1
be a program such that Σ1satisfies SPEC1but F1)violates SPEC1. The program
19
Σ0
2which results from applying the constructive method described above satisfies the
following properties:
1. Σ0
2extends Σ1using some state projection πand
2. F0
2)does not exploit the non-fusion closure of SPEC1.
PROOF SKETCH: To show the first point we argue that there exists a projection function
π(which is induced by our method) such that every fault-free execution of Σ0
2is an
execution of Σ1. To show the second point, we argue that the method removes all bad
fusion points and apply the bad fusion point criterion proved as Theorem 2.
1h1i1. The induced projection function πof the constructive method above is such that
Σ0
2extends Σ1using π.
1.1 h2i1. For every state sof Σ1exists a π-image s0in the state space of Σ0
2.
PROOF: The constructive method starts off with the the state space of Σ0
2being
equal to the state space of Σ1and any subsequent changes to πdo not affect this
initial mapping.
1.2 h2i2. Consider an arbitrary fault-free execution σ0=s0
1, s0
2, . . . of Σ0
2. Then π(σ0)
is an execution of Σ1.
PROOF: Looking at Figure 11, every execution σ0of Σ0
2evolves from an execution
of Σ1by splitting fusion paths and adapting πappropriately. Therefore, under the
projection function πboth executions look the same. Formally, this is proved
using an induction on the length of the execution.
1.3 h2i3. Q.E.D.
PROOF: Steps h2i1and h2i2prove the two conditions of Definition 3 (extension)
with respect to the projection function π. Hence, Σ0
2extends Σ1using π.
2h1i2.F0
2)does not exploit the non-fusion closure of SPEC 1.
2.1 h2i1.Σ0
2has no bad fusion point in the presence of F.
PROOF: This is a result from applying the constructive method. Because all fusion
paths are split, no fusion points remain.
2.2 h2i2. Q.E.D.
PROOF: Because of step h2i1we can apply the bad fusion point criterion (Theo-
rem 2) which shows that the non-fusion-closure of SPEC cannot be exploited for
Σ0
2in the presence of F.
3h1i3. Q.E.D.
PROOF: The above two steps show the two consequents of the lemma.
4.5 Correctness of the Combined Method
Starting from a program Σ1, Lemma 3 shows that the program Σ0
2resulting from
the constructive method for removing bad fusion points enjoys certain properties (see
Fig. 9). We now prove that starting off from these properties and choosing SPEC 2
as the fusion closure of SPEC 1, the program Σ2, which results from applying the
algorithms of [11, 12] on Σ0
2, has the desired properties of the transformation problem
(Definition 8).
Lemma 4 Given F,SPEC1, and Σ1as in Lemma 3, let SPEC2=fusion-closure(SPEC1)
and let Σ2be the result of applying any of the known methods that solve the fusion-
closed transformation problem of Definition 9 to Σ0
2with respect to Fand SPEC2,
20
where Σ0
2results from Σ1through the application of the constructive method. Then
the following statements hold:
1. Σ2extends Σ1using some state projection π.
2. If F2)satisfies SPEC2then F2)satisfies SPEC1.
PROOF SKETCH: To prove the first point we argue that a fault tolerance addition proce-
dure only removes non-reachable transitions. Hence, every fault-free execution of Σ0
2
is also an execution of Σ2. But since Σ0
2extends Σ1so must Σ2. To show the second
point we first observe that F0
2)does not necessarily satisfy SPEC 1but not all traces
for this are in F2)anymore (due to the removal of bad transitions during addition
of fault tolerance). Next we show that any trace of F2)which violates SPEC 1must
exploit the non-fusion closure of SPEC 1. But this must also be a trace of F0)and
so is ruled out by assumption.
1h1i1. If Σ0
2extends Σ1using state projection π0then Σ2extends Σ1using state pro-
jection π
1.1 h2i1. Application of the known methods to add fail-safe fault tolerance according
to Definition 9 does not change the fault-free behavior of that system.
PROOF: For the methods of Kulkarni and Arora [12] and Jhumka et al. [11] this
has been discussed in Section 3.
1.2 h2i2. Every (fault-free) execution of Σ0
2is also a (fault-free) execution of Σ2and
vice versa.
PROOF: Follows from step h2i1and the fact that Σ2results from Σ0
2by applying
the fail-safe-tolerance transformation (see Fig. 9).
1.3 h2i3. Every execution of Σ0
2under π0is an execution of Σ1.
PROOF: Follows from the assumption that Σ0
2extends Σ1using π0.
1.4 h2i4. Every execution of Σ2is also an execution of Σ1under π0and vice versa.
PROOF: Starting with an arbitrary execution σof Σ2, step h2i2allows to find an
equivalent execution σ0of Σ0
2. Then for σ0, step h2i3allows to find an equivalent
execution σ00 of Σ1.
1.5 h2i5. Q.E.D.
PROOF: Step h2i4allows to construct a state projection function such that the
safety properties of Σ1and Σ2are identical. Hence, Σ2extends Σ1.
2h1i2. ASSUME: 1. F0
2)does not exploit the non-fusion closure of SPEC 1.
2. F2)satisfies SPEC 2.
PROVE:F2)satisfies SPEC 1.
2.1 h2i1. All executions σof F0
2)that violate SPEC 1are not in F2).
PROOF: This follows from applying a fail-safe tolerance transformation proce-
dure, such as those in [11, 12]. Since these procedures are proved to be sound,
i.e., the resulting programs are indeed fail-safe fault-tolerant, then no execution
can violate the specification.
2.2 h2i2.σF2) : σSPEC 2
PROOF: Follows directly from second assumption, i.e., F2)satisfies SPEC 2.
2.3 h2i3.σF2) : σF0
2)
PROOF: The known fail-safe tolerance transformation procedures that solve Def-
inition 9 guarantee that F2)F0
2), from which this step follows.
2.4 h2i4.F2)does not exploit non-fusion closure of SPEC 1.
21
PROOF: For a contradiction, assume that there is an execution τF2)that
exploits non-fusion closure of SPEC 1. Since τF2), from step h2i3we have
that τF0
2). Hence, F0
2)also exploits the non-fusion closure of SPEC 1, a
contradiction to assumption 2.
2.5 h2i5.σF2) : σSPEC 1
2.5.1 h3i1. ASSUME:σΣ2
PROVE: QED
PROOF: Since σΣ2and Σ2extends Σ1we have that σΣ1. But since Σ1
satisfies SPEC 1we conclude that σSPEC 1.
2.5.2 h3i2. ASSUME:σF2)\Σ2
PROVE: QED
PROOF: First note that σcannot be in fusion-closure(SPEC 1)\SPEC 1(follows
from step h2i4). But since fusion-closure(SPEC 1) = SPEC 2and since F2)
satisfies SPEC 2we have that σmust be in SPEC 1.
2.5.3 h3i3. Q.E.D.
PROOF: Follows from steps h3i1and h3i2and the fact that they cover all
cases.
2.6 h2i6. Q.E.D.
PROOF: Step h2i5shows that F2)satisfies SPEC 1which is what we wanted
to prove.
3h1i3. Q.E.D.
PROOF: Steps h1i1and h1i2prove the first and second point of the lemma, respec-
tively.
Lemmas 3 and 4 together guarantee that the composition of the method described
in Section 4.3 and the fail-safe transformation methods for fusion-closed specifica-
tions in fact solves the transformation problem for non-fusion closed specifications of
Definition 8.
Theorem 3 Given a fault model Fand a program Σ1which is F-intolerant with re-
spect to a non-fusion closed specification SPEC1. The composition of the constructive
method described in Section 4.3 and the fail-safe transformation methods for fusion-
closed specifications solves the general transformation problem of Definition 8, i.e.,
constructs a program Σ2such that Σ2extends Σ1and F2)satisfies SPEC1.
4.6 Examples
Finally, we present two examples of the application of our method. The top of Fig-
ure 12 (system 1) shows the original system. The augmented system is depicted at the
bottom (system 4). The correctness specification for the system is “(dimplies previ-
ously b) and (eimplies previously c)”. There are only two bad fusion points, namely c
and dwhich have to be extended. In the first step, cis “removed” by splitting the fusion
path which is indicated using two short lines. This results in system 2. Subsequently,
dis refined, resulting in system 3. Note that dhas to be refined twice because there are
two sets of fusion paths. This results in system 4, which can be subject to the standard
fail-safe transformation methods, which will remove the transitions (c, d00)and (d, e).
A similar, yet more complex example is shown in Figure 13. The correctness
specification for the system 1 at the top is gimplies previously (bor c)”. The figure
22
(2)
(3)
(4)
(1) a b c d e
a b c d e
c0
a b c d e
c0d0
a b c d e
d00
d0
c0
Figure 12: Removing bad fusion points. The specification is “(dimplies previously b)
and (eimplies previously c)”.
23
shows that again a “two level” extension is necessary here, since the only execution
which must be prevented is the one which uses both fault transitions. This means that
state fis a bad fusion point for multiple execution paths and hence must be refined
twice (note that the fault transition (d, f )is a new fault added to the system in the
extension).
4.7 Discussion
The complexity of our method directly depends on the number of bad fusion points
which have to be removed. Bad fusion points are not hard to find if the specification
is given as a temporal logic formula in the spirit of those used throughout this paper.
For example, if specifications are given in the form xonly if previously y then only
states which occur in traces between xand ycan be fusion points. Candidates for bad
fusion points are all states where two execution paths merge.
Our method requires to check every one of these states whether it is a bad fusion
point. So obviously, applying our method induces a larger overhead than directly
adding history variables. But as can be seen in Figs. 12 and 13, the number of states is
significantly less than adding a general history variable. For example, a clever addition
of history variables to the system im Fig. 12 would require two bits, one to record the
visit to state band one to record the visit to c. Overall this would result in 2×2×5 = 20
states. Our methods achieves the same result with a total of 8 states. The system in
Fig. 13 could employ a boolean history variable which records whether states bor e
have been visited (it is set to true as soon as one of these states is reached). Adding
such a variable would create a total of 7 additional states. Our methods just adds 5.
Note however that the resulting system in Fig. 12 is not redundancy minimal. The
state d00 is not necessary since it may become unreachable even in the presence of
faults after the fail-safe transformation is applied. This is the price we still have to pay
for the modularity of our approach, i.e., adding history states does at present not “look
ahead” which states might become unreachable even in the presence of faults.
In theory there are cases where our method of adding history states does not termi-
nate because there are infinitely many bad fusion points. For this to happen, the state
space must be infinite. If we consider the application area of embedded software, we
can safely assume a bounded state space.
Given a program Σand a general specification SPEC , then our combined method
will find a solution to the general transformation problem iff (a) there exists one with a
finite number of additional states and (b) the method of adding fail-safe fault-tolerance
for fusion-closed specifications is complete. Requirement (a) ensures that our method
of removing bad fusion points will terminate.
5 Conclusions
In this paper, we have presented ways on how get rid of a restriction upon which
procedures that add fault tolerance [11,12] are based, namely that specifications have
to be fusion closed. Our method can be viewed as a finer grained method to add
history information to a given system and hence add state space redundancy. We have
shown that our method in general adds less history states than would be added using
24
(1)
(2)
(3)
(4)
(5)
(6)
a b ec d f g
a b ec d f g
c0
a b ec d f g
c0d0
a b ec d f g
c0d0e0
c0d0e0f0
a b ec d f g
f00
a b ec d f g
c0d0e0f0
Figure 13: A more complex example. The specification is gimplies previously (bor
e)”.
25
standard history variables (which in general lead to an exponential growth of the state
space). Thus, adding state redundancy using the approach presented in this paper
makes addition of fault tolerance more efficient.
As future work, it would be interesting to combine our method with one of the
methods to add detectors so that the resulting method is redundancy minimal. We
are also investigating issues of non-masking fault-tolerance, i.e, adding tolerance with
respect to liveness properties.
Acknowledgments
We wish to thank Sandeep Kulkarni for helpful discussions. Work by the first author
was supported by Deutsche Forschungsgemeinschaft (DFG) as part of “Graduiertenkol-
leg ISIA and Emmy Noether programme.
References
[1] Mart´
ın Abadi and Leslie Lamport. The existence of refinement mappings. The-
oretical Computer Science, 82(2):253–284, May 1991.
[2] Bowen Alpern and Fred B. Schneider. Defining liveness. Information Processing
Letters, 21:181–185, 1985.
[3] Anish Arora and Sandeep S. Kulkarni. Component based design of multitoler-
ant systems. IEEE Transactions on Software Engineering, 24(1):63–78, January
1998.
[4] Anish Arora and Sandeep S. Kulkarni. Detectors and correctors: A theory of
fault-tolerance components. In Proceedings of the 18th IEEE International Con-
ference on Distributed Computing Systems (ICDCS98), May 1998.
[5] Anindya Basu, Bernadette Charron-Bost, and Sam Toueg. Simulating reliable
links with unreliable links in the presence of process crashes. In Proceedings
of the 10th International Workshop on Distributed Algorithms (WDAG96), pages
105–122, Bologna, Italy, October 1996. Springer-Verlag.
[6] K. Mani Chandy and Jayadev Misra. Parallel Program Design: A Foundation.
Addison-Wesley, Reading, MA, Reading, Mass., 1988.
[7] Edsger W. Dijkstra. Guarded commands, nondeterminacy, and formal derivation
of programs. Communications of the ACM, 18(8):453–457, August 1975.
[8] Felix C. G¨
artner. Transformational approaches to the specification and verifi-
cation of fault-tolerant systems: Formal background and classification. Journal
of Universal Computer Science (J.UCS), 5(10):668–692, October 1999. Special
Issue on Dependability Evaluation and Assessment.
[9] Felix C. G¨
artner and Hagen V¨
olzer. Redundancy in space in fault-tolerant sys-
tems. Technical Report TUD-BS-2000-06, Department of Computer Science,
Darmstadt University of Technology, Darmstadt, Germany, July 2000.
26
[10] H. Peter Gumm. Another glance at the Alpern-Schneider characterization of
safety and liveness in concurrent executions. Information Processing Letters,
47(6):291–294, 1993.
[11] Arshad Jhumka, Felix C. G¨
artner, Christof Fetzer, and Neeraj Suri. On system-
atic design of fast and perfect detectors. Technical Report 200263, Swiss Federal
Institute of Technology (EPFL), School of Computer and Communication Sci-
ences, Lausanne, Switzerland, September 2002.
[12] Sandeep S. Kulkarni and Anish Arora. Automating the addition of fault-
tolerance. In Mathai Joseph, editor, Formal Techniques in Real-Time and Fault-
Tolerant Systems, 6th International Symposium (FTRTFT 2000) Proceedings,
number 1926 in Lecture Notes in Computer Science, pages 82–93, Pune, India,
September 2000. Springer-Verlag.
[13] Leslie Lamport. Proving the correctness of multiprocess programs. IEEE Trans-
actions on Software Engineering, 3(2):125–143, March 1977.
[14] Leslie Lamport. How to write a proof. American Mathematical Monthly,
102(7):600–608, August/September 1995.
[15] Zhiming Liu and Mathai Joseph. Specification and verification of fault-tolerance,
timing and scheduling. ACM Transactions on Programming Languages and Sys-
tems, 21(1):46–89, 1999.
[16] Heiko Mantel and Felix C. G¨
artner. A case study in the mechanical verification
of fault tolerance. Journal of Experimental & Theoretical Artificial Intelligence
(JETAI), 12(4):473–488, October 2000.
[17] Fred B. Schneider. Implementing fault-tolerant services using the state machine
approach: A tutorial. ACM Computing Surveys, 22(4):299–319, December 1990.
27
... By contrast, (assuming that satisfying the stronger specification were not feasible), the existing algorithm for adding fault-tolerance will declare failure to add fault-tolerance. There are several algorithms [1,5,13] for adding fault-tolerance in the literature. In this section, we plan to utilize them as a black box, i.e., we only rely on the assumption that they satisfy the problem of adding fault-tolerance (repeated from [5]). ...
... In order to describe the fault-tolerant program subject to the constraints in Problem Statement VIII.2, we can utilize any of the algorithms [1,5,13]. We use the name Add_masking to describe such a generic algorithm. ...
... We use the name Add_masking to describe such a generic algorithm. Since our reuse is black-box in nature, we only rely on the proof (from [1,5,13]) that it satisfies Problem VIII.3. However, for the convenience of the reader, we briefly describe the key steps of these algorithms. ...
Article
Full-text available
Traditionally, (nonmasking and masking) fault-tolerance has focused on ensuring that after the occurrence of faults, the program recovers to states from where it continues to satisfy its original specification. However, a problem with this limited notion is that, in some cases, it may be impossible to recover to states from where the entire original specification is satisfied. For this reason, one can consider a fault-tolerant graceful-degradation program that ensures that upon the occurrence of faults, the program recovers to states from where a (given) subset of its specification is satisfied. Typically, the subset of specification satisfied thus would be the critical/important requirements. In this paper, we initially focus on automatically revising a given fault-intolerant program into a fault-tolerant gracefully degrading program. Specifically, we propose a two-step approach: In the first step, we transform the fault-intolerant program into a graceful program. This program is guaranteed to satisfy only the given subset of specification (e.g., critical requirements). In particular, this step involves adding new behaviors that will satisfy the given subset of the specification. The second step involves utilizing the original program and the graceful program to obtain a fault-tolerant gracefully degrading program. We also develop an algorithm to transform the gracefully degrading program into a distributed gracefully degrading program. Afterwards, the second phase of our transformation can be applied to generate a distributed fault-tolerant gracefully degrading program. We showcase the algorithm with three different non-trivial case studies. Finally, we formalize the problem of multi-graceful degradation and propose an algorithm that solves it and we use a complex case study to showcase the viability of the approach. All the algorithms have polynomial time complexity in the size of the state space of the original program.
... We want to illustrate the functioning of the complete fault-tolerance synthesis by presenting one example produced by our tool. The example is one of those presented by Gärtner and Jhumka [13] in an extended Technical Report of their conference paper [6]. The example is not fully synthesized by Gärtner and Jhumka [13]. ...
... The example is one of those presented by Gärtner and Jhumka [13] in an extended Technical Report of their conference paper [6]. The example is not fully synthesized by Gärtner and Jhumka [13]. Here we finish the synthesis of this example. ...
Conference Paper
Synthesizing fault-tolerant systems from fault-intolerant systems simplifies design of fault-tolerance. Arora and Kulkarni developed a method and a tool to synthesize fault-tolerance under the assumption that specifications are not history-dependent (fusion-closed). Later, Gartner and Jhumka removed this assumption by presenting a modular extension of the Arora-Kulkarni method. This paper presents an implementation of the Gartner-Jhumka method which is evaluated on several examples. As additional safety net, we have added automatic verification of the results using the model checker Spin. In the context of this work, a fault in the Gartner-Jhumka method has been found. Though this fault is rare and does not cause incorrect results, there might be no result at all
... Fusionclosed specifications are non-restrictive in the sense that every specification which is not fusion-closed can be transformed into an equivalent fusion-closed specification by adding history variables. It was further shown how this transformation can be efficiently done in [6]. Alpern and Schneider [1] have shown that every specification can be written as the intersection of a safety specification and a liveness specification. ...
... Calculating these transitions can be achieved in polynomial time in the size of the state space of the program. Gaertner and Jhumka [6] showed how to circumvent the problem of requiring fusionclosed specification to minimize the expansion of the state space due to the fusion closure requirement. Also, note that the notion of satisfies deals with infinite computations , whereas maintains deals with computation prefixes. ...
Conference Paper
Full-text available
The design of a fault-tolerant program is known to be an inherently difficult task. Decisions taken during the design process will invariably have an impact on the efficiency of the resulting fault-tolerant program. In this paper, we focus on two such decisions, namely (i) the class of faults the program is to tolerate, and (ii) the variables that can be read and written. The impact these design issues have on the overall fault tolerance of the system needs to be well-understood, failure of which can lead to costly redesigns. For the case of understanding the impact of fault classes on the efficiency of fail-safe fault tolerance, we show that, under the assumption of a general fault model, it is impossible to preserve the original behavior of the fault-intolerant program. For the second problem of read and write constraints of variables, we again show that it is impossible to preserve the original behavior of the fault-intolerant program. We analyze the reasons that lead to these impossibility results, and suggest possible ways of circumventing them.
... Based on the work of Kulkarni and Arora, Gärtner and Jhumka propose a way to deal also with non fusion closed traces [8]. A specification is fusion closed iff the entire history of every trace is present in every state of the trace (hence the next state of the systems depends only on its current state and on the inputs, i.e., not on the sequence of previous events). ...
Conference Paper
Full-text available
Tolerating the value failures of sensors is an important problem in automated control processes and plants. In this paper, we address this problem in a theoretical framework in order to demonstrate the feasibility of an automatic method based on discrete controller synthesis. We consider a fault-intolerant program whose job is to control an automated process, here a liquid tank equipped with level sensors that can be subject to value faults. This fault-intolerant program is modeled as a finite labeled transition system. We then specify formally a fault hypothesis, i.e., how many sensors can fail simultaneously. We use discrete controller synthesis to obtain automatically a program, having the same behavior as the initial fault-intolerant one, and satisfying the fault tolerance requirements under the fault hypothesis. We advocate that, thanks to the use of discrete controller synthesis, our method offers flexibility, reliability, separation of concern, and it is automatic.
Article
Travaux de recherches effectués de 1995 à 2006, successivement au sein des équipes MEIJE (INRIA Sophia-Antipolis / Centre de Mathématiques Appliquées de l'Ecole des Mines de Paris), PTOLEMY et PATH (département EECS de l'Université de Californie à Berkeley), et enfin BIP et POP ART (INRIA Rhône-Alpes).
Article
Existing algorithms for automated model repair for adding fault-tolerance to fault-intolerant models incur an impediment that designers have to identify the set of legitimate states of the original model. This set determines states from where the original model meets its specification in the absence of faults. Experience suggests that of the inputs required for model repair, identifying such legitimate states is the most difficult. In this paper, we consider the problem of automated model repair for adding fault-tolerance where legitimate states are not explicitly given as input. We show that without this input, in some instances, the complexity of model repair increases substantially (from polynomial-time to NP-complete). In spite of this increase, we find that this formulation is relatively complete; i.e., if it was possible to perform model repair with explicit legitimate states, then it is also possible to do so without the explicit identification of the legitimate states. Finally, we show that if the problem of model repair can be solved with explicit legitimate states, then the increased cost of solving it without explicit legitimate states is very small. In summary, the results in this paper identify instances of automated addition of fault-tolerance, where the explicit knowledge of legitimate state is beneficial and where it is not very crucial.
Conference Paper
We focus on the problem of multi-graceful degradation. In multi-graceful degradation, the system provides successively reduced guarantees in the presence of increasingly severe faults. We present an automated technique for generation of a multi-graceful-degraded program from its original fault-intolerant/ideal version. In this algorithm, we begin with (1) an ideal program that satisfies all its specification in the absence of faults, (2) a set of faults that need to be tolerated and (3) reduced requirements in their presence. We subsequently generate several gracefullly degrading programs that only satisfy the reduced requirements. This step also identifies new states to which program needs to recover to satisfy the reduced specification. Subsequently, we utilize the original input program and the generated programs that ensures that (1) in the absence of faults, the entire specification is satisfied and (2) in the presence of faults, the program recovers to states from where the corresponding reduced specification is satisfied. We illustrate our technique with a case study of a system in the fuelcell lab of the Ohio Coal Research Center (OCRC). In this system, it is important to satisfy safety of lab personnel as well as safety of people in the building in which it is located. Moreover, in case of device failures, it is necessary to provide weaker guarantees that capture the best possible protection. In our example, we begin with an ideal model for this system and successively add multi-graceful degradation to obtain the same program (with some abstractions) as the one that was designed manually for this system.
Conference Paper
To keep pace with today's nano-technology, safety critical embedded systems are becoming less tolerant to errors. Research into techniques to cope with errors in these systems has mostly focused on transformational approach, replication of hardware devices, parallel program design, component based design and/or information redundancy. It would be better to tackle the issue early in the design process that a safety critical system never fails to satisfy its strict dependability requirements. A novel method is outlined in this paper that proposes an efficient approach to synthesize safety critical systems. The proposed method outperforms dominant existing work by introducing the technique of run time detection and completion of proper execution of the system in presence of faults.
Article
Automotive and avionics systems are complex, distributed, software-intensive systems-of-systems (SoS). Consequently, system integration is a central challenge in both domains. Important cross-cutting requirements aspects, such as security, authorization, and failure management, are best understood as properties of the interplay among sub-systems. Yet, traditional development processes address the integration challenge only late, at the level of implementation and deployment. Consequently, potentials for reuse within and across product lines are left unrealized. Furthermore, late integration leads to high calibration, configuration and redesign costs. Service-Oriented Architectures (SOAs) have emerged as a solution to the integration challenge. However, inappropriate application of SOA-principles results in a high degree of fragmentation and scattering of functionality-this leads to additional difficulties in requirements traceability and quality assurance. In this article, we give a comprehensive overview of these SOA-challenges, and present Rich Services as a hierarchical SOA blueprint and development process enabling SoS integration in a dependable way. Rich Services introduce services as hierarchical, partial interaction patterns; these interactions are then augmented with infrastructure elements to inject behaviors that address cross-cutting requirements aspects. Rich Services also seamlessly address the mapping from logical to deployment architectures. Using end-to-end failure management as an example, we illustrate the utility of Rich Services.
Article
A method of writing proofs is proposed that makes it much harder to provethings that are not true. The method, based on hierarchical structuring, issimple and practical.
Article
In order to derive a result such as the Alpern-Schneider theorem characterizing safety and liveness properties of concurrent program executions, it is shown that all that is needed is a ∨-preservingmap ϕ between complete Boolean algebras. Every property becomes a conjunction of a safety and a liveness property and safety properties can be characterized by sets of configurations that are to be “avoided”.Aside from the original result of B. Alpern and F.B. Schneider we also provide a new application by considering transition systems with a UNITY-style logic. Safety properties are characterized by a set of forbidden pairs of successive states and progress properties are those allowing all possible state-successor pairs. Every property of a transition system is shown to be a conjunction of a safety and a progress property.
Article
A formal definition for liveness properties is proposed. It is argued that this definition captures the intuition that liveness properties stipulate that ‘something good’ eventually happens during execution. A topological characterization of safety and liveness is given. Every property is shown to be the intersection of a safety property and a liveness property.
Conference Paper
Abstract To date, there is little evidence that modular reasoning about fault-tolerant systems can simplify the verification proce ss in practice. We study this question using a prominent,example from the fault tolerance literature: the problem of reli able broadcast in point-to-point networks,opposed,to crash failures of processes. The experiences from this case study show how,modular,specification techniques and rigorous proof reuse can indeed help in such undertakings.
Article
The state machine approach is a general method for implementing fault-tolerant services in distributed systems. This paper reviews the approach and describes protocols for two different failure models—Byzantine and fail stop. Systems reconfiguration techniques for removing faulty components and integrating repaired components are also discussed.
Article
Proving that a program suits its specification and thus can be called correct has been a research subject for many years resulting in a wide range of methods and formalisms. However, it is a common experience that even systems which have been proven correct can fail due to physical faults occurring in the system. As computer programs control an increasing part of todays critical infrastructure, the notion of correctness has been extended to fault tolerance, meaning correctness in the presence of a certain amount of faulty behavior of the environment. Formalisms to verify fault-tolerant systems must model faults and faulty behavior in some form or another. Common ways to do this are based on a notion of transformation either at the program or the specification level. We survey the wide range of formal methods to verify fault-tolerant systems which are based on some form of transformation. Our aim is to classify these methods, relate them to one another and, thus, structure the area. We hope that this might faciliate the involvement of researchers into this interesting field of computer science.
Article
Refinement mappings are used to prove that a lower-level specification correctly implements a higher-level one. We consider specifications consisting of a state machine (which may be infinite- state) that specifies safety requirements, and an arbitrary supplementary property that specifies liveness requirements. A refinement mapping from a lower-level specification S1 to a higher-level one S2 is a mapping from S1's state space to S2's state space. It maps steps of S1's state machine to steps of S2's state machine and maps behaviors allowed by S1 to behaviors allowed by S2. We show that, under reasonable assumptions about the specification, if S1 implements S2, then by adding auxiliary variables to S1 we can guarantee the existence of a refinement mapping. This provides a completeness result for a practical, hierarchical specification method.
Article
The inductive assertion method is generalized to permit formal, machine-verifiable proofs of correctness for multiprocess programs. Individual processes are represented by ordinary flowcharts, and no special synchronization mechanisms are assumed, so the method can be applied to a large class of multiprocess programs. A correctness proof can be designed together with the program by a hierarchical process of stepwise refinement, making the method practical for larger programs. The resulting proofs tend to be natural formalizations of the informal proofs that are now used.
Article
So-called “guarded commands” are introduced as a building block for alternative and repetitive constructs that allow nondeterministic program components for which at least the activity evoked, but possibly even the final state, is not necessarily uniquely determined by the initial state. For the formal derivation of programs expressed in terms of these constructs, a calculus will be be shown.