ArticlePDF Available

Automating the Addition of Fail-Safe Fault-Tolerance: Beyond Fusion-Closed Specifications

January 2004
Lecture Notes in Computer Science 3253

Authors:

University of Leeds

The tolerance theory by Arora and Kulkarni views a fault-tolerant program as the composition of a fault-intolerant program and fault tolerance components called detectors and correctors.At its core, the theory assumes that the correctness specifications under consideration are fusion closed.In general, fusion closure of specifications can be achieved by adding history variables to the program. However, addition of history variables causes an exponential growth of the state space of the program.To redress this problem, we present a method which can be used to add history information to a program in a way that (in a certain sense) minimizes the additional states. Hence, automated methods that add fault tolerance can now be efficiently applied to environments with not fusion closed specifications.

Examples for extension monotonic and not extension monotonic fault models.

…

A redundancy-minimal version of the program in Figure 3 in guarded commands (left) and a state chart notation (right). The specification is "always (x = 4 implies that previously x = 2)".

…

Diagram accompanying the proof of Theorem 1.

…

Example where the non-fusion closure of a specification is exploited by a fault model. The specification is "f implies previously d".

…

Diagram accompanying the proof of Theorem 2.

…

Figures - uploaded by Arshad Jhumka

Content may be subject to copyright.

Content uploaded by Arshad Jhumka

Content may be subject to copyright.

Automating the Addition of Fail-Safe

Fault-Tolerance: Beyond Fusion-Closed

Speciﬁcations

Felix C. G¨

artner

Ecole Polytechnique F´

ed´

erale de Lausanne (EPFL)

Departement de Syst`

emes de Communications

Laboratoire de Programmation Distribu´

CH-1015 Lausanne, Switzerland

fcg@acm.org

Arshad Jhumka

Technische Universit¨

at Darmstadt

Fachbereich Informatik

D-64283 Darmstadt, Germany

arshad@informatik.tu-darmstadt.de

April 11, 2003

Swiss Federal Institute of Technology (EPFL)

School of Computer and Communication Sciences

Technical Report IC/2003/23

Abstract

The fault tolerance theories of Arora and Kulkarni [3] and of Jhumka et

al. [11] view a fault-tolerant program as the result of composing a fault-intolerant

program with fault tolerance components called detectors and correctors. At

their core, the theories assume that the correctness speciﬁcations under consid-

eration are fusion closed. In general, fusion closure of speciﬁcations can be

achieved by adding history variables to the program. However, addition of his-

tory variables causes an exponential growth of the state space of the program,

causing addition of fault tolerance to be expensive. To redress this problem, we

present a method which can be used to add history information to a program in a

way that (in a certain sense) minimizes the additional states. Hence, automated

methods that add fault tolerance can now be efﬁciently applied in environments

where speciﬁcations are not necessarily fusion closed.

Keywords:

fault-tolerance, safety, fusion closure, speciﬁcations, transition systems, theory, exten-

sion

1 Introduction

It is an established engineering method in computer science to generate complicated

things from simpler things. The most obvious example for this is a compiler for a pro-

gramming language (like C). The compiler takes a high-level programming instruction

in form of a C program and generates a sequence of machine code instructions that per-

form the speciﬁed task. Of course, the original C program might be complicated too,

but it is at least easier to understand than the generated assembly code since it abstracts

away from the machine architecture and supports a more natural formulation of control

structures etc.

Another area in which this technique has been applied is the area of fault-tolerant

systems. The goal is to start off with a system which is not fault-tolerant for certain

kinds of faults and use a sound procedure to transform it into a program which is

fault-tolerant. The approaches which have been proposed range from practical propos-

als like Schneider’s state machine approach [17] to theoretical studies like the one by

Basu et al. [5]. The former approach can be used to tolerate permanent faults in a cer-

tain number of replicated processes while the latter approach studies tolerance against

certain types of transient communication faults. Although these methods can be com-

bined, in general they seem a little oversized since they cannot be easily adapted to

other types of faults with ﬁner granularity like a stuck-at-0 register.

To this end, Arora and Kulkarni [3] initially presented a method which can be used

to combat ﬁner grained fault assumptions. Fault tolerance is achieved by composing a

fault-intolerant program with two types of fault-tolerance components called detectors

and correctors. Brieﬂy spoken, a detector is used to detect a certain (error) condi-

tion on the system state and a corrector is used to bring the system into a valid state

again. Since common fault-tolerance methods like triple modular redundancy or er-

ror correcting codes can be modeled by using detectors and correctors, the theory can

be viewed as an abstraction of many existing fault tolerance techniques, including the

state machine approach.

Kulkarni and Arora [12] and more recently Jhumka et al. [11] proposed meth-

ods to automate the addition of detectors and correctors to a fault-intolerant program.

The basic idea of these methods is to perform a state space analysis of the fault-

affected program and change its transition relation in such a way that it still satisﬁes

its speciﬁcation in the presence of faults. These changes result in either the removal

of transitions to satisfy a safety speciﬁcation or the addition of transitions to satisfy

a liveness speciﬁcation. G¨

artner and V¨

olzer [9] analyzed the assumptions behind the

original Kulkarni-Arora method and argued that it is based on two distinct forms of

redundancy: redundancy in space and redundancy in time. The former refers to non-

reachable states of the program while the latter refers to non-reachable transitions.

However, the detector/corrector method cannot be viewed as a method which “adds

redundancy” (like for example the state machine approach) because the redundancy is

already present in the fault intolerant program. This stems from the fact that Arora and

Kulkarni [3] assume that their correctness speciﬁcations are fusion closed.

Basically, fusion closure means that the next step of a program merely depends

on the current state and not on the previous history of the execution. For example,

given a program with a single variable x∈N, then the speciﬁcation “never x= 1”

is fusion closed while the speciﬁcation “x= 4 implies that previously x= 2” is

not. Speciﬁcations written in the popular Unity Logic [6] are fusion closed [10], as

are speciﬁcations consisting of state transition systems (like C programs). But general

temporal logic formulas which are usually used in the area of fault-tolerant program

synthesis and reﬁnement [15, 16] are not. Arora and Kulkarni [3, p. 75] originally

argued that this assumption is not restrictive in the sense that for every non-fusion

closed speciﬁcation there exists an “equivalent” speciﬁcation which is fusion closed if

it is allowed to add history variables to the program. History variables are additional

control variables which are used to record the previous state sequence of an execution

and hence can be used to answer the question of, e.g., “has the program been in state

x= 2?”. Using such a history variable hthe example above which was not fusion

closed can be rephrased in a fusion-closed fashion as:

“never (x= 4 and (x= 2) 6∈ h)”

However, these history variables add states to the program and in effect add the neces-

sary redundancy to be fault-tolerant.

There are obvious “brute force” approaches on how to add history information like

the one sketched above where the history variable remembers the entire previous state

sequence of an execution. However, since history variables must be implemented,

they exponentially enlarge the state space of the fault-intolerant program. Rephrasing

this in the redundancy terminology of G¨

artner and V¨

olzer [9], history variables add

redundancy in space. Speciﬁcally, the history variables add exponential redundancy

in space, which is costly. So, we are interested in adding as little redundancy (i.e.,

as little additional states) as possible. Intuitively, the minimal amount of redundancy

which is necessary to tolerate a certain class of faults depends on the kind and nature

of the faults.

In this paper, we present a method to add history states to a program in a way

which (in general) avoids exponential growth of the state space. More speciﬁcally,

we start with a problem speciﬁcation SPEC 1which is not fusion closed, a program

Σ1which satisﬁes SPEC 1and a class of faults F. Depending on Fwe show how to

transform SPEC 1and Σ1into SPEC 2and Σ2in such a way that (a) SPEC 2is fusion

closed, (b) Σ2can be made fault tolerant for SPEC 2iff Σ1can be made fault tolerant

for SPEC 1, and (c) Σ2is (in a certain sense) minimal with respect to the added states.

We restrict our attention to cases where SPEC is a safety property and therefore are

only concerned with what Arora and Kulkarni call fail-safe fault-tolerance [3].

The beneﬁt of the proposed method is the following: Firstly, it makes the methods

which automatically add detectors [11,12] amendable to speciﬁcations which are not

fusion closed and closes a gap in the applicability of the detector/corrector theory [3].

And secondly, the presented method offers further insight into the efﬁciency of the

basic mechanisms which are applied in fault tolerance.

The paper is structured as follows: We ﬁrst present some preliminary deﬁnitions

in Section 2 and then relate the assumption of fusion closure to the notion of state

space redundancy in Section 3. In Section 4 we study speciﬁcations which are not fu-

sion closed and present a method which makes these types of speciﬁcations efﬁciently

manageable in the context of automated methods which add fault tolerance. Finally,

Section 5 presents some open problems and directions for future work.

2 Formal Preliminaries

In this section we deﬁne the formal system model used throughout this paper.

2.1 States, Traces and Properties

The state space of a program is an unstructured ﬁnite nonempty set Cof states. A

state predicate over Cis a boolean predicate over C. A state transition over Cis a

pair (r, s)of states from C.

In the following, let Cbe a state set and Tbe a state transition set. We deﬁne a

trace over Cto be a non-empty sequence s1, s2, s3, . . . of states over C. We sometimes

use the notation sito refer to the i-th element of a trace. Note that traces can be ﬁnite

or inﬁnite. A trace is ﬁnite if its length is ﬁnite. We will always use greek letters to

denote traces and normal lowercase letters to denote states. For two traces αand β,

we write α·βto mean the concatenation of the two traces. We say that a transition t

occurs in some trace σif there exists an isuch that (si, si+1) = t.

We deﬁne a property over Cto be a set of traces over C. A trace σsatisﬁes a

property Piff σ∈P. If σdoes not satisfy Pwe say that σviolates P. There are two

important types of properties called safety and liveness [2, 13]. Informally spoken, a

safety property demands that “something bad never happens” [13], i.e., it rules out a set

of unwanted trace preﬁxes. Mutual exclusion and deadlock freedom are two promi-

nent examples of safety properties. A liveness property on the other hand demands

that “something good will eventually happen” [13] and can be used to formalize, e.g.,

notions of termination. Since we are only concerned with safety properties we omit a

formal deﬁnition of liveness. Safety properties are formally deﬁned as follows.

Deﬁnition 1 (safety property over C)Asafety property Sover Cis a property over

Cfor which the following holds: For each trace σwhich violates Sthere exists a preﬁx

αof σsuch that for all traces β,α·βviolates S.

2.2 Programs, Speciﬁcations and Correctness

We deﬁne programs as state transition systems consisting of a state set C, a set of

initial states I⊆Cand a transition relation Tover C, i.e., a program (sometimes also

called system) is a triple Σ = (C, I , T ). The state predicate Itogether with the state

transition set Tdescribe a safety property S, i.e., all traces which are constructable by

starting in a state in Iand using only state transitions from T. We denote this property

by safety-prop(Σ). For brevity, we sometimes write Σinstead of safety-prop(Σ). A

state s∈Cof a program Σis reachable iff there exists a trace σ∈Σsuch that s

occurs in σ. Otherwise sis non-reachable. Sometimes we will call a non-reachable

state a redundant.

We deﬁne speciﬁcations to be properties, i.e., a speciﬁcation over Cis a prop-

erty over C. A safety speciﬁcation is a speciﬁcation which is a safety property. Un-

like Arora and Kulkarni [3], we do not assume that problem speciﬁcations are fusion

closed. Fusion closure is deﬁned as follows: Let Cbe a state set, s∈C,Xbe property

over C,α,γﬁnite state sequences, and β,δ,σbe state sequences over C.

Deﬁnition 2 (fusion closed set) The set Xis fusion closed if the following holds: If

α·s·βand γ·s·δare in Xthen α·s·δand γ·s·βare also in X.

It is easy to see that for every program Σholds that safety-prop(Σ) is fusion closed.

Intuitively, fusion closure means that the entire history of every trace is present in

every state of the trace. We will give examples for fusion closed and not fusion closed

speciﬁcations later.

Let SPEC be a speciﬁcation and Σbe a program over C. We say that Σsatisﬁes

SPEC iff all traces in Σsatisfy SPEC . Consequently, we say that Σviolates SPEC iff

there exists a trace σ∈Σwhich violates SPEC .

2.3 Extensions

Given some program Σ1= (C1, I1, T1)our goal is to deﬁne the notion of a fault-

tolerant version Σ2of Σ1meaning that Σ2does exactly what Σ1does in fault-free

scenarios and has additional fault-tolerance abilities which Σ1lacks. Sometimes, Σ2=

(C2, I2, T2)will have additional states (i.e., C2⊃C1) and for this case we must deﬁne

what these states “mean” with respect to the original program Σ1. This is done using

astate projection function π:C27→ C1which tells which states of Σ2are “the same”

with respect to states of Σ1. A state projection function can be naturally extended to

traces and properties, e.g., for a trace s1, s2, . . . over C2holds that π(s1, s2, . . .) =

π(s1), π(s2), . . .

Deﬁnition 3 (extends) Let Σ1= (C1, I1, T1)and Σ2= (C2, I2, T2)be two pro-

grams. Program Σ2extends program Σ1using state projection πiff the following

conditions hold:

1. C2⊇C1,

2. πis a total mapping from C2to C1(for simplicity we assume that for any s∈C1

holds that π(s) = s), and

3. π(safety-prop(Σ2)) = safety-prop(Σ1).

Note that the concept of extension is related to the notion of reﬁnement [1]. Ex-

tensions are reﬁnements with the additional property that the original state space is

preserved and that there is no notion of stuttering [1].

If Σ2extends Σ1using πand Σ1satisﬁes SPEC then obviously π(Σ2)satisﬁes

SPEC . When it is clear from the context that Σ2extends Σ1we will simply say that

Σ2satisﬁes SPEC instead of “π(Σ2)satisﬁes SPEC ”.

2.4 Fault Models and Fault-Tolerant Versions

Since we are concerned with fault tolerant systems we must have a way of modeling

faulty behavior. We deﬁne a fault model Fas being a program transformation [8],

i.e., a mapping Ffrom programs to programs. The resulting program is called the

fault-affected version. For a given program Σ,F(Σ) is also called program Σin the

presence of faults F.

We require that a fault model does not tamper with the set of initial states, i.e.,

we rule out “immediate” faults that occur before the system is switched on. We also

restrict ourselves to the case where F“adds” transitions, since this is the only way to

violate a safety speciﬁcation.

Deﬁnition 4 (fault model) Afault model Fmaps a program Σ=(C, I , T )to a pro-

gram F(Σ) = (F(C), F (I), F (T)) such that the following conditions hold:

1. F(C) = C

2. F(I) = I

3. F(T)⊃T

For a given fault model Fand a speciﬁcation SPEC, we say that a program Σis

F-intolerant with respect to SPEC if Σsatisﬁes SPEC but F(Σ) violates SPEC .

Given two programs Σ1and Σ2such that Σ2extends Σ1and a fault model F, it

makes sense to assume that Ftreats Σ1and Σ2in a “similar way”. Basically, this

means that Fshould at least add the same transitions to Σ1and Σ2. But with respect

to the possible new states of Σ2it can possibly add new fault transitions. This models

faults which occur within the fault-detection and correction mechanisms.

Deﬁnition 5 (fault extension monotonicity) A fault model Fis extension monotonic

iff for any two programs Σ1= (C1, I1, T1)and Σ2= (C2, I2, T2)such that Σ2extends

Σ1using πholds:

F(T1)\T1⊆F(T2)\T2

original system Σ1

extension Σ2a b

a b a b

d c

a b

not extension monotonicextension monotonic

Figure 1: Examples for extension monotonic and not extension monotonic fault mod-

els.

An example is given in Fig. 1. The original system is given at the top and the

extension is given below (the state projection is implied by vertical orientation, i.e.,

states which are vertically aligned are mapped to the same state by π). In the left

example the fault model is extension monotonic since all fault transitions in Σ1are

also in Σ2. The right example is not extension monotonic. Intuitively, an extension

monotonic fault model maintains at least its original transitions over extensions.

The extension monotonicity requirement does not restrict faulty behavior on the

new states of the extension. However, we have to restrict this type of behavior since

it would be impossible to build fault-tolerant versions otherwise. In this paper we

assume a very general type of restriction: it basically states that in any inﬁnite sequence

of extensions of the original program there is always some point where Fdoes not

introduce new fault transitions anymore.

Deﬁnition 6 (ﬁnite fault model) A extension monotonic fault model Fis ﬁnite iff for

any inﬁnite sequence of programs Σ1,Σ2, . . . such that for all i,Σi+1 extends Σiholds

that there exists a jsuch that for all k≥jno new fault transition is introduced in Σk,

i.e., F(Tk+1)\Tk+1 =F(Tk)\Tk.

Finite fault models retain the fault transitions in the original program (i.e., they are

extension monotonic for each pair of extensions). They do not restrict the additional

faulty behavior introduced in the new states of an extension. However, they exclude

fault models for which inﬁnite redundancy is necessary to tolerate them. The engineer-

ing process is as follows: Given a program Σ1and a fault model F, we extend Σ1to

Σ2to make Ftolerable. Then we look at the new states introduced in this process and

consider faults which might happen there. Regarding these new faults we construct a

new extension Σ3of Σ2to potentially tolerate these faults. This process is repeated.

In theory, this process might never terminate, namely if Fforever adds certain kinds

of faults to the new states. A ﬁnite fault model guarantees that this process must even-

tually terminate. In this paper, we assume our fault model to be ﬁnite and extension

monotonic.

Now we are able to deﬁne a fault-tolerant version. It captures the idea of starting

with some program Σ1which is fault-intolerant regarding a speciﬁcation SPEC and

some fault model F. A fault-tolerant version Σ2of Σ1is a program which has the

same behavior as Σ1if no faults occur, but additionally satisﬁes SPEC in the presence

of faults.

Deﬁnition 7 (fault-tolerant version) Let Fbe a fault model, SPEC be a speciﬁcation

and Σ1and Σ2be programs. Assume that Σ1satisﬁes SPEC but F(Σ1)violates SPEC.

We call a program Σ2the F-tolerant version of program Σ1for SPEC using state

projection πiff the following conditions hold:

1. Σ2extends Σ1using π,

2. F(Σ2)satisﬁes SPEC.

3 Problem Statement

The basic task we would like to solve is to construct a fault-tolerant version for a given

program and a safety speciﬁcation.

Deﬁnition 8 (general fail-safe transformation problem) Given a fault model Fand

a program Σ1which is F-intolerant with respect to a general safety speciﬁcation

SPEC1. The general fail-safe transformation problem consists of ﬁnding a fault-tolerant

version of Σ1, i.e., a program Σ2such that Σ2extends Σ1and F(Σ2)satisﬁes SPEC1.

The case where SPEC is fusion closed has been studied by Kulkarni and Arora

[12] and Jhumka et al. [11], i.e., they solve a restricted transformation problem.

Deﬁnition 9 (fusion-closed fail-safe transformation problem) The fusion-closed fail-

safe transformation problem consists of solving the general fail-safe transformation

problem where SPEC1is fusion closed.

In the remainder of this section we brieﬂy recall the approaches used by Kulkarni

and Arora [12] and Jhumka et al. [11] to solve the latter problem.

3.1 Adding Fail-Safe Fault Tolerance to Fusion-Closed Speciﬁcations

The basic mechanism which Kulkarni and Arora [12] and Jhumka et al. [11] apply

is the creation of non-reachable states. The fact that speciﬁcations are fusion closed

implies that safety speciﬁcations can be concisely represented by a set of “bad” tran-

sitions, transitions which causes a violation of the speciﬁcation [3, 10].

Deﬁnition 10 (maintains) Let Σbe a program, SPEC be a speciﬁcation and αbe a

ﬁnite computation of Σ. We say that αmaintains SPEC iff there exists a sequence of

states βsuch that α·β∈SPEC.

If SPEC is a safety property, every trace not in SPEC has a preﬁx which does

not maintain SPEC . From the deﬁnition of maintains, we have that there must be a

transition where a given trace σswitches from “good” to “bad”, i.e., σcan be written

as α·d·b·βsuch that α·dmaintains SPEC and all “longer” preﬁxes (starting with

α·d·b) do not maintain SPEC . Arora and Kulkarni have shown [4, “Only-if” part of

Lemma 3.2] that (d, b)is a transition which will cause any trace in which it occurs to

violate SPEC . We rephrase this result as follows:

Lemma 1 Let Σ = (C, I , T )be a system, SPEC be safety property which is fusion

closed and assume that Σviolates SPEC and that for all x∈Iholds that xmaintains

SPEC. Then there exists a transition (d, b)∈Tsuch that for all traces σof Σholds:

if (d, b)occurs in σthen σ6∈ SPEC.

The known automated procedures [11,12] which are based on the concept of non-

reachable states use the following approach for addition of fail-safe fault tolerance:

Since F(Σ1)violates SPEC , there must exist executions in which a speciﬁed bad

transition occurs. Inevitably, we must prevent the occurrence of such a transition. So,

for all bad transitions t= (d, b)we must make either state dor state bunreachable in

F(Σ2). If tis a program transition then it depends on whether or not tis reachable in

Σ1or not.

•If tis a reachable program transition, then a violation of SPEC can occur even

if no faults occur, so, obviously, no fault-tolerant version exists since we would

have to change the behavior of the original program.

•If tis a redundant (i.e., non-reachable) program transition, then we can remove

it resulting in a smaller transition set T2of Σ2.

If tis a transition which has been introduced by F, then we cannot remove it directly.

The best we can do is make the starting state dof tunreachable. But this can only

be done if there exists a non-reachable program transition on the path to d. If such a

transition exists, we can safely remove it. If not, then again no fault-tolerant version

exists.

abef

c d gh

Figure 2: Illustration for the Kulkarni-Arora method. The speciﬁcation is “never h”.

As an illustration of the method consider Figure 2 which shows a program Σ1

in a state-chart like notation. Again, states are drawn as circles and transitions are

arrows between states. Initial states are identiﬁed using arrows without starting states.

Transitions which are introduced by Fare shown as dashed arrows.

Assume the correctness speciﬁcation SPEC for Σ1is that it never reaches state h.

Obviously, the system satisﬁes SPEC in the absence of faults but it violates SPEC in

the presence of faults. The bad transitions which we must prevent in F(Σ2)are all

transitions which have state has destination state. We can remove the transition (g, h)

easily from T2because its removal does not change the fault-free behavior of Σ2. But

we cannot remove transition (f, h)since it is a fault transition. But luckily, there exists

a redundant transition (d, e)on the path leading to fwhich can be removed in T2. So

Σ2is constructed from Σ1by removing (g, h)and (d, e)from T2.

Figure 2 also helps to illustrate the cases where no fault-tolerant version exists. For

example, if there were a transition (c, h)∈T1then his reachable in Σ1and, hence, Σ1

does not satisfy the speciﬁcation anyway. The other case arises for example, if there

were a fault transition (b, h)∈F(T1), i.e., his reachable along a path with only fault

transitions and reachable program transitions. Again, such an Fis not tolerable. How-

ever, the fact that Fis not tolerable is not a drawback of the transformation method; it

is simply states that generally the chosen fault assumption is too severe to be tolerated.

This concludes the recapitulation of the known approaches to automatically make

a program fail-safe fault tolerant. Recall that speciﬁcations are required to be fusion

closed. The above illustrations show that fusion closure together with the assumption

that a fault assumption is tolerable implies that state space redundancy (e.g., states d

and g) is already available in the fault-intolerant system Σ1. This type of redundancy

allows to formulate detection predicates in the language of guarded commands [6,

7] which is the basis of the Arora-Kulkarni theory [3]. These detection predicates

are conjoined to the guards of certain actions and hence have the effect of removing

transitions.

3.2 Handling Speciﬁcations which are Not Fusion Closed

Programs are presented in a guarded command notation [6,7]. The state space of a

program is deﬁned by a set of variables and the state transitions by a set of actions. An

process Σ1

var x∈ {0,1,2,3,4}init 0

begin

x= 0 −→ x:= 1

[] x= 1 −→ x:= 2

[] x= 2 −→ x:= 3

[] x= 3 −→ x:= 4

[] f:x= 1 −→ x:= 3

end

process Σ2

var x∈ {0,1,2,3,4}init 0

hsequence of {0,1,2,3,4}init hi

begin

x= 0 −→ x:= 1;h:= h1i

[] x= 1 −→ x:= 2;h:= h1,2i

[] x= 2 −→ x:= 3;h:= h1,2,3i

[] x= 3 −→ x:= 4;h:= h1,2,3,4i

[] f:x= 1 −→ x:= 3

end

Figure 3: Two programs in guarded command notation.

action of a program has the form

hguardi → hstatementi

in which the guard is a boolean expression over the program variables and the state-

ment is either the empty statement or an instantaneous assignment to one or more vari-

ables. An execution is constructed by repeatedly and non-deterministically choosing

any action where the guard evaluates to true and executing the corresponding action.

Consider the program on the left side of Figure 3. The program has a variable

xwhich can take ﬁve different values (0–4) and simply proceeds from state x= 0 to

x= 4 through all intermediate states. The fault assumption Fhas added one transition

from x= 1 to x= 3 to the transition relation (the action is marked with an ‘f’).

Consider the correctness speciﬁcation

SPEC =“always (x= 4 implies that previously x= 2)”

Note that F(Σ1)does not satisfy SPEC (i.e., F(Σ1)can reach state x= 4 without

having been in state x= 2), and that SPEC is not fusion closed. To see the latter,

consider the two traces 0,3,2,4and 2,3,4from SPEC . The fusion at state x= 3

yields trace 0,3,4which is not in SPEC . Since SPEC is not fusion closed, we cannot

apply the known transformation methods [11,12].

The speciﬁcation can be made fusion closed by adding a history variable hwhich

records the entire state history. Such a variable has been added to the program on the

right side of Figure 3. Now SPEC can be rephrased as

SPEC =“always (x= 4 implies h2i ∈ h)”

or equivalently

SPEC =“never (x= 4 and h2i 6∈ h)”

Now we can identify a set of bad transitions which must be prevented, e.g.:

x= 3 ∧h=h1i → x= 4 ∧h=h1,2,3,4i

The precondition for the transition to a state where x= 4 must be strengthened by the

detection predicate h6=h1i, i.e., the fourth guarded command of Σ2must be changed

to:

x= 3 ∧h6=h1i −→ x:= 4;h:= h1,2,3,4i

Hence, bad transitions are prevented and the modiﬁed system satisﬁes SPEC in the

presence fault f.

3.3 State Space Redundancy Through History Variables

Adding a history variable hin the previous example adds states tothe state space of the

system. In fact, deﬁning the domain of has the set of all sequences over {0,1,2,3,4}

adds inﬁnitely many states. Clearly this can be reduced by the observation that if faults

do not corrupt h, then hwill only take on ﬁve different values (hi,h1i,h1,2i,h1,2,3i,

and h1,2,3,4i). But still, the state space has been increased from ﬁve states to 52= 25

states.

Note that Σ2has redundant states and Σ1is not redundant at all. So the redundancy

is due to the history variable h. But even if the domain of hhas cardinality 5, the

redundancy is in a certain sense not minimal, as we now explain.

Consider the program Σ3on the left side of Figure 4. It tolerates the fault fby

adding only one state to the state space of Σ1(namely, x= 5). The state space together

with the transitions is depicted on the right side of the ﬁgure. Note that Σ3has only one

redundant state, so Σ3can be regarded as redundancy-minimal with respect to SPEC .

The metric used for minimality is the number of redundant states. We want to exploit

this observation to deal with the general case.

process Σ3

var x∈ {0,1,2,3,4,5}init 0

begin

x= 0 −→ x:= 1

[] x= 1 −→ x:= 2

[] x= 2 −→ x:= 5

[] x= 5 −→ x:= 4

[] f:x= 1 −→ x:= 3

end

1234

Figure 4: A redundancy-minimal version of the program in Figure 3 in guarded com-

mands (left) and a state chart notation (right). The speciﬁcation is “always (x= 4

implies that previously x= 2)”.

4 Beyond Fusion Closure

Although the automated procedures of [11, 12] were developed for fusion-closed spec-

iﬁcations, they (may) still work for speciﬁcations which are not fusion closed only if

the fault model has a certain pleasant form. For example, consider the system in Fig-

ure 5 and the speciﬁcation

SPEC =“(eimplies previously c) and (never g)”

Obviously, the fault model Fcan be tolerated using the known transformation methods

because Fdoes not “exploit” the part of the speciﬁcation which is not fusion closed.

a b c d e f g

Figure 5: The fail-safe transformation can be successful even if the speciﬁcation is not

fusion closed. The speciﬁcation in this case is “(eimplies previously c) and (never g)”.

4.1 Exploiting Non-Fusion Closure

Now we formalize what it means for a fault model to “exploit” the fact that a speciﬁ-

cation is not fusion-closed (we call this property non-fusion closure). First we deﬁne

what it means for a trace to be the fusion of two other traces.

Deﬁnition 11 (fusion and fusion point of traces) Let sbe a state and α=αpre ·s·

αpost and β=βpre ·s·βpost be two traces in which soccurs. Then we deﬁne

fusion(α, s, β) = αpre ·s·βpost

If fusion(α, s, β)6=αand fusion(α, s, β )6=βwe call safusion point of αand β.

Lemma 2 For the fusion of three traces α, β, γ holds: If soccurs before s0in βthen

fusion(α, s, fusion(β, s0, γ )) = fusion(fusion(α, s, β), s0, γ)

and

fusion(γ, s0,fusion(α, s, β )) = fusion(γ, s0, β)

Proofs are written in a structured style similar to proof trees of interactive theorem

proving environments. This approach is advocated by Lamport who promises that this

style “makes it much harder to prove things that are not true” [14]. The proof is a

sequence of numbered steps at different levels. Every step has a proof which may be

reﬁned at lower levels by additional steps. For example, step h1i2.is the second step

on level 1. Proofs may also be read in a structured way, for example, by reading only

the top level steps and going into sublevels only when necessary.

PROOF: Assume that soccurs before s0in β. The proof is by direct calculation. Let

α=αpre ·s·αpost,β=βpr e ·s·βmid ·s0·βpost, and γ=γpre ·s0·γpost . Then

fusion(α, s, fusion(β, s0, γ )) = fusion(α, s, βpre ·s·βmid ·s0γpost)

=αpre ·s·βmid ·s0·γpost

=fusion(αpre · ·s·βmid ·s0·βpost, s0, γpost )

=fusion(fusion(α, s, β), s0, γ )

proves the ﬁrst equation and

fusion(γ, s0,fusion(α, s, β )) = fusion(γ, s0, αpre ·s·βmid ·s0·βpost)

=γpre ·s0·βpost

=fusion(γ, s0, β )

proves the second equation.

If SPEC is a set of traces, we recursively deﬁne the fusion closure of SPEC , de-

noted by fusion-closure(SPEC ), as the set which is closed under ﬁnite applications of

the fusion operator.

Deﬁnition 12 (fusion closure) Given a speciﬁcation SPEC, a trace σis in fusion-closure(SP EC )

iff

1. σis in SPEC, or

2. σ=fusion(α, s, β)for traces α, β ∈fusion-closure(SPEC)and a state sthat

occurs in αand β.

Lemma 2 guarantees that every trace in fusion-closure(SPEC )which is not in

SPEC has a “normal form”, i.e., it can be represented uniquely as the sequence of

fusions of traces in SPEC . This is shown in the following theorem.

Theorem 1 For every trace σ∈fusion-closure(SPEC)which is not in SPEC there

exists a sequence of traces α0, α1, α2, . . . and a sequence of states s1, s2, s3, . . . such

that

1. for all i≥0,αi∈SPEC,

2. for all i≥1,siis a fusion point of αi−1and αi, and

3. σcan be written as:

σ=fusion(fusion(. . . fusion(α0, s1, α1), s2, α2), s3, α3), . . .)

PROOF SKETCH: The proof is by induction on the structure of how σevolved from

traces in SPEC . Basically this means an induction on the number of fusion points

in sigma. The induction step assumes that σis the fusion of two traces which have

at most nfusion points and depending on their relative positions uses the rules of

Lemma 2 to construct the normal form for sigma.

1h1i1. The theorem holds for all traces which have one fusion point.

PROOF: Since σhas one fusion point, it can be written as σ=fusion(α0, s1, α1)

with α0and α1from SPEC and s1a fusion point of α0and α1.

2h1i2. ASSUME: The theorem holds for all traces with at most nfusion points.

PROVE: The theorem holds for all traces σwhich are fusions of traces with at

most nfusion points.

PROOF SKETCH: Take two traces τand τ0which have at most nfusion points and

which share an additional common fusion point s(see Fig. 6). The new fusion point

sdivides the fusion points in τand τ0into two groups of kand mfusion points in

τ(and k0and m0fusion points in τ0respectively). The fusion of both traces will

maintain the kfusion points of τand the m0fusion points of τ0. This follows from

the second equation of Lemma 2. Because of the ordering of the fusion points we

can use the ﬁrst equation of Lemma 2 to construct the normal form. In general, the

resulting trace can have more than nfusion points.

2.1 h2i1.σcan be written as σ=fusion(τ, s, τ 0)where τand τ0have at most nfusion

points.

PROOF: Follows from the fact that σis the fusion of two traces with at most n

fusion points.

2.2 h2i2.σcan be written as

σ=fusion(fusion(. . . fusion(α0, s1, α1), s2, α2). . .), s,

fusion(. . . fusion(α0

0, s0

1, α0

1), s0

2, α0

2). . .))

PROOF: Follows from the induction hypothesis and by replacing τand τ0with

their normal forms in the formula of step h2i1.

2.3 h2i3. Let k,m,k0and m0denote the number of fusion points to the left and right of

sin τand τ0(see Fig. 6). Then σcan be written as

fusion(. . . fusion(α0, s1, α1), s2, α2). . .), s,

fusion(. . . fusion(α0

k0, s0

k0+1, α0

k0+1), s0

k0+2, α0

k0+2). . .))

PROOF: The ﬁrst k0fusion points of τ0precede sand so by repeatedly applying

the second equation of Lemma 2 we can remove the k0ﬁrst applications of fusions

from the formula of step h2i2.

2.4 h2i4.σcan be written as

fusion(. . . fusion(α0, s1, α1), s2, α2). . .),

sk, αk), s, α0

k0), s0

k0+1, α0

k0+1), s0

k0+2, α0

k0+2). . .)

PROOF: From the deﬁnition of fusion, we can ignore the ﬁnal mfusion points of

τ. The formula follows by repeatedly applying the ﬁrst equation of Lemma 2 to

the formula of step h2i3(shifting the fusion operator to the left).

2.5 h2i5. Q.E.D.

PROOF: The formula of step h2i4has the required normal form because all αi

and α0

jare in SPEC and all siand s0

jare fusion points of consecutive elements in

the formula.

3h1i3. Q.E.D.

PROOF: Follows from induction.

τ0

k m

k0m0

Figure 6: Diagram accompanying the proof of Theorem 1.

Now consider the system depicted in Figure 7. The corresponding speciﬁcation is:

SPEC =“fimplies previously d”

The system may exhibit the following two traces in the absence of faults, namely

α=a·b·cand β=a·d·e·f. In the presence of faults, a new trace is possible,

namely γ=a·b·e·f. Observe that γviolates SPEC and that γis the fusion of two

traces α, β ∈SPEC (the state which plays the role of sin Deﬁnition 11 is state e). In

such a case we say that fault model Fexploits the non-fusion closure of SPEC .

We now formally deﬁne what is meant by exploiting the non-fusion closure of a

speciﬁcation.

Deﬁnition 13 (exploiting non-fusion closure) Let Σbe a system, Fbe a fault model

and SPEC be a speciﬁcation which is satisﬁed by Σ. Then F(Σ) exploits the non-

fusion closure of SPEC iff there exists a trace σ∈F(Σ) such that σ6∈ SPEC and

σ∈fusion-closure(SPEC).

b c d e f

Figure 7: Example where the non-fusion closure of a speciﬁcation is exploited by a

fault model. The speciﬁcation is “fimplies previously d”.

Intuitively, exploiting the non-fusion closure means that there exists a bad com-

putation (σ6∈ SPEC ) that can potentially “impersonate” a good computation (σ∈

fusion-closure(SPEC )). Deﬁnition 13 states that Fcauses a violation of SPEC by

constructing a fusion of two (allowed) traces.

Given a fault model Fsuch that F(Σ) exploits the non-fusion closure of SPEC,

then also we say that the non-fusion closure of SPEC is exploited for Σin the presence

of F.

Obviously, if for some speciﬁcation SPEC and system Σsuch an Fexists, then

SPEC is not fusion closed. Similarly trivial to prove is the observation that no fault

model Fcan exploit the non-fusion closure of a speciﬁcation which is fusion closed.

On the other hand, if the non-fusion closure of SPEC cannot be exploited, this does

not necessarily mean that SPEC is fusion closed. To see this consider Figure 8. The

correctness speciﬁcation SPEC of the program is “cimplies previously a”. Obviously,

a fault model can only generate traces that begin with a. Since ais an initial state

and we assume initial state preservance, no Fcan exploit the non-fusion closure. But

SPEC is not fusion closed.

b ca

Figure 8: Example where the non-fusion closure cannot be exploited but the speciﬁca-

tion is not fusion closed. The speciﬁcation is “cimplies previously a”.

4.2 Preventing the Exploitation of Non-Fusion Closure

The fact that a fault model may not exploit the non-fusion closure of a speciﬁcation

will be important in our approach to solve the general fail-safe transformation problem

(Def. 8). A method to solve this problem, i.e., that of ﬁnding a fault-tolerant version

Σ2, should be a generally applicable method, which constructs Σ2from Σ1(this is de-

picted in the top part of Figure 9). Instead of devising such a method from scratch, our

aim is to reuse the existing transformations to add fail-safe fault tolerance which are

based on fusion-closed speciﬁcations [11, 12]. This approach is shown in the bottom

part of Figure 9. Starting from Σ1, we construct some intermediate program Σ0

2and

some intermediate fusion-closed speciﬁcation SPEC 2to which we apply one of the

above mentioned methods for fusion-closed speciﬁcations [11,12]. The construction

of Σ0

2and SPEC 2must be done in such a way that the resulting program satisﬁes the

properties of the general transformation problem stated in Deﬁnition 8. How can this

be done?

The idea of our approach is the following: First, choose SPEC 2to be the fusion

closure of SPEC 1, i.e., choose

SPEC 2=fusion-closure(SPEC 1)

and construct Σ0

2from Σ1in such a way that F(Σ0

2)does not exploit the non-fusion

closure of SPEC 1. More precisely, Σ0

2results from applying a constructive method

(which we give below) which ensures that

•Σ0

2extends Σ1using some state projection πand

•F(Σ0

2)does not exploit the non-fusion closure of SPEC 1.

Our claim, which we formally prove later, is that the program Σ2resulting from ap-

plying (for example) the algorithms of [11,12] to Σ0

2with respect to SPEC 2in fact

satisﬁes the requirements of Deﬁnition 8, i.e., Σ2is in fact an F-tolerant version of Σ1

with respect to SPEC 1.

fault-intolerant w.r.t.

general speciﬁcation

SP E C1

fusion-closed

SP E C2

general method

“standard” fail-safe transformation

w.r.t. fusion-closed SP EC2

this paper

Σ0

fault-tolerant w.r.t.

SP E C1

Σ1

Σ1Σ2

Σ2

Figure 9: Overview of transformation problem (top) and our approach (bottom). The

constructive method described in Section 4.3 offers a solution to the ﬁrst step (i.e.,

Σ1→Σ0

2).

4.3 Bad Fusion Points

For a given system Σand a speciﬁcation SPEC, how can we tell whether or not the

nature of SPEC is exploitable by a fault model? For the negative case (where it can

be exploited), we give a sufﬁcient criterion. It is based on the notion of a bad fusion

point.

Deﬁnition 14 (bad fusion point) Let SPEC be a speciﬁcation, Σbe a system satis-

fying SPEC,sbe a state of Σ, and Fa fault model such that F(Σ) violates SPEC.

State sis a bad fusion point of Σfor SPEC in the presence of Fiff there exist traces

α, β ∈SPEC such that

1. sis a fusion point of αand β,

2. fusion(α, s, β)∈F(Σ), and

3. fusion(α, s, β)6∈ SPEC.

Intuitively, a bad fusion point is a state in which “multiple pasts” may have hap-

pened, i.e., there may be two different execution paths passing through s, and from the

point of view of the speciﬁcation it is important to tell the difference. We now give

several examples of bad fusion points.

As an example, consider Fig. 7 where eis a bad fusion point. To instantiate the

deﬁnition, take α=a·b·e∈F(Σ) and β=a·d·e·f∈F(Σ). The fusion at e

yields the trace a·b·e·fwhich is not in SPEC .

Theorem 2 (bad fusion point criterion) Let SPEC be a speciﬁcation, Σbe a system

satisfying SPEC and Fbe a fault model. The following two statements are equivalent:

1. Σhas no bad fusion point for SPEC in the presence of F.

2. F(Σ) does not exploit the non-fusion closure of SPEC.

α0

α1

αk+1

sk+1

αk−1

αk

sk+1

σ0

Figure 10: Diagram accompanying the proof of Theorem 2.

PROOF SKETCH: We prove the contraposition of the theorem in both directions. First

we assume that F(Σ) exploits the non-fusion closure and use Theorem 1 to construct

a bad fusion point. Second we prove that if there exists a bad fusion point then F(Σ)

exploits the non-fusion closure.

1h1i1. ASSUME:Σhas no bad fusion point for SPEC in the presence of F.

PROVE:F(Σ) does not exploit the non-fusion closure of SPEC .

1.1 h2i1. ASSUME:F(Σ) exploits the non-fusion closure of SPEC .

PROVE: False

1.1.1 h3i1. There exists a minimal preﬁx σ0of σwhich violates SPEC .

PROOF: Follows from the fact that σ6∈ SPEC and that SPEC is a safety prop-

erty.

1.1.2 h3i2.σ0contains at least one fusion point.

PROOF: Since σ6∈ SPEC but σ∈fusion-closure(SPEC )we can apply Theo-

rem 1 and write σas the fusion of traces αi∈SPEC (see Fig. 10. If there were

no fusion point within σ0, then σ0would be a preﬁx of α0, a contradiction to

the fact that α0∈SPEC .

1.1.3 h3i3. Let sdenote the rightmost fusion point skin σ0and let αdenote the preﬁx

of σ0up to and including state s(see Fig. 10). Then α∈SPEC .

PROOF: Follows from the fact that σ0is minimal (i.e., preﬁxes of σ0satisfy

SPEC , shown in step h3i1) and the fact that αis a preﬁx of σ0.

1.1.4 h3i4. If there exists a fusion point sk+1 after skin αk, let βbe the trace αkup

to and including sk(see Fig. 10). Otherwise let βbe the trace αk. Then

β∈SPEC .

PROOF: In both cases βis a preﬁx of αk, which is in SPEC and so β∈SPEC

too.

1.1.5 h3i5.fusion(α, s, β)6∈ SPEC

PROOF: Follows from the fact that σ0is a preﬁx of fusion(α, s, β)and SPEC

is a safety property (any extension of σ0is not in SPEC ).

1.1.6 h3i6.fusion(α, s, β)∈F(Σ)

PROOF: Follows from the construction of αand s(in step h3i3) and β(in step

h3i4) and the fact that fusion(α, s, β)is a preﬁx of σwhich is in F(Σ).

1.1.7 h3i7.sis a bad fusion point for Σin the presence of F.

PROOF: Steps h3i3and h3i4exhibit traces αand βwhich are both in SPEC .

Step h3i6shows that their fusion at state sis in F(Σ). Finally, step h3i5shows

that this fusion is not in SPEC . From Deﬁnition 14 follows that sis a bad

fusion point for Σin the presence of F.

1.1.8 h3i8. Q.E.D.

PROOF: Step h3i7contradicts the assumption that Σhas no bad fusion point in

the presence of F.

1.2 h2i2. Q.E.D.

PROOF: Follows indirectly from step h2i1.

2h1i2. ASSUME:F(Σ) does not exploit the non-fusion closure of SPEC .

PROVE:SPEC has no bad fusion point for Σin the presence of F.

2.1 h2i1. ASSUME:SPEC has a bad fusion point for Σin the presence of F.

PROVE: False

2.1.1 h3i1. There exists a trace σin F(Σ) such that σ6∈ SPEC and σis the fusion of

two traces αand βin SPEC at some state s.

PROOF: From assumption.

2.1.2 h3i2. The non-fusion closure of SPEC can be exploited for Σ

PROOF: From step h3i1and the deﬁnition of exploits (Deﬁnition 13)

2.1.3 h3i3. Q.E.D.

PROOF: Step h3i2contradicts the assumption of the theorem.

2.2 h2i2. Q.E.D.

PROOF: Follows indirectly from step h2i1.

3h1i3. Q.E.D.

PROOF: The two top level steps show both directions of the equivalence.

4.4 Removal of Bad Fusion Points

Theorem 2 states that it is both necessary and sufﬁcient to remove all bad fusion points

from Σto make its structure robust against fault models that exploit the non-fusion

closure of SPEC . So how can we get rid of bad fusion points?

Recall that a bad fusion point is one which has multiple pasts, and from the point

of view of the speciﬁcation, it is necessary to distinguish between those pasts. Thus,

the basic idea of our method is to introduce additional states which split the fusion

paths. This is sketched in Figure 11. Let Σ1= (C1, I1, T1)be a system. If sis a bad

fusion point of Σ1for SPEC , there exists a trace β∈SPEC and a trace α∈F(Σ)

which both go through s.

Constructive Method to Remove Bad Fusion Points: To remove bad fusion points,

we now construct an extension Σ2= (C2, I2, T2)of Σ1in the following way:

•C2=C1∪ {s0}where s0is a “new” state,

•I2=I1, and

•T2results from T1by “diverting” the transitions of βto and from s0instead of s.

The extension is completed by deﬁning the state projection function πto map s0to s.

Observe that sis not a bad fusion point regarding αand βanymore because αnow

contains sand βa different state s0which cannot be fused. So this procedure gets rid

of one bad fusion point. Also, it does not by itself introduce a new one, since s0is an

extension state which cannot be referenced in SPEC . So we can repeatedly apply the

procedure and incrementally build a sequence of extensions Σ1,Σ2, . . . where in every

step one bad fusion point is removed and an additional state is added. However, F

may cause new bad fusion points to be created during this process by introducing new

faults transitions deﬁned on the newly added states. But since the fault model is ﬁnite it

will do this only ﬁnitely often. Hence, repeating this construction for every bad fusion

point will terminate unless there are inﬁnitely many bad fusion points. This, however,

is impossible if the state space is ﬁnite.

Note that in the extension process, certain states can be extended multiple times

because they might be bad fusion points for different combinations of traces.

s s

βα

Figure 11: Splitting fusion paths.

We now prove that the above method results in a program with the desired proper-

ties.

Lemma 3 Let Fbe a fault model, SPEC1be a non-fusion closed speciﬁcation, and Σ1

be a program such that Σ1satisﬁes SPEC1but F(Σ1)violates SPEC1. The program

Σ0

2which results from applying the constructive method described above satisﬁes the

following properties:

1. Σ0

2extends Σ1using some state projection πand

2. F(Σ0

2)does not exploit the non-fusion closure of SPEC1.

PROOF SKETCH: To show the ﬁrst point we argue that there exists a projection function

π(which is induced by our method) such that every fault-free execution of Σ0

2is an

execution of Σ1. To show the second point, we argue that the method removes all bad

fusion points and apply the bad fusion point criterion proved as Theorem 2.

1h1i1. The induced projection function πof the constructive method above is such that

Σ0

2extends Σ1using π.

1.1 h2i1. For every state sof Σ1exists a π-image s0in the state space of Σ0

PROOF: The constructive method starts off with the the state space of Σ0

2being

equal to the state space of Σ1and any subsequent changes to πdo not affect this

initial mapping.

1.2 h2i2. Consider an arbitrary fault-free execution σ0=s0

1, s0

2, . . . of Σ0

2. Then π(σ0)

is an execution of Σ1.

PROOF: Looking at Figure 11, every execution σ0of Σ0

2evolves from an execution

of Σ1by splitting fusion paths and adapting πappropriately. Therefore, under the

projection function πboth executions look the same. Formally, this is proved

using an induction on the length of the execution.

1.3 h2i3. Q.E.D.

PROOF: Steps h2i1and h2i2prove the two conditions of Deﬁnition 3 (extension)

with respect to the projection function π. Hence, Σ0

2extends Σ1using π.

2h1i2.F(Σ0

2)does not exploit the non-fusion closure of SPEC 1.

2.1 h2i1.Σ0

2has no bad fusion point in the presence of F.

PROOF: This is a result from applying the constructive method. Because all fusion

paths are split, no fusion points remain.

2.2 h2i2. Q.E.D.

PROOF: Because of step h2i1we can apply the bad fusion point criterion (Theo-

rem 2) which shows that the non-fusion-closure of SPEC cannot be exploited for

Σ0

2in the presence of F.

3h1i3. Q.E.D.

PROOF: The above two steps show the two consequents of the lemma.

4.5 Correctness of the Combined Method

Starting from a program Σ1, Lemma 3 shows that the program Σ0

2resulting from

the constructive method for removing bad fusion points enjoys certain properties (see

Fig. 9). We now prove that starting off from these properties and choosing SPEC 2

as the fusion closure of SPEC 1, the program Σ2, which results from applying the

algorithms of [11, 12] on Σ0

2, has the desired properties of the transformation problem

(Deﬁnition 8).

Lemma 4 Given F,SPEC1, and Σ1as in Lemma 3, let SPEC2=fusion-closure(SPEC1)

and let Σ2be the result of applying any of the known methods that solve the fusion-

closed transformation problem of Deﬁnition 9 to Σ0

2with respect to Fand SPEC2,

where Σ0

2results from Σ1through the application of the constructive method. Then

the following statements hold:

1. Σ2extends Σ1using some state projection π.

2. If F(Σ2)satisﬁes SPEC2then F(Σ2)satisﬁes SPEC1.

PROOF SKETCH: To prove the ﬁrst point we argue that a fault tolerance addition proce-

dure only removes non-reachable transitions. Hence, every fault-free execution of Σ0

is also an execution of Σ2. But since Σ0

2extends Σ1so must Σ2. To show the second

point we ﬁrst observe that F(Σ0

2)does not necessarily satisfy SPEC 1but not all traces

for this are in F(Σ2)anymore (due to the removal of bad transitions during addition

of fault tolerance). Next we show that any trace of F(Σ2)which violates SPEC 1must

exploit the non-fusion closure of SPEC 1. But this must also be a trace of F(Σ0)and

so is ruled out by assumption.

1h1i1. If Σ0

2extends Σ1using state projection π0then Σ2extends Σ1using state pro-

jection π

1.1 h2i1. Application of the known methods to add fail-safe fault tolerance according

to Deﬁnition 9 does not change the fault-free behavior of that system.

PROOF: For the methods of Kulkarni and Arora [12] and Jhumka et al. [11] this

has been discussed in Section 3.

1.2 h2i2. Every (fault-free) execution of Σ0

2is also a (fault-free) execution of Σ2and

vice versa.

PROOF: Follows from step h2i1and the fact that Σ2results from Σ0

2by applying

the fail-safe-tolerance transformation (see Fig. 9).

1.3 h2i3. Every execution of Σ0

2under π0is an execution of Σ1.

PROOF: Follows from the assumption that Σ0

2extends Σ1using π0.

1.4 h2i4. Every execution of Σ2is also an execution of Σ1under π0and vice versa.

PROOF: Starting with an arbitrary execution σof Σ2, step h2i2allows to ﬁnd an

equivalent execution σ0of Σ0

2. Then for σ0, step h2i3allows to ﬁnd an equivalent

execution σ00 of Σ1.

1.5 h2i5. Q.E.D.

PROOF: Step h2i4allows to construct a state projection function such that the

safety properties of Σ1and Σ2are identical. Hence, Σ2extends Σ1.

2h1i2. ASSUME: 1. F(Σ0

2)does not exploit the non-fusion closure of SPEC 1.

2. F(Σ2)satisﬁes SPEC 2.

PROVE:F(Σ2)satisﬁes SPEC 1.

2.1 h2i1. All executions σof F(Σ0

2)that violate SPEC 1are not in F(Σ2).

PROOF: This follows from applying a fail-safe tolerance transformation proce-

dure, such as those in [11, 12]. Since these procedures are proved to be sound,

i.e., the resulting programs are indeed fail-safe fault-tolerant, then no execution

can violate the speciﬁcation.

2.2 h2i2.∀σ∈F(Σ2) : σ∈SPEC 2

PROOF: Follows directly from second assumption, i.e., F(Σ2)satisﬁes SPEC 2.

2.3 h2i3.∀σ∈F(Σ2) : σ∈F(Σ0

PROOF: The known fail-safe tolerance transformation procedures that solve Def-

inition 9 guarantee that F(Σ2)⊆F(Σ0

2), from which this step follows.

2.4 h2i4.F(Σ2)does not exploit non-fusion closure of SPEC 1.

PROOF: For a contradiction, assume that there is an execution τ∈F(Σ2)that

exploits non-fusion closure of SPEC 1. Since τ∈F(Σ2), from step h2i3we have

that τ∈F(Σ0

2). Hence, F(Σ0

2)also exploits the non-fusion closure of SPEC 1, a

contradiction to assumption 2.

2.5 h2i5.∀σ∈F(Σ2) : σ∈SPEC 1

2.5.1 h3i1. ASSUME:σ∈Σ2

PROVE: QED

PROOF: Since σ∈Σ2and Σ2extends Σ1we have that σ∈Σ1. But since Σ1

satisﬁes SPEC 1we conclude that σ∈SPEC 1.

2.5.2 h3i2. ASSUME:σ∈F(Σ2)\Σ2

PROVE: QED

PROOF: First note that σcannot be in fusion-closure(SPEC 1)\SPEC 1(follows

from step h2i4). But since fusion-closure(SPEC 1) = SPEC 2and since F(Σ2)

satisﬁes SPEC 2we have that σmust be in SPEC 1.

2.5.3 h3i3. Q.E.D.

PROOF: Follows from steps h3i1and h3i2and the fact that they cover all

cases.

2.6 h2i6. Q.E.D.

PROOF: Step h2i5shows that F(Σ2)satisﬁes SPEC 1which is what we wanted

to prove.

3h1i3. Q.E.D.

PROOF: Steps h1i1and h1i2prove the ﬁrst and second point of the lemma, respec-

tively.

Lemmas 3 and 4 together guarantee that the composition of the method described

in Section 4.3 and the fail-safe transformation methods for fusion-closed speciﬁca-

tions in fact solves the transformation problem for non-fusion closed speciﬁcations of

Deﬁnition 8.

Theorem 3 Given a fault model Fand a program Σ1which is F-intolerant with re-

spect to a non-fusion closed speciﬁcation SPEC1. The composition of the constructive

method described in Section 4.3 and the fail-safe transformation methods for fusion-

closed speciﬁcations solves the general transformation problem of Deﬁnition 8, i.e.,

constructs a program Σ2such that Σ2extends Σ1and F(Σ2)satisﬁes SPEC1.

4.6 Examples

Finally, we present two examples of the application of our method. The top of Fig-

ure 12 (system 1) shows the original system. The augmented system is depicted at the

bottom (system 4). The correctness speciﬁcation for the system is “(dimplies previ-

ously b) and (eimplies previously c)”. There are only two bad fusion points, namely c

and dwhich have to be extended. In the ﬁrst step, cis “removed” by splitting the fusion

path which is indicated using two short lines. This results in system 2. Subsequently,

dis reﬁned, resulting in system 3. Note that dhas to be reﬁned twice because there are

two sets of fusion paths. This results in system 4, which can be subject to the standard

fail-safe transformation methods, which will remove the transitions (c, d00)and (d, e).

A similar, yet more complex example is shown in Figure 13. The correctness

speciﬁcation for the system 1 at the top is “gimplies previously (bor c)”. The ﬁgure

(2)

(3)

(4)

(1) a b c d e

a b c d e

c0d0

a b c d e

d00

Figure 12: Removing bad fusion points. The speciﬁcation is “(dimplies previously b)

and (eimplies previously c)”.

shows that again a “two level” extension is necessary here, since the only execution

which must be prevented is the one which uses both fault transitions. This means that

state fis a bad fusion point for multiple execution paths and hence must be reﬁned

twice (note that the fault transition (d, f )is a new fault added to the system in the

extension).

4.7 Discussion

The complexity of our method directly depends on the number of bad fusion points

which have to be removed. Bad fusion points are not hard to ﬁnd if the speciﬁcation

is given as a temporal logic formula in the spirit of those used throughout this paper.

For example, if speciﬁcations are given in the form “xonly if previously y” then only

states which occur in traces between xand ycan be fusion points. Candidates for bad

fusion points are all states where two execution paths merge.

Our method requires to check every one of these states whether it is a bad fusion

point. So obviously, applying our method induces a larger overhead than directly

adding history variables. But as can be seen in Figs. 12 and 13, the number of states is

signiﬁcantly less than adding a general history variable. For example, a clever addition

of history variables to the system im Fig. 12 would require two bits, one to record the

visit to state band one to record the visit to c. Overall this would result in 2×2×5 = 20

states. Our methods achieves the same result with a total of 8 states. The system in

Fig. 13 could employ a boolean history variable which records whether states bor e

have been visited (it is set to true as soon as one of these states is reached). Adding

such a variable would create a total of 7 additional states. Our methods just adds 5.

Note however that the resulting system in Fig. 12 is not redundancy minimal. The

state d00 is not necessary since it may become unreachable even in the presence of

faults after the fail-safe transformation is applied. This is the price we still have to pay

for the modularity of our approach, i.e., adding history states does at present not “look

ahead” which states might become unreachable even in the presence of faults.

In theory there are cases where our method of adding history states does not termi-

nate because there are inﬁnitely many bad fusion points. For this to happen, the state

space must be inﬁnite. If we consider the application area of embedded software, we

can safely assume a bounded state space.

Given a program Σand a general speciﬁcation SPEC , then our combined method

will ﬁnd a solution to the general transformation problem iff (a) there exists one with a

ﬁnite number of additional states and (b) the method of adding fail-safe fault-tolerance

for fusion-closed speciﬁcations is complete. Requirement (a) ensures that our method

of removing bad fusion points will terminate.

5 Conclusions

In this paper, we have presented ways on how get rid of a restriction upon which

procedures that add fault tolerance [11,12] are based, namely that speciﬁcations have

to be fusion closed. Our method can be viewed as a ﬁner grained method to add

history information to a given system and hence add state space redundancy. We have

shown that our method in general adds less history states than would be added using

(1)

(2)

(3)

(4)

(5)

(6)

a b ec d f g

c0d0

a b ec d f g

c0d0e0

c0d0e0f0

a b ec d f g

f00

a b ec d f g

c0d0e0f0

Figure 13: A more complex example. The speciﬁcation is “gimplies previously (bor

e)”.

standard history variables (which in general lead to an exponential growth of the state

space). Thus, adding state redundancy using the approach presented in this paper

makes addition of fault tolerance more efﬁcient.

As future work, it would be interesting to combine our method with one of the

methods to add detectors so that the resulting method is redundancy minimal. We

are also investigating issues of non-masking fault-tolerance, i.e, adding tolerance with

respect to liveness properties.

Acknowledgments

We wish to thank Sandeep Kulkarni for helpful discussions. Work by the ﬁrst author

was supported by Deutsche Forschungsgemeinschaft (DFG) as part of “Graduiertenkol-

leg ISIA” and Emmy Noether programme.

References

[1] Mart´

ın Abadi and Leslie Lamport. The existence of reﬁnement mappings. The-

oretical Computer Science, 82(2):253–284, May 1991.

[2] Bowen Alpern and Fred B. Schneider. Deﬁning liveness. Information Processing

Letters, 21:181–185, 1985.

[3] Anish Arora and Sandeep S. Kulkarni. Component based design of multitoler-

ant systems. IEEE Transactions on Software Engineering, 24(1):63–78, January

1998.

[4] Anish Arora and Sandeep S. Kulkarni. Detectors and correctors: A theory of

fault-tolerance components. In Proceedings of the 18th IEEE International Con-

ference on Distributed Computing Systems (ICDCS98), May 1998.

[5] Anindya Basu, Bernadette Charron-Bost, and Sam Toueg. Simulating reliable

links with unreliable links in the presence of process crashes. In Proceedings

of the 10th International Workshop on Distributed Algorithms (WDAG96), pages

105–122, Bologna, Italy, October 1996. Springer-Verlag.

[6] K. Mani Chandy and Jayadev Misra. Parallel Program Design: A Foundation.

Addison-Wesley, Reading, MA, Reading, Mass., 1988.

[7] Edsger W. Dijkstra. Guarded commands, nondeterminacy, and formal derivation

of programs. Communications of the ACM, 18(8):453–457, August 1975.

[8] Felix C. G¨

artner. Transformational approaches to the speciﬁcation and veriﬁ-

cation of fault-tolerant systems: Formal background and classiﬁcation. Journal

of Universal Computer Science (J.UCS), 5(10):668–692, October 1999. Special

Issue on Dependability Evaluation and Assessment.

[9] Felix C. G¨

artner and Hagen V¨

olzer. Redundancy in space in fault-tolerant sys-

tems. Technical Report TUD-BS-2000-06, Department of Computer Science,

Darmstadt University of Technology, Darmstadt, Germany, July 2000.

[10] H. Peter Gumm. Another glance at the Alpern-Schneider characterization of

safety and liveness in concurrent executions. Information Processing Letters,

47(6):291–294, 1993.

[11] Arshad Jhumka, Felix C. G¨

artner, Christof Fetzer, and Neeraj Suri. On system-

atic design of fast and perfect detectors. Technical Report 200263, Swiss Federal

Institute of Technology (EPFL), School of Computer and Communication Sci-

ences, Lausanne, Switzerland, September 2002.

[12] Sandeep S. Kulkarni and Anish Arora. Automating the addition of fault-

tolerance. In Mathai Joseph, editor, Formal Techniques in Real-Time and Fault-

Tolerant Systems, 6th International Symposium (FTRTFT 2000) Proceedings,

number 1926 in Lecture Notes in Computer Science, pages 82–93, Pune, India,

September 2000. Springer-Verlag.

[13] Leslie Lamport. Proving the correctness of multiprocess programs. IEEE Trans-

actions on Software Engineering, 3(2):125–143, March 1977.

[14] Leslie Lamport. How to write a proof. American Mathematical Monthly,

102(7):600–608, August/September 1995.

[15] Zhiming Liu and Mathai Joseph. Speciﬁcation and veriﬁcation of fault-tolerance,

timing and scheduling. ACM Transactions on Programming Languages and Sys-

tems, 21(1):46–89, 1999.

[16] Heiko Mantel and Felix C. G¨

artner. A case study in the mechanical veriﬁcation

of fault tolerance. Journal of Experimental & Theoretical Artiﬁcial Intelligence

(JETAI), 12(4):473–488, October 2000.

[17] Fred B. Schneider. Implementing fault-tolerant services using the state machine

approach: A tutorial. ACM Computing Surveys, 22(4):299–319, December 1990.

Automation of fault-tolerant graceful degradation

Article

Full-text available

Feb 2019
DISTRIB COMPUT

Traditionally, (nonmasking and masking) fault-tolerance has focused on ensuring that after the occurrence of faults, the program recovers to states from where it continues to satisfy its original specification. However, a problem with this limited notion is that, in some cases, it may be impossible to recover to states from where the entire original specification is satisfied. For this reason, one can consider a fault-tolerant graceful-degradation program that ensures that upon the occurrence of faults, the program recovers to states from where a (given) subset of its specification is satisfied. Typically, the subset of specification satisfied thus would be the critical/important requirements. In this paper, we initially focus on automatically revising a given fault-intolerant program into a fault-tolerant gracefully degrading program. Specifically, we propose a two-step approach: In the first step, we transform the fault-intolerant program into a graceful program. This program is guaranteed to satisfy only the given subset of specification (e.g., critical requirements). In particular, this step involves adding new behaviors that will satisfy the given subset of the specification. The second step involves utilizing the original program and the graceful program to obtain a fault-tolerant gracefully degrading program. We also develop an algorithm to transform the gracefully degrading program into a distributed gracefully degrading program. Afterwards, the second phase of our transformation can be applied to generate a distributed fault-tolerant gracefully degrading program. We showcase the algorithm with three different non-trivial case studies. Finally, we formalize the problem of multi-graceful degradation and propose an algorithm that solves it and we use a complex case study to showcase the viability of the approach. All the algorithms have polynomial time complexity in the size of the state space of the original program.

FCPre: Extending the Arora-Kulkarni Method of Automatic Addition of Fault-Tolerance

Conference Paper

May 2007

Bastian Braun

Synthesizing fault-tolerant systems from fault-intolerant systems simplifies design of fault-tolerance. Arora and Kulkarni developed a method and a tool to synthesize fault-tolerance under the assumption that specifications are not history-dependent (fusion-closed). Later, Gartner and Jhumka removed this assumption by presenting a modular extension of the Arora-Kulkarni method. This paper presents an implementation of the Gartner-Jhumka method which is evaluated on several examples. As additional safety net, we have added automatic verification of the results using the model checker Spin. In the context of this work, a fault in the Gartner-Jhumka method has been found. Though this fault is rare and does not cause incorrect results, there might be no result at all

Issues on the Design of Efficient Fail-Safe Fault Tolerance

Conference Paper

Full-text available

Nov 2009

The design of a fault-tolerant program is known to be an inherently difficult task. Decisions taken during the design process will invariably have an impact on the efficiency of the resulting fault-tolerant program. In this paper, we focus on two such decisions, namely (i) the class of faults the program is to tolerate, and (ii) the variables that can be read and written. The impact these design issues have on the overall fault tolerance of the system needs to be well-understood, failure of which can lead to costly redesigns. For the case of understanding the impact of fault classes on the efficiency of fail-safe fault tolerance, we show that, under the assumption of a general fault model, it is impossible to preserve the original behavior of the fault-intolerant program. For the second problem of read and write constraints of variables, we again show that it is impossible to preserve the original behavior of the fault-intolerant program. We analyze the reasons that lead to these impossibility results, and suggest possible ways of circumventing them.

A flexible method to tolerate value sensor failures

Conference Paper

Full-text available

Oct 2006

Tolerating the value failures of sensors is an important problem in automated control processes and plants. In this paper, we address this problem in a theoretical framework in order to demonstrate the feasibility of an automatic method based on discrete controller synthesis. We consider a fault-intolerant program whose job is to control an automated process, here a liquid tank equipped with level sensors that can be subject to value faults. This fault-intolerant program is modeled as a finite labeled transition system. We then specify formally a fault hypothesis, i.e., how many sensors can fail simultaneously. We use discrete controller synthesis to obtain automatically a program, having the same behavior as the initial fault-intolerant one, and satisfying the fault tolerance requirements under the fault hypothesis. We advocate that, thanks to the use of discrete controller synthesis, our method offers flexibility, reliability, separation of concern, and it is automatic.

Contributions to the safe design of safe embedded systems

Article

Sep 2006

Alain Girault

Travaux de recherches effectués de 1995 à 2006, successivement au sein des équipes MEIJE (INRIA Sophia-Antipolis / Centre de Mathématiques Appliquées de l'Ecole des Mines de Paris), PTOLEMY et PATH (département EECS de l'Université de Californie à Berkeley), et enfin BIP et POP ART (INRIA Rhône-Alpes).

The complexity of automated addition of fault-tolerance without explicit legitimate states

Article

Jun 2014

Existing algorithms for automated model repair for adding fault-tolerance to fault-intolerant models incur an impediment that designers have to identify the set of legitimate states of the original model. This set determines states from where the original model meets its specification in the absence of faults. Experience suggests that of the inputs required for model repair, identifying such legitimate states is the most difficult. In this paper, we consider the problem of automated model repair for adding fault-tolerance where legitimate states are not explicitly given as input. We show that without this input, in some instances, the complexity of model repair increases substantially (from polynomial-time to NP-complete). In spite of this increase, we find that this formulation is relatively complete; i.e., if it was possible to perform model repair with explicit legitimate states, then it is also possible to do so without the explicit identification of the legitimate states. Finally, we show that if the problem of model repair can be solved with explicit legitimate states, then the increased cost of solving it without explicit legitimate states is very small. In summary, the results in this paper identify instances of automated addition of fault-tolerance, where the explicit knowledge of legitimate state is beneficial and where it is not very crucial.

TOWARDS AUTOMATED MODEL REVISION FOR FAULT-TOLERANT SYSTEMS

Article

Full-text available

Fuad Abujarad

Automated Multi-graceful Degradation: A Case Study

Conference Paper

Sep 2013

We focus on the problem of multi-graceful degradation. In multi-graceful degradation, the system provides successively reduced guarantees in the presence of increasingly severe faults. We present an automated technique for generation of a multi-graceful-degraded program from its original fault-intolerant/ideal version. In this algorithm, we begin with (1) an ideal program that satisfies all its specification in the absence of faults, (2) a set of faults that need to be tolerated and (3) reduced requirements in their presence. We subsequently generate several gracefullly degrading programs that only satisfy the reduced requirements. This step also identifies new states to which program needs to recover to satisfy the reduced specification. Subsequently, we utilize the original input program and the generated programs that ensures that (1) in the absence of faults, the entire specification is satisfied and (2) in the presence of faults, the program recovers to states from where the corresponding reduced specification is satisfied. We illustrate our technique with a case study of a system in the fuelcell lab of the Ohio Coal Research Center (OCRC). In this system, it is important to satisfy safety of lab personnel as well as safety of people in the building in which it is located. Moreover, in case of device failures, it is necessary to provide weaker guarantees that capture the best possible protection. In our example, we begin with an ideal model for this system and successively add multi-graceful degradation to obtain the same program (with some abstractions) as the one that was designed manually for this system.

Synthesizing Fault Tolerant Safety Critical Systems

Conference Paper

Dec 2012

To keep pace with today's nano-technology, safety critical embedded systems are becoming less tolerant to errors. Research into techniques to cope with errors in these systems has mostly focused on transformational approach, replication of hardware devices, parallel program design, component based design and/or information redundancy. It would be better to tackle the issue early in the design process that a safety critical system never fails to satisfy its strict dependability requirements. A novel method is outlined in this paper that proposes an efficient approach to synthesize safety critical systems. The proposed method outperforms dominant existing work by introducing the technique of run time detection and completion of proper execution of the system in presence of faults.

Addressing the Integration Challenge for Avionics and Automotive Systems—From Components to Rich Services

Article

May 2010

Automotive and avionics systems are complex, distributed, software-intensive systems-of-systems (SoS). Consequently, system integration is a central challenge in both domains. Important cross-cutting requirements aspects, such as security, authorization, and failure management, are best understood as properties of the interplay among sub-systems. Yet, traditional development processes address the integration challenge only late, at the level of implementation and deployment. Consequently, potentials for reuse within and across product lines are left unrealized. Furthermore, late integration leads to high calibration, configuration and redesign costs. Service-Oriented Architectures (SOAs) have emerged as a solution to the integration challenge. However, inappropriate application of SOA-principles results in a high degree of fragmentation and scattering of functionality-this leads to additional difficulties in requirements traceability and quality assurance. In this article, we give a comprehensive overview of these SOA-challenges, and present Rich Services as a hierarchical SOA blueprint and development process enabling SoS integration in a dependable way. Rich Services introduce services as hierarchical, partial interaction patterns; these interactions are then augmented with infrastructure elements to inject behaviors that address cross-cutting requirements aspects. Rich Services also seamlessly address the mapping from logical to deployment architectures. Using end-to-end failure management as an example, we illustrate the utility of Rich Services.

Parallel Program Design, A Foundation

Article

Jan 1988

How to Write a Proof

Article

Aug 1995

Leslie Lamport

A method of writing proofs is proposed that makes it much harder to provethings that are not true. The method, based on hierarchical structuring, issimple and practical.

Another Glance at the Alpern-Schneider Characterization of Safety and Liveness in Concurrent Executions

Article

Jan 1998
INFORM PROCESS LETT

H. Peter Gumm

In order to derive a result such as the Alpern-Schneider theorem characterizing safety and liveness properties of concurrent program executions, it is shown that all that is needed is a ∨-preservingmap ϕ between complete Boolean algebras. Every property becomes a conjunction of a safety and a liveness property and safety properties can be characterized by sets of configurations that are to be “avoided”.Aside from the original result of B. Alpern and F.B. Schneider we also provide a new application by considering transition systems with a UNITY-style logic. Safety properties are characterized by a set of forbidden pairs of successive states and progress properties are those allowing all possible state-successor pairs. Every property of a transition system is shown to be a conjunction of a safety and a progress property.

Defining liveness

Article

Oct 1985
INFORM PROCESS LETT

A formal definition for liveness properties is proposed. It is argued that this definition captures the intuition that liveness properties stipulate that ‘something good’ eventually happens during execution. A topological characterization of safety and liveness is given. Every property is shown to be the intersection of a safety property and a liveness property.

A Case Study in the Mechanical Verification of Fault Tolerance.

Conference Paper

Oct 2000

Abstract To date, there is little evidence that modular reasoning about fault-tolerant systems can simplify the verification proce ss in practice. We study this question using a prominent,example from the fault tolerance literature: the problem of reli able broadcast in point-to-point networks,opposed,to crash failures of processes. The experiences from this case study show how,modular,specification techniques and rigorous proof reuse can indeed help in such undertakings.

Implementing Fault-Tolerant Services Using the State Machine Approach: A Tutorial

Article

Dec 1990

Fred B. Schneider

The state machine approach is a general method for implementing fault-tolerant services in distributed systems. This paper reviews the approach and describes protocols for two different failure models—Byzantine and fail stop. Systems reconfiguration techniques for removing faulty components and integrating repaired components are also discussed.

Transformational Approaches to the Specification and Verification of Fault-Tolerant Systems: Formal Background and Classification.

Article

Jan 1999
J UNIVERS COMPUT SCI

Felix C. Gärtner

Proving that a program suits its specification and thus can be called correct has been a research subject for many years resulting in a wide range of methods and formalisms. However, it is a common experience that even systems which have been proven correct can fail due to physical faults occurring in the system. As computer programs control an increasing part of todays critical infrastructure, the notion of correctness has been extended to fault tolerance, meaning correctness in the presence of a certain amount of faulty behavior of the environment. Formalisms to verify fault-tolerant systems must model faults and faulty behavior in some form or another. Common ways to do this are based on a notion of transformation either at the program or the specification level. We survey the wide range of formal methods to verify fault-tolerant systems which are based on some form of transformation. Our aim is to classify these methods, relate them to one another and, thus, structure the area. We hope that this might faciliate the involvement of researchers into this interesting field of computer science.

The Existence of Refinement Mappings.

Article

May 1991
THEOR COMPUT SCI

Refinement mappings are used to prove that a lower-level specification correctly implements a higher-level one. We consider specifications consisting of a state machine (which may be infinite- state) that specifies safety requirements, and an arbitrary supplementary property that specifies liveness requirements. A refinement mapping from a lower-level specification S1 to a higher-level one S2 is a mapping from S1's state space to S2's state space. It maps steps of S1's state machine to steps of S2's state machine and maps behaviors allowed by S1 to behaviors allowed by S2. We show that, under reasonable assumptions about the specification, if S1 implements S2, then by adding auxiliary variables to S1 we can guarantee the existence of a refinement mapping. This provides a completeness result for a practical, hierarchical specification method.

Proving the Correctness of Multiprocess Programs.

Article

Mar 1977

Leslie Lamport

The inductive assertion method is generalized to permit formal, machine-verifiable proofs of correctness for multiprocess programs. Individual processes are represented by ordinary flowcharts, and no special synchronization mechanisms are assumed, so the method can be applied to a large class of multiprocess programs. A correctness proof can be designed together with the program by a hierarchical process of stepwise refinement, making the method practical for larger programs. The resulting proofs tend to be natural formalizations of the informal proofs that are now used.

Guarded Commands, Nondeterminacy and Formal Derivation of Programs

Article

Aug 1975

Edsger W. Dijkstra

So-called “guarded commands” are introduced as a building block for alternative and repetitive constructs that allow nondeterministic program components for which at least the activity evoked, but possibly even the final state, is not necessarily uniquely determined by the initial state. For the formal derivation of programs expressed in terms of these constructs, a calculus will be be shown.

Automating the Addition of Fail-Safe Fault-Tolerance: Beyond Fusion-Closed Specifications

Abstract and Figures

Recommended publications

Automating the Addition of Fail-Safe

On Systematic Design of Fast and Perfect Detectors

Automated Design of Ecient Fail-Safe Fault Tolerance

Towards Efficient Stabilizing Code Dissemination in Wireless Sensor Networks