Page 1

To appear in Proc. 8th WDAG, October 1994

Self-Stabilizationby Local Checking and Global Reset

Extended Abstract

Baruch Awerbuch

??, Boaz Patt-Shamir

?, George Varghese

?and Shlomi Dolev

??

?Dept. of Computer Science, Johns Hopkins University

?Lab. for Computer Science, MIT

?Dept. of Computer Science, Washington University

?Dept. of Computer Science, Texas A&M University

?School of Computer Science, Carleton University

Abstract. We describe a method for transforming asynchronous network protocols

into protocols that can sustain any transient fault, i.e., become self-stabilizing. We

combine the known notion of local checking with a new notion of internal reset, and

prove that given any self-stabilizing internal reset protocol, any locally-checkable

protocol can be made self-stabilizing. Our proof is constructive in the sense that

we provide explicit code. The method applies to many practical network problems,

including spanning tree construction, topology update, and virtual circuit setup.

1Introduction

A network protocol is called self-stabilizing (or stabilizing for short) if when started from

an arbitrary state, it eventually exhibits the desired behavior. In the context of computer

networks, a self-stabilizing system may have an initial state with arbitrary messages at the

links and arbitrary corruption of the state variables at the nodes. The practical appeal of

stabilizing protocols is that they are simpler (i.e., they avoid a slew of mechanisms to deal

with a catalog of anticipated faults), and they are more robust (e.g., they can recover from

transient faults such as memory corruption as well as common faults such as link and node

crashes).

Since the pioneering work of Dijkstra [11], the theory of self-stabilization has been

extensively studied (e.g., [9, 16, 12, 2, 5]). While most of the work was directed at self-

stabilization of specific tasks, some work was devoted to designing general algorithmic

transformers that take a protocol as input, and produce as their output a self-stabilizing

version of that protocol. These transformers typically exhibit trade-offs between their

generality (i.e., the range of input protocols they can transform) and the efficiency of

the resulting protocols. One such general transformation is given by Katz and Perry [16],

where they show how to compile an arbitrary asynchronous protocol into a stabilizing

equivalent. Briefly, the idea in [16] is that a leader node periodically takes “snapshots” of

the global network state, and resets the system if some inconsistency is detected. We call

this method global checking and correction. Due to its generality, this transformation is

expensive in terms of space and communication; another drawback of this approach is that

it requires an additional self-stabilizing mechanism that maintains routes that connect all

nodes to some leader.

Afek, Kutten and Yung [5] suggested that global inconsistency could sometimes be

detected by checking the states of neighbors — i.e., by local means. Using the idea of

Page 2

detectingfaultslocallyandcorrectingthembyaglobaloperation,astabilizingspanning-tree

constructionisdevelopedin[5].In[2],Arora andGoudaproposetheuseofdistributedreset

to maintain diffusing computations. Their reset protocol requires an underlying stabilizing

spanning tree.

The idea of local detection of faults is formalized in [6, 7, 28] under the name of

localchecking.In[6,28], the classof locallycorrectableprotocols isalsodefined;these are

protocolsthatcanreach“good”globalstatesbymeansoflocalcorrectionactions.In[6,28],

a transformer that useslocal checkingand local correction is described. The transformer of

[6] is efficient, but it can be applied only to protocols that are both locally checkable and

locally correctable. Unfortunately, many interesting network protocols can be shown to be

locally checkable, but not locally correctable.

In this paper, motivated on the one hand by the inefficiency of the transformer of [16],

and by the narrowness of the transformer of [6] on the other hand, we introduce a new

algorithmic transformer that can be used to make a wide class of protocols self-stabilizing.

The idea is to combine local checking and global correction: bad states are detected by

local checking mechanism, a global correction action (called “reset”) is used to recover

from faults. We contend that local checking and global reset is the right balance in many

practical situations. First, we argue that global detection mechanisms such as the self-

stabilizing snapshot [16] incur unnecessary large overhead (in terms of time, space and

communication) practically always, since networks are fairly failure-free. Local checking

detects faults quickly, and it can be done, as we show in this paper, with only a small

increase in communication cost. Secondly, as mentioned above, there are many protocols

that are locally checkable but not locally correctable (e.g., spanning tree construction and

topology update [27, 20, 21, 22]). In these cases we are forced to use other techniques —

e.g., reset.

Even though resetting an entire network may seem drastic and inefficient, there is

evidence that this is not the case.For instance,consider routing protocols.The stabilization

timeofourmethod(usingthebestresetprotocols)isproportionaltoacross-networklatency,

which is the time that takes for many protocols to compute their results anyhow, even after

being started in a good state. Empirical results also support the claim that resets perform

quite well in practice. Specifically, DEC SRC’s AN-1 network [25] employs a variant of

global reset for dealing with topology changes (by making a reset request whenever a

link fails or comes up).

failures very fast [23]. The reason is that usually the routing protocol only operates for a

small fraction of the time at a node; the remaining processing is devoted to forwarding

data. During a reset, however, no data forwarding is done; all processing and bandwidth is

devoted to the reset. The moral from the AN-1 experience is that reset schemes work well

for smallsizednetworks;for larger networks,the sameapproachshouldwork if the routing

protocol is hierarchical [15] and each level is reset independently.

The main result of this paper is a precise description and statement of the method of

local checking and global reset. We provide formalization, analysis, and code. We believe

that in doing so we contribute something thatwill help both theoreticians and practitioners.

We remark that the ideas of local checking and global reset are not new; for instance, the

stabilizing spanning tree protocol of [5] uses local detection, and Arora and Gouda [2] use

?The AN-1 designers found that the protocol recovered from link

?The AN-1 reset is performed using a version of Finn’s unbounded counter protocol [14].

Page 3

reset to maintain diffusing computations. The contribution of this paper is in introducing

a general transformer that can be used to stabilize any locally checkable protocol. The

descriptionof the transformer entails a descriptionof a local checkingmechanism,detailed

requirements that the reset protocol being used must meet, and a description of the way to

construct the resulting self-stabilizing protocol.

Itisimportanttoobserve thatthe classicalnotionofreset [14,1,6]is insufficientforour

purposes.In these papers,the taskis specifiedinterms of an externalentity thattriggers the

reset: for example, the reset can be triggered by a change in the topology (e.g., link crash).

The important point is that this specificationformalizes a reset that is invoked regardless of

the way it affects the system. Below, we call such resets external. Notice that an external

reset is inadequate for a general transformer: in our method there is no external entity.

We have a protocol, which is checked by the local checking mechanism, that can trigger

the reset, which in turn changes the state of the protocol being checked. If while resetting

inconsistentstatesof the original protocol are created,the localcheckingmechanismmight

invoke the reset again,resulting in an endless vicious cycle of reset invocations. Therefore,

anothernotion ofresetis required in this setting.One of the contributions inthis paper is an

appropriate specification of a stronger reset, hereafter called internal reset. Intuitively, the

requirement of external reset that there are only finitely many reset invocations is replaced

in internal reset by a specification that guarantees that when used properly, the reasons for

invoking reset eventually disappear.

Interestingly, some reset implementations [1, 6] are known to produce intermediate

globalinconsistencies[1, 28]. In this paper,however, we show thata certain“pairwise con-

sistency” is sufficient; fortunately, it turns out that the above protocols (although designed

as external resets) meet the requirements of internal reset.

The remainder of the paper is organized as follows. We start, in Section 2, with an

overview of the network model and the definition of stabilization used in this paper. In

Section 3 we define the notion of local checkability (this is a straightforward formulation

of the ideas in [5, 6]). In Section 4 we give a definition of the requirements of internal

resetprotocols.Then,in Section 5,we give our main result, that connectsthe known notion

of local checkability with the new notion of internal reset. Namely, we present a theorem

that says that any locally checkable protocol can be made self-stabilizing using any self-

stabilizingresetprotocol.Wesketchaspecificimplementationofthelocalcheckingprocess,

and outline a proof of correctness for the combined protocol (explicit code is omitted from

this extended abstract.) Some applications of our main result are mentioned in Section 6.

2 Model

In this section we describe our network model. We first review briefly the underlying

formal model of Input/Output Automata (see [17, 18] for full definitions),and establishthe

notation we use throughout this paper. We also formalize the notion of self-stabilization in

this framework. In the second part of this section, we specify the network model we are

dealing with in this paper.

IO Automata, Stabilization, Time Complexity. An Input/Output Automaton (abbreviated

IOA henceforth) is a state machine whose state transitions are given labels called actions.

Page 4

There are three kinds of actions. The environment affects the automaton through input

actions which must be responded to in any state. The automaton affects the environment

through outputactions;these actionsare controlledby the automaton. Internalactionsonly

change the state of the automaton without affecting the environment. Formally, an IOA

N is defined by a state set

the action set into input, output, and internal actions), a transition relation

S?N?, an action set

A?N?, a signature

G?N? (that classifies

R?N??

S

the automaton’s name when it is clear from the context. An action

in state

For an automaton

that is obtainedfrom

uninitialized IOA for which

actions. More formally, IOAs can be composed (under certain compatibility conditions) to

generate a composite state machine; an action which is output of one of the components

and input of the other is performed simultaneously.

When an IOA “runs” it produces an execution. Formally, an execution fragment is an

alternating sequence of states andactions

all

enabled eventually occurs.

initial state and is fair. A schedule is a subsequence of an execution consisting only of the

actions. A behavior is a subsequence of a schedule consisting only of its input and output

actions. Each IOA generates a set of behaviors. An IOA

the behaviors of

definition and require only that

?N??A?N??S?N?, and a non-empty set of initial states

I?N??S?N?. We omit

a is said to be enabled

s if there exist

s

?

?S such that

?s?a?s

?

??R. Input actions are always enabled.

N and non-empty set

L?S?N?, we define

NjL to be the automaton

N by settingthe initial statesto be

L. In this paperwe often dealwith

I?S and

S is finite. IOAs communicate by means of shared

?s

?

?a

?

?s

?

?????, suchthat

?s

i

?a

i

?s

i??

??R for

i??. An execution fragment is fair if any internal or output action that is continuously

?An execution is an execution fragment that begins with an

A implements another IOA

B if

A are a subset of the behaviors of

B. For stabilization, we weaken this

A eventuallyexhibit a behavior of

B. Formally, we saythat

A stabilizes to

this definition (based on a definition by Nancy Lynch) is formulated in terms of external

behavior, as opposed to a (somewhat circular) state-based definition.

For time complexity, we use the timed IOA model of [18] (see [28] for formal details).

Informally, we assume that every internal or output action that is continuously enabled

occurs in 1unit of time. We saythat

suffix that occurs within time

is the smallest

to be identical to

instead of 1 time unit.

B if every behavior

? of

A has a suffix which is a behavior of

B. Note that

A stabilizes to

B in time

t if every behavior of

A has a

t and is a behavior of

B. The stabilization time from

A to

B

t such that

A stabilizes to

B in time

t. For any automaton

N, define

N?x?

N except that the time associated with each action is now

x time units

Network Model. For the remainder of this paper we fix an underlying network topology,

modeled by a directed symmetric graph

node

communicationlink. We denote the numberof networknodes by

diameter is denoted

we describe verbally the links and node automata. Formal definitions are omitted from this

abstract.

G??V?E? with unique node identifiers.

?Each

v?V represents a processor, and each directed edge represents a unidirectional

n?jVj,and the network

d?diam?G?. Each node and link is modeled by an IOA. Below,

?The IOA model specifies fairnessin terms of equivalence classes; here we assume eachaction is in

a separate class.

?A graph

G??V?E? is called symmetric if for all

u?v?E we have that

?u?v??E implies

?v?u??E.

Page 5

In our model, links have bounded storage, i.e., only a bounded number of outstanding

packets are stored on each link at any instant. The justification for this assumption is

twofold: first, not much can be done with unbounded links in a stabilizing setting [13], and

secondly, real links are inherently bounded anyway. In this paper we abstract this property

by postulating that a link can store at any given instant at most one outstanding packet.

Formally,alinkfromnode

packet from some packet alphabet

Figure 1) includes an input action SEND

output action RECEIVE

FREE

utonode

v ismodeledasaqueue

Q

u?vthatcanstore atmostone

? at any instant. The external interface to the link (see

u?v

?p? (interpreted as “send packet

p from

u”), an

u?v

?p?, (interpreted as “deliver packet

p at

v”), and an output action

u?v(interpreted as “the

?u?v? link is currently free”).

?If a SEND

u?v

?p? occurs when

Q

and when it is taken, its effect is to set

SEND

packet is just dropped). We note that by our timing assumptions, a packet stored in a link

will be delivered in one unit of time.

u?v

??, the effect is that

followingdiscipline for sendingpackets.It hasa boundedoutputqueue calledqueue ?v

Q

u?v

?fpg; when

Q

u?v

?fpg, RECEIVE

u?v

?p? is enabled,

Q

u?v

??; when

Q

u?v

??, FREE

u?vis enabled. If

u?v

?p? occurs when

Q

u?v

???, there is no change of state (intuitively, the incoming

SENDu,v

FREEu,v

queue[v]

free[v]

Qu,v

Node u

RECEIVEu,v

Node v

Fig.1. Schematic representation of a single link, connecting queued node automaton

from

u to

v. The link

v to

u is symmetric and is not shown.

A node automaton

packets to

FREE

automaton which we call a queued node automaton. A queued node automaton

u has, for each neighbor

v, output actions SEND

u?v

?p? to send

v, input actions RECEIVE

v ?u

?p? to receive packets from

v, and an input action

u?vto obtain indications of the

?u?v?-link state. In this paper, we use a special node

u has the

?and

a boolean flag free ?v

is received from the link at

when

Queued node automata allow us to easily superimpose a local checking process. The

local checking process at each node needs access to the state of the node and also requires

the sending of control packets. (The requirement of access to state rules out the possibility

of formalizing the local checkingprocess as a separate automaton). Queuednode automata

use a particular discipline for sending data packets on a link; this discipline makes it easy

to multiplex data and control packets on each link. Any node automaton requires some

discipline anyway to deal with bounded links. Thus the use of a particular discipline is not

overly restrictive and makes it easy to add local checking.

For the given graph

? for each neighbor

v of

u (see Figure 1). Whenever a FREE

u?vaction

u,the free ?v? flag is set. A SEND

u?v

?p? action is only performed

p is the head of queue ?v? and free

?v? is set; its effect is to clear free

?v?.

G??V?E?, we define the automaton for

G by the composition

?Ourconvention foraction subscriptsis thatthe first representsthe senderand the secondrepresents

the receiver.