To appear in Proc. 8th WDAG, October 1994
Self-Stabilizationby Local Checking and Global Reset
??, Boaz Patt-Shamir
?, George Varghese
?and Shlomi Dolev
?Dept. of Computer Science, Johns Hopkins University
?Lab. for Computer Science, MIT
?Dept. of Computer Science, Washington University
?Dept. of Computer Science, Texas A&M University
?School of Computer Science, Carleton University
Abstract. We describe a method for transforming asynchronous network protocols
into protocols that can sustain any transient fault, i.e., become self-stabilizing. We
combine the known notion of local checking with a new notion of internal reset, and
prove that given any self-stabilizing internal reset protocol, any locally-checkable
protocol can be made self-stabilizing. Our proof is constructive in the sense that
we provide explicit code. The method applies to many practical network problems,
including spanning tree construction, topology update, and virtual circuit setup.
A network protocol is called self-stabilizing (or stabilizing for short) if when started from
an arbitrary state, it eventually exhibits the desired behavior. In the context of computer
networks, a self-stabilizing system may have an initial state with arbitrary messages at the
links and arbitrary corruption of the state variables at the nodes. The practical appeal of
stabilizing protocols is that they are simpler (i.e., they avoid a slew of mechanisms to deal
with a catalog of anticipated faults), and they are more robust (e.g., they can recover from
transient faults such as memory corruption as well as common faults such as link and node
Since the pioneering work of Dijkstra , the theory of self-stabilization has been
extensively studied (e.g., [9, 16, 12, 2, 5]). While most of the work was directed at self-
stabilization of specific tasks, some work was devoted to designing general algorithmic
transformers that take a protocol as input, and produce as their output a self-stabilizing
version of that protocol. These transformers typically exhibit trade-offs between their
generality (i.e., the range of input protocols they can transform) and the efficiency of
the resulting protocols. One such general transformation is given by Katz and Perry ,
where they show how to compile an arbitrary asynchronous protocol into a stabilizing
equivalent. Briefly, the idea in  is that a leader node periodically takes “snapshots” of
the global network state, and resets the system if some inconsistency is detected. We call
this method global checking and correction. Due to its generality, this transformation is
expensive in terms of space and communication; another drawback of this approach is that
it requires an additional self-stabilizing mechanism that maintains routes that connect all
nodes to some leader.
Afek, Kutten and Yung  suggested that global inconsistency could sometimes be
detected by checking the states of neighbors — i.e., by local means. Using the idea of
to maintain diffusing computations. Their reset protocol requires an underlying stabilizing
The idea of local detection of faults is formalized in [6, 7, 28] under the name of
localchecking.In[6,28], the classof locallycorrectableprotocols isalsodefined;these are
a transformer that useslocal checkingand local correction is described. The transformer of
 is efficient, but it can be applied only to protocols that are both locally checkable and
locally correctable. Unfortunately, many interesting network protocols can be shown to be
locally checkable, but not locally correctable.
In this paper, motivated on the one hand by the inefficiency of the transformer of ,
and by the narrowness of the transformer of  on the other hand, we introduce a new
algorithmic transformer that can be used to make a wide class of protocols self-stabilizing.
The idea is to combine local checking and global correction: bad states are detected by
local checking mechanism, a global correction action (called “reset”) is used to recover
from faults. We contend that local checking and global reset is the right balance in many
practical situations. First, we argue that global detection mechanisms such as the self-
stabilizing snapshot  incur unnecessary large overhead (in terms of time, space and
communication) practically always, since networks are fairly failure-free. Local checking
detects faults quickly, and it can be done, as we show in this paper, with only a small
increase in communication cost. Secondly, as mentioned above, there are many protocols
that are locally checkable but not locally correctable (e.g., spanning tree construction and
topology update [27, 20, 21, 22]). In these cases we are forced to use other techniques —
Even though resetting an entire network may seem drastic and inefficient, there is
evidence that this is not the case.For instance,consider routing protocols.The stabilization
which is the time that takes for many protocols to compute their results anyhow, even after
being started in a good state. Empirical results also support the claim that resets perform
quite well in practice. Specifically, DEC SRC’s AN-1 network  employs a variant of
global reset for dealing with topology changes (by making a reset request whenever a
link fails or comes up).
failures very fast . The reason is that usually the routing protocol only operates for a
small fraction of the time at a node; the remaining processing is devoted to forwarding
data. During a reset, however, no data forwarding is done; all processing and bandwidth is
devoted to the reset. The moral from the AN-1 experience is that reset schemes work well
for smallsizednetworks;for larger networks,the sameapproachshouldwork if the routing
protocol is hierarchical  and each level is reset independently.
The main result of this paper is a precise description and statement of the method of
local checking and global reset. We provide formalization, analysis, and code. We believe
that in doing so we contribute something thatwill help both theoreticians and practitioners.
We remark that the ideas of local checking and global reset are not new; for instance, the
stabilizing spanning tree protocol of  uses local detection, and Arora and Gouda  use
?The AN-1 designers found that the protocol recovered from link
?The AN-1 reset is performed using a version of Finn’s unbounded counter protocol .
reset to maintain diffusing computations. The contribution of this paper is in introducing
a general transformer that can be used to stabilize any locally checkable protocol. The
descriptionof the transformer entails a descriptionof a local checkingmechanism,detailed
requirements that the reset protocol being used must meet, and a description of the way to
construct the resulting self-stabilizing protocol.
Itisimportanttoobserve thatthe classicalnotionofreset [14,1,6]is insufficientforour
purposes.In these papers,the taskis specifiedinterms of an externalentity thattriggers the
reset: for example, the reset can be triggered by a change in the topology (e.g., link crash).
The important point is that this specificationformalizes a reset that is invoked regardless of
the way it affects the system. Below, we call such resets external. Notice that an external
reset is inadequate for a general transformer: in our method there is no external entity.
We have a protocol, which is checked by the local checking mechanism, that can trigger
the reset, which in turn changes the state of the protocol being checked. If while resetting
inconsistentstatesof the original protocol are created,the localcheckingmechanismmight
invoke the reset again,resulting in an endless vicious cycle of reset invocations. Therefore,
anothernotion ofresetis required in this setting.One of the contributions inthis paper is an
appropriate specification of a stronger reset, hereafter called internal reset. Intuitively, the
requirement of external reset that there are only finitely many reset invocations is replaced
in internal reset by a specification that guarantees that when used properly, the reasons for
invoking reset eventually disappear.
Interestingly, some reset implementations [1, 6] are known to produce intermediate
globalinconsistencies[1, 28]. In this paper,however, we show thata certain“pairwise con-
sistency” is sufficient; fortunately, it turns out that the above protocols (although designed
as external resets) meet the requirements of internal reset.
The remainder of the paper is organized as follows. We start, in Section 2, with an
overview of the network model and the definition of stabilization used in this paper. In
Section 3 we define the notion of local checkability (this is a straightforward formulation
of the ideas in [5, 6]). In Section 4 we give a definition of the requirements of internal
resetprotocols.Then,in Section 5,we give our main result, that connectsthe known notion
of local checkability with the new notion of internal reset. Namely, we present a theorem
that says that any locally checkable protocol can be made self-stabilizing using any self-
and outline a proof of correctness for the combined protocol (explicit code is omitted from
this extended abstract.) Some applications of our main result are mentioned in Section 6.
In this section we describe our network model. We first review briefly the underlying
formal model of Input/Output Automata (see [17, 18] for full definitions),and establishthe
notation we use throughout this paper. We also formalize the notion of self-stabilization in
this framework. In the second part of this section, we specify the network model we are
dealing with in this paper.
IO Automata, Stabilization, Time Complexity. An Input/Output Automaton (abbreviated
IOA henceforth) is a state machine whose state transitions are given labels called actions.
There are three kinds of actions. The environment affects the automaton through input
actions which must be responded to in any state. The automaton affects the environment
through outputactions;these actionsare controlledby the automaton. Internalactionsonly
change the state of the automaton without affecting the environment. Formally, an IOA
N is defined by a state set
the action set into input, output, and internal actions), a transition relation
S?N?, an action set
A?N?, a signature
G?N? (that classifies
the automaton’s name when it is clear from the context. An action
For an automaton
that is obtainedfrom
uninitialized IOA for which
actions. More formally, IOAs can be composed (under certain compatibility conditions) to
generate a composite state machine; an action which is output of one of the components
and input of the other is performed simultaneously.
When an IOA “runs” it produces an execution. Formally, an execution fragment is an
alternating sequence of states andactions
enabled eventually occurs.
initial state and is fair. A schedule is a subsequence of an execution consisting only of the
actions. A behavior is a subsequence of a schedule consisting only of its input and output
actions. Each IOA generates a set of behaviors. An IOA
the behaviors of
definition and require only that
?N??A?N??S?N?, and a non-empty set of initial states
I?N??S?N?. We omit
a is said to be enabled
s if there exist
?S such that
??R. Input actions are always enabled.
N and non-empty set
L?S?N?, we define
NjL to be the automaton
N by settingthe initial statesto be
L. In this paperwe often dealwith
S is finite. IOAs communicate by means of shared
i??. An execution fragment is fair if any internal or output action that is continuously
?An execution is an execution fragment that begins with an
A implements another IOA
A are a subset of the behaviors of
B. For stabilization, we weaken this
A eventuallyexhibit a behavior of
B. Formally, we saythat
A stabilizes to
this definition (based on a definition by Nancy Lynch) is formulated in terms of external
behavior, as opposed to a (somewhat circular) state-based definition.
For time complexity, we use the timed IOA model of  (see  for formal details).
Informally, we assume that every internal or output action that is continuously enabled
occurs in 1unit of time. We saythat
suffix that occurs within time
is the smallest
to be identical to
instead of 1 time unit.
B if every behavior
A has a suffix which is a behavior of
B. Note that
A stabilizes to
B in time
t if every behavior of
A has a
t and is a behavior of
B. The stabilization time from
t such that
A stabilizes to
B in time
t. For any automaton
N except that the time associated with each action is now
x time units
Network Model. For the remainder of this paper we fix an underlying network topology,
modeled by a directed symmetric graph
communicationlink. We denote the numberof networknodes by
diameter is denoted
we describe verbally the links and node automata. Formal definitions are omitted from this
G??V?E? with unique node identifiers.
v?V represents a processor, and each directed edge represents a unidirectional
n?jVj,and the network
d?diam?G?. Each node and link is modeled by an IOA. Below,
?The IOA model specifies fairnessin terms of equivalence classes; here we assume eachaction is in
a separate class.
G??V?E? is called symmetric if for all
u?v?E we have that
In our model, links have bounded storage, i.e., only a bounded number of outstanding
packets are stored on each link at any instant. The justification for this assumption is
twofold: first, not much can be done with unbounded links in a stabilizing setting , and
secondly, real links are inherently bounded anyway. In this paper we abstract this property
by postulating that a link can store at any given instant at most one outstanding packet.
packet from some packet alphabet
Figure 1) includes an input action SEND
output action RECEIVE
? at any instant. The external interface to the link (see
?p? (interpreted as “send packet
?p?, (interpreted as “deliver packet
v”), and an output action
u?v(interpreted as “the
?u?v? link is currently free”).
?If a SEND
?p? occurs when
and when it is taken, its effect is to set
packet is just dropped). We note that by our timing assumptions, a packet stored in a link
will be delivered in one unit of time.
??, the effect is that
followingdiscipline for sendingpackets.It hasa boundedoutputqueue calledqueue ?v
?p? is enabled,
u?vis enabled. If
?p? occurs when
???, there is no change of state (intuitively, the incoming
Fig.1. Schematic representation of a single link, connecting queued node automaton
v. The link
u is symmetric and is not shown.
A node automaton
automaton which we call a queued node automaton. A queued node automaton
u has, for each neighbor
v, output actions SEND
?p? to send
v, input actions RECEIVE
?p? to receive packets from
v, and an input action
u?vto obtain indications of the
?u?v?-link state. In this paper, we use a special node
u has the
a boolean flag free ?v
is received from the link at
Queued node automata allow us to easily superimpose a local checking process. The
local checking process at each node needs access to the state of the node and also requires
the sending of control packets. (The requirement of access to state rules out the possibility
of formalizing the local checkingprocess as a separate automaton). Queuednode automata
use a particular discipline for sending data packets on a link; this discipline makes it easy
to multiplex data and control packets on each link. Any node automaton requires some
discipline anyway to deal with bounded links. Thus the use of a particular discipline is not
overly restrictive and makes it easy to add local checking.
For the given graph
? for each neighbor
u (see Figure 1). Whenever a FREE
u,the free ?v? flag is set. A SEND
?p? action is only performed
p is the head of queue ?v? and free
?v? is set; its effect is to clear free
G??V?E?, we define the automaton for
G by the composition
?Ourconvention foraction subscriptsis thatthe first representsthe senderand the secondrepresents
of node automata for each
of study in this paper, called hereafter network automaton for
which all node automata are queued node automata.
u?V and link automata for each edge
?u?v??E. Our object
G, is an automaton for
3 Local Checkability
In this section we formalize the notion of local checkability. For the remainder of this
section, fix a network automaton
the notion of subsystems.
N for a given graph
G??V?E?. We start by defining
automata for nodes
?u?v?-link subsystem of
N is the composition of the
v, and edges
For a state
node automaton of
automaton for link
by the 4-tuple
s?S?N? and node
u?V , let
sju denote the projection of
s onto the
u; similarly, for
s projected onto the
N is in state
?u?v? subsystemstate is characterized
?sju?sj?u?v??sj?v?u??sjv?. We now define the notions of predicates and
Definition2. A predicate of
N is a subset of the states of
N. A local predicate
is said to satisfy a local predicate
?u?v??E, is a subset of the states of the
?u?v? subsystem. A state
We shall also use the following standard definition.
Definition3. A local predicate
N is called stable if for all transitions
N, we have that if
u?vthen so does
We now arrive at our definition of local checkability.
g be a set of local predicates, and let
? be any predicate of
N. A network automaton
(1) For all states
(2) There exists
N is locally checkable for
Lif the following conditions
s?S?N? such that
?L is stable.
Certainly the most intriguing condition in Definition 4 above is (3). This condition
is introduced so that periodic verification of local predicates can still be useful for fault
detection. More specifically, it is aimed to rule out the case of an “evasive violation:” there
are examples in which a (global) predicate can be expressed as a conjunction of local
predicates, and such that under a certain schedule, whenever a local predicate is checked
it turns out to be true; however, it may be the case that the global predicate never holds
in this execution! Intuitively, the problem stems from the fact that a fault could “travel”
through the network, escaping detection every time local predicates are verified. Imposing
the stability condition guarantees that if a local predicate is known to hold, it will continue
to hold through the rest of the execution.
Superficially, the stability requirement in Def. 4 may seem too strong a condition. We
supportour definition by the observationthatmany protocols, calledlocally extensible,can
be made to have this property by means of a simple transformation. Informally, a protocol
is said to be locally extensible if any correct state of any pair of overlapping subsystems
can be extended to a globally correct state. Details can be found in .
4 Internal Reset
In this section we give a specification for an internal reset protocol in terms of observable
behaviors.Intuitively, the goalof anyresetprotocolis to provide, uponrequest,a consistent
signal common to all nodes in the network. The time point corresponding to the signal can
be used to locally restart the protocol that we wish to reset. An internal reset protocol has
to satisfy additional conditions regarding its output before termination. The basic idea in
the specification below is an analogy to a pair of nodes connected by a data link protocol.
Essentially, an internal reset protocol generalizes the guarantees of a data link protocol 
to the whole network.
4.1Interface of Reset Protocols
We now define the interface of reset protocols (this interface is common to both external
and internal resets). Let us start with some intuition. It is desired that a reset protocol be
superimposed on any other network protocol, say
reset may be invoked at any node, and its effect is to output signals at all the nodes in a
consistent way. The notion of consistency is expressed in terms of the messages sent and
received by the nodes. We therefore assume that the reset protocol has control over the
messagessent and received by the user.
Motivated by this consideration, we define the external interface for a reset protocol at
theresetprotocol,andthesignalactionisusedbythereset toprovide aconsistenttimepoint
to the user. The reset protocol also regulates all message traffic of
To avoid confusion, we call the messages generated by, and destined for the application
messages. We assume that these messages are drawn from some alphabet
reset modules can communicate among themselves; the messages that reset protocols send
and receive (including those it relays to and from
which we denote by
output by users are relayed by the reset modules between network nodes using packets.
Formally, a SENDM
the user at node
another message from
reset service at node
P, which we think of as a “user.” The
uasshowninFigure 2.The requestactionallowsa localuserat anode toinvoke
P to and from the node.
?. In addition,
P) are drawn from the links alphabet,
??. These message are called packets. Intuitively, messages
?m? action allows
u to send a message to neighbor
?m? action allows the reset service to deliver a message
m from node
u. The FREEM
u?vaction indicates that the reset service is ready to accept
u to node
v. Thus the external interface between a reset service and
P-users mimics the link interface (cf. Figure 1) with packets replacedby messages.The
u, however, offers two additional actions: an input action REQUEST
Reset module at node u
User protocol P at node u
Fig.2. Interface specification for reset service
used to enable the user to request a reset, and an output action SIGNAL
user that a reset has been completed at that node.
u, that informs the
4.2 Behavior Specification for Reset Protocols
this specification and external reset  is in the consistency requirement below.) Our
specificationis parameterizedbythe responsetime ofthe protocol,denoted
This is convenient because different reset protocols have different response times.
Before describing the behaviors of reset, we impose a well formedness condition on
the behaviors of the user
it is not sure that the link is free. Formally, a behavior is well-formed if between any two
To specify the requirements of an internal reset protocol, we define properties we call
timeliness, causality and consistency. An internal reset protocol is required to be timely,
causal, and consistent.We define these properties below.
Intuitively, a behavior is timely if, in the absence of reset requests, the reset protocol
relays messagesand“free” events toandfrom the node inconstanttime. Formally, we have
the following definition.
P. Intuitively, we rule out cases where
P injects messages when
u?vevents there is a FREEM
Definition5 (Timeliness). A behavior
(1) At any point in
(2) Every SENDM
or else SIGNAL
? is timely if for all
?u?v??E the following
?, either FREEM
u?voccurs within constant time, or else SIGNAL
?m? event in
? is followed by RECEIVEM
?m? in constant time,
Intuitively, a behavior is causal if reset signals are only caused by reset requests and
reset requests result in reset signals. Formally, we have the following definition.
Definition6 (Causality). A behavior
? is causal if the following holds.
(1) For any signal event there is some request event that occurs within the preceding
(2) For any REQUEST
uevent, there is a SIGNAL
uevent that occurs within the following
(3) For any signal event, there are signal events that occur at all
preceding and following
v?V within the
O?R? time units.
Note that condition (1) of the causality definition guarantees termination: if reset requests
stop, all signal events will stop in
The intuition for the property of consistency of behaviors is harder to capture. As
mentioned above, we would like to extend guarantees made by data link protocols to
networks. Fix a behavior
receive events:for any RECEIVEM
be the first SENDM
the subsequences of
For a given signal interval
?. Our first step is to define correspondence between send and
i,we define its correspondingsendevent
?, such that there is no other RECEIVEM
i. Next, we define the notion of signal intervals at a node
u: these are
? demarcated by SIGNAL
uevents; if the number of SIGNAL
u’s final interval is the infinite interval that begins with the last SIGNAL
uat a node
u, we define, for each neighbor
? which consists of the messages
u sends to
u, and the sequence
With these concepts, we arrive at the central definition of the mating relation among
?, which consists of the messages
u receives from
Definition7. Given a behavior, let
udenote the set of signal intervals at
u?V , and let
I denote the set of all signal intervals. A mating relation “ ?” is a reflexive, symmetric
(2) For all
I that satisfies the following conditions for any
u, there exists at most one
? is a prefix of
Note that only final intervals enjoy a “full guarantee” in the sense that the sequence of
messages received is equal to the sequence of messages transmitted. Mating non-final
intervals are allowed to “lose” the tail of the sequence.
We can now define the two flavors of consistency we consider.
Definition8 (Consistency). A behavior
of signal intervals of
consistentand, in addition, the mating relation is transitive.
? is called weakly consistent if every RECEIVEM
? has a corresponding send event, and there exists a mating relation over the set
?. A behavior is called strongly consistent if the behavior is weakly
Note that even for weak consistency, there is a transitive mating relation between final
with intervals in the same class: this seems to be the essenceof network synchronization.
To specify an internal reset, we might require that any well-formed behavior be timely,
consistent,and causal.Unfortunately,it appearsthat noimplementationcanstabilizeto this
set of behaviors! Messages stored in the initial state can result in executions in which all
suffixes have some receive event that does not correspond to a send; this violates mating.
Thus we settle for the following definition.
Definition9 (Internal Reset). A protocol
formed behavior of
R is a weak (strong) internal reset if any well-
R is a suffix of some timely, causal and weakly (resp., strongly)
We remark that one difficulty with this definition is the hardness of proving that an IOA
stabilizes to behaviors that are suffixes of behaviors of a secondIOA.
We note that the stabilizing reset protocols of  and  appear to be strong internal
resets. The protocol of Katz and Perry  requires a stable set of paths to a fixed leader,
and it stabilizes in
directed spanning tree, and its stabilization time (given a tree) is
of Awerbuch, Patt-Shamir and Varghese  is designed as an external reset, but in fact it
also satisfies the requirements of weak internal resets (see ). The stabilization time of
canbe used,forexample,to geta stabilizingspanningtree protocol with
the space requirement to
? time. The protocol of Arora and Gouda  requires a stable
O?d?. The reset protocol
O?n?, but enjoys the advantage that it does not require precomputed structures. This
n?, at the expense of increasing the stabilization time to
tree protocol that stabilizes in
with the reset protocol of  to yield an
n?. Lastly, we remark that Awerbuch et al. describe in  a stabilizing spanning
O?d? time; it appears that that protocol can be combined
O?d? strong internal reset protocol.
5Self-Stabilization by Local Checking and Global Reset
In this section we state and sketch a proof of our main result. Basically, it says that any
protocol that is locally checkable for some global property, can be transformed into an
equivalent protocol,that stabilizes toa variant of the protocol in whichthe desiredproperty
holds in its initial state. This transformation increases the time complexity of the original
protocol as follows. First, the stabilization time of the resulting protocol is
R is the response time parameter of the internal reset protocol used (see Definition 6); and
secondly,the behaviors of the transformed automatonare sloweddown bya constantfactor
(due to the overhead of local checking).
We start this section with a statement of the main theorem.
N beany networkautomatonthatislocallycheckableforsomepredicate
?, such that
?stabilizes to the behaviors of
R is the response time of any stabilizing internal resetprotocol.
??In , relevant properties of suffixes are extracted and used to prove that the reset protocol of 
is a weak reset in the above sense. This approachworks but is messy; weprefer here to concentrate
on what is needed for global correction.
Due to lack of space, we do not give a full proof here; below, we describe the transfor-
mation and outline its analysis.Details can be found in .
The transformation. Let
the node automata with
node automata in Figure 1 to be message sending actions as in Figure 2).
Next, recall that by the assumption of local checkability of
R be an automaton for internal reset. Our first step is to compose
R (this requires us to first rename the packetsendingactions of the
?, there exists a set
subsystemto checkwhether the local predicate
the lower ID node (say
is a slight modification of the general Chandy-Lamport snapshot protocol . More
specifically,this is done asfollows (seeFigure 3).Node
which is responded to by
the snapshot is composed at
reset request to the internal reset protocol
g of local predicates for
?. We now add a periodic checking process to each
u?vholds.This is implementedby having
u) periodically initiate a snapshot of the
?u?v? subsystem. This
v with a “SnapResponse” message. When the response arrives,
u, and is then checked to see if
u?vholds; if not,
u makes a
Fig.3. Correct Snapshots after nodes
v have each performed a signal event. Time increases
If there are user messages to be sent, one user message is sent between every two
invocations of the snapshot; this slows down the communication of the user by a constant
factor. The snapshot is made stabilizing by numbering SnapRequest and SnapResponse
packetswith a 4-valued counter,and by retransmittingSnapRequestsuntil a SnapResponse
with matching counter is received. The counter is incremented mod 4 on every invocation.
When a SIGNAL
Finally, we rename messagesto packets again since the transformation so far produces
node automata that send messages.
uaction is taken by the internal reset protocol, node
u initializes the local
N to some prespecified initial state
uand in addition, initializes its snapshot
Sketch of Analysis. First, the specific local checking process outlined above can be shown
be stabilizing: in [28, 29] it is proven that after the fifth invocation, all snapshots produce
and we assume that we are given a correct internal reset protocol. The main difficulty in
??We choose the initial states
usuch that if
ufor all nodes
u and all links are empty in state
s??. The definition of local checkability implies the existence of such a state
proving that the transformation works is to show termination. We need to prove that in
that eventually, the SnapRequest and SnapResponse are matched correctly. Define an ISI
(for Initialized Signal Interval) at node
event. In any behavior, all but possibly the first signal interval at a node are ISIs. Suppose
signal intervals keep recurring at node
can be used to deduce that the corresponding SnapRequest was sent in
initialized at the start of an ISI.
Thus any snapshot that completes after this point will have been completely executed
within an ISI at
might have occurred in some asynchronous execution of the
is initialized at the start of
In such an asynchronous execution,
will continue to hold regardless of any messages received from other subsystems that may
still be “incorrect,” and the snapshot will not detect a violation. Thus if signal intervals
stop making reset requests.Hence, the causality property of reset implies that there will be
a final signal interval at all nodes, which corresponds to a behavior of
completes the proof of Theorem 10.
We remark that the proof hinges on two crucial points. First, it is important that
each local predicate is stable: since a weak internal reset does not guarantee a transitive
matingrelation, itis possiblefor asubsystemtoreceive “inconsistent”messagesfrom other
adjacent subsystems during non-final intervals. The stability guarantees that such events
will not trigger further inconsistencies. Note, however, that stability is required anyway in
order to do local checking via snapshots:it is not an extra condition required for the global
Thesecondcrucialpointis thematingrelationforsignalintervals;withoutit,a snapshot
receipt of the SnapRequestand the sending of the SnapResponse.This in turn could leadto
persistent incorrect snapshots and the possibility of non-termination. Pairwise consistency
for more general notions of locality (e.g., 3 node subsystems).A careful argument is more
involved: we defer it to the final paper.
?R? time, the local checkingprocess will stopmaking resetrequests.We start by arguing
u to be a signal interval that starts with a SIGNAL
u. Then eventually any SnapResponse packet that
?v?u? link (see Figure 3) is sent in an ISI at
v), and is received in an ISI
u). If this SnapResponseis acceptedas a matchingresponse,the mating property
v: this follows from the fact that by the code, all snapshot variables and counters are
u and an ISI at
v. But all communication between two ISIs is exactly what
?u?v? subsystem in which
v is initialized at the start of
vand the two links are empty.
u?vholds at the start; also, because
u?vis stable, it
N. This argument
u could span two signal intervals at
v, if there is another SIGNAL
vevent between the
The main result of this paper is Theorem 10, which shows how can locally checkable
protocols be made stabilizingautomatically, using local checkingand an internal reset. We
have used this result to rigorously prove the correctness of a spanning tree protocol .
Our protocolcomputesthe tree asa shortestpathstree rootedatthe minimumID node.This
is the same idea used in the widely deployed IEEE 802.1 spanning tree protocol . The
main problem in the basic approach is that fictitious IDs may spoil the computation. The
802.1 protocol overcomes this problem by using timers. To get rid of fictitious IDs even
in worst cases, the timeout periods are large always. By contrast, our protocol uses reset,
and its stabilization time is proportional to actual network delays (which are in most cases
significantly smaller the worst possible). In  we show that if the local predicates have
a certain structure, then local checking can be done by having each node periodically send
its state toits neighbors (without the needto implementlocalsnapshots).The spanningtree
algorithm has this structure and so the resultant protocol is quite simple.
Another application is topology update. Many existing networks [19, 20] use sequence
numbers to broadcasttopology information to all nodes.If the counter being usedever gets
tothemaximumvalue,a large timeoutisusedfor recovery.We propose thatinsteadof these
large timeouts, global reset can be used. Similarly, the AN-1 network  uses a simple
“large counter” to reset the network after topology changes . This simple reset is more
efficient than the stabilizing reset protocols but is vulnerable to counter errors. The AN-1
designers have suggested  that stabilizing reset could be used to reset the simple reset
protocol when the local predicates of the simple reset protocol are violated.
We believe that all of the above provides strong indications that the idea of self-
stabilization by local checking and global reset is a viable practical technique, as well as a
convenient theoretical tool.
86-0078, ARPA/Army contract DABT63-93-C-0038, ARO contract DAAL03-86-K-0171,
IBM. The second author is also supported by NSF contract 9225124-CCR, AFOSR-ONR
author was done while in MIT. The fourth author was partially supported by by NSF Pres-
idential Young Investigator Award CCR-91-58478 and funds from Texas A&M University
College of Engineering.
We would like to thank Nancy Lynch, Mark Tuttle and Shay Kutten for their crucial
comments and suggestions.
1. Yehuda Afek, Baruch Awerbuch, and Eli Gafni. Applying static network protocols to dynamic
networks. In Proc. 28th IEEE Symp. on Foundations of Computer Science, October 1987.
2. Anish Arora and Mohamed G. Gouda. Distributed reset. In Proc. 10th Conf. on Foundations of
SoftwareTechnologyand TheoreticalComputer Science,pages316–331. Spinger-Verlag(LNCS
3. Baruch Awerbuch and Shimon Even. Reliable broadcast protocols in unreliable networks.
Networks, 16(4):381–396, Winter 1986.
4. BaruchAwerbuch,ShayKutten,YishayMansour,BoazPatt-Shamir,andGeorgeVarghese. Time
optimal self-stabilizing synchronization. In Proc. 25th ACM Symp. on Theory of Computing,
5. Yehuda Afek, Shay Kutten, and Moti Yung. Memory-efficient self-stabilization on general
networks. In Proc. 4th Workshop on Distributed Algorithms, pages 15–28, Italy, September
1990. Springer-Verlag (LNCS 486).
6. Baruch Awerbuch,Boaz Patt-Shamir, and George Varghese. Self-stabilization by local checking
and correction. In Proc. 32nd IEEE Symp. on Foundations of Computer Science, October 1991.
7. Baruch Awerbuch and George Varghese. Distributed programchecking: a paradigm for building
self-stabilizing distributed protocols. In Proc. 32nd IEEE Symp. on Foundations of Computer
Science, October 1991.
8. Baruch Awerbuch and Rafail Ostrovsky. Memory-efficient and self-stabilizing network RESET.
In Proc. 13th ACM Symp. on Principles of Distributed Computing, August 1994.
9. J.E. Burns and J. Pachl. Uniform self-stabilizing rings. ACM Transactions on Programming
Languages and Systems, 11(2):330–344, 1989.
10. K. Mani Chandy and Leslie Lamport. Distributed snapshots: Determining global states of
distributed systems. ACM Trans. on Comput. Syst., 3(1):63–75, February 1985.
11. Edsger W. Dijkstra. Self stabilization in spite of distributed control. Comm. of the ACM,
12. ShlomiDolev,AmosIsraeli,andShlomoMoran. Self-stabilizationofdynamicsystemsassuming
only read/write atomicity. In Proc. 10th ACM Symp. on Principles of Distributed Computing,
13. Shlomi Dolev, Amos Israeli, and Shlomo Moran. Resource bounds for self-stabilizing message
driven protocols. In Proc. 11th ACM Symp. on Principles of Distributed Computing, Aug. 1991.
14. Steven G.Finn. Resynchproceduresandafail-safenetworkprotocol. IEEE Trans.onCommun.,
COM-27(6):840–845, June 1979.
15. L. Kleinrock and F. Kamoun. Hierarchical routing for large networks; performance evaluation
and optimization. Computer Networks, 1:155–174, 1977.
16. Shmuel Katz and Kenneth Perry. Self-stabilizing extensions for message-passing systems. In
Proc. 10th ACM Symp. on Principles of Distributed Computing, August 1990.
17. Nancy A. Lynchand Mark R. Tuttle. An introductionto input/output automata. CWI Quarterly,
18. M. Merritt, F. Modugno, and M.R. Tuttle. Time constrained automata. In CONCUR 91, pages
19. John McQuillan, Ira Richer, and Eric Rosen. The new routing algorithm for the arpanet. IEEE
Trans. on Commun., 28(5):711–719, May 1980.
20. RadiaPerlman. Fault tolerantbroadcastof routinginformation. Computer Networks,Dec. 1983.
21. RadiaPerlman. Analgorithmfordistributedcomputation ofaspanningtreeinanextendedLAN.
In Proceedings of the the 9th Data Communication Symposium, pages 44–53, September 1985.
22. Radia Perlman, George Varghese, and Anthony Lauck. Reliable broadcast of information in a
wide area network. US Patent 5,085,428, February 1992.
23. ThomasRodehefferandMichaelSchroeder. AutomaticreconfigurationintheAutonet. Proceed-
ings of the 14th Symposium on Operating Systems Principles, November 1993.
24. Thomas Rodeheffer and Michael Schroeder. Personal communication.
25. M. Schroeder, A. Birrell, M. Burrows, H. Murray, R. Needham, T. Rodeheffer, E. Sattenthwaite,
and C.Thacker. Autonet: a high-speed, self-configuring local area network using point-to-point
links. Technical Report 59, Digital System Research Center, April 1990.
26. John M.Spinelli. Reliablecommunication. Ph.d.thesis,MIT,Lab.forInformationandDecision
Systems, December 1988.
27. A. Tanenbaum. Computer Networks. Prentice Hall, 2nd. edition, 1989.
28. George Varghese. Self-stabilization by local checking and correction.
MIT/LCS/TR-583, Massachusetts Institute of Technology, 1992.
29. George Varghese. Self-stabilizationbycounter flushing. InProc. 13th ACMSymp.on Principles
of Distributed Computing, August 1994.