To appear in Proc. 8th WDAG, October 1994
Self-Stabilizationby Local Checking and Global Reset
??, Boaz Patt-Shamir
?, George Varghese
?and Shlomi Dolev
?Dept. of Computer Science, Johns Hopkins University
?Lab. for Computer Science, MIT
?Dept. of Computer Science, Washington University
?Dept. of Computer Science, Texas A&M University
?School of Computer Science, Carleton University
Abstract. We describe a method for transforming asynchronous network protocols
into protocols that can sustain any transient fault, i.e., become self-stabilizing. We
combine the known notion of local checking with a new notion of internal reset, and
prove that given any self-stabilizing internal reset protocol, any locally-checkable
protocol can be made self-stabilizing. Our proof is constructive in the sense that
we provide explicit code. The method applies to many practical network problems,
including spanning tree construction, topology update, and virtual circuit setup.
A network protocol is called self-stabilizing (or stabilizing for short) if when started from
an arbitrary state, it eventually exhibits the desired behavior. In the context of computer
networks, a self-stabilizing system may have an initial state with arbitrary messages at the
links and arbitrary corruption of the state variables at the nodes. The practical appeal of
stabilizing protocols is that they are simpler (i.e., they avoid a slew of mechanisms to deal
with a catalog of anticipated faults), and they are more robust (e.g., they can recover from
transient faults such as memory corruption as well as common faults such as link and node
Since the pioneering work of Dijkstra , the theory of self-stabilization has been
extensively studied (e.g., [9, 16, 12, 2, 5]). While most of the work was directed at self-
stabilization of specific tasks, some work was devoted to designing general algorithmic
transformers that take a protocol as input, and produce as their output a self-stabilizing
version of that protocol. These transformers typically exhibit trade-offs between their
generality (i.e., the range of input protocols they can transform) and the efficiency of
the resulting protocols. One such general transformation is given by Katz and Perry ,
where they show how to compile an arbitrary asynchronous protocol into a stabilizing
equivalent. Briefly, the idea in  is that a leader node periodically takes “snapshots” of
the global network state, and resets the system if some inconsistency is detected. We call
this method global checking and correction. Due to its generality, this transformation is
expensive in terms of space and communication; another drawback of this approach is that
it requires an additional self-stabilizing mechanism that maintains routes that connect all
nodes to some leader.
Afek, Kutten and Yung  suggested that global inconsistency could sometimes be
detected by checking the states of neighbors — i.e., by local means. Using the idea of
to maintain diffusing computations. Their reset protocol requires an underlying stabilizing
The idea of local detection of faults is formalized in [6, 7, 28] under the name of
localchecking.In[6,28], the classof locallycorrectableprotocols isalsodefined;these are
a transformer that useslocal checkingand local correction is described. The transformer of
 is efficient, but it can be applied only to protocols that are both locally checkable and
locally correctable. Unfortunately, many interesting network protocols can be shown to be
locally checkable, but not locally correctable.
In this paper, motivated on the one hand by the inefficiency of the transformer of ,
and by the narrowness of the transformer of  on the other hand, we introduce a new
algorithmic transformer that can be used to make a wide class of protocols self-stabilizing.
The idea is to combine local checking and global correction: bad states are detected by
local checking mechanism, a global correction action (called “reset”) is used to recover
from faults. We contend that local checking and global reset is the right balance in many
practical situations. First, we argue that global detection mechanisms such as the self-
stabilizing snapshot  incur unnecessary large overhead (in terms of time, space and
communication) practically always, since networks are fairly failure-free. Local checking
detects faults quickly, and it can be done, as we show in this paper, with only a small
increase in communication cost. Secondly, as mentioned above, there are many protocols
that are locally checkable but not locally correctable (e.g., spanning tree construction and
topology update [27, 20, 21, 22]). In these cases we are forced to use other techniques —
Even though resetting an entire network may seem drastic and inefficient, there is
evidence that this is not the case.For instance,consider routing protocols.The stabilization
which is the time that takes for many protocols to compute their results anyhow, even after
being started in a good state. Empirical results also support the claim that resets perform
quite well in practice. Specifically, DEC SRC’s AN-1 network  employs a variant of
global reset for dealing with topology changes (by making a reset request whenever a
link fails or comes up).
failures very fast . The reason is that usually the routing protocol only operates for a
small fraction of the time at a node; the remaining processing is devoted to forwarding
data. During a reset, however, no data forwarding is done; all processing and bandwidth is
devoted to the reset. The moral from the AN-1 experience is that reset schemes work well
for smallsizednetworks;for larger networks,the sameapproachshouldwork if the routing
protocol is hierarchical  and each level is reset independently.
The main result of this paper is a precise description and statement of the method of
local checking and global reset. We provide formalization, analysis, and code. We believe
that in doing so we contribute something thatwill help both theoreticians and practitioners.
We remark that the ideas of local checking and global reset are not new; for instance, the
stabilizing spanning tree protocol of  uses local detection, and Arora and Gouda  use
?The AN-1 designers found that the protocol recovered from link
?The AN-1 reset is performed using a version of Finn’s unbounded counter protocol .
802.1 protocol overcomes this problem by using timers. To get rid of fictitious IDs even
in worst cases, the timeout periods are large always. By contrast, our protocol uses reset,
and its stabilization time is proportional to actual network delays (which are in most cases
significantly smaller the worst possible). In  we show that if the local predicates have
a certain structure, then local checking can be done by having each node periodically send
its state toits neighbors (without the needto implementlocalsnapshots).The spanningtree
algorithm has this structure and so the resultant protocol is quite simple.
Another application is topology update. Many existing networks [19, 20] use sequence
numbers to broadcasttopology information to all nodes.If the counter being usedever gets
tothemaximumvalue,a large timeoutisusedfor recovery.We propose thatinsteadof these
large timeouts, global reset can be used. Similarly, the AN-1 network  uses a simple
“large counter” to reset the network after topology changes . This simple reset is more
efficient than the stabilizing reset protocols but is vulnerable to counter errors. The AN-1
designers have suggested  that stabilizing reset could be used to reset the simple reset
protocol when the local predicates of the simple reset protocol are violated.
We believe that all of the above provides strong indications that the idea of self-
stabilization by local checking and global reset is a viable practical technique, as well as a
convenient theoretical tool.
86-0078, ARPA/Army contract DABT63-93-C-0038, ARO contract DAAL03-86-K-0171,
IBM. The second author is also supported by NSF contract 9225124-CCR, AFOSR-ONR
author was done while in MIT. The fourth author was partially supported by by NSF Pres-
idential Young Investigator Award CCR-91-58478 and funds from Texas A&M University
College of Engineering.
We would like to thank Nancy Lynch, Mark Tuttle and Shay Kutten for their crucial
comments and suggestions.
1. Yehuda Afek, Baruch Awerbuch, and Eli Gafni. Applying static network protocols to dynamic
networks. In Proc. 28th IEEE Symp. on Foundations of Computer Science, October 1987.
2. Anish Arora and Mohamed G. Gouda. Distributed reset. In Proc. 10th Conf. on Foundations of
SoftwareTechnologyand TheoreticalComputer Science,pages316–331. Spinger-Verlag(LNCS
3. Baruch Awerbuch and Shimon Even. Reliable broadcast protocols in unreliable networks.
Networks, 16(4):381–396, Winter 1986.
4. BaruchAwerbuch,ShayKutten,YishayMansour,BoazPatt-Shamir,andGeorgeVarghese. Time
optimal self-stabilizing synchronization. In Proc. 25th ACM Symp. on Theory of Computing,
5. Yehuda Afek, Shay Kutten, and Moti Yung. Memory-efficient self-stabilization on general
networks. In Proc. 4th Workshop on Distributed Algorithms, pages 15–28, Italy, September
1990. Springer-Verlag (LNCS 486).
6. Baruch Awerbuch,Boaz Patt-Shamir, and George Varghese. Self-stabilization by local checking
and correction. In Proc. 32nd IEEE Symp. on Foundations of Computer Science, October 1991.
7. Baruch Awerbuch and George Varghese. Distributed programchecking: a paradigm for building
self-stabilizing distributed protocols. In Proc. 32nd IEEE Symp. on Foundations of Computer
Science, October 1991.
8. Baruch Awerbuch and Rafail Ostrovsky. Memory-efficient and self-stabilizing network RESET.
In Proc. 13th ACM Symp. on Principles of Distributed Computing, August 1994.
9. J.E. Burns and J. Pachl. Uniform self-stabilizing rings. ACM Transactions on Programming
Languages and Systems, 11(2):330–344, 1989.
10. K. Mani Chandy and Leslie Lamport. Distributed snapshots: Determining global states of
distributed systems. ACM Trans. on Comput. Syst., 3(1):63–75, February 1985.
11. Edsger W. Dijkstra. Self stabilization in spite of distributed control. Comm. of the ACM,
12. ShlomiDolev,AmosIsraeli,andShlomoMoran. Self-stabilizationofdynamicsystemsassuming
only read/write atomicity. In Proc. 10th ACM Symp. on Principles of Distributed Computing,
13. Shlomi Dolev, Amos Israeli, and Shlomo Moran. Resource bounds for self-stabilizing message
driven protocols. In Proc. 11th ACM Symp. on Principles of Distributed Computing, Aug. 1991.
14. Steven G.Finn. Resynchproceduresandafail-safenetworkprotocol. IEEE Trans.onCommun.,
COM-27(6):840–845, June 1979.
15. L. Kleinrock and F. Kamoun. Hierarchical routing for large networks; performance evaluation
and optimization. Computer Networks, 1:155–174, 1977.
16. Shmuel Katz and Kenneth Perry. Self-stabilizing extensions for message-passing systems. In
Proc. 10th ACM Symp. on Principles of Distributed Computing, August 1990.
17. Nancy A. Lynchand Mark R. Tuttle. An introductionto input/output automata. CWI Quarterly,
18. M. Merritt, F. Modugno, and M.R. Tuttle. Time constrained automata. In CONCUR 91, pages
19. John McQuillan, Ira Richer, and Eric Rosen. The new routing algorithm for the arpanet. IEEE
Trans. on Commun., 28(5):711–719, May 1980.
20. RadiaPerlman. Fault tolerantbroadcastof routinginformation. Computer Networks,Dec. 1983.
21. RadiaPerlman. Analgorithmfordistributedcomputation ofaspanningtreeinanextendedLAN.
In Proceedings of the the 9th Data Communication Symposium, pages 44–53, September 1985.
22. Radia Perlman, George Varghese, and Anthony Lauck. Reliable broadcast of information in a
wide area network. US Patent 5,085,428, February 1992.
23. ThomasRodehefferandMichaelSchroeder. AutomaticreconfigurationintheAutonet. Proceed-
ings of the 14th Symposium on Operating Systems Principles, November 1993.
24. Thomas Rodeheffer and Michael Schroeder. Personal communication.
25. M. Schroeder, A. Birrell, M. Burrows, H. Murray, R. Needham, T. Rodeheffer, E. Sattenthwaite,
and C.Thacker. Autonet: a high-speed, self-configuring local area network using point-to-point
links. Technical Report 59, Digital System Research Center, April 1990.
26. John M.Spinelli. Reliablecommunication. Ph.d.thesis,MIT,Lab.forInformationandDecision
Systems, December 1988.
27. A. Tanenbaum. Computer Networks. Prentice Hall, 2nd. edition, 1989.
28. George Varghese. Self-stabilization by local checking and correction.
MIT/LCS/TR-583, Massachusetts Institute of Technology, 1992.
29. George Varghese. Self-stabilizationbycounter flushing. InProc. 13th ACMSymp.on Principles
of Distributed Computing, August 1994.