A Low Latency, Loss Tolerant Architecture and Protocol for Wide Area Group
Yair Amir, Claudiu Danilov, Jonathan Stanton
Department of Computer Science
Johns Hopkins University
3400 North Charles St.
Baltimore, MD 21218 USA
?yairamir, claudiu, jonathan
Group communication systems are proven tools upon
which to build fault-tolerant systems. As the demands for
fault-toleranceincrease and more applications require reli-
able distributed computing over wide area networks, wide
area group communication systems are becoming very use-
ful. However, building a wide area group communication
system is a challenge. This paper presents the design of
the transport protocols of the Spread wide area group com-
munication system. We focus on two aspects of the system.
First, the value of using overlay networks for application
level group communication services. Second, the require-
to construct wide area group communication. We support
our claims with the results of live experiments conducted
over the Internet.
Keywords—Group Communication, Overlay Networks, Re-
liable Multicast, Wide Area Networks, TCP/IP.
There exist some fundamental difficulties with high-
performance group communication over wide-area net-
works. These difficulties include:
The characteristics (loss rates, amount of buffering)
and performance (latency, bandwidth) vary widely in
different parts of the network.
The packet loss rates and latencies are significantly
higher and more variable than on local area networks.
?This work was supported in part by grants from the National Secu-
rity Agency (NSA) and the Defense Advanced Research Projects Agency
This is the Tech Report version. Cite as TR 00/99 -xx
It is not as easy to implement efficient reliability and
ordering on top of the available wide area multicast
mechanisms as it is on top of local area hardware
broadcast and multicast. Moreover, the available best
effort wide area multicast mechanisms come with sig-
More and more applications are starting to use group
communication systems to enable fault-tolerance and high-
availability or to create large scale complex interactive ap-
plications. In high-availability application domains such
as stock markets and command and control systems, group
communication systems have been used for years. With the
increased demandfor high reliability in standard networked
business services such as web servers and databases, these
uses are becoming more widespread.
A more recent area of interest in group communications
is wide-area distributed simulation and distributed object
applications that require a large number of active objects
which keep state and communicate with low latency re-
quirements. The requirement for many active objects of-
ten necessitates an equally large number of active groups in
the system. Other wide-area applications that could bene-
fit from group communicationare database replication, net-
work and service monitoring, and collaborative design and
Most existing group communication systems are limited
in their ability to provide wide area services with low la-
tency because their protocols were designed with local area
networks in mind. Because of these limitations, most appli-
cations today either use a TCP based client-server architec-
ture sending one copy of the data to each recipient or use
IP-Multicast for data dissemination coupled with external
reliability and ordering servers. Each of these choices has
some limitations either in performance/scalability issues or
in complexity and ease of use.
We address these problems by creating a group commu-
nication service, called Spread, which provideshigh perfor-
mance in both local area and wide area networks. Spread
systems, including unreliable and reliable delivery, FIFO,
causal, and total ordering, and membership services with
strong semantics. Two advances allow this: the incorpora-
tion of an overlay network architecture in the group com-
munication system, and the development of a new point-
to-point protocol for the wide area network, tailored to that
Spread creates an overlay network that can impose
any arbitrary network configuration including for example,
point-to-multi-point, trees, rings, trees-with-subgroups and
any combinations of them to adapt the system to different
networking environments. Coupled with that, a new point-
to-point protocol for the wide area network, the Hop proto-
col, is designed for this environment. The Spread architec-
ture allows multiple protocols to be used on links between
sites and within a site. To validate the usefulness of the
Hop protocol for this environment, we compare it with us-
ing TCP on the wide area links between sites.
By targeting Spread at wide area networks we did not
compromise performance over local area networks. Spread
achieves similar performance to the best existing group
communication systems in local area networks with
somewhat higher cpu requirements to achieve the same per-
formance. Spread is also available in a separate version
specifically tuned for local area networks which has no ad-
Spread is very useful for applications that need the tra-
ditional group communication services such as causal and
also need to run over wide area networks. In fact, it is the
first available groupcommunicationsystem to fully support
strong semantics across wide area networks as far as we
know. In addition, other applications may find Spread a
better fit compared with the different reliable IP-Multicast
schemes because of several technical differences:
Scalability with the number of collaboration sessions.
IP-Multicastis verygoodatsupportinga small number
Spread, on the other hand, can support a large num-
ber of different collaboration sessions, each of which
spans the Internet but has only a small number of par-
ticipants. The reason is that Spread utilizes unicast
messages on the wide area network, routing them be-
tween Spread nodes on the overlay network. There-
fore, IP-Multicast related resourcesare not requiredon
the network routers.
Scalability with the number of groups. Spread can
scale well with the number of groups used by the ap-
plication without imposing any overhead on network
routers. Group naming and addressing is no longer
a shared resource (the IP address for multicast) but
rather a large space of strings which is unique per col-
Routing. All of the current IP-Multicast routing meth-
ods build routing trees in an incremental way. This is
very good for being able to scale to millions of users
on a session. However, since Spread has to main-
tain membership, it requires little additional work to
reconstruct routing trees every time the membership
changes. This provides Spread with the ability to con-
struct optimal routing trees. Note, though, that these
trees are on the overlay network and not on the phys-
ical network. Both IP-Multicast and Spread support
The Spread toolkit is available publicly. An early ver-
sion of the system is used by several other organizations
for research and practical projects. The toolkit supports
cross-platform applications and has been ported to sev-
eral Unix platforms as well as Windows and Java environ-
ments. More details on the Spread system can be found at
http://www.spread.org/ along with a white paper and pro-
2 An Overlay Network Architecture for
Wide Area Group Communication
Our goal for a multicast architecture is to facilitate ef-
ficient group communication services for local and wide
area networks. These services include unreliable and reli-
able dissemination of messages to process groups, ordering
guarantees on messages, and membership services. These
services usually adhere to strict semantics such as Virtual
Synchrony and its flavors[16, 8]. This range of services
can be used to more easily develop, or make fault-tolerant,
applications ranging from replicated database servers to
group collaboration tools to streaming multimedia.
to establish the basic message dissemination network and
providebasic membershipandorderingservices, while user
applications link with a small client library, can reside any-
where in the network, and will connect to the closest dae-
mon to gain access to the group communication services.
There is a small cost to using a daemon-client architecture,
which is extra context-switches and inter-process commu-
nication, however, on modern systems this cost is minimal
in comparison with wide-area latencies.
A “site” in Spread consists of a collection of daemons
which can all communicate over a broadcast or multicast
domain. This is usually limited to a local area network. We
IP-Multicast is actively developed to support Internet
wide unreliable multicasting and to scale to millions of
users. Many reliable multicast protocols which use IP-
multicast have been developed, such as SRM , RMTP
, Local Group Concept (LGC) , and HRMP .
The development of reliable protocols over IP-Multicast
has focused on solving scalability problems such as Ack or
Nack implosion and bandwidth limits, and providing use-
applications. Several of these protocols have developed
localized loss recovery protocols. SRM uses randomized
timeouts with backoff to request missed data and send re-
localize the recoveryby using the TTL field of IP-Multicast
to request a lost packet from nearer nodes first, and then
expand the request if no one close has it. Several other
variations in localized recovery such as using administra-
tive scope and separate multicast groups for recovery, are
discussed in .
Other reliable multicast protocols like LGC use the dis-
who is the root of some subtree of the main tree. RMTP
also uses “Designated Receivers” (DR) who act as the head
of a virtual subtree to localize recovery of lost packets and
provides reliable transport of a file from one sender to mul-
tiple receivers located around the world. RMTP is based
on the IP-Multicast model, but created user-level multicast
through UDP and modified mrouted software. RMTP did
not examine the tradeoffs in link protocols discussed in this
paper because it handles reliability over the entire tree, with
theDR’s onlyactingas aggregatorsofglobalprotocolinfor-
mation. Since Spreadalreadyhasadditionalinformationfor
membership and ordering purposes about the exact dissem-
ination of messages and where copies are buffered, Spread
can use more precise local recovery to get the packet from
the nearest source.
HRMP  is a reliable multicast protocol which pro-
videsa efficientlocal reliabilitybasedon a ring, whileusing
standard tree-based protocols such as ack trees to provide
reliability between rings. This work theoretically analyzes
be better then protocols utilizing only a ring or a tree. Our
work here focuses on the best protocols to use for reliabil-
ity on a wide-area multicast tree and thus is orthogonalwith
whichlocal site protocolto use. The Spreadsystem actually
uses a ring protocol for local area networks for many of the
same reasons HRMP does.
We presented an architecture for wide area group com-
munications that was implemented in the Spread system.
This architecture takes advantage of the ability to construct
user level overlay networks to efficiently disseminate reli-
able messages to process groups.
We described Hop, an efficient point-to-point reliable
transportprotocol for connectingsites on a wide area multi-
cast tree. ExperimentsconductedovertheInternetvalidated
the low latency and high stability of the Hop protocol under
various load and loss conditions.
We would like to thank Michal Miskin-Amir, one of the
creators of Spread. We thank Jithesh Parameswaran for
programming the optimal routing computations in Spread.
We also wish to thank Nabil Adam, Richard Holowczak,
Michael Melliar-Smith, Louise Moser, Alec Peterson, and
Robert Stanton for allowing us to use their systems in our
Communications of the ACM, 39(4), April 1996.
 D. Agarwal, L. E. Moser, P. M. Melliar-Smith, and R. K.
Budhia. The totem multiple-ring ordering and topology
maintenance protocol. ACM Transactions on Computer Sys-
tems, 16(2):93–132, May 1998.
 Y. Amir, D. Dolev, S. Kramer, and D. Malki.
A communication subsystem for high-availability. In Di-
gest of Papers, The 22nd International Symposium on Fault-
Tolerant Computing Systems, pages 76–84, 1992.
 Y. Amir, L. E. Moser, P. M. Melliar-Smith, D. Agarwal, and
P. Ciarfella. The totem single-ring ordering and member-
ACM Transactions on Computer Systems,
13(4):311–342, November 1995.
 T. Anker, G. V. Chockler, D. Dolev, and I. Keidar. Scal-
able group membership services for novel applications. In
M. Mavronicolas, M. Merritt, and N. Shavit, editors, Pro-
ceedings of the workshop on Networks in Distributed Com-
puting, DIMACS Series in Discrete Mathematics and Theo-
retical Computer Science, 1998.
 K. P. Birman and T. Joseph. Exploiting virtual synchrony in
distributed systems. In 11th Annual Symposium on Operat-
ing Systems Principles, pages 123–138, November 1987.
 K. P. Birman and R. V. Renesse. Reliable Distributed Com-
puting with the Isis Toolkit. IEEE Computer Society Press,
 A. Fekete, N. Lynch, and A. Shvartsman. Specifying and
using a partionable group communication service. In Pro-
ceedings of the 16th annual ACM Symposium on Principles
of Distributed Computing, pages 53–62, August 1997.
 R. W. Floyd. Algorithm 97 (shortest path). Communications
of the ACM, 5(6):345, 1962.
 S. Floyd, V. Jacobson, C. Liu, S. McCanne, and L. Zhang.
A reliablemulticast framework for light-weight sessions and
application level framing. IEEE/ACM Transactions on Net-
working, 5(6):784–803, December 1997.
Special issue on group communications systems.
 L. Gu and J. Garcia-Luna-Aceves. New error recovery struc-
tures for reliable networking. In Proceedings of the Sixth In-
ternational Conference on Computer Communications and
Networking, September 1997.
 K. Guo and L. Rodrigues. Dynamic light-weight groups.
In Proceedings of 17th International Conference on Dis-
tributed Computing Systems, pages 33–42, May 1997.
 M. Hofmann. A generic concept for large-scale multicast. In
B. Plattner, editor, International Zurich Seminar on Digital
Communications, number 1044 in Lecture Notes in Com-
puter Science, pages 95–106, Februrary 1996.
 N. Huleihel. Efficient ordering of messages in widearea net-
works. Master’s thesis, Institute of Computer Science, The
Hebrew University of Jerusalem, Jerusalem, Israel, 1996.
 J. Lin and S. Paul. Rmtp: A reliable multicast transport pro-
tocol. In Proceedings of IEEE Infocom, pages 1414–1424,
 L. E. Moser, Y. Amir, P. M. Melliar-Smith, and D. A. Agar-
wal. Extended virtual synchrony.
IEEE 14th International Conference on Distributed Com-
puting Systems, pages 56–65, June 1994.
 J. Nonnenmacher and E. W. Biersack. Performance model-
lling of reliable multicast transmission. In Proceedings of
INFOCOM 97, April 1997.
 R. V. Renesse, K. Birman, and S. Maffeis. Horus: A flexible
group communication system. Communications of theACM,
39(4):76–83, April 1996.
 L. E. Rodrigues, H. Fonseca, and P. Verissimo. A synamic
hybrid protocol for total order in large-scale systems. In
Proceedings of the 16th International Conference on Dis-
tributed Computing Systems, May 1996. Selected portions
 E. Thomopoulos, L. E. Moser, and P. M. Melliar-Smith.
Analyzing the latency of the totem multicast protcols. In
Proceedings of the Sixth International Conference on Com-
puterCommunications andNetworks, pages 42–50, Septem-
 B. Whetten, T. Montgomery, and S. Kaplan. A high per-
formance totally ordered multicast protocol. In Theory and
Practice in Distributed Systems, International Workshop,
Lecture Notes in Computer Science, page 938, September
In Proceedings of the