A Low Latency, Loss Tolerant Architecture and Protocol for Wide Area Group
Yair Amir, Claudiu Danilov, Jonathan Stanton
Department of Computer Science
Johns Hopkins University
3400 North Charles St.
Baltimore, MD 21218 USA
?yairamir, claudiu, jonathan
Group communication systems are proven tools upon
which to build fault-tolerant systems. As the demands for
fault-toleranceincrease and more applications require reli-
able distributed computing over wide area networks, wide
area group communication systems are becoming very use-
ful. However, building a wide area group communication
system is a challenge. This paper presents the design of
the transport protocols of the Spread wide area group com-
munication system. We focus on two aspects of the system.
First, the value of using overlay networks for application
level group communication services. Second, the require-
to construct wide area group communication. We support
our claims with the results of live experiments conducted
over the Internet.
Keywords—Group Communication, Overlay Networks, Re-
liable Multicast, Wide Area Networks, TCP/IP.
There exist some fundamental difficulties with high-
performance group communication over wide-area net-
works. These difficulties include:
The characteristics (loss rates, amount of buffering)
and performance (latency, bandwidth) vary widely in
different parts of the network.
The packet loss rates and latencies are significantly
higher and more variable than on local area networks.
?This work was supported in part by grants from the National Secu-
rity Agency (NSA) and the Defense Advanced Research Projects Agency
This is the Tech Report version. Cite as TR 00/99 -xx
It is not as easy to implement efficient reliability and
ordering on top of the available wide area multicast
mechanisms as it is on top of local area hardware
broadcast and multicast. Moreover, the available best
effort wide area multicast mechanisms come with sig-
More and more applications are starting to use group
communication systems to enable fault-tolerance and high-
availability or to create large scale complex interactive ap-
plications. In high-availability application domains such
as stock markets and command and control systems, group
communication systems have been used for years. With the
increased demandfor high reliability in standard networked
business services such as web servers and databases, these
uses are becoming more widespread.
A more recent area of interest in group communications
is wide-area distributed simulation and distributed object
applications that require a large number of active objects
which keep state and communicate with low latency re-
quirements. The requirement for many active objects of-
ten necessitates an equally large number of active groups in
the system. Other wide-area applications that could bene-
fit from group communicationare database replication, net-
work and service monitoring, and collaborative design and
Most existing group communication systems are limited
in their ability to provide wide area services with low la-
tency because their protocols were designed with local area
networks in mind. Because of these limitations, most appli-
cations today either use a TCP based client-server architec-
ture sending one copy of the data to each recipient or use
IP-Multicast for data dissemination coupled with external
reliability and ordering servers. Each of these choices has
some limitations either in performance/scalability issues or
in complexity and ease of use.
We address these problems by creating a group commu-
nication service, called Spread, which provideshigh perfor-
mance in both local area and wide area networks. Spread
systems, including unreliable and reliable delivery, FIFO,
causal, and total ordering, and membership services with
strong semantics. Two advances allow this: the incorpora-
tion of an overlay network architecture in the group com-
munication system, and the development of a new point-
to-point protocol for the wide area network, tailored to that
Spread creates an overlay network that can impose
any arbitrary network configuration including for example,
point-to-multi-point, trees, rings, trees-with-subgroups and
any combinations of them to adapt the system to different
networking environments. Coupled with that, a new point-
to-point protocol for the wide area network, the Hop proto-
col, is designed for this environment. The Spread architec-
ture allows multiple protocols to be used on links between
sites and within a site. To validate the usefulness of the
Hop protocol for this environment, we compare it with us-
ing TCP on the wide area links between sites.
By targeting Spread at wide area networks we did not
compromise performance over local area networks. Spread
achieves similar performance to the best existing group
communication systems in local area networks with
somewhat higher cpu requirements to achieve the same per-
formance. Spread is also available in a separate version
specifically tuned for local area networks which has no ad-
Spread is very useful for applications that need the tra-
ditional group communication services such as causal and
also need to run over wide area networks. In fact, it is the
first available groupcommunicationsystem to fully support
strong semantics across wide area networks as far as we
know. In addition, other applications may find Spread a
better fit compared with the different reliable IP-Multicast
schemes because of several technical differences:
Scalability with the number of collaboration sessions.
IP-Multicastis verygoodatsupportinga small number
Spread, on the other hand, can support a large num-
ber of different collaboration sessions, each of which
spans the Internet but has only a small number of par-
ticipants. The reason is that Spread utilizes unicast
messages on the wide area network, routing them be-
tween Spread nodes on the overlay network. There-
fore, IP-Multicast related resourcesare not requiredon
the network routers.
Scalability with the number of groups. Spread can
scale well with the number of groups used by the ap-
plication without imposing any overhead on network
routers. Group naming and addressing is no longer
a shared resource (the IP address for multicast) but
rather a large space of strings which is unique per col-
Routing. All of the current IP-Multicast routing meth-
ods build routing trees in an incremental way. This is
very good for being able to scale to millions of users
on a session. However, since Spread has to main-
tain membership, it requires little additional work to
reconstruct routing trees every time the membership
changes. This provides Spread with the ability to con-
struct optimal routing trees. Note, though, that these
trees are on the overlay network and not on the phys-
ical network. Both IP-Multicast and Spread support
The Spread toolkit is available publicly. An early ver-
sion of the system is used by several other organizations
for research and practical projects. The toolkit supports
cross-platform applications and has been ported to sev-
eral Unix platforms as well as Windows and Java environ-
ments. More details on the Spread system can be found at
http://www.spread.org/ along with a white paper and pro-
2 An Overlay Network Architecture for
Wide Area Group Communication
Our goal for a multicast architecture is to facilitate ef-
ficient group communication services for local and wide
area networks. These services include unreliable and reli-
able dissemination of messages to process groups, ordering
guarantees on messages, and membership services. These
services usually adhere to strict semantics such as Virtual
Synchrony and its flavors[16, 8]. This range of services
can be used to more easily develop, or make fault-tolerant,
applications ranging from replicated database servers to
group collaboration tools to streaming multimedia.
to establish the basic message dissemination network and
providebasic membershipandorderingservices, while user
applications link with a small client library, can reside any-
where in the network, and will connect to the closest dae-
mon to gain access to the group communication services.
There is a small cost to using a daemon-client architecture,
which is extra context-switches and inter-process commu-
nication, however, on modern systems this cost is minimal
in comparison with wide-area latencies.
A “site” in Spread consists of a collection of daemons
which can all communicate over a broadcast or multicast
domain. This is usually limited to a local area network. We
will use the term “site” to refer to this collection of locally
connected daemons as a whole. Each site selects one dae-
mon, based on the current membership of the site, that acts
as gateway, connecting all the members of the site to other
The Spread group communication system architecture
solves five main problems: end-to-end reliability, config-
uration of an overlay network, low-latency forwarding over
high-latency links, the high cost of membership changes,
2.1 End-to-End Reliability
Guaranteed end-to-end reliability is a result of the prop-
erties of two separate protocols in Spread. First, each link
guarantees that all packets sent on it will eventually be reli-
ablyreceivedontheotherside aslongas thereis nota mem-
bership change. Second, the membership protocols detect
recoverthe necessary state to continuemakingprogressand
to report to the application if any messages were not able to
be reliably delivered because of crashes or partitions. The
three possible cases are:
No failures, no slow receivers. Here the eventual relia-
bility of each link is sufficient to result in all messages
getting to all recipients reliably.
No failures, slow receivers. Here we impose a global
window of outstanding messages which will limit the
speed at which new messages are generated to the rate
the slowest receiver can handle. Combined with link
reliability, this produces end-to-end reliability.
Failures. Herethemembershipprotocoltakes overand
recoversthe system state, resends messages as needed,
and informs the application of which messages have
guaranteed reliability according to the semantics de-
fined by Extended Virtual Synchrony Model .
In general,Spread decouples the dissemination and local
reliability mechanisms from the global ordering and stabil-
ity protocols. This decoupling allows messages to be for-
warded on the network immediately despite losses or order-
ing requirements. The only place where messages are de-
layed by Spread is just beforedeliveringthem to the clients,
to preserve the semantic guarantees. Decoupling local and
global protocols also permits pruning, where data messages
are only sent to the minimal necessary set of network com-
ponents, without compromising the strong semantic guar-
Spread allows different low level protocols to be used to
provide reliable dissemination of messages, depending on
the configuration of the underlying network. Each proto-
col can have different tuning parameters applied to differ-
ent portions of the network. In particular, Spread integrates
threelow-levelprotocols: one forlocalarea networkscalled
Ring, and two for the wide area overlay network connect-
ing the local area networks: the standard TCP, and our new
protocol called Hop.
2.2 Overlay Networks
We define “overlay network” as a virtual network con-
structed such that each link connects two edge nodes in an
underlyingphysical network, such as the Internet. Each vir-
tual link in the overlay network can translate into several
hops in the underlying network, and the cost attached to a
virtual link is some aggregated cost of the underlying net-
work links over which it travels.
Spread constructs an overlay network between all the
sites that have currently active daemons. This network is
constructed based on information contained in the static
bership list, and any available network cost information.
These sources produce a network that dynamically changes
as daemons are started or crash, and as partitions and
merges occur. The configuration file provides information
about all potential machines in the Spread system, but does
not constrain which members are currently running.
The overlay network is used to calculate source based
optimal routing paths from each source to all other Spread
daemons. Source based routing over the shortest path
produces the best routes when it is feasible to do. For the
application domain Spread is targeted at, the calculations
associated with source based routing are feasible. After a
membership change a new overlay network is constructed
and the routing trees are recalculated based on the new net-
The membership service provides each daemon with an
weights of the current overlay network. Then, each Spread
daemon independently calculates a shortest path multicast
tree from each site to every other currently connected site.
Since each daemon uses identical weights and an identi-
cal graph they are guaranteed to compute the same routing
trees. The routing calculation uses Floyd-Warshall all-
pairs shortest path algorithm.
When configuring Spread, one can set weights for ev-
ery potential link in every direction. Thus, different over-
lay networks can be constructed based on the needs of the
application and knowledge about the behavior of the un-
derlying network. Actually, since different weights can be
set for every link in both directions, the flexibility exists to
create most practical networks. Figure 1 presents a sample
set of sites located around the United States and the over-
Multicast Tree Network
Multicast Chain Network
Multicast Fanout Network
* For clarity, a fanout network is shown only for Mae East.
It exists for the other sites as well.
Figure 1. Network Testbed
lay networks that were constructed between them based on
different needs. Each site may contain several tens of dae-
mons, each of them serving many clients. The figure shows
the real network used for the experiments reported in Sec-
tion 4. The overlay network labeled “Multicast Tree Net-
work” shows a shared tree network where the tree rooted
at each site is the same. The network labeled “Multicast
workwhereeachsite has directconnectionstoallothersites
(only the tree rooted at Mae East is shown for clarity rea-
As mentioned before, when changes in the membership
occur, such as daemons dying or starting up, the Spread
membership service automatically discovers all live dae-
mons and creates a new membership including all of them.
Then a new overlay network is constructed connecting the
live daemons using minimal cost links. For these experi-
ments the overlay network routing configuration was hard-
wired to the described configurations for the purpose of
evaluating the behavior of the link protocols on these spe-
cific networks. We were not trying in this paper to evaluate
the dynamic reconfiguration algorithms.
The major cost of using an overlay network is that since
theoverlayis constructedonlybetweenendnodesinthe un-
derlying network, inefficiencies exist in the routing paths.
However, this disadvantage is outweighed by several key
benefits the overlay architecture provides: First, the algo-
rithms used in the overlay network can be easily changed
and do not require changes to basic network infrastructure
(e.g. routers). Second, routers can be made simpler and
faster, while complex protocols and processing can occur
on end nodes where more abundant resources exist. For
example, as a result of the difficulties encounterd while de-
ploying and upgrading IP-Multicast in routers, most of the
work on high level multicast services, such as reliability,
has used an overlay network approach.
2.3 Low-latency Forwarding
Group communication applications are often very la-
protocols themselves, and the high-level applications built
on top of a group communication toolkit. Spread uses two
approaches to solve this problem. First, the global order-
ing protocols used in Spread were designed to be latency-
insensitive as much as possible. Second as many sources
of added latency in the dissemination path (for both data
messages and control messages) were removed.
The most significant change was to design a point-to-
point reliability protocol which did not delay packets until
they were in order, but rather could forward them out of
order and let a higher level protocol deal with delivering
them to the application in order. In this way, messages are
delayed only at the receiver and not while in transit. There-
fore, no latency is added because of reliability and routing.
Latency is incurred only due to flow control on the network
and to preserve semantics on the receiver. This protocol is
discussed in depth in Section 3.2.
Spread uses a daemon-client architecture. This architec-
ture has many benefits, the most important for the wide-
area setting being the resultant ability to pay the minimum
necessary price for different causes of group membership
changes. Simple joins and leaves of processes translate
into a single message. A daemon disconnection or connec-
tion does not pay the heavy cost involved in changing wide
area routes. Only networkpartitions between differentlocal
area components of the network require the heavy cost of a
full-fledged membership change. Luckily, there is a strong
inverse relationship between the frequency of these events
and their cost in a practical system. The process and dae-
monmembershipcorrespondto the morecommonmodelof
“Lightweight Groups” and “Heavyweight Groups”.
Using the Spread architecture, one can envisage millions
of Spread overlay networks concurrently sharing the Inter-
net. Each overlaynetwork is formed on behalf of one appli-
cation, which might support tens of users collaboratingon a
project. In contrast, such a scenario cannotbe envisagedus-
ing IP-Multicast because of the scalability limits of routers
in handling millions of concurrent groups.
Core based trees alleviate some of the scalability prob-
lems with IP-Multicast groups, however, they do not re-
the same core backbone routers will end up being used by
millions of groups which cross them. The key difference is
that each daemon handles only the groups associated with
its application, and needs no knowledge of any groups used
by other Spread daemon overlay networks.
Of course, there are some applications that are only pos-
sible in a model such as IP-Multicast. For example, large
scale multimedia broadcasts of a limited number of chan-
nelstomillionsofusers is onlypossiblewith routersupport.
Although,a wide area membershipservice is operational
in Spread (which is available for download), the details of
a wide area membership service is beyond the scope of this
A group communication system always involves trade-
offsin performance,scalability, andservices. As mentioned
above, Spread supports a very large number of groups ac-
tive at any time( such as several thousand), and a very large
number of active Spread configurations at any time (essen-
tally unlimited). Spread also uses a hierarchal architecture
to scale the number of daemons and users into the tens to
hundreds of daemons and thousands of users. The hierar-
chy Spread uses consists of three levels:
Sites. There are between 1 and 100 sites. Each site is
connectedto some of the others throughpoint-to-point
WAN links forminga multicast forest (optimal source-
Daemons. Eachsite cancontainbetween1 and50 dae-
mons. The site uses a ring based protocol to provide
ordering, reliability, and flow control amoung the dae-
Users. Eachdaemoncansupport(in theory)between1
we have tested upto a couple hundred). The daemon
3 Wide Area Link Protocols
Given the framework above, each link between two dae-
mons that are directly connected on the overlay network,
can be formed by one of several protocols. In this section
we discuss the requirements for a wide area link protocol
and describe the Hop protocol that addresses these require-
Before we dive into wide area protocols, it is worthwhile
to explain what protocol is used within each ’site’. Within
a site, the Spread daemons use the Ring protocol to provide
data dissemination, reliability, and flow control. The Ring
protocol uses the unreliable multicast service provided by
IP, and is based on the same ideas as the Totem system
and achieves similar performance. Each site can include
several tens of daemons who all participate in the Ring pro-
tocol, and who are treated as one entity on the wide-area
The Ring protocol is based on passing a token around
the the members of the site, with each member sending new
multicast messages when it holds the token. When the time
to pass the token is very small this can work quite well,
however in high latency wide-area networks, the time dur-
ing which no member can send because the token is in tran-
sit becomes very significant. Additionally, since in a wide-
area network the network medium is not shared as it is in
a LAN, it is possible for multiple daemons to be sending at
the same time which token passing will not allow. Finally,
the Ring protocolbecomes more fragile when the chance of
packet loss is non-negligible as losing the packet is costly
to recover from.
Spread couples these rings with a wide area protocol
such as the Hop protocol in order to create a complete opti-
mized network connecting a set of local area networks.
3.1 Requirements for Wide Area Link Protocols
For wide-area links, two protocols are discussed in this
paper: TCP and the Hop protocol. Here we will compare
TCP as the link protocol in a Spread created multicast tree,
with the use of the Hop protocol in a Spread multicast tree.
We will call these two protocol choices: TCP based multi-
cast, and Hop based multicast. To provide the global group
communication services, the link protocols must provide
eventually reliable transport and link-level flow control.
TCP is a very mature protocol that provides both reli-
able transport and flow control. However, TCP also pro-
vides other guarantees, in particular FIFO ordering. TCP
will not deliverany data out of order,which can cause prob-
lems when multiple TCP connections are chained to create
our overlay network. Specifically, since messages are de-
layed until they can be delivered in order, losses will cause
delays in forwarding the data to the next link down the tree.
Furthermore, when the data is finally recovered, a burst of
buffered data is then immediately forwarded down the tree.
These micro-delays and micro-bursts cause end-to-end la-
tency increases and an increased burstiness on the network
which itself causes degraded performance.
These issues call for a design of a protocol which bet-
ter fits the requirements of a wide area link protocol in our
The Hop protocol is designed to provide only the re-
quired services, specifically reliability and flow control.
The Hop protocol forwards packets as soon as they are re-
ceivedeven if prior packets are missing. The Spread system
provides all the standard group communicationorderingsat
a higher level, where messages are only delayed just before
being delivered to the application.
Both the TCP based multicast and Hop based multicast
protocols have several advantages over emulated multicast,
where end-to-end links are used between all the applica-
tion instances. Not only can they utilize multicast trees to
avoid sending N copies of the data across the sender’s net-
work, but they can also achieve localized recovery of lost
packets without requiring the original sender to re-send the
data. Localized recovery is crucial for high latency mul-
ticast networks, not only for large, but also for small
groups where the members are widely dispersed in the net-
work. Each unicast link in the Spread multicast tree has
buffers on the sending side to store data until it has success-
fully reached the other end of the link.
Most current reliable multicast protocols have some
form of localized recovery, such as creating virtual local
subgroups along the tree, or using nack avoidance al-
gorithms and expanding ring nacks. All of these tech-
niques create an approximation of recovering the missing
data from the closest node. Spread has an accurate knowl-
edge of the current membership and the structure of the
overlay network. As discussed in Section 2.1 we do not
have one protocol provide end-to-end reliability directly.
Instead, we rely on link reliability, liveness and global flow
control to guarantee end-to-end reliability. Each link is
guaranteed to eventually transfer each packet to the other
side. The source dissemination tree from the sender to each
receiver guarantees that each daemon will eventually re-
ceiveall the necessarydata fromits parent. Failurerecovery
and liveness is guaranteed by Spread’s membership proto-
BeforewediscusstheHopprotocol,we mentionthatas a
global optimizationto allow packingof small messages and
fragmentation of large messages, all of the network proto-
cols used by Spread (Ring, Hop, TCP) actually operate on
packets constructed of one or more data fragments and con-
trolmessages. All controlinformationusedbytheprotocols
is piggybacked on data packets when possible. If no data is
available, control messages are sent as separate packets.
3.2 The Hop Protocol
The Hop protocol operates over an unreliable datagram
service such as UDP/IP. The core goal of the Hop protocol
is to providethe lowest latency and highest throughputpos-
sible when transferring packets across wide-area networks.
The key elements of the Hop protocol are:
Non-Blocking: packets are forwarded despite the loss
of packets ordered earlier.
Lazy-Selective-Retransmits: nacks are sent for specific
lost packets after a short delay to avoid requestingdata
which was not lost but merely arrived out of order or
is sequenced after lost data.
Rate-based flow control: a rate based flow regula-
tor provides explicit support for high delay-bandwidth
networks. In addition, the rate based regulator can uti-
lize bandwidthreservationsservices if such exist in the
The Hop protocol establishes a bidirectional connection
between every two daemons that are directly connected on
the overlay network. These two daemons maintain a list
of counters and a table of open packets which have not yet
been acknowledged. To establish reliable transmission in
the presence of losses, Hop uses selective nacks where the
receiver requests specific data packets (identified by their
sequence number) when loss is detected. The receiver con-
tinues to request lost packets until they are recovered. If a
tocol declares the link failed and the membership protocol
reconfigures the system. This is necessary to eliminate the
“failure to receive” problem that can occur either because
Sender handling received NACK:
1 receive NACK n:
resend( get_packet(n) )
if (get_numnacks_received(n) > MaxNACKS)
Figure 2. Sender handling received NACK
Sender handling ACK Timeout:
1 timeout ACK_Timer:
ack_val = get_highest_seq_sent_sofar()
Figure 3. Sender handling ACK Timeout
of a networking fault that deletes certain packets or a ma-
licious attacker who keeps removing one particular packet
from the network.
The receiver has two methods to detect loss. First, when
the receiver receives a packet which is sequenced beyond
the next expected packet, it adds the sequence numbers be-
tween the highest previouslyreceivedsequencenumberand
the just arrived packet to a list of proposed lost packets.
This method is shown in Figure 4. The receiver schedules
a nack packet containing the newly lost sequence numbers
to be sent in a short time if the packets do not arrive mean-
while. In real-life experiments we found this delay could
avoid false positive losses of about one percent while still
requesting truly lost packets early enough to receive the re-
transmission and acknowledge it before hitting the maxi-
mum outstanding packet limit. Currently this delay is fixed,
but we are investigatingthe amountof reorderingof packets
onwide area networksandhow to set the delay basedon the
Second, to detect loss when no further packets are sent,
or when there is a long time before the next packet, the
sender sends a link acknowledgement to the receiver pe-
riodically (based on time and number of packets sent and
received), which specifies the highest sequence value the
sender has sent. This is shown in Figure 3. Sequence num-
bers that are equal to or below the specified value that the
receiver has not yet received, are added to the list of miss-
ing packets. After a short delay these sequencenumbers are
sent to the sender in a nack packet.
When the sender receives a nack packet it adds the pack-
ets represented by the requested sequence numbers to its
outgoing retransmit queue as shown in Figure 2. The pack-
ets will be sent along with other data when flow control al-
lows. Retransmissions are sent even when the limit on the
numberof outstandingpackets has been reached. Therefore
the packet will cross the link in a bounded time or else the
link will be declared failed.
The Hop protocol eliminates duplicates by checking ev-
ery received packet against the list of known missing pack-
ets. Ifthereceivedpacket’ssequencenumberis lessthanthe
highest previously received either it is in the list of missing
packets or it has already been received, so if it is not in the
list it is a duplicate and is discarded.
To enable the sender to release buffered copies of pack-
ically or after some numberof packets, whicheveris sooner.
The generation of these acks can be seen in the second part
of Figure 4. These acknowledgementscontain the sequence
ceived. This combinationof acks and nacksprovidesa fully
reliable channel with boundedbuffers. However,it does not
create any restrictions on the ordering of messages so that
each packet is delivered to the higher layers of Spread as
soon as it is received by the Hop protocol.
The Hop protocol uses rate-based flow control to limit
the rate packets are sent, and a maximum window of out-
standing packets to provide termination guarantees for the
reliability protocol as described above. The rate regulator
is a leaky bucket with both a maximum burst size and an
average rate limitation. All of these parameters are set per
link so that they can be tuned separately for each link in the
Figure 5 presents a case with three hop links where data
flows from site A to sites B, C, and D. Messages are as-
signed different sequence numbers on different links ac-
cording to their arrival at the parent node. The message
labels m1-m4 represent the message identifier and not the
sequencenumberassigned on each link. Suppose, as shown
in Figure 5.(a), that packet m2 is lost on the link between
sites A and B and subsequentlypacket m1 is lost on the link
between B and C. Note that the loss of packet m2 does not
preclude B from forwarding packet m3. Figure 5.(b) shows
the nacks for messages m1 and m2 and the concurrent for-
warding of m4. In Figure 5.(c) message m2 is recovered by
B and immediately forwarded to C and D. Then message
m1 is retransmitted to C. Note, that to allow this aggressive
behavior, the order for the sequence numbers between sites
B and C were as follows: m1 got link sequence 1, m3 got
link sequence 2, m4 got link sequence 3, and m2 got link
sequence 4. This is the reason a request for m2 was not
triggered by site C.
In Section 4 we evaluate both the Hop protocol and TCP
for their usefulness in providing a link protocol for the
Spread group communication system.
Receiver handling PACKET:
1 receive PACKET n:
if (n > next_expected )
add_to_missing_list(next_expected + 1 .. n)
next_expected = n + 1
queue_send(NACK next_expected + 1 .. n, Nack_Delay)
if (n == next_expected )
deliver(PACKET n )
next_expected = n + 1
if (n < next_expected )
if (check_in_missing_list(n) )
Receiver handling PACKET for ACK:
1 receive PACKET n:
if (get_packets_since_last_ack() > MaxPacketsBetweenAck)
ack_val = get_highest_sequence_all_received_upto()
Figure 4. Receiver handling PACKET
Message that already arrived
Message in transit
Message that was lost.
* All data messages are generated by site A.
Figure 5. Hop Protocol Forwarding Scenario
4 Performance and Results
We have conducted experiments over the Internet to test
the correctness of the implementation and to measure the
performance of the different protocols in practice. Figure 1
presents the layout of our testbed which consisted of six
Hopkins - at the Computer Science department at
Johns Hopkins University, Maryland.
CNDS - our lab at the Center for Networking and Dis-
tributed Systems at Johns Hopkins.
UCSB - at the ECE department at the University of
California, Santa Barbara.
Mae East - on AboveNet Communications network at
one of the Internet main connecting hubs, Virginia.
OSU -at the mathdepartmentat OhioState University.
Rutgers - at the Center for Information Management,
Integration and Connectivity at Rutgers University,
Since the focus of this paper is architectural support and
protocols for the wide area setting, only one computer from
each site participated in the experiments. The computers
involved ranged from Sparc-5 to Ultra-5 workstations run-
ning Solaris, and Pentium II workstations running Linux.
During the tests the computers were also under normal user
load, and no changes were made to their operating system.
Since none of the Spread protocols use wall clock time, no
effortwas madetosynchronizethesystem clocksofthema-
The network characteristics of any particular connection
over the Internet may vary significantly depending on the
time of day, other users, news events, etc. To minimize
the effects of these variations on our experiments, each ex-
periment was conducted a number of times (between 30 to
200 times depending on the specific experiment). Each set
of measurements for one experiment was run at approxi-
mately the same time (within a few minutes of each other).
Separateexperiments(reportedin separate graphsor tables)
were run at different times and so comparing results be-
tween tables might not be highly accurate. Because of these
variances, any one data point may vary over time, however
we believe the aggregatetrends of the graphsand results are
4.2 Overlay Networks
For this experimentwe created two differentoverlaynet-
works between the above six sites by adjusting the weights
of the links in the configuration of Spread. A Fanout net-
work contains a direct link between each two sites so that
everysourcesendsdirectlytoeveryothersite. This was cre-
ated by assigning equal weights to every link. A shared-tree
multicast network was created as shown in Figure 1. This
tree was constructed based on measurements of network la-
tency. The experiment shown in Table 1 was conducted us-
ing TCP as the link protocol in Spread.
Table 1. Throughput using TCP and several
Mae East (Kbits/sec)
In tests run on each of the two networks (Fanout and
Tree), for everytest, oneof the foursites (Mae East, CNDS,
Rutgers, UCSB) was a source of a stream of 10000 reli-
able messages of 1024 bytes. The sending application on
that site always made messages available to Spread. The
remaining five sites were running a receiving application
that computed the running time of the test at that site. The
numbers in Table 1 represent the throughput of the slowest
receiving site measured in kilobits per second. The differ-
ence between the fastest and slowest receiver in most of the
tests was negligible. As in anyreliablemulticastsystem, the
maximum sustained throughputis limited to the throughput
of the slowest link.
networks are each better than the other for different source
sites. When the CNDS site is the source, the fanoutnetwork
providesbetterthroughput. Thisis probablybecauseCNDS
has extremely high throughput connectivity to the Internet
and thus the first few hops do not form a bottleneck. How-
ever, when Rutgers or UCSB are multicasting, the multicast
tree network yields much better throughput. Even though
the Mae East site is located very close to a major Internet
backbone peering point, providing better connectivity then
almost any typical server, the multicast tree network was
still 15 percent better then the fanout network.
This experiment validates the usefulness, described in
Section 2, of source based routing using the overlay net-
works. For example, while messages generated by CNDS
can be sent through a fanout configuration, messages sent
by UCSB will be sent using a tree configuration.
4.3 Link Protocols
Here we evaluate the tradeoffs of using the Hop protocol
versus using a TCP based link protocol. We started by eval-
uating the overhead latency associated with each protocol
on one link. Then, the latency on a multi-link network was
evaluated with regards to number of links, the size of the
packets, and the load on the network. Finally, the through-
put of a link and a full network were evaluated under vary-
ing levels of additional packet loss.
All latency tests were done by an application level pro-
and then listens for a response message. A second appli-
cation runs on the other site and acts as an echo-response
server, sending anythingit receives immediatelyback to the
sender through Spread. The sender application calculates
round-trip latency times by taking the difference between
the time it received the echo-response and the time it sent
the original message. These latency tests are repeated 30
times back to back and the minimum, average, and max-
imum are reported. All results are reported as round-trip
times, which includetime transferringthe message fromthe
client to the Spread daemon, processing time in the dae-
mon, network transfer time, the receiving daemon’s pro-
cessing time and the transfer to the receiving application,
and a similar reverse path back to the sender. For the tables
and figures reporting ’ping’ results, the standard ’ping’ pro-
gram was run from between the daemons using 1024 byte
packets. We believe the ping latencies provide us with an
effective lower bound.
Table 2. Link Latency (Mae East to UCSB).
Table 2 shows the single link latency for a link between
Mae East and UCSB for 1024 byte messages. Clearly the
ping latencyis the best, howeverboth the TCP link protocol
and the Hop link protocol have minimum times very close
to ping. The Hop protocol also is very stable across all the
tests, with a variance of only 3 milliseconds, the same as
ping, while TCP produced a large variance of over 200 mil-
liseconds between the minimum and maximum latency.
To more realistically evaluate latency over a wide-area
network, we also constructed an overlaynetwork consisting
of six sites in a chain. This chain is shown in Figure 1,
as running from Mae East to UCSB to OSU to Rutgers to
CNDS to Hopkins. Note, we realize this is not a practical
setup, or even an efficient chain. However, using this chain
demonstrates how the protocols interact when packets must
be forwarded many times, and how the performance of the
protocols scales with the diameter of the multicast network.
The experiments reported in Figure 6 and Figure 8 use
the chain network. The sender application is always run
from one of the ends of the chain. The receiver application
is placed on each of the other sites, and 30 to 200 latency
tests each using a 1024 byte reliable message are run. The
results of the tests are averaged and graphed. The ping line
on the graph was calculated by adding the individual ping
times from site to site along the chain.
The results in Figure 6 show how the Hop latency stays
close to the ping latency as the number of hops and dis-
tance traveled increases, while the TCP latency is signifi-
cantly higher. This is made more clear in Figure 7 which
graphs the percentage overhead of TCP and Hop in com-
parison with ping times. TCP has an overhead of between
38 and 66 percent on all number of links, while Hop has an
overhead of at most 18 percent and as little as 5 percent.
Figure 8 shows the same chain network with the sender
placed at Hopkins instead of Mae East. Here the improved
end-to-end latency of Hop over TCP as the network la-
tency increases becomes clearer. When the network ping
latency increases significantly after OSU, the TCP latency
increases evenmore,while Hop latencystays withina small
percentage of ping latency. Figure 9 shows how the per-
centage overhead of Hop decreases substantially as the net-
work latency increases. When the network latency is small,
for example on the local area networks connecting CNDS
and Hopkins, the application and IPC overhead of Spread
become comparable with the actual network latency. In a
working Spread configuration, these local area, low latency
networks would use the Ring protocol instead of TCP or
Hop. The Ring protocol is designed for local area networks
and has excellent performance.
Next, we evaluate the latency for various packet sizes.
The results presented in Figure 10 use the same chain net-
work with Mae East as the sender and Hopkins as the only
receiver. Inthis test, reliablemessagesofvaryingsizes were
sent by the application. The larger message sizes are actu-
ally sent as several packets on the physical network, how-
ever since each message was only sent after the previous
one was received this does not become a throughput test.
The Hop protocol does not have a large increase in latency
beyond what can be attributed to the size of the message in
comparison with TCP, which has a significant increase in
latency for packets above 1024 byte.
Next, we evaluate the latency under load. The results
presented in Figure 11 use the same chain network. In this
test a load application using Spread was flooding the net-
UCSBOSU Rutgers CNDS Hopkins
Figure 6. Chain Latency (Mae East)
UCSB OSURutgers CNDS Hopkins
Figure 7. Protocol Latency Overhead (Mae East)
Hopkins CNDS RutgersOSUUCSB Mae
Figure 8. Chain Latency (Hopkins)
CNDS RutgersOSU UCSB Mae East
Figure 9. Protocol Latency Overhead (Hopkins)
Packet size (bytes)
1024 2048 4096 10240
Figure 10. Latency for different size of packets
0 100200300 400
Figure 11. Latency under Load
Table 3. Link Throughput under loss.
Additional Loss Rate
Table 4. Network Throughput under loss.
Additional Loss Rate
workfrom Mae East with a controlledlevel of messages per
second. Concurrently,the latency test application measured
reliable, 1024 byte message latency between Mae East and
Hopkins. The Hop protocol has almost constant latency as
the background load increases from 0 to 400 kilobits per
second. The stability under load is attributed to the Hop
protocol’s forwarding policy, which does not delay packets
even when there is loss or other application traffic. TCP
latency shows a steady increase as the background load in-
creases with a jump between 300 and 400 kilobits per sec-
ond where the latency grows to almost a second and a half.
Finally, we evaluated the Hop protocol’s behavior un-
der various levels of packet loss. These tests were done on
both a single link between Mae East and UCSB, which are
shown in Table 3, and on the multicast tree shown in Fig-
ure 1, whose results are reported in Table 4. These tests
were done by dropping packets randomly based on a uni-
form distribution on each side of every link. These losses
were in addition to any actual packet loss which occurred
on the network. All tests were done with a stream of 10000
reliable Spread messages of 1024 bytes by the same testing
application as was used in Section 4.2.
In the link experiment, the throughput decreases at 3 to
10 percent more then the actual loss rate. This seems to us
quite reasonable. In the network experiment using 6 sites
and5 linksthe system still maintaineda 63percentthrough-
put even with 20 percent loss on every link. Overall, the
degradation on the whole network is less then double the
loss rate on a single link.
We believe that the performance demonstrated by the
above experiments validate the viability and usefulness of
the Hop protocol in real-life system settings.
5 Related Work
Group communication systems in the LAN environment
have a well developed history beginning with ISIS , and
more recent systems such as Transis , Horus , Totem
, and RMP . These systems explored several differ-
ent models of Group Communication such as Virtual Syn-
chrony  and Extended Virtual Synchrony . Newer
work in this area focuses on scaling group membership to
wide-area networks .
A few of these systems have added some type of support
for either wide-area group communication or multi-LAN
group communication. The Hybrid paper  discusses the
dynamic and costly wide-area setting. The Hybrid system
has each group communication application switch between
a token based and symmetric vector based ordering algo-
rithm dependingon the communicationlatency betweenthe
applications. While their system provides a total order us-
ing whichever protocol is more efficient for each partici-
pant, Hybrid does not handle partitions in the network, or
provide support for orderings other then total.
The Multiple-Ring Totem protocol  allows several
rings to be interconnected by gateway nodes that forward
packets to other rings. This system provides a substan-
tial performance boost compared to a single-ring on large
LAN environments, but keeps the assumptions of low loss
rates and latency and a fairly similar bandwidth between all
nodes that limit its applicabilityto wide-area networks. The
latency of the Totem multiple ring protocol has been theo-
retically analyzed in .
The Transis wide-area protocols Pivots and Xports by
Nabil Huleihel  provide ordering and delivery guar-
antees in a partitionable environment. Both protocols are
based on a hierarchical model of the network, where each
level of the hierarchyis partitionedinto small sets of nearby
processes, and each set has a static representative who is
also a member of the next higher level of the hierarchy.
ifies that subtree. The Congress work  approaches the
problem of providing wide-area membership services sepa-
rately from actual multicast and ordering services, and pro-
vides a general membership service that can provide differ-
ent semantic guarantees.
IP-Multicast is actively developed to support Internet
wide unreliable multicasting and to scale to millions of
users. Many reliable multicast protocols which use IP-
multicast have been developed, such as SRM , RMTP
, Local Group Concept (LGC) , and HRMP .
The development of reliable protocols over IP-Multicast
has focused on solving scalability problems such as Ack or
Nack implosion and bandwidth limits, and providing use-
applications. Several of these protocols have developed
localized loss recovery protocols. SRM uses randomized
timeouts with backoff to request missed data and send re-
localize the recoveryby using the TTL field of IP-Multicast
to request a lost packet from nearer nodes first, and then
expand the request if no one close has it. Several other
variations in localized recovery such as using administra-
tive scope and separate multicast groups for recovery, are
discussed in .
Other reliable multicast protocols like LGC use the dis-
who is the root of some subtree of the main tree. RMTP
also uses “Designated Receivers” (DR) who act as the head
of a virtual subtree to localize recovery of lost packets and
provides reliable transport of a file from one sender to mul-
tiple receivers located around the world. RMTP is based
on the IP-Multicast model, but created user-level multicast
through UDP and modified mrouted software. RMTP did
not examine the tradeoffs in link protocols discussed in this
paper because it handles reliability over the entire tree, with
theDR’s onlyactingas aggregatorsofglobalprotocolinfor-
mation. Since Spreadalreadyhasadditionalinformationfor
membership and ordering purposes about the exact dissem-
ination of messages and where copies are buffered, Spread
can use more precise local recovery to get the packet from
the nearest source.
HRMP  is a reliable multicast protocol which pro-
videsa efficientlocal reliabilitybasedon a ring, whileusing
standard tree-based protocols such as ack trees to provide
reliability between rings. This work theoretically analyzes
be better then protocols utilizing only a ring or a tree. Our
work here focuses on the best protocols to use for reliabil-
ity on a wide-area multicast tree and thus is orthogonalwith
whichlocal site protocolto use. The Spreadsystem actually
uses a ring protocol for local area networks for many of the
same reasons HRMP does.
We presented an architecture for wide area group com-
munications that was implemented in the Spread system.
This architecture takes advantage of the ability to construct
user level overlay networks to efficiently disseminate reli-
able messages to process groups.
We described Hop, an efficient point-to-point reliable
transportprotocol for connectingsites on a wide area multi-
cast tree. ExperimentsconductedovertheInternetvalidated
the low latency and high stability of the Hop protocol under
various load and loss conditions.
We would like to thank Michal Miskin-Amir, one of the
creators of Spread. We thank Jithesh Parameswaran for
programming the optimal routing computations in Spread.
We also wish to thank Nabil Adam, Richard Holowczak,
Michael Melliar-Smith, Louise Moser, Alec Peterson, and
Robert Stanton for allowing us to use their systems in our
Communications of the ACM, 39(4), April 1996.
 D. Agarwal, L. E. Moser, P. M. Melliar-Smith, and R. K.
Budhia. The totem multiple-ring ordering and topology
maintenance protocol. ACM Transactions on Computer Sys-
tems, 16(2):93–132, May 1998.
 Y. Amir, D. Dolev, S. Kramer, and D. Malki.
A communication subsystem for high-availability. In Di-
gest of Papers, The 22nd International Symposium on Fault-
Tolerant Computing Systems, pages 76–84, 1992.
 Y. Amir, L. E. Moser, P. M. Melliar-Smith, D. Agarwal, and
P. Ciarfella. The totem single-ring ordering and member-
ACM Transactions on Computer Systems,
13(4):311–342, November 1995.
 T. Anker, G. V. Chockler, D. Dolev, and I. Keidar. Scal-
able group membership services for novel applications. In
M. Mavronicolas, M. Merritt, and N. Shavit, editors, Pro-
ceedings of the workshop on Networks in Distributed Com-
puting, DIMACS Series in Discrete Mathematics and Theo-
retical Computer Science, 1998.
 K. P. Birman and T. Joseph. Exploiting virtual synchrony in
distributed systems. In 11th Annual Symposium on Operat-
ing Systems Principles, pages 123–138, November 1987.
 K. P. Birman and R. V. Renesse. Reliable Distributed Com-
puting with the Isis Toolkit. IEEE Computer Society Press,
 A. Fekete, N. Lynch, and A. Shvartsman. Specifying and
using a partionable group communication service. In Pro-
ceedings of the 16th annual ACM Symposium on Principles
of Distributed Computing, pages 53–62, August 1997.
 R. W. Floyd. Algorithm 97 (shortest path). Communications
of the ACM, 5(6):345, 1962.
 S. Floyd, V. Jacobson, C. Liu, S. McCanne, and L. Zhang.
A reliablemulticast framework for light-weight sessions and
application level framing. IEEE/ACM Transactions on Net-
working, 5(6):784–803, December 1997.
Special issue on group communications systems.
 L. Gu and J. Garcia-Luna-Aceves. New error recovery struc-
tures for reliable networking. In Proceedings of the Sixth In-
ternational Conference on Computer Communications and
Networking, September 1997.
 K. Guo and L. Rodrigues. Dynamic light-weight groups.
In Proceedings of 17th International Conference on Dis-
tributed Computing Systems, pages 33–42, May 1997.
 M. Hofmann. A generic concept for large-scale multicast. In
B. Plattner, editor, International Zurich Seminar on Digital
Communications, number 1044 in Lecture Notes in Com-
puter Science, pages 95–106, Februrary 1996.
 N. Huleihel. Efficient ordering of messages in widearea net-
works. Master’s thesis, Institute of Computer Science, The
Hebrew University of Jerusalem, Jerusalem, Israel, 1996.
 J. Lin and S. Paul. Rmtp: A reliable multicast transport pro-
tocol. In Proceedings of IEEE Infocom, pages 1414–1424,
 L. E. Moser, Y. Amir, P. M. Melliar-Smith, and D. A. Agar-
wal. Extended virtual synchrony.
IEEE 14th International Conference on Distributed Com-
puting Systems, pages 56–65, June 1994.
 J. Nonnenmacher and E. W. Biersack. Performance model-
lling of reliable multicast transmission. In Proceedings of
INFOCOM 97, April 1997.
 R. V. Renesse, K. Birman, and S. Maffeis. Horus: A flexible
group communication system. Communications of theACM,
39(4):76–83, April 1996.
 L. E. Rodrigues, H. Fonseca, and P. Verissimo. A synamic
hybrid protocol for total order in large-scale systems. In
Proceedings of the 16th International Conference on Dis-
tributed Computing Systems, May 1996. Selected portions
 E. Thomopoulos, L. E. Moser, and P. M. Melliar-Smith.
Analyzing the latency of the totem multicast protcols. In
Proceedings of the Sixth International Conference on Com-
puterCommunications andNetworks, pages 42–50, Septem-
 B. Whetten, T. Montgomery, and S. Kaplan. A high per-
formance totally ordered multicast protocol. In Theory and
Practice in Distributed Systems, International Workshop,
Lecture Notes in Computer Science, page 938, September
In Proceedings of the