ArticlePDF Available

Consistent Hashing and Random Trees: Distributed Caching Protocols for Relieving Hot Spots on the World Wide Web

Authors:

Abstract

We describe a family of caching protocols for distrib-uted networks that can be used to decrease or eliminate the occurrence of hot spots in the network. Our protocols are particularly designed for use with very large networks such as the Internet, where delays caused by hot spots can be severe, and where it is not feasible for every server to have complete information about the current state of the entire network. The protocols are easy to implement using existing network protocols such as TCP/IP, and require very little overhead. The protocols work with local control, make efficient use of existing resources, and scale gracefully as the network grows. Our caching protocols are based on a special kind of hashing that we call consistent hashing. Roughly speaking, a consistent hash function is one which changes minimally as the range of the function changes. Through the development of good consistent hash functions, we are able to develop caching protocols which do not require users to have a current or even consistent view of the network. We believe that consistent hash functions may eventually prove to be useful in other applications such as distributed name servers and/or quorum systems. 1
Consistent Hashing and Random Trees:
Distributed Caching Protocols for Relieving Hot Spots on the World Wide Web
David Karger Eric Lehman Tom Leighton Matthew Levine Daniel Lewin
Rina Panigrahy
We describe a family of caching protocols for distrib-uted networks
that can be used to decrease or eliminate the occurrence of hot spots
in the network. Our protocols are particularly designed for use with
very large networks such as the Internet, where delays caused by
hot spots can be severe, and where it is not feasible for every server
to have complete information about the current state of the entire
network. The protocols are easy to implement using existing net-
work protocols such as TCP/IP, and require very little overhead.
The protocols work with local control, make efficient use of exist-
ing resources, and scale gracefully as the network grows.
Our caching protocols are based on a special kind of hashing
that we call consistent hashing. Roughly speaking, a consistent
hash function is one which changes minimally as the range of the
function changes. Through the development of good consistent
hash functions, we are able to develop caching protocols which do
not require users to have a current or even consistent view of the
network. We believe that consistent hash functions may eventually
prove to be useful in other applications such as distributed name
servers and/or quorum systems.
In this paper, we describe caching protocols for distributed net-
works that can be used to decrease or eliminate the occurrences
of “hot spots”. Hot spots occur any time a large number of clients
wish to simultaneously access data from a single server. If the site
is not provisioned to deal with all of these clients simultaneously,
service may be degraded or lost.
Many of us have experienced the hot spot phenomenon in the
context of the Web. A Web site can suddenly become extremely
popular and receive far more requests in a relatively short time than
This research was supported in part by DARPA contracts N00014-95-
1-1246 and DABT63-95-C-0009, Army Contract DAAH04-95-1-0607, and
NSF contract CCR-9624239
Laboratory for Computer Science, MIT, Cambridge, MA 02139.
email: karger,e lehman,danl,ftl,mslevine,danl,rinap @theory.lcs.mit.edu
A full version of this paper is availble at:
http://theory.lcs.mit.edu/ karger,e lehman,ftl,mslevine,danl,rinap
Department of Mathematics, MIT, Cambridge, MA 02139
it was originally configured to handle. In fact, a site may receive so
many requests that it becomes “swamped,” which typically renders
it unusable. Besides making the one site inaccessible, heavy traffic
destined to one location can congest the network near it, interfering
with traffic at nearby sites.
As use of the Web has increased, so has the occurrence and
impact of hot spots. Recent famous examples of hot spots on the
Web include the JPL site after the Shoemaker-Levy 9 comet struck
Jupiter, an IBM site during the Deep Blue-Kasparov chess tour-
nament, and several political sites on the night of the election. In
some of these cases, users were denied access to a site for hours or
even days. Other examples include sites identified as “Web-site-of-
the-day” and sites that provide new versions of popular software.
Our work was originally motivated by the problem of hot spots
on the World Wide Web. We believe the tools we develop may be
relevant to many client-server models, because centralized servers
on the Internet such as Domain Name servers, Multicast servers,
and Content Label servers are also susceptible to hot spots.
Several approaches to overcoming the hot spots have been pro-
posed. Most use some kind of replication strategy to store copies of
hot pages throughout the Internet; this spreads the work of serving
a hot page across several servers. In one approach, already in wide
use, several clients share a proxy cache. All user requests are for-
warded through the proxy, which tries to keep copies of frequently
requested pages. It tries to satisfy requests with a cached copy; fail-
ing this, it forwards the request to the home server. The dilemma
in this scheme is that there is more benefit if more users share the
same cache, but then the cache itself is liable to get swamped.
Malpani et al. [6] work around this problem by making a group
of caches function as one. A user's request for a page is directed
to an arbitrary cache. If the page is stored there, it is returned to
the user. Otherwise, the cache forwards the request to all other
caches via a special protocol called “IP Multicast”. If the page is
cached nowhere, the request is forwarded to the home site of the
page. The disadvantage of this technique is that as the number
of participating caches grows, even with the use of multicast, the
number of messages between caches can become unmanageable.
A tool that we develop in this paper, consistent hashing,givesa
way to implement such a distributed cache without requiring that
the caches communicate all the time. We discuss this in Section 4.
Chankhunthod et al. [1] developed the Harvest Cache, a more
scalable approach using a tree of caches. A user obtains a page by
asking anearby leaf cache. If neither this cache nor its siblings have
the page, the request is forwarded to the cache's parent. If a page
is stored by no cache in the tree, the request eventually reaches
the root and is forwarded to the home site of the page. A cache
retains a copy of any page it obtains for some time. The advantage
of a cache tree is that a cache receives page requests only from its
children (and siblings), ensuring that not too many requests arrive
simultaneously. Thus, many requests for a page in a short period
of time will only cause one request to the home server of the page,
and won't overload the caches either. A disadvantage, at least in
theory, is that the same tree is used for all pages, meaning that the
root receives at least one request for every distinct page requested
of the entire cache tree. This can swamp the root if the number of
distinct page requests grows too large, meaning that this scheme
also suffers from potential scaling problems.
Plaxton and Rajaraman [9] show how to balance the load
among all caches by using randomization and hashing. In partic-
ular, they use a hierarchy of progressively larger sets of “virtual
cache sites” for each page and use a random hash function to as-
sign responsibility for each virtual site to an actual cache in the
network. Clients send a request to a random element in each set in
the hierarchy. Caches assigned to a given set copy the page to some
members of the next, larger set when they discover that their load
is too heavy. This gives fast responses even for popular pages, be-
cause the largest set that hasthe page is not overloaded. It also gives
good load balancing, because a machine in a small (thus loaded) set
for one page is likely to be in alarge (thus unloaded) set for another.
Plaxton and Rajaraman's technique is also fault tolerant.
The Plaxton/Rajaraman algorithm hasdrawbacks, however. For
example, since their algorithm sends a copy of each page request
to a random element in every set, the small sets for a popular page
are guaranteed to be swamped. In fact, the algorithm uses swamp-
ing as a feature since swamping is used to trigger replication. This
works well in their model of a synchronous parallel system, where
a swamped processor is assumed to receive a subset of the incom-
ing messages, but otherwise continues to function normally. On the
Internet, however, swamping has much more serious consequences.
Swamped machines cannot be relied upon to recover quickly and
may even crash. Moreover, the intentional swamping of large num-
bers of random machines could well be viewed unfavorably by the
owners of those machines. The Plaxton/Rajaraman algorithm also
requires that all communications be synchronous and/or that mes-
sages have priorities, and that the set of caches available be xed
and known to all users.
Here, we describe two tools for data replication and use them to
give a caching algorithm that overcomes the drawbacks of the pre-
ceding approaches and has several additional, desirable properties.
Our first tool, random cache trees, combines aspects of the
structures used by Chankhunthod et al. and Plaxton/Rajaraman.
Like Chankhunthod et al., we use a tree of caches to coalesce re-
quests. Like Plaxton and Rajaraman, we balance load by using
a different tree for each page and assigning tree nodes to caches
via a random hash function. By combining the best features of
Chankhunthod et al. and Plaxton/Rajaraman with our own meth-
ods, we prevent any server from becoming swamped with high
probability, a property not possessed by either Chankhunthod et
al. or Plaxton/Rajaraman. In addition, our protocol shows how to
minimize memory requirements (without significantly increasing
cache miss rates) by only caching pages that have been requested a
sufficient number of times.
We believe that the extra delay introduced by a tree of caches
should be quite small in practice. The time to request a page is
multiplied by the tree depth. However, the page request typically
takes so little time that the extra delay is not great. The return of
a page can be pipelined; a cache need not wait until it receives a
whole page before sending data to its child in the tree. Therefore,
the return of a page also takes only slightly longer. Altogether, the
added delay seen by a user is small.
Our second tool is a new hashing scheme we call consistent
hashing. This hashing scheme differs substantially from that used
in Plaxton/Rajaraman and other practical systems. Typical hashing
based schemes do a good job of spreading load through a known,
fixed collection of servers. The Internet, however, does not have
a fixed collection of machines. Instead, machines come and go
as they crash or are brought into the network. Even worse, the
information about what machines are functional propagates slowly
through the network, so that clients may have incompatible “views”
of which machines are available to replicate data. This makes stan-
dard hashing useless since it relies on clients agreeing on which
caches are responsible for serving a particular page. For example,
Feeley et al [3] implement a distributed global shared memory sys-
tem for a network of workstations that uses a hash table distributed
among the machines to resolve references. Each time a new ma-
chine joins the network, the require a central server to redistribute
a completely updated hash table to all the machines.
Consistent hashing may help solve such problems. Like most
hashing schemes, consistent hashing assigns a set of items to buck-
ets so that each bin receives roughly the same number of items.
Unlike standard hashing schemes, a small change in the bucket set
does not induce a total remapping of items to buckets. In addi-
tion, hashing items into slightly different sets of buckets gives only
slightly different assignments of items to buckets. We apply con-
sistent hashing to our tree-of-caches scheme, and show how this
makes the scheme work well even if each client is aware of only
a constant fraction of all the caching machines. In [5] Litwin et al
proposes a hash function that allows buckets to be added one at a
time sequentially. However our hash function allows the buckets to
be added inan arbitrary order. Another schemethat we can improve
on is given by Devine [2]. In addition, we believe that consistent
hashing will be useful in other applications (such as quorum sys-
tems [7] [8] or distributed name servers) where multiple machines
with different views of the network must agree on a common stor-
age location for an object without communication.
In Section 2 we describe our model of the Web and the hot spot
problem. Our model is necessarily simplistic, but is rich enough
to develop and analyze protocols that we believe may be useful in
practice. In Section 3, we describe our random tree method and use
it in a caching protocol that effectively eliminates hot spots under a
simplified model. Independent of Section3, in Section 4 we present
our consistent hashing method and use it to solve hot spots under a
different simplified model involving inconsistent views.
In Section 5 we show how our two techniques can be effectively
combined. In Section 6 we propose a simple delay model that cap-
tures hierarchical clustering of machines on the Internet. We show
that our protocol can be easily extended to work in this more real-
istic delay model. In Sections 7 and 8 we consider faults and the
behavior of the protocol over time, respectively. In Section 9 we
discuss some extensions and open problems.
In several places we make use of hash functions that map objects
into arange. For clarity we assume that these functions map objects
in a truly random fashion, i.e. uniformly and independently. In
practice, hash functions with limited independence are more plau-
sible since they economize on space and randomness. We have
proven all theorems of this paper with only limited independence
using methods similar to those in [11]. However, in this extended
abstract we only state the degree of independence required for re-
sults to hold. Proofs assuming limited independence will appear in
the full version of this paper.
This section presents our model of the Web and the hotspot prob-
lem.
We classify computers on the Web into three categories. All
requests for Web pages are initiated by browsers. The permanent
homes of Web pages are servers. Caches are extra machines which
we use to protect servers from the barrage of browser requests.
Throughout the paper, the set of caches is
and the number of
caches is
.
Each server is home to a xed set of pages. Caches are also
able to store a number of pages, but this set may change over time
as dictated by a caching protocol. We generally assume that the
content of each page is unchanging, though Section 9 contains a
discussion of this issue. The set of all pages is denoted
.
Any machine can send a message directly to any other with
the restriction that a machine may not be aware of the existence
of all caches; we require only that each machine is aware of a
fraction of the caches for some constant . The two typical types
of messages are requests for pages and the pages themselves. A
machine which receives too many messages too quickly ceases to
function properly and is said to be “swamped”.
measures the time for a message from machine
to arrive at machine . We denote this quantity .In
practice, of course, delays on the Internet are not so simply char-
acterized. The value of
should be regarded as a “best guess” that
we optimize on for lack of better information; the correctness of
a protocol should not depend on values of
(which could actually
measure anything such as throughput, price of connection or con-
gestion) being exactly accurate. Note that we do not make latency
a function of message size; this issue is discussed in Section 3.2.1.
All cache and server behavior and some browser behavior is
specified in our protocol. In particular, the protocol specifies how
caches and servers respond to page requests and which pages are
stored in a cache. The protocol also specifies the cache or server
to which a browser sends each page request. All control must be
local; the behavior of a machine can depend only on messages it
receives.
An adversary decides which pages are requested by browsers.
However, the adversary cannot see random values generated in our
protocol and cannot adapt his requests based on observed delays
in obtaining pages. We consider two models. First, we consider a
static model in which a single “batch” of requests is processed, and
require that the number of page requests be at most
where
is a constant and is the number of caches. We then consider
a temporal model in which the adversary may initiate new requests
for pages at rate at most
; that is, in any time interval ,he
may initiate at most
requests.
The “hot spot problem” is to satisfy all browser page requests while
ensuring that with high probability no cache or server is swamped.
The phrase “with high probability” means “with probability at least
”, where is a confidence parameter used throughout the
paper.
While our basic requirement is to prevent swamping, we also
have two additional objectives. The first is tominimize cache mem-
ory requirements. A protocol should work well without requiring
any cache to store a large number of pages. A second objective is,
naturally, to minimize the delay a browser experiences in obtaining
a page.
In this section we introduce our first tool, random trees. To sim-
plify the presentation, we givea simple caching protocol that would
work well in a simpler world. In particular, we make the following
simplifications to the model:
1. All machines know about all caches.
2.
for all .
3. All requests are made at the same time.
This restricted model is “static” in the sense that there is only
one batch of requests; we need not consider the long-term stability
of the network.
Under these restrictions we show a protocol that has good be-
havior. That is, with high probability no machine is swamped. We
achieve a total delay of
and prove that it is optimal. We
use total cache space which is a fraction of the number of requests,
and evenly divided among the caches. In subsequent sections we
will show how to extend the protocol so as to preserve the good
behavior without the simplifying assumptions.
The basic idea of our protocol is an extension of the “tree of
caches” approach discussed in the introduction. We use this tree to
ensure that no cache has many “children” asking it for a particu-
lar page. As discussed in the introduction, levels near the root get
many requests for a page even if the page is relatively unpopular,
so being the root for many pages causes swamping. Our technique,
similar to Plaxton/Rajaraman's, is to use a different, randomly gen-
erated tree for each page. This ensures that no machine is near the
root for many pages, thus providing good load balancing. Note that
we cannot make use of the analysis given by Plaxton/Rajaraman,
because our main concern is to prevent swamping, whereas they
allow machines to be swamped.
In Section 3.1 below, we define our protocol precisely. In Sec-
tion 3.2, we analyze the protocol, bounding the load on any cache,
the storage each cache uses, and the delay a browser experiences
before getting the page.
We associate a rooted -ary tree, called an abstract tree, with each
page. We use the term nodes only in reference to the nodes of these
abstract trees. The number of nodes in each tree is equal to the
number of caches, and the tree is as balanced as possible (so all
levels but the bottom are full). We refer to nodes of the tree by
their rank in breadth-first search order. The protocol is described
as running on these abstract trees; to support this, all requests for
pages take the form of a 4-tuple consisting of the identity of the re-
quester, the name of the desired page, a sequence of nodes through
which the request should be directed, and a sequence of caches that
should act as those nodes. To determine the latter sequence, that is,
which cache actually does the work for a given node, the nodes are
mapped to machines. The root of a tree is always mapped to the
server for the page. All the other nodes are mapped to the caches
by a hash function
, which must be dis-
tributed to all browsers and caches. In order not to create copies of
pages for which there are few requests, we have another parameter,
, for how many requests a cache must see before it bothers to store
a copy of the page.
Now, given a hash function
, and parameters and , our pro-
tocol is as follows:
Browser When a browser wants a page, it picks a random leaf to
root path, maps the nodes to machines with
, and asks the
leaf node for the page. The request includes the name of the
browser, the name of the page, the path, and the result of the
mapping.
Cache When a cache receives a request, it first checks to see if it
is caching a copy of the page or is in the process of getting
one to cache. If so, it returns the page to the requester (after it
gets itscopy, if necessary). Otherwise it increments a counter
for the page and the node it is acting as, and asks the next
machine on the path for the page. If the counter reaches
,it
caches a copy of the page. In either case the cache passes the
page on to the requester when it is obtained.
Server When a server receives a request, it sends the requester a
copy of the page.
The analysis is broken into three parts. We begin by showing that
the latency in processing a request is likely to be small, under the
assumption that no server is swamped. We then show that no ma-
chine is likely to be swamped. We conclude by showing that no
cache need store too many pages for the protocol to work properly.
The analysis of swamping runs much the same way, except that
the “weights” on our abstract nodes are now the number of requests
arriving at those nodes. As above, the number of requests that hit a
machine is bounded by the weight of nodes mapped to it.
Under our protocol, the delay a browser experiences in obtaining
a page is determined by the height of the tree. If a request is for-
warded from a leaf to the root, the latency is twice the length of
the path,
. If the request is satisfied with a cached copy,
the latency is only less. If a request stops at a cache that is wait-
ing for a cache copy, the latency is still less since a request has
already started up the tree. Note that
can probably be made large
in practice, so this latency will be quite small.
Note that in practice, the time required to obtain a large page is
not multipliedby the number of steps in a path over which it travels.
The reason is that the page can be transmitted along the path in
a pipelined fashion. A cache in the middle of the path can start
sending data to the next cache as soon as it receives some; it need
not wait to receive the whole page. This means that although this
protocol will increase the delay in gettingsmallpages, the overhead
for large pages is negligible. The existence of tree schemes, like the
Harvest Cache, suggests that is acceptable in practice.
Our bound is optimal (up to constant factors) for any protocol
that forbids swamping. To see this, consider making
requests
for a single page. Look at the graph with nodes corresponding to
machines and edges corresponding to links over which the page
is sent. Small latency implies that this graph has small diameter,
which implies that some node must have high degree, which im-
plies swamping.
The intuition behind our analysis is the following. First we analyze
the number of requests directed to the abstract tree nodes of various
pages. These give “weights” to the tree nodes. We then analyze the
outcome when the tree nodes are mapped by a hash function onto
the actual caching machines: a machine gets as many requests as
the total weight of nodes mapped to it. To bound the projected
weight, we first give a bound for the case where each node is as-
signed to a random machine. This is a weighted version of the
familiar balls-in-bins type of analysis. Our analysis gives a bound
with an exponential tail. We can therefore argue as in [11] that itap-
plies even when the balls are assigned to bins only
-
way independently. This can be achieved by using a -universal
hash function to map the abstract tree nodes to machines.
We will now analyze our protocol under the simplified model.
In this “static” analysis we assume for now that caches have enough
space that they never have to evict pages; this means that if a cache
has already made
requests for a page it will not make another
request for the same page. In Theorem 3.1 we provide high proba-
bility bounds on the number of requests a cache gets, assuming that
all the outputs of the function
are independent and random. The-
orem 3.4 extends our high probability analysis to the case when
is a -way independent function. In particular we show that it suf-
fices to have
logarithmic in the system parameters to achieve the
same high probability bounds as with full independence.
Theorem 3.1 If is chosen uniformly and at random from the
space of functions
then with probability at least
,where is a parameter, the number of requests a given
cache gets is no more than
Note that is the average number of requests per cache
since each browser request could give rise to
requests up the
trees. The
term arises because at the leaf nodes of a tree's
page some cache could occur times (balls-in-bins) and the
adversary could choose to devote all
requests to that page. We
prove the above Theorem in the rest of the section.
We split the analysis into two parts. First we analyze the re-
quests to a cache due to its presence in the leaf nodes of the ab-
stract trees and then analyze the requests due to its presence at the
internal nodes and then add them up.
Due to space limitations, we give a proof that only applies when
. Its extension to small is straightforward but long.
Observe that the requests for each page are being mapped randomly
onto the leaf nodes of its abstract tree. And then these leaf nodes
are mapped randomly onto the set of caches. Look at collection of
all the leaf nodes and the number of requests (weight) associated
with each one of them. The variance among the “weights” of the
leaf nodes is maximized when all the requests are made for one
page. This is also the case which maximizes the number of leaf
node requests on a cache.
Each page's tree has about
leaf nodes. Since a
machine
has a chance of occurring at a particular leaf node,
with probability
it will occur in leaf nodes. In
fact, since there are at most requests, will occur
times in all those requested pages' trees with probability .
Given an assignment of machines to leaf nodes so that occurs
times in each tree, the expected number of requests
gets is which is . Also, once the as-
signment of machine to leaf nodes is xed, the number of requests
gets is a sum of independent Bernoulli variables. So by Cher-
noff bounds
gets requests with probability
. So we conclude that gets with
probability at least
. Replacing by and assum-
ing
we can say that the same bound holds with probability
. It is easy to extend this proof so that the bound holds
even for
.
Again we think of the protocol as first running on the abstract trees.
Now no abstract internal node gets more than
requests because
each child node gives out at most
requests for a page. Consider
any arbitrary arrangement of paths for all the
requests up their
respective trees. Since there are only
requests in all we can bound
the number of abstract nodes that get
requests. In fact we will
bound the number of abstract nodes over all trees which receive
between
and requests where .Let
denote the number of abstract nodes that receive between
and requests. Let be the number of requests for page .
Then
. Since each of the requests gives rise to at
most
requests up the trees, the total number of requests is
no more than
.So,
(1)
Lemma 3.2 The total number of internal nodes which receive at
least requests is at most if
Proof (sketch): Look at the tree induced by the request paths, con-
tract out degree 1 nodes, and count internal nodes.
For there can clearly be no more than requests.
The preceding lemma tells us that
, the number of abstract nodes
that receive between
and requests, is at most except for
.For , will be at most . Now the probabil-
ity that machine
assumes a given one of these nodes is .
Since assignments of nodes to machines are independent the prob-
ability that a machine
is receives more than of these nodes is at
most . In order for the right hand side
to be as small as we must have .
Note that the latter term will be present only if .So
is with probability at least .
So with probability at least the total number of
requests received by
due to internal nodes will be of the order of
By combining the high probability bounds for internal and leaf
nodes, we can say that a machine gets
requests with probability at least . Replacing by
and ignoring in comparision with we get
Theorem 3.1.
In this section we
show that the high probability bound we have proven for the num-
ber of requests received by a machine
is tight.
Lemma 3.3 There exists a distribution of
requests to pages
so that a given machine
gets
requests with probability at least .
Proof: Full paper.
We now extend our high
probability analysis to functions
that are chosen at random from
a
-universal hash family.
Theorem 3.4 If
is chosen at random from a -universal hash
family then withprobability at least
a given cache receives
no more than
requests.
Proof: The full proof is deferred to the final version of the paper.
This result does not follow immediately from the results of [11],
but involves a similar argument.
Setting we get the following corollary.
Corollary 3.5 The high probability bound proved in theorem 3.1
for the number of requests a cache gets holds even if
is selected
from a
-universal hash family.
In fact, this can be shown to be true for all the bounds that we will
prove later, i.e., it suffices
to be logarithmic in the system size.
In this section, we discuss the amount of storage each cache must
have in order to make our protocol work. The amount of storage
required at a cache is simply the number of pages for which it re-
ceives more than
requests.
Lemma 3.6 The total number of cached pages, over all machines,
is
with probability at least .A
given cache
has cached copies with high proba-
bility.
Proof (sketch): The analysis is very similar to that in proof of The-
orem 3.1. We again play the protocol on the abstract trees. Since a
page is cached only if it requested
times, we assign each abstract
node a weight of one if it gets more than
requests and zero other-
wise. These abstract nodes are then mapped randomly onto the set
of caches. We can bound the total weight received by a particular
cache, which is exactly the number of pages it caches.
In this section we define a new hashing technique called consis-
tent hashing. We motivate this technique by reference to a simple
scheme for data replication onthe Internet. Consider a single server
that has a large number of objects that other clients might want to
access. It is natural to introduce a layer of caches between the
clients and the server in order to reduce the load on the server. In
such a scheme, the objects should be distributed across the caches,
so that each is responsible for a roughly equal share. In addition,
clients need to know which cache to query for a specific object.
The obvious approach is hashing. The server can use a hash func-
tion that evenly distributes the objects across the caches. Clients
can use the hash function to discover which cache stores a object.
Consider now what happens when the set of active caching ma-
chines changes, or when each client is aware of a different set of
caches. (Such situations are very plausible on the Internet.) If the
distribution was done with a classical hash function (for example,
the linear congruential function
), such in-
consistencies would be catastrophic. When the range of the hash
function (
in the example) changed, almost every item would be
hashed to a new location. Suddenly, all cached data is useless be-
cause clients are looking for it in a different location.
Consistent hashing solves this problem of different “views. We
define a view to be the set of caches of which a particular client is
aware. We assume that while views can be inconsistent, they are
substantial: each machine is aware of a constant fraction of the cur-
rently operating caches. A client uses a consistent hash function
to map a object to one of the caches in its view. We analyze and
construct hash functions with the following consistency properties.
First, there is a “smoothness” property. When a machine is added
to or removed from the set of caches, the expected fraction of ob-
jects that must be moved to a new cache is the minimum needed
to maintain a balanced load across the caches. Second, over all the
client views, the total number of different caches to which a object
is assigned is small. We call this property “spread”. Similarly, over
all the client views, the number of distinct objects assigned to a
particular cache is small. We call this property “load”.
Consistent hashing therefore solves the problems discussed
above. The “spread” property implies that even in the presence
of inconsistent views of the world, references for a given object are
directed only to a small number of caching machines. Distributing
a object to this small set of caches will insure access for all clients,
without using a lot of storage. The “load” property implies that
no one cache is assigned an unreasonable number of objects. The
“smoothness” property implies that smooth changes in the set of
caching machines are matched by a smooth evolution in the loca-
tion of cached objects.
Since there are many ways to formalize the notion of consis-
tency as described above, we will not commit to a precise defini-
tion. Rather, in Section 4.4 we define a “ranged hash function” and
then precisely define several quantities that capture different as-
pects of “consistency”. In Section 4.2 we construct practical hash
functions which exhibit all four to some extent. In Section 4.4, we
discuss other aspects of consistent hashing whihc, though not ger-
mane to this paper, indicate some of the richness underlying the
theory.
In this section, we formalize and relate four notions of consistency.
Let
be theset of itemsand be the set of buckets. Let
be the number of items. A view is any subset of the buckets .
A ranged hash function is a function of the form
. Such a function specifies an assignment of items to buckets
for every possible view. That is,
is the bucket to which
item
isassignedinview . (We will use the notation in
place
from now on.) Since items should only be assigned
to usable buckets, we require
for every view .
A ranged hash family is a family of ranged hash functions. A
random ranged hash function is a function drawn at random from
a particular ranged hash family.
In the remainder of this section, we state and relate some rea-
sonable notions of consistency regarding ranged hash families.
Throughout, we use the following notational conventions:
is a
rangedhashfamily,
is a ranged hash function, is a view, is an
item, and
is a bucket.
Balance: A ranged hash family is balanced if, given a particu-
lar view a set of items, and a randomly chosen function selected
from the hash family, with high probability the fraction of items
mapped to each bucket is
.
The balance property is what is prized about standard hash
functions: they distribute items among buckets in a balanced fa-
sion.
Monotonicity: A ranged hash function is monotone if for all
views
, implies .A
ranged hash family is monotone if every ranged hash function in it
is.
This property says that if items are initially assigned to a set
of buckets and then some new buckets are added to form ,
then an item may move from an old bucket to a new bucket, but not
from one old bucket to another. This reflects one intuition about
consistency: when the set of usable buckets changes, items should
only move if necessary to preserve an even distribution.
Spread: Let
be a set of views, altogether containing
distinct buckets and each individually containing at least
buckets. For a ranged hash function and a particular item ,the
spread is the quantity .Thespread of a hash
function is the maximum spread of an item. The spread of
a hash family is
if with high probability, the spread of a random
hash function from the family is
.
The ideabehindspread is thatthere are
people, eachof whom
can see at least a constant fraction (
) of the buckets that are
visible to anyone. Each person tries to assign an item
to a bucket
using a consistent hash function. The property says that across the
entire group, there are at most
different opinions about which
bucket should contain the item. Clearly, a good consistent hash
function should have low spread over all items.
Load: Define a set of
views as before. For a ranged hash
function and bucket ,theload is the quantity .
The load of a hash function is the maximum load of a bucket. The
load of a hash family is
if with high probability, a randomly cho-
senhashfunctionhasload . (Note that is the set of items
assigned to bucket in view .) The load property is similar to
spread. The same
people are back, but this time we consider a
particular bucket
instead of an item. The property says that there
are at most
distinct items that at least one person thinks be-
longs in the bucket. A good consistent hash function should also
have low load.
Our main result for consistent hashing is Theorem 4.1 which
shows the existence of an efficiently computable monotonic ranged
hash family with logarithmic spread and balance.
We now give a construction of a ranged hash family with good
properties. Suppose that we have two random functions
and .
The function
maps buckets randomly to the unit interval, and
does the same for items. is defined to be the bucket
that minimizes .Inotherwords, is mapped to the
bucket “closest” to
. For reasons that will become apparent, we ac-
tually need to have more than one point in the unit interval associ-
ated with each bucket. Assuming that the number of buckets in the
range is always less than
, we will need points for each
bucket for some constant
. The easiest way to view this is that
each bucket is replicated times, and then maps each
replicated bucket randomly. In order to economize on the space to
represent a function in the family, and on the use of random bits,
we only demand that the functions
and map points -
way independently and uniformly to
. Note that for each point
we pick in the unit interval, we need only pick enough random bits
to distinguish the point from all other points. Thus it is unlikely
that we need more than
number of points bits for each point.
Denote the above described hash family as
.
Theorem 4.1 The ranged hash family described above has the
following properties:
1. is monotone.
2. Balance: For a fixed view , for
and , and, conditioned on the choice of ,the
assignments of items tobuckets are
-way independent.
3. Spread: If the number of views
for some constant
, and the number of items , then for , is
with probability greater than .
4. Load: If
and are as above, then for , is
with probability greater than .
Proof (sketch): Monotonicity is immediate. When a new bucket
is added, the only items that move are those that are now closest to
one of the new bucket's associated points. No items move between
old buckets. The spread and load properties follow from the obser-
vation that with high probability, a point from every view falls into
an interval of length
. Spread follows by observing that
the number of bucket points that fall in this size interval around
an item point is an upper bound on the spread of that item, since
no other bucket can be closer in any view. Standard Chernoff argu-
ments apply to this case. Load follows by a similar argument where
we count the number of item points that fall in the region “owned”
by a bucket's associated points. Balance follows from the fact that
when
points are randomly mapped to the unit interval,
each bucket is with highu probability responsible for no more than
a
fraction of the interval. The key here is to count the number
of combinatroially distinct ways of assigning this large a fraction to
the points associated with a bucket. This turns out to be
polynomial in
. We then argue that with high probability none of
these possibilities could actually occur by showing that in each one
an additional bucket point is likely to fall. We deduce that the ac-
tual length must be smaller than
. All of the above proofs
can be done with only
-way independent mappings.
The following corollary is immediate and is useful in the rest
of the paper.
Corollary 4.2 With the same conditions of the previous theorem,
in any view for and .
In this section we show how the hash family just dexcrobed can
be implemented efficiently. Specifically, the expected running time
for a single hash computation will be
. The expectation is over
the choice of hash function. The expected running time for adding
or deleting a bucket will be
where is an upper bound
on the total number of buckets in all views.
A simple implementation uses a balanced binary search tree
to store the correspondence between segments of the unit interval
and buckets. If there are
buckets, then there will be
intervals, so the search tree will have depth . Thus, a
single hash computation takes
time. The time for an
addition or removal of a bucket is
sinceweinsertor
delete
points for each bucket.
The following trick reduces the expected running time of a hash
computation to
. The idea is to divide the interval into roughly
equal length segments, and to keep a separate search
tree for each segment. Thus, the time to compute the hash function
is the time to determine which interval
is in, plus the time
to lookup the bucket in the corresponding search tree. The first
time is always
. Since, the expected number of points in each
segment is
, the second time is in expectation.
One caveat to the above is that as the number of buckets grows,
the size of the subintervals needs to shrink. In order to deal with
this issue, we will use intervals only of length
for some .At
first we choose the largest
such that . Then,
as points are added, we bisect segments gradually so that when we
reach the next power of , we have already divided all the segments.
In this way we amortize the work ofdividing search trees over all of
the additions and removals. Another point is that the search trees in
adjacent empty intervals may all need to be updated when a bucket
is added since they may all now be closest to that bucket. Since the
expected length of a run of empty intervals is small, the additional
cost is negligible. For a more complete analysis of the running time
we refer to the complete version of the paper.
In this section, we discuss some additional features of consistent
hashing which, though unneccessary for theremainder of the paper,
demonstrate some of its interesting properties.
To give insight into the monotone property, we will define a
new class of hash functions and then show that this is equivalent to
the class of monotone ranged hash functions.
A
-hash function is a hash function of the familiar form
constructed as follows. With each item ,
associate a permutation
of all the buckets .Dene to
be the first bucket in the permutation
that is contained in the
view
. Note that the permutations need not be chosen uniformly
or independently.
Theorem 4.3 Every monotone ranged hash function is a -hash
function and vice versa.
Proof (sketch): For a ranged hash function
, associate item with
the permutation
.
Suppose
is the first element of an arbitrary view in this
permutation. Then
.Since
, monotonicity implies .
The equivalence stated in Theorem 4.3 allows us to reason
about monotonic ranged hash functions in terms of permutations
associated with items.
Universality: A ranged hash family is universal if restricting
every function in the family to a single view creates a universal
hash family.
This property is one way of requiring that a ranged hash func-
tion be well-behaved in every view. The above condition is rather
stringent; it says that if a view is fixed, items are assigned ran-
domly to the bins in that view. This implies that in any view
,
the expected fraction of items assigned to
of the buckets is .
Using only monotonicity and this fact about the uniformity of the
assignment, we can determine the expected number of items reas-
signed when the set of usable buckets changes. This relates to the
informal notion of “smoothness”.
Theorem 4.4 Let
be a monotonic, universal ranged hash func-
tion. Let
and be views. The expected fraction of items for
which
is .
Proof (sketch): Count the number of items that move as we add
buckets from
until the view is , and then delete buckets
down to
Note that monotonicity is used only to show an upper bound on
the number of items reassigned to a new bucket; this implies that
one can not obtain a “more consistent” universal hash function by
relaxing the monotone condition.
We have shown that every monotone ranged hash function can
be obtained by associating each item with a random permutation
of buckets. The most natural monotone consistent hash function
is obtained by choosing these permutations independently and uni-
formly at random. We denote this function by
.
Theorem 4.5 The function is monotonic and universal. For item
and bucket each of the following hold with probability at least
: and
2.
Proof: Monotonicity and universality are immediate; this leaves
spread and load. Define:
We use to denote a list of the buckets in
which are ordered as in .
First, consider spread. Recall that in a particular view, item
is assigned to the first bucket in whichisalsointheview.
Therefore, if every view contains one of the first
buckets in
then in every view item will be assigned to one of the first
buckets in . This implies that item is assigned to at most
distinct buckets over all the views.
We have to show that with high probability every view contains
one of the first
buckets in . We do this by showing that the
complement has low probability; that is, the probability that some
view contains none of the first
buckets is at most .
The probability that a particular view does not contain the first
bucket in
is at most , since each view contains at least
a
fraction of all buckets. The fact that the first bucket is not
in a view only reduces the probability that subsequent buckets are
not in the view. Therefore, the probability that a particular view
contains none of the first
buckets is at most
. By the union bound, the probability
that even one of the
views contains none of the first buckets is
at most
.
Now consider load. By similar reasoning, every item
in every
view is assigned to one of the first
buckets in
with probability at least . We show below that a fixed
bucket
appears among the first buckets in for
at most
items with probability at least . By the union
bound, both events occur with high probability. This implies that at
most
items are assigned to bucket over all the views.
All that remains is to prove the second statement. The ex-
pected number of items
for which the bucket appears among
the first
buckets in is . Us-
ing Chernoff bounds, we find that bucket
appears among the first
buckets in for at most items with probability
at least
.
A simple approach to constructing a consistent hash function is
to assign random scores to buckets, independently for each item.
Sorting the scores defines a random permutation, and therefore has
the good properties proved in the this section. However, finding the
bucket an item belongs in requires computing all the scores. This
could be restrictivly slow for large bucket sets.
In this section we apply the techniques developed in the last sec-
tion to the simple hot spot protocol developed in section 3. We now
relax the assumption that clients know about all of the caches. We
assume only that each machine knows about a
fraction of the
caches chosen by an adversary. There is no difference in the proto-
col, except that the mapping
is a consistent hash function. This
change will not affect latency. Therefore, we only analyze the ef-
fects on swamping and storage. The basic properties of consistent
hashing are crucial in showing that the protocol still works well.
In particular, the blowup in the number of requests and storage is
proportional to the maximum
and of the hash function.
Theorem 5.1 If is implemented using the -way indepen-
dent consistent hash function of Theorem 4.1 and if each view con-
sists of
caches then withprobability at least
an arbitrary cache gets no more than
requests.
Proof (sketch): We look at the different trees of caches for dif-
ferent views for one page, .Let denote the number
of caches in each tree. We overlay these different trees on one an-
other to get a new tree where in each node, there is a set of caches.
Due to the spread property of the consistent hash function at most
caches appear at any node in this combined tree
with high probability. In fact since there are only
requests, this
will be true for the nodes of all the
trees for the requested pages.
If
denotes the event that appears in the node of the
combined tree for page
then we know from Corrollary 4.2 that
the probability of this event is
,where is the load which
is
with high probability. We condition on the event that
and are which happens with high probability.
Since a cache in a node sends out at most
requests, each node
in the combined tree sends out at most
requests. We now adapt
the proof of Theorem 3.1 to this case. In Theorem 3.1 where every
machine was aware of all the
caches, an abstract node was as-
signed to any given machine with probability
. We now assign
and abstract node to a given machine with probability
.So
we havea scenario with
caches where eachabstract node
sends out up to
requests to its parent and occurs at each ab-
stract node independently and with probability
.Therest
of the proof is very similar to that of Theorem 3.1.
Using techniques similar to those in proof of Theorem 5.1 we get
the following lemma. The proof is deferred to the final version of
the paper.
Lemma 5.2 The total number of cached pages, over all machines
is
with probability of .A
given cache has cached copies with high
probability.
So far we assumed that every pair of machines can communicate
with equal ease. In this section we extend our protocol to take
the latency between machines,
, into account. The latency of the
whole request will be the sum of the latencies of the machine-
machine links crossed by the request. For simplicity, we assume
in this section that all clients are aware of all caches.
We extend our protocol to a restricted class of functions
.In
particular, we assume that
is an ultrametric. Formally, an ultra-
metric is a metric which obeys a more strict form of the triangle
inequality:
.
The ultrametric is a natural model of Internet distances, since it
essentially captures the hierarchical nature of the Internet topology,
under which, for example, all machines in a given university are
equidistant, but all of them are farther away from another univer-
sity, and still farther from another continent. The logical point-to-
point connectivity is established atop a physical network, and it is
generally the case that the latency between two sites is determined
by the “highest level” physical communication link that must be
traversed on the path between them. Indeed, another definition of
an ultrametric is as a hierarchical clustering of points. The distance
in the ultrametric between two points is completely determined by
the smallest cluster containing both of the points.
The only modification we make to the protocol is that when a
browser maps the tree nodes to caches, it only uses caches that
are as close to it as the server of the desired page. By doing this,
we insure that our path to the server does not contain any caches
that are unnecessarily far away in the metric. The mapping is done
using a consistent hash function, which is the vital element of the
solution.
Clearly, requiring that browsers use “nearby” caches can cause
swamping if thereis only one cache and server near many browsers.
Thus, in order to avoid cases of degenerate ultrametrics where there
are browsers that are not close to any cache, and where there are
clusters in the ultrametric without any caches in them, we restrict
the set of ultrametrics that may be presented to the protocol. There-
striction is that in any cluster the ratioof thenumber of caches to the
number of browsers may not fall below
(recall that ).
This restriction makes sense in the real world where caches are
likely to be evenly spread out over the Internet. It is also neces-
sary, as we can prove that a large number of browsers clustered
around one cache can be forced to swamp that cache in some cir-
cumstances.
It is clear from the protocol and the definition of an ultrametric that
the latency will be no more than the depth of the tree,
, times
the latency between the browser and the server. So once again we
need only look at swamping and storage. The intuitionis that inside
each cluster the bounds we proved for the unit distance model ap-
ply. The monotone property on consistent hashing will allow us to
restrict our analysis to
clusters. Thus, summing over these
clusters we have only a
blowup in the bound.
Theorem 6.1 Let be an ultrametric. Suppose that each browser
makes at most one request. Then in the protocol above, an arbi-
trary cache gets no more than
requests with probability at least
where is a parameter.
Proof (sketch): The intuitionbehind are proof is the following. We
bound the load on a machine
. Consider the ranking of machines
according to their distance from . Suppose asks
for a page from a machine closer to itself than
. Then according
to our modified protocol, it will never involve
in the request. So
we need onlyconsider machine
if it asks fora page at least as far
away from itself as
. It follows from the definition of ultrametrics
that every , , is also used in the revised protocol by .
Intuitively, our original protocol spread load among the ma-
chines so that the probability a machine got on the path for a par-
ticular page requests was
. In our ultrametric pro-
tocol,
plays the protocol on a set of at least machines. So is
on the path of the request from
with probability .
Summing over
, the expected load on is .
Statingthings slightlymore formally, we consider a set of
nested “virtual” clusters . Note that any
browser in
will use all machines in in the protocol. We
modify the protocol so that such a machine uses only the machines
in
. This only reduces the number of machines it uses. According
to the monotonicity property of our consistent hash functions, this
only increases the load on machine
.
Now we can consider each
separately and apply the static
analysis. The total number of requests arrivingin one of the clusters
under the modified protocol is proportional to the number of caches
in the cluster, so our static analysis applies to the cluster. This
gives us a bound of
on the load induced on by .
Sumnming over the
clusters proves the theorem.
Using techniques similar to those in proof of Theorem 6.1 we get
the following lemma.
Lemma 6.2 The total number of cached pages, over all machines
is
with probability of .A
given cache
has cached copies with high
probability.
Basically, as in Plaxton/Rajaraman, the fact that our protocol uses
random short paths to the server makes it fault tolerant. We con-
sider a model in which an adversary designates that some of the
caching machines may be down, that is, ignore all attempts at com-
munication. Remember that our adversary does not get to see our
random bits, and thus cannot simply designate all machines at the
top of a tree to be down. The only restriction is that a specified
fraction
of the machines in every view must be up. Under our
protocol, no preemptive caching of pages is done. Thus, if a server
goes down, all pages that it has not distributed become inaccessible
to any algorithm. This problem can be eliminated using standard
techniques, such as Rabin's Information Dispersal Algorithm [10].
So we ignore server faults.
Observe that a request is satisfied if and only if all the caches
serving for the nodes of the tree path are not down. Since each
node is mapped to a machine (
-wise) independently, it is trivial
(using standard Chernoff bounds) to lower bound the number of
abstract nodes that have working paths to the root. This leads to the
following lemma:
Lemma 7.1 Suppose that
. With high probability,
the fraction of abstract-tree leaves that have a working path to the
root is
. In particular, if ,this
fraction is a constant.
The modification to the protocol is therefore quite simple.
Choose a parameter
, and simultaneously send requests for the
page. A logarithmic number if requests is sufficient to give a high
probability of one of the requests goes through. This change in the
protocol will of course have an impact on the system. This impact
is described in the full paper.
Note that since communication is a chancy thing on the Inter-
net, failure to get a quick response from a machine is not a par-
ticularly good indication that it is down. Thus, we focused on the
tolerance of faults, and not on their detection. However, given some
way to decide that a machine is down, our consistent hash functions
make it trivial to reassign the work to other machines. If a you de-
cide a machine is down, remove it from your view.
So far, we have omitted any real mention of time from our analy-
ses. We have instead considered and analyzed a single “batch” of
requests, and argued that this batch causes a limited amount of
caching (storage usage) at every machine, while simultaneously ar-
guing that no machine gets swamped by the batch. In this section,
we show how this static analysis carries implications for a temporal
model in which requests arrive over time. Recall that our temporal
model says that browsers issues requests at a certain rate
.
Time is a problematic issue when modeling the Internet, be-
cause the communication protocols for it have no guarantees re-
garding time of delivery. Thus any one request could take arbi-
trarily long. However, we can consider the rate at which servers
receive requests. This seems like an overly simplistic measure, but
the rate at which a machine can receive requests is in fact the statis-
tic that hardware manufacturers advertise. We consider an interval
of time
, and apply our “requests all come at once” analysis to the
requests that come in this interval.
We can write the bounds from the static analysis on
requests
as follows:
cache size cache load
Suppose machines have cache size . Consider a time interval
small enough to make
small enough so that .
In other words, the number of requests that arrive in this interval is
insufficient, according to our static analysis, to use storage exceed-
ing
per machine. Thus once a machine caches a page during this
interval, it keeps it for the remainder of the interval. Thus our static
analysis will apply over this interval. This gives us a bound on how
many requests can arrive in the interval. Dividing by the interval
length, we get the rate at which caches see requests:
.
Plugging in the bounds from Section 3, we get the following:
Theorem 8.1 If our machines have
storage, for
some constant
, then with probability , the bound on the
rate of new requests per cache when we have
machines of size
is .
Observe the tradeoffs implicit in this theorem. Increasing
causes the load to decrease proportionately, but never below
. Increasing increases the load linearly (but re-
duces the number of hops on a request path). Increasing
seems
only to hurt, suggesting that we should always take
.
The above analysis used the rate at which requests were issued
to measure the rate at which connections are established to ma-
chines. If we also assume that each connection lasts for a finite
duration, this immediately translates into a bound on the number of
connections open at a machine at any given time.
This paper has focused on one particular caching problem—that
of handling read requests on the Web. We believe the ideas have
broader applicability. In particular, consistent hashing may be a
useful tool for distributing information from name servers such as
DNS and label servers such as PICS in a load-balanced and fault-
tolerant fashion. Our two schemes may together provide an inter-
esting method for constructing multicast trees [4].
Another important way in which our ideas could be extended
is in handling pages whose information changes over time, due to
either server or client activity. If we augment our protocol to let the
server know which machines are currently caching its page, then
the server can notify such caches whenever the data on its pages
changes. This might work particularly well in conjunction with the
currently under development multicast protocols [4] that broadcast
information from a server to all the client members of a multicast
“group. Our protocol can be mapped into this model if we assume
that every machine “caching” apage joins a multicast group for that
page. Even without multicast, if each cache keeps track, for each
page it caches, of the at most
other caches it has given the page
to, then notification of changes can be sent down the tree to only
the caches that have copies.
It remains open how to deal with time when modeling the In-
ternet, because the communication protocols have no guarantees
regarding time of delivery. Indeed, at the packet level, there are not
even guarantees regarding eventual delivery. This suggests model-
ing the Internet as some kind of distributed system. Clearly, in a
model in which there are no guarantees regarding delivery times,
the best one can hope to prove is some of the classical liveness and
safety properties underlying distributed algorithms. It is not clear
what one can prove about caching and swamping in such a model.
We think that there is significant research to be done on the proper
way to model this aspect of the Internet.
We also believe that interesting open questions remain regard-
ing the method of consistent hashing that we present in this paper.
Among them are the following. Is there a
-universal consistent
hash function that can be evaluated efficiently?? What tradeoffs
can be achieved between spread and load? Are there some kind of
“perfect” consistent hash functions that can be constructed deter-
ministically with the same spread and load bounds we give? On
what other theoretical problems can consistent hashing give us a
handle?
[1] Anawat Chankhunthod, Peter Danzig, Chuck Neerdaels, Michael
Schwartz and Kurt Worrell. A Hierarchical Internet Object Cache. In
USENIX Proceedings, 1996.
[2] Robert Devine. Design and Implementation of DDH: A Distributed
Dynamic Hashing Algorithm. In Proceedings of 4th International Con-
ference on Foundations of Data Organizations and Algorithms, 1993.
[3] M. J. Feeley, W. E. Morgan, F. P. Pighin, A. R. Karlin, H. M. Levy
and C. A. Thekkath. Implementing Global Memory Management in a
Workstation Cluster. In Proceedings of the 15th ACM Symposium on
Operating Systems Principles, 1995.
[4] Sally Floyd, Van Jacobson, Steen McCanne, Ching-Gung Liu and
Lixia Zhang. A Reliable Multicast Framework for Light-weight Ses-
sions and Application Level Framing, SIGCOMM' 95
[5] Witold Litwin, Marie-Anne Neimat and Donovan A. Schneider.
-A Scalable, Distributed Data Structure. ACM Transactions on
Database Systems, Dec. 1996
[6] Radhika Malpani, Jacob Lorch and David Berger. Making World Wide
Web Caching Servers Cooperate. In Proceedings of World Wide Web
Conference, 1996.
[7] M. Naor and A. Wool. The load, capacity, and availability of quorum
systems. In Proceedings of the 35th IEEE Symposium on Foundations
of Computer Science, pages 214-225, November 1994.
[8] D. Peleg and A. Wool. The availability of quorum systems. Information
and Computation 123(2):210-233, 1995.
[9] Greg Plaxton and Rajmohan Rajaraman. Fast Fault-Tolerant Concur-
rent Access to Shared Objects. In Proceedings of 37th IEEE Sympo-
sium on Foundations of Computer Science, 1996.
[10] M. O. Rabin. Efficient dispersal of Information for Security, Load Bal-
ancing, and Fault Tolerance. Journal of the ACM 36:335–348, 1989.
[11] Jeanette Schmidt, Alan Siegel and Aravind Srinivasan. Chernoff-
Hoeffding Bounds for Applications with Limited Independence. In
Proc. 4th ACS-SIAM Symposium on Discrete Algorithms, 1993.
... -Consistent Hashing. Originally, this algorithm [9] was designed for balancing loads of computer networks. Nowadays, it seems to be the most popular method for balancing many various types of loads. ...
Preprint
Full-text available
Modern high load applications store data using multiple database instances. Such an architecture requires data consistency, and it is important to ensure even distribution of data among nodes. Load balancing is used to achieve these goals. Hashing is the backbone of virtually all load balancing systems. Since the introduction of classic Consistent Hashing, many algorithms have been devised for this purpose. One of the purposes of the load balancer is to ensure storage cluster scalability. It is crucial for the performance of the whole system to transfer as few data records as possible during node addition or removal. The load balancer hashing algorithm has the greatest impact on this process. In this paper we experimentally evaluate several hashing algorithms used for load balancing, conducting both simulated and real system experiments. To evaluate algorithm performance, we have developed a benchmark suite based on Unidata MDM~ -- a scalable toolkit for various Master Data Management (MDM) applications. For assessment, we have employed three criteria~ -- uniformity of the produced distribution, the number of moved records, and computation speed. Following the results of our experiments, we have created a table, in which each algorithm is given an assessment according to the abovementioned criteria.
... The strategy that assigns data objects to functions is critical as it impacts the elasticity and cost-effectiveness of the storage system. Randomly placing data in function instances using conventional (serverful) strategies, such as consistent hashing [23], may lead to unnecessary expenses. As cold data would be co-located with hot/new data in the same function instance, function warmups would be needed for the entire ServerlessMemory instance pool, which increases monetary cost. ...
Preprint
Cloud object storage such as AWS S3 is cost-effective and highly elastic but relatively slow, while high-performance cloud storage such as AWS ElastiCache is expensive and provides limited elasticity. We present a new cloud storage service called ServerlessMemory, which stores data using the memory of serverless functions. ServerlessMemory employs a time-window-based data placement strategy to effectively segregate old and new data and provides high elasticity, performance, and a pay-per-access cost model with extremely low cost for a new memory-based storage. We then design and implement SION (Serverless I/O Nirvana), a persistent and elastic cloud storage system, which seamlessly couples the function-based ServerlessMemory layer with a persistent, inexpensive cloud object store layer. SION enables durability despite function failures using a fast parallel recovery scheme built on the auto-scaling functionality of a FaaS (Function-as-a-Service) platform. We evaluate SION extensively using three real-world applications and results show that SION achieves 27.77% and 97.30% tenant-side cost reduction compared to InfiniCache (a serverless-function-based object cache) and AWS ElastiCache respectively, while offering pay-per-access with competitive performance.
Preprint
p>The paper studies the method for online teaching using Multi-Track Modular Teaching (MT2). MT2 is an advanced teaching-learning method for better training of students with deep skills. Covid19 pandemic and the associated lockdown periods, coupled with the requirements of social distancing, have played a major disruption in the academic processes across all the world’s universities. Due to the inability to conduct physical classes and lectures, most universities have opted for online classes using video conference tools. One way to make the computer networking classes, during lock-down periods, more effective is to use the Multi-Track Modular Teaching (MT2) technique which organizes a course into multiple tracks and modules. This paper presents the mod?ifications carried out to MT2-based computer networking course and experiences from teaching it online during the lock-down period. While the online classes offered an alter?native to the completely disrupted teaching and learning process, observations on their effectiveness reveal certain important considerations in order to make them more useful. Results from a feedback survey of the students who under?took the course is also provided in this paper, giving insights to certain difficulties and issues in the conduct of online classes in developing countries. </p
Article
Full-text available
The advent of the Big Data era has brought considerable challenges to storing and managing massive data. Moreover, distributed storage systems are critical to the pressure and storage capacity costs. The Ceph cloud storage system only selects data storage nodes based on node storage capacity. This node selection method results in load imbalance and limited storage scenarios in heterogeneous storage systems. Therefore, we add node heterogeneity, network state, and node load as performance weights to the CRUSH algorithm and optimize the performance of the Ceph system by improving load balancing. We designed a cloud storage system model based on Software Defined Network (SDN) technology. This system model can avoid the tedious configuration and significant measurement overhead required to obtain network status in traditional network architecture. Then we propose adaptive read and write optimization algorithms based on SDN technology. The Object Storage Device (OSD) is initially classified based on the Node Heterogeneous Resource Classification Strategy. Then the SDN technology is used to obtain network and load conditions in real-time and an OSD performance prediction model is built to obtain weights for performance impact factors. Finally, a mathematical model is proposed for multi-attribute decision making in conjunction with the OSD state and its prediction model. Furthermore, this model is addressed to optimize read and write performance adaptively. Compared with the original Ceph system, TOPSIS_PA improves the performance of reading operations by 36%; TOPSIS_CW and TOPSIS_PACW algorithms improve the elastic read performance by 23 to 60% and 36 to 85%, and the elastic write performance by 180 to 468% and 188 to 611%, respectively.
Article
Online-transaction-processing (OLTP) applications require the underlying storage system to guarantee consistency and serializability for distributed transactions involving large numbers of servers, which tends to introduce high coordination cost and cause low system performance. In-network coordination is a promising approach to alleviate this problem, which leverages programmable switches to move a piece of coordination functionality into the network. This paper presents a fast and scalable transaction processing system called SwitchTx. At the core of SwitchTx is a decentralized multi-switch in-network coordination mechanism, which leverages modern switches' programmability to reduce coordination cost while avoiding the central-switch-caused problems in the state-of-the-art Eris transaction processing system. SwitchTx abstracts various coordination tasks (e.g., locking, validating, and replicating) as in-switch gather-and-scatter (GaS) operations, and offloads coordination to a tree of switches for each transaction (instead of to a central switch for all transactions) where the client and the participants connect to the leaves. Moreover, to control the transaction traffic intelligently, SwitchTx reorders the coordination messages according to their semantics and redesigns the congestion control combined with admission control. Evaluation shows that SwitchTx outperforms current transaction processing systems in various workloads by up to 2.16X in throughput, 40.4% in latency, and 41.5% in lock time.
Article
Most products at ByteDance, e.g., TikTok, Douyin, and Toutiao, naturally generate massive amounts of graph data. To efficiently store, query and update massive graph data is challenging for the broad range of products at ByteDance with various performance requirements. We categorize graph workloads at ByteDance into three types: online analytical, transaction, and serving processing, where each workload has its own characteristics. Existing graph databases have different performance bottlenecks in handling these workloads and none can efficiently handle the scale of graphs at ByteDance. We developed ByteGraph to process these graph workloads with high throughput, low latency and high scalability. There are several key designs in ByteGraph that make it efficient for processing our workloads, including edge-trees to store adjacency lists for high parallelism and low memory usage, adaptive optimizations on thread pools and indexes, and geographic replications to achieve fault tolerance and availability. ByteGraph has been in production use for several years and its performance has shown to be robust for processing a wide range of graph workloads at ByteDance.
Conference Paper
Full-text available
This paper discusses the design and performance of a hierarchical proxy-cache designed to make Internet in- formation systems scale better. The design was motivated by our earlier trace-driven simulation study of Internet traffic. We challenge the conventional wisdom that the benefitsof hierarchical filecaching do not merit the costs, and believe the issue merits reconsideration in the Internet environment. The cache implementation supports a highly concurrent stream of requests. We present performance measurements that show that our cache outperforms other popular Inter- net cache implementations by an order of magnitude under concurrent load. These measurements indicate that hierar- chy does not measurably increase access latency. Our soft- ware can also be configuredas a Web-server accelerator; we present data that our httpd-acceleratoris ten times faster than Netscape' s Netsite and NCSA 1.4 servers. Finally, we relate our experience fittingthe cache into the increasingly complex and operational world of Internet in- formation systems, including issues related to security, trans- parency to cache-unaware clients, and the role of filesystems in support of ubiquitous wide-area information systems.
Article
Full-text available
This paper describes scalable reliable multicast (SRM), a reliable multicast framework for light-weight sessions and application level framing. The algorithms of this framework are efficient, robust, and scale well to both very large networks and very large sessions. The SRM framework has been prototyped in wb, a distributed whiteboard application, which has been used on a global scale with sessions ranging from a few to a few hundred participants. The paper describes the principles that have guided the SRM design, including the IP multicast group delivery model, an end-to-end, receiver-based model of reliability, and the application level framing protocol model. As with unicast communications, the performance of a reliable multicast delivery algorithm depends on the underlying topology and operational environment. We investigate that dependence via analysis and simulation, and demonstrate an adaptive algorithm that uses the results of previous loss recovery events to adapt the control parameters used for future loss recovery. With the adaptive algorithm, our reliable multicast delivery algorithm provides good performance over a wide range of underlying topologies.
Article
Advances in network and processor technology have greatly changed the communication and computational power of local-area workstation clusters. However, operating systems still treat workstation clusters as a collection of loosely-connected processors, where each workstation acts as an autonomous and independent agent. This operating system structure makes it difficult to exploit the characteristics of current clusters, such as low-latency communication, huge primary memories, and high-speed processors, in order to improve the performance of cluster applications. This paper describes the design and implementation of global memory management in a workstation cluster. Our objective is to use a single, unified, but distributed memory management algorithm at the lowest level of the operating system. By managing memory globally at this level, all system- and higher-level software, including VM, file systems, transaction systems, and user applications, can benefit from available cluster memory. We have implemented our algorithm in the OSF/1 operating system running on an ATM-connected cluster of DEC Alpha workstations. Our measurements show that on a suite of memory-intensive programs, our system improves performance by a factor of 1.5 to 3.5. We also show that our algorithm has a performance advantage over others that have been proposed in the past.
Article
An Information Dispersal Algorithm (IDA) is developed that breaks a file F of length L = &uharl; F&uharr; into n pieces Fi, l ≤ i ≤ n, each of length &uharl;Fi&uharr; = L/m, so that every m pieces suffice for reconstructing F. Dispersal and reconstruction are computationally efficient. The sum of the lengths &uharl;Fi&uharr; is (n/m) · L. Since n/m can be chosen to be close to l, the IDA is space efficient. IDA has numerous applications to secure and reliable storage of information in computer networks and even on single disks, to fault-tolerant and efficient transmission of information in networks, and to communications between processors in parallel computers. For the latter problem provably time-efficient and highly fault-tolerant routing on the n-cube is achieved, using just constant size buffers.
Article
We present a scalable distributed data structure called LH*. LH* generalizes Linear Hashing (LH) to distributed RAM and disk files. An LH* file can be created from records with primary keys, or objects with OIDs, provided by any number of distributed and autonomous clients. It does not require a central directory, and grows gracefully, through splits of one bucket at a time, to virtually any number of servers. The number of messages per random insertion is one in general, and three in the worst case, regardless of the file size. The number of messages per key search is two in general, and four in the worst case. The file supports parallel operations, e.g., hash joins and scans. Performing a parallel operation on a file of M buckets costs at most 2 M + 1 messages, and between 1 and O (log 2 M rounds of messages. We first describle the basic LH* scheme where a coordinator site manages abucket splits, and splits a bucket every time a collision occurs. We show that the average load factor of an LH* file is 65%–70% regardless of file size, and bucket capacity. We then enhance the scheme with load control, performed at no additional message cost. The average load factor then increases to 80–95%. These values are about that of LH, but the load factor for LH* varies more. We nest define LH* schemes without a coordinator. We show that insert and search costs are the same as for the basic scheme. The splitting cost decreases on the average, but becomes more variable, as cascading splits are needed to prevent file overload. Next, we briefly describe two variants of splitting policy, using parallel splits and presplitting that should enhance performance for high-performance applications. All together, we show that LH* files can efficiently scale to files that are orders of magnitude larger in size than single-site files. LH* files that reside in main memory may also be much faster than single-site disk files. Finally, LH* files can be more efficient than any distributed file with a centralized directory, or a static parallel or distributed hash file.
Article
A quorum system is a collection of sets (quorums) every two of which intersect. Quorum systems have been used for many applications in the area of distributed systems, including mutual exclusion, data replication, and dissemination of information. In this paper we study the failure probabilities of quorum systems and in particular of nondominated coteries (NDC). We characterize NDC′s in terms of the failure probability, and prove that any NDC has availability that falls between that of a singleton and a majority consensus. We show conditions for weighted voting schemes to provide asymptotically high availability, and we analyze the availability of several other known quorum systems.
Conference Paper
The authors consider a synchronous model of distributed computation in which n nodes communicate via point-to-point messages, subject to the following constraints: (i) in a single “step”, a node can only send or receive O(logn) words, and (ii) communication is unreliable in that a constant fraction of all messages are lost at each step due to node and/or link failures. They design and analyze a simple local protocol for providing fast concurrent access to shared objects in this faulty network environment. In the protocol, clients use a hashing-based method to access shared objects. When a large number of clients attempt to read a given object at the same time, the object is rapidly replicated to an appropriate number of servers. Once the necessary level of replication has been achieved, each remaining request for the object is serviced within O(1) expected steps. The protocol has practical potential for supporting high levels of concurrency in distributed file systems over wide area networks
Conference Paper
A quorum system is a collection of sets (quorums) every two of which have a nonempty intersection. Quorum systems have been used for a number of applications in the area of distributed systems. We investigate the load, capacity and availability of quorum systems. We present four novel constructions of quorum system, all featuring optimal or near optimal load, and high availability. These desirable properties of the constructions translate into improvements of any protocol using them: a low work load on the processors and a high resilience to processor failures. The best construction, based on paths in a grid, has a load of O(1/√n), and a failure probability of exp(-O(√n)) when the elements fail with probability p<&frac12;. Moreover, even in the presence of faults, with exponentially high probability the load of this system is still O(1/√n). The analysis of this scheme is based on Percolation Theory
Conference Paper
DDH extends the idea of dynamic hashing algorithms to distributed systems. DDH spreads data across multiple servers in a network using a novel autonomous location discovery algorithm that learns the bucket locations instead of using a centralized directory. We describe the design and implementation of the basic DDH algorithm using networked computers. Performance results show that the prototype of DDH hashing is roughly equivalent to conventional single-node hashing implementations when compared with CPU time or elapsed time. Finally, possible improvements are suggested to the basic DDH algorithm for increased reliability and robustness.
Article
We discuss the issue of multicasting in Nimrod. We identify the requirements that Nimrod has of any solution for multicast support. We compare existing approaches for multicasting within an internetwork and discuss their advantages and disadvantages. Finally, as an example, we outline the mechanisms to support multicast in Nimrod using the scheme currently being developed within the IETF - namely, the Protocol Indpendent Multicast (PIM) protocol. Internet Draft Nimrod Multicast Support March 1995 Contents 1 Introduction 1 2 Multicast vs Unicast 1 3 Goals and Requirements 2 4 Approaches 4 5 A Multicasting Scheme based on PIM 7 5.1 Overview : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 7 5.2 Joining and Leaving a Tree : : : : : : : : : : : : : : : : : : : : : : : : : : : : 9 5.2.1 An Example : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 11 5.3 Establishing a Shared Tree : : : : : : : : : : : : : : : : : : : : : : : : : : : : 11 5...