Conference PaperPDF Available

Load Balancing of Heterogeneous Workloads in Memcached Clusters

Authors:

Abstract and Figures

Web services, large and small, use in-memory caches like memcached to lower database loads and quickly respond to user requests. These cache clusters are typically provisioned to support peak load, both in terms of request processing capabilities and cache storage size. This kind of worst-case provisioning can be very expensive (e.g., Facebook reportedly uses more than 10,000 servers for its cache cluster) and does not take advantage of the dynamic resource allocation and virtual machine provisioning capabilities found in modern public and private clouds. Further, there can be great diversity in both the workloads running on a cache cluster and the types of nodes that compose the cluster, making manual management difficult. This paper identifies the challenges in designing large-scale self-managing caches. Rather than requiring all cache clients to know the key to server mapping, we propose an automated load balancer that can perform line-rate request redirection in a far more dynamic manner. We describe how stream analytic techniques can be used to efficiently detect key hotspots. A controller then guides the load balancer's key mapping and replication level to prevent overload, and automatically starts additional servers when needed.
Content may be subject to copyright.
Load Balancing of Heterogeneous Workloads in Memcached Clusters
Wei Zhang, Timothy Wood and H. Howie Huang
The George Washington University
Beihang University
Jinho Hwang
IBM T. J. Watson Research Center
K.K. Ramakrishnan
Rutgers University
Abstract
Web services, large and small, use in-memory caches
like memcached to lower database loads and quickly re-
spond to user requests. These cache clusters are typi-
cally provisioned to support peak load, both in terms of
request processing capabilities and cache storage size.
This kind of worst-case provisioning can be very expen-
sive (e.g., Facebook reportedly uses more than 10,000
servers for its cache cluster) and does not take advantage
of the dynamic resource allocation and virtual machine
provisioning capabilities found in modern public and pri-
vate clouds. Further, there can be great diversity in both
the workloads running on a cache cluster and the types
of nodes that compose the cluster, making manual man-
agement difficult. This paper identifies the challenges
in designing large-scale self-managing caches. Rather
than requiring all cache clients to know the key to server
mapping, we propose an automated load balancer that
can perform line-rate request redirection in a far more
dynamic manner. We describe how stream analytic tech-
niques can be used to efficiently detect key hotspots. A
controller then guides the load balancer’s key mapping
and replication level to prevent overload, and automati-
cally starts additional servers when needed.
1 Introduction
In-memory caching has become a popular technique for
enabling highly scalable web applications. These caches
typically store frequently accessed or expensive to com-
pute query results to lower the load on database servers
that are typically difficult to scale up. Memcached has
become the standard caching server for a wide range of
applications.
Large web companies like Facebook and Twitter pro-
vision their caching infrastructure to support the peak
load and to hold a majority of data in cache [1]. Given the
dynamic nature of Internet workloads, this kind of static
provisioning is generally very expensive. For example,
the workload handled by the Facebook cache was shown
to have a peak workload about two times higher than the
minimum seen over a 24 hour period [2], and this may
be even more dramatic for websites with a less global
reach. Thus provisioning a cache infrastructure for peak
load can be highly wasteful in terms of hardware and en-
ergy costs.
At the same time, public-facing applications need to
be wary of flash crowds that direct a large workload to
a small portion of the application’s total content. In-
memory caches are crucial for handling this type of load,
but even they may become overloaded if popular data
is not efficiently replicated or the system is unable to
scale up the number of cache nodes in time. Further, one
caching cluster is often multiplexed for several differ-
ent applications, each of which may have distinct work-
load characteristics. The cache must be able to balance
the competing needs of these applications despite differ-
ences in get/set rates, data churn, and the cost of a cache
miss.
Heterogeneity can occur not only within keys and
workloads, but also among the servers that make up the
caching cluster. A wide range of key-value store archi-
tectures (many supporting the same memcached proto-
col) have been proposed, ranging from energy efficient
FPGA designs [3] to high-powered data stores capable
of saturating multiple 10 gigabit NICs [4, 5]. These ap-
proaches provide different trade-offs in the energy ef-
ficiency, throughput, latency, and data volatility of the
cache, suggesting that a heterogeneous deployment of
different server and cache types may offer the best over-
all performance.
While many resource management systems have been
proposed for web applications, the caching tier is of-
ten ignored, leaving it statically partitioned and sized for
worst case workloads. In part, this is because caches are
typically accessed by a distributed set of clients (usually
web servers), so dynamically adjusting the cache setup
requires coordination across a large number of nodes. To
get around this problem, we eschew the traditional ap-
proach where clients know precise key-server mappings,
and instead propose a middlebox-based load balancer ca-
pable of making dynamic adaptations within the caching
infrastructure. Having a centralized load balancer is
made possible by recent advances in high performance
network cards and multi-core processors that allow net-
work functions to be run on commodity servers [6–8].
Our system will be built upon the following components:
A high speed memcached load balancer that can for-
ward millions of requests per second.
A hot spot detection algorithm that uses stream
data mining techniques to efficiently determine the
hottest keys.
A two-level key mapping system that combines
consistent hashing with a lookup table to flexibly
control the placement and replication level of hot
keys.
An automated server management system that takes
inputs from the load balancers and overall appli-
cation performance levels to determine the number
and types of servers in the caching cluster.
In this paper we describe our preliminary work on de-
signing this dynamically scalable caching infrastructure.
Our architecture provides greater flexibility than exist-
ing approaches that place complexity and intelligence in
either the clients or memcached servers. By removing
the reliance on manual, administrator specified policies,
our self-managing cache cluster can automatically tune
key placement and server settings to provide high perfor-
mance at low cost.
2 Background
In this paper we focus on the memcached in-memory key
value store. Memcached provides a simple put/get/delete
interface and is primarily used to store small (e.g., <
1KB) data values [2]. Clients, such as a PHP web ap-
plication, generally follow the pattern of first requesting
data from a cache node, but querying a database and
loading the relevant entry into the cache if it was not
found.
A memcached client can directly connect to a mem-
cached server, or a proxy can be used to help manage
the mappings of keys to servers. Individual memcached
servers are designed so that they are unaware of each
other. Many distributed key-value stores employ consis-
tent hashing [9] to determine how keys are mapped to
different servers [1, 10, 11]. This provides both an even
0
20
40
60
80
100
1 1.5 2 2.5 3 3.5 4 4.5 5
Normalized Max Req Rate
Zipf Workload Skew
no replication
10 replicas
Figure 1: As workloads become more skewed (larger θ), the
imbalance across nodes rises significantly relative to the load
under a uniform workload.
key distribution and simplifies the addition and removal
of servers into the key space.
To prevent a centralized proxy from becoming a bot-
tleneck, each client machine typically runs its own lo-
cal proxy instance that maintains the mapping of all key
ranges to servers [1, 11]. While this reduces latency, co-
ordinating their consistency can be a problem if there are
a very large number of clients. As a result, the key to
server mapping in these clusters is typically kept rela-
tively static, limiting the flexibility with which the cluster
can be managed.
2.1 Workload Heterogeneity
Workload characteristics can have a large impact on the
performance of a memcached cluster since requests of-
ten follow a heavy tailed, Zipfian distribution. Figure 1
shows how the amount of skew in a Zipfian request dis-
tribution affects the number of key requests that occur
on the most loaded server in a simulated cluster of 100
machines (normalized relative to the number of requests
under a uniform workload). As the workload becomes
more focused on a smaller number of keys, the imbal-
ance across servers can rise significantly, but if the hot
keys can be replicated to even a small number of servers,
the balance is significantly improved.
In addition to varied key popularity, analysis of the
Facebook memcached workload [1, 2] shows that differ-
ent applications can have different read/write rates, churn
rates, costs for cache misses, quality-of-service demands,
etc. To handle these heterogeneous workloads, Face-
book breaks their memcached cluster into groups, each
of which services a different application or set of appli-
cations [1]. However, from their descriptions it appears
that this partitioning is done in a manual fashion. This
leaves it susceptible to inefficient allocations under dy-
namic conditions and may not be feasible for companies
with less expertise in memcached cluster management.
0
100
200
300
400
500
600
0 1 2 3 4 5 6 7 8 9
Throughput(MOps/sec)
Value Size(KB)
1g
10g
Figure 2: On our test system, a Gigabit NIC achieves higher
performance for small value sizes, but the network quickly be-
comes saturated.
2.2 Server Heterogeneity
The hardware and software that make up a cache clus-
ter can also be diverse. Not only are there several re-
search [4, 10, 12] and commercial [13, 14] key-value
store software solutions, many of them support the
standard memcached protocol (even recent versions of
MySQL). These alternatives are optimized for different
use cases: e.g., couchbase [13] supports key replication
for high availability and MICA [4] provides extremely
high throughput, but only supports small value sizes.
While many alternatives exist, memcached continues to
be the most popular in-memory cache.
Figure 2 illustrates the performance of memcached us-
ing either an Intel 82599EB ten gigabit NIC or a Broad-
com 5720 gigabit NIC on the same server with dual In-
tel Xeon X5650 CPUs. The ten gigabit NIC appears
to suffer from inefficient processing of small packets,
reducing its maximum throughput for very small sized
objects. However, the gigabit network card quickly
reaches its bandwidth limit when using larger value sizes.
Since many web applications store primarily small ob-
jects [2], the added energy and hardware cost of ten gi-
gabit adapters is not always necessary, suggesting that an
intelligently scheduled mixed deployment could provide
the best performance per dollar spent. Other hardware
options such as low-power Atom CPUs have also been
shown to provide valuable trade-offs when designing a
memcached cluster [15].
3 Memcached Load Balancer Design
This section describes our preliminary design for a scal-
able, in-network dynamic load balancer for memcached
clusters. Our overall architecture is shown in Figure 3.
The load balancer is composed of a Lossy Counter used
to detect key hot spots, a two level Key Director that
stores where a key should be routed, and Key and Server
managers that run the control algorithms to decide how
the system should respond to workload changes. We de-
scribe these components in the following subsections.
3.1 Middlebox Platform
Recent advances in network interface cards (NICs) and
user-level packet processing libraries [6, 7, 16] have
enabled high speed packet processing on commodity
servers. We are using these techniques to build an ef-
ficient load-balancing platform that can direct traffic to a
large number of back-end memcached servers. Further,
the load balancer’s placement within the network path
allows it to observe important statistics about the servers
and their workloads.
Our prior work has demonstrated that even when a
load balancer such as this is run inside a virtual machine,
it is possible to achieve full 10 Gbps line rates [8]. A
typical memcached request packet is approximately 96
bytes1, so the maximum rate that can be handled by a
single 10 Gbps NIC port is 13 million requests per sec-
ond. Our current prototype can handle approximately 10
million 64-byte requests per second when using a sin-
gle core to run a simplified version of the key redirection
system described below.
As shown in Figure 3, client requests are sent as UDP
packets to the IP of the load balancer, but replies are re-
turned directly from the memcached servers back to the
client. This significantly lowers the processing require-
ments of the load balancer since memcached responses
are often much larger (and thus more expensive to pro-
cess) than requests. The load balancer acts as a “bump in
the wire”, so it does not need to maintain any connection
state, unlike existing proxies such as Twemproxy which
establishes separate socket connections with each client
and each server [11].
3.2 Hot Spot Detection
The load balancer must determine how to efficiently for-
ward requests to memcached servers, while preventing
some of them from becoming overloaded. Currently, our
focus is on handling skewed workloads that cause a small
number of servers to become overloaded. To prevent this,
the load balancer must be able to detect which keys are
causing the greatest load imbalance.
Since memcached does not store much data per key
(e.g., true frequency over time) to limit overheads, we
cannot simply rely on the servers sending this info. We
propose a frequency counting mechanism to build a table
of hot items in the request stream. Once hot keys are de-
tected, requests for them can be either directed to more
powerful servers or to replicas spread across several ma-
chines. The Lossy Counting algorithm [17] is a one-pass
1This is calculated based on 52 bytes for the MAC, IP, and UDP head-
ers, 8 bytes for memcached application level header, plus the median
memcached key size for Facebook’s ETC pool [2]
Automated Load Balancer
Key Director
Consistent
Hashing
Ring
Fwd Table
k1 S1, S14
k2 S3, S6
k3 S2,S8,S9
Hot Spot
Detector
Server
Manager
Key
Replicator
Heterogeneous
Cache Servers
Clients
S1
S1
Figure 3: Requests are intercepted by the load balancer and directed based on either the forwarding table (hot items) or consistent
hashing. Replies return directly back to the clients, minimizing the processing requirements of the load balancer.
0
5
10
15
20
25
30
35
1 1.5 2 2.5 3
Number of Hot Items
Zipf Dist. Parameter
Figure 4: Number of (Steady-State) Hot Items with Different
WorkloadsHere, sis 1%, and εis 0.1%
deterministic algorithm that efficiently calculates fre-
quency counts over data streams by guaranteeing to iden-
tify hot items based on user-defined parameterssupport
threshold sand error rate ε, where ε<< s. When the
stream length is N, the lossy counting algorithm returns
all keys with frequency at least sN, and there are no false
negatives, which means no item with true frequency less
than (sε)Nis returned.
Depending on the workload’s skew, the actual num-
ber of hot items can be very different as shown in Fig-
ure 4. Thus the number of hot items found by the counter
for a given sparameter depends both on the total num-
ber of keys being accessed in the cache and the work-
load distribution. Unfortunately, it is non-trivial to pre-
dict in advance what a workload will look like, and it
may change with time. As a result, we have modified the
Lossy Counting algorithm so that it will adjust itself to
store a specified number of hot keys; we then adapt the
target number of hot keys based on our observations of
the workload during the previous observation window.
Hot Key Analysis: Figure 5 shows the estimated fre-
quency (i.e., request rate) seen by the top keys mea-
sured by the Lossy Counter for a sample workload; the
frequencies are an estimate guaranteed to be at most
εNsmaller than the true counts [17]. Clearly, not all
items reported by the counter should be treated identi-
cally since they have very different loads. The goal of
our hot key analysis phase is to determine which of these
potential keys need to be replicated or moved to faster
servers.
The Lossy Counter can be used to separate keys into
groups with similar request rates. Due to the nature of
long tail distributions common in web workloads, the
number of keys in the group with the highest average
frequency will be smaller than the number of keys in the
group with the second highest, and so on. This is shown
in Figure 5 by the increasing width of each step. We con-
sider the workload as a set of groups g1,g2,...,gi,..., gn
ordered such that g1is the key group with the highest re-
quest frequency, f1. Our goal is to find the group githat
splits the groups into two sets: g1...girepresents the hot
keys while gi+1...gnare the “regular” keys.
The intuition behind our approach is to select gisuch
that |gi|, the number of keys in the group, is large enough
to be evenly distributed. Any subsequent gjwhere j>i
will have |gj|>|gi|since we expect a heavy tailed work-
load distribution, and thus can be load balanced with sim-
ple consistent hashing.
We can use the common Balls and Bins analysis to
understand how many keys (balls) are expected to be
placed on each server (bin), if keys are uniformly as-
signed to each server. For a given group, we consider
the request rate of all keys to be equivalent, so bound-
ing the number of keys from a group that can be as-
signed to the most loaded server in turn bounds the max-
imum request rate it can achieve. For the case where
there are fewer keys in a group than there are servers
(i.e., |gj|<#servers/log(#servers), which is common
for the hottest sets of keys) we can adapt the theorem
from Mitzenmacher [18] that bounds the number of balls
assigned to the most loaded bin with high probability
(p=11/#servers):
MaxLoad fj×log(#servers)
log(#servers/|gj|)(1)
where fjis the maximum request frequency to keys in
0
5
10
15
20
25
0 20 40 60 80 100 120
Frequency (x1000)
Item Number
Figure 5: Frequency of each key measured by a Lossy Counter
where sis 1%, εis 0.1%.
gj. The Hot Spot Detector considers each group in or-
der starting with j=1. If the maximum load exceeds
a threshold, then group gj+1will be considered for the
split point. Once the split point giis found, all groups
less than iare replicated as described in Section 3.3, and
the rest are forwarded using consistent hashing.
Adaptive Sized Lossy Counter: The above analysis
assumes that group giis included in the groups of keys
returned by the Lossy Counter. However, if the counter
is not configured with the appropriate support parameter,
s, then the counter will track either too few groups (and
not enough keys will be replicated) or too many groups
(wasting memory and increasing the cost of lookups in
the counter). To prevent this, we adapt the size of the
Lossy Counter during each measurement interval to en-
sure it is tracking the correct number of keys.
The key request stream passes through the lossy count-
ing algorithm for a configurable time window. At the end
of the window, the algorithm compares the number of re-
turned keys with T, which is a target level of hot items.
We adapt Tbased on the request rates of the groups re-
turned by the counter; if the last group returned by the
counter will cause too much skew if it is not replicated,
then Tmust be increased since the optimal giis not in-
cluded in the counter’s results.
Figure 6 illustrates the generic lossy counting algo-
rithm and our autonomic lossy counting algorithm to
show how we can adjust the desired level. Under the
same workload, the generic lossy counting algorithm
shows a steady amount of hot items, where as our auto-
nomic lossy counting algorithm tries to reach the target
levelhere, the target level is defined as 200.
Lossy Counter Overhead: Since the Lossy Counter
runs within the packet processing path, minimizing its
overhead is critical. Our tests give an average total pro-
cessing time of 367 nanoseconds per key counted (269 ns
for lookup and 98 ns for insertion). Memcached request
latencies are typically in the hundreds of microseconds,
0
50
100
150
200
250
10 20 30 40 50 60 70
Number of Hot Items
Stream Length N (x1000)
Target Level
Autonomic Lossy Counting
Generic Lossy Counting
Figure 6: Autonomic Scaling on Number of Hot
ItemsWorkload is a Zipfian distribution; initial sis 1%,
εis 0.1%, and αis 0.1%
so this processing will add negligible performance im-
pact. However, to achieve full line rate with a single core
on the load balancer, each packet must be examined and
redirected within about 80 nanoseconds. We are inves-
tigating how the counter operations can be done outside
the critical path by separate cores that only sample a por-
tion of the requests passing through the load balancer.
The memory size is also very small: only 48 bytes plus
8 bytes (key hash and frequency) + key size per key.
3.3 Request Redirection
Directing keys to servers should be as fast as possible.
Consistent Hashing is well known to effectively balance
load across servers while supporting fast lookups and the
flexibility to add or remove servers. We use consistent
hashing to direct most keys to their destination, but use
a lookup table to provide a more flexible mapping for
hot keys. Incoming requests are first queried against the
lookup table, and if not found there they are handled by
the consistent hash ring.
The Hot Spot Detector provides a set of key/frequency
pairs (k1,f1)...(ki,fi)where kiis the last key in gi. The
Key Redirector can then either replicate each key propor-
tionally to its request frequency, or select some keys to
be forwarded to servers that are known to be more pow-
erful or less loaded. Recently there have been several
efficient, concurrent hash table data structures proposed
which we are exploring to allow the load balancer to ef-
ficiently check whether an incoming request is for a hot
or regular key [19–21].
3.4 Server Management
The final component of our load balancer is the server
manager. This component aggregates information from
both the hot spot detector and from the memcached
servers themselves. This will allow the load balancer to
also respond to broader workload dynamics that require
servers to be added or removed from the memcached
cluster.
There has been a significant amount of related work on
how to dynamically manage virtual servers in response to
workload changes. For example, VMware’s Distributed
Resource Scheduler can dynamically adjust CPU and
memory allocations or migrate virtual machines. Yet
without insight into statistics such as the cache’s hit rate
and the overall application’s performance, these manage-
ment actions will not be effective. An important dis-
tinction when managing caching servers is that workload
skew can have a significant effect not only on request
rates (as described in the previous sections), but also on
the cache hit rate. Skewed workloads are actually easier
to cache, meaning that an intelligent controller may be
able to safely reduce the number of cache servers with
minimal impact on the cache’s hit rate [22]. Therefore,
we are investigating what new control mechanisms and
algorithms are necessary to provide a QoS management
system that controls a cache cluster based on its internal
behavior and the overall application’s needs.
4 Related Work
There has been a large amount of work on improving the
performance and scalability of individual memcached
servers. Some work proposes using new hardware such
as RDMA [23], high speed NICs [4], and FPGAs[24],
while others have improved memcache’s internal data
structures [1, 20, 21]. These approaches all focus on
maximizing the performance or energy efficiency of a
single cache node.
A large scale analysis of Facebook’s workload was
presented by Atikoglu et al., which illustrates the high
skew and time variation seen by what is probably the
largest memcached deployment [2]. Facebook has also
improved the efficiency of individual nodes and has de-
ployed a key replication system to help balance load and
increase the chance of finding multiple related keys in
one server lookup [1]. Their system relies on individ-
ual clients knowing the mapping of all keys to servers,
which we argue reduces the agility of the system com-
pared to an in-network load balancer like we propose.
They also appear to rely on manual classification of ap-
plication workloads into different pools.
Fan et al. propose using a fast, small cache in front
of a memcached cluster to prevent workload skew across
servers even under adversarial workloads [25]. Our load
balancer could potentially include a cache for fast lo-
cal lookups, although the scalability of such an approach
may be limited. This also increases the complexity of
maintaining consistency. Our approach of forwarding
some requests to high powered servers should provide
a similar load balancing effect. Replication of Mem-
cached keys was proposed by Hong et al [26]. Their sys-
tem requires modification to both the clients and servers
to maintain state about replicated keys, which we try to
avoid with our transparent middle-box approach.
5 Conclusions and Future Work
Large-scale web applications rely on in-memory caches
such as memcached to reduce the cost of processing
common user requests. However, memcached deploy-
ments are typically statically sized and provisioned for
peak workloads. We are developing a load balancing net-
work middlebox that can automatically detect hotspots
and balance load across memcached servers through re-
quest redirection and replication. Contrary to most re-
source management systems that require software to be
installed either on clients or on the servers being man-
aged, our in-network approach can be transparently de-
ployed without any changes to applications. We believe
that recent advances that allow such middleboxes to run
at high speed even in virtual machines will open up new
possibilities for a wide range of resource management
systems that can be flexibly reconfigured and deployed.
In our ongoing work, we are continuing to extend
our load balancer to optimize the number of servers and
replicas of hot keys. We are also exploring how this
kind of middle-box platform can be used to transparently
monitor and manage other types of data center applica-
tions.
Acknowledgments
We thank the reviewers for their help improving this pa-
per. This work was supported in part by NSF grants
CNS-1253575, CNS-1350766, OCI-0937875, National
Natural Science Foundation of China under Grant No.
61370059 and No. 61232009, and Beijing Natural Sci-
ence Foundation under Grant No. 4122042.
References
[1] Rajesh Nishtala, Hans Fugal, Steven Grimm, Marc
Kwiatkowski, Herman Lee, Harry C. Li, Ryan
McElroy, Mike Paleczny, Daniel Peek, and Paul
Saab, “Scaling memcache at facebook,” in Pro-
ceedings of the 10th USENIX conference on Net-
worked Systems Design and Implementation, 2013,
p. 385398.
[2] Berk Atikoglu, Yuehai Xu, Eitan Frachtenberg,
Song Jiang, and Mike Paleczny, “Workload analy-
sis of a large-scale key-value store,” in Proceedings
of the 12th ACM SIGMETRICS/PERFORMANCE
Joint International Conference on Measurement
and Modeling of Computer Systems, New York,
NY, USA, 2012, SIGMETRICS ’12, p. 5364,
ACM.
[3] Maysam Lavasani, Hari Angepat, and Derek
Chiou, “An FPGA-based in-line accelerator for
memcached,” IEEE Computer Architecture Letters,
vol. 99, no. RapidPosts, pp. 1, 2013.
[4] Hyeontaek Lim, Dongsu Han, David G. Andersen,
and Michael Kaminsky, “MICA: a holistic ap-
proach to fast in-memory key-value storage, in
Proceedings of the 11th USENIX Symposium on
Networked Systems Design and Implementation,
Seattle, WA, 2014, NSDI 14, p. 429444, USENIX.
[5] Wei Zhang, Timothy Wood, K.K. Ramakrishnan,
and Jinho Hwang, “Smartswitch: Blurring the line
between network infrastructure & cloud applica-
tions,” in 6th USENIX Workshop on Hot Topics in
Cloud Computing (HotCloud 14), Philadelphia, PA,
June 2014, USENIX Association.
[6] Intel Corporation, “Intel data plane development
kit: Getting started guide,” 2013.
[7] Luigi Rizzo, “netmap: A novel framework for
fast packet I/O, in Presented as part of the 2012
USENIX Annual Technical Conference, Berkeley,
CA, 2012, pp. 101–112, USENIX.
[8] Jinho Hwang, K.K. Ramakrishnan, and Timothy
Wood, “NetVM: high performance and flexible
networking using virtualization on commodity plat-
forms,” in Symposium on Networked System Design
and Implementation, Apr. 2014, NSDI 14.
[9] David Karger, Alex Sherman, Andy Berkheimer,
Bill Bogstad, Rizwan Dhanidina, Ken Iwamoto,
Brian Kim, Luke Matkins, and Yoav Yerushalmi,
“Web caching with consistent hashing,” Comput.
Netw., vol. 31, no. 11-16, pp. 1203–1213, May
1999.
[10] David G. Andersen, Jason Franklin, Michael
Kaminsky, Amar Phanishayee, Lawrence Tan, and
Vijay Vasudevan, “FAWN: a fast array of wimpy
nodes,” in Proceedings of the ACM SIGOPS 22Nd
Symposium on Operating Systems Principles, New
York, NY, USA, 2009, SOSP ’09, p. 114, ACM.
[11] “Twemproxy: A fast, light-weight
proxy for memcached,” Feb. 2012,
https://blog.twitter.com/2012/twemproxy.
[12] John Ousterhout, Parag Agrawal, David Erick-
son, Christos Kozyrakis, Jacob Leverich, David
Mazi`
eres, Subhasish Mitra, Aravind Narayanan,
Diego Ongaro, Guru Parulkar, Mendel Rosenblum,
Stephen M. Rumble, Eric Stratmann, and Ryan
Stutsman, “The case for ramcloud, Commun.
ACM, vol. 54, no. 7, pp. 121–130, July 2011.
[13] Couchbase, “vbuckets: The core enabling mecha-
nism for couchbase server data distribution, Tech-
nical Report, 2013.
[14] Redis, “http://redis.io,” .
[15] Joseph Issa and Silvia Figueira, “Hadoop and mem-
cached: Performance and power characterization
and analysis,” Journal of Cloud Computing: Ad-
vances, Systems and Applications, vol. 1, no. 1, pp.
10, July 2012.
[16] EunYoung Jeong, Shinae Wood, Muhammad
Jamshed, Haewon Jeong, Sunghwan Ihm, Dongsu
Han, and KyoungSoo Park, “mTCP: a highly scal-
able user-level TCP stack for multicore systems,
in Proceedings of the 11th USENIX Symposium
on Networked Systems Design and Implementation,
Seattle, WA, 2014, NSDI 14, p. 489502, USENIX.
[17] Gurmeet Singh Manku and Rajeev Motwani, Ap-
proximate frequency counts over data streams, in
Proceedings of the 28th International Conference
on Very Large Data Bases. 2002, VLDB ’02, pp.
346–357, VLDB Endowment.
[18] Michael Mitzenmacher, The Power of Two Choices
in Randomized Load Balancing, Ph.D. thesis, UC
Berkeley, 1996.
[19] Dong Zhou, Bin Fan, Hyeontaek Lim, David G.
Andersen, and Michael Kaminsky, “Scalable,
high performance ethernet forwarding lookup,
in Proc. 9th International Conference on emerg-
ing Networking EXperiments and Technologies
(CoNEXT), Dec. 2013.
[20] Bin Fan, David G. Andersen, and Michael Kamin-
sky, “MemC3: compact and concurrent memcache
with dumber caching and smarter hashing,” Proc.
10th USENIX NSDI, 2013.
[21] Yandong Mao, Eddie Kohler, and Robert Tappan
Morris, “Cache craftiness for fast multicore key-
value storage, in Proceedings of the 7th ACM
European Conference on Computer Systems, New
York, NY, USA, 2012, EuroSys ’12, p. 183196,
ACM.
[22] Timothy Zhu, Anshul Gandhi, Mor Harchol-Balter,
and Michael A. Kozuch, “Saving cash by us-
ing less cache,” in Workshop on Hot Topics in
Cloud Computing, Berkeley, CA, 2012, HotCloud
12, USENIX.
[23] Patrick Stuedi, Animesh Trivedi, and Bernard Met-
zler, “Wimpy nodes with 10GbE: leveraging one-
sided operations in soft-RDMA to boost mem-
cached,” in USENIX Annual Technical Conference,
Boston, MA, 2012, USENIX ATC 12, p. 347353,
USENIX.
[24] Michaela Blott, Kimon Karras, Ling Liu, Kees
Vissers, Jeremia Br, and Zsolt Istvn, “Achieving
10Gbps line-rate key-value stores with FPGAs, in
Presented as part of the 5th USENIX Workshop
on Hot Topics in Cloud Computing, Berkeley, CA,
2013, USENIX.
[25] Bin Fan, Hyeontaek Lim, David G. Andersen, and
Michael Kaminsky, “Small cache, big effect: Prov-
able load balancing for randomly partitioned clus-
ter services,” in ACM Symposium on Cloud Com-
puting, New York, NY, USA, 2011, SOCC ’11, p.
23:123:12, ACM.
[26] Yu-Ju Hong and Mithuna Thottethodi, “Under-
standing and mitigating the impact of load imbal-
ance in the memory caching tier, in ACM Sympo-
sium on Cloud Computing, New York, NY, USA,
2013, SOCC ’13, p. 13:113:17, ACM.
... To address such a load imbalance among nodes in the distributed in-memory environments, studies have been conducted using ring-based hashing schemes [32][33][34][35][36][37]. The general ring-based hashing schemes adjust the loading by replicating the data to other nodes, or by migrating the data through a hash space adjustment. ...
... In other words, the occurrence of hot data significantly increases the data migration cost. The studies in [34,36] proposed a scheme of distributing the load concentrated on one node by replicating the hot data that causes a large load to another node. When load balancing is performed considering only the hot data, however, it is not possible to solve the situation where a load occurs on a node without the hot data. ...
... Zhang et al. proposed a load balancing scheme that extracts hot data that causes a large load and replicates it to other nodes [36]. To distribute the load, the hot data is detected using the lossy counter and the hot data is replicated to the more efficient node. ...
Article
Full-text available
As digital data have increased exponentially due to an increasing number of information channels that create and distribute the data, distributed in-memory systems were introduced to process big data in real-time. However, when the load is concentrated on a specific node in a distributed in-memory environment, the data access performance is degraded, resulting in an overall degradation in the processing performance. In this paper, we propose a new load balancing scheme that performs data migration or replication according to the loading status in heterogeneous distributed in-memory environments. The proposed scheme replicates hot data when the hot data occurs on the node where a load occurs. If the load of the node increases in the absence of hot data, the data is migrated through a hash space adjustment. In addition, when nodes are added or removed, data distribution is performed by adjusting the hash space with the adjacent nodes. The clients store the metadata of the hot data and reduce the access of the load balancer through periodic synchronization. It is confirmed through various performance evaluations that the proposed load balancing scheme improves the overall load balancing performance.
... To address such load imbalance among nodes in the distributed in-memory environments, studies have been conducted [12][13][14]. The study in [12] calculated the load on a node using the hit rate and the usage rate, and performed load balancing by adjusting the hash space. ...
... The study in [12] calculated the load on a node using the hit rate and the usage rate, and performed load balancing by adjusting the hash space. The studies in [13,14] proposed a scheme of distributing the load concentrated on one M RACS'18, October 9-12, 2018, Honolulu, HI, USA K. Bok et al. ...
... Zhang et al. proposed a load balancing scheme that extracts hot data that causes a large load and replicates it to other nodes [14]. To distribute the load, the hot data is detected using the lossy counter and the hot data is replicated to the more efficient node. ...
Conference Paper
In this paper, we propose a new load balancing scheme which performs data migration or replication according to the loading conditions in heterogeneous distributed in-memory environments. The proposed scheme replicates hot data when the hot data occurs on the node where a load occurs. If the loading of the node increases in the absence of hot data, the data is migrated through an adjustment of the storage space. It was shown through various performance evaluations that the proposed load balancing scheme improved the overall load balancing performance.
... SPORE modifies the traditional Memcached behavior implementing advanced data replication strategies based on key popularity. Zhang et al. [42] propose a solution to load imbalance due to a hot-spot workload and server heterogeneity. NetKV [43], is an accelerated proxy to inspect key requests and analyze the datastore workload to replicate hot-spot keys on multiple servers, in order to limit the load unbalancing due to the workload skewness. ...
Preprint
Distributed caching systems (e.g., Memcached) are widely used by service providers to satisfy accesses by millions of concurrent clients. Given their large-scale, modern distributed systems rely on a middleware layer to manage caching nodes, to make applications easier to develop, and to apply load balancing and replication strategies. In this work, we performed a dependability evaluation of three popular middleware platforms, namely Twemproxy by Twitter, Mcrouter by Facebook, and Dynomite by Netflix, to assess availability and performance under faults, including failures of Memcached nodes and congestion due to unbalanced workloads and network link bandwidth bottlenecks. We point out the different availability and performance trade-offs achieved by the three platforms, and scenarios in which few faulty components cause cascading failures of the whole distributed system.
... Workload characteristics have a significant impact on the performance of Memcached. There have been many analytical and experimental studies on workload characteristics of Memcached clusters, such as Facebook's production workloads [21,22]. These studies have shown a number of interesting findings for directing the design of in-memory K-V stores in hybrid memory systems. ...
Article
Full-text available
Hybrid memory systems composed of DRAM and NVM have the potential to provide very large capacity of main memory for in-memory key-value (K-V) stores. However, there remains challenges to directly deploy DRAM-based K-V stores in hybrid memory systems. In this paper, we propose HMCached, an in-memory K-V store built on a hybrid DRAM/NVM system. HMCached utilizes an application-level data access counting mechanism to identify frequently-accessed objects (i.e., K-V pairs) in NVM, and migrates them to fast DRAM to reduce the costly NVM accesses. We also propose an NVM-friendly index structure to store the frequently-updated portion of object metadata in DRAM, and thus further mitigate the NVM accesses. Moreover, we propose a benefit-aware memory reassignment policy to address the slab calcification problem in slab-based K-V store systems, and significantly improve the benefit gain from the DRAM. We implement the proposed schemes with Memcached and evaluate it with Zipfian-like workloads. Experiment results show that HMCached significantly reduces NVM accesses by 70% compared to the vanilla Memcached, and improves application performance by up to 50%. Moreover, compared to a DRAM-only system, HMCached achieves 90% of performance and 51% reduction of energy consumption for read-intensive workloads while significantly reducing the DRAM usage by 75%.
... Performance prediction [6,7] and optimization [8][9][10][11][12] for Memcached have drawn much attention recently. To avoid the performance drop due to slab calcification, the operator needs to restart the server to reset the system. ...
Article
Full-text available
The use of key-value caches in modern web servers is becoming more and more ubiquitous. Representatively, Memcached as a widely used key-value cache system, originally intended for speeding up dynamic web applications by alleviating database load. One of the key factors affecting the performance of Memcached is the memory allocation among different item classes. How to obtain the most efficient partitioning scheme with low time and space consumption is a focus of attention. In this paper, we propose a lightweight and accurate memory allocation scheme in Memcached, by sampling access patterns, analyzing data locality, and reassigning the memory space. One early study on optimizing memory allocation is LAMA, which uses footprint-based MRC to optimize memory allocation in Memcached. However, LAMA does not model deletion operations in Memcached and its spatial overhead is quite large. We propose a method that consumes only 3% of LAMA space and can handle read, write and deletion operations. Moreover, evaluation results show that the average stable-state miss ratio is reduced by 15.0% and the average stable-state response time is reduced by 12.3% when comparing our method to LAMA.
... Hwang et al. [60] propose an adaptive partitioning algorithm to re-balance the load created by hot items due to data skew. In a follow-up work [61], the authors discuss the challenges in designing self-managing caches and propose integrating dynamic scaling with load balancing, but do not discuss this further. Dynacache [62] is a cache controller that determines the best memory allocation and eviction policies for different applications in a shared Memcachedas-a-service setting. ...
Article
Emerging byte-addressable non-volatile memory (NVM) technologies offer higher density and lower cost than DRAM, at the expense of lower performance and limited write endurance. There have been many studies on hybrid NVM/DRAM memory management in a single physical server. However, it is still an open problem on how to manage hybrid memories efficiently in a distributed environment. This paper proposes Alloy, a memory resource abstraction and data placement strategy for an RDMA-enabled distributed hybrid memory pool (DHMP). Alloy provides simple APIs for applications to utilize DRAM or NVM resource in the DHMP, without being aware of the hardware details of the DHMP. We propose a hotness-aware data placement scheme, which combines hot data migration, data replication and write merging together to improve application performance and reduce the cost of DRAM. We evaluate Alloy with several micro-benchmark workloads and public benchmark workloads. Experimental results show that Alloy can significantly reduce the DRAM usage in the DHMP by up to 95%, while reducing the total memory access time by up to 57% compared with the state-of-the-art approaches.
Article
In-memory store has become a key component for an increasing number of data-intensive applications like OLTP and OLAP. To be resilient to data loss incurred by transient failures, redundancy strategies are incorporated into in-memory stores. In-memory datasets are characterized by skewed popularity, it is prudent to apply customized redundancy schemes with dynamic memory efficiency and access parallelisms to different in-memory datasets. In this work, we propose an adaptive redundancy scheme - PaRS - for in-memory datasets. PaRS relies on a re-stripe or replication mechanism to transform involved redundancy groups according to their workload popularity growth. With PaRS in place, a memory-efficient redundancy layout is deployed for data blocks with low access frequencies; a redundancy layout exhibiting high access parallelism is adopted for highly-accessed data blocks. Compared with existing redundancy schemes that employ simple replication or erasure coding, PaRS facilitates a configurable tradeoff between memory efficiency and access parallelism for in-memory data blocks. Quantitative evaluations using YCSB show that PaRS enables in-memory stores to exhibit higher access performance and memory efficiency than the replication scheme. Furthermore, PaRS achieves better load balancing than the erasure coding, while sustaining superb access performance and memory efficiency.
Conference Paper
With the increase in adoption of SDN, programmable switches are becoming an attractive mechanism for enabling network operators to configure protocols without replacing networking hardware. This paper argues that the functionality of programmable switches can be extended beyond network operations to the application layer, in order to achieve significant end-to-end application performance improvements. To illustrate the potential of this new concept, we develop AppSwitch, a packet switch that also performs load balancing for key-value storage systems. Unlike existing key-value load balancers, which require each request to send an extra message to a proxy server, AppSwitch requires only a single message from the key-value client to the server. This results in a 2x reduction in end-to-end average latency and 2x throughput improvement. We implemented AppSwitch on PISCES, a P4 programmable version of Open vSwitch. Finally, we demonstrate that AppSwitch can be deployed transparently, without any changes to the key-value clients and servers.
Conference Paper
Full-text available
A revolution is beginning in communication networks with the adoption of network function virtualization, which allows network services to be run on common off-the-shelf hardware—even in virtual machines—to increase flexibility and lower cost. An exciting prospect for cloud users is that these software-based network services can be merged with compute and storage resources to flexibly integrate all of the cloud's resources. We are developing an application aware networking platform that can perform not only basic packet switching , but also typical functions left to compute platforms such as load balancing based on application-level state, localized data caching, and even arbitrary computation. Our prototype " memcached-aware smart switch " reduces request latency by half and increases throughput by eight fold compared to Twitter's TwemProxy. We also describe how a Hadoop-aware switch could automatically cache data blocks near worker nodes, or perform some computation directly on the data stream. This approach enables a new breed of application designs that blur the line between the cloud's network and its servers.
Conference Paper
Full-text available
Many applications (routers, traffic monitors, firewalls, etc.) need to send and receive packets at line rate even on very fast links. In this paper we present netmap, a novel framework that enables commodity operating systems to handle the millions of packets per seconds traversing 1..10 Gbit/s links, without requiring custom hardware or changes to applications. In building netmap, we identified and successfully reduced or removed three main packet processing costs: per-packet dynamic memory allocations, removed by preallocating resources; system call overheads, amortized over large batches; and memory copies, eliminated by sharing buffers and metadata between kernel and userspace, while still protecting access to device registers and other kernel memory areas. Separately, some of these techniques have been used in the past. The novelty in our proposal is not only that we exceed the performance of most of previouswork, but also that we provide an architecture that is tightly integrated with existing operating system primitives, not tied to specific hardware, and easy to use and maintain. Netmap has been implemented in FreeBSD and Linux for several 1 and 10 Gbit/s network adapters. In our prototype, a single core running at 900 MHz can send or receive 14.88 Mpps (the peak packet rate on 10 Gbit/s links). This is more than 20 times faster than conventional APIs. Large speedups (5× and more) are also achieved on user-space Click and other packet forwarding applications using a libpcap emulation library running on top of netmap.
Conference Paper
NetVM brings virtualization to the Network by enabling high bandwidth network functions to operate at near line speed, while taking advantage of the flexibility and customization of low cost commodity servers. NetVM allows customizable data plane processing capabilities such as firewalls, proxies, and routers to be embedded within virtual machines, complementing the control plane capabilities of Software Defined Networking. NetVM makes it easy to dynamically scale, deploy, and reprogram network functions. This provides far greater flexibility than existing purpose-built, sometimes proprietary hardware, while still allowing complex policies and full packet inspection to determine subsequent processing. It does so with dramatically higher throughput than existing software router platforms. NetVM is built on top of the KVM platform and Intel DPDK library. We detail many of the challenges we have solved such as adding support for high-speed inter-VM communication through shared huge pages and enhancing the CPU scheduler to prevent overheads caused by inter-core communication and context switching. NetVM allows true zero-copy delivery of data to VMs both for packet processing and messaging among VMs within a trust boundary. Our evaluation shows how NetVM can compose complex network functionality from multiple pipelined VMs and still obtain throughputs up to 10 Gbps, an improvement of more than 250% compared to existing techniques that use SR-IOV for virtualized networking.
We present a method for accelerating server applications using a hybrid CPU+FPGA architecture and demonstrate its advantages by accelerating Memcached, a distributed key-value system. The accelerator, implemented on the FPGA fabric, processes request packets directly from the network, avoiding the CPU in most cases. The accelerator is created by profiling the application to determine the most commonly executed trace of basic blocks which are then extracted. Traces are executed speculatively within the FPGA. If the control flow exits the trace prematurely, the side effects of the computation are rolled back and the request packet is passed to the CPU. When compared to the best reported software numbers, the Memcached accelerator is 9.15× more energy efficient for common case requests.
Article
NetVM brings virtualization to the Network by enabling high bandwidth network functions to operate at near line speed, while taking advantage of the flexibility and customization of low cost commodity servers. NetVM allows customizable data plane processing capabilities such as firewalls, proxies, and routers to be embedded within virtual machines, complementing the control plane capabilities of Software Defined Networking. NetVM makes it easy to dynamically scale, deploy, and reprogram network functions. This provides far greater flexibility than existing purpose-built, sometimes proprietary hardware, while still allowing complex policies and full packet inspection to determine subsequent processing. It does so with dramatically higher throughput than existing software router platforms. NetVM is built on top of the KVM platform and Intel DPDK library. We detail many of the challenges we have solved such as adding support for high-speed inter-VM communication through shared huge pages and enhancing the CPU scheduler to prevent overheads caused by inter-core communication and context switching. NetVM allows true zero-copy delivery of data to VMs both for packet processing and messaging among VMs within a trust boundary. Our evaluation shows how NetVM can compose complex network functionality from multiple pipelined VMs and still obtain throughputs up to 10 Gbps, an improvement of more than 250% compared to existing techniques that use SR-IOV for virtualized networking.
Conference Paper
Recently, various key/value stores have been proposed targeting clusters built from low-power CPUs. The typical network configuration is that the nodes in those clusters are connected using 1 Gigabit Ethernet. During the last couple of years, 10 Gigabit Ethernet has become commodity and is increasingly used within the data centers providing cloud computing services. The boost in network link speed, however, poses a challenge to the cluster nodes because filling the network link can be a CPU-intensive task. In particular for CPUs running in low-power mode, it is therefore important to spend CPU cycles used for networking as efficiently as possible. In this paper, we propose a modified Memcached architecture to leverage the one-side semantics of RDMA. We show how the modified Memcached is more CPU efficient and can serve up to 20% more GET operations than the standard Memcached implementation on low-power CPUs. While RDMA is a networking technology typically associated with specialized hardware, our solution uses soft-RDMA which runs on standard Ethernet and does not require special hardware.
Conference Paper
Memcached is a well known, simple, in-memory caching solution. This paper describes how Facebook leverages memcached as a building block to construct and scale a distributed key-value store that supports the world's largest social network. Our system handles billions of requests per second and holds trillions of items to deliver a rich experience for over a billion users around the world.
Conference Paper
Several emerging network trends and new architectural ideas are placing increasing demand on forwarding table sizes. From massive-scale datacenter networks running millions of virtual machines to flow-based software-defined networking, many intriguing design options require FIBs that can scale well beyond the thousands or tens of thousands possible using today's commodity switching chips. This paper presents CuckooSwitch, a software-based Ethernet switch design built around a memory-efficient, high-performance, and highly-concurrent hash table for compact and fast FIB lookup. We show that CuckooSwitch can process 92.22 million minimum-sized packets per second on a commodity server equipped with eight 10 Gbps Ethernet interfaces while maintaining a forwarding table of one billion forwarding entries. This rate is the maximum packets per second achievable across the underlying hardware's PCI buses.
Conference Paper
Distributed memory caching systems (e.g., memcached) offer tremendous performance improvements for multi-tiered applications compared to architectures that directly access the storage layer. Unfortunately, the performance improvements are artificially limited by load imbalance in the memcached server pool. Specifically, we show that skewed key popularity induces significant load imbalance, which in turn can cause significant degradation in the tail (i.e., 90+th %ile) latency. Based on this understanding, we design and implement SPORE -- an augmented memcached variant which uses self-adapting, popularity-based replication to mitigate the effects of such load imbalance. SPORE uses reactive internal key renaming as a basic mechanism to efficiently achieve replication without excessive communication and/or coordination among servers and clients. Further, our SPORE design offers the same consistency model (with added time-bounds on write propagation) as a system with memcached. Based on evaluations on a "wimpy-node" testbed and on Amazon EC2, we show that SPORE achieves significantly higher performance than the baseline memcached.