IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING i
Analyzing In-Memory NoSQL Landscape
Masoud Hemmatpour†, Bartolomeo Montrucchio†,
Maurizio Rebaudengo†, and Mohammad Sadoghi ‡
†Dipartimento di Automatica e Informatica, Politecnico di Torino
‡Computer Science Department, University of California Davis
Abstract—In-memory key-value stores have quickly become a key enabling technology to build high-performance applications that
must cope with massively distributed workloads. In-memory key-value stores (also referred to as NoSQL) primarly aim to offer
low-latency and high-throughput data access which motivates the rapid adoption of modern network cards such as Remote Direct
Memory Access (RDMA). In this paper, we present the fundamental design principles for exploiting RDMAs in modern NoSQL
systems. Moreover, we describe a break-down analysis of the state-of-the-art of the RDMA-based in-memory NoSQL systems
regarding the indexing, data consistency, and the communication protocol. In addition, we compare traditional in-memory NoSQL with
their RDMA-enabled counterparts. Finally, we present a comprehensive analysis and evaluation of the existing systems based on a
wide range of conﬁgurations such as the number of clients, real-world request distributions, and workload read-write ratios.
Index Terms—RDMA, memory, key-value store, big data, high performance, cluster, parallel programming.
As demand for big data analytics grows every day, com-
panies have become aware of the critical role of real-time
data-driven decision making to gain a competitive edge.
This creates a challenge for companies needing to accelerate
(ﬁne-grained) the access to massively distributed data, in
particular, those dealing with online services . A dis-
tributed key-value store offers a ﬂexible data model that is
more performant at the cost of weaker consistency models
for partitioning data across many compute nodes. Starting
in mid-2000, numerous commercial key-value stores have
emerged , each with its own unique characteristics, such
as Google Bigtable , Amazon Dynamo , and Facebook
Cassandra  to enable managing massively distributed
data at unseen scale, which simply was not feasible with
traditional relational database systems running on commod-
ity hardware. These systems have become critical for large-
scale applications, such as social networks , , realtime
processing , and recommendation engines , ,  to
achieve higher performance.
Given the rise of key-value store—broadly classiﬁed as
NoSQL—over the last decade there have been two major ef-
forts to accelerate NoSQL platform using modern hardware
. The ﬁrst approach was to employ storage-class memory,
e.g., such as solid-state disks (SSDs), that focused on exploit-
ing SSDs as a cache between main memory and disks ,
such as CaSSanDra , cassandraSSD , Flashstore ,
Flashcache , BufferHash , . The second approach
has been to capitalize on the ever-increasing size of the
main memory in each machine. These machines can now
be connected through fast optical interfaces from a massive
†Dipartimento di Automatica e Informatica, Politecnico di Torino
‡Computer Science Department, University of California Davis
virtual shared memory space at an affordable cost that
continues to decline, such as RAMCloud , Memcached
, MICA , SILT  and Redis .
Much research has been carried out in order to improve
the communication performance either by optimizing the
existing protocol  or inventing new communication stan-
dards. A great deal of work on high-performance commu-
nication, such as Arsenic Gigabit Ethernet , U-Net ,
VIA , (Myricom/CSPi)’s Myrinet , Quadrics’s QS-
NET  has led to modern high-speed networks including
InﬁniBand , RoCE , iWARP , and Intel’s Omni-
Path , which support Remote Direct Memory Access
(RDMA) . RDMA blurs the boundary of each machine
by creating a virtual distributed, shared memory among
connected nodes, i.e., substantially reducing communication
and processing on the host machine. Through RDMAs,
clients can now directly access remote memory without
the need to invoke the NoSQL’s traditional client-server
model. This motivates the NoSQL community to invest in
developing purely in-memory key-value stores with RDMA
capability, such as HydraDB , Herd , Pilaf ,
DrTM , FaRM . RDMA capable protocol (i.e., Inﬁni-
Band) supports legacy socket applications through IP over
InﬁniBand (IPOIB); however, running existing in-memory
systems on top of it can not efﬁciently exploit the beneﬁts
in the infrastructure , . So existing in-memory key-
value stores strive to reduce latency and achieve higher
performance by exploiting RDMA operations , .
Recently, we have witnessed the emergence of many
NoSQL systems, such as mongoDB , HBase , VoltDB
, Cassandra , Voldemort , Redis , Memcached
, Pilaf , HERD , HydraDB , FaRM , DrTM
, and Nessie . Therefore, there have been several
efforts to evaluate NoSQL systems , , , , .
Some of these studies performed experimental analysis and
some of them adopt case studies in their evaluation.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING ii
A. Gandini et al. evaluates commonly used NoSQL
databases: Cassandra , MongoDB  and HBase 
. T. Rabl et al. presents a comprehensive performance
evaluation of six key-value stores: Cassandra , Volde-
mort , HBase , VoltDB , Redis , and MySQL
 focusing on maximum throughput . J. Klein et al.
evaluates the performance of Riak , Cassandra  and
MongoDB  in a case study for an Electronic Health
Record (EHR) system for delivering healthcare support to
the patients . P. cudr-Mauroux et al. compares NoSQL
systems for the sake of data exchange on the web according
to Resource Description Framework (RDF) standard . J.
R. Lourenc et al. evaluates quality attribute requirements
of NoSQL systems regarding to their suitability on the
design of enterprise systems . H. Zhang focuses on the
design principles and implementing of efﬁcient and high-
performance in-memory data management and processing
systems . W. Cao evaluates Redis and Memcached .
Their evaluation focused on performance challenges in in-
ternal data structures and memory allocators. B. atikoglu
et al. analyzes the workloads of Facebooks Memcached
according to the size, rate, patterns, and use cases .
To the best of our knowledge there is no comprehen-
sive study of widely deployed in-memory key-value stores
with RDMA acceleration. We outline a set of principles
for exploiting RDMA in key-value stores along with in-
depth analysis and lessons learned. Our work corroborates
previous ﬁndings on RDMA performance issues and high-
lights the importance of RDMA operations choices on the
overall performance , , . Moreover, we compared
the RDMA systems with legacy ones, namely, those that
do not rely on any hardware acceleration in their design
such as storage class memory or advanced network cards.
Moreover, the following parameters are considered when
selecting legacy systems: language (e.g., C), general purpose
systems, relying on standard communication protocol (i.e.,
TCP), single-node setup to analyze the throughput, and
support of set/get operations. In our study, we describe
Memcached , a system that can be deployed for caching
heavily used objects, as well as Redis, a system that can be
used as a database cache and message broker  as legacy
systems. We chose Memcached and Redis since these two
systems have been extensively studied in the literature ,
 and commonplace in industry. For example, Facebook
uses Memcached  and Cisco uses Redis Enterprise as its
primary database for IoT solutions .
In this paper, we present one-of-a-kind comprehensive
study of modern NoSQL systems including HydraDB ,
Pilaf , HERD , FaRM , DrTM , Memcached
, and Redis . We review the key performance chal-
lenges of how to exploit RDMA in NoSQL systems in
Section 2. We provide an in-depth review of modern key-
value stores in an uniﬁed representation that reveals ar-
chitectural differences along with strength and weakness of
each system in Section 3. We present a comprehensive evalu-
ation methodology and extensive analysis of state-of-the-art
approaches in Section 4. 1. Finally, we draw conclusions in
1. Our implementation is open-sourced and publicly available at
2 RDMA PERFORMANCE CHALLENGE ANALYSIS
InﬁniBand architecture speciﬁcation describes the functions,
called verbs, to conﬁgure, manage, and operate with In-
ﬁniBand adapter . This paper focuses on InﬁniBand,
which is a commonly used protocol in commodity servers
. InﬁniBand is an advanced network protocol with low
latency and high bandwidth, as well as with advanced
features such as RDMA, atomic operations, multicast, QoS,
and congestion control. InﬁniBand Network Interface Cards
(NICs) follow two approaches in packet processing: Onload
or Ofﬂoad (e.g., Mellanox ofﬂoading and Qlogic onloading)
. In the Onload approach, packets are processed in
the host CPU while in the Ofﬂoad approach packets are
processed in NIC processor.
RDMA supports two memory semantics consisting of
SEND/RECV (two-sided verb) and READ/WRITE (one-sided
verb). READ and WRITE semantic reads/writes a data
from/to a remote node exploiting the DMA engine of the
remote machine (i.e., bypassing the CPU and kernel). SEND
and RECV semantic sends/receives a message through the
CPU of the remote node. In order to adopt RDMA semantics
as efﬁciently as possible, multiple programming models
have emerged, such as Remote Memory Access (RMA) and
Partitioned Global Address Space (PGAS).
The asynchronous nature of InﬁniBand architecture al-
lows an application to queue up a series of requests to be
executed by the adapter. These queues are created in pairs,
called Queue Pair (QP), for send and receive operations. The
verb application submits a Work Queue Element (WQE) on
the appropriate queue. The channel adapter executes WQEs
in the FIFO order on the queue. When the channel adapter
completes a WQE, a Completion Queue Element (CQE) is
enqueued on a Completion Queue (CQ).
Many studies demonstrate that adopting one-sided verb
outperforms two-sided verb by eliminating the overhead of
the CPU access in the remote node , , ; however,
there are different opinions on the best strategy , .
Since one-sided verb (i.e., READ) requires two round-trip
times (RTTs) for remote memory access,  advocates
using two-sided verb combining with a local memory access
in the remote machine. Moreover, it should be consid-
ered that the latency of two verbs (i.e., READ/WRITE and
SEND/RECV) are changing through evolving adapters .
So measuring semantic costs is necessary before designing a
In high-performance applications not only board-to-
board communication is critical, but also core-to-core, CPU-
to-CPU, and I/O communications require careful investi-
gation to explore the performance tradeoffs. For example,
hardware message passing among cores , , ,
CPU-to-CPU Intel QuickPath Interconnect (QPI) and AMD
HyperTransport (HT) , , and I/O PCI express, Intel
Data Direct I/O ,  all strive to enhance the commu-
Fig. 1 shows the schematic view of a node equipped with
InﬁniBand card , , . It shows the architecture of
request queues in an InﬁniBand channel adapter and the
components of the system. Although the network architec-
ture in the cluster affects the contention for resources, its
impacts are outside the scope of this paper.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING iii
. . .
CPU socket 1 CPU socket 2
Fig. 1. System view of a compute node with InﬁniBand adapter.
Although remote memory access through RDMA operation
is quite fast compared to traditional network operations,
they are still substantially slower than a local memory access
. Thus, a better management of the local memory can
inﬂuence the performance. The main RDMA challenges are
described in the following.
Cache miss The experiment on cache miss rate of the
requester once the RDMA message uses different memory
addresses (i.e., the cache is never hit), and once the message
uses the same memory address (i.e., cache is always hit)
shows that high cache-miss rate of the requester can reduce
the performance .
Data Alignment Since NICs work more efﬁciently on
aligned data , using aligned message size can improve
the performance of RDMA systems .
NUMA Afﬁnity The distance between processor and
data has a critical role on the performance. Generally
speaking, better latency can be achieved through conﬁning
the memory access to local NUMA node. However, the
appropriate deployment of processes and data can exploit
memory bandwidth of other NUMA nodes , , .
Since the operating systems delegate the burden of NUMA-
related issues on the application, designer must be aware of
the data distribution on the main memory in order to reduce
the latency and to increase bandwidth in memory access.
The impact of the NUMA afﬁnity on the RDMA applications
has been studied in the literature .
Memory Prefetching Software prefetching is a classi-
cal technique for overcoming the speed gap between the
processor and the memory , . Software prefetching
in RDMA-based applications can improve the performance
. Modern CPUs equipped with automated prefetching
(e.g., Smart Prefetch) predict and preload the potentially
needed data .
2.2 Host Bus communication
PCI Express (PCIe) technology  is an ubiquitous scal-
able, high-speed and serialized protocol standard for device
communication and mainly a replacement for the PCI-X
bus. Both PCIe and PCI-X allowed the device to initiate an
independent communication, called ﬁrst-party DMA .
InﬁniBand vendors nowadays adopt PCIe bus family for
host communication due to the high-speed serial and dedi-
cated link , , . PCIe generations are evolved based
on the speed of the link (i.e., lane speed and number of
lanes), encoding, trafﬁc, and packet overheads . PCIe
provides a root-tree based network topology, where all I/Os
are connected, through switches and bridges, to a root
complex. The root complex connects one or more processors
and their associated memory subsystems.
Any movement through PCIe has an overhead on the
performance, so it is important to understand the CPU and
InﬁniBand NIC interaction for high-performance applica-
tions. Aside from protocol and trafﬁc overhead, maximum
payload size and maximum read request size  may impact
the performance in a PCIe system. These parameters might
cause a limitation on transaction rate over PCIe. Though
tuning these parameters have an impact on the performance
of InﬁniBand devices . Moreover, interrupt request afﬁn-
ity on PCIe can improve application scalability and latency
Proﬁling PCIe transactions are important to have a com-
prehensive view of the CPU-NIC interaction , . Mod-
ern CPUs provide a list of events in Performance Monitoring
Unit (PMU) to measure micro-architectural events of PCIe.
For example, the events PCIeRdCur and PCIeItoM monitor
DMA reads and writes from PCIe, respectively.
CPU can submit a work request to the NIC out of
writing to the memory mapped I/O (MMIO) register (i.e.,
BlueFlame in Mellanox) or sending a list of works (i.e.,
Doorbell). It is recommended to use BlueFlame in the light
load and Doorbell in high-bandwidth scenarios . Each
PCIe device equipped with a DMA engine can access to
main memory independently. Firstly, the device sends a
memory Read Request to the root complex. Then, it returns
the desired memory by completion with a data packet. Com-
paring the CPU MMIO overhead with NIC DMA memory
access reveals that the best trade-off is obtained by reducing
the number of MMIOs .
The cache coherency between the NIC and CPU is a
critical issue which is not written in the speciﬁcation of
InﬁniBand protocol and is fully vendor speciﬁc.  reveals
that there is a single cache line coherency for Mellanox
adapters on x86 processors. In addition, memory order on
READ/WRITE verbs over PCIe are important concerns which
are vendor speciﬁc.
2.3 NIC Memory
First InﬁniBand products (developed by Mellanox) pro-
vided memory on the NIC board. However, recently essen-
tial resources have been moved to the host memory and only
a cache memory remains on the board since the memory
access time of the host does not have a signiﬁcant impact on
the performance .
The NIC cache memory, particularly in Mellanox
adapters, are served for several purposes such as maintain-
ing page tables of registered memory (to translate virtual
to physical address) or queue pair (QP) data (i.e., state
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING iv
and elements) . Since adapter has limited resources, the
optimization of this scope can improve the performance.
Adopting larger memory pages will reduce the number of
entries of page tables, reducing fetching page table entries
from system memory to NIC (i.e., page faults) , .
Reducing the number of queue pairs can reduce the memory
usage . In addition, the work request submission rate is
quite important for avoiding cache miss in the NIC .
2.4 RDMA Features
Choosing the right RDMA features is critical to the scal-
ability and the reliability of the application. In this section,
several RDMA features and their impact on the performance
are described in more detail.
Transport Type RDMA supports unreliable and reliable
types of connections. Reliable connection guarantees the
delivery and the order of the packet by the acknowledgment
from the responder. An unreliable connection does not
guarantee the delivery and the order of the message.
RDMA provides two types of transports: unconnected
and connected. Each QP is connected to exactly one QP
of a remote node in connected mode unlike unconnected
(datagram). Since for each connected connection between
two nodes, two QPs are required one for the requester
and one for the responder, the number of QPs increases
2×with the number of connections. There are different
approaches to reduce the number of QPs. In a reliable
connection, threads can share QPs to reduce the QP memory
footprint . Sharing QP reduces CPU efﬁciency because
threads contend for the cache lines for QP buffers between
CPU cores , . In the Annex A14 of the InﬁniBand
speciﬁcation 1.2, eXtended Reliable Connection (XRC) was
introduced to connect nodes . Multiple connections of a
process in a node can be reduced to one.
Shared receive queue (SRQ) shares receive queue on
multiple connections and reduce the number of QPs. SRQ
solves the two-sided communication synchronization prob-
lem between the requester and the responder which previ-
ously was solved by using backoff in the requester and over
provisioning of receive WQEs in responder . SRQ can
solve this problem since an incoming receive message on
any QP associated with an SRQ can use the next available
WQE to receive the incoming data.
Inline Data Inlining the data to the work queue element
(WQE) eliminates the overhead of memory access through
DMA for payload after submitting a WQE and expecting
the performance raising. However, an inline message has
limitation according to the size of the payload.
Message Size The size of the message could be the
bottlenecked in two places: host bus communication (i.e.,
PCIe) or the Path Maximum Transfer Unit (PMTU). Maxi-
mum payload size and maximum read request size in the PCIe
communication affects the performance of memory access
from the InﬁniBand adapter , . It basically speciﬁes
the number of essential completion with data packets. The
higher read-request size increases the efﬁciency of packet
transfers. When a QP (reliable/unreliable connected) is
created, the PMTU is determined in the queue and if the
desired message to be sent is larger than the PMTU of
the queue, the message is divided into multiple messages.
However, if InﬁniBand receives a message larger than its
port Maximum Transfer Unit (MTU) it silently drops the
Reducing the number of cache lines used by a WQE
can improve throughput drastically . Roughly speaking,
increasing the size of the message increases the communi-
cation latency.  demonstrates that increasing the size of
the message will decrease the performance.
Completion Detection While InﬁniBand adapter com-
pletes a work request, it enqueues a CQE in the completion
queue. Mainly, two approaches can be adopted to detect
completion of a work request: busy polling and event han-
dling. In the former mechanism, the application polls the
completion queue to receive a CQE. This approach has high
CPU utilization; however, the cost of polling is quite low
since the operating system is bypassed. In the second ap-
proach, a notiﬁcation is received when a CQE arrives to the
completion queue. This approach is much better based on
CPU utilization. However, it requires the operating system
intervention. Busy polling outperforms event handling in
all possible RDMA operations , . However, in large
message size, both methods converge.
Completion Signaling A work request can be sent with
signaled or unsignaled opcode. If the opcode of the work
request is set to signaled, once the work request is com-
pleted a work completion element is generated. Unsignaled
opcode generates no element to the completion queue and
consequently, there isn’t extra overhead. However, the latter
approach cannot be adopted due to the resource depletion,
and a signaled work request must be sent periodically to
release the taken resources by unsignaled work requests.
Finding the best period to send a signaled work request is a
challenging task. The send queue depth and the message
size can be considered as parameters in ﬁnding the best
Batching Once the CPU sends a list of requests to NIC
instead of sending one request per each is called batching.
The advantage of batching is reducing the number of CPU-
NIC and network communications due to coalescing the
requests. However, hardware limitation does not allow to
batch requests of different QPs . The batching scheme
is more appropriate for the datagram transport due to its
intuitive multicast support. So this scheme can be used
to batch requests over datagram connections to multiple
remote QPs. In addition, sending multiple requests in a
message is another approach that allows the requester to
send several requests to a speciﬁc responder and amortize
communication overheads .
Atomic operations RDMA intrinsically provides a
shared memory region in a distributed environment. Cross-
access to the same memory region must be handled in order
to avoid the race condition. RDMA supports two types of
primitives to avoid concurrent access from other RDMA
operations (not only atomic) on the same NIC: fetch-and-add
and compare-and-swap. These operations adopt an internal
lock mechanism of the NIC. The performance of these
primitives depends on both the NIC atomic implementation
mechanisms and the level of parallelism . The atomicity
of RDMA CAS is hardware-speciﬁc with different granu-
larity (i.e., IB V AT OM I C HC A,I BV AT O M IC GLOB )
. However, there is no concurrency control between the
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING v
operation from local and the remote RDMA operations. To
solve the concurrency problem, concurrent data structures
,  have been proposed or special hardware instruc-
tions providing strong atomicity (i.e., hardware transac-
tional memory in Intel processor) are exploited .
RDMA support Protocols InﬁniBand is different from
Ethernet in different aspects . There are several Inﬁni-
Band alternative protocols supporting the RDMA technolo-
gies including RoCE and iWARP. Adopting an appropriate
protocol requires the awareness of their advantages and
drawbacks. The iWARP  is a complex protocol published
by Internet Engineering Task Force (IETF). It only sup-
ports reliable connected transport (i.e., it does not support
multicast) over a generic non-lossless network . It was
designed to convert TCP to RDMA semantics. On the con-
trary, RoCE is an Ethernet-based RDMA solution published
by InﬁniBand Trade Association supporting reliable and
unreliable transports over lossless network . There are
several studies comparing these protocols over time ,
, , . The consensus is that iWARP is unable to
achieve the same performance of InﬁniBand and RoCE.
There are protocols which are using technology like
Intel DPDK to implelemt RDMA verbs over them .
However, they are not as performant as the native RDMA
implementation in the NIC. While Data Plane Development
Kit (DPDK) also provides kernel bypass that reduce the
reliance on the CPU it does not go far enough and Mellanox
makes the claim .
Wire Speed Mellanox InﬁniBand has been made in 5
speeds: Single-Data Rate (SDR), Dual-Data Rate (DDR),
Quad-Data Rate (QDR), Fourteen-Data Rate (FDR), and
Enhanced Data Rate (EDR) offering 2.5, 5, 10, 14, 25 Gbps
respectively, so enhancing the link speed increases the per-
formance and decreases the latency .
Adaptor Connection InﬁniBand adapters are based on
PCI-X, PCIe. Different studies show that NIC based on
PCIe outperforms PCI-X , , . InﬁniBand host
communication has been started to be integrated on the chip
of the processor , , .
2.5 Application level issues
Aside from all presented performance issues, an application
must be suited for RDMA design, otherwise performance
gain is negligible. RDMA is appropriate for applications
that require shared memory access with low latency, long
connection duration , and consistent alike message size
. The application level parameters that impact on the
performance are presented in the following.
Memory registration Memory registration allows the
RDMA device to read and write data from/to this memory.
The cost of memory registration can be divided into three
parts: (1) mapping virtual to physical memory, (2) pinning
the memory region (3) registering the memory region to NIC
driver. Memory registration is a costly operation because of
the kernel call and the write operation to the NIC driver. Pre-
registering the memory can eliminate this cost in runtime.
However, if the applications can not store their data in a
pre-register memory, then data need to be copied within a
register memory region.
The cost of copying memory versus registering the new
memory depends on the memory size and the power of
the host . The cost of the memory registration can be
even more than RDMA operation itself (i.e., WRITE) .
So different techniques are proposed to solve this prob-
lem. When a new memory region is registered and the
corresponding page is not resident in the memory, then
the cost of page fault is added to the cost of memory
registration. So ensuring the residency of the memory page
before registering a memory region can decrease the cost
of the memory registration. Memory allocation from kernel
space (i.e., get free pages) and registering in the kernel (e.g.,
ib reg phys mr) instead of resorting to user space can de-
crease the memory registration latency . Consequently,
the ﬁrst two steps of memory registration are eliminated
since kernel memory space are physically contiguous and
never swapped out. Since submitting a work request is not
a blocking function, the overlapping memory registration
with communication can hide the cost of registration. Yet
comparing the cost of RTT with the cost of memory regis-
tration reveals that it fully depends on the size of memory
registration . Parallel memory registration can also hide
the cost. This technique is particularly effective when pages
are resident in memory.
Data Structures Hashtable is a popular data structure
in a multi-client and multi-threaded server environment
due to its fast direct lookups; however, hash collision is
inevitable, which leads the increased number of probes. The
higher number of probes in a hash table naturally increases
the cost of lookup and pollutes the CPU cache by keys
which are irrelevant. In many modern RDMA-based NoSQL
systems , , Cuckoo and Hopscotch hashing are often
employed , . These hash tables strive to have the
constant lookup cost by keeping the key in a bounded
neighborhood to its original hash position.
Pipelining Pipeline allows simultaneous (sub)tasks at
different stages. For example, pipeline can be used for
memory registration and communication in order to hide
the cost of registration . This approach can improve the
performance depending on the size of memory. However,
 compares multi-threaded request handling pipeline ver-
sus single threaded request handling which multi-threaded
request handling harms the performance.
Flow Isolation Latency-sensitive and throughput-
sensitive applications may need to share the network re-
sources (i.e., NIC) in large-scale environment. The appli-
cation deployment is critical in such scenario to avoid the
performance isolation of either types of applications. In the
presence of both type of applications, a latency-sensitive
ﬂow will suffer . So throughput-sensitive and latency-
sensitive ﬂows are better to be isolated.
In-bound vs. Out-bound Requests can be categorized
to inbound and outbound according to the requester .
Sending the request from multiple clients to one server
is called inbound, and sending requests from one server
to multiple clients is called outbound. Before designing
an RDMA-based application, measuring inbound and out-
bound throughput is important. Outbound is bottlenecked
by PCIe and NIC processing power while inbound is bottle-
necked by NIC processing power and InﬁniBand bandwidth
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING vi
2.6 RDMA suitability
There are plenty of studies showing RDMA beneﬁts over
legacy protocols (i.e., TCP/UDP). However, there is a trade-
off in using RDMA. We aimed to compare RDMA with
legacy protocols to identify the cost of using RDMA in the
pursuit of higher performance.
RDMA can perform one operation in each Round Trip
Time (RTT), however traditional protocols provide more
ﬂexibility by fulﬁlling multiple operations. Thus, it makes
RDMA more suitable for environments with single and
quick operations. Furthermore, RDMA is more suitable
for small and alike message sizes. Additionally, exploiting
RDMA with dynamic connections cannot be beneﬁcial due
to the heavy initial cost of RDMA. Moreover, although
RDMA provides high-speed communication, it limits scal-
ing in both distance and size, meaning that the distance
among nodes and the number of nodes cannot be arbitrarily
large . Generally speaking, programmability in RDMA
is more involved in comparison with legacy protocols since
the application layer requires to manage the concurrency
control over the distributed shared memory. RoCE and
InﬁniBand require a lossless network. This limitation brings
the management, scalability and deployment issues which
are the main challenges while using these protocols. iWARP
does not require such assumption, while this is great for
generalizability, it results more complex NICs which leads
to performance degradation in comparison with RoCE and
InﬁniBand . In addition, there is a software type of
RDMA protocols like SoftRDMA which is using Data Plane
Development Kit (DPDK) technology to implement RDMA
verbs over it . DPDK is the Intel technology which
provides kernel bypass that reduces the reliance on the
CPU, however it is not as performant as the native RDMA
implementation in the NIC .
Programmability in legacy protocols (i.e., TCP/IP) is
simpler than RDMA. Moreover, TCP/IP is resilient to packet
loss, while RoCE and InﬁniBand are not. Generally speak-
ing, legacy protocols will be replaced by RDMA if the
problem is addressing volume and scaling up, and not ultra-
3 NOSQL SYSTEMS
NoSQL systems, in particular in-memory key-value stores
, , , , , , ,  are vital in acceler-
ating memory-centric and disk-centric distributed systems
. Modern in-memory key-value stores adopt RDMA to
alleviate the communication and remote processing over-
heads. However,  presents the possible missing features
of RDMA to meet the needs of future data processing appli-
cations such as in-memory key-value stores. It introduces
the remote memory copy and conditional read to enable the
client to remotely copy the data and to remotely ﬁlter the
data before delivering it to the client. Moreover, introducing
the dereferencing in RDMA can accelerate the existing sys-
tems by exploiting an indirection mechanism and omitting
the cost of extra operation , , , . Next, a
comprehensive review of state-of-the-art RDMA-based key-
value stores, examining indexing, consistency models, and
communication protocols is provided.
Fig. 2. Schematic view of HydraDB.
HydraDB is a general-purpose in-memory key-value store
designed for low latency and high availability environment
. The HydraDB partitions data into different instances
coupled with a single-threaded, partitionable execution
model, called shard. Each shard is exclusively assigned to
a core allows to fully exploiting the computational power
and the cache of each core.
Each shard maintains a cache-friendly hash table with
the memory location of the key-value stores. This hash table
is not visible to the clients and is a compact hash table
that has each bucket aligned to 64 bytes with 7 slots and
a header to an extended slot, in order to avoid link list
traversal which has an excessive pointer dereference. Each
slot contains a signature of the key-value and a pointer.
The server processes a request if its signature (i.e., a short
hash key) matches to the requested key. Values are stored
with a word-size ﬂag in order to show the validity of the
content and to avoid reading the stale data by client. Fig. 2
shows the schematic view of HydraDB. The scheme shows
the indexing, communication protocol, and the required
operations for Get and Put transactions.
HydraDB supports single-statement Get or Put transac-
tions. Each client locates key-values according to the consis-
tent hashing algorithm . Clients and servers use WRITE
to send and receive requests/responses because WRITE out-
performs the other RDMA communications. In HydraDB,
the shard process keeps polling a memory area to detect
a new arrival message (i.e., sustained-polling). Each shard
uses a single thread to poll the buffer requests. In the case of
Get, shard ﬁrst ﬁnds the corresponding key-value address
in the compact hash table. Then, it replies to the client the
address of the key-value pairs. So for the next request of the
same key from the client, it exploits the READ based on the
Put operation fully relies on the server; the server ﬁnds
the key in the compact hash table, then it ﬂips the word-
size ﬂag of the key-value atomically to notify the readers
about the update. Since remote read through RDMA may
conﬂict with local write, shard exploits out-of-place update
with lease-based time mechanism to guarantee the data
consistency and memory reclamation, respectively .
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING vii
Fig. 3. Schematic view of Pilaf system.
Pilaf  is a distributed in-memory key-value store exploit-
ing RDMA operation with the goal of reducing latency and
improving the performance of traditional in-memory key-
value stores (i.e., Redis  and Memcached ). As Fig.
3 demonstrates, Pilaf exploits two distinct memory regions
visible to the client: a variable extent area and a ﬁxed-size
self-verifying hash table. The extent area is the region to
store the actual value and each bucket in the ﬁxed-size self-
verifying hash table keeps the address of the key-value, its
checksum, its value size, and the checksum of the bucket
itself. Pilaf supports single-statement transactions Get and
Put. Unlike the Get, the Put operation is fully server-driven.
The Get operation probes ﬁxed-size self-verifying hash table
through RDMA READ to ﬁnd the appropriate bucket for the
key with the valid content (i.e., in use). Since the address of
the value is stored in the bucket, client can read the value in
the extent area. The Put operation fully relies on the server
to avoid the write-write race condition.
The Pilaf client and server uses SEND message to ex-
change request and response in Put operation. Once the
Pilaf server receives a Put operation, ﬁrst, it allocates a new
memory location and updates it with the new value. Then
it updates the corresponding bucket in the self-verifying
hash table and disassembles the previous key-value content.
Each bucket equipped with a checksum over the value to
guarantee the data consistency. Disassembling the previous
content notiﬁes the clients about the recent update by the in-
consistency between the read value and its checksum in the
corresponding bucket. Moreover, each bucket is equipped
with a checksum over the bucket itself to solve the race
condition between the client’s read and server’s update on
the same bucket. Once a client detects this inconsistency, it
initiates the lookup in the hash table to retrieve the updated
address of the key and its content.
HERD is an in-memory key-value store designed for efﬁ-
cient use of RDMA operations . HERD adopts a simple
lossy associative index, and a circular log for storing values
(i.e., MICA back-end data structures ) , illustrated in Fig.
Fig. 4. Schematic view of HERD system.
4. The clients write their requests (i.e., Get, Put) to the server
using WRITE on an unreliable connection and, the server
replies using SEND over an unreliable datagram. Adopting
these operations are due to the scaling of WRITE and SEND
in inbound and outbound communications. HERD’s de-
signers claim that single RTT communications (i.e., WRITE,
SEND) combining with a memory lookup can outperform
multiple RTT communications (i.e., READ).
Although it is shown that zero packet loss is detected in
100 trillion packets over unreliable datagram , there is
the belief that unreliable transports may bring the unrelia-
bility to the enterprise applications . HERD designers
propose FaSST which is a transactional in-memory store
with a loss detection algorithm over unreliable connection
. HERD applied several optimizations, such as window
request, prefetching, selective signaling, huge pages, batch-
ing, multi-port request in order to enhance the performance
FaRM is a distributed in-memory transaction processing
system designed to improve the latency and throughput of
the TCP/IP communication . FaRM exploits symmetric
model in which each machine uses its local memory to store
data. This symmetric model helps to exploit the local mem-
ory and the CPU which is mostly idle. The FaRM adopts
two memory areas for storing and handling transactions: a
chained associative hopscotch hash table and a key-value
store area. FaRM employs a modiﬁed version of hopscotch
hashing  that uses a chain in the bucket. A chain is
used to keep the new data in the bucket instead of resizing
the table in the overﬂow situation. However, it attempts
to remove this chain and to move the last element of the
chain to the available slot. Each bucket in the hopscotch
hash table consists of an incarnation, the address of the
value, and its size. FaRM stores small value sizes into the
bucket and the bigger sizes in the key-value store area
and keeps its address in the bucket. Values in FaRM are
stored in a structure called object. Each object consists of a
header version (Vobj ), a lock (L), an incarnation (I), and cache
line versions (Vc). The incarnation is used to determine
the validity in case of the removed object. Lock, header
and cache line versions are used to guarantee the data
consistency. The FaRM supports multi-statement Put and
Get transactions. Fig. 5 shows the FaRM data model and the
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING viii
Fig. 5. Schematic view of FaRM system.
interaction of the two transactions. Get operation uses READ
for lookup process. It is performed by reading consecutive
buckets according to the size of the neighbourhood (H) in
hopscotch, H=6 in FaRM. In addition, the client checks the
lock and the header version to be matched with all the cache
line versions to guarantee the data consistency. In case of
failure, the client retries to read after a backoff time. In case
of Put transaction, client fetches the desired key from the
shard, then it locks the key by sending a request to the shard.
Shard sets the lock atomically (i.e., compare and swap) and
sends the acknowledgment to the client. Afterwards, client
validates the key by reading the key and sends the update
(i.e., key and updated value) to the shard. Shard updates
the value and ﬁnally, the cache line and header versions are
incremented and the object is unlocked. FaRM uses fences
after each memory write to guarantee the memory ordering
in case of concurrent read.
FaRM uses WRITE and sustained-polling mechanism to
exchange requests and responses as HydraDB. However,
this approach is incompatible with out-of-order packet de-
livery, and a retransmitted packet from an old message
might cause a memory to be overwritten and causes incon-
sistency in the execution .
DrTM  and its successor DrTM+R  are in-memory
key-value systems, which exploit concurrency instruction
provided by modern CPUs. DrTM adopts traditional hash
table with collocated memory regions for keys and values,
called cluster hashing. Memory regions are managed in three
different areas, called main header,indirect header, and entry,
as shown in Fig. 6. The main header includes the incarna-
tion, key, and its offset. The indirect header has the same
structure as the main header and it is used in the overﬂow
situation of the slots of the bucket in the main header. In this
case the last slot of the bucket points to an available indirect
The value in DrTM is stored in a structure called entry
containing the incarnation, value, version and status. Status
represents the state of the key to perform Get and Put.
DrTM supports multi-statement transaction Get and Put.
To guarantee the data consistency, it uses a lease-based in
Fig. 6. Schematic view of DrTM.
Fig. 7. Schematic view of Memcached.
combination with lock and Hardware Transactional Mem-
ory (HTM). At the beginning of a transaction, the executor
locks the remote key through the one-sided atomic RDMA
verb (i.e., compare-and-swap) and fetches the keys, then
a local HTM is started. DrTM uses the strong consistency
and atomicity of RDMA and HTM. Concurrent RDMA and
HTM operations on the same memory will abort the HTM
transaction. Once the transaction is committed, all remote
keys are updated and locks are released.
Memcached is a legacy in-memory key-value store based
on TCP/IP protocol. It stores keys with their values into
an internal hash table as shown in Fig. 7. It uses slab
allocator to efﬁciently manage the memory according to
the size of the key and value. Since Memcached supports
multithreaded access to the hash table, the server orches-
trates access through the locks. Memcached supports single-
statement Get and Put transactions. Once the server receives
aGet request, it ﬁnds the appropriate bucket and acquires
the lock. Then, it replies the value or a miss, in case of
missing the key, to the client. Once the server receives the
Put request, it acquires the corresponding bucket lock, then
updates the item. Memcached keeps an expiration (exptime)
and the recent access time (time) to the key to replace the
new items with the old ones in case of memory shortage
according to the Least Recently Used (LRU) algorithm.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING ix
Fig. 8. Schematic view of Redis.
Redis is an in-memory key-value store with the ability to
asynchronously store data on the disk. Flushing data to the
disk can release memory space. Redis uses the TCP/IP com-
munication scheme, and supports single-statement Get and
Put transactions considering that it is single threaded and
does not use locks for accessing the data. One of the main
advantages of Redis is supporting the various value data
types, such as lists and sets. Fig. 8 shows the communication
and data model of the Redis. It uses LRU algorithm to ﬂush
the data to the disk; however, there is a high probability of
data loss in case of system power loss events.
3.8 Systems Comparison
Table 1 differentiates aforementioned systems based on the
usage of one-side or two-sided verbs, connection type in
sending/receiving request/response, client-driven/server-
driven operations, and architectural model of the systems.
Client-driven/server-driven are operations that fully man-
aged by the client/server. Architectural model captures the
symmetric/asymmetric memory model by usage of client
local memory in storing key-values.
Table 2 classiﬁes the systems based on ampliﬁcation in
size and computation. Ampliﬁcation highlights the extra
amount of bytes or computation a client must exchange or
compute in Get and Put operations. The word-size ﬂag is the
overhead of Get operation in HydraDB system. HydraDB
client must check the ﬂag and lease time which has com-
putation overhead. In Pilaf, the client performs a lookup in
the self-verifying hash table to ﬁnd the address of the value,
then it reads the value. To guarantee the consistency, the
client must compute the checksum of the bucket and value.
Client on average performs 1.6 probes in the self-verifying
hash table plus one more READ to read the value. On average
checksum computation is ampliﬁed by a factor of 2.6. In
FaRM, the header and cache line versions are ampliﬁcation
to the value. The DrTM requires to read and set the State in
Get and Put operations. Moreover, Get operation requires to
read an Incarnation (I) and version.
Table 3 shows the data consistency, indexing, and trans-
action type of the systems. In-memory key-value stores
can be categorized to single-statement (also called caching
systems) and multi-statement transactions. In the single-
statement systems, such as HydraDB , Pilaf , Herd
, Memcached , and Redis  there is one operation
in the transaction. However, multi-statement transaction,
such as FaRM , and DrTM  have multiple operations
in the transaction.
Each of the aforementioned systems has its own charac-
teristics in its design which can carries its own drawbacks.
These shortcomings for each system are discussed next.
HydraDB exploits lease mechanism to invalidate the
remote cache. This method introduces overhead needed to
synchronize outdated data items. It increases the number
of control messages in the network. Furthermore, deter-
mining the lease period is a non-trivial task that is highly
dynamic, can change overtime, and may vary signiﬁcantly
across different records. Technically speaking, HydraDB
uses sustained-polling which violates the RDMA speciﬁca-
Pilaf uses cuckoo hashing which requires multiple reads
to probe a key. So, the network communication overhead
is ampliﬁed in Pilaf, hindering its maximum sustained
throughput. Moreover, the use of checksum computation in
Pilaf brings extra overhead not only for server but also for
HERD has simple design which uses unreliable commu-
nication transport to respond to clients which may lead to
uncertainty or unreliability. Another disadvantage of this
system is mixing the control and data planes. So, data are
not able to move between the nodes.
FaRM uses hopscotch hashing which requires larger
reads and consequently increases the read ampliﬁcation.
Moreover, it suffers from the same issue as HydraDB in
violating the RDMA speciﬁcation.
DrTM relies on speciﬁc hardware features (i.e., hardware
transactional memory) that are not available in all CPUs,
and may change across vendors or CPU generations. So, it
cannot be seamlessly ported and executed on all hardware.
Memcached and Redis perform approximately the same
for set and get operations. However, the impact of swapping
pages in Memcached is rather high while the memory
fragmentation in Redis is high .
4 EXPERIMENTAL EVALUATION
Since our study focused on performance challenges of
RDMA operations, it allowed us to capture bottlenecks of
RDMA-based in-memory key-value stores. In order to study
the impact of the RDMA design choices on the performance
of the in-memory key-value stores, we performed a com-
prehensive set of experiments. We employed the Redis 3.2.9
, Memcached 1.4.37 , and HERD . However, Hy-
draDB, Pilaf, and FaRM are implemented from scratch since
they are not publically available. All messages in HydraDB
and FaRM are exchanged using the inline RDMA messages.
Since DrTM uses a special limited CPU feature that is not
widely accessible, we did not include it in the analysis.
FaRM is the only system in our list that supports multi-
statement transaction. Redis and Memcached are executed
over IPOIB and the other systems have native RDMA over
4.1 Experimental Setting
To provide the reproducibility and interpretability of the ex-
periments, the deterministic and non-deterministic parame-
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING x
KV Store One-sided / (Two-sided) Request / (Response) connection type Client-driven / (Server-driven) operations Architectural model
HydraDB 3/ (7) RC / (RC) 7/ (Get,Put) Asymmetric
Pilaf 3/ (3) RC / (RC) Get / (Put) Asymmetric
HERD 3/ (3) UC / (UD) 7/ (Get,Put) Asymmetric
FaRM 3/ (7) RC / (RC) Get / (Put) Symmetric
DrTM 3/ (7) RC / (RC) Get,Put / (7) Symmetric
Memcached 7/ (7) RC / (RC) 7/ (Get,Put) Asymmetric
Redis 7/ (7) RC / (RC) 7/ (Get,Put) Asymmetric
Decoupling the communication of the existing systems.
KV Store Size Get / (Put) Ampliﬁcation Computation Get / (Put) Ampliﬁcation
HydraDB Word / (0) lease validity + check ﬂag / (0)
Pilaf (0) / (0) (2.6) ×CRC64 / (0)
HERD 0 / (0) 0 / (0)
FaRM Vobj + I + L + (#cache lines - 1) * Vc1/ (0) check lock / (0)
DrTM 2 ×State + I + version / (2 ×State) 0 / (0)
Memcached 0 / (0) 0 / (0)
Redis 0 / (0) 0 / (0)
Decoupling the client ampliﬁcation of the existing systems.
KV Store Data Consistency Indexing Transaction
HydraDB ﬂag & lease compact hash table Single
Pilaf self-verifying cuckoo hashing Single
HERD 7lossy associative index Single
FaRM versioning chained hopscotch hashing Multiple
DrTM lock and HTM cluster chaining hashing Multiple
Memcached lock chaining hash table Single
Redis 7chaining hash table Single
Data consistency and indexing of existing systems.
ters which can impact on the performance are described in
Deterministic Setting All benchmarks are compiled
with the gcc version 4.4.7 with 50 seconds warmup and
25 seconds measuring time. In addition, to achieve the
certainty on the result the experiments are repeated three
times. Benchmarks are executed on a machine with 2 sockets
AMD Opteron 6276 (Bulldozer) 2.3 GHz equipped with 20
Gbps DDR ConnectX Mellanox on PCI-E 2.0 with ofﬂoad
processing. The network topology is a direct connection
with a Inﬁniscale-III Mellanox switch. Each machine has 4
NUMA nodes connecting to two sockets.
Non-deterministic Setting Query distribution (i.e., ob-
ject popularity) is one of the main parameters in our experi-
ments. Facebook analysis reported that web requests follow
Zipﬁan-like distribution  , and the majority of in-
memory systems present experiments based on the Zipﬁan
distribution with high αratio (α= 0.99) indicating skew
curve , , , . In this paper, for comprehensive-
ness, two distributions are considered that closely model the
real-world trafﬁc: Uniform and Zipﬁan.
The number of keys is an important parameter in the
experiment. For example, the server needs to register cor-
responding memory size to the number of keys which can
increase the cache misses in the NIC to fetch the page table
entries and inﬂuence on the performance. In this paper, we
use 4 million keys, which can be seen as the active set. We
experiment with the key and value size of 16 and 32 bytes,
respectively, close to real-world workloads , . Real-
world workload ratio of read and write vary from 68% to
99% . Thus, we consider the various read-write ratio
workload ranging from read-only to write-only.
4.2 Process Mapping
Each process is mapped to a single core to avoid context
switching and uses its local NUMA node. Since NIC (i.e.,
Mellanox adapters) installed over PCIe is closer to one of
the CPUs on the board , we conduct an experiment
to examine the impact when both sender and receiver are
mapped on the closer CPU and once on the far CPU to NIC
on two machines. Fig. 9 shows the bandwidth difference on
SEND operation when the processes are bind to a close or
far CPU from the NIC. Therefore, we pined the shard to the
closer CPU to the NIC in all experiments. In addition, in
order to stress each system, we dedicate one machine as the
server with a single shard and one machine with multiple
8 16 32 64 128
Far socket to NIC
Near socket to NIC
Fig. 9. RDMA Send bandwidth difference on far and close socket to NIC.
4.3 Memory issues
Memory alignment and cache misses are two important
memory issues on the RDMA performance. In the following,
we perform experiments to investigate their impacts.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING xi
0 10 20 32 64
Fig. 10. Memory alignment performance evaluation.
1 2 3 4
Number of memory blocks
Fig. 11. Cache misses impact on the performance.
Memory alignment We allocated a memory block that
is aligned with the page boundary (4096 bytes). Then, the
client sends write requests that may reside on the different
offset of the page. We conducted an experiment to show
the throughput when we vary the offset alignment, namely,
dividing the page into a set of slots such that a record may
span multiple slots depending on its size. Fig. 10 illustrates
that the throughput is always improved when the slot offset
is aligned on a multiple of 64-bytes, which is aligned with
the processors cache line.
Cache miss We performed an experiment when each
message uses separate memory areas up to 4 so that the
cache is rarely hit, and when the cache is always hit, which
means no cached item is ever replaced. As can be seen in
Fig. 11, when memory is assigned from the same memory
block it has higher performance.
Impact of varying the number of clients and work-
load read-write ratio on throughput Fig. 12 shows the
results on different workload read-write ratios on Uniform
distribution. As expected, the throughput of Redis and
Memcached is order of magnitude slower than RDMA-
based system in particular in a read-intensive workload.
HydraDB scales well by increasing the number of clients
and almost outperforms the other systems because of the
smaller ampliﬁcation in the HydraDB and the use of READ
for Get statement in read-intensive workload. FaRM uses
hopscotch hashing, and for each read, it requires reading a
neighborhood consisting of six buckets; thus, FaRM suffers
from read ampliﬁcation. Although Pilaf uses READ in Get
statement, it can not outperform HydraDB and FaRM due
to the higher number of READs and cost of CRC64. In
0 10 20 30
Redis Memcached Herd HydraDB Pilaf FaRM
0 10 20 30
690% Get, 10% Put
0 10 20 30
Number of clients
50% Get, 50% Put
0 10 20 30
10 100% Put
Fig. 12. Uniform throughput with single shard.
particular, the latency of CRC64 computation can be even
higher than READ latency .
HERD uses WRITE over unreliable connection type for
sending request, and SEND for responding to the request.
Each client creates one QP to send its requests while the
server creates one QP for each client to receive the requests.
The server uses one QP for all clients to respond their
requests and each client creates one QP for each server to
receive the response. This mechanism puts more overhead
on the clients due to the higher number of QPs. In our
experiment, HERD is bottlenecked with the SEND operation
. Moreover, we run HERD without the hugepages which
have an impact on the performance.
Throughput impact varying the workload read-write
ratio To highlight the impact of decreasing the workload
read-write ratio for both Zipﬁan and Uniform distributions,
we devise an experiment with 24 clients while varying
workload read-write ratio. As can be seen in Fig. 13, Hy-
draDB and FaRM scale gracefully when decreasing the
read-write ratio due to the higher performance of WRITE
exploited in Put transaction comparing to the READ in
Get transaction. The performance difference of READ and
WRITE is due to the limited number of outstanding READ
requests for each QP (i.e., our case 24) and the overhead
of maintaining their state in the NIC. Moreover, READ uses
PCIe non-posted transaction comparing to cheaper posted
transactions for WRITE . A non-posted transaction is a
type of PCIe request, in which requester needs a response
from the destination device unlike the posted transaction.
Pilaf does not scale with decreasing the read-write ratio due
to increased overhead of two-sided verb in Put operations.
In the case of HERD, decreasing the read-write ratio does
not have a noticeable impact on the performance since
HERD is bottlenecked by the two-sided verb.
Throughput impact of the memory access distribution
To isolate the impact of the memory access distribution
on the performance, we performed an experiment with 24
processes reading remote memory either using Zipﬁan and
Uniform distributions. We deploy data in the clustered (i.e.,
in order) and unclustered (i.e., random) way. We observed
that both distributions perform identically with the small
value size due to the locality of the data in the smaller
portion of the memory. However, increasing the value size
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING xii
100% 50% 0%
100% 50% 0%
Fig. 13. Throughput comparison when varying read-write ratio for 24
2 4 8 16 32 64
Value size (Bytes)
Fig. 14. Uniform and zipﬁan distributions throughput varying value size
for clustered and unclustered insertions.
results reducing the locality, and Zipﬁan memory access
outperforms (i.e., 2.1×) the Uniform due to higher data
locality, as can be seen in Fig. 14. Furthermore, Zipﬁan
distribution with clustered deployment has higher (i.e.,
1.4×) performance comparing to the unclustered setting
due to the higher data locality. Generally speaking, Zipﬁan
distribution increases the chances of reusing cached data for
hot keys. However, it will increase the race condition when
updating hot keys. Fig. 15 shows the throughput results for
4.5 Average latency
Fig. 16 and 17 show the latency of Uniform and Zipﬁan
distributions. Increasing the number of clients will increase
the number of requests and consequently will increase the
HydraDB and FaRM on both distributions are two order
of magnitude slower than legacy systems. By decreasing
the read-write ratio of HydraDB and FaRM the latency de-
creases 2.5 and 2.3 times respectively. The drop is due to the
lower latency of WRITE operation compared to READ used
0 10 20 30
Redis Memcached Herd HydraDB Pilaf FaRM
0 10 20 30
90% Get, 10% Put
0 10 20 30
Number of clients
50% Get, 50% Put
0 10 20 30
10 100% Put
Fig. 15. Zipﬁan throughput with single shard.
0 5 10 15 20 25 30
Redis Memcached HERD HydraDB Pilaf FaRM
0 5 10 15 20 25 30
10390% Get, 10% Put
0 5 10 15 20 25 30
Number of clients
50% Get, 50% Put
0 5 10 15 20 25 30
Fig. 16. Latency Zipﬁan.
0 5 10 15 20 25 30
Redis Memcached HERD HydraDB Pilaf FaRM
0 5 10 15 20 25 30
10390% Get, 10% Put
0 5 10 15 20 25 30
Number of clients
50% Get, 50% Put
0 5 10 15 20 25 30
Fig. 17. Latency Uniform.
in Put and Get operations. However, decreasing the read-
write ratio will increase the latency up to 2 times on Pilaf
due to the higher latency of the verb messages comparing to
READ. Since HERD uses the same operations in both Get and
Put operations, there is no noticeable impact on the latency
by decreasing the read-write ratio. Moreover, the latency
of the Redis and Memcached are higher than RDMA-based
systems because the use of the IP over InﬁniBand instead of
4.6 Value size
The payload size of RDMA message inﬂuences the through-
put of READ and WRITE operations as shown in Fig. 18.
Increasing the payload size up to 64 bytes does not have
impact on the throughout. The throughput degradation
after 64 bytes (i.e., cache line size) is due to the occupy-
ing the multiple cache lines in the Work Queue Element
(WQEs). Increasing the WQE size will increase the MMIO
and DMA operations in the inline and non-inline messages.
The key difference between the inline and non-inline RDMA
operations depends on the payload in the message, in
which unlike in-line message, a non-inline message requires
to read the payload by a DMA read operation . The
throughput of inline WRITE is 4.5 times higher than non-
inline WRITE in small payload size in our experiment, and
the reason is the elimination of the DMA reads in an inline
message. Although inline message has higher throughput
but it poses a limit on the message payload size. For ex-
ample, it must be less than 1K bytes in the NIC employed
in our evaluation. Moreover, increasing the payload size
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING xiii
8 16 32 64 128 256 512 800
Fig. 18. Payload size impact on the performance with 24 processes on
one machine and one remote process.
8 16 32 64 128 256 512
Fig. 19. Value size impact on the performance of the systems with 8
clients and 50% Get and Put.
causes higher throughput degradation in inline message
comparing to non-inline messages.
We performed the same experiment using IPOIB. In the
experiment clients read and write from/to a server. As can
be seen in Fig. 18, these operations are less sensitive to
the payload size. The performance degradation ratio in our
experiment is 1.02%.
Since the bigger value in in-memory key-value systems
implies the bigger payload size of RDMA message, we
performed an experiment with 8 clients and 50% read-
write ratio to measure the impact of varying the value size
on the throughput. As Fig. 19 shows, the throughput of
the RDMA systems reduce up to 1.7×by increasing the
value size. Moreover, the HydraDB and FaRM are more
sensitive to the size of the value due to inline WRITE in the
experiment. The transition point when performance begins
to drop depends on the NIC and the size of the value .
Since the performance degradation of IPOIB operations is
much less than RDMA systems, the legacy systems are less
sensitive to message size.
4.7 Uniformity Ratio
Since the server copes with a large number of simultane-
ous access, satisfying the uniformity of the client requests
is essential. The uniformity ratio indicates the maximum
number of completed requests over the minimum number
of completed requests among the clients in a speciﬁc period
of time. Closer ratio to 1 indicates the better uniformity and
further ratio indicates the worse. We employ the uniformity
ratio as it is an indicator of fairness . As shown in Fig.
20, all RDMA-based systems have the uniformity ratio close
Re Me He Hy Pi Fa
Zipfian 100% Get
Re Me He Hy Pi Fa
50% Get, 50% Put
Re Me He Hy Pi Fa
Re Me He Hy Pi Fa
Uniform 100% Get
Re Me He Hy Pi Fa
550% Get, 50% Put
Re Me He Hy Pi Fa
Fig. 20. Uniformity Ratio on 2 machines. Names are summarized to the
ﬁrst two letters.
to 1. However, requests in Redis and Memcached are not as
uniform as RDMA based systems.
5 CONCLUSION AND FUTURE DIRECTION
In this paper, we describes RDMA performance challenges
and reviews the state-of-the-art in-memory key-value stores.
We present a uniﬁed representation of a wide spectrum of
NoSQL systems with respect to their indexing, consistency
model, and communication protocols. We conduct extensive
evaluation comparing HydraDB , FaRM , Pilaf ,
HERD , Memcached , and Redis . Our work
shows that operation using one-sided RDMA combining
with hopscotch hashing in FaRM can outperform cuckoo
hashing in Pilaf because of the higher number of probes
(i.e., READs) in cuckoo hash table. However, hopscotch
hashing requires to read a ﬁxed size neighborhood, to
have a constant lookup time, which has a substantial im-
pact on the throughput comparing to the compact hash
table in HydraDB. We have observed that exploiting one-
sided verb (i.e., WRITE) in exchanging the request/response
can outperform the two-sided verb. However, increasing
the number of machines can inﬂuence the performance
using connected QP . We have further observed that
the performance of memory access distribution is greatly
inﬂuenced when the data is clustered or unclustered. The
clustered access achieves higher performance by factor of
1.4. In addition, we have demonstrated that the latency of
the legacy systems are up to 2 order of magnitude higher
than the RDMA-based systems. We have shown that in-
creasing the size of the value will decrease the performance,
and systems using inline RDMA are more sensitive to the
value. Finally, we have observed that RDMA-based systems
can uniformly serve the requests comparing to the legacy
Achieving high performance in key-value stores is not
trivial and it requires solving several problems. Future direc-
tions in performance enhancement of key-value stores are
placed in improving the network and storage drivers. For
example, sophisticated operations such as dereferencing,
traversing a list, conditional read, on-the-ﬂy operations (e.g.,
compression and decompression), histogram computations,
results consolidation  are possible features in future
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING xiv
RDMA protocols, which can be used to improve key-value
stores. In addition, new storage class memory like Intel
Optane can be used in key-value stores  as well as Non-
Volatile Memory Express (NVMe) which is a state-of-the-art
protocol for high-performance storage . The advantage
of NVMe is that it supports several transport protocols such
as RDMA (i.e., RoCE and iWARP), Fibre Channel, and TCP
Authors would like to thank to HPC project within the
Department of Control and Computer Engineering at the
Politecnico di Torino (http://www.hpc.polito.it).
 X. Lu, D. Shankar, and D. K. Panda, “Scalable and Dis-
tributed Key-Value Store-based Data Management Using RDMA-
Memcached.” IEEE Data Eng. Bull., vol. 40, no. 1, pp. 50–61, 2017.
 M. Sadoghi and S. Blanas, “Transaction processing on modern
hardware,” Synthesis Lectures on Data Management, vol. 14, no. 2,
pp. 1–138, 2019.
 F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach,
M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber, “Bigtable: A
distributed storage system for structured data,” ACM Transactions
on Computer Systems, vol. 26, no. 2, 2008.
 G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, A. Laksh-
man, A. Pilchin, S. Sivasubramanian, P. Vosshall, and W. Vogels,
“Dynamo: amazon’s highly available key-value store,” Special
interest Group in Operating Systems, vol. 41, no. 6, pp. 205–220,
 A. Lakshman and P. Malik, “Cassandra: a decentralized struc-
tured storage system,” Special interest Group in Operating Systems,
vol. 44, no. 2, pp. 35–40, 2010.
 R. Nishtala, H. Fugal, S. Grimm, M. Kwiatkowski, H. Lee, H. C.
Li, R. McElroy, M. Paleczny, D. Peek, P. Saab et al., “Scaling
Memcache at Facebook.” in Networked Systems Design and Imple-
mentation, vol. 13, 2013, pp. 385–398.
 D. Dai, X. Li, C. Wang, M. Sun, and X. Zhou, “Sedna: A memory
based key-value storage system for realtime processing in cloud,”
in Cluster Computing Workshops. IEEE, 2012, pp. 48–56.
 X. He, J. Pan, O. Jin, T. Xu, B. Liu, T. Xu, Y. Shi, A. Atallah,
R. Herbrich, S. Bowers et al., “Practical lessons from predicting
clicks on ads at facebook,” in International Workshop on Data
Mining for Online Advertising. ACM, 2014, pp. 1–9.
 C. Li, Y. Lu, Q. Mei, D. Wang, and S. Pandey, “Click-through
prediction for advertising in twitter timeline,” in International
Conference on Knowledge Discovery and Data Mining. ACM, 2015,
 Y. Huang, B. Cui, W. Zhang, J. Jiang, and Y. Xu, “Tencentrec:
Real-time stream recommendation in practice,” in International
Conference on Management of Data. ACM, 2015, pp. 227–238.
 G. Graefe, “The ﬁve-minute rule twenty years later, and how
ﬂash memory changes the rules,” in International workshop on Data
management on new hardware. ACM, 2007.
 P. Menon, T. Rabl, M. Sadoghi, and H.-A. Jacobsen, “CaSSanDra:
An SSD boosted key-value store,” in 2014 IEEE 30th International
Conference on Data Engineering. IEEE, 2014, pp. 1162–1167.
 ——, “CaSSanDra: An SSD boosted key-value store,” in Interna-
tional Conference on Data Engineering. IEEE, 2014, pp. 1162–1167.
 B. Debnath, S. Sengupta, and J. Li, “FlashStore: high throughput
persistent key-value store,” Proceedings of the Very Large Database
Endowment, vol. 3, no. 2, pp. 1414–1425, 2010.
 “Flashcache,” accessed: 12-October-2017. [Online]. Available:
 A. Anand, S. Kappes, A. Akella, and S. Nath, “Building cheap
and large cams using bufferhash,” University of Wisconsin Madison
Technical Report TR1651, 2009.
 T. Kgil, D. Roberts, and T. Mudge, “Improving NAND ﬂash based
disk caches,” in International Symposium on Computer Architecture.
IEEE, 2008, pp. 327–338.
 D. Ongaro, S. M. Rumble, R. Stutsman, J. Ousterhout, and
M. Rosenblum, “Fast crash recovery in RAMCloud,” in Proceed-
ings of the Twenty-Third ACM Symposium on Operating Systems
Principles. ACM, 2011, pp. 29–41.
 “Memcached: High-Performance, Distributed Memory Object
Caching System,” accessed: 12-October-2017. [Online]. Available:
 H. Lim, D. Han, D. G. Andersen, and M. Kaminsky, “MICA:
A Holistic Approach to Fast In-Memory Key-Value Storage,”
in 11th USENIX Symposium on Networked Systems Design and
Implementation (NSDI 14). Seattle, WA: USENIX Association,
2014, pp. 429–444.
 H. Lim, B. Fan, D. G. Andersen, and M. Kaminsky, “SILT: A
Memory-efﬁcient, High-performance Key-value Store,” in Sym-
posium on Operating Systems Principles. New York, NY, USA:
ACM, 2011, pp. 1–13.
 S. Sanﬁlippo, “Redis,” accessed: 12-October-2017. [Online].
 D. Sidler, Z. Istv´
an, and G. Alonso, “Low-latency TCP/IP stack
for data center applications,” in Field Programmable Logic and
Applications. IEEE, 2016, pp. 1–4.
 I. Pratt and K. Fraser, “Arsenic: A user-accessible gigabit ethernet
interface,” in Joint Conference of the IEEE Computer and Communi-
cations Societies, vol. 1. IEEE, 2001, pp. 67–76.
 T. Von Eicken, A. Basu, V. Buch, and W. Vogels, “U-Net: A user-
level network interface for parallel and distributed computing,”
in Special interest Group in Operating Systems, vol. 29, no. 5. ACM,
1995, pp. 40–53.
 D. Dunning, G. Regnier, G. McAlpine, D. Cameron, B. Shubert,
F. Berry, A. M. Merritt, E. Gronke, and C. Dodd, “The virtual
interface architecture,” IEEE micro, vol. 18, no. 2, pp. 66–76, 1998.
 N. J. Boden, D. Cohen, R. E. Felderman, A. E. Kulawik, C. L. Seitz,
J. N. Seizovic, and W.-K. Su, “Myrinet: A gigabit-per-second local
area network,” IEEE micro, vol. 15, no. 1, pp. 29–36, 1995.
 J. Beecroft, D. Addison, D. Hewson, M. McLaren, D. Roweth,
F. Petrini, and J. Nieplocha, “QSNET/sup II: deﬁning high-
performance network design,” IEEE micro, vol. 25, no. 4, pp. 34–
 “InﬁniBand,” accessed: 12-October-2017. [Online]. Available:
 “iWARP,” accessed: 12-October-2017. [Online]. Available:
 M. S. Birrittella, M. Debbage, R. Huggahalli, J. Kunz, T. Lovett,
T. Rimmer, K. D. Underwood, and R. C. Zak, “Intel R
path architecture: Enabling scalable, high performance fabrics,”
in High-Performance Interconnects. IEEE, 2015, pp. 1–9.
 K. Gildea, R. Govindaraju, D. Grice, P. Hochschild, and F. C.
Chang, “Remote direct memory access system and method,”
Aug. 30 2004, uS Patent App. 10/929,943.
 Y. Wang, L. Zhang, J. Tan, M. Li, Y. Gao, X. Guerin, X. Meng,
and S. Meng, “HydraDB: A Resilient RDMA-driven Key-value
Middleware for In-memory Cluster Computing,” in SC’15.
 A. Kalia, M. Kaminsky, and D. G. Andersen, “Using RDMA
efﬁciently for key-value services,” in Special Interest Group on Data
Communication, vol. 44, no. 4. ACM, 2014, pp. 295–306.
 Y. G. Christopher Mitchell and J. Li, “Using One-Sided RDMA
Reads to Build a Fast, CPU-Efﬁcient Key-Value Store,” in Pre-
sented as part of the 2013 USENIX Annual Technical Conference
(USENIX ATC 13). San Jose, CA: USENIX, 2013, pp. 103–114.
 X. Wei, J. Shi, Y. Chen, R. Chen, and H. Chen, “Fast In-memory
Transaction Processing Using RDMA and HTM,” in Symposium
on Operating Systems Principles. New York, NY, USA: ACM, 2015,
 A. Dragojevi´
c, D. Narayanan, O. Hodson, and M. Castro, “FaRM:
Fast remote memory,” in Proceedings of the 11th USENIX Confer-
ence on Networked Systems Design and Implementation, 2014, pp.
 J. Jose, H. Subramoni, M. Luo, M. Zhang, J. Huang, M. Wasi-ur
Rahman, N. S. Islam, X. Ouyang, H. Wang, S. Sur et al., “Mem-
cached design on high performance rdma capable interconnects,”
in International Conference on Parallel Processing. IEEE, 2011, pp.
 W. Tang, Y. Lu, N. Xiao, F. Liu, and Z. Chen, “Accelerating Redis
with RDMA Over InﬁniBand,” in International Conference on Data
Mining and Big Data. Springer, 2017, pp. 472–483.
 “MongoDB,” accessed: 12-October-2017. [Online]. Available:
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING xv
 “HBase,” accessed: 12-October-2017. [Online]. Available:
 “VoltDB,” accessed: 12-October-2017. [Online]. Available:
 B. Cassell, T. Szepesi, B. Wong, T. Brecht, J. Ma, and X. Liu,
“Nessie: A Decoupled, Client-Driven Key-Value Store Using
RDMA,” IEEE Transactions on Parallel and Distributed Systems,
vol. 28, no. 12, pp. 3537–3552, 2017.
 A. Gandini, M. Gribaudo, W. J. Knottenbelt, R. Osman, and
P. Piazzolla, “Performance evaluation of NoSQL databases,” in
European Workshop on Performance Engineering. Springer, 2014,
 J. Klein, I. Gorton, N. Ernst, P. Donohoe, K. Pham, and C. Matser,
“Performance evaluation of NoSQL databases: A case study,” in
Proceedings of the 1st Workshop on Performance Analysis of Big Data
Systems. ACM, 2015, pp. 5–10.
 H. Zhang, G. Chen, B. C. Ooi, K.-L. Tan, and M. Zhang, “In-
memory big data management and processing: A survey,” IEEE
Transactions on Knowledge and Data Engineering, vol. 27, no. 7, pp.
 P. Cudr´
e-Mauroux, I. Enchev, S. Fundatureanu, P. Groth,
A. Haque, A. Harth, F. L. Keppmann, D. Miranker, J. F. Sequeda,
and M. Wylot, “NoSQL databases for RDF: an empirical evalu-
ation,” in International Semantic Web Conference. Springer, 2013,
 J. R. Lourenc¸o, B. Cabral, P. Carreiro, M. Vieira, and J. Bernardino,
“Choosing the right NoSQL database for the job: a quality
attribute evaluation,” Journal of Big Data, vol. 2, no. 1, p. 18, 2015.
 “MySQL,” accessed: 12-October-2017. [Online]. Available:
 T. Rabl, S. G´
omez-Villamor, M. Sadoghi, V. Munt´
A. Jacobsen, and S. Mankovskii, “Solving big data challenges for
enterprise application performance management,” Proceedings of
the Very Large Database Endowment, vol. 5, no. 12, pp. 1724–1735,
 “Riak,” accessed: 12-October-2017. [Online]. Available:
 W. Cao, S. Sahin, L. Liu, and X. Bao, “Evaluation and analysis of
in-memory key-value systems.” IEEE, 2016, pp. 26–33.
 B. Atikoglu, Y. Xu, E. Frachtenberg, S. Jiang, and M. Paleczny,
“Workload analysis of a large-scale key-value store,” in Special
Interest Group for the computer systems performance evaluation com-
munity, vol. 40, no. 1. ACM, 2012, pp. 53–64.
 Q. Liu and R. D. Russell, “A performance study of InﬁniBand
fourteen data rate (FDR),” in Proceedings of the High Performance
Computing Symposium. Society for Computer Simulation Inter-
 B. M. T. G. C. Zhang, Hao and B. C. Ooi, “Efﬁcient In-memory
Data Management: An Analysis,” VLDB Endowment, vol. 7,
no. 10, pp. 833–836, Jun. 2014.
 “Redis Labs,” accessed: 12-October-2017. [Online]. Available:
 “InﬁniBandTM Architecture Speciﬁcation Volume
2,” accessed: 12-October-2017. [Online]. Available:
 “Top500,” accessed: 12-October-2017. [Online]. Available:
 G. Shainer and A. Ancel, “Network-based processing versus
host-based processing: Lessons learned,” in Cluster Computing
Workshops and Posters. IEEE, 2010, pp. 1–4.
 A. Dragojevic, D. Narayanan, and M. Castro, “RDMA Reads: To
Use or Not to Use?” IEEE Data Eng. Bull., vol. 40, no. 1, pp. 3–14,
 S. Sur, M. J. Koop, D. K. Panda et al., “Performance analysis and
evaluation of Mellanox ConnectX InﬁniBand architecture with
multi-core platforms,” in High-Performance Interconnects. IEEE,
2007, pp. 125–134.
 O. Shahmirzadi, “High-Performance Communication Primitives
and Data Structures on Message-Passing Manycores:Broadcast
and Map,” 2014.
 “TILE-G,” accessed: 12-October-2017. [Online]. Available:
http://www.mellanox.com/page/multi core overview?
mtag=multi core overview
 “kalray,” accessed: 12-October-2017. [Online]. Available:
 D. Slogsnat, A. Giese, M. N ¨
ussle, and U. Br ¨
uning, “An open-
source HyperTransport core,” ACM Transactions on Reconﬁgurable
Technology and Systems, vol. 1, no. 3, 2008.
 D. Ziakas, A. Baum, R. A. Maddox, and R. J. Safranek, “Intel R
quickpath interconnect architectural features supporting scalable
system architectures,” in High Performance Interconnects. IEEE,
2010, pp. 1–6.
 “PCI Express,” accessed: 12-October-2017. [Online]. Available:
 “Intel Data Direct I/O Technology,”
accessed: 12-October-2017. [Online]. Avail-
 F. Mietke, R. Rex, R. Baumgartl, T. Mehlan, T. Hoeﬂer, and
W. Rehm, “Analysis of the Memory Registration Process in
the Mellanox InﬁniBand Software Stack,” International European
Conference on Parallel and Distributed Computing, pp. 124–133, 2006.
 D. Lee, L. Subramanian, R. Ausavarungnirun, J. Choi, and
O. Mutlu, “Decoupled direct memory access: Isolating CPU and
IO trafﬁc by leveraging a dual-data-port DRAM,” in Parallel
Architecture and Compilation. IEEE, 2015, pp. 174–187.
 R. E. Bryant, O. David Richard, and O. David Richard, Computer
Systems A Programmers Perspective. Prentice Hall Upper Saddle
River, 2003, vol. 2.
 B. Lepers, V. Qu´
ema, and A. Fedorova, “Thread and Memory
Placement on NUMA Systems: Asymmetry Matters.” in USENIX
Annual Technical Conference, 2015, pp. 277–289.
 T. Li, Y. Ren, D. Yu, and S. Jin, “Analysis of NUMA effects in
modern multicore systems for the design of high-performance
data transfer applications,” Future Generation Computer Systems,
vol. 74, pp. 41–50, 2017.
 Y. Ren, T. Li, D. Yu, S. Jin, and T. Robertazzi, “Design and
performance evaluation of NUMA-aware RDMA-based end-to-
end data transfer systems,” in Proceedings of the International
Conference on High Performance Computing, Networking, Storage and
Analysis. ACM, 2013.
 A.-H. A. Badawy, A. Aggarwal, D. Yeung, and C.-W. Tseng,
“Evaluating the impact of memory system performance on soft-
ware prefetching and locality optimizations,” in Proceedings of the
15th international conference on Supercomputing. ACM, 2001, pp.
 D. Callahan, K. Kennedy, and A. Porterﬁeld, “Software Prefetch-
ing,” in Proceedings of the Fourth International Conference on Archi-
tectural Support for Programming Languages and Operating Systems,
ser. ASPLOS IV. New York, NY, USA: ACM, 1991, pp. 40–52.
 “AMD SenseMI,” accessed: 12-October-2017. [Online]. Available:
 “3GIOPCIE,” accessed: 12-October-2017. [Online]. Available:
 R. B. Thompson and B. F. Thompson, PC Hardware in a Nutshell,
3rd Edition. Sebastopol, CA, USA: O’Reilly & Associates, Inc.,
 R. Noronha and D. K. Panda, “Can high performance software
DSM systems designed with InﬁniBand features beneﬁt from
PCI-Express?” in Cluster, Cloud and Grid computing, vol. 2. IEEE,
2005, pp. 945–952.
 J. Liu, A. Mamidala, A. Vishnu, and D. K. Panda, “Performance
evaluation of inﬁniband with pci express,” in High Performance
Interconnects, 2004. Proceedings. 12th Annual IEEE Symposium on.
IEEE, 2004, pp. 13–19.
 “Understanding PCI Bus, PCI-Express and In ﬁniBand
Architecture,” accessed: 12-October-2017. [Online]. Available:
 “Understanding Performance of PCI Ex-
press Systems,” accessed: 12-October-2017. [Online].
 “Understanding PCIe Conﬁguration for Maximum Per-
formance,” accessed: 12-October-2017. [Online]. Available:
 “IRQ Afﬁnity,” accessed: 12-October-2017. [Online]. Available:
 A. K. M. Kaminsky and D. G. Andersen, “Design guidelines
for high performance RDMA systems,” in 2016 USENIX Annual
Technical Conference, 2016.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING xvi
 M. Flajslik and M. Rosenblum, “Network Interface Design for
Low Latency Request-Response Protocols.” in USENIX Annual
Technical Conference, 2013, pp. 333–346.
 “Mellanox Adapters Programmers Reference
Manual,” accessed: 12-October-2017. [Online].
user manuals/Ethernet Adapters Programming Manual.pdf
 S. Sur, A. Vishnu, H.-W. Jin, D. Panda, and W. Huang, “Can
Memory-Less Network Adapters Beneﬁt Next-Generation Inﬁni-
Band Systems?” in High Performance Interconnects, 2005. Proceed-
ings. 13th Symposium on. IEEE, 2005, pp. 45–50.
 P. W. Frey and G. Alonso, “Minimizing the hidden cost of
RDMA,” in International Conference on Distributed Computing Sys-
tems. IEEE, 2009, pp. 553–560.
 A. Kalia, M. Kaminsky, and D. G. Andersen, “FaSST: Fast,
Scalable and Simple Distributed Transactions with Two-Sided
(RDMA) Datagram RPCs.” in Operating Systems Design and Im-
plementation, 2016, pp. 185–201.
 A. Dragojevi´
c, D. Narayanan, E. B. Nightingale, M. Renzelmann,
A. Shamis, A. Badam, and M. Castro, “No Compromises: Dis-
tributed Transactions with Consistency, Availability, and Perfor-
mance,” in Symposium on Operating Systems Principles. New York,
NY, USA: ACM, 2015, pp. 54–70.
 “Annex A14: Extended Reliable Connected (XRC) Transport
Service,” accessed: 12-October-2017. [Online]. Available:
 T. Shanley, Inﬁniband Network Architecture. Addison-Wesley
 P. MacArthur and R. D. Russell, “A Performance Study to Guide
RDMA Programming Decisions,” in International Conference on
High Performance Computing and Communication & International
Conference on Embedded Software and Systems. IEEE, 2012, pp.
 “RDMA Aware Networks Programming User
Manual,” accessed: 12-October-2017. [Online].
prod software/RDMA Aware Programming user manual.pdf
 P. MacArthur, Q. Liu, R. D. Russell, F. Mizero, M. Veeraraghavan,
and J. M. Dennis, “An integrated tutorial on InﬁniBand, Verbs
and MPI,” IEEE Communications Surveys & Tutorials, 2017.
 “RoCE vs. iWARP Competitive Analysis,”
accessed: 12-October-2017. [Online]. Avail-
WP RoCE vs iWARP.pdf
 R. Mittal, A. Shpiner, A. Panda, E. Zahavi, A. Krishnamurthy,
S. Ratnasamy, and S. Shenker, “Revisiting Network Support for
 J. Vienne, J. Chen, M. Wasi-Ur-Rahman, N. S. Islam, H. Subra-
moni, and D. K. Panda, “Performance Analysis and Evaluation
of InﬁniBand FDR and 40GigE RoCE on HPC and Cloud Com-
puting Systems,” in High-Performance Interconnects. IEEE, 2012,
 M. Beck and M. Kagan, “Performance Evaluation of the RDMA
over Ethernet (RoCE) Standard in Enterprise Data Centers In-
frastructure,” in Proceedings of the 3rd Workshop on Data Center
- Converged and Virtual Ethernet Switching, ser. DC-CaVES ’11.
International Teletrafﬁc Congress, 2011, pp. 9–15.
 H. Subramoni, P. Lai, M. Luo, and D. K. Panda, “RDMA over Eth-
ernetA preliminary study,” in Cluster Computing and Workshops,
2009. CLUSTER’09. IEEE International Conference on. IEEE, 2009,
 L. A. Kachelmeier, F. V. Van Wig, and K. N. Erickson, “Com-
parison of High Performance Network Options: EDR Inﬁni-
Band vs. 100Gb RDMA Capable Ethernet,” Los Alamos National
Lab.(LANL), Los Alamos, NM (United States), Tech. Rep., 2016.
 “RDMA over DPDK,” accessed: 12-October-2017. [Online].
2017presentations /114 Urdma PMacArthur.pdf
 “RDMA vs DPDK,” accessed: 12-October-2017.
[Online]. Available: http://www.mellanox.com/related-
 M. J. Koop, W. Huang, K. Gopalakrishnan, and D. K. Panda,
“Performance Analysis and Evaluation of PCIe 2.0 and Quad-
Data Rate InﬁniBand,” in High Performance Interconnects. IEEE,
2008, pp. 85–92.
 J. Liu, A. Mamidala, A. Vishnu, and D. K. Panda, “Evaluating
InﬁniBand performance with PCI express,” IEEE Micro, vol. 25,
no. 1, pp. 20–29, 2005.
 “InﬁniBand PCI PCIE,” accessed: 12-October-2017. [On-
line]. Available: http://www.mellanox.com/pdf/ whitepaper-
s/PCI 3GIO IB WP 120.pdf
 “Intel Xeon Processor D-1500 Product Fam-
ily,” accessed: 12-October-2017. [Online]. Avail-
 “Intel Xeon Phi Processor Knights Landing Architectural
Overview,” accessed: 12-October-2017. [Online]. Avail-
 D. Dalessandro and P. Wyckoff, “Accelerating Web Protocols
Using RDMA,” in Network Computing and Applications, 2007. NCA
2007. Sixth IEEE International Symposium on. IEEE, 2007, pp. 205–
 K. Magoutis, S. Addetia, A. Fedorova, and M. I. Seltzer, “Making
the Most Out of Direct-Access Network Attached Storage.” in
USENIX Conference on File and Storage Technologies, 2003.
 L. Ou, X. He, and J. Han, “An efﬁcient design for fast memory
registration in RDMA,” Journal of Network and Computer Applica-
tions, vol. 32, no. 3, pp. 642–651, 2009.
 R. Pagh and F. F. Rodler, “Cuckoo hashing,” in European Sympo-
sium on Algorithms. Springer, 2001, pp. 121–133.
 M. Herlihy, N. Shavit, and M. Tzafrir, “Hopscotch hashing,” in
International Symposium on Distributed Computing. Springer, 2008,
 D. Dalessandro and P. Wyckoff, “Memory management strategies
for data serving with RDMA,” in High-Performance Interconnects.
IEEE, 2007, pp. 135–142.
 Y. Zhang, J. Gu, Y. Lee, M. Chowdhury, and K. G. Shin, “Per-
formance Isolation Anomalies in RDMA,” in Workshop on Kernel-
Bypass Networks, 2017, pp. 43–48.
 A. Romanow and S. Bailey, “An Overview of RDMA over IP,” in
Proceedings of the First International Workshop on Protocols for Fast
Long-Distance Networks, 2003.
 X. Wu, L. Zhang, Y. Wang, Y. Ren, M. Hack, and S. Jiang,
“zExpander: a key-value cache with both high performance and
fewer misses,” in Proceedings of the Eleventh European Conference
on Computer Systems. ACM, 2016, p. 14.
 K. Zhang, K. Wang, Y. Yuan, L. Guo, R. Lee, and X. Zhang,
“Mega-KV: a case for GPUs to maximize the throughput of in-
memory key-value stores,” Proceedings of the Very Large Database
Endowment, vol. 8, no. 11, pp. 1226–1237, 2015.
 C. Barthels, G. Alonso, and T. Hoeﬂer, “Designing Databases
for Future High-Performance Networks.” IEEE Data Eng. Bull.,
vol. 40, no. 1, pp. 15–26, 2017.
 M. Sadoghi, M. Canim, B. Bhattacharjee, F. Nagel, and K. A. Ross,
“Reducing database locking contention through multi-version
concurrency,” Proceedings of the Very Large Database Endowment,
vol. 7, no. 13, pp. 1331–1342, 2014.
 ——, “Reducing Database Locking Contention Through Multi-
version Concurrency,” Very Large Database.
 M. Sadoghi, K. A. Ross, M. Canim, and B. Bhattacharjee, “Ex-
ploiting SSDs in operational multiversion databases,” Very Large
Database, vol. 25, no. 5, pp. 651–672, 2016.
 M. Hemmatpour, B. Montrucchio, M. Rebaudengo, and
M. Sadoghi, “Kanzi: A distributed, in-memory key-value store,”
in International Middleware Conference. ACM, 2016, pp. 3–4.
 D. Karger, E. Lehman, T. Leighton, R. Panigrahy, M. Levine, and
D. Lewin, “Consistent hashing and random trees: Distributed
caching protocols for relieving hot spots on the World Wide
Web,” in Proceedings of the twenty-ninth annual ACM symposium
on Theory of computing. ACM, 1997, pp. 654–663.
 Y. Wang, X. Meng, L. Zhang, and J. Tan, “C-Hint: An Effective and
Reliable Cache Management for RDMA-Accelerated Key-Value
Stores,” in Proceedings of the ACM Symposium on Cloud Computing,
ser. SOCC ’14. New York, NY, USA: ACM, 2014, pp. 23:1–23:13.
 Y. Chen, X. Wei, J. Shi, R. Chen, and H. Chen, “Fast and general
distributed transactions using RDMA and HTM,” in Proceedings
of the Eleventh European Conference on Computer Systems. ACM,
 “HERD Source code,” accessed: 12-October-2017. [Online].
Available: https://github.com/efﬁcient/rdma bench
 L. Breslau, P. Cao, L. Fan, G. Phillips, and S. Shenker, “Web
caching and Zipf-like distributions: Evidence and implications,”
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING xvii
in INFOCOM’99. Eighteenth Annual Joint Conference of the IEEE
Computer and Communications Societies. Proceedings. IEEE, vol. 1.
IEEE, 1999, pp. 126–134.
 Q. Huang, K. Birman, R. van Renesse, W. Lloyd, S. Kumar, and
H. C. Li, “An Analysis of Facebook Photo Caching,” in Symposium
on Operating Systems Principles. New York, NY, USA: ACM, 2013,
 B. Fan, D. G. Andersen, and M. Kaminsky, “MemC3: Compact
and Concurrent MemCache with Dumber Caching and Smarter
Hashing.” in Symposium on Networked Systems Design and Imple-
mentation, vol. 13, 2013, pp. 371–384.
 B. Atikoglu, Y. Xu, E. Frachtenberg, S. Jiang, and M. Paleczny,
“Workload analysis of a large-scale key-value store,” in ACM
SIGMETRICS Performance Evaluation Review, vol. 40, no. 1. ACM,
2012, pp. 53–64.
 D. Petrovi´
c, T. Ropars, and A. Schiper, “Leveraging Hardware
Message Passing for Efﬁcient Thread Synchronization,” in Sym-
posium on Principles and Practice of Parallel Programming. ACM,
2014, pp. 143–154.
 “Intel Optane,” accessed: 12-October-2017. [Online]. Available:
 “NVMe Key-value store,” accessed: 12-October-
2017. [Online]. Available: http://mvapich.cse.ohio-
sc18 osu booth 2.pdf
 D. Minturn, “NVM Express Over Fabrics,” in 11th Annual Open-
Fabrics International OFS Developers Workshop, 2015.
Masoud Hemmatpour received the M.S. de-
gree in Computer and Communication Network
Engineering and the Ph.D. degree in Control
and Computer Engineering both from Politecnico
di Torino, Italy, in 2015, and 2019, respectively.
His research interest includes High Performance
Computing (HPC) in the context of concurrent
programming and dynamic memory manage-
ment in parallel systems utilizing modern hard-
ware, such as hardware message passing and
Remote Direct Memory Access (RDMA). He is
currently a postdoctoral fellow in Cisco System. He focuses on control
and management plane in cloud-native environment.
Bartolomeo Montrucchio received the M.S.
degree in Electronic Engineering and the Ph.D.
degree in Computer Engineering both from Po-
litecnico di Torino, Italy, in 1998, and 2002, re-
spectively. He is currently an Associate Pro-
fessor of Computer Engineering at the Dipar-
timento di Automatica e Informatica of Politec-
nico di Torino, Italy. His current research inter-
ests include image analysis and synthesis tech-
niques, scientiﬁc visualization, sensor networks
Maurizio Rebaudengo (M’95)(S’14) received
the M.S. degree in Electronics (1991), and the
Ph.D. degree in Computer Engineering (1995),
both from Politecnico di Torino, Italy. He is an
IEEE Senior Member since 2014. Currently, he
is a Full Professor at the Dipartimento di Au-
tomatica e Informatica of the same institution.
His research interests include ubiquitous com-
puting and testing and dependability analysis of
Mohammad Sadoghi is an Assistant Profes-
sor in the Computer Science Department at the
University of California, Davis. Formerly, he was
an Assistant Professor at Purdue University and
Research Staff Member at IBM T.J. Watson Re-
search Center. He received his Ph.D. from the
University of Toronto in 2013. He leads the Expo-
Lab research group with the aim to pioneer a dis-
tributed ledger that uniﬁes secure transactional
and real-time analytical processing (L-Store), all
centered around a democratic and decentralized
computational model (ResilientDB). He has cofounded a blockchain
company called Moka Blox LLC, the ResilientDB spinoff. He has over
80 publications in leading database conferences/journals and 34 ﬁled
U.S. patents. He served as the Area Editor for Transaction Processing in
Encyclopedia of Big Data Technologies by Springer. He has co-authored
the book “Transaction Processing on Modern Hardware”, Morgan &
Claypool Synthesis Lectures on Data Management, and currently co-
authoring a book entitled “Fault-tolerant Distributed Transactions on
Blockchain” also as part of Morgan & Claypool series.