Conference PaperPDF Available

QueCC: A Queue-oriented, Control-free Concurrency Architecture

Authors:

Abstract

We investigate a coordination-free approach to transaction processing on emerging multi-sockets, many-core, shared-memory architecture to harness its unprecedented available parallelism. We propose a queue-oriented, control-free concur-rency architecture, referred to as QueCC, that exhibits minimal contention among concurrent threads by eliminating the overhead of concurrency control from the critical path of the transaction. QueCC operates on batches of transactions in two deterministic phases of priority-based planning followed by control-free execution. We extensively evaluate our transaction execution architecture and compare its performance against seven state-of-the-art concurrency control protocols designed for in-memory, key-value stores. We demonstrate that QueCC can significantly out-perform state-of-the-art concurrency control protocols under high-contention by up to 6.3×. Moreover, our results show that QueCC can process nearly 40 million YCSB transactional operations per second while maintaining serializability guarantees with write-intensive workloads. Remarkably, QueCC out-performs H-Store by up to two orders of magnitude.
QueCC: A Queue-oriented, Control-free Concurrency
Architecture
Thamir M. Qadah1, Mohammad Sadoghi2
Exploratory Systems Lab
1Purdue University, West Lafayette
2University of California, Davis
tqadah@purdue.edu,msadoghi@ucdavis.edu
Abstract
We investigate a coordination-free approach to transaction
processing on emerging multi-sockets, many-core, shared-
memory architecture to harness its unprecedented available
parallelism. We propose a queue-oriented, control-free concur-
rency architecture, referred to as eCC, that exhibits mini-
mal contention among concurrent threads by eliminating the
overhead of concurrency control from the critical path of the
transaction. eCC operates on batches of transactions in
two deterministic phases of priority-based planning followed
by control-free execution. We extensively evaluate our trans-
action execution architecture and compare its performance
against seven state-of-the-art concurrency control protocols
designed for in-memory, key-value stores. We demonstrate
that eCC can signicantly out-perform state-of-the-art
concurrency control protocols under high-contention by up
to 6
.
3
×
. Moreover, our results show that eCC can pro-
cess nearly 40 million YCSB transactional operations per
second while maintaining serializability guarantees with
write-intensive workloads. Remarkably, eCC out-performs
H-Store by up to two orders of magnitude.
CCS Concepts Information systems Data manage-
ment systems
;
DBMS engine architectures
;
Database
transaction processing
;
Parallel and distributed DBMSs
;
Key-value stores;Main memory engines;
ACM Reference Format:
Thamir M. Qadah
1
, Mohammad Sadoghi
2
. 2018. QueCC: A Queue-
oriented, Control-free Concurrency Architecture. In 19th Interna-
tional Middleware Conference (Middleware ’18), December 10–14,
2018, Rennes, France. ACM, New York, NY, USA, 14 pages. hps:
//doi.org/10.1145/3274808.3274810
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are not
made or distributed for prot or commercial advantage and that copies bear
this notice and the full citation on the rst page. Copyrights for components
of this work owned by others than ACM must be honored. Abstracting with
credit is permitted. To copy otherwise, or republish, to post on servers or to
redistribute to lists, requires prior specic permission and/or a fee. Request
permissions from permissions@acm.org.
Middleware ’18, December 10–14, 2018, Rennes, France
©2018 Association for Computing Machinery.
ACM ISBN 978-1-4503-5702-9/18/12. . . $15.00
hps://doi.org/10.1145/3274808.3274810
1 Introduction
New multi-socket, many-core hardware architectures with
tens or hundreds of cores are becoming commonplace in the
market today [
13
,
24
,
29
]. This is a trend that is expected
to increase exponentially, thus, reaching thousands of cores
per box in the near future [
14
]. However, recent studies
have shown that traditional transactional techniques that
rely on extensive coordination among threads fail to scale
on these emerging hardware architectures; thus, there is
an urgent need to develop novel techniques to utilize the
power of next generation of highly parallel modern hardware
[
23
,
25
,
35
,
40
,
41
]. There is also a new wave to study deter-
ministic concurrency techniques, e.g., the read and write
sets are known a priori. These promising algorithms are
motivated from the practical standpoint by examining the
predened stored procedures that are heavily deployed in
customer settings [
9
,
11
,
12
,
16
,
32
,
33
]. However, many of
the existing deterministic approaches do not fundamentally
redesign their algorithms for the many-core architecture,
which is the precise focus on this work, a novel deterministic
concurrency control for modern highly parallel architec-
tures.
The main challenge for transactional processing systems
built on top of many-core hardware is the increased con-
tention (due to increased parallelism) among many compet-
ing cores for shared resources, e.g., failure to acquire highly
contended locks (pessimistic) or failure to validate contented
tuples (optimistic). The role of concurrency control mecha-
nisms in traditional databases is to determine the interleav-
ing order of operations among concurrent transactions over
shared data. But there is no fundamental reason to rely on
concurrency control logic during the actual execution nor it
is a necessity to force the same thread to be responsible for
executing both transaction and concurrency control logic.
This important realization has been observed in recent stud-
ies [
26
,
40
] that may lead to a complete paradigm shift in
how we think about transactions, but we have just scratched
the surface. It is essential to note that the two tasks of es-
tablishing the order for accessing shared data and actually
executing the transaction
'
s logic are completely independent.
Hence, these tasks can potentially be performed in dierent
phases of execution by independent threads.
For instance, Ren et al. [
26
] propose ORTHRUS which op-
erates based on pessimistic concurrency control, in which
transaction executer threads delegate locking functionality
to dedicated lock manager threads. Yao et al. [
40
] propose
LADS that process batches of transactions by constructing a
set of transaction dependency graphs and partition them into
smaller pieces (e.g., min-cut algorithms) followed by depend-
ency-graph-driven transaction execution. Both ORTHRUS
and LADS rely on explicit message-passing to communicate
among threads, which can introduce an unnecessary over-
head to transaction execution despite the available shared
memory model of a single machine. In contrast, eCC em-
braces the shared memory model and applies determinism
in a two-phase, priority-based, queue-oriented execution
model.
The proposed work in this paper is motivated by a sim-
ple profound question: is it possible to have concurrent ex-
ecution over shared data without having any concurrency
control? To answer this question, we investigate a deter-
ministic approach to transaction processing geared towards
multi-socket, many-core architectures. In particular, we pro-
pose eCC, pronounced Quick, a novel queue-oriented,
control-free concurrency architecture that exhibits minimal
contention during execution and imposes no coordination
among transactions while oering serializable guarantees.
The key intuition behind our eCC’s design is to eliminate
concurrency control by executing a set of batched transac-
tions in two disjoint and deterministic phases of planning and
execution, namely, decompose transactions into (predeter-
mined) priority queues followed by priority-queue-oriented
execution. In other words, we impose a deterministic plan of
execution on batches of transactions, which eliminates the
need for concurrency control during the actual execution of
transactions.
1.1 Emergence of Deterministic Key-Value Stores
Early proposals for deterministic execution for transaction
processing aimed at data replication (e.g., [
15
,
17
]). The sec-
ond wave of proposals focused on deterministic execution
in distributed environments, and lock-based approaches for
concurrency control. For example, H-Store is exclusively tai-
lored for partitionable workloads (e.g. [
16
]) as it essentially
relies on partition-level locks and runs transactions serially
within each partition. Calvin and all of its derivatives primar-
ily focused on developing a novel distributed protocol, where
essentially all nodes partaking in distributed transactions
execute batched transactions on all replicas in a predeter-
mined order known to all. For local in-node concurrency,
in Calvin all locks are acquired (in order to avoid deadlock)
before a transaction starts and if not all locks are granted,
then the node stalls [
33
]. In fact, Calvin and eCC dovetails,
the former sequences transactions pre-execution to essen-
tially (almost) eliminate agreement protocol while the latter
introduces a novel predetermined prioritization and queue-
oriented execution model to essentially (almost) eliminate
the concurrency protocol.
Serializablility
Deterministic data stores guarantee se-
rializable execution of transactions seamlessly. A determin-
istic transaction processing engine needs to ensure that (a)
the order of conicting operations, and (b) the commitment
ordering of transactions follow the same order that is deter-
mined prior to execution. With those two constraints are
satised by the execution engine, serializable execution is
guaranteed. In fact, from the scheduling point of view, de-
terministic data stores are less exible compared to other
serializable approaches [
27
,
39
] because there is only one
possible serial schedule that is produced by the execution en-
gine. However, this allows the protocol to plan a near-optimal
schedule that maximizes the throughput. Furthermore, given
the deterministic execution, evaluating and testing the con-
currency protocol is dramatically simplied because all non-
determinism complexity has been eliminated. The determin-
ism profoundly simplies the recovery execution, in fact,
normal and recovery routines become identical.
Future of Deterministic In-memory Key-Value Stores
Notably, deterministic data stores have their own advantages
and disadvantages that they may not be optimal for every
possible workload [
27
]. For instance, it is an open question
how to support transactions that demands multiple rounds
of back-and-forth client-server communication or how to
support traditional cursor-based access. Clients must reg-
ister stored procedures in advance and supply all input pa-
rameters at run-time, i.e., the read-set and the write-set of
a transaction must be known prior to execution, and the
use of non-deterministic functions, e.g.,
currentTime()
, is
non-trivial. Notably, there have been several lightweight
solutions to eciently determining read/write (when not
known as a priori) through a passive, pre-play execution
model [9, 11, 12, 16, 32, 33].
1.2 Contributions
In this paper, we make the following contributions:
we present a rich formalism to model our re-thinking
of how transactions are processed in eCC. Our for-
malism does not suer from the traditional data depen-
dency conicts among transactions because they are
seamlessly eliminated by our execution model (Sec-
tion 2).
we propose an ecient deterministic, queue-oriented
transaction execution model for highly parallel archi-
tectures, that is amenable to ecient pipelining and
oers a exible and adaptable thread-to-queue assign-
ment to minimize coordination (Section 3).
we design a novel two-phase, priority-based, queue-
oriented planning and execution model that eliminates
the need for concurrency control (Section 4).
we prototype our proposed concurrency architecture
within a comprehensive concurrency control testbed,
which includes eight modern concurrency techniques,
to demonstrate eCC eectiveness compared to state-
of-the-art approaches based on well-established bench-
marks such as TPC-C and YCSB (Section 5).
2
2 Formalism
Before describing the design and architecture of eCC, we
rst present data and transaction models used by eCC.
2.1 Data Model
The data model used is the widely adopted key-value storage
model. In this model, each record in the database is logically
dened as a pair
(k,v)
, where
k
uniquely identies a record
and
v
is the value of that record. Internally, we access records
by knowing its physical record identiers (RID), i.e., the
physical address in either memory or disk.
Operations are modeled as two fundamental types of oper-
ations; namely,
READ
and
WRITE
operations. However, there
are other kinds of operations such as
INSERT
,
UPDATE
, and
DELETE
. Those operations are treated as dierent forms of
the WRITE operation[1].
2.2 Transaction Model
Transactions can be modeled as a DAG (Directed Acyclic
Graphs) of “sub-transactions” called transaction fragments.
Each fragment performs a sequence of operations on a set of
records (each internally associated with a RID). In addition
to the operations, each fragment is associated with a set
of constraints that captures the application integrity. We
formally dene transaction fragments as follows:
Denition 1. (Transaction fragments):
A transaction fragment
fi
is dened as a pair
(Sop ,C)
, where
Sop
is a nite sequence of operations either
READ
or
WRITE
on records
identied with RIDs that are mapped to the same contiguous RID
range, and
C
is a nite set of constraints that must be satised post
the fragment execution.
Fragments that belong to the same transaction can have
two kinds of dependencies, and such dependencies are based
on the transaction’s logic. We refer to them as logic-induced
dependencies, and they are of two types: (1) data dependen-
cies and (2) commit dependencies [
10
]. Because these logic-
induced dependencies may also exist among transaction frag-
ments that belong to the same transaction, we call them
intra-transaction dependencies to dierentiate them from
inter-transaction dependencies that exist between fragments
that belong to dierent transactions. Inter-transaction de-
pendencies are induced by the transaction execution model.
Thus, they are also called execution-induced dependencies.
An intra-transaction data dependency between fragment
fi
, and another fragment
fj
such that
fj
is data-dependent
on
fi
implies that
fj
requires some data that is computed
by
fi
. To illustrate, consider a transaction that reads a value
vi
of a particular record, say,
ri
and updates the value
vj
of
another record, say,
rj
such that
vj=vi+
1. This transaction
can be decomposed into two fragments
fi
, and
fj
with a data
dependency between
fi
and
fj
such that
fj
depends on
fi
. We
formalize the notion of intra-transaction data dependencies
as follows:
Denition 2. (Intra-transaction data dependency):
An intra-transaction data dependency exist between two transaction
fragments
fi
and
fj
, denoted as
fi
d
fj
, if and only if both frag-
ments belong to the same transaction and the logic of
fj
requires data
computed by the logic of fi.
The second type of logic-induced dependency is called an
intra-transaction commit dependency. This kind of depen-
dency captures the atomicity of a transaction when some of
its fragments may abort due to logic-induced aborts. We re-
fer to such fragments as abortable fragments. Logic-induced
aborts are the result of violating integrity constraints dened
by applications, which are captured by the set of constraints
C
for each fragment. Intuitively, if a fragment is associated
with at least one constraint that may not be satised post
the execution of the fragment, then it is abortable.
A formal denition of abortable fragments is as follows:
Denition 3. (Abortable transaction fragments):
A transaction fragment fiis abortable if and only if fi.C,ϕ.
Using the denition of abortable fragments, intra-transaction
commit dependencies are formally dened as follows:
Denition 4. (Intra-transaction commit dependency):
An intra-transaction commit dependency exist between two trans-
action fragments
fi
and
fj
, denoted as
fi
c
fj
, if and only if both
fragments belong to the same transaction and fiis abortable.
The notion of transaction fragments is similar in spirit
to the notion of pieces [
10
,
30
,
37
], the notion of actions
in DORA[
25
], and the notion of record actions in LADS[
40
].
However, unlike those notions, we impose a RID-range re-
striction on records accessed by fragments and formally
model the set of constraints associated with fragments.
Now, we can formally dene transactions based on the
fragments and their dependencies, as follows:
Denition 5. (Transactions):
A transaction
ti
is dened as a directed acyclic graph (DAG)
Gti
:
=
(Vti,Eti)
, where
Vti
is nite set of transaction fragments
{f1,f2, . . . , fk}
,
and Eti={(fp,fq)| fp
d
fqfp
c
fq}
In eCC, there is a third type of dependencies that may
exist between transaction fragments of dierent transactions,
which are induced by the execution model. Therefore, they
are called execution-induced dependencies. Since we are
modeling transactions at the level of fragments, we capture
them at that level. However, they are called “commit de-
pendencies” by Larson et al. [
22
] when not considering the
notion of transaction fragments. They are the result of spec-
ulative reading of uncommitted records. We formally dene
them as follows:
Denition 6. (Inter-transaction commit dependency):
An inter-transaction commit dependency exist between two transac-
tion fragments
fi
and
fj
is denoted as
fi
s
fj
, if and only if both
fragments belong to dierent transactions and
fj
speculatively reads
uncommitted data written by fi
Note that inter-transaction commit dependencies may
cause cascading aborts among transactions. This problem can
3
be mitigated by exploiting the idea of “early write visibility”,
which is proposed by Faleiro et al.[10].
Also, note that execution-induced data dependencies among
transactions, used to model conicts in traditional concur-
rency control mechanisms, are no longer possible in eCC
because these conicts are seamlessly resolved and elimi-
nated by the deterministic, priority-based, queue-oriented
execution model of eCC. Non-deterministic data stores
that rely on traditional concurrency control mechanisms, suf-
fer from non-deterministic aborts caused by their execution
model that employs non-deterministic concurrency control.
A notable observation is that deterministic stores eliminate
non-deterministic aborts, which improves the eciency of
the transaction processing engine.
3 Priority-based, Queue-oriented
Transaction Processing
We rst oer a high-level description of our transaction
processing architecture. Our proposed architecture (depicted
in gure 1) is geared towards a throughput-optimized in-
memory key-value stores.
Transaction batches are processed in two deterministic
phases. First, in the planning phase, multiple planner threads
consume transactions from their respective client transac-
tion queue in parallel and create prioritized execution queues.
Each planner thread is assigned a predetermined distinct pri-
ority. The idea of priority is essential to the design of eCC
and it has two advantages. First, it allows planner threads to
independently and in parallel perform their planning task.
By assigning the priority to the execution queue, the order-
ing of transactions planned by dierent planner threads is
preserved. Secondly, the priory enables execution threads
to decide the order of executing fragments, which leads to
correct serializable execution.
The planner thread acts as a local sequencer with a prede-
termined priority for its assigned transactions and spreads
operations of each transaction (e.g., reads and writes) into a
set of queues based on the sequence order.
Each queue represents a distinct set of records, and queues
inherit their planner distinct priorities. The goal of the plan-
ner is to distribute operations (e.g.,
READ
/
WRITE
) into a set
of almost equal-sized queues. Queues for each planner can
be merged or split arbitrarily to satisfy balanced size queues.
However, queues across planners can only be combined to-
gether following the strict priority order of each planner. We
introduce execution-priority invariance that states for each
record, operations that belong to higher priority queues (created
by a higher priority planner) must always be executed before ex-
ecuting any lower priority operations. This execution-priority
invariance is the essence of how we capture determinism
in eCC. Since all planners operate at dierent priorities,
then they can be plan independently in parallel without any
contention.
The execution queues are handed over to a set of execu-
tion threads based on their priorities. Each execution thread
can arbitrarily select any outstanding queues within a batch
and execute its operations without any coordination with
others executors. The only criterion that must be satised
is the execution-priority invariance, implying that if a lower
priority queue overlaps with any higher priority queues (i.e.,
containing overlapping records), then before executing a
lower priority queue, the operations in all higher priority
queues must be executed rst. Depending on the number of
operations per transaction and its access patterns, indepen-
dent operations from a single transaction may be processed
in parallel by multiple execution threads without any syn-
chronization among the executors; hence, coordination-free
and independent execution across transactions. Once all the
execution queues are processed, it signals the completion
of the batch, and transactions in the batch are committed
except those that violated an integrity constraint. The vio-
lations are identied by executing a set of commit threads
once each batch is completed.
To ensure recoverability, all parameters required to recre-
ate the execution queues are persisted at the end of the
planning phase. A second persistent operation is done at the
end of the execution phase once the batch is fully processed;
which is similar to the group commit technique [7].
3.1 Proof of serializability
In this section, we show that eCC produces serializable
execution histories. We use
c(Ti)
to denote the commit or-
dering of transaction
Ti
, and
e(fi j )
to denote the completion
time for the execution of fragment
fi j
, where
fi j
belongs to
Ti
. For the sake of this proof, we use the notion of conicting
fragments to have the same meaning as conicting opera-
tions in serializability theory [
38
]. Without loss of generality,
we assume that each fragment accesses a single record, but
the same argument applies in general because of the RID
range restriction (see Denition 1).
Theorem 1.
The transaction execution history produced by
eCC is serializable.
Proof.
Suppose that the execution of two transactions
Ti
and
Tj
is not serial, and their commit ordering is
c(Ti)<c(Tj)
.
Note that their commitment ordering is the same as their
ordering when they were planned. Therefore, there exist two
conicting fragments
fip
and
fjq
such that
e(fjq )>e(fi p )
.
Because
fip
and
fjq
access the same record, we have the
following cases: (Case 1) if
Ti
and
Tj
are planned by the same
planner thread, they must be placed in the same execution
queue (EQ). Since the commitment ordering is the same as
the order they were planned, the planner must have placed
fip
ahead of
fjq
in the execution queue which contradicts
the conicting order. (Case 2) if
Ti
and
Tj
are planned by dif-
ferent planner threads, their respective fragments are placed
in two dierent EQs with the EQ containing
fip
having a
higher priority than the other EQ containing
fjq
. Having
e(fjq )>e(fi p )
implies that the priority execution invariance
is violated, which is also a contradiction.
4
Client'
Trans ac ti o n'
Queues
Main-Memory'
DB'Storage
Planning'Threads
High'Priority'
Queues
Low'Priority'
Queues
Execution'Threads
Priority'
Groups
Execution'
Queues
Index'lookups
Index
Direct'
record'
access
Figure 1. Overview of Priority-based, Queue-oriented Architecture
4 Control-free Architectural Design
In this section, we present planning and execution tech-
niques introduced by eCC.
4.1 Deterministic Planning Phase
In the planning phase, our aim is to answer the key ques-
tions: how to eciently produce execution plans and distribute
them across execution threads in a balanced manner? How to
eciently deliver the plans to execution threads?
A planner thread consumes transactions from its dedi-
cated client transaction queue , which eliminates contention
from using a single client transaction queue. Since each plan-
ner thread has its own pre-determined priority, at this point,
transactions are partially ordered based the planners’ prior-
ities. Each planner can independently determine the order
within its own partition of the batch. The set of execution
queues (EQs) lled by planners inherit their planner’s prior-
ity thus forming a priority group (PG) of EQs. To represent
priority inheritance of EQs, we associate all EQs planned by
a planner with a priority group (PG). Each batch is organized
into priority groups of EQs with each group inheriting the
priority of its planner. We formally dene the notion of a
priority group as follows:
Denition 7. (Priority Group):
Given a set of transactions in a batch,
T={t1,t2, . . . ,tn}
,
and a set of planner threads
{pt1,pt2, . . . . ptk}
, the planning
phase will produce a set of
k
priority groups
{pд1,pд2, . . . pдk}
,
where each
pдi
is a partition of
T
and is produced by planner
thread pti.
In eCC, EQs are the main data structure used to repre-
sents the workload of transaction fragments. Planners ll
EQs with transaction fragments augmented with some ad-
ditional meta-data during the planning and assign EQs to
execution threads on batch delivery. EQs have a xed ca-
pacity and are recycled across batches. Under extreme con-
tention, they are dynamically expanded to hold transaction
fragments beyond their initial capacity. Planners may physi-
cally split or logically merge EQs in order to balance the load
given to execution queues. Splitting EQs is costly because
it requires copying transaction fragments from one queue
to two new queues that resulted from the split. The cost of
allocating memory for EQs is minimized by maintaining a
thread-local pool of EQs, which allows recycling EQs after
batch commitment.
We now focus on how each planner produces the priority-
based EQs associated with its PG. Our planning technique is
based on RID value ranges.
Range-based Planning
In our range-based planning ap-
proach, each planner starts by partitioning the whole RID
space into a number of ranges equal to the number of exe-
cution threads. For example, if we have 4execution threads,
then we will initially have 4range partitions of the whole RID
space. Based on the number of transactions accessing each
range, that range can be further partitioned progressively
into smaller ranges to ensure that they can be assigned to
execution threads in a balanced manner (i.e., each execution
thread will have the same number of transaction fragments
to process). Note that each range is associated with an EQ,
and partitioning a range implies splitting their associated
EQs as well. Range partitioning is progressive such that a
partitioning of a previous batch is reused for future ones,
which amortize the cost of range partitioning across multiple
batches, and reduces the planning time for the subsequent
batches.
A range needs to be partitioned if its associated EQ is full.
In eCC, we have a adaptable system conguration param-
eter that controls the capacity of EQs. When EQs become
full during planning, they are split into additional queues.
The split algorithm is simple. Given an EQ to split, a planner
partitions its associated range in half. Each range split will
be associated with a new EQ obtained from a local thread
pool of preallocated EQs
1
. Based on the new ranges, planners
copy transaction fragments from the original EQ into the
two new EQs.
A planner needs to determine when a batch is ready.
Batches can be considered complete based on time (i.e., com-
plete a batch every 5milliseconds) or based on counts (i.e.,
complete a batch every 1000 transaction). The choice of how
batches are determined is orthogonal to our techniques. How-
ever, in our implementation, we use count-based batches
with the batch size being a congurable system parameter.
Using count-based batches allows us to easily study the im-
pact of batching. For count-based batches, a planner thread
can easily compute the number of transactions in its parti-
tion of the batch since the number of planners and the batch
size, are known parameters. Once the batch is planned and
ready, it can be delivered to execution threads for execution.
Operation Planning
Planning
READ
and
UPDATE
opera-
tions are straightforward, but special handling is needed for
1If the pool is empty, a new EQ is dynamically allocated.
5
planning
INSERT
operations. When planning a
READ
or an
UPDATE
operation, a planner will simply do an index lookup
to nd the RID value for the record and its pointer. Based on
the RID value, it determines the EQ responsible for the trans-
action fragment. It will check if the EQ is full and perform
a split if needed. Finally, it inserts the transaction fragment
into the EQ.
DELETE
operations are handled the same way
as
UPDATE
operations from planning perspective. For the
INSERT
operations, a planner assigns a new RID value to the
new record and places the fragment into the respective EQ.
4.2 Deterministic Execution Phase
Once the batch is delivered, execution threads start pro-
cessing transaction fragments from assigned EQs without
any need for controlling its access to records. Fragments
are executed in the same order they are planned within a
single EQ. Execution threads try to execute the whole EQ
before moving to the next EQ. The execution threads may en-
counter a transaction fragment that has an intra-transaction
data dependency to another fragment that resides in another
EQ. Data dependencies exist when intermediate values are
required to execute the fragment in hand. Once the interme-
diate values are computed by the corresponding fragments,
they are are stored in the transaction’s meta-data accessible
by all transaction fragments. Data dependencies may trigger
EQ switching before the whole EQ is consumed. In particular,
an EQ switch occurs if intermediate values required by the
fragment in hand are not available.
To illustrate, consider the example transaction from sec-
tion 2, which has the following logic:
fi={a=read(ki)},fj=
{b=a+
1;
writ e(kj,b)}
, where keys are denoted as
ki
. In this
transaction, we have a data dependency between the two
transaction fragments. The
WRITE
operation on
kj
cannot
be performed until the
READ
operation on
ki
is completed.
Suppose that
fi
and
fj
are placed in two separate EQs, e.g.,
EQ1
and
EQ2
respectively. An attempt to execute
fi
before
fj
can happen, which triggers an EQ switch by the attempt-
ing execution thread. Note that, this delaying behavior
2
is
unavoidable because there is no way for
fj
to complete with-
out the completion of
fi
. This mechanism of EQ switching
ensures that the execution thread will only wait if data depen-
dencies associated with transaction fragments at the head
of all EQs are not satised. Our EQ switch mechanism is
very lightweight and requires only a single private counter
per EQ to keep track of how many fragments of the EQ are
consumed.
Execution Priority Invariance
Each execution thread
(ET) is assigned one or more EQ in each PG. ETs can exe-
cute fragments from multiple PGs. Since EQs are planned
independently by each planner, the following degenerate
case may occur. Consider two planner threads, say,
pt0
and
pt1
with their respective PGs (i.e.,
pд0
and
pд1
), and two exe-
cution threads
et0
and
et1
. A total of four EQs are planned
2
Notably, although further processing of a queue maybe delayed, the ex-
ecutor is not blocked and may simply begin processing another queue.
in the batch. Each EQ is denoted as
EQi j
such that
i
refers
to the planner thread index and
j
refers to the execution
thread index, according to the assignment. For example,
E00
is assigned by
pt0
to
et0
, and so forth. Therefore, we have the
following set of EQs:
EQ00 ,EQ01,EQ10,
and
EQ11
. Now for
each EQ, there is an associated RID range
ri j
, and the indices
of the ranges correspond to planner and execution threads,
respectively. A violation of the execution priority invariance
occurs under the following conditions: (1)
et0
start executing
EQ10
; (2)
et1
has not completed the execution of
EQ01
; (3)
a fragment in
EQ01
updates a record, while a fragment in
EQ10
reads the same record (this implies that
r10
overlaps
with
r01
). Therefore, to ensure the invariance, an executor
checks that all overlapped EQs from higher priority PGs have
completed their processing. If so, it proceeds with the execu-
tion of the EQ in hand, otherwise, it switches to another EQ.
Fully processing all planned EQs in a batch signies that all
transactions are executed, and execution threads can start
the commit stage for the whole batch. Notably, at any point
during the execution, the executor thread may act as commit
thread, by checking commit dependencies of fully executed
transactions as described next.
Commit Dependency Tracking
When processing a trans-
action, execution threads need to track inter-transaction
commit dependencies. When a transaction fragment spec-
ulatively reads uncommitted data written by a fragment
that belongs to another transaction in the batch, a commit
dependency is formed between the two transactions. This
dependency must be checked during commitment (or as soon
as all prior transactions are fully executed) to ensure that the
earlier transaction has committed. If the earlier transaction
is aborted, the later transaction must abort. This dependency
information is stored in the transaction context. To capture
such dependencies, eCC uses a similar approach to the
approach used by Larson et al. [
22
] for dealing with commit
dependencies. eCC maintains the transaction id of the last
transaction that updated a record in per-record meta-data.
During execution, the transaction ID is checked and if it
refers to a transaction that belongs to the current batch, a
commit dependency counter for the current transaction is in-
cremented and a pointer to the current transaction’s context
is added to the context of the other transaction. During the
commit stage, when a transaction is committing, the coun-
ters for all dependent transactions is decremented. When the
commit dependency counter is equal to zero, the transaction
is allowed to commit. Once all execution threads are done
with their assigned work, the batch goes through a commit
stage. This can be done in parallel by multiple threads.
4.3 QueCC Implementation Details
Plan Delivery After each planner, completes its batch par-
tition and construct its PG, it needs to be delivered to the
execution layer so that execution threads can start executing
transactions. In eCC, we use a simple lock-free delivery
mechanism using atomic operations. We utilize a shared
6
Batch&
bi+1
BatchQueue
PT1
PT0
ET1
ET0
Planning&
threads
Execution&threads
Batch&
bi
Figure 2.
Example of concurrent batch planning and execu-
tion with 4worker threads (2planner threads
+
2execution
threads). Priority groups are color-coded by planners. Execu-
tion threads process transactions from both priority groups.
data structure called BatchQueue, which is basically a cir-
cular buer that contains slots for each batch. Each batch
slot contains pointers to partitions of priority groups which
are set in a latch-free manner using atomic CAS operations.
Priority group partitions are assigned to execution threads.
Figure 2, illustrates an example of concurrent batch planning
and execution of batch
bi+1
and
bi
respectively. In this ex-
ample, planner threads denoted as
PT0
and
PT1
are planning
their respective priority groups for batch
bi+1
; and concur-
rently, execution threads
ET0
and
ET1
are executing EQs from
the previously planned batch (i.e., batch bi).
Delivering priority group partitions to the execution layer
must be ecient and lightweight. For this reason, eCC
uses a latch-free mechanism for delivery. The mechanism
goes as follows. Execution threads spin on priority group
partition slots while they are not set (i.e., their values is zero).
Once the priority groups are ready to be delivered, planner
threads merge EQs into priority group partitions such that
the workload is balanced, and each priority group partition is
assigned to one execution thread. Note that, we determinis-
tically assign EQs to execution threads. The alternative way
is to make all execution threads available to all execution
threads, but this approach has a risk of increasing contention
when there are many execution threads. To achieve balanced
workload among execution threads, we have a simple greedy
algorithm that keeps track of how many transaction frag-
ments are assigned to each execution thread. It iterates over
the remaining unassigned EQs until all EQs are assigned. In
each iteration, it assigns an EQ to the worker with the lowest
load.
Once a planner is done with creating execution threads
assignments, it uses atomic CAS operations to set the values
of the slots in the BatchQueue to point to the list of assigned
EQs for each execution thread, which constitutes the priority
group partition assigned to the respective execution thread.
In the pipelined design, execution threads are either pro-
cessing EQs or waiting for their slots to be set by planner
threads. As soon as the slot is set, execution threads can start
processing EQs from the newly planned batch. On the other
hand, for the un-pipelined conguration, worker threads
acting as planner threads, will synchronize at the end of
the planning phase. Once the synchronization is completed,
worker threads will act as execution threads and start exe-
cuting EQs.
Note that in eCC, regardless of the number of planner
threads and execution threads, there is zero contention with
respect to the BatchQueue data structure.
RID Management
Our planning is based on record iden-
tiers (RIDs). RIDs can be physical or logical depending
on the storage architecture being row-oriented or column-
oriented. Typically, in row-oriented storage, physical RIDs
are used. While in column-oriented storage, logical RIDs
are used. As opposed to traditional disk-oriented data stores,
where RIDs are typically physical and is composed of the disk
page identier and the record oset, main-memory stores
typically uses memory pointers as physical RIDs. On the
other hand, logical RIDs can be used as an optimization to
improve performance under contention. In eCC, we use
logical RIDs from a single space of 64-bit integers. These
logical RIDs which are used for planning purposes are stored
alongside index entries.
4.4 Discussion
eCC supports “speculative write visibility” (SWV) when
executing transaction fragments because it defers commit-
ment to the end of the batch and allows reading uncommitted
data written within a batch. In general, transaction fragments
that may abort can cause cascading aborts. To ensure recov-
erability, eCC maintains an undo buer per transaction,
which is populated by the pre-write values of records (or its
elds) being updated. A transaction can abort only if at least
one of its fragments is abortable and have exercised its abort
action.
If a transaction aborts, the original values are recovered
from the undo buers. This approach makes conservative
assumptions about the abortability of transaction fragments
(i.e., it assumes that all transaction fragments can abort). The
overhead of maintaining undo-buers can be eliminated if
the transaction fragment is guaranteed to commit (i.e., it
does not depend on other fragments). We can maintain in-
formation the abortability of a transaction fragment in its
respective transaction meta-data. Thus, instead of perform-
ing populating the undo buers “blindly”, we can check the
possibility of an abort by looking at the transaction context,
and skip the copying to undo buers if the transaction is
guaranteed to commit (i.e., passed its commit point[10]).
However, eCC is not limited to only SWV and can sup-
port multiple write visibility policies. Faleiro et al. [
10
] in-
troduced a new write visibility policy called “early write
visibility” (EWV), which can improve the throughput of
transaction processing by allowing reads on records only if
their respective writes are guaranteed to be committed with
serializability guarantees. Unlike SWV, which is prone to
7
Table 1.
YCSB Workload congurations. Notes: default values
are in parenthesis; in partitioned stores, it reects the number of
partitions; batch size parameters are applicable only to eCC;
multi-partition transaction parameter is applicable only to the
partitioned stores.
Parameter Name Parameter Values
# of worker threads 4,8,16,24,(32)
Zipan’s theta 0.0,0.4,0.8,0.9,(0.99)
%of write operations 0%,5%,20%,(50%),80%,95%
Rec. sizes 50B,(100B),200B,400B,800B,1KB,2KB
Operations per txn 1,10,(16),20,32
Batch sizes 1K,4K,(10K),20K,40K,80K
%of multi-partition txns. 1%,5%,10%,20%,50%,80%,100%
cascading aborts, EWV is not. In fact, both EWV and SWV
can be used at the same time by eCC. A special token is
placed ahead of the original fragment to make ETs adhere
to the EWV policy. If that special token is not placed, then
ETs will follow SWV course. One major advantage of using
EWV in the context of eCC is eliminating the process of
backing-up copies of records in the undo-buers. Since the
transaction that updated record is guaranteed to commit,
there will be no potential rollback and the undo-action is
unnecessary.
5 Experimental Analysis
To evaluate eCC, we have substantially extended an ex-
isting concurrency testbed, referred to as Deneva [
12
,
23
],
which is the successor of DBx1000 [41].
This is a comprehensive framework for evaluating concur-
rency control schemes, and it includes many concurrency
control techniques. We compare eCC to a variants of two-
phase locking [
8
] (i.e., NO-WAIT [
3
] as a representative of
pessimistic concurrency control), TicToc [
42
], Cicada [
23
],
SILO [
35
], FOEDUS with MOCC [
19
,
36
], ERMIA with SSI and
SSN [18], and H-Store [16].
5.1 Experimental Setup
We run all of our experiments using a Microsoft Azure G5
VM instance. This VM is equipped with an Intel Xeon CPU
E5-2698B v3 running at 2GHz, and has 32 cores. The memory
hierarchy includes a 32KB L1 data cache, 32KB L2 instruc-
tion cache, 256KB L2 cache, 40MB L3 cache, and 448GB of
RAM. The operating system is Ubuntu 16.04.3 LTS (xenial).
The codebase is compiled with GCC version 5.4.0 and
O
3
compiler optimization ag.
The workloads are generated at the server before any
transaction is processed, and are stored in main-memory
buers. This is done to remove any eects of the network,
and allows us to study concurrency control protocols under
high stress.
Every experiment starts with a warm-up period where
measurements are not collected; followed by a measured
period. Each experiment is run three times, and the average
value is reported in the results of this section.
We focus on evaluating three metrics: throughput, latency,
and abort percentage. The abort percentage is computed as
the ratio between the total number of aborted transaction to
Table 2.
TPC-C Workload congurations, default values are
in parenthesis
Parameter Name Parameter Values
# of worker threads 4,8,16,24,(32)
%of payment txns. 0%,50%,100%
the sum of the total number of attempted transaction (i.e.,
both aborted and committed transactions).
5.2 Workloads Overview
We have experimented with both YCSB and TPC-C bench-
marks. Below, we briey discuss the workloads used in our
evaluation.
YCSB[
5
] is a web-application benchmark that is repre-
sentative of web applications used by YAHOO. While the
original workload does not have any transaction semantics,
ours is adapted to have transactional capability by including
multiple operations per transaction. Each operation can be
either a
READ
or a
READ-MODIFY-WRITE
operation. The ratio
of
READ
to
WRITE
operations can also vary. The benchmark
consists of a single table.
The table in our experiments contains 16 million records.
Table 1 summarizes the various conguration parameters
used in our evaluation, and default values are in parenthesis.
The data access patterns can be controlled using the param-
eter
θ
of the Zipan distribution. For example, a workload
with uniform access has
θ=
0
.
0, while a skewed workload
has a larger value of theta e.g., θ=0.99.
TPC-C [
34
] is the industry standard benchmark for evalu-
ating transaction processing systems. It basically simulates
a wholesale order processing system. Each warehouse is
considered to be a single partition. There are 9tables and 5
transaction types for this benchmark. The data store is parti-
tioned by warehouse, which is considered the best possible
partitioning scheme for the TPC-C workload [
6
]. Similar to
previous studies in the literature[
12
,
41
], we focus on the two
main transaction proles (
NewOrder
and
Payment
) out of the
ve proles, which correspond to 88% of the default TPC-C
workload mix [
34
]. These two proles are also the most com-
plex ones. For example, the
NewOrder
transaction performs
2
READ
operations, 6
16
READ-MODIFY-WRITE
operations,
7
16
INSERT
operations, and about 15% of these operations
can access a remote partition. The Payment transaction, on
the other hand, performs 3
READ-MODIFY-WRITE
operations,
and 1
INSERT
operation. One of the reads uses the last name
of the customer, which requires a little more work to look
up the record.
In this paper, we primarily study high-contention work-
loads because when there is limited or no contention, then,
generally, the top approaches behave comparably with negli-
gible dierences. This choice also has an important practical
signicance [
19
,
22
,
23
,
35
,
42
] because real workloads are
often skewed, thus, exhibiting a high contention. Therefore,
in the interest of space, we present our detailed results for
8
(a) Throughput (b) Latency
Figure 3.
Varying batch sizes and high data access skew
(θ=0.99)
Figure 4.
Time breakdown when varying number of worker
threads.
high-contention workloads and briey overview the results
for lower-contention scenarios.
5.3 YCSB Experiments
Using YCSB workloads, we start by evaluating the perfor-
mance of eCC with dierent batch sizes, which is a unique
aspect of eCC. Subsequently, we compare eCC with
other concurrency control protocols.
Eect of Batch Sizes
We gear our experiments to study
the eect of batch sizes on throughput and latency for eCC
because it is the only approach that uses batching. We use a
write-intensive workload, 32 worker threads, a record size
of 100 bytes, Zipan’s
θ=
0
.
99, and 16 operations per trans-
action.
We observe that eCC exhibit low average latency (i.e.,
under 3ms) for batches smaller than 20
K
transactions 3b,
which is considered reasonable for many applications. For
the remaining experiments, we use a batch of size 10K.
Time Breakdown
Figure 4 illustrates the time break-
down spent on each phase of eCC under highly skewed
data accesses. Notably, eCC continues to achieve high-
utilization even under extreme contention model. For exam-
ple, even scaling to 32 worker threads, over the 80% of the
time is dedicated to useful work, i.e., planning and execution
phases.
Eect of Data Access Skew
We evaluate the eect of
varying record contention using Zipan’s
θ
parameter of
the YCSB workload while keeping the number of worker
threads constant. We use 32 worker threads and assign one
to each available core. Figure 5a, shows the throughput re-
sults of eCC compared with other concurrency control
protocols. We use a write-intensive workload which has
50%
READ-MODIFY-WRITE
operations per transaction. As ex-
pected eCC performs comparably with the best competing
approaches under low contention scenarios
θ<=
0
.
8. Re-
markably, in high contention scenarios, eCC begins to
(a) Throughput (b) Abort Percentage
Figure 5.
Variable contention (
θ
) on write-intensive YCSB
workload
(a) High contention, θ=0.99 (b) Abort percentage, θ=0.99
Figure 6.
Scaling Worker Threads Under Write Intensive
Workload.
signicantly outperforms all the state-of-the-art approaches.
eCC improves the next best approach by 3
.
3
×
with
θ=
0
.
99, and has 35% better throughput with
θ=
0
.
9. The main
reason for eCC’s high-throughput is that it eliminates con-
currency control induced aborts completely. On the other
hand, the other approaches suer from excessive transac-
tion aborts which lead to wasted computations and complete
stalls for lock-based approaches. This experiments also high-
lights the stability and predictability of eCC with respect
to degree of contention.
Scalability
We evaluate the scalability of eCC by vary-
ing the number of worker threads while maintaining a skewed,
write-intensive access pattern. We observe that all other ap-
proaches scale poorly under highly concurrent access sce-
nario (6a) despite employing techniques to reduce the cost
of contention (e.g., Cicada). In contrast, eCC scales well
despite the higher contention due to increased number of
threads. For instance, eCC achieves nearly 3
×
the through-
put of Cicada with 32 worker threads.
This results demonstrates the eectiveness of eCC’s
concurrency architecture that exploits the untapped paral-
lelism available in transactional workloads. Figure 6b shows
that the abort rate for Cicada,TicToc, and ERMIA as paral-
lelism increases. This high abort-rate behavior is caused by
the large number of worker threads competing to read and
modify a small set of records (cf. 6). Unlike eCC, any non-
deterministic scheduling and concurrency control protocols
will be a subject to signicant amplied abort rates when
the number of conicting operations by competing threads
increases.
Eect of Write Operation Percentage
Another factor
that contributes to contention is the percentage of write op-
erations. With read-only workloads, concurrency control
9
(a) High contention, θ=0.99 (b) Abort percentage, θ=0.99
Figure 7.
Results for varying the percentage of write opera-
tions in each transaction
(a) Throughput (b) Abort Percentage
Figure 8.
Results for varying the size of records under high
contention.
protocols exhibit limited contention even if the data access is
skewed. However, as the number of conicting write opera-
tions on records increases, the contention naturally increases,
e.g., exclusive locks need to be acquired for NO-WAIT, more
failed validations for SILO and Cicada, and in general, any
approach relying on the optimistic assumption that conicts
are rare will suer. Since eCC does not perform any con-
currency control during execution, no contention arise from
the write operations.
In addition to increased contention, write operations trans-
lates into increased size of undo logging for recovery. This is
an added cost for any in-place update approach and eCC
is no exception. As we increase the write percentage, more
records are backed up in the undo-buers log and, thus, neg-
atively impacts the overall system throughput. Of course,
through multi-version storage model by avoiding in-place
updates, the undo-buer overhead can be mitigated. Never-
theless, eCC signicantly outperforms other concurrency
control protocols by up to 4
.
5
×
under write-intensive work-
loads, i.e., once the write percentage exceeds 50%.
Eect of Record Sizes
Having larger record sizes may
also negatively aect the performance of logging component
as shown in Figure 8. Since the undo log maintains a copy of
every modied record, the logging throughput suers when
large records are used.
One approach to handle the logging is to exploit the notion
of “abortabity” of the transaction last updated the record,
and re-purpose the key principle of EWV[
10
].
3
Even under
logging pressure that begins to become one of the dominant
factor when the records size reaches 2
KB
,eCC continues
to maintains its superiority and outperform Cicada by factor
3
Similarly in eCC, we check if all fragments of the last writer transaction
has been executed successfully, if so, we avoid writing to the undo buers,
and we further avoid adding the commit dependency.
(a) High contention, θ=0.99 (b) Abort percentage, θ=0.99
Figure 9.
Results for varying the number of operations in
each transaction.
(a) Throughput (b) Abort percentage
Figure 10.
Results of multi-partition transactions with com-
parison to H-Store.
of 3
×
despite the contention regulation mechanism employed
by Cicada.
Eect of Transaction Size
So far, each transaction con-
tains a total of 16 operations. Now we evaluate the eect
of varying the number of operations per transaction, essen-
tially the depth of a transaction. Figure 9 shows the results
of having 1
,
10
,
16
,
20, and 32 operations per transaction un-
der high data skew. For these experiments, we report the
throughput in terms of the number of operations completed
or records accessed per second. For all concurrency control
protocols, the throughput is lowest when there is only a
single operation per transaction, which indicates that the
work for ensuring transactional semantics is becoming the
bottleneck.
More interestingly, when increasing the transaction depth,
the probability of conicting access is also increased; thereby,
higher contention and higher abort rates. In contrast, under
higher contention, eCC continues to have zero percent
abort rates. It further benets from improved cache-locality
and yields higher throughput because the smaller subset
of records are handled by the same worker thread. eCC
further exploits intra-transaction parallelism and altogether
improves up to 2
.
7
×
over the next best performing protocol
(Cicada) when increasing the transaction depth.
Comparison to Partitioned Stores
eCC is not sen-
sitive to multi-partition transactions despite its per-queue,
single-threaded execution model, which is one of its key
distinction. To establish eCC’s resilience to non-partition
workloads, we devise an experiment in which we vary the de-
gree of multi-partition transactions. Figure 10 illustrates that
eCC throughput virtually remains the same regardless of
the percentage of multi-partition transactions. We observed
that eCC improves over H-Store by factor 4
.
26
×
even when
there is only 1% multi-partition transactions in the work-
load. Remarkably, with 100% multi-partition transactions,
10
(a) Throughput - 100% NewOrder (b) Throughput - 50% NewOrder +50% Payment (c) Throughput - 100% Payment
(d) Abort Percentage - 100% NewOrder (e)
Abort Percentage - 50%
NewOrder +
50%
Payment (f) Abort Percentage - 100% Payment
Figure 11. Results for 32 worker threads for TPC-C benchmark. Number of warehouses =1.
eCC improves on H-Store by two orders of magnitude.H-
Store is limited to a thread-to-transaction assignment and
resolves conicting access at the partition level. For multi-
partition transactions, H-Store is forced to lock each partition
accessed by a transaction prior to starting its execution. If
the partition-level locks cannot be acquired, the transaction
is aborted and restarted after some randomized delay. The
H-Store coarse-grained partition locks oer an elegant model
when assuming partition-able workload, but it noticeably
limits concurrency when this assumption no longer holds.
5.4 TPC-C Experiments
In this section, we study eCC using the industry standard
TPC-C. Our experiments in this section focus on through-
put and abort percentage under high contention with three
dierent workload mixes.
From a data access skew point of view, the TPC-C bench-
mark is inherently skewed towards warehouse records be-
cause both
Payment
and
NewOrder
transactions access the
warehouse table. The scale factor for TPC-C is the number
of warehouses, but it also determines the data access skew.
As we increase the number of warehouses, we get less data
access skew (assuming a xed number of transactions in the
generated workload). Therefore, to induce high contention
in TPC-C, we limit the number of warehouses to 1in the
workload and use all the 32 cores for processing the work-
load.
Figure 11 captures the throughput and the abort percent-
age. With a workload mix of 100%
Payment
transactions,
Figure 11c, eCC performs 6
.
34
×
better than the other ap-
proaches. With the a 50%
Payment
transaction mix, eCC
improves by nearly 2
.
7
×
over FOEDUS with MOCC. Despite
the skewness towards the single warehouse record (where
every transaction in the workload would accesses it), eCC
can process fragments accessing other tables in parallel be-
cause it distributes them among multiple queues, and assign
those queues to dierent threads. In addition, eCC per-
forms no spurious aborts which contributes its high perfor-
mance.
6 Related Work
There have been extensive research on concurrency control
approaches, and there many excellent publications dedicated
to this topic (e.g., [
2
,
20
,
31
]). However, research interest in
concurrency control in the past decade has been revived
due to emerging hardware trends, e.g., mutli-core and large
main-memory machines. We will cover key approaches in
this section.
Novel Transaction Processing Architectures
Arguably
one of the rst paper that started to question the status quo
for concurrency mechanism was H-Store [
16
]. H-Store imag-
ined a simple model, where the workload always tends to
be partitionable and advocated single-threaded execution in
each partition; thereby, drop the need for any coordination
mechanism within a single partition. Of course, as expected
its performance degrades when transactions span multiple
partitions.
Unlike H-Store,eCC through a deterministic, priority-
based planning and execution model that not only eliminates
the need for concurrency mechanism, but also it is not lim-
ited to partitionable workloads and can swiftly readjust and
reassign thread-to-queue assignment or merge/spit queues
during the planning and/or execution, where queue is es-
sentially an ordered set of operations over a ne-grained
partition that is created dynamically.
Unlike the classical execution model, in which each trans-
action is assigned to a single thread, DORA [
25
] proposed a
novel reformulation of transactions processing as a loosely-
coupled publish/subscribe paradigm, decomposes each trans-
action through a set of rendezvous points, and relies on
message passing for thread communications. DORA assigns
a thread to a set of records based on how the primary key
index is traversed, often a b-tree index, where essentially
the tree divided into a set of contiguous disjoint ranges, and
each range is assigned to a thread. The goal of DORA is to
improve cache eciency using thread-to-data assignment
as opposed to thread-to-transaction assignment. However,
DORA continues to rely on classical concurrency controls to
coordinate data access while eCC is fundamentally dier-
ent by completely eliminating the need for any concurrency
control through deterministic planning and execution for
a batch of transactions. Notably, eCC’s thread-to-queue
assignment also substantially improve cache locality.
Concurrency Control Protocols
The well-understood
pessimistic two-phase locking schemes for transactional con-
currency control on single-node systems are shown to have
scalability problems with large numbers of cores[
41
]. There-
fore, several research proposals focused on the optimistic
concurrency control (OCC) approach (e.g., [
35
,
36
,
42
,
43
]),
which is originally proposed by [
21
]. Tu et al.’s SILO [
35
]
11
is a scalable variant of optimistic concurrency control that
avoids many bottlenecks of the centralized techniques by
an ecient implementation of the validation phase. TicToc
[
42
] improves concurrency by using a data-driven timestamp
management protocol. Both BCC [
43
] and MOCC [
36
] are
designed to minimize the cost of false aborts. All of these CC
protocols suer from non-deterministic aborts, which results
in wasting computing resources and reducing the overall
system’s throughput. On the other hand, eCC does not
have such limitation because it deterministically processes
transactions, which eliminates non-deterministic aborts.
Larson et al. [
22
] revisited concurrency control for in-
memory stores and proposed a multi-version, optimistic con-
currency control with speculative reads. Sadoghi et al. [
28
]
introduced a two-version concurrency control that allows the
coexistence of both pessimistic and optimistic concurrency
protocols, all centered around a novel indirection layer that
serves as a gateway to nd the latest version of the record and
a lightweight coordination mechanism to implement block
and non-blocking concurrency mechanism. Cicada by Lim
et al. mitigates the costs associated with multi-versioning
and contention by carefully examining various layers of the
system [
23
]. eCC is in sharp contrast with these research
eorts, eCC focuses on eliminates concurrency mecha-
nism as opposed to improving it.
ORTHRUS by Ren et al. [
26
] uses dedicated threads for
pessimistic concurrency control and message passing com-
munication between threads. Transaction execution threads
delegate their locking functionality to dedicated concurrency
control threads. In contrast to ORTHRUS,eCC plans a batch
of transactions in the rst phase and execute them in the sec-
ond phase using coordination-free mechanism. LADS by Yao
et al. [
40
] builds dependency graphs for a batch of transac-
tions that dictates execution orders. Faleiro et al. [
10
] propose
PWV which is based on the “early write visibility” technique
that exploits the ability to determine the commit decision
of a transaction before it completes all of its operations. In
terms of execution, both LADS and PWV process transactions
explicitly by relying on dependency graphs. On the other
hand, eCC does satisfy transaction dependencies but its
execution model is organized in term of prioritized queues.
In eCC, not only do we drop the partitionability assump-
tion, but we also eliminate any graph-driven coordination
by introducing a novel deterministic, priority-based queuing
execution. Notably, the idea of “early write visibility” can be
exploited by eCC to further reduce chances of cascading
aborts.
The ability to parallelize transaction processing is limited
by various dependencies that may exist among transactions
fragments. IC3 [
37
] is a recent proposal for a concurrency
control optimized for multi-core in-memory stores. IC3 de-
composes transactions into pieces through static analysis,
and constrain the parallel execution of pieces at run-time to
ensure serializable.
Unlike IC3,eCC achieves transaction-level parallelism
by using two deterministic phases of planning and execution
and without relying on conict graphs explicitly.
Deterministic Transaction Processing
All the afore-
mentioned single-version transaction processing schemes in-
terleave transaction operations non-deterministically, which
leads to fundamentally unnecessary aborts and transaction
restarts. Deterministic transaction processing, e.g.,[
9
,
33
])
on the other hand, eliminates this class of non-deterministic
aborts and allow only logic-induced aborts (i.e., explicit
aborts by the transaction’s logic). Calvin[
33
] is designed
for distributed environments and uses determinism elim-
inate the cost of two-phase-commit protocol when process-
ing distributed transactions and does not address multi-
core optimizations in the individual nodes. Gargamel [
4
] pre-
serilaize possibly conicting transactions using a dedicated
load-balancing node in distributed environments. It uses a
classier based on static analysis to determine conicting
transactions. Unlike Gargamel,eCC is centered around the
notion of priority, and is designed for multi-core hardware.
BOHM [
9
] started re-thinking multi-version concurrency
control for deterministic multi-core in-memory stores. In
particular, BOHM process batches of transactions in three
sequential phases (1) a single-threaded sequencing phase to
determine the global order of transactions, (2) a parallel multi-
version concurrency control phase to determine the version
conicts, and (3) a parallel execution phase based on trans-
action dependencies, which optionally performs garbage
collection for unneeded record versions. In sharp contrast,
eCC process batches of transactions in only two deter-
ministic phases, and it has a parallel priority-based queue-
oriented planning and execution phases that do not suer
from additional costs such as garbage collection costs.
7 Conclusion
In this paper, we presented eCC, a queue-oriented, control-
free concurrency architecture for high-performance, in-memory,
key-value stores on emerging multi-sockets, many-core, shared-
memory architectures. eCC exhibits minimal contention
among concurrent threads by eliminating the overhead of
concurrency control from the critical path of the transac-
tion. eCC operates on batches of transactions in two de-
terministic phases of priority-based planning followed by
control-free execution. Instead of the traditional thread-to-
transaction assignment, eCC uses a novel thread-to-queue
assignment to dynamically parallelize transaction execution
and eliminate bottlenecks under high contention workloads.
We extensively evaluate eCC with two popular bench-
marks. Our results show that eCC can process almost
40 Million YCSB operation per second and over 5Million
TPC-C transactions per second. Compared to other concur-
rency control approaches, eCC achieves up to 4
.
5
×
higher
throughput for YCSB workloads, and 6
.
3
×
higher throughput
for TPC-C workloads.
12
References
[1]
A. Adya, B. Liskov, and P. O’Neil. 2000. Generalized isolation level
denitions. In Proc. ICDE. 67–78.
DOI:
hps://doi.org/10.1109/ICDE.
2000.839388
[2]
Arthur J. Bernstein, David S. Gerstl, and Philip M. Lewis. 1999. Con-
currency Control for Step-decomposed Transactions. Inf. Syst. 24, 9
(Dec. 1999), 673–698. hp://portal.acm.org/citation.cfm?id=337922
[3]
Philip A. Bernstein and Nathan Goodman. 1981. Concurrency Control
in Distributed Database Systems. ACM Comput. Surv. 13, 2 (June 1981),
185–221. DOI:hps://doi.org/10.1145/356842.356846
[4]
P. Cincilla, S. Monnet, and M. Shapiro. 2012. Gargamel: Boosting DBMS
Performance by Parallelising Write Transactions. In 2012 IEEE 18th
International Conference on Parallel and Distributed Systems. 572–579.
[5]
Brian F. Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrish-
nan, and Russell Sears. 2010. Benchmarking Cloud Serving Systems
with YCSB. In Proc. SoCC. ACM, 143–154.
DOI:
hps://doi.org/10.1145/
1807128.1807152
[6]
Carlo Curino, Evan Jones, Yang Zhang, and Sam Madden. 2010. Schism:
A Workload-driven Approach to Database Replication and Partitioning.
Proc. VLDB Endow. 3, 1-2 (Sept. 2010), 48–57.
DOI:
hps://doi.org/10.
14778/1920841.1920853
[7]
David J. DeWitt, Randy H. Katz, Frank Olken, Leonard D. Shapiro,
Michael R. Stonebraker, and David A. Wood. 1984. Implementation
Techniques for Main Memory Database Systems. In Proc. SIGMOD.
ACM, 1–8. DOI:hps://doi.org/10.1145/602259.602261
[8]
K. P. Eswaran, J. N. Gray, R. A. Lorie, and I. L. Traiger. 1976. The
Notions of Consistency and Predicate Locks in a Database System.
Commun. ACM 19, 11 (Nov. 1976), 624–633.
DOI:
hps://doi.org/10.
1145/360363.360369
[9]
Jose M. Faleiro and Daniel J. Abadi. 2015. Rethinking Serializable
Multiversion Concurrency Control. Proc. VLDB Endow. 8, 11 (July
2015), 1190–1201. DOI:hps://doi.org/10.14778/2809974.2809981
[10]
Jose M. Faleiro, Daniel J. Abadi, and Joseph M. Hellerstein. 2017. High
Performance Transactions via Early Write Visibility. Proc. VLDB En-
dow. 10, 5 (Jan. 2017), 613–624.
DOI:
hps://doi.org/10.14778/3055540.
3055553
[11] Jose M. Faleiro, Alexander Thomson, and Daniel J. Abadi. 2014. Lazy
Evaluation of Transactions in Database Systems. In Proc. SIGMOD.
ACM, 15–26. DOI:hps://doi.org/10.1145/2588555.2610529
[12]
Rachael Harding, Dana Van Aken, Andrew Pavlo, and Michael Stone-
braker. 2017. An Evaluation of Distributed Concurrency Control. Proc.
VLDB Endow. 10, 5 (Jan. 2017), 553–564.
DOI:
hps://doi.org/10.14778/
3055540.3055548
[13]
Hewlett Packard Enterprise. 2017. HPE Superdome Servers. hps:
//www.hpe.com/us/en/servers/superdome.html. (2017).
[14]
Hewlett Packard Labs. 2017. The Machine: A new kind of computer.
hp://labs.hpe.com/research/themachine. (2017).
[15]
R. Jimenez-Peris, M. Patino-Martinez, and S. Arevalo. 2000. Determin-
istic scheduling for transactional multithreaded replicas. In Proc. IEEE
SRDS. 164–173. DOI:hps://doi.org/10.1109/RELDI.2000.885404
[16]
Robert Kallman, Hideaki Kimura, Jonathan Natkins, Andrew Pavlo,
Alexander Rasin, Stanley Zdonik, Evan P. C. Jones, Samuel Madden,
Michael Stonebraker, Yang Zhang, John Hugg, and Daniel J. Abadi.
2008. H-store: A High-performance, Distributed Main Memory Trans-
action Processing System. Proc. VLDB Endow. 1, 2 (Aug. 2008), 1496–
1499. DOI:hps://doi.org/10.14778/1454159.1454211
[17]
Bettina Kemme and Gustavo Alonso. 2000. Don’T Be Lazy, Be
Consistent: Postgres-R, A New Way to Implement Database Repli-
cation. In Proc. VLDB. Morgan Kaufmann Publishers Inc., 134–143.
hp://dl.acm.org/citation.cfm?id=645926.671855
[18]
Kangnyeon Kim, Tianzheng Wang, Ryan Johnson, and Ippokratis Pan-
dis. 2016. ERMIA: Fast Memory-Optimized Database System for Het-
erogeneous Workloads. In Proceedings of the 2016 International Confer-
ence on Management of Data (SIGMOD ’16). ACM, New York, NY, USA,
1675–1687. DOI:hps://doi.org/10.1145/2882903.2882905
[19]
Hideaki Kimura. 2015. FOEDUS: OLTP Engine for a Thousand Cores
and NVRAM. In Proceedings of the 2015 ACM SIGMOD International
Conference on Management of Data (SIGMOD ’15). ACM, New York,
NY, USA, 691–706. DOI:hps://doi.org/10.1145/2723372.2746480
[20]
Vijay Kumar (Ed.). 1995. Performance of Concurrency Control Mech-
anisms in Centralized Database Systems. Prentice-Hall, Inc., Upper
Saddle River, NJ, USA.
[21]
H. T. Kung and John T. Robinson. 1981. On Optimistic Methods for
Concurrency Control. ACM Trans. Database Syst. 6, 2 (June 1981),
213–226. DOI:hps://doi.org/10.1145/319566.319567
[22]
Per A. Larson, Spyros Blanas, Cristian Diaconu, Craig Freedman, Jig-
nesh M. Patel, and Mike Zwilling. 2011. High-performance Concur-
rency Control Mechanisms for Main-memory Databases. Proc. VLDB
Endow. 5, 4 (Dec. 2011), 298–309.
DOI:
hps://doi.org/10.14778/2095686.
2095689
[23]
Hyeontaek Lim, Michael Kaminsky, and David G. Andersen. 2017.
Cicada: Dependably Fast Multi-Core In-Memory Transactions. In Proc.
SIGMOD. ACM, 21–35.
DOI:
hps://doi.org/10.1145/3035918.3064015
[24]
Mellanox Technologies. 2017. Multicore Processors Overview. hp:
//www.mellanox.com/page/multi_core_overview. (2017).
[25]
Ippokratis Pandis, Ryan Johnson, Nikos Hardavellas, and Anastasia
Ailamaki. 2010. Data-oriented Transaction Execution. Proc. VLDB En-
dow. 3, 1-2 (Sept. 2010), 928–939.
DOI:
hps://doi.org/10.14778/1920841.
1920959
[26]
Kun Ren, Jose M. Faleiro, and Daniel J. Abadi. 2016. Design Principles
for Scaling Multi-core OLTP Under High Contention. In Proc. SIGMOD.
ACM, 1583–1598. DOI:hps://doi.org/10.1145/2882903.2882958
[27]
Kun Ren, Alexander Thomson, and Daniel J. Abadi. 2014. An Eval-
uation of the Advantages and Disadvantages of Deterministic Data-
base Systems. Proc. VLDB Endow. 7, 10 (June 2014), 821–832.
DOI:
hps://doi.org/10.14778/2732951.2732955
[28]
Mohammad Sadoghi, Mustafa Canim, Bishwaranjan Bhattacharjee,
Fabian Nagel, and Kenneth A. Ross. 2014. Reducing Database Locking
Contention Through Multi-version Concurrency. Proc. VLDB Endow.
7, 13 (Aug. 2014), 1331–1342.
DOI:
hps://doi.org/10.14778/2733004.
2733006
[29]
Sgi. 2017. SGI UV 3000 and SGI UV 30. hps://www.sgi.com/products/
servers/uv/uv_3000_30.html. (2017).
[30]
Dennis Shasha, Francois Llirbat, Eric Simon, and Patrick Valduriez.
1995. Transaction Chopping: Algorithms and Performance Studies.
ACM Trans. Database Syst. 20, 3 (Sept. 1995), 325–363.
DOI:
hps:
//doi.org/10.1145/211414.211427
[31]
Alexander Thomasian. 1998. Concurrency Control: Methods, Perfor-
mance, and Analysis. ACM Comput. Surv. 30, 1 (March 1998), 70–119.
DOI:hps://doi.org/10.1145/274440.274443
[32]
Alexander Thomson and Daniel J. Abadi. 2015. CalvinFS: Consistent
WAN Replication and Scalable Metadata Management for Distributed
File Systems. In Proc. FAST. USENIX Association, 1–14. hp://portal.
acm.org/citation.cfm?id=2750483
[33]
Alexander Thomson, Thaddeus Diamond, Shu C. Weng, Kun Ren,
Philip Shao, and Daniel J. Abadi. 2012. Calvin: Fast Distributed Trans-
actions for Partitioned Database Systems. In Proc. SIGMOD. ACM, 1–12.
DOI:hps://doi.org/10.1145/2213836.2213838
[34]
TPCC. TPC-C, On-line Transaction Processing Benchmark. (????).
hp://www.tpc.org/tpcc/.
[35]
Stephen Tu, Wenting Zheng, Eddie Kohler, Barbara Liskov, and
Samuel Madden. 2013. Speedy Transactions in Multicore In-memory
Databases. In SOSP. ACM, 18–32.
DOI:
hps://doi.org/10.1145/2517349.
2522713
[36]
Tianzheng Wang and Hideaki Kimura. 2016. Mostly-optimistic Con-
currency Control for Highly Contended Dynamic Workloads on a
Thousand Cores. Proc. VLDB Endow. 10, 2 (Oct. 2016), 49–60.
DOI:
hps://doi.org/10.14778/3015274.3015276
[37]
Zhaoguo Wang, Shuai Mu, Yang Cui, Han Yi, Haibo Chen, and Jinyang
Li. 2016. Scaling Multicore Databases via Constrained Parallel Execu-
tion. In Proc. SIGMOD. ACM, 1643–1658.
DOI:
hps://doi.org/10.1145/
2882903.2882934
13
[38]
Gerhard Weikum and Gottfried Vossen. 2001. Transactional Infor-
mation Systems: Theory, Algorithms, and the Practice of Concurrency
Control and Recovery. Morgan Kaufmann Publishers Inc., San Francisco,
CA, USA.
[39]
Arthur T. Whitney, Dennis Shasha, and Stevan Apter. 1997. High
Volume Transaction Processing without Concurrency Control, Two
Phase Commit, Sql or C++. In HPTS.
[40]
C. Yao, D. Agrawal, G. Chen, Q. Lin, B. C. Ooi, W. F. Wong, and M.
Zhang. 2016. Exploiting Single-Threaded Model in Multi-Core In-
Memory Systems. IEEE TKDE 28, 10 (2016), 2635–2650.
DOI:
hps:
//doi.org/10.1109/TKDE.2016.2578319
[41]
Xiangyao Yu, George Bezerra, Andrew Pavlo, Srinivas Devadas, and
Michael Stonebraker. 2014. Staring into the Abyss: An Evaluation of
Concurrency Control with One Thousand Cores. Proc. VLDB Endow. 8,
3 (Nov. 2014), 209–220.
DOI:
hps://doi.org/10.14778/2735508.2735511
[42]
Xiangyao Yu, Andrew Pavlo, Daniel Sanchez, and Srinivas Devadas.
2016. TicToc: Time Traveling Optimistic Concurrency Control. In
Proc. SIGMOD. ACM, 1629–1642.
DOI:
hps://doi.org/10.1145/2882903.
2882935
[43]
Yuan Yuan, Kaibo Wang, Rubao Lee, Xiaoning Ding, Jing Xing, Spyros
Blanas, and Xiaodong Zhang. 2016. BCC: Reducing False Aborts
in Optimistic Concurrency Control with Low Cost for In-memory
Databases. Proc. VLDB Endow. 9, 6 (Jan. 2016), 504–515.
DOI:
hps:
//doi.org/10.14778/2904121.2904126
14
... Deferred execution. DRP's use of deferred execution is inspired by prior works [9,13,25,34]. A key difference is that we use lazy evaluation to avoid the rank mismatch problem inherent with tame transactions ( §5), and to allow wild transactions to pipeline lock acquisition during certification without cascading aborts ( §5.2). ...
... For example, lazy transactions [13] and Sloth [9] use lazy evaluation to batch queries together to reduce the number of round trips between clients and database servers, exploit temporal locality, and achieve better load balancing. Calvin [34] and QueCC [25] capture the transactions' control and data flow, create an execution plan, and defer their execution via pushing the operations into per-partion/object queues. ...
Conference Paper
DRP is a new concurrency control protocol for software transactional memory that achieves high throughput, even for skewed workloads that exhibit high contention. DRP builds on prior works that chop transactions into pieces to expose more concurrency opportunities, but unlike these works, DRP performs no static analyses and supports arbitrary workloads. DRP achieves a high degree of concurrency across most workloads and guarantees deadlock freedom, strict serializability, and opacity. We incorporate DRP into the software transactional objects library STO [18] and find that DRP improves STO's throughput on several STAMP benchmarks by up to 3.6x. Additionally, an in-memory multicore database implemented with our modified variant of STO outperforms databases that use OCC or transaction chopping for concurrency control. Specifically, DRP achieves 6.6x higher throughput than OCC when contention is high. Compared to transaction chopping, our DRP achieves 3.3x higher throughput when contention is medium or low. Furthermore, our implementation achieves comparable performance to OCC and transaction chopping at other contention levels.
... As discussed above, all evaluated CC schemes scale poorly, surrendering to conflicts and contention. Even advanced CC schemes with conflict mitigation mechanisms do not reliably withstand many conflicts, like TICTOC or beyond the evaluated ones CICADA [26,37,45,68,73]. Moreover, besides logical contention of transaction conflicts, also physical contention (e.g. on latches) significantly impacts performance and system components outside of the CC scheme strongly affect contention (logical & physical), especially the simple but common inter-transaction parallel execution scheme. ...
Article
Full-text available
In our initial DaMoN paper, we set out the goal to revisit the results of “Starring into the Abyss [...] of Concurrency Control with [1000] Cores” (Yu in Proc. VLDB Endow 8: 209-220, 2014). Against their assumption, today we do not see single-socket CPUs with 1000 cores. Instead, multi-socket hardware is prevalent today and in fact offers over 1000 cores. Hence, we evaluated concurrency control (CC) schemes on a real (Intel-based) multi-socket platform. To our surprise, we made interesting findings opposing results of the original analysis that we discussed in our initial DaMoN paper. In this paper, we further broaden our analysis, detailing the effect of hardware and workload characteristics via additional real hardware platforms (IBM Power8 and 9) and the full TPC-C transaction mix. Among others, we identified clear connections between the performance of the CC schemes and hardware characteristics, especially concerning NUMA and CPU cache. Overall, we conclude that no CC scheme can efficiently make use of large multi-socket hardware in a robust manner and suggest several directions on how CC schemes and overall OLTP DBMS should evolve in future.
... In our ServerlessBFT protocol, we learn from these works and employ the queuing approach to create plans that allow running non-conflicting transactions in parallel [22,24,43,44,53]. However, such a strategy would require us to make changes to the protocol stated in Figure 3. ...
Preprint
Full-text available
With a growing interest in edge applications, such as the Internet of Things, the continued reliance of developers on existing edge architectures poses a threat. Existing edge applications make use of edge devices that have access to limited resources. Hence, they delegate compute-intensive tasks to the third-party cloud servers. In such an edge-cloud model, neither the edge devices nor the cloud servers can be trusted as they can act maliciously. Further, delegating tasks to cloud servers does not free the application developer from server provisioning and resource administration. In this paper, we present the vision of a novel Byzantine Fault- Tolerant Serverless-Edge architecture. In our architecture, we delegate the compute-intensive tasks to existing Serverless cloud infrastructures, which relieve us from the tasks of provisioning and administration. Further, we do not trust the edge devices and require them to order each request in a byzantine fault-tolerant manner. Neither do we trust the serverless cloud, which requires us to spawn multiple serverless cloud instances. To achieve all these tasks, we design a novel protocol, ServerlessBFT. We discuss all possible attacks in the serverless-edge architecture and extensively evaluate all of its characteristics.
... One drawback of this method is that Φ-SR is more common among deterministic certifiers than it is among mainstream databases, and the former group appears to be mostly confined to academia. Certifiers that produce histories in Φ-SR include Calvin [56], Bohm [13], SGSI [54] and QueCC [58]. Of these, Calvin has been adopted (with some variations) among mainstream databases, including the open-source Apple FoundationDB 14 [60], [61] and the closed-source Fauna [59] products. ...
Article
Full-text available
To sidestep reasoning about the complex effects of concurrent execution, many system designers have conveniently embraced strict serializability on the strength of its claims, support from commercial and open-source database communities and ubiquitous levels of industry adoption. Crucially, distributed components are built on this model; multiple schedulers are composed in an event-driven architecture to form larger, ostensibly correct systems. This paper examines the oft-misconstrued position of strict serializability as a composable correctness criterion in the design of such systems. An anomaly is presented wherein a strict serializable scheduler in one system produces a history that cannot be serially applied to even a weak prefix-consistent replica in logical timestamp order. Several solutions are presented under varying isolation properties, including novel isolation properties contributed by this paper. It is further shown that every nondeterministic scheduler is anomaly-prone, every nonconcurrent scheduler is anomaly-free, and that at least one deterministic concurrent scheduler is anomaly-free.
... Achieving fault-tolerant distributed consensus is an age-old problem. Commit protocols such as Two-Phase Commit (Gray 1978), Three-Phase Commit (Skeen 1982) and Easy-Commit Sadoghi 2018, 2020) help in reaching agreement among the participants in a partitioned distributed databases (Qadah and Sadoghi 2018;Qadah et al 2020;Sadoghi and Blanas 2019). However, commit protocols can only handle node failures and are unsafe under message delay or loss. ...
Preprint
Full-text available
A blockchain is a linked list of immutable tamper-proof blocks, which is stored at each participating node. Each block records a set of transactions and the associated metadata. Blockchain transactions act on the identical ledger data stored at each node. Blockchain was first perceived by Satoshi Nakamoto, as a peer-to-peer money exchange system. Nakamoto referred to the transactional tokens exchanged among clients in his system as Bitcoins.
... The low throughput and high latency are the key reasons why BFT algorithms are often ignored. Prior works [8], [9], [10], [11] have shown that the traditional distributed systems can achieve throughputs of the order 100K transactions per second while the initial blockchain applications, such as Bitcoin [12] and Ethereum [13], have throughputs of at most ten transactions per second. Such low throughputs do not * Both authors have equally contributed to this work. ...
... the previously mentioned strategies and is inspired by the parallel transaction execution proposed in [24] and relates to the ideas of [15,12,22]. When a block of transactions is received by the execute-subphase, we first identify all existing conflict dependencies between transactions. ...
Conference Paper
Today's permissioned blockchain systems come in a stand-alone fashion and require the users to integrate yet another full-fledged transaction processing system into their already complex data management landscape. This seems odd as blockchains and traditional DBMSs share considerable parts of their processing stack. Therefore, rather than replacing the established infrastructure, we advocate to "chainify" existing DBMSs by installing a lightweight blockchain layer on top. Unfortunately, this task is challenging: Users might have different black-box DBMSs in operation, which potentially behave differently. To deal with such setups, we introduce a new processing model called Whatever-Voting (WV), pronounced [weave]. It allows us to create a highly flexible permissioned blockchain layer coined chainifyDB that (a) is centered around bullet-proof database technology, (b) can be easily integrated into an existing heterogeneous database landscape, (c) is able to recover deviating organizations, and (d) adds only up to 8.5% of overhead on the underlying database systems while providing an up to 6x higher throughput than Hyperledger Fabric.
... Big data challenges are not characterized only by the large volume of data that has to be processed, but also by a high rate of data production and consumption i.e., high-velocity [30], [45], [36], [37], [44]. Explosion in data volume and velocity is commonplace in a wide range of monitoring applications. ...
Article
Full-text available
Due to recent explosion of data volume and velocity, a new array of lightweight key-value stores have emerged to serve as alternatives to traditional databases. The majority of these storage engines, however, sacrifice their read performance in order to cope with write throughput by avoiding random disk access when writing a record in favor of fast sequential accesses. But, the boundary between sequential vs. random access is becoming blurred with the advent of solid-state drives (SSDs). In this work, we propose our new key-value store, LogStore, optimized for hybrid storage architectures. Additionally, introduce a novel cost-based data staging model based on log-structured storage, in which recent changes are first stored on SSDs, and pushed to HDD as it ages, while minimizing the read/write amplification for merging data from SSDs and HDDs. Furthermore, we take a holistic approach in improving both the read and write performance by dynamically optimizing the data layout, such as deferring and reversing the compaction process, and developing an access strategy to leverage the strengths of each available medium in our storage hierarchy. Lastly, in our extensive evaluation, we demonstrate that LogStore achieves up to 6x improvement in throughput/latency over LevelDB, a state-of-the-art key-value store
... This blockchain holds an ordered record of all transactions and this record is secured against changes using cryptographic primitives. To assure that all replicas agree on the same set of transactions and maintain the same blockchain, new transactions are agreed upon via a consensus protocol (which are fault-tolerant counterparts of the classical two-phase and three-phase commit protocols used in distributed database systems [32,42,43,69,70,72]). From our perspective, these technologies can strengthen data management in three vital directions: ...
Book
Since the introduction of Bitcoin—the first widespread application driven by blockchain—the interest of the public and private sectors in blockchain has skyrocketed. In recent years, blockchain-based fabrics have been used to address challenges in diverse fields such as trade, food production, property rights, identity-management, aid delivery, health care, and fraud prevention. This widespread interest follows from fundamental concepts on which blockchains are built that together embed the notion of trust, upon which blockchains are built. 1. Blockchains provide data transparency. Data in a blockchain is stored in the form of a ledger, which contains an ordered history of all the transactions. This facilitates oversight and auditing. 2. Blockchains ensure data integrity by using strong cryptographic primitives. This guarantees that transactions accepted by the blockchain are authenticated by its issuer, are immutable, and cannot be repudiated by the issuer. This ensures accountability. 3. Blockchains are decentralized, democratic, and resilient. They use consensus-based replication to decentralize the ledger among many independent participants. Thus, it can operate completely decentralized and does not require trust in a single authority. Additions to the chain are performed by consensus, in which all participants have a democratic voice in maintaining the integrity of the blockchain. Due to the usage of replication and consensus, blockchains are also highly resilient to malicious attacks even when a significant portion of the participants are malicious. It further increases the opportunity for fairness and equity through democratization. These fundamental concepts and the technologies behind them—a generic ledger-based data model, cryptographically ensured data integrity, and consensus-based replication—prove to be a powerful and inspiring combination, a catalyst to promote computational trust. In this book, we present an in-depth study of blockchain, unraveling its revolutionary promise to instill computational trust in society, all carefully tailored to a broad audience including students, researchers, and practitioners. We offer a comprehensive overview of theoretical limitations and practical usability of consensus protocols while examining the diverse landscape of how blockchains are manifested in their permissioned and permissionless forms.
Chapter
Full-text available
Hybrid OLTP and OLAP
Article
Full-text available
Blockchain Transaction Processing.
Conference Paper
Full-text available
Large scale distributed databases are designed to support commercial and cloud based applications. The minimal expectation from such systems is that they ensure consistency and reliability in case of node failures. The distributed database guarantees reliability through the use of atomic commitment protocols. Atomic commitment protocols help in ensuring that either all the changes of a transaction are applied or none of them exist. To ensure efficient commitment process, the database community has mainly used the two-phase commit (2PC) protocol. However, the 2PC protocol is blocking under multiple failures. This necessitated the development of the non-blocking, three-phase commit (3PC) protocol. However, the database community is still reluctant to use the 3PC protocol, as it acts as a scalability bottleneck in the design of efficient transaction processing systems. In this work, we present Easy Commit which leverages the best of both the worlds (2PC and 3PC), that is, non-blocking (like 3PC) and requires two phases (like 2PC). Easy Commit achieves these goals by ensuring two key observations: (i) first transmit and then commit , and (ii) message redundancy. We present the design of the Easy Commit protocol and prove that it guarantees both safety and liveness. We also present a detailed evaluation of EC protocol, and show that it is nearly as efficient as the 2PC protocol.
Article
Concurrency control for on-line transaction processing (OLTP) database management systems (DBMSs) is a nasty game. Achieving higher performance on emerging many-core systems is difficult. Previous research has shown that timestamp management is the key scalability bottleneck in concurrency control algorithms. This prevents the system from scaling to large numbers of cores. In this paper we present TicToc, a new optimistic concurrency control algorithm that avoids the scalability and concurrency bottlenecks of prior T/O schemes. TicToc relies on a novel and provably correct data-driven timestamp management protocol. Instead of assigning timestamps to transactions, this protocol assigns read and write timestamps to data items and uses them to lazily compute a valid commit timestamp for each transaction. TicToc removes the need for centralized timestamp allocation, and commits transactions that would be aborted by conventional T/O schemes. We implemented TicToc along with four other concurrency control algorithms in an in-memory, shared-everything OLTP DBMS and compared their performance on different workloads. Our results show that TicToc achieves up to 92% better throughput while reducing the abort rate by 3.3x over these previous algorithms.
Conference Paper
Multi-core in-memory databases promise high-speed online transaction processing. However, the performance of individual designs suffers when the workload characteristics miss their small sweet spot of a desired contention level, read-write ratio, record size, processing rate, and so forth. Cicada is a single-node multi-core in-memory transactional database with serializability. To provide high performance under diverse workloads, Cicada reduces overhead and contention at several levels of the system by leveraging optimistic and multi-version concurrency control schemes and multiple loosely synchronized clocks while mitigating their drawbacks. On the TPC-C and YCSB benchmarks, Cicada outperforms Silo, TicToc, FOEDUS, MOCC, two-phase locking, Hekaton, and ERMIA in most scenarios, achieving up to 3X higher throughput than the next fastest design. It handles up to 2.07 M TPC-C transactions per second and 56.5 M YCSB transactions per second, and scans up to 356 M records per second on a single 28-core machine.
Article
Increasing transaction volumes have led to a resurgence of interest in distributed transaction processing. In particular, partitioning data across several servers can improve throughput by allowing servers to process transactions in parallel. But executing transactions across servers limits the scalability and performance of these systems. In this paper, we quantify the effects of distribution on concurrency control protocols in a distributed environment. We evaluate six classic and modern protocols in an in-memory distributed database evaluation framework called Deneva, providing an apples-to-apples comparison between each. Our results expose severe limitations of distributed transaction processing engines. Moreover, in our analysis, we identify several protocol-specific scalability bottlenecks. We conclude that to achieve truly scalable operation, distributed concurrency control solutions must seek a tighter coupling with either novel network hardware (in the local area) or applications (via data modeling and semantically-aware execution), or both.
Article
In order to guarantee recoverable transaction execution, database systems permit a transaction's writes to be observable only at the end of its execution. As a consequence, there is generally a delay between the time a transaction performs a write and the time later transactions are permitted to read it. This delayed write visibility can significantly impact the performance of serializable database systems by reducing concurrency among conflicting transactions. This paper makes the observation that delayed write visibility stems from the fact that database systems can arbitrarily abort transactions at any point during their execution. Accordingly, we make the case for database systems which only abort transactions under a restricted set of conditions, thereby enabling a new recoverability mechanism, early write visibility, which safely makes transactions' writes visible prior to the end of their execution. We design a new serializable concurrency control protocol, piece-wise visibility (PWV), with the explicit goal of enabling early write visibility. We evaluate PWV against state-of-the-art serializable protocols and a highly optimized implementation of read committed, and find that PWV can outperform serializable protocols by an order of magnitude and read committed by 3X on high contention workloads.
Article
Future servers will be equipped with thousands of CPU cores and deep memory hierarchies. Traditional concurrency control (CC) schemes---both optimistic and pessimistic---slow down orders of magnitude in such environments for highly contended workloads. Optimistic CC (OCC) scales the best for workloads with few conflicts, but suffers from clobbered reads for high conflict workloads. Although pessimistic locking can protect reads, it floods cache-coherence backbones in deep memory hierarchies and can also cause numerous deadlock aborts. This paper proposes a new CC scheme, mostly-optimistic concurrency control (MOCC), to address these problems. MOCC achieves orders of magnitude higher performance for dynamic workloads on modern servers. The key objective of MOCC is to avoid clobbered reads for high conflict workloads, without any centralized mechanisms or heavyweight interthread communication. To satisfy such needs, we devise a native, cancellable reader-writer spinlock and a serializable protocol that can acquire, release and re-acquire locks in any order without expensive interthread communication. For low conflict workloads, MOCC maintains OCC's high performance without taking read locks. Our experiments with high conflict YCSB workloads on a 288-core server reveal that MOCC performs 8× and 23× faster than OCC and pessimistic locking, respectively. It achieves 17 million TPS for TPC-C and more than 110 million TPS for YCSB without conflicts, 170× faster than pessimistic methods.
Conference Paper
Concurrency control for on-line transaction processing (OLTP) database management systems (DBMSs) is a nasty game. Achieving higher performance on emerging many-core systems is difficult. Previous research has shown that timestamp management is the key scalability bottleneck in concurrency control algorithms. This prevents the system from scaling to large numbers of cores. In this paper we present TicToc, a new optimistic concurrency control algorithm that avoids the scalability and concurrency bottlenecks of prior T/O schemes. TicToc relies on a novel and provably correct data-driven timestamp management protocol. Instead of assigning timestamps to transactions, this protocol assigns read and write timestamps to data items and uses them to lazily compute a valid commit timestamp for each transaction. TicToc removes the need for centralized timestamp allocation, and commits transactions that would be aborted by conventional T/O schemes. We implemented TicToc along with four other concurrency control algorithms in an in-memory, shared-everything OLTP DBMS and compared their performance on different workloads. Our results show that TicToc achieves up to 92% better throughput while reducing the abort rate by 3.3x over these previous algorithms.