PreprintPDF Available

Permissioned Blockchain Through the Looking Glass: Architectural and Implementation Lessons Learned

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

Since the inception of Bitcoin, the distributed and database community has shown interest in the design of efficient blockchain systems. At the core of any blockchain application is a Byzantine-Fault Tolerant (BFT) protocol that helps a set of replicas reach an agreement on the order of a client request. Initial blockchain applications (like Bitcoin) attain very low throughput and are computationally expensive. Hence, researchers moved towards the design of permissioned blockchain systems that employ classical BFT protocols, such as PBFT, to reach consensus. However, existing permissioned blockchain systems still attain low throughputs (of the order 10K txns/s). As a result, existing works blame this low throughput on the associated BFT protocol and expend resources in developing optimizes protocols. We believe such blames only depict a one-sided story. In specific, we raise a simple question: can a well-crafted system based on a classical BFT protocol outperform a modern protocol? We show that designing such a well-crafted system is possible and illustrate cases where a three-phase protocol can outperform a single-phase protocol. Further, we dissect a permissioned blockchain system and state several factors that affect its performance. We also design a high-throughput yielding permissioned blockchain system, ResilientDB, that employs parallel pipelines to balance tasks at a replica, and provides guidelines for future designs.
Content may be subject to copyright.
Permissioned Blockchain Through the Looking Glass:
Architectural and Implementation Lessons Learned
Suyash Gupta, Sajjad Rahnama, Mohammad Sadoghi
Exploratory Systems Lab
Department of Computer Science
University of California, Davis
ABSTRACT
Since the inception of Bitcoin, the distributed and database com-
munity has shown interest in the design of ecient blockchain
systems. At the core of any blockchain application is a Byzantine-
Fault Tolerant (Bft) protocol that helps a set of replicas reach
an agreement on the order of a client request. Initial blockchain
applications (like Bitcoin) attain very low throughput and are
computationally expensive. Hence, researchers moved towards
the design of permissioned blockchain systems that employ clas-
sical Bft protocols, such as PBFT, to reach consensus. However,
existing permissioned blockchain systems still attain low through-
puts (of the order 10
K
txns/s). As a result, existing works blame
this low throughput on the associated Bft protocol and expend
resources in developing optimizes protocols.
We believe such blames only depict a one-sided story. In spe-
cic, we raise a simple question: can a well-crafted system based
on a classical Bft protocol outperform a modern protocol? We show
that designing such a well-crafted system is possible and illus-
trate cases where a three-phase protocol can outperform a single-
phase protocol. Further, we dissect a permissioned blockchain
system and state several factors that aect its performance. We
also design a high-throughput yielding permissioned blockchain
system, ResilientDB, that employs parallel pipelines to balance
tasks at a replica, and provides guidelines for future designs.
1 INTRODUCTION
Since the inception of blockchain [
21
,
22
,
25
], the distributed
systems and database community has renewed its interest in the
age-old design of Byzantine-Fault Tolerant (Bft) systems. At
the core of any blockchain application is a Bft algorithm that
ensures all the replicas of this blockchain application reach a
consensus, that is, agree on the order for a given client request,
even if some of the replicas are byzantine [4, 7, 36, 61].
On looking closely, one can easily detect that these Bft algo-
rithms are the resilient counterparts of the famous Two-phase
commit and Three-phase commit algorithms [
19
,
26
,
27
,
53
]. This
property of resilience interests the database community as mali-
cious attacks on massive data-stores are common. Not so long
back, deadly attacks such as WannaCry and NotPetya disrupted
critical data-based services in health care and container ship-
ping [
20
,
45
,
56
], while a recent estimate shows that cyberattacks
alone have burdened the U.S. economy by $57 to $109 billion
in 2016 [
44
]. However, even after a decade of its introduction,
and publication of several prominent research works, the major
use-case of blockchain technology remains as a crypto-currency.
This leads us to a key observation: Why have blockchain (or Bft)
applications seen such a slow adoption?
The low throughput and high latency are the key reasons
why Bft algorithms are often ignored. Prior works [
26
,
49
,
50
]
have shown that the traditional distributed systems can achieve
throughputs of the order 100
K
transactions per second, while the
Figure 1: System throughput on running two dierent
permissioned blockchain systems employing distinct Bft
consensus protocols. For this experiment both the sys-
tems receive requests from 80Kclients.
initial blockchain applications, Bitcoin [
42
] and Ethereum [
60
],
have throughputs of at most ten transactions per second. Such
low throughputs do not aect the users of these applications, as
these applications were designed with an aim of open membership,
that is, anyone can join, and the identities of the participants
are kept hidden. Further, these applications aimed to present
an alternative currency, which is unregulated by any large cor-
poration. Evidently, there have been several attacks on these
open-membership blockchain applications [15, 48, 57].
To improve this situation, the blockchain community, moved
towards the design of permissioned blockchain applications that
advocate close membership, that is, the identity of each partic-
ipating replica needs to be known a priori. Through the use
of permissioned blockchain applications, the community also
hoped to achieve higher throughputs. However, the through-
puts of current permissioned blockchain applications are still of
the order 10
K
[
3
,
4
,
43
] transactions per second. Several prior
works [
4
,
25
,
36
,
61
] blame the low throughput and scalability
of a permissioned blockchain system on to its underlying Bft
consensus algorithm. Although these claims are not false, we
believe they only represent a one-sided story.
We claim that the low throughput of a blockchain (or Bft)
system is due to missed opportunities during its design and imple-
mentation. Hence, we want to raise a question: can a well-crafted
system based on a classical Bft protocol outperform a modern proto-
col? Essentially, we wish to show that even a slow-perceived clas-
sical Bft protocol, such as PBFT [
7
], if implemented in a skillfully-
optimized blockchain fabric, can outperform a fast niche-case,
and optimized for fault-free consensus, Bft protocol, such as
Zyzzyva [36].
We use Figure 1 to illustrate such a possibility. In this gure,
we compare the throughput of an optimally designed permis-
sioned blockchain system (ResilientDB) employing the PBFT pro-
tocol, against the Zyzzyva protocol implemented on a blockchain
system that employs a protocol-centric design rather than the
arXiv:1911.09208v1 [cs.DB] 20 Nov 2019
system-centric approach. What is more interesting about this g-
ure is that PBFT requires three phases, of which two necessitate
quadratic communication among the replicas, while Zyzzyva
requires a single linear phase. Despite this, ResilientDB achieves
a throughput of 175
K
transactions per second, easily scales up to
32 replicas and attains up to 79% more throughput than the other
system. At this point, we would like to highlight that several
interesting prior works [
3
,
17
,
18
,
39
,
62
] employ this practice of
protocol-centric blockchain systems, where there is no to mini-
mal discussion on how a well-crafted implementation can benet
the Bft consensus protocol.
To design such a system, in this paper, we enlist dierent
factors that aect the performance of a permissioned blockchain
system and present ways to mitigate the eects of these factors.
This allows us to reach the following other observations:
Optimal batching of transactions can help a system gain
up to 66×throughput.
Clever use of cryptographic signature schemes can in-
crease throughput by 103×.
Employing in-memory storage with blockchains can yield
up to 18×throughput gains.
Decoupling execution from the ordering of client transac-
tions can increase throughput by 9.5%.
Out-of-order processing of client transactions can help
gain 60% more throughput.
Protocols optimized for fault-free cases can result in a loss
of 39×throughput under failures.
In our endeavor, we also design a high-throughput yielding
permissioned blockchain system, ResilientDB. With the help of
ResilientDB’s uid architecture, we eciently pipeline the tasks
performed by a replica. Further, we extensively parallelize dier-
ent components of a permissioned blockchain system or database.
Through our principle of out-of-order processing, we can elim-
inate any bottleneck that arises due to maintaining order. We
also perceive ResilientDB as a reliable test-bed to implement
and evaluate newer Bft consensus protocols and blockchain
applications1. We now enlist our contributions:
We dissect a permissioned blockchain system and enlist
dierent factors that aect its performance.
We carefully measure the impact of these factors and
present ways to mitigate the eects of these factors.
We design a permissioned blockchain system, ResilientDB
that yields high throughput, incurs low latency, and scales
even a slow protocol like PBFT. ResilientDB includes an
extensively parallelized and deeply pipelined architecture
that eciently balances the load at a replica.
We raise eleven dierent questions and rigorously evaluate
our ResilientDB platform in light of these questions.
The remainder of the paper is organized as follows: In Sec-
tion 2, we discuss existing trends in the permissioned blockchain
systems and revisit some background details on consensus. In
Section 3, we look beyond consensus and discuss various factors
that aect the performance of a permissioned blockchain system.
In Section 4, we present the design of our ResilientDB and illus-
trate various ecient design practices that we have adopted. In
Section 5, we raise dierent questions and present an evaluation
of our ResilientDB. In Section 6, we enlist our observations and
1
In order to foster both academic and industry research, our ResilientDB
is available at
https://resilientdb.com
, and the code is open-sourced at
https://github.com/resilientdb.
Primary
Malicious
Crashed
Figure 2: This diagram illustrates a set of replicas of which
some may be malicious or have crashed. One replica is des-
ignated as the primary, which leads the consensus among
the remaining replicas, on a client request it received.
lessons learned that can help in designing ecient permissioned
blockchain systems in future.
2 TRENDS IN BLOCKCHAIN
To gain more throughput from a permissioned blockchain appli-
cation, system designers and researchers have adopted several
distinct architectures and paradigms. This quest has also shifted
their focus to the details of the underlying consensus protocol,
while the rest of the system acts as a black-box. Before laying
down the foundation for ecient design, we rst analyze existing
practices in the domain of permissioned blockchain.
2.1 BFT Consensus
Decades of research into the design of secure Bft algorithms have
paved the way for resilient permissioned blockchain applications.
ABft consensus protocol states that given a client request
2
and a
set of replicas, some of which could be byzantine, the non-faulty
replicas would agree on the order of this client request.
PBFT
[
7
] is often described as the rst Bft protocol to allow
consensus to be incorporated by practical systems. PBFT employs
a simple design where one replica is designated as the primary
and other replicas act as the backup. PBFT only guarantees a
successful consensus among
n
replicas, if at most
f
of them are
byzantine, where n3f+1.
When the primary replica receives a client request, it assigns
it a sequence number and sends a
Pre-prepare
message to all
the backups to execute this request in the sequence order (Refer
Figure 3). Each backup replica, on receiving the
Pre-prepare
message from the primary, shows its agreement to this order
by broadcasting a
Prepare
message. When a replica receives
Prepare
message from at least 2
f
distinct backup replicas, then it
achieves a guarantee that a majority of the non-faulty replicas are
aware of this request. Such a replica marks itself as prepared and
broadcasts a
Commit
message. Next, when this replica receives
Commit
messages from 2
f+
1distinct replicas, then it achieves
a guarantee on the order for this request, as a majority of the
replicas must have also prepared this request. Finally, this replica
executes the request and sends a response to the client.
More Replicas:
It is evident that PBFT requires three phases,
of which two necessitate quadratic communication complexity.
This led to a plethora of interesting Bft designs. For instance,
Q/U [
1
] protocol attempts to reduce Bft consensus to a single
phase through the use of 5
f+
1replicas, but cannot handle
concurrent requests. HQ [
13
] builds on top of Q/U and permits
concurrency only if the transactions are non-conicting. Cowling
et al. [
52
] introduce a preserializer to order conicting concurrent
transactions but expect preserializer to be non-malicious.
2A client request denotes a client transaction.
Request Pre-prepare Prepare Commit
CLIENT
PRIMARY PRIMARY
REPLICA
REPLICA REPLICA
REPLICA REPLICA
REPLICA
Response
PRIMARY
REPLICA
REPLICA
REPLICA
CLIENT
Figure 3: Three phases of the PBFT consensus protocol.
Speculative Execution:
Zyzzyva [
36
] introduces speculative
execution to the Bft protocols, to yield a single-phase, linear
Bft protocol. In Zyzzyva’s design, as soon as a backup replica
receives a request from the primary, it executes the request and
sends a response to the client. Hence, a replica does not even
wait to conrm that the order is the same across all the repli-
cas. Zyzzyva requires just one phase, so it helps to gauge the
maximum throughput that can be attained by a Bft protocol. If
the primary is malicious, Zyzzyva depends on its client to help
ensure a correct order. If the clients are malicious, then Zyzzyva
is unsafe until a good client participates. Further, Zyzzyva’s fast
case requires a client to receive a response from all the 3
f+
1
replicas before it marks a request complete. Prior works [
9
,
10
]
have shown that just one failure is enough to lead Zyzzyva to
very low throughput. Moreover, a recent work [
2
] showed that
Zyzzyva is unsafe. Several other protocols that base their de-
sign on Zyzzyva’s model also face similar limitations [
16
,
28
,
51
].
PoE [
21
] tries to eliminate the limitations of Zyzzyva by provid-
ing a two-phase, speculative consensus protocol but requires one
phase of quadratic communication among all the replicas.
Trusted Components:
Several Bft protocols [
5
,
8
,
12
,
32
,
38
,
58
] suggest the use of trusted components to reduce the cost of
Bft consensus. These works require only 2
f+
1replicas, as
the trusted component helps to guarantee a correct ordering.
However, a trusted component can be compromised [
47
] and can
act as a sink for dierent attacks.
Multiple Primaries:
Several protocols [
22
24
,
61
] suggest
dedicating multiple replicas as primaries to gain higher through-
put. The concept of multiple primaries is fruitful until the system
is neither compute bounded nor network bounded. Further, each
of these protocols would require coordination among the pri-
maries to ensure a correct order.
2.2 Chain Management
A blockchain is an immutable ledger that consists of a set of
blocks. Each block contains necessary information regarding the
executed transaction and the previous block in its chain. The
data about the previous block helps any blockchain achieve im-
mutability. The i-th block in the chain can be represented as:
Bi:={k,d,v,H(Bi1)}
This block
Bi
contains the sequence number (
k
) of the client
request, the digest (
d
) of the request, the identier of the primary
v
who initiated the consensus, and the hash of the previous block,
H(Bi1)
. Figure 4 illustrates a simple blockchain maintained at
each replica. In each blockchain application, every replica in-
dependently maintains its copy of the blockchain. Prior to the
start of consensus, the blockchain of each replica has no element.
Hence, it is initialized with a genesis block [
25
]. The genesis block
is marked as the rst block in the chain and contains dummy
data. For instance, a genesis block can contain the hash of the
identier of the rst primary, H(P).
{ D(m), 1, 1, H(B0) }
B1
{ D(1), , , }
B0 (Genesis)
{ D(m’), 2, 1, H(B1) }
B2
{ D(m’’), k, 2, H(Bk-1) }
Bk
Figure 4: A formal representation of the blockchain main-
tained at a replica.
2.3 Alternative Blockchain Architectures
To improve the throughput attained by a permissioned blockchain
application, database researchers have also looked at several dif-
ferent architectures and designs.
Commiees.
Several early works on blockchain attempted
to increase the throughput of the open membership (or permis-
sionless) blockchain applications through the use of commit-
tees [
34
,
35
,
46
]. In a committee based design, some replicas from
the set of all the replicas are selected, and only these replicas
perform consensus (or create the next block). In specic, these
systems rely on the assumption that the members of the selected
committee will act non-faulty. Such an assumption undermines
the fault-tolerance capability of the system.
DAG.
Since the common data-structure in any blockchain
application is the ledger, several systems incorporated a directed-
acyclic graph to record the client transactions [
3
,
6
,
37
,
54
,
55
].
As a blockchain application expects a single order for all the
transactions across all the replicas, so a DAG-based design allows
replicas working on non-conicting transactions to simultane-
ously record multiple transactions. However, a DAG-based design
would require the merge of branches of a DAG once there are
conicting transactions, which in turn necessitates regular com-
munication among the replicas.
Sharding.
Another approach to extract higher throughput
from a blockchain system is to employ sharding [
3
,
40
,
59
,
62
].
Sharding splits the records accessed by the clients into several
distinct partitions, where each partition is maintained by a set of
replicas. Although sharding helps an application to attain high
throughput when client transactions require access to only one
partition, multi-partition transactions are expensive as they can
require up to two additional phases to ensure safety.
Probabilistic.
Prior research works have also employed prob-
abilistic estimates to yield fast Bft consensus among the repli-
cas [
18
,
62
]. These probabilistic estimates help these works to
determine an approximate number of replicas necessary for con-
sensus. Often these systems revert to traditional Bft protocols
when the probabilistic estimates are not as expected.
3 DISSECTING PERMISSIONED
BLOCKCHAIN
In the previous section, we discussed several ways that researchers
have employed to improve the throughput of a blockchain appli-
cation at hand. Most of these strategies focussed at: (i) optimizing
the underlying Bft consensus algorithm, and/or (ii) restructuring
the way a blockchain is maintained. We believe there is much
more to render in the design of a permissioned blockchain system
beyond these strategies. Hence, we identify several other key
factors that reduce the throughput and increase the latency of a
permisisoned blockchain system or database.
Single-threaded Monolithic Design.
There is ample oppor-
tunity available in the design of a permissioned blockchain ap-
plication to extract parallelism. Several existing permissioned
systems provide minimal to no discussion on how they can
benet from the underlying hardware or cores [
59
,
62
]. Due
to the sustained reduction in hardware cost (as a consequence of
Moore’s Law [
41
]), it is easy for each replica to have at least eight
cores. Hence, by parallelizing the tasks across dierent threads,
a blockchain application can highly benet from the available
computational power. Further, it is often trivial to divide the tasks
across a blockchain application, which in turn can act as stages
of a pipeline. Such a pipelined architecture facilitates concurrent
processing of multiple requests across its stages.
Successive Phases of Consensus.
Several interesting works
advocate the benets of performing consensus on one request
at a time [
3
,
31
]. We believe aggregating client requests into
large batches can help a permissioned blockchain application to
signicantly reduce the costs incurred by successive runs of its
consensus protocol. Further, consensus on a single request can
increase the communication and computation costs, as for each
consensus, the system has to create and send digests.
Integrated Ordering and Execution.
On receiving a client
request, each replica of a permissioned blockchain application
has to order and execute that request. Although these tasks share
a dependency, it is a useful design practice to separate them at
the physical or logical level. At the physical level, distinct replicas
can be used for execution but there are additional communica-
tion costs. At the logical level, distinct threads are provided the
task of executing the request but requires extra hardware cores
for performing this task in parallel. In specic, a single entity
performing both ordering and execution loses an opportunity to
gain from inherent parallelism.
Strict Ordering.
Permissioned blockchain applications rely
on Bft protocols, which necessitate ordering of client requests in
accordance with linearizability [
7
,
30
]. Although linearizability
helps in guaranteeing a safe database state across all the repli-
cas, it is an expensive property to achieve. Hence, we need an
approach that can provide linearizability but is inexpensive. We
observe that permissioned blockchain applications can benet
from delaying the ordering of client requests until execution. This
delay ensures that although several client requests are processed
in parallel, the result of their execution is in order.
O-Memory Chain Management.
Blockchain applications
work on a large set of records or data. Hence, they require access
to databases to store these records. There is a clear trade-o
when applications store data in-memory or on an o-the-shelf
database. O-memory storage requires several CPU cycles to
fetch data [
29
]. Hence, employing in-memory storage can ensure
faster access, which in turn can lead to high system throughput.
Expensive Cryptographic Practices.
It is evident from the
preceding sections that throughout the lifetime of a blockchain
application it exchanges several types of messages. These mes-
sages are exchanged among the participating replicas and the
clients, of which some may be byzantine. Hence, any blockchain
application requires strong cryptographic constructs that allow a
client or a replica to validate any message. These cryptographic
constructs nd a variety of uses in a blockchain application: (i)
To sign a message before sending. (ii) To verify an incoming
message. (iii) To generate digest of a client request. (iv) To hash
a record or data.
To sign and verify a message, a blockchain application can
employ either symmetric-key cryptography or asymmetric-key
HASHING
TOOLKIT
SIGNING
TOOLKIT
SECURE
LAYER
STORAGE LAYER
BLOCKCHAIN METADATA
THREADS
CONSENSUS PROTOCOL
QUEUES
EXECUTION LAYER
NETWORK
APPLICATION LAYER YCSB
Figure 5: ResilientDB Framework
cryptography [
33
]. Although symmetric-key signatures, such as
Message Authentication Code (MAC), are faster to generate than
asymmetric-key signatures, such as Digital Signature (DS), DSs of-
fer the key property of non-repudiation, which is not guaranteed
by MACs [
33
]. Hence, several works suggest using DSs [
3
,
4
,
62
].
However, a cleverly designed permissioned blockchain system
can skip using DSs for a majority of its communication, which in
turn will help increase its throughput. For generating digests or
hash, a blockchain application needs to employ standard Hash
functions, such as SHA256 or SHA3, which are secure. But, hash-
ing is expensive and needs to be used with prudence.
4 HIGH THROUGHPUT YIELDING
PERMISSIONED BLOCKCHAIN FABRIC
In the previous section, we discussed some key factors that need
to be taken into consideration while designing a permissioned
blockchain system. We now present our ResilientDB blockchain
framework, which incorporates our insights and fullls the promise
of an ecient permissioned blockchain system. ResilientDB presents
an extensible architecture, which can be easily employed by sev-
eral existing permissioned blockchain applications and databases [
3
,
4
,
21
]. Further, ResilientDB also acts as a test-bed to implement
and test new protocols pertaining to permissioned blockchains.
In Figure 5, we illustrate the overall architecture of ResilientDB.
ResilientDB presents a client-server architecture that is spread
across several layers. At the application layer, we allow multiple
clients to co-exist, each of which creates its own requests. For
this purpose, they can either employ an existing benchmark suite
or design a request suiting to the active application. Next, clients
and replicas use the transport layer to exchange messages across
the network. ResilientDB also provides a storage layer where
all the metadata corresponding to a request and the blockchain
is stored. At each replica, there is an execution layer where the
underlying consensus protocol is run on the client request, and
the request is ordered and executed. During ordering, the secure
layer provides any cryptographic support.
Since our aim is to present the design of a high-throughput
permissioned blockchain system, we employ the simple yet ro-
bust PBFT protocol for reaching consensus among the replicas.
This allows us to highlight that despite an expensive consensus
protocol, a blockchain system can achieve high-throughputs.
4.1 Multi-Threaded Deep Pipeline
ResilientDB lays down a client-server architecture, where each
client transmits its request to a server designated as the primary.
As all the servers are a replica of each other, so the primary
replica takes up the task of leading the consensus among the
(a) Primary Replica.
(b) Backup Replica.
Figure 6: Pipelined and multi-threaded architecture for primary and backup replicas in the ResilientDB fabric.
replicas and ensuring all the replicas are in the same state. Note
that this categorization of a replica as the primary or a backup
helps ResilientDB to customize its architecture as necessary.
We use Figures 6a and 6b to illustrate the threaded-pipelined
architecture at the primary and backup replicas, respectively.
Note that the number of threads shown in these gures are for
the sake of illustration and can be increased (or decreased), if
necessary. In fact one of the key goals of this paper is to study
the eect of varying these threads on a permissioned blockchain.
With each replica, we associate multiple input and output
threads. In specic, using ResilientDB, we balance the tasks as-
signed to the input-threads, by requiring one input-thread to
solely receive client requests, while two other input-threads to
collect messages sent by other replicas. ResilientDB also bal-
ances the task of transmitting messages between the two output-
threads by assigning equal clients and replicas to each output-
thread. To facilitate this division, we need to associate a distinct
queue with each output-thread.
4.2 Transaction Batching
ResilientDB provides both clients and replicas an opportunity to
batch their transactions. Using an optimal batching policy can
help mask communication and consensus costs. A client can send
a burst of transactions as a single request message to the primary
replica. For instance, a client batching multiple requests is visible
in applications such as stock-trading, monetary-exchanges, and
service level-agreements. The primary replica can also aggregate
client requests together to signicantly reduce the number of
times a consensus protocol needs to be run among the replicas.
4.3 Modeling a Primary Replica
To facilitate ecient batching of requests, ResilientDB also asso-
ciates multiple batch-threads at the primary replica. When the
primary replica receives a batch of requests from the client, it
treats it as a single request. The input-thread at the primary
assigns a monotonically increasing sequence number to each
incoming client request and enqueues it into the common queue
for the batch-threads. To prevent contention among the batch-
threads, we design the common queue as lock-free. But why have
a common queue? This allows us to ensure that any enqueued
request is consumed as soon as any batch-thread is available.
Each batch-thread also performs the task of verifying the sig-
nature of client request. If the verication is successful, then it
creates a batch and names it as the
Pre-prepare
message. PBFT
also requires the primary to generate the digest of the client
request and send this digest as part of the
Pre-prepare
mes-
sage. This digest helps in identifying the client request in future
communication. However, hashing (or generating digests) is ex-
pensive, and hashing each request of the batch will not only
act as a computational burden but also reduce the benets of
batching. Hence, each batch-thread rst generates a single string
representation of the whole batch and then hashes this string. As
hashes are computationally hard to forge, so this practice is safe.
Finally, the batch-thread signs and enqueues the corresponding
Pre-prepare message into the queue for an output-thread.
Apart from the client requests, the primary replica also re-
ceives
Prepare
and
Commit
messages from other replicas. As
the system is partially asynchronous, so the primary may re-
ceive the
Prepare
and
Commit
messages from a backup replica
X
before the
Prepare
message from a backup
Y
.How is this
possible? The replica
X
could have received sucient number
of
Prepare
messages (that is 2
f
) before the primary receives
Prepare
from replica
Y
(total number of replicas are
n=
3
f+
1).
Hence, to prevent any resource contention, we designate only
one worker-thread the task of processing all these messages.
When the input-thread receives a
Prepare
message, it en-
queues that message in the work-queue. The worker-thread de-
queues a message and veries the signature on this message.
If the verication is successful, then it records this message
and continues collecting
Prepare
messages corresponding to a
Pre-prepare
message, until its count reaches 2
f
. Once it reaches
this count, then it creates a
Commit
message, signs and broad-
casts this message. The worker-thread follows similar steps for a
Commit
message, except that it needs a total of 2
f+
1messages,
and once it reaches this count, it informs the execute-thread to
execute the client requests.
4.4 Modeling a Backup Replica
ResilientDB associates fewer threads with a backup replica as it
does not need to collect client requests and create batches. When
the input-thread at a backup replica receives a
Pre-prepare
mes-
sage from the primary, then it enqueues it in the work-queue.
The worker-thread at a backup dequeues a
Pre-prepare
message
and checks if the message has a valid signature of the primary.
If this is the case, then the worker-thread creates a
Prepare
message, signs this message, and enqueues it in the queue for
output-thread. Note that this
Prepare
message includes the di-
gest from the
Pre-prepare
message and the sequence number
suggested by the primary. The output-thread broadcasts this
Prepare
message on the network. Similar to the primary, each
backup replica also collects 2
fPrepare
messages, creates and
broadcasts a
Commit
message, collects 2
f+
1
Commit
messages,
and informs the execute-thread.
4.5 Out-of-Order Message Processing
The key to the fast ordering of client requests is to allow ordering
of multiple client requests to happen in parallel. ResilientDB
supports parallel ordering of client requests, while ensuring a
single common order across all the replicas.
Example 4.1. In ResilientDB, say a client
C
sends the primary
replica
P
rst request
m
1and then request
m
2. The input-thread
at the primary
P
would assign a sequence number
k
to request
m
1and
k+
1to request
m
2. However, as the batch-threads can
work at varying speeds, so it is possible that the consensus for
request
m
1and
m
2may either overlap, or some replica
R
may
receive 2f+1Commit messages for m2before m1.
In principle, Example 4.1 seems like a challenge for blockchain
applications, as blockchain applications require every new block
to contain the hash of the previous block in the chain. However,
this requirement is implicitly fullled through the design of a
Bft protocol. Standard Bft protocols assume non-faulty replicas
present same output on same input. Further, they only accept
a request after they have a guarantee that a majority of other
replicas have also accepted the request. For example, in the PBFT
protocol, the replica
R
will not send a
Commit
message in sup-
port of the client request
m
1(received through the
Pre-prepare
message), until it receives 2
f
identical
Prepare
messages from
distinct replicas, that is, these messages should include the digest
of
m
1and have the same sequence number. Hence, PBFT does not
require any request to include the digest of a previous request.
This allows us to easily parallelize consensus.
4.6 Ecient Ordered Execution
Although we parallelize consensus, we ensure execution happens
in order. For instance, the requests
m
1and
m
2from Example 4.1
will be executed in sequence order, that is,
m
1is executed before
m
2, irrespective of the order their consensus completed. At each
replica, we dedicate a separate execution-thread to execute the
requests. But, the key question remains: how can we reduce the
execution-thread’s overhead of ordering.
It is evident that the execution-thread has to wait for a noti-
cation from the worker-thread. In specic, we require the worker-
thread to create a
Execute
message and place this message in
the appropriate queue for the execution-thread. This
Execute
message contains the identier for the starting and ending trans-
actions of a batch, which need to be executed. Note that we
associate a large set of queues with the execution-thread. To de-
termine the number of required queues for the execution-thread,
we use the parameter QC.
QC =2×Num_Clients ×N um_Req
Here,
Num_Client s
represent the total number of clients in the
systems, while
Num_Req
represents the maximum number of
requests a client can send without waiting for any response. We
assume both of these parameters to be nite, and although
QC
can be very large, the queues are just logical, and so the space
complexity remains almost the same as for a single queue. But
why is this practice advantageous?
Using this design our execute-thread has to no longer, continu-
ously, enqueue and dequeue, to check if the message correspond-
ing to the next transaction in order has arrived. The execute-
thread just waits on the queue,
txn_id
%
QC
, where
txn_id
is
the identier of the transaction. When execution-thread nds an
Execute
message, then it has a guarantee that it contains the
next transaction in order. Alternatively, we could have employed
hash-maps but collision resistant hash functions are expensive
to compute and verify [33].
Once the execution is complete, the execution-thread creates
a
Response
message and enqueues it in the queue for output-
threads, to send to the client. Note that ensuring execution hap-
pens in order provides a guarantee that a single common order
is established across all the non-faulty replicas.
Block Generation.
It is at this stage where we require the
execution thread to create a block representing this batch of
requests. Traditional blockchain application suggest that every
new block should include a hash of the previous block. Although
execute-thread can hash the previous block, this process would
be resource consuming and can act as a bottleneck. Note that
prior to starting execution, every replica did ensure that it got
2
f+
1identical
Commit
messages from distinct replicas. This acts
as a sucient proof to guarantee correct order [
7
,
36
,
61
]. Hence,
we include the signatures of these 2
f+
1
Commit
messages in
the block, instead of computing a hash.
4.7 Checkpointing
ResilientDB also requires its replicas to periodically generate and
exchange checkpoints. These checkpoints serve two purposes:
(1) Help a failed replica to update itself to the current state.
(2) Facilitate cleaning of old requests, messages and blocks.
However, as checkpointing requires exchange of large messages,
so we ensure it does not impact the throughput of the system.
ResilientDB deploys a separate checkpoint-thread at each replica
to collect and process incoming
Checkpoint
messages. These
checkpoint messages simply include all the blocks generated
since the last checkpoint. In specic, a
Checkpoint
message is
sent only after a replica has executed
requests. Once execute-
thread completes executing a batch, it checks if the sequence
number of the batch is a multiple of
. If such is the case, it
sends a
Checkpoint
message to all the replicas. When a replica
receives 2
f+
1identical
Checkpoint
messages from distinct
replicas, then it marks the checkpoint and clears all the data
before the previous checkpoint [7, 36].
4.8 Buer Pool Management
Until now, our description revolved around how a replica uses
messages and transactions. We now highlight how we eciently
store these constructs. In ResilientDB, we designed a base class
that represents all the messages. To create a new message type,
one has to simply inherit this base class and add required proper-
ties. Although on delivery to the network, each message is simply
a buer of characters, this typed representation helps us to easily
manipulate the required properties. Similarly, we have designed
abase class to represent all client transactions. An object of this
transaction class includes: transaction identier, client identier,
and transaction data, among many other properties.
As messages arrive in the system, a replica would need to allo-
cate (
malloc
or
new
) space for those messages. Similarly, when
a replica receives a client request, then it needs to allocate cor-
responding transaction objects. When the lifetime of a message
ends (or a new checkpoint is established), then the memory occu-
pied by that message (or transactions object) needs to be released
(
free
or
delete
). To avoid such frequent allocations and de-
allocations, we adopt the standard practice of maintaining a set
of buer pools. At the system initialization stage, we create a
large number of empty objects representing the messages and
(a) System throughput. (b) Latency.
Figure 7: Upper bound measurements: (i) Primary responds back
to the client without Execution, and (ii) executes and then reply.
transactions. So instead of doing a
malloc
, these objects are ex-
tracted from their respective pools and are placed back in the
pool during the free operation.
5 EXPERIMENTAL ANALYSIS
We now experimentally analyze how various parameters aect
the throughput and latency of a Permissioned BlockChain (hence-
forth referred to as Pbc) system. To perform this study, we use our
ResilientDB fabric and employ the PBFT protocol for achieving
consensus among the replicas. To ensure a holistic evaluation,
we attempt to answer the following questions:
(Q1)
Can a well-crafted system based on a classical Bft protocol
outperform a modern protocol?
(Q2)
How much gains in throughput (and latency) can a Pbc
achieve from pipelining and threading?
(Q3) Can pipelining help a Pbc become more scalable?
(Q4)
What impact does batching of transactions has on a Pbc?
(Q5)
Do multi-operation transactions impact the throughput
and latency of a Pbc?
(Q6) How increasing the message size impacts a Pbc?
(Q7)
What eect do dierent type of cryptographic signature
schemes have on the throughput of a Pbc?
(Q8)
How does a Pbc fare with in-memory storage versus a
storage provided by a standard database?
(Q9)
Can an increased number of clients impact the latency of
aPbc, while its throughput remains unaected?
(Q10)
Can a Pbc sustain high throughput on a setup having
fewer number of cores?
(Q11) How impactful are replica failures for a Pbc?
5.1 Evaluation Setup
We employ Google Cloud infrastructure at Iowa region to de-
ploy our ResilientDB on each replica and client. For replicas, we
use
c2
machines with an 8-core Intel Xeon Cascade Lake CPU
running at 3
.
8GHz and having 16GB memory, while for clients
we use
c2
4-core machines. We invoke up to 80
K
clients on 4
machines. For each experiment, we rst warmup the system for
60 seconds, and then for the next 120 seconds, we continuously
collect results. We conduct each experiment three times to aver-
age out any noise. Further, we utilize batching to create batches
of 100 transactions. We generate checkpoints in-frequently, once
per 10
K
transactions. For communication among replicas and
clients we employ digital signatures based on ED25519, and for
communication among replicas we use a combination of CMAC
and AES [
33
]. Note that we follow this setup throughout this
section, unless explicitly stated otherwise.
We use the YCSB [
11
] benchmark as the workload for client
transactions. For creating a transaction, each client indexes a
YCSB table with an active set of 600
K
records. In our evaluation,
(a) System throughput.
(b) Latency.
Figure 8: System throughput and latency, on varying the number
of replicas participating in the consensus. Here, E denotes num-
ber of execution-threads, while B denotes batch-threads.
we require client transactions to contain only write accesses, as
a majority of blockchain requests are updates to the existing
data. During the initialization phase, we ensure each replica has
an identical copy of the table. Each client YCSB transaction is
generated from a uniform Zipan distribution.
5.2 Eect of Threading and Pipelining
In this section, we analyze questions Q1 to Q3 and attempt at
answering the same. For this study, we vary the system parame-
ters in two dimensions: (i) We increase the number of replicas
participating in the consensus from 4to 32. (ii) We expand the
pipeline and gradually balance the load among parallel threads.
We rst try to gauge the upper bound performance of our sys-
tem. In Figures 7a and 7b, we measure the maximum throughput
and latency a system can achieve, when there is no communica-
tion among the replicas or any consensus protocol. We use the
term No Execution to refer to the case where all the clients send
their request to the primary replica and primary simply responds
back to the client. We count every query responded back in the
system throughput. We use the term Execution to refer to the case
where the primary replica executes each query before responding
back to the client. In both of these experiments, we allowed two
threads to work independently at the primary replica, that is,
no ordering is maintained. Clearly, the system can attain high
throughputs (up to 500
K
txns/s) and has low latency (up to 0
.
25s).
Next, we take two consensus protocols: PBFT and Zyzzyva,
and we ensure that at least 3
f+
1replicas are participating
in the consensus. We gradually move our system towards the
(a) Primary Replica.
(b) Backup Replica.
Figure 9: Saturation level of dierent threads at a replica. The mean is at 100%, which implies the thread is completely saturated.
architecture of Figures 6a and 6b. In Figures 8a and 8b, we show
the eects of this gradual increase. We denote the number of
execution-threads with symbol E, and batch-threads with symbol
B. For all these experiments, we used only one worker-thread.
The key intuition behind these plots is to continue expanding
the stages of pipeline and the number of threads, until system
can no longer increase its throughput. Moreover, through these
plots, we want to determine if PBFT can outperform Zyzzyva?
Note that PBFT is a three-phase protocol with two of its phase
requiring quadratic communication, while Zyzzyva is a single
phase protocol with linear amount of communication. So if we
can present a case where PBFT outperforms Zyzzyva, then clearly
eects of a well-crafted implementation can be observed.
On close observation of Figure 8a, we see that there are multi-
ple such cases. Further, these plots help to conrm our intuition
that a multi-threaded pipelined architecture for a Pbc outper-
forms a single-threaded design. This is the key reason why our
design of ResilientDB employs one execution-thread and two
batch-threads, apart from a single worker-thread.
To perform this experiment, PBFT was our target protocol and
we wanted to gradually study its performance. We rst modied
ResilientDB to ensure there are no additional threads for execu-
tion and batching, that is, all tasks are done by one worker-thread
(0E 0B). On scaling this system we realized that this worker-
thread was getting saturated. Hence, we partially divide the load
by having an execute-thread (1E 0B). However, we again observed
that the worker-thread at the primary was getting saturated. So
we had an opportunity to introduce a separate thread to create
batches (1E 1B). Although worker-thread was no longer saturat-
ing, the batch-thread was overloaded with the task of creating
batches. Hence, we further divided the task of batching among
multiple batch-threads (1E 2B) and ensured none of the batch-
threads were saturating. Figures 9a and 9b show the saturation
level for dierent threads at a replica. In this gure, we mark
100% as the maximum saturation for any thread. Using the bar for
cummulative saturation, we show a summation of all the thread
saturations, for any experiment. Note that for PBFT 1E 2B, the
worker-thread at the backup replicas have started to saturate. But,
as the architecture at the non-primary is following our design,
so we split no further.
It can be observed that PBFT on our standard pipeline (1E 2B)
attains higher throughput than all but one Zyzzyva implementa-
tions. The only Zyzzyva implementation (1E 2B) that outperforms
PBFT is the one that employs ResilientDB’s standard threaded-
pipeline. Further, even the simpler implementation for PBFT (1E
1B) attains higher throughput than Zyzzyva’s 0E 0B and 1E 0B im-
plementations. Note that in majority of the settings PBFT incurs
less latency than Zyzzyva. This is an eect of Zyzzyva’s algo-
rithm, which requires the client to wait for replies from all the
n
replicas, where for PBFT the client only needs
f+
1responses.
To
summarize
: (i) PBFT’s throughput (latency) increases (re-
duces) by 1
.
39
×
(58
.
4%) on moving from 0E 0B setup to 1E 2B.
(ii) Zyzzyva’s throughput (latency) increases (reduces) by 1
.
72
×
(63
.
19%) on moving from 0E 0B setup to 1E 2B. (iii) Throughput
gains up to 1
.
07
×
are possible on running PBFT on an ecient
setup, in comparison to basic setups for Zyzzyva.
5.3 Eect of Transaction Batching
We now try to answer question Q4 by studying how batching
the client transactions impacts the throughput and latency of a
Pbc. For this study, we require 16 replicas to participate in the
consensus, and we increase the size of a batch from 1to 5000.
(a) System throughput. (b) Latency.
Figure 10: System throughput and latency on varying number of
transactions per batch. Here,
16
replicas participate in consensus.
(a) System throughput.
(b) Latency.
Figure 11: System throughput and latency on varying the num-
ber of operations per transaction. Here, B denotes the number of
batch-threads and 16 replicas participate in consensus.
We adhere to a standard architecture of one worker-thread, one
execute-thread and two batch-threads for these plots.
Using Figures 10a and 10b, we observe that as the number of
transactions in a batch increases, the throughput increases until
a limit (at 1000) and then starts decreasing (at 3000). At smaller
batches, more consensuses are taking place, and hence communi-
cation impacts the system throughput. Hence larger batches help
reduce the consensuses. However, when the transactions in a
batch are increased further, then the size of the resulting message
and the time taken to create a batch by a batch-thread, reduces
the system throughput. Hence, any Pbc needs to nd an optimal
number of client transactions that it can batch. To
summarize:
batching can lead to up to 66
×
increase in throughput and 98
.
4%
reduction in latency.
5.4 Eect of Multi-Operation Transactions
We now present our attempt at answering question Q5, that is,
understand how multi-operation transactions aect the through-
put of a system? We use Figures 11a and 11b to increase the
number of operations per transaction from 1to 50. Note that
although multi-operation transactions are common in databases,
prior works do not provide any discussion on such transactions.
Further, these experiments are orthogonal counterparts of the
experiments in the previous section.
(a) System throughput. (b) Latency.
Figure 12: System throughput and latency on varying the mes-
sage size. Here, 16 replicas participate in consensus.
In these gures, we require 16 replicas to participate in con-
sensus. Further, we increase the number of batch-threads from
2to 5, while having one worker-thread and one execute-thread.
It is evident from these gures that as the number of operations
per transaction increases the system throughput decreases. This
decrease is a consequence of batch-threads getting saturated as
they perform task of creating batching and allocating resources
for transaction. Hence, we ran several experiments with dierent
number of batch-threads. An increase in the number of batch-
threads helps the system to increase its throughput, but the gap
reduces signicantly after the transaction becomes too large (at
50 operations). Similarly, more batch-threads help to decrease
the latency incurred by the system.
Alternatively, we also measure the total number of opera-
tions completed in each experiment. Notice that if we base the
throughput on the number of operations executed per second,
then the trend has completely reversed. Indeed, this makes sense
as in fewer rounds of consensus, more operations have been exe-
cuted. To
summarize:
, multi-operation transactions can cause
a decrease (increase) of 93% (13
.
29
×
) throughput (latency), as
measured on the two batch-threads setup. An increase in batch-
threads from two to ve, led to an increase (reduction) in through-
put (latency) of up to 66% (39%).
5.5 Eect of Message Size
We now attempt at answering question Q6 by increasing the
size of the
Pre-prepare
message in each consensus. The key
intuition behind this experiment is to gauge how well a Pbc
system performs when the requests sent by a client are large.
Although each batch includes only 100 client transactions, indi-
vidually, these requests can be large. Hence, these experiments
are aimed at exploiting a dierent system parameter than the
plots of Figure 10.
Figures 12a and 12b depict the variation in throughput and
latency as the size of
Pre-prepare
message increases. To increase
the size of the
Pre-prepare
message, we add a set of integers
(8byte each) as a payload to each message. The cardinality of
this set is kept equivalent to the desired message size. For these
experiments, we use 16 replicas for consensus, and employ our
standard combination of threads: one worker-thread, one execute-
thread and two batch-threads.
It is evident from these plots that as the message size increases,
there is a decrease in the system throughput and an increase
in the latency incurred by the client. This happens as the net-
work bandwidth becomes a limitation, and it takes extra time
to push more data onto the network. Hence, in this experiment,
the system reaches the network bound before any thread can
computationally saturate. This leads to all the threads being idle.
To
summarize:
On moving from 8
KB
to 64
KB
messages, there
was a 52% (1.09×) reduction (increase) in throughput (latency).
(a) System throughput. (b) Latency.
Figure 13: System throughput and latency with dierent signa-
ture schemes. Here, 16 replicas participate in consensus.
(a) System throughput. (b) Latency.
Figure 14: System throughput and latency for in-memory stor-
age vs. o-memory storage. Here,
16
replicas used for consensus.
5.6 Eect of Cryptographic Signatures
In this section, we answer question Q7 by studying the impact
of dierent cryptographic signing schemes. The key intuition
behind these experiments is to determine which signing scheme
helps our ResilientDB achieve highest throughput, while prevent-
ing byzantine attacks. For this purpose, we run four dierent
experiments to measure the system throughput and latency when:
(i) no signature scheme is used, (ii) everyone uses digital signa-
tures based on ED25519, (iii) everyone uses digital signatures
based on RSA, and (iv) all replicas use CMAC+AES for signing,
while clients sign their message using ED25519.
Figures 13a and 13b help us to illustrate the throughput at-
tained and latency incurred by ResilientDB for dierent cong-
urations. In these experiments, we require 16 replicas to par-
ticipate in consensus and use our standard architecture of one
worker-thread, one execute-thread and two batch-threads. It is
evident that ResilientDB attains maximum throughput when
no signatures are employed. However, such a system does not
fulll the minimal requirements of a permissioned blockchain
system. Further, using just digital signatures for signing messages
is not exactly the best practice. An optimal conguration can
require clients to sign their messages using digital signatures,
while replicas can communicate using MACs. To
summarize:
(i) cryptography causes at least 49% (33%) reduction (increase)
in throughput (latency). (ii) choosing RSA over CMAC, ED25519
combination would increase latency by 125×.
5.7 Eect of Memory Storage
We now try to answer question Q8 by studying the trade o
of having in-memory storage versus o-memory storage, on a
Pbc. For testing o-memory storage, we associate SQLite [
14
]
with our ResilientDB architecture. We use SQLite to store and
access the transactional records. As SQLite is external to our
ResilientDB fabric, so we developed API calls to read and write
its tables. Note that until now for all the experiments, we assumed
an in-memory storage, that is, records are written and accessed
in an in-memory key-value data-structure.
(a) System throughput. (b) Latency.
Figure 15: System throughput and latency on varying the num-
ber of clients. Here, 16 replicas participate in consensus.
(a) System throughput. (b) Latency.
Figure 16: System throughput and latency on varying the num-
ber of hardware cores. Here,
16
replicas participate in consensus.
Figures 14a and 14b helps us to illustrate the impact on sys-
tem throughput and latency in the two cases. In these experi-
ments, we again run consensus among 16 replicas and conform
to our standard thread conguration. For the in-memory storage
we require the execute-thread to read/write the key-value data-
structure, while for SQLite execute-thread initiates an API call
and waits for the results. It is evident from these plots that access
to o-memory storage (SQLite) is quite expensive. Further, as
execute-thread is busy-waiting for a reply, it performs no useful
task. To
summarize:
, choosing SQLite over in-memory storage
reduces (increases) throughput (latency) by 94% (24×).
5.8 Eect of Clients
In this section, we study the impact of clients on a Pbc system,
and as a result, work towards answering question Q9. We want
to observe how the throughput and latency gets impacted on
increasing the number of clients sending requests to a Pbc. For
this purpose, we vary the number of clients from 4Kto 80K.
We use Figures 15a and 15b to illustrate the eects on through-
put and latency. We employ 16 replicas for consensus and use
our standard thread conguration. Through Figure 15a we con-
clude that on increasing the number of clients, the throughput
for the system increases to some extent (up to 32
K
), and then
it becomes constant. This happens as the system can no longer
process any extra requests, as all the threads are already working
at their maximum capacity. As the number of clients increases,
an increased set of requests have to wait in the queue before
they can be processed. This wait can even cause a slight dip in
throughput (on moving from 64
K
to 80
K
clients). This delay in
processing is a major cause for a linear increase in the latency
incurred by the clients (as shown in Figure 15b). To
summarize:
we observe that an increase in the number of clients from 16
K
to
80
K
helps the system to gain an additional 1
.
44% throughput but
incurs 5×more latency.
5.9 Eect of Hardware Cores
We now move towards studying question Q10, that is, what
are the eects of a deployed hardware on a Pbc application. In
(a) System throughput. (b) Latency.
Figure 17: System throughput and latency on failing non-
primary replicas. Here, 16 replicas participate in consensus.
specic, we want to deploy our replicas on dierent Google Cloud
machines having 1,2,4and 8cores.
We use Figures 16a and 16b to illustrate the throughput and
latency attained by our ResilientDB system on dierent machines.
For all these experiments we requires 16 replicas to participate
in the consensus and employ our standard thread conguration.
These gures arm our claim that if replicas run on a machine
with less cores, then the overall system throughput will be re-
duced (and higher latency will be incurred). As our architecture
(refer Figures 6a and 6b) requires several threads, so on a ma-
chine with less cores our threads face resource contention. Hence,
ResilientDB attains maximum throughput with the 8-core ma-
chines. To
summarize:
deploying ResilientDB replicas on an
8
core machine, in comparison to the 1-core machines, leads to
an 8.92×increase in throughput.
5.10 Eect of Replica Failures
We now study how simple replica failures aect a Pbc. Essen-
tially, we try to analyze question Q11. The key intuition behind
this experiment is to experimentally analyze whether a fast Bft
consensus protocol can withstand failures. In specic, we again
take the single phase Zyzzyva consensus protocol and present a
head-on comparison of Zyzzyva against PBFT, while allowing
some non-primary (backup) replicas to fail.
In Figures 17a and 17b, we illustrate the impact of failure of
one replica and ve replicas on the two consensus protocols. For
this experiment we require at most 16 replicas to participate in
consensus. Note that for
n=
16, the maximum number of failures
aBft system can handle are
f=
5. Hence, we evaluate both the
protocols under minimum and maximum simultaneous failures.
Due to the high scaling of the graph, it is easy to conclude
that on increasing the number of failures, the throughput for
both the protocols remains the same. However, there is a small
dip in throughput for both PBFT and Zyzzyva. This dip is very
small as no phase of PBFT requires more than 2
f+
1messages.
So PBFT can continue performing well even under failure of
multiple backup replicas. On the other hand, Zyzzyva observes
a pronounced reduction in its throughput with just one failure.
The key issue with Zyzzyva is that its client needs response from
all the replicas. So a single failure makes the client to wait until it
timeouts. This wait causes signicant reduction in its throughput.
Note that nding an optimal amount of time a client should
wait is a hard problem. Hence, we approximate this by requiring
clients to wait for only a little time.
6 OBSERVATION FOR FUTURE
PERMISSIONED BLOCKCHAIN SYSTEMS
Based on the results presented in the previous section, we revisit
our discussion on the design of ecient permissioned blockchain
systems from Section 3. Before diving deep into our observations
for the future Pbc systems, we make two high-level conclusions:
A slow classical Bft protocol running on a well-crafted im-
plementation, such as ResilientDB, can easily outperform
a fast Bft protocol implemented on a protocol-centric
design. For example, we can provide three-phase PBFT
protocol up to 79% more throughput than the single-phase
Zyzzyva protocol.
No single parameter can alone substantially improve the
throughput (or reduce latency) of the underlying Pbc. The
key reason our ResilientDB framework can attain high
throughputs and incurs low latency is that it attempts at
optimally utilizing several parameters.
Threading and Pipelining.
Through extensive discussion in
the previous section, we observed benets of pipelining and par-
allelizing the tasks. Most of the interesting works on Pbc systems
either present new protocols to improve performance of a Pbc or
illustrate novel use-cases for blockchain [
3
,
4
,
18
,
62
]. However,
these works rarely focus on the implementation of a replica itself.
These works can signicantly increase their throughput by adopt-
ing an architecture similar to our ResilientDB. Further, caution
needs to be taken while introducing parallelism, as unnecessary
threads can cause resource contention or deadlocks. For example,
having multiple execution-threads can cause data-conicts.
Batching and Multiple Operations.
Batching client requests
has been a known practice in the database community. Several
interesting works [
3
,
31
] have suggested ill-eects of batching
transactions in blockchain. Our results show that such observa-
tions may not always be true. Optimal use of batching can help to
reduce the cost of consensus, by merging multiple consensuses
into one. However, over batching does introduce a communica-
tion trade-o. Hence, each Pbc application should determine the
optimal set of client requests to batch. Clients can also employ
multi-operation transactions. In practice, such a transaction in-
cludes at most ten operations. Hence, employing operations per
second as a metric to measure throughput may be a good idea.
One can also reduce the size of a batch to save on communication.
Message Size and Payload.
Depending on the application
targeted by a Pbc, the clients can send requests that have a
large size. For example, a client can require the execution of
a specic code. Under such cases, traditional batching policies
may not yield desirable results. If multiple large requests are
batched together, then the network may consume resources in
splitting a message into packets, transmitting these packets, and
re-collecting these packets at the destination. Hence, depending
on the application, batching just ten large requests may allow
the system to return high throughput.
Cryptographic Signatures. Every blockchain system relies
on a cryptographic signature scheme to prevent forgery. Al-
though generating these signatures bottlenecks the system through-
put, their use is essential for safety. Further, we observe that
MACs are cheaper than DSs but latter guarantee non-repudiation.
Thus, several works [
3
,
4
,
62
] suggest using only DS. However,
it is possible to extract both safety and high throughputs. For
instance, digital signatures are only necessary for messages that
need to be forwarded. Hence, in a Pbc, only clients need to digi-
tally sign their requests. For communication among the replicas,
MACs suce, as in most of the Bft protocols, no replica for-
wards messages of any other replica. Hence, the property of
non-repudiation is implicitly satised.
Memory Storage.
Pbc applications need to store client records
and other metadata. We observed that the use of in-memory data-
structures is better than o-memory storage, such as SQLite.
The key reason a Pbc system can avoid frequent access to o-
memory storage is that at all times, at most
f
replicas can fail.
Hence, if persistent storage is required, then it can be performed
asynchronously or delayed until periods of low contention.
Replica Failures.
We know that failures are common. Either
replicas may fail, or messages may get lost. A Pbc system needs to
be ready to face these situations. Hence, the system design must
not rely on a Bft protocol that works well in non-failure cases but
attains low throughput under simple failures. We observed that
designs employing protocols like Zyzzyva can have negligible
throughput with just one failure.
7 CONCLUSIONS
In this paper, we present a high-throughput yielding permis-
sioned blockchain framework, ResilientDB. By dissecting Re-
silientDB, we analyze several factors that aect the performance
of a permissioned blockchain system. This allows us to raise a
simple question: can a well-crafted system based on a classical
Bft protocol outperform a modern protocol? We show that the ex-
tensively parallel and pipelined design of our ResilientDB fabric
does allow even PBFT to gain high throughputs (up to 175
K
) and
outperform common implementations of Zyzzyva. Further, we
perform a rigorous evaluation of ResilientDB and illustrate the
impact of dierent factors such as cryptography, chain manage-
ment, monolithic design, and so on.
REFERENCES
[1]
Michael Abd-El-Malek, Gregory R. Ganger, Garth R. Goodson, Michael K. Re-
iter, and Jay J. Wylie. 2005. Fault-scalable Byzantine Fault-tolerant Services. In
Proceedings of the Twentieth ACM Symposium on Operating Systems Principles.
ACM, 59–74. https://doi.org/10.1145/1095810.1095817
[2]
Ittai Abraham, Guy Gueta, Dahlia Malkhi, Lorenzo Alvisi, Ramakrishna Kotla,
and Jean-Philippe Martin. 2017. Revisiting Fast Practical Byzantine Fault
Tolerance. https://arxiv.org/abs/1712.01367
[3]
Mohammad Javad Amiri, Divyakant Agrawal, and Amr El Abbadi. 2019.
CAPER: A Cross-application Permissioned Blockchain. Proceedings of the
VLDB Endowment 12, 11 (2019), 1385–1398. https://doi.org/10.14778/3342263.
3342275
[4]
Elli Androulaki, Artem Barger, Vita Bortnikov, Christian Cachin, Konstanti-
nos Christidis, Angelo De Caro, David Enyeart, Christopher Ferris, Gen-
nady Laventman, Yacov Manevich, Srinivasan Muralidharan, Chet Murthy,
Binh Nguyen, Manish Sethi, Gari Singh, Keith Smith, Alessandro Sorniotti,
Chrysoula Stathakopoulou, Marko Vukolić, Sharon Weed Cocco, and Ja-
son Yellick. 2018. Hyperledger Fabric: A Distributed Operating System for
Permissioned Blockchains. In Proceedings of the Thirteenth EuroSys Confer-
ence (EuroSys ’18). ACM, New York, NY, USA, Article 30, 15 pages. https:
//doi.org/10.1145/3190508.3190538
[5]
Johannes Behl, Tobias Distler, and Rüdiger Kapitza. 2017. Hybrids on Steroids:
SGX-Based High Performance BFT. In Proceedings of the Twelfth European
Conference on Computer Systems. ACM, 222–237. https://doi.org/10.1145/
3064176.3064213
[6]
Iddo Bentov, Pavel Hubáček, Tal Moran, and Asaf Nadler. 2017. Tortoise
and Hares Consensus: the Meshcash Framework for Incentive-Compatible,
Scalable Cryptocurrencies.
[7]
Miguel Castro and Barbara Liskov. 1999. Practical Byzantine Fault Toler-
ance. In Proceedings of the Third Symposium on Operating Systems Design and
Implementation. USENIX Association, 173–186.
[8]
Byung-Gon Chun, Petros Maniatis, Scott Shenker, and John Kubiatowicz. 2007.
Attested Append-only Memory: Making Adversaries Stick to Their Word. In
Proceedings of Twenty-rst ACM SIGOPS Symposium on Operating Systems
Principles. ACM, 189–204. https://doi.org/10.1145/1294261.1294280
[9]
Allen Clement, Manos Kapritsos, Sangmin Lee, Yang Wang, Lorenzo Alvisi,
Mike Dahlin, and Taylor Riche. 2009. Upright Cluster Services. In Proceedings
of the ACM SIGOPS 22nd Symposium on Operating Systems Principles. ACM,
277–290. https://doi.org/10.1145/1629575.1629602
[10]
Allen Clement, Edmund Wong, Lorenzo Alvisi, Mike Dahlin, and Mirco
Marchetti. 2009. Making Byzantine Fault Tolerant Systems Tolerate Byzantine
Faults. In Proceedings of the 6th USENIX Symposium on Networked Systems
Design and Implementation. USENIX Association, 153–168.
[11]
Brian F. Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan, and
Russell Sears. 2010. Benchmarking Cloud Serving Systems with YCSB. In
Proceedings of the 1st ACM Symposium on Cloud Computing. ACM, 143–154.
https://doi.org/10.1145/1807128.1807152
[12]
Miguel Correia, Nuno Ferreira Neves, and Paulo Verissimo. 2004. How to
Tolerate Half Less One Byzantine Nodes in Practical Distributed Systems. In
Proceedings of the 23rd IEEE International Symposium on Reliable Distributed
Systems. IEEE, 174–183. https://doi.org/10.1109/RELDIS.2004.1353018
[13]
James Cowling, Daniel Myers, Barbara Liskov, Rodrigo Rodrigues, and Liuba
Shrira. 2006. HQ Replication: A Hybrid Quorum Protocol for Byzantine Fault
Tolerance. In Proceedings of the 7th Symposium on Operating Systems Design
and Implementation. USENIX Association, 177–190.
[14] SQLite Developers. 2019. SQLite Home Page. https://sqlite.org/
[15]
John R. Douceur. 2002. The Sybil Attack. In Peer-to-Peer Systems, Peter Dr-
uschel, Frans Kaashoek, and Antony Rowstron (Eds.). Springer Berlin Heidel-
berg, Berlin, Heidelberg, 251–260.
[16]
Sisi Duan, Sean Peisert, and Karl N. Levitt. 2015. hBFT: Speculative Byzantine
Fault Tolerance with Minimum Cost. IEEE Transactions on Dependable and
Secure Computing 12, 1 (2015), 58–70. https://doi.org/10.1109/TDSC.2014.
2312331
[17]
Sisi Duan, Michael K. Reiter, and Haibin Zhang. 2018. BEAT: Asynchronous
BFT Made Practical. In Proceedings of the 2018 ACM SIGSAC Conference on
Computer and Communications Security. ACM, 2028–2041. https://doi.org/10.
1145/3243734.3243812
[18]
Yossi Gilad, Rotem Hemo, Silvio Micali, Georgios Vlachos, and Nickolai Zel-
dovich. 2017. Algorand: Scaling Byzantine Agreements for Cryptocurrencies.
In Proceedings of the 26th Symposium on Operating Systems Principles. ACM,
51–68. https://doi.org/10.1145/3132747.3132757
[19]
Jim Gray. 1978. Notes on Data Base Operating Systems. In Operating Sys-
tems, An Advanced Course. Springer-Verlag, 393–481. https://doi.org/10.1007/
3-540- 08755-9_9
[20]
Andy Greenberg. 2018. The Untold Story of NotPetya, the Most
Devastating Cyberattack in History. https://www.wired.com/story/
notpetya-cyberattack- ukraine-russia- code-crashed-the- world/
[21]
Suyash Gupta, Jelle Hellings, Sajjad Rahnama, and Mohammad Sadoghi. 2019.
Proof-of-Execution: Reaching Consensus through Fault-Tolerant Speculation.
CoRR abs/1911.00838 (2019). arXiv:1911.00838 http://arxiv.org/abs/1911.00838
[22]
Suyash Gupta, Jelle Hellings, and Mohammad Sadoghi. 2019. Brief Announce-
ment: Revisiting Consensus Protocols through Wait-Free Parallelization. In
33rd International Symposium on Distributed Computing (DISC 2019) (Leibniz
International Proceedings in Informatics (LIPIcs)), Vol. 146. Schloss Dagstuhl–
Leibniz-Zentrum fuer Informatik, 44:1–44:3. https://doi.org/10.4230/LIPIcs.
DISC.2019.44
[23] Suyash Gupta, Jelle Hellings, and Mohammad Sadoghi. 2019. Revisiting con-
sensus protocols through wait-free parallelization. CoRR abs/1908.01458 (2019).
arXiv:1908.01458 http://arxiv.org/abs/1908.01458
[24]
Suyash Gupta, Jelle Hellings, and Mohammad Sadoghi. 2019. Scaling
Blockchain Databases through Parallel Resilient Consensus Paradigm. CoRR
abs/1911.00837 (2019). arXiv:1911.00837 http://arxiv.org/abs/1911.00837
[25]
Suyash Gupta and Mohammad Sadoghi. 2018. Blockchain Transaction Pro-
cessing. Springer International Publishing, 1–11. https://doi.org/10.1007/
978-3- 319-63962- 8_333-1
[26]
Suyash Gupta and Mohammad Sadoghi. 2018. EasyCommit: A Non-blocking
Two-phase Commit Protocol. In Proceedings of the 21st International Conference
on Extending Database Technology. Open Proceedings, 157–168. https://doi.
org/10.5441/002/edbt.2018.15
[27]
Suyash Gupta and Mohammad Sadoghi. 2019. Ecient and non-blocking
agreement protocols. Distributed and Parallel Databases (13 Apr 2019). https:
//doi.org/10.1007/s10619-019- 07267-w
[28]
James Hendricks, Shafeeq Sinnamohideen, Gregory R. Ganger, and Michael K.
Reiter. 2010. Zzyzx: Scalable fault tolerance through Byzantine locking. In
2010 IEEE/IFIP International Conference on Dependable Systems Networks (DSN).
IEEE, 363–372. https://doi.org/10.1109/DSN.2010.5544297
[29]
John L. Hennessy and David A. Patterson. 2011. Computer Architecture, Fifth
Edition: A Quantitative Approach (5th ed.). Morgan Kaufmann Publishers Inc.,
San Francisco, CA, USA.
[30]
Maurice P. Herlihy and Jeannette M. Wing. 1990. Linearizability: A Correctness
Condition for Concurrent Objects. ACM Trans. Program. Lang. Syst. 12, 3 (July
1990), 463–492. https://doi.org/10.1145/78969.78972
[31]
Zsolt István, Alessandro Sorniotti, and Marko Vukolić. 2018. StreamChain:
Do Blockchains Need Blocks?. In Proceedings of the 2Nd Workshop on Scalable
and Resilient Infrastructures for Distributed Ledgers (SERIAL’18). ACM, New
York, NY, USA, 1–6. https://doi.org/10.1145/3284764.3284765
[32]
Rüdiger Kapitza, Johannes Behl, Christian Cachin, Tobias Distler, Simon
Kuhnle, Seyed Vahid Mohammadi, Wolfgang Schröder-Preikschat, and Klaus
Stengel. 2012. CheapBFT: Resource-ecient Byzantine Fault Tolerance. In
Proceedings of the 7th ACM European Conference on Computer Systems. ACM,
295–308. https://doi.org/10.1145/2168836.2168866
[33]
Jonathan Katz and Yehuda Lindell. 2014. Introduction to Modern Cryptography
(2nd ed.). Chapman and Hall/CRC.
[34]
Aggelos Kiayias, Alexander Russell, Bernardo David, and Roman Oliynykov.
2017. Ouroboros: A Provably Secure Proof-of-Stake Blockchain Protocol. In
Advances in Cryptology – CRYPTO 2017. Springer International Publishing,
357–388. https://doi.org/10.1007/978-3- 319-63688- 7_12
[35]
Eleftherios Kokoris-Kogias, Philipp Jovanovic, Nicolas Gailly, Ismail Kho,
Linus Gasser, and Bryan Ford. 2016. Enhancing Bitcoin Security and Per-
formance with Strong Consistency via Collective Signing. In Proceedings of
the 25th USENIX Conference on Security Symposium. USENIX Association,
279–296.
[36]
Ramakrishna Kotla, Lorenzo Alvisi, Mike Dahlin, Allen Clement, and Edmund
Wong. 2007. Zyzzyva: Speculative Byzantine Fault Tolerance. In Proceedings
of Twenty-rst ACM SIGOPS Symposium on Operating Systems Principles. ACM,
45–58. https://doi.org/10.1145/1294261.1294267
[37]
Chenxing Li, Peilun Li, Wei Xu, Fan Long, and Andrew Chi-Chih Yao. 2018.
Scaling Nakamoto Consensus to Thousands of Transactions per Second. https:
//arxiv.org/abs/1805.03870
[38]
Joshua Lind, Oded Naor, Ittay Eyal, Florian Kelbert, Emin Gün Sirer, and Peter
Pietzuch. 2019. Teechain: A Secure Payment Network with Asynchronous
Blockchain Access. In Proceedings of the 27th ACM Symposium on Operating
Systems Principles. ACM, 63–79. https://doi.org/10.1145/3341301.3359627
[39]
J. Liu, W. Li, G. O. Karame, and N. Asokan. 2019. Scalable Byzantine Consensus
via Hardware-Assisted Secret Sharing. IEEE Trans. Comput. 68, 1 (Jan 2019),
139–151. https://doi.org/10.1109/TC.2018.2860009
[40]
Loi Luu, Viswesh Narayanan, Chaodong Zheng, Kunal Baweja, Seth Gilbert,
and Prateek Saxena. 2016. A Secure Sharding Protocol For Open Blockchains.
In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Commu-
nications Security. ACM, 17–30. https://doi.org/10.1145/2976749.2978389
[41]
G. E. Moore. 2006. Cramming more components onto integrated circuits,
Reprinted from Electronics, volume 38, number 8, April 19, 1965, pp.114 .
IEEE Solid-State Circuits Society Newsletter 11, 3 (Sep. 2006), 33–35. https:
//doi.org/10.1109/N-SSC.2006.4785860
[42]
Satoshi Nakamoto. 2009. Bitcoin: A Peer-to-Peer Electronic Cash System.
https://bitcoin.org/bitcoin.pdf
[43]
Faisal Nawab and Mohammad Sadoghi. 2019. Blockplane: A Global-Scale
Byzantizing Middleware. In 35th International Conference on Data Engineering
(ICDE). IEEE, 124–135. https://doi.org/10.1109/ICDE.2019.00020
[44]
The Council of Economic Advisers. 2018. The Cost of Malicious Cyber Activity
to the U.S. Economy. Technical Report. Executive Oce of the President of
the United States. https://www.whitehouse.gov/wp-content/uploads/2018/
03/The-Cost- of-Malicious- Cyber-Activity- to-the- U.S.-Economy.pdf
[45]
National Audit Oce. 2018. Investigation: WannaCry cy-
ber attack and the NHS. https://www.nao.org.uk/report/
investigation-wannacry- cyber-attack- and-the- nhs/
[46]
Rafael Pass and Elaine Shi. 2016. Hybrid Consensus: Ecient Consensus in
the Permissionless Model. https://eprint.iacr.org/2016/917
[47]
Charles P. Peeger, Shari Lawrence Peeger, and Jonathan Margulies. 2015.
Security in Computing (5th ed.). Prentice Hall.
[48]
Nathaniel Popper. [n.d.]. Worries Grow That the Price of Bitcoin Is Being
Propped Up. The New York Times, NY, USA.
[49]
Thamir M. Qadah and Mohammad Sadoghi. 2018. QueCC: A Queue-oriented,
Control-free Concurrency Architecture. In Proceedings of the 19th International
Middleware Conference. ACM, 13–25. https://doi.org/10.1145/3274808.3274810
[50]
Mohammad Sadoghi and Spyros Blanas. 2019. Transaction Processing
on Modern Hardware. Morgan & Claypool. https://doi.org/10.2200/
S00896ED1V01Y201901DTM058
[51]
Marco Serani, Péter Bokor, Dan Dobre, Matthias Majuntke, and Neeraj Suri.
2010. Scrooge: Reducing the costs of fast Byzantine replication in presence of
unresponsive replicas. In 2010 IEEE/IFIP International Conference on Dependable
Systems Networks (DSN). IEEE, 353–362. https://doi.org/10.1109/DSN.2010.
5544295
[52]
Atul Singh, Petros Maniatis, Peter Druschel, and Timothy Roscoe. 2007.
Conict-free Quorum-based BFT Protocols. Technical Report. Max Planck
Institute for Software Systems. https://www.mpi-sws.org/tr/2007-001.pdf
[53]
Dale Skeen. 1982. A Quorum-Based Commit Protocol. Technical Report. Cornell
University.
[54]
Yonatan Sompolinsky, Yoad Lewenberg, and Aviv Zohar. 2018. SPECTRE: A
Fast and Scalable Cryptocurrency Protocol. https://eprint.iacr.org/2016/1159
[55]
Yonatan Sompolinsky and Aviv Zohar. 2015. Secure High-Rate Transaction
Processing in Bitcoin. In Financial Cryptography and Data Security. Springer
Berlin Heidelberg, 507–527. https://doi.org/10.1007/978-3-662-47854- 7_32
[56]
Symantec. 2018. Internet Security Threat Report, Volume 32. https://www.
symantec.com/content/dam/symantec/docs/reports/istr-23- 2018-en.pdf
[57]
Amie Tsang. [n.d.]. Bitcoin Plunges After Hacking of Exchange in Hong Kong.
The New York Times, NY, USA.
[58]
Giuliana Santos Veronese, Miguel Correia, Alysson Neves Bessani, Lau Cheuk
Lung, and Paulo Verissimo. 2013. Ecient Byzantine Fault-Tolerance. IEEE
Trans. Comput. 62, 1 (2013), 16–30. https://doi.org/10.1109/TC.2011.221
[59]
Jiaping Wang and Hao Wang. 2019. Monoxide: Scale out Blockchains with
Asynchronous Consensus Zones. In Proceedings of the 16th USENIX Symposium
on Networked Systems Design and Implementation. USENIX Association, 95–
112.
[60]
Gavin Wood. 2016. Ethereum: a secure decentralised generalised transaction
ledger. https://gavwood.com/paper.pdf EIP-150 revision.
[61]
Maofan Yin, Dahlia Malkhi, Michael K. Reiter, Guy Golan Gueta, and Ittai
Abraham. 2019. HotStu: BFT Consensus with Linearity and Responsive-
ness. In Proceedings of the 2019 ACM Symposium on Principles of Distributed
Computing. ACM, 347–356. https://doi.org/10.1145/3293611.3331591
[62]
Mahdi Zamani, Mahnush Movahedi, and Mariana Raykova. 2018. RapidChain:
Scaling Blockchain via Full Sharding. In Proceedings of the 2018 ACM SIGSAC
Conference on Computer and Communications Security. ACM, 931–948. https:
//doi.org/10.1145/3243734.3243853
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Despite recent intensive research, existing blockchain systems do not adequately address all the characteristics of distributed applications. In particular, distributed applications collaborate with each other following service level agreements (SLAs) to provide different services. While collaboration between applications, e.g., cross-application transactions, should be visible to all applications, the internal data of each application, e.g, internal transactions, might be confidential. In this paper, we introduce CAPER, a permissioned blockchain system to support both internal and cross-application transactions of collaborating distributed applications. In CAPER, the blockchain ledger is formed as a directed acyclic graph where each application accesses and maintains only its own view of the ledger including its internal and all cross-application transactions. CAPER also introduces three consensus protocols to globally order cross-application transactions between applications with different internal consensus protocols. The experimental results reveal the efficiency of CAPER in terms of performance and scalability.
Conference Paper
Full-text available
We present HotStuff, a leader-based Byzantine fault-tolerant replication protocol for the partially synchronous model. Once network communication becomes synchronous, HotStuff enables a correct leader to drive the protocol to consensus at the pace of actual (vs. maximum) network delay--a property called responsiveness---and with communication complexity that is linear in the number of replicas. To our knowledge, HotStuff is the first partially synchronous BFT replication protocol exhibiting these combined properties. Its simplicity enables it to be further pipelined and simplified into a practical, concise protocol for building large-scale replication services.
Conference Paper
Full-text available
The byzantine fault-tolerance model captures a wide-range of failures-common in real-world scenarios-such as ones due to malicious attacks and arbitrary software/hardware errors. We propose Blockplane, a middleware that enables making existing benign systems tolerate byzantine failures. This is done by making the existing system use Blockplane for durability and as a communication infrastructure. Blockplane proposes the following: (1) A middleware and communication infrastructure to make an entire benign protocol byzantine fault-tolerant, (2) A hierarchical locality-aware design to minimize the number of wide-area messages, (3) A separation of fault-tolerance concerns to enable designs with higher performance. I. INTRODUCTION A byzantine failure model [11] is a model of arbitrary failures that includes-in addition to crashes-unexpected behavior due to software and hardware malfunctions, malicious breaches, and violation of trust between participants. It is significantly more difficult to develop byzantine fault-tolerant protocols compared to benign (non-byzantine) protocols. This poses a challenge to organizations that want to adopt byzantine fault-tolerant software solutions. This challenge is exacerbated with the need of many applications to be globally distributed. With global distribution, the wide-area latency between participants amplifies the performance overhead of byzantine fault-tolerant protocols. To overcome the challenges of adopting byzantine fault-tolerant software solutions, we propose pushing down the byzantine fault-tolerance problem to the communication layer rather than the application/storage layer. Our proposal, Block-plane, is a communication infrastructure that handles the delivery of messages from one node to another. Blockplane exposes an interface of log-commit, send, and receive operations to be used by nodes to both persist their state and communicate with each other. Blockplane adopts a locality-aware hierarchical design due to our interest in supporting efficient byzantine fault-tolerance in global-scale environments. Hierarchical designs have recently been shown to perform well in global-scale settings [15]. Blockplane optimizes for the communication latency by performing as much computation as possible locally and only communicate across the wide-area link when necessary. In the paper, we distinguish between two types of failures. The first is independent byzantine failures that are akin to traditional byzantine failures which affect each node independently (the failure of one node does not correlate with the failure of another node). The second type of failures is benign
Article
Full-text available
Large scale distributed databases are designed to support commercial and cloud based applications. The minimal expectation from such systems is that they ensure consistency and reliability in case of node failures. The distributed database guarantees reliability through the use of atomic commitment protocols. Atomic commitment protocols help in ensuring that either all the changes of a transaction are applied or none of them exist. To ensure efficient commitment process, the database community has mainly used the two-phase commit (2PC) protocol. However, the 2PC protocol is blocking under multiple failures. This necessitated the development of non-blocking, three-phase commit (3PC) protocol. However, the database community is still reluctant to use the 3PC protocol, as it acts as a scalability bottleneck in the design of efficient transaction processing systems. In this work, we present EasyCommit protocol which leverages the best of both worlds (2PC and 3PC), that is non-blocking (like 3PC) and requires two phases (like 2PC). EasyCommit achieves these goals by ensuring two key observations: (i) first transmit and then commit, and (ii) message redundancy. We present the design of the EasyCommit protocol and prove that it guarantees both safety and liveness. We also present a detailed evaluation of EC protocol and show that it is nearly as efficient as the 2PC protocol. To cater the needs of geographically large scale distributed systems we also design a topology-aware agreement protocol (Geo-scale EasyCommit) that is non-blocking, safe, live and outperforms 3PC protocol.
Conference Paper
Full-text available
We investigate a coordination-free approach to transaction processing on emerging multi-sockets, many-core, shared-memory architecture to harness its unprecedented available parallelism. We propose a queue-oriented, control-free concur-rency architecture, referred to as QueCC, that exhibits minimal contention among concurrent threads by eliminating the overhead of concurrency control from the critical path of the transaction. QueCC operates on batches of transactions in two deterministic phases of priority-based planning followed by control-free execution. We extensively evaluate our transaction execution architecture and compare its performance against seven state-of-the-art concurrency control protocols designed for in-memory, key-value stores. We demonstrate that QueCC can significantly out-perform state-of-the-art concurrency control protocols under high-contention by up to 6.3×. Moreover, our results show that QueCC can process nearly 40 million YCSB transactional operations per second while maintaining serializability guarantees with write-intensive workloads. Remarkably, QueCC out-performs H-Store by up to two orders of magnitude.
Chapter
We propose Meshcash, a protocol for implementing a permissionless ledger (blockchain) via proofs of work, suitable for use as the underlying consensus mechanism of a cryptocurrency. Unlike most existing proof-of-work based consensus protocols, Meshcash does not rely on leader-election (e.g., the single miner who managed to extend the longest chain). Rather, we use ideas from traditional (permissioned) Byzantine agreement protocols in a novel way to guarantee convergence to a consensus from any starting state. Our construction combines a local “hare” protocol that guarantees fast consensus on recent blocks (but doesn’t, by itself, imply irreversibility) with a global “tortoise” protocol that guarantees irreversibility. Our global protocol also allows the ledger to “self-heal” from arbitrary violations of the security assumptions, reconverging to consensus after the assumptions hold again.
Conference Paper
Blockchains such as Bitcoin and Ethereum execute payment transactions securely, but their performance is limited by the need for global consensus. Payment networks overcome this limitation through off-chain transactions. Instead of writing to the blockchain for each transaction, they only settle the final payment balances with the underlying blockchain. When executing off-chain transactions in current payment networks, parties must access the blockchain within bounded time to detect misbehaving parties that deviate from the protocol. This opens a window for attacks in which a malicious party can steal funds by deliberately delaying other parties' blockchain access and prevents parties from using payment networks when disconnected from the blockchain. We present Teechain, the first layer-two payment network that executes off-chain transactions asynchronously with respect to the underlying blockchain. To prevent parties from misbehaving, Teechain uses treasuries, protected by hardware trusted execution environments (TEEs), to establish off-chain payment channels between parties. Treasuries maintain collateral funds and can exchange transactions efficiently and securely, without interacting with the underlying blockchain. To mitigate against treasury failures and to avoid having to trust all TEEs, Teechain replicates the state of treasuries using committee chains, a new variant of chain replication with threshold secret sharing. Teechain achieves at least a 33X higher transaction throughput than the state-of-the-art Lightning payment network. A 30-machine Teechain deployment can handle over 1 million Bitcoin transactions per second.
Book
The last decade has brought groundbreaking developments in transaction processing. This resurgence of an otherwise mature research area has spurred from the diminishing cost per GB of DRAM that allows many transaction processing workloads to be entirely memory-resident. This shift demanded a pause to fundamentally rethink the architecture of database systems. The data storage lexicon has now expanded beyond spinning disks and RAID levels to include the cache hierarchy, memory consistency models, cache coherence and write invalidation costs, NUMA regions, and coherence domains. New memory technologies promise fast non-volatile storage and expose unchartered trade-offs for transactional durability, such as exploiting byte-addressable hot and cold storage through persistent programming that promotes simpler recovery protocols. In the meantime, the plateauing single-threaded processor performance has brought massive concurrency within a single node, first in the form of multi-core, and now with many-core and heterogeneous processors. The exciting possibility to reshape the storage, transaction, logging, and recovery layers of next-generation systems on emerging hardware have prompted the database research community to vigorously debate the trade-offs between specialized kernels that narrowly focus on transaction processing performance vs. designs that permit transactionally consistent data accesses from decision support and analytical workloads. In this book, we aim to classify and distill the new body of work on transaction processing that has surfaced in the last decade to navigate researchers and practitioners through this intricate research subject.
Conference Paper
Processing at block granularity and blockchains seem inseparable. The original role of blocks is to amortize the cost of cryptography (e.g., solving proof-of-work) and to make data transfers more efficient in a geo-distributed setting. While blocks are a simple and powerful tool for amortizing these costs, today in permissioned distributed ledgers, that are often neither geo-distributed, nor require proof-of-work, the benefits of operating on blocks are overshadowed by the large latencies they introduce. Our proposal is to switch the distributed ledger processing paradigm from block processing to stream transaction processing and rely on batching (i.e., block formation) only for amortizing the cost of disk accesses for commit operations. This paradigm shift enables shaving off end-to-end latencies by more than an order of magnitude and opens up new use-cases for permissioned ledgers. We demonstrate a proof-of-concept of our idea using Hyperledger Fabric, achieving end-to-end latencies of less than 10ms while maintaining relatively high throughput, namely close to 1500 tps.
Conference Paper
A major approach to overcoming the performance and scalability limitations of current blockchain protocols is to use sharding which is to split the overheads of processing transactions among multiple, smaller groups of nodes. These groups work in parallel to maximize performance while requiring significantly smaller communication, computation, and storage per node, allowing the system to scale to large networks. However, existing sharding-based blockchain protocols still require a linear amount of communication (in the number of participants) per transaction, and hence, attain only partially the potential benefits of sharding. We show that this introduces a major bottleneck to the throughput and latency of these protocols. Aside from the limited scalability, these protocols achieve weak security guarantees due to either a small fault resiliency (e.g., 1/8 and 1/4) or high failure probability, or they rely on strong assumptions (e.g., trusted setup) that limit their applicability to mainstream payment systems. We propose RapidChain, the first sharding-based public blockchain protocol that is resilient to Byzantine faults from up to a 1/3 fraction of its participants, and achieves complete sharding of the communication, computation, and storage overhead of processing transactions without assuming any trusted setup. RapidChain employs an optimal intra-committee consensus algorithm that can achieve very high throughputs via block pipelining, a novel gossiping protocol for large blocks, and a provably-secure reconfiguration mechanism to ensure robustness. Using an efficient cross-shard transaction verification technique, our protocol avoids gossiping transactions to the entire network. Our empirical evaluations suggest that RapidChain can process (and confirm) more than 7,300 tx/sec with an expected confirmation latency of roughly 8.7 seconds in a network of 4,000 nodes with an overwhelming time-to-failure of more than 4,500 years.