ArticlePDF Available

# ResilientDB: global scale resilient blockchain fabric

Authors:

## Abstract

Recent developments in blockchain technology have inspired innovative new designs in resilient distributed and database systems. At their core, these blockchain applications typically use Byzantine fault-tolerant consensus protocols to maintain a common state across all replicas, even if some replicas are faulty or malicious. Unfortunately, existing consensus protocols are not designed to deal with geo-scale deployments in which many replicas spread across a geographically large area participate in consensus. To address this, we present the Geo-Scale Byzantine Fault-Tolerant consensus protocol (GeoBFT). GeoBFT is designed for excellent scalability by using a topological-aware grouping of replicas in local clusters, by introducing parallelization of consensus at the local level, and by minimizing communication between clusters. To validate our vision of high-performance geo-scale resilient distributed systems, we implement GeoBFT in our efficient ResilientDB permissioned blockchain fabric. We show that GeoBFT is not only sound and provides great scalability, but also outperforms state-of-the-art consensus protocols by a factor of six in geo-scale deployments.
ResilientDB: Global Scale Resilient Blockchain Fabric
Exploratory Systems Lab
Department of Computer Science
University of California, Davis
ABSTRACT
Recent developments in blockchain technology have inspired
innovative new designs in resilient distributed and database
systems. At their core, these blockchain applications typically
use Byzantine fault-tolerant consensus protocols to maintain
a common state across all replicas, even if some replicas
are faulty or malicious. Unfortunately, existing consensus
protocols are not designed to deal with geo-scale deployments
in which many replicas spread across a geographically large
area participate in consensus.
To address this, we present the Geo-Scale Byzantine Fault-
Tolerant consensus protocol (GeoBFT). GeoBFT is de-
signed for excellent scalability by using a topological-aware
grouping of replicas in local clusters, by introducing paral-
lelization of consensus at the local level, and by minimizing
communication between clusters. To validate our vision of
high-performance geo-scale resilient distributed systems, we
implement GeoBFT in our eﬃcient ResilientDB permis-
sioned blockchain fabric. We show that GeoBFT is not
only sound and provides great scalability, but also outper-
forms state-of-the-art consensus protocols by a factor of six
in geo-scale deployments.
PVLDB Reference Format:
Sadoghi. ResilientDB: Global Scale Resilient Blockchain Fabric.
PVLDB, 13(6): 868-883, 2020.
DOI: https://doi.org/10.14778/3380750.3380757
1. INTRODUCTION
Recent interest in blockchain technology has renewed devel-
opment of distributed Byzantine fault-tolerant (Bft) systems
that can deal with failures and malicious attacks of some
participants [8, 14, 17, 21, 24, 46, 49, 56, 74, 75, 79, 93].
Although these systems are safe, they attain low through-
put, especially when the nodes are spread across a wide-area
network (or geographically large distances). We believe this
contradicts the central promises of blockchain technology:
Both authors have equally contributed to this work.
NonCommercial-NoDerivatives 4.0 International License. To view a copy
any use beyond those covered by this license, obtain permission by emailing
info@vldb.org. Copyright is held by the owner/author(s). Publication rights
Proceedings of the VLDB Endowment, Vol. 13, No. 6
ISSN 2150-8097.
DOI: https://doi.org/10.14778/3380750.3380757
Table 1: Real-world inter- and intra-cluster commu-
nication costs in terms of the ping round-trip times
(which determines latency) and bandwidth (which
determines throughput). These measurements are
taken in Google Cloud using clusters of n1 machines
(replicas) that are deployed in six diﬀerent regions.
Ping round-trip times (ms) Bandwidth (Mbit/s)
O I M B T S O I M B T S
Oregon (O)1 38 65 136 118 161 7998 669 371 194 188 136
Iowa (I)1 33 98 153 172 10004 752 243 144 120
Montreal (M)1 82 186 202 7977 283 111 102
Belgium (B)1 252 270 9728 79 66
Taiwan (T)1 137 7998 160
Sydney (S)1 7977
decentralization and democracy, in which arbitrary replicas
at arbitrary distances can participate [33, 44, 49].
At the core of any blockchain system is a Bft consensus
protocol that helps participating replicas to achieve resilience.
Existing blockchain database systems and data-processing
frameworks typically use permissioned blockchain designs
that rely on traditional Bft consensus [45, 47, 55, 88, 76,
87]. These permissioned blockchains employ a fully-replicated
a full copy of the data (the blockchain).
1.1 Challenges for Geo-scale Blockchains
To enable geo-scale deployment of a permissioned block-
chain system, we believe that the underlying consensus proto-
col must distinguish between local and global communication.
This belief is easily supported in practice. For example, in
Table 1 we illustrate the ping round-trip time and bandwidth
measurements. These measurements show that global mes-
sage latencies are at least 33–270 times higher than local
latencies, while the maximum throughput is 10–151 times
lower, both implying that communication between regions is
several orders of magnitude more costly than communication
within regions. Hence, a blockchain system needs to recog-
nize and minimize global communication if it is to attain
high performance in a geo-scale deployment.
In the design of geo-scale aware consensus protocols, this
translates to two important properties. First, a geo-scale
aware consensus protocol needs to be aware of the network
topology. This can be achieved by clustering replicas in a
region together and favoring communication within such
clusters over global inter-cluster communication. Second, a
geo-scale aware consensus protocol needs to be decentralized:
no single replica or cluster should be responsible for coordi-
868
nating all consensus decisions, as such a centralized design
limits the throughput to the outgoing global bandwidth and
latency of this single replica or cluster.
Existing state-of-the-art consensus protocols do not share
these two properties. The inﬂuential Practical Byzantine
Fault Tolerance consensus protocol (Pbft) [18, 19] is cen-
tralized, as it relies on a single primary replica to coordinate
all consensus decisions, and requires a vast amount of global
communication (between all pairs of replicas). Protocols
such as Zyzzyva improve on this by reducing communica-
tion costs in the optimal case [9, 62, 63]. However, these
protocols still have a highly centralized design and do not
favor local communication. Furthermore, Zyzzyva provides
high throughput only if there are no failures and requires
reliable clients [3, 23]. The recently introduced HotStuff
improves on Pbft by simplifying the recovery process on
primary failure [94]. This allows HotStuff to eﬃciently
switch primaries for every consensus decision, providing the
potential of decentralization. However, the design of Hot-
Stuff does not favor local communication, and the usage of
threshold signatures strongly centralizes all communication
for a single consensus decision to the primary of that round.
Another recent protocol PoE provides better throughput
than both Pbft and Zyzzyva in the presence of failures, this
without employing threshold signatures [45]. Unfortunately,
also PoE has a centralized design that depends on a single
primary. Finally, the geo-aware consensus protocol Steward
promises to do better [5], as it recognizes local clusters and
tries to minimize inter-cluster communication. However, due
to its centralized design and reliance on cryptographic primi-
tives with high computational costs, Steward is unable to
beneﬁt from its topological knowledge of the network.
1.2 GeoBFT: Towards Geo-scale Consensus
In this work, we improve on the state-of-the-art by in-
troducing GeoBFT, a topology-aware and decentralized
consensus protocol. In GeoBFT, we group replicas in a
region into clusters, and we let each cluster make consensus
decisions independently. These consensus decisions are then
shared via an optimistic low-cost communication protocol
with the other clusters, in this way assuring that all replicas
in all clusters are able to learn the same sequence of consensus
decisions: if we have two clusters
C1
and
C2
with
n
replicas
each, then our optimistic communication protocol requires
only
dn/
3
e
messages to be sent from
C1
to
C2
when
C1
needs
to share local consensus decisions with
C2
. In speciﬁc, we
make the following contributions:
1.
We introduce the GeoBFT consensus protocol, a novel
consensus protocol that performs a topological-aware
grouping of replicas into local clusters to minimize
global communication. GeoBFT also decentralizes
consensus by allowing each cluster to make consensus
decisions independently.
2.
To reduce global communication, we introduce a novel
global sharing protocol that optimistically performs
minimal inter-cluster communication, while still en-
abling reliable detection of communication failure.
3.
The optimistic global sharing protocol is supported by
a novel remote view-change protocol that deals with
any malicious behavior and any failures.
Table 2: The normal-case metrics of
Bft
consensus
protocols in a system with z clusters, each with n
replicas of which at most f, n >
3
f, are Byzantine.
GeoBFT
provides the lowest global communication
cost per consensus decision (transaction) and oper-
ates decentralized.
Protocol Decisions Communication Centralized
(Local) (Global)
GeoBFT (our paper) zO(2zn2)O(fz2) No
single decision 1O(4n2)O(fz) No
Steward 1O(2zn2)O(z2) Yes
Zyzzyva 1O(zn) Yes
Pbft 1O(2(zn)2) Yes
PoE 1O((zn)2) Yes
HotStuff 1O(8(zn)) Partly
Each cluster runs Pbft to
select, locally replicate, and
certify a client request.
Primaries at each cluster
share the certiﬁed client
request with other clusters.
Order the certiﬁed requests,
execute them, and inform
local clients.
Local replication Inter-cluster sharing Ordering and execution
Figure 1: Steps in a round of the GeoBFT protocol.
4.
We prove that GeoBFT guarantees safety: it achieves
a unique sequence of consensus decisions among all
replicas and ensures that clients can reliably detect
when their transactions are executed, this independent
of any malicious behavior by any replicas.
5.
We show that GeoBFT guarantees liveness: whenever
the network provides reliable communication, GeoBFT
continues successful operation, this independent of any
malicious behavior by any replicas.
6.
To validate our vision of using GeoBFT in geo-scale
settings, we present our ResilientDB fabric [48] and
implement GeoBFT in this fabric.1
7.
We also implemented other state-of-the-art Bft proto-
cols in ResilientDB (Zyzzyva,Pbft,HotStuff, and
Steward), and evaluate GeoBFT against these Bft
protocols using the YCSB benchmark [25]. We show
that GeoBFT achieves up-to-six times more through-
put than existing Bft protocols.
In Table 2, we provide a summary of the complexity of
the normal-case operations of GeoBFT and compare this
to the complexity of other popular Bft protocols.
2. GeoBFT: GEO-SCALE CONSENSUS
We now present our Geo-Scale Byzantine Fault-Tolerant
consensus protocol (GeoBFT) that uses topological infor-
mation to group all replicas in a single region into a single
cluster. Likewise, GeoBFT assigns each client to a single
cluster. This clustering helps in attaining high throughput
and scalability in geo-scale deployments. GeoBFT operates
in rounds, and in each round, every cluster will be able to
propose a single client request for execution. Next, we sketch
the high-level working of such a round of GeoBFT. Each
round consists of the three steps sketched in Figure 1: local
replication,global sharing, and ordering and execution, which
we further detail next.
1
We have open-sourced our ResilientDB fabric at
https:
//resilientdb.com/.
869
R2,3
R2,2
R2,1
PC2
c2
R1,3
R1,2
R1,1
PC1
c1
T2
T1
Local Pbft
Consensus
on T2
Local Pbft
Consensus
on T1
Execute T1T2
Execute T1T2
Local
Request
Local
Replication
Global
Sharing
Local
Sharing
Local
Inform
C2
C1
Figure 2: Representation of the normal-case algo-
rithm of
GeoBFT
running on two clusters. Clients
ci,i∈ {
1
,
2
}, request transactions Tifrom their local
cluster Ci. The primary
P
Ci∈ Cireplicates this trans-
action to all local replicas using
Pbft
. At the end of
local replication, the primary can produce a cluster
certiﬁcate for Ti. These are shared with other clus-
ters via inter-cluster communication, after which all
replicas in all clusters can execute Tiand Cican in-
form ci.
1.
At the start of each round, each cluster chooses a sin-
gle transaction of a local client. Next, each cluster
locally replicates its chosen transaction in a Byzantine
fault-tolerant manner using Pbft. At the end of suc-
cessful local replication, Pbft guarantees that each
non-faulty replica can prove successful local replication
via a commit certiﬁcate.
2.
Next, each cluster shares the locally-replicated trans-
action along with its commit certiﬁcate with all other
clusters. To minimize inter-cluster communication, we
use a novel optimistic global sharing protocol. Our op-
timistic global sharing protocol has a global phase in
which clusters exchange locally-replicated transactions,
followed by a local phase in which clusters distribute
any received transactions locally among all local repli-
cas. To deal with failures, the global sharing protocol
utilizes a novel remote view-change protocol.
3.
Finally, after receiving all transactions that are locally-
replicated in other clusters, each replica in each cluster
can deterministically order all these transactions and
proceed with their execution. After execution, the
replicas in each cluster inform only local clients of the
outcome of the execution of their transactions (e.g.,
conﬁrm execution or return any execution results).
In Figure 2, we sketch a single round of GeoBFT in a
setting of two clusters with four replicas each.
2.1 Preliminaries
To present GeoBFT in detail, we ﬁrst introduce the system
model we use and the relevant notations.
Let
R
be a set of replicas. We model a topological-aware
system as a partitioning of
R
into a set of clusters
S
=
{C1,...,Cz}
, in which each cluster
Ci
, 1
iz
, is a set
of
|Ci|
=
n
replicas of which at most
f
are faulty and can
behave in Byzantine, possibly coordinated and malicious,
manners. We assume that in each cluster n>3f.
Remark 2.1.We assumed
z
clusters with
n>
3
f
replicas
each. Hence,
n
= 3
f
+
j
for some
j
1. We use the
same failure model as Steward [5], but our failure model
diﬀers from the more-general failure model utilized by Pbft,
Zyzzyva, and HotStuff [9, 18, 19, 62, 63, 94]. These
protocols can each tolerate the failure of up-to-
bzn/
3
c
=
b
(3
fz
+
zj
)
/
3
c
=
fz
+
bzj/
3
c
replicas, even if more than
f
of
these failures happen in a single region; whereas GeoBFT
and Steward can only tolerate
fz
failures, of which at most
f
can happen in a single cluster. E.g., if
n
= 13,
f
= 4, and
z
= 7, then GeoBFT and Steward can tolerate
fz
= 28
replica failures in total, whereas the other protocols can
tolerate 30 replica failures. The failure model we use enables
the eﬃcient geo-scale aware design of GeoBFT, this without
facing well-known communication bounds [32, 35, 36, 37, 41].
We write f(
Ci
) to denote the Byzantine replicas in cluster
Ci
and
nf
(
Ci
) =
Ci\
f(
Ci
) to denote the non-faulty replicas
in
Ci
. Each replica
R∈ Ci
has a unique identiﬁer
id
(
R
),
1
id
(
R
)
n
. We assume that non-faulty replicas behave in
accordance to the protocol and are deterministic: on identical
inputs, all non-faulty replicas must produce identical outputs.
We do not make any assumptions on clients: all client can
be malicious without aﬀecting GeoBFT.
Some messages in GeoBFT are forwarded (for example,
the client request and commit certiﬁcates during inter-cluster
sharing). To ensure that malicious replicas do not tamper
with messages while forwarding them, we sign these mes-
sages using digital signatures [58, 72]. We write
hmiu
to
denote a message signed by
u
. We assume that it is practi-
cally impossible to forge digital signatures. We also assume
authenticated communication: Byzantine replicas can imper-
sonate each other, but no replica can impersonate another
non-faulty replica. Hence, on receipt of a message
m
from
replica
R∈ Ci
, one can determine that
R
did send
m
if
R /
f(
Ci
); and one can only determine that
m
was sent by a
non-faulty replica if
Rnf
(
Ci
). In the permissioned setting,
authenticated communication is a minimal requirement to
deal with Byzantine behavior, as otherwise Byzantine repli-
cas can impersonate all non-faulty replicas (which would
lead to so-called Sybil attacks) [39]. For messages that are
forwarded, authenticated communication is already provided
via digital signatures. For all other messages, we use less-
costly message authentication codes [58, 72]. Replicas will
discard any messages that are not well-formed, have invalid
message authentication codes (if applicable), or have invalid
signatures (if applicable).
Next, we deﬁne the consensus provided by GeoBFT.
Deﬁnition 2.2.
Let
S
be a system over
R
. A single run
of any consensus protocol should satisfy the following two
requirements:
Termination
Each non-faulty replica in
R
executes a trans-
action.
Non-divergence
All non-faulty replicas execute the same
transaction.
Termination is typically referred to as liveness, whereas
non-divergence is typically referred to as safety. A single
round of our GeoBFT consists of
z
consecutive runs of
the Pbft consensus protocol. Hence, in a single round of
GeoBFT, all non-faulty replicas execute the same sequence
of ztransactions.
To provide safety, we do not need any other assumptions
on communication or on the behavior of clients. Due to well-
870
known impossibility results for asynchronous consensus [15,
16, 42, 43], we can only provide liveness in periods of reliable
bounded-delay communication during which all messages sent
by non-faulty replicas will arrive at their destination within
some maximum delay.
2.2 Local Replication
In the ﬁrst step of GeoBFT, the local replication step,
each cluster will independently choose a client request to
execute. Let
S
be a system. Each round
ρ
of GeoBFT
starts with each cluster
C ∈ S
replicating a client request
T
of client
cclients
(
C
). To do so, GeoBFT relies on
Pbft [18, 19],
2
a primary-backup protocol in which one
replica acts as the primary, while all the other replicas act as
backups. In Pbft, the primary is responsible for coordinating
the replication of client transactions. We write P
C
to denote
the replica in
C
that is the current local primary of cluster
C
. The normal-case of Pbft operates in four steps which we
sketch in Figure 3. Next, we detail these steps.
First, the primary P
C
receives client requests of the form
hTic, transactions Tsigned by a local client cclients(C).
Then, in round
ρ
,P
C
chooses a request
hTic
and initiates
the replication of this request by proposing it to all replicas
via a preprepare message. When a backup replica receives
apreprepare message from the primary, it agrees to par-
ticipate in a two-phase Byzantine commit protocol. This
commit protocol can succeed if at least
n
2
f
non-faulty
replicas receive the same preprepare message.
In the ﬁrst phase of the Byzantine commit protocol, each
replica
R
responds to the preprepare message
m
casting a prepare message in support of
m
casting the prepare message,
R
nf
prepare messages in support of
m
(indicating that at least
n2fnon-faulty replicas support m).
Finally, after receiving these messages,
R
enters the second
phase of the Byzantine commit protocol and broadcasts a
commit message in support of
m
. Once a replica
R
nf
commit messages in support of
m
, it has the guarantee
that eventually all replicas will commit to hTic.
This protocol exchanges suﬃcient information among all
replicas to enable detection of malicious behavior of the
primary and to recover from any such behavior. Moreover,
on success, each non-faulty replica
R∈ C
will be committed
to the proposed request
hTic
and will be able to construct
acommit certiﬁcate [
hTic, ρ
]
R
that proves this commitment.
In GeoBFT, this commit certiﬁcate consists of the client
request
hTic
and
nf>
2
f
identical commit messages for
hTic
signed by distinct replicas. Optionally, GeoBFT can
use threshold signatures to represent these
nf
signatures
via a single constant-sized threshold signature [85].
In GeoBFT, we use a Pbft implementation that only uses
digital signatures for client requests and commit messages,
as these are the only messages that need forwarding. In this
conﬁguration, Pbft provides the following properties:
Lemma 2.3
(Castro et al. [18, 19])
.
Let
S
be a system and
let C ∈ Sbe a cluster with n>3f. We have the following:
2
Other consensus protocols such as Zyzzyva [9, 62, 63] and
HotStuff [94] promise to improve on Pbft by sharply
reducing communication. In our setting, where local commu-
nication is abundant (see Table 1), such improvements are
unnecessary, and the costs of Zyzzyva (reliable clients) and
HotStuff (high computational complexity) can be avoided.
R3
R2
R1
PC
chTic
Construct
[hTic, ρ]C
preprepare prepare commit
Figure 3: The normal-case working of round ρof
Pbft
within a cluster C: a client crequests trans-
action T, the primary
P
Cproposes this request to
all local replicas, which prepare and commit this
proposal, and, ﬁnally, all replicas can construct a
commit certiﬁcate.
Termination
If communication is reliable, has bounded de-
lay, and a replica
R∈ C
is able to construct a com-
mit certiﬁcate [
hTic, ρ
]
R
, then all non-faulty replicas
R0nf
(
C
)will eventually be able to construct a commit
certiﬁcate [hT0ic0, ρ]R0.
Non-divergence
If replicas
R1, R2∈ C
are able to con-
struct commit certiﬁcates [
hT1ic1, ρ
]
R1
and [
hT2ic2, ρ
]
R2
,
respectively, then T1=T2and c1=c2.
From Lemma 2.3, we conclude that all commit certiﬁcates
C
for round
ρ
show commitment to
the same client request
hTic
. Hence, we write [
hTic, ρ
]
C
, to
represent a commit certiﬁcate from some replica in cluster
C
.
To guarantee the correctness of Pbft (Lemma 2.3), we
need to prove that both non-divergence and termination
hold. From the normal-case working outlined above and in
Figure 3, Pbft guarantees non-divergence independent of
the behavior of the primary or any malicious replicas.
To guarantee termination when communication is reliable
and has bounded delay, Pbft uses view-changes and check-
points. If the primary is faulty and prevents any replica
from making progress, then the view-change protocol enables
non-faulty replicas to reliably detect primary failure, recover
a common non-divergent state, and trigger primary replace-
ment until a non-faulty primary is found. After a successful
view-change, progress is resumed. We refer to these Pbft-
provided view-changes as local view-changes. The checkpoint
protocol enables non-faulty replicas to recover from failures
and malicious behavior that do not trigger a view-change.
2.3 Inter-Cluster Sharing
Once a cluster has completed local replication of a client
request, it proceeds with the second step: sharing the client
request with all other clusters. Let
S
be a system and
C ∈ S
be a cluster. After
C
reaches local consensus on client
request
hTic
in round
ρ
—enabling construction of the commit
certiﬁcate [
hTic, ρ
]
C
that proves local consensus—
C
needs to
exchange this client request and the accompanying proof with
all other clusters. This exchange step requires global inter-
cluster communication, which we want to minimize while
retaining the ability to reliably detect failure of the sender.
However, minimizing this inter-cluster communication is not
as straightforward as it sounds, which we illustrate next:
Example 2.4.Let
S
be a system with two clusters
C1,C2S
.
Consider a simple global communication protocol in which a
message
m
is sent from
C1
to
C2
by requiring the primary P
C1
to send
m
to the primary P
C2
(which can then disseminate
m
in
C2
). In this protocol, the replicas in
C2
cannot determine
871
R2,3
R2,2
R2,1
PC2
PC1
Global phase Local phase
C2
Figure 4: A schematic representation of the normal-
case working of the global sharing protocol used by
C1to send m= (hTic,[hTic, ρ]C1)to C2.
The global phase (used by the primary PC1):
1: Choose a set Sof f+ 1 replicas in C2.
2: Send mto each replica in S.
The local phase (used by replicas R∈ C2):
3: event receive mfrom a replica Q∈ C1do
4: Broadcast mto all replicas in C2.
Figure 5: The normal-case global sharing protocol
used by C1to send m= (hTic,[hTic, ρ]C1)to C2.
what went wrong if they do not receive any messages. To
show this, we distinguish two cases:
(1)
P
C1
is Byzantine and behaves correctly toward every
replica, except that it never sends messages to P
C2
, while
PC2is non-faulty.
(2)
P
C1
is non-faulty, while P
C2
is Byzantine and behaves
correctly toward every replica, except that it drops all mes-
sages sent by PC1.
In both cases, the replicas in
C2
from
C1
, while both clusters see correct behavior of their
primaries with respect to local consensus. Indeed, with this
little amount of communication, it is impossible for replicas
in
C2
to determine whether P
C1
is faulty (and did not send
any messages) or P
C2
is faulty (and did not forward any
In GeoBFT, we employ an optimistic approach to reduce
communication among the clusters. Our optimistic approach
consists of a low-cost normal-case protocol that will suc-
ceed when communication is reliable and the primary of the
sending cluster is non-faulty. To deal with any failures, we
use a remote view-change protocol that guarantees eventual
normal-case behavior when communication is reliable. First,
we describe the normal-case protocol, after which we will
describe in detail the remote view-change protocol.
Optimistic inter-cluster sending.
In the optimistic case,
where participants are non-faulty, we want to send a min-
imum number of messages while retaining the ability to
reliably detect failure of the sender. In Example 2.4, we
already showed that sending only a single message is not
suﬃcient. Sending f+ 1 messages is suﬃcient, however.
Let
m
= (
hTic,
[
hTic, ρ
]
C1
) be the message that some
replica in cluster
C1
needs to send to some replicas
C2
. Note
that
m
includes the request replicated in
C1
in round
ρ
, and
the commit-certiﬁcate, which is the proof that such a replica-
tion did take place. Based on the observations made above,
we propose a two-phase normal-case global sharing protocol.
We sketch this normal-case sending protocol in Figure 4 and
present the detailed pseudo-code for this protocol in Figure 5.
In the global phase, the primary P
C1
sends
m
to
f
+ 1
replicas in
C2
. In the local phase, each non-faulty replica
Rnf
(
C2
m
forwards
m
to all
replicas in its cluster C2.
Proposition 2.5.
Let
S
be a system, let
C1,C2S
be two
clusters, and let
m
= (
hTic,
[
hTic, ρ
]
C1
)be the message
C1
sends to C2using the normal-case global sharing protocol of
Figure 5. We have the following:
Receipt
If the primary P
C1
is non-faulty and communica-
tion is reliable, then every replica in
C2
will eventually
Agreement
Replicas in
C2
will only accept client request
hTicfrom C1in round ρ.
Proof.
If the primary P
C1
is non-faulty and communication
is reliable, then
f
+ 1 replicas in
C2
m
(Line 2).
As at most
f
replicas in
C2
are Byzantine, at least one of
these receiving replicas is non-faulty and will forward this
message
m
to all replicas in
C2
(Line 4), proving termination.
The commit certiﬁcate [
hTic, ρ
]
C1
cannot be forged by
faulty replicas, as it contains signed
commit
messages from
nf>f
replicas. Hence, the integrity of any message
m
for-
warded by replicas in
C2
can easily be veriﬁed. Furthermore,
Lemma 2.3 rules out the existence of any other messages
m0= [hT0ic0, ρ]C1, proving agreement.
We notice that there are two cases in which replicas in
C2
m
from
C1
: either P
C1
is faulty and did not
send
m
to
f
+1 replicas in
C2
, or communication is unreliable,
and messages are delayed or lost. In both cases, non-faulty
replicas in
C2
initiate remote view-change to force primary
replacement in
C1
(causing replacement of the primary P
C1
).
Remote view-change.
The normal-case global sharing pro-
tocol outlined will only succeed if communication is reliable
and the primary of the sending cluster is non-faulty. To
recover from any failures, we provide a remote view-change
protocol. Let
S
=
{C1,...,Cz}
be a system. To simplify
presentation, we focus on the case in which the primary of
cluster
C1
fails to send
m
= (
hTic,
[
hTic, ρ
]
C1
) to replicas of
C2
. Our remote view-change protocol consists of four phases,
which we detail next.
First, non-faulty replicas in cluster
C2
detect the failure of
the current primary P
C1
of
C1
to send
m
. Note that although
the replicas in
C2
have no information about the contents of
message
m
, they are awaiting arrival of a well-formed message
m
from
C1
in round
ρ
. Second, the non-faulty replicas in
C2
initiate agreement on failure detection. Third, after reaching
agreement, the replicas in
C2
send their request for a remote
view-change to the replicas in
C1
in a reliable manner. In the
fourth and last phase, the non-faulty replicas in
C1
trigger a
local view-change, replace P
C1
, and instruct the new primary
to resume global sharing with
C2
. Next, we explain each
phase in detail.
To be able to detect failure,
C2
must assume reliable com-
munication with bounded delay. This allows the usage of
timers to detect failure. To do so, every replica
R∈ C2
sets a timer for
C1
at the start of round
ρ
and waits until
m
from
C1
. If the timer expires
before
R
m
, then
R
detects failure of
C1
in round
ρ
. Successful detection will eventually lead to a
remote view-change request.
From the perspective of
C1
, remote view-changes are con-
trolled by external parties. This leads to several challenges
872
R2,4
R2,3
R2,2
R2,1
R1,4
R1,3
R1,2
R1,1
Detection &
view-change
in C1(Pbft)
DRvc Rvc (forward)
Detection
(in C2)
Agreement
(in C2)
Request
view-change
C2
C1
Figure 6: A schematic representation of the remote
view-change protocol of
GeoBFT
running at a sys-
tem Sover R. This protocol is triggered when a
cluster C2Sexpects a message from C1S, but
does not receive this message in time.
not faced by traditional Pbft view-changes (the local view-
changes used within clusters, e.g., as part of local replication):
(1)
A remote view-change in
C1
requested by
C2
should
only trigger at most a single local view-change in
C1
, other-
wise remote view-changes enable replay attacks.
(2)
While replicas in
C1
detect failure of P
C1
and initiate
local view-change, it is possible that
C2
detects failure of
C1
and requests remote view-change in
C1
. In this case, only a
single successful view-change in C1is necessary.
(3)
Likewise, several clusters
C2,...,Cz
can simultaneously
detect failure of
C1
and request remote view-change in
C1
.
Also in this case, only a single successful view-change in
C1
is necessary.
Furthermore, a remote view-change request for cluster
C1
cannot depend on any information only available to
C1
(e.g.,
the current primary P
C1
of
C1
). Likewise, the replicas in
C1
cannot determine which messages (for which rounds) have
already been sent by previous (possibly malicious) primaries
of
C1
: remote view-change requests must include this infor-
mation. Our remote view-change protocol addresses each of
these concerns. In Figures 6 and 7, we sketch this protocol
and its pseudo-code. Next, we describe the protocol in detail.
Let
R∈ C2
be a replica that detects failure of
C1
in
round
ρ
v1
remote view-changes
in
C1
. Once a replica
R
detects a failure, it initiates the
process of reaching an agreement on this failure among other
replicas of its cluster
C2
. It does so by broadcasting message
DRvc(C1, ρ, v1) to all replicas in C2(Line 3 of Figure 7).
Next,
R
DRvc
(
C1, ρ, v1
)
messages from
nf
distinct replicas in
C2
(Line 12 of Fig-
ure 7). This guarantees that there is agreement among the
non-faulty replicas in
C2
that
C1
has failed. After receiving
these
nf
messages,
R
requests a remote view-change by
sending message
hRvc
(
C1, ρ, v1
)
iR
to the replica
Q∈ C1
with
id(R) = id(Q) (Line 13 of Figure 7).
In case some other replica
R0∈ C2
m
from
C1
,
then
R0
would respond with message
m
in response to the
message
DRvc
(
C1, ρ, v
) (Line 5 of Figure 7). This allows
R
to recover in cases where it could not reach an agreement on
the failure of
C1
. Finally, some replica
R0∈ C2
may detect
the failure of
C1
later than
R
. To handle such a case, we
require each replica
R0
DRvc
(
C1, ρ, v
)
messages from
f
+ 1 distinct replicas in
C2
to assume that
Initiation role (used by replicas R∈ C2):
1: v1:= 0 (number of remote view-changes in C1requested by R).
2: event detect failure of C1in round ρdo
3: Broadcast DRvc(C1, ρ, v1) to all replicas in C2.
4: v1:= v1+ 1.
5: event Rreceives DRvc(C1, ρ, v1) from R0∈ C2do
6: if Rreceived (hTic,[hTic, ρ]C) from Q∈ C1then
7: Send (hTic,[hTic, ρ]C) to R0.
8: event Rreceives DRvc(C1, ρ, v0
1) from f+ 1 replicas in C2do
9: if v1v0
1then
10: v1:= v0
1.
11: Detect failure of C1in round ρ(if not yet done so).
12: event Rreceives DRvc(C1, ρ, v1) from nfreplicas in C2do
13: Send hRvc(C1, ρ, v1)iRto Q∈ C1, id(R) = id(Q).
Response role (used by replicas Q∈ C1):
14: event Qreceives hRvc(C1, ρ, v)iRfrom R,R(R\ C1)do
15: Broadcast hRvc(C1, ρ, v)iRto all replicas in C1.
16: event Qreceives hRvc(C1, ρ, v)iRi, 1 if+ 1, such that:
1. {Ri|1if+ 1}⊂C0,C0S;
2. |{Ri|1if+ 1}| =f+ 1;
3. no recent local view-change was triggered; and
4. C0did not yet request a v-th remote view-change
do
17: Detect failure of PC1(if not yet done so).
Figure 7: The remote view-change protocol of
GeoBFT
running at a system Sover R. This pro-
tocol is triggered when a cluster C2Sexpects a
message from C1S, but does not receive this mes-
sage in time.
the cluster
C1
has failed. This assumption is valid as one
of these
f
+ 1 messages must have come from a non-faulty
replica in
C2
, which must have detected the failure of cluster
C1successfully (Line 8 of Figure 7).
If replica
Q∈ C1
mRcv
=
hRvc
(
ρ, v
)
iR
from
R∈ C2
, then
Q
veriﬁes whether
mRcv
is well-formed. If
mRcv
is well-formed,
Q
forwards
mRcv
to all replicas in
C1
(Line 14 of Figure 7). Once
Q
f
+ 1 messages identical to
mRcv
, signed by distinct replicas
in
C2
, it concludes that at least one of these remote view-
change requests must have come from a non-faulty replica
in
C2
. Next,
Q
determines whether it will honor this remote
view-change request, which
Q
will do when no concurrent
local view-change is in progress and when this is the ﬁrst
v
-th remote view-change requested by
C2
(the lather prevents
replay attacks). If these conditions are met,
Q
detects its
current primary PC1as faulty (Line 16 of Figure 7).
When communication is reliable, the above protocol en-
sures that all non-faulty replicas in
C1
will detect failure of
P
C1
. Hence, eventually a successful local view-change will
be triggered in
C1
. When a new primary in
C1
is elected, it
takes one of the remote view-change requests it received and
determines the rounds for which it needs to send requests
(using the normal-case global sharing protocol of Figure 5).
As replicas in
C2
do not know the exact communication de-
lays, they use exponential back oﬀ to determine the timeouts
used while detecting subsequent failures of C1.
We are now ready to prove the main properties of remote
view-changes.
Proposition 2.6.
Let
S
be a system, let
C1,C2S
be two
clusters, and let
m
= (
hTic,
[
hTic, ρ
]
C
)be the message
C1
needs to send to
C2
in round
ρ
. If communication is reliable
and has bounded delay, then either every replica in
C2
will
receive mor C1will perform a local view-change.
873
Proof.
Consider the remote view-change protocol of Figure 7.
If a non-faulty replica
R0nf
(
C2
m
, then any
replica in
C2
m
m
from
R0
(Line 5). In all other cases, at least
f
+ 1 non-faulty replicas
in
C2
m
and will timeout. Due to exponential
back-oﬀ, eventually each of these
f
+ 1 non-faulty replicas
will initiate and agree on the same
v1
-th remote view-change.
Consequently, all non-faulty replicas in
nf
(
C2
) will participate
in this remote view-change (Line 8). As
|nf
(
C2
)
|
=
nf
, each
of these
nf
replicas
Rnf
(
C2
) will send
hRvc
(
C1, ρ, v
)
iR
to some replica
Q∈ C1
,
id
(
R
) =
id
(
Q
) (Line 12). Let
S
=
{Q∈ C1|Rnf
(
C2
)
id
(
R
) =
id
(
Q
)
}
be the set of
C1
of these messages and let
T
=
Snf
(
C1
). We
have
|S|
=
nf>
2
f
and, hence,
|T|>f
. Each replica
QT
in
C1
(Line 14). As
|T|>f
, this eventually triggers a local
view-change in C1(Line 16).
Finally, we use the results of Proposition 2.5 and Proposi-
tion 2.6 to conclude
Theorem 2.7.
Let
S
=
{C1,...,Cz}
be a system over
R
. If communication is reliable and has bounded delay,
then every replica
RR
will, in round
ρ
{
(
hTiici,
[
hTiici, ρ
]
Ci
)
|
(1
iz
)
(
ciclients
(
Ci
))
}
of
z
messages. These sets all contain identical client requests.
Proof.
Consider cluster
CiS
. If P
Ci
behaves reliable, then
Proposition 2.5 already proves the statement with respect
to (
hTiici,
[
hTiici, ρ
]
Ci
). Otherwise, if P
Ci
behaves Byzan-
tine, then then Proposition 2.6 guarantees that either all
replicas in
R
hTiici,
[
hTiici, ρ
]
Ci
) or P
Ci
will be
replaced via a local view-change. Eventually, these local view-
changes will lead to a non-faulty primary in
Ci
, after which
Proposition 2.5 again proves the statement with respect to
(hTiici,[hTiici, ρ]Ci).
2.4 Ordering and Execution
Once replicas of a cluster have chosen a client request for
execution and have received all client requests chosen by
other clusters, they are ready for the ﬁnal step: ordering and
executing these client requests. In speciﬁc, in round
ρ
, any
non-faulty replica that has valid requests from all clusters
can move ahead and execute these requests.
Theorem 2.7 guarantees after the local replication step (Sec-
tion 2.2) and the inter-cluster sharing step (Section 2.3) each
replica in
R
will receive the same set of
z
client requests in
round
ρ
. Let
Sρ
=
{
(
hTiici|
(1
iz
)
(
ciclients
(
Ci
))
}
be this set of zclient requests received by each replica.
The last step is to put these client requests in a unique
order, execute them, and inform the clients of the outcome.
To do so, GeoBFT simply uses a pre-deﬁned ordering on the
clusters. For example, each replica executes the transactions
in the order [
T1,...,Tz
]. Once the execution is complete,
each replica
R∈ Ci
, 1
iz
, informs the client
ci
of
any outcome (e.g., conﬁrmation of execution or the result
of execution). Note that each replica
R
only informs its
local clients. As all non-faulty replicas are expected to
act deterministic, execution will yield the same state and
results across all non-faulty replicas. Hence, each client
ci
is
guaranteed to receive identical response from at least
f
+ 1
replicas. As there are at most
f
faulty replicas per cluster
and faulty replicas cannot impersonate non-faulty replicas, at
least one of these
f
+1 responses must come from a non-faulty
replica. We conclude the following:
Theorem 2.8
(GeoBFT is a consensus protocol)
.
Let
S
be a system over
R
in which every cluster satisﬁes
n>
3
f
. A single round of GeoBFT satisﬁes the following two
requirements:
Termination
If communication is reliable and has bounded
delay, then GeoBFT guarantees that each non-faulty
replica in Rexecutes ztransactions.
Non-divergence
GeoBFT guarantees that all non-faulty
replicas execute the same ztransaction.
Proof.
Both termination and non-divergence are direct corol-
laries of Theorem 2.7.
2.5 Final Remarks
Until now we have presented the design of GeoBFT using
a strict notion of rounds. Only during the last step of each
round of GeoBFT, which orders and executes client requests
(Section 2.4), this strict notion of rounds is required. All
other steps can be performed out-of-order. For example,
local replication and inter-cluster sharing of client requests
for future rounds can happen in parallel with ordering and
execution of client requests. In speciﬁc, the replicas of a
cluster
Ci
, 1
iz
can replicate the requests for round
ρ
+ 2, share the requests for round
ρ
+ 1 with other clusters,
and execute requests for round
ρ
in parallel. Hence, GeoBFT
needs minimal synchronization between clusters.
Additionally, we do not require that every cluster always
has client requests available. When a cluster
C
does not
have client requests to execute in a round, the primary P
C
can propose a no-op-request. The primary P
C
can detect
the need for such a no-op request in round
ρ
when it starts
receiving client requests for round
ρ
from other clusters. As
with all requests, also such no-op requests requires commit
certiﬁcates obtained via local replication.
To prevent that P
C
can indeﬁnitely ignore requests from
some or all clients in
clients
(
C
), we rely on standard Pbft
techniques to detect and resolve such attacks during local
replication. These techniques eﬀectively allow clients in
clients
(
C
) to force the cluster to process its request, ruling
out the ability of faulty primaries to indeﬁnitely propose
no-op requests when client requests are available.
Furthermore, to simplify presentation, we have assumed
that every cluster has exactly the same size and that the set
of replicas never change. These assumptions can be lifted,
however. GeoBFT can easily be extended to also work with
clusters of varying size, this only requires minor tweaks on
the remote view-change protocol of Figure 7 (the conditions
at Line 16 rely on the cluster sizes, see Proposition 2.6).
To deal with faulty replicas that eventually recover, we can
rely on the same techniques as Pbft [18, 19]. Full dynamic
membership, in which replicas can join and leave GeoBFT
via some vetted automatic procedure, is a challenge for any
permissioned blockchain and remains an open problem for
future work [13, 83].
3. IMPLEMENTATION IN ResilientDB
GeoBFT is designed to enable geo-scale deployment of a
permissioned blockchain. Next, we present our Resilient-
DB fabric [48], a permissioned blockchain fabric that uses
GeoBFT to provide such a geo-scale aware high-performance
874
n
Replica
C P P
B
B
B
B1
B1
B1
Propose a Block
C P
client
request
P
B
B
B
B2
B2
B2
B B
Cluster 2
Cluster 1
P
B
B
B
B
B
B
P
B1 B2 B3 B4 B5 B6 B7 B8
B1 B2 B3 B4 B5 B6 B7 B8
B2
Certiﬁcate
B
B
P
B
B
B
B
P
Propose a Block
Ledger
Execution
Persistant Data
B1
P B
Consensus
B B
P B
Consensus
client
request
B2
Minimal
Communication
Minimal
Communication
Local
Local
B
B
P
B
Execution
B
B
B
P
Execution
C
C
Client
Response
Client
Response
B1
Certiﬁcate
Figure 8: Architecture of our ResilientDB Fabric.
permissioned blockchain. ResilientDB is especially tuned
to enterprise-level blockchains in which (i) replicas can be
dispersed over a wide area network; (ii) links connecting
replicas at large distances have low bandwidth; (iii) replicas
are untrusted but known; and (iv) applications require high
throughput and low latency. These four properties are di-
rectly motivated by practical properties of geo-scale deployed
distributed systems (see Table 1 in Section 1). In Figure 8,
we present the architecture of ResilientDB.
At the core of ResilientDB is the ordering of client
requests and appending them to the ledger—the immutable
append-only blockchain representing the ordered sequence of
accepted client requests. The order of each client request will
be determined via GeoBFT, which we described in Section 2.
Next, we focus on how the ledger is implemented and on
other practical details that enable geo-scale performance.
The ledger (blockchain).
The key purpose of any block-
chain fabric is the maintenance of the ledger: the immutable
append-only blockchain representing the ordered sequence of
client requests accepted. In ResilientDB, the
i
-th block in
the ledger consists of the
i
-th executed client request. Recall
that in each round
ρ
of GeoBFT, each replica executes
z
requests, each belonging to a diﬀerent cluster
Ci
, 1
iz
.
Hence, in each round ρ, each replica creates zblocks in the
order of execution of the
z
requests. To assure immutability
of each block, the block not only consists of the client request,
but also contains a commit certiﬁcate. This prevents tam-
pering of any block, as only a single commit certiﬁcate can
be made per cluster per GeoBFT round (Lemma 2.3). As
ResilientDB is designed to be a fully-replicated blockchain,
each replica independently maintains a full copy of the ledger.
The immutable structure of the ledger also helps when re-
covering replicas: tampering of its ledger by any replica can
easily be detected. Hence, a recovering replica can simply
read the ledger of any replica it chooses and directly verify
whether the ledger can be trusted (is not tampered with).
Cryptography.
The implementation of GeoBFT and the
ledger requires the availability of strong cryptographic prim-
itives, e.g., to provide digital signatures and authenticated
communication (see Section 2.1). To do so, ResilientDB
strong cryptographic primitives [10]. In speciﬁc, we use
ED25519-based digital signatures to sign our messages and
we use AES-CMAC message authentication codes to imple-
ment authenticated communication [58]. Further, we employ
SHA256 to generate collision-resistant message digests.
Pipelined consensus.
From our experience designing and
implementing Byzantine consensus protocols, we know that
throughput can be limited by waiting (e.g., due to message
latencies) or by computational costs (e.g., costs of signing
and verifying messages). To address both issues simulta-
neously, ResilientDB provides a multi-threaded pipelined
architecture for the implementation of consensus protocols.
In Figure 9, we have illustrated how GeoBFT is implemented
With each replica, we associate a set of input threads that
receive messages from the network. The primary has one
input thread dedicated to accepting client requests, which
this input thread places on the batch queue for further pro-
cessing. All replicas have two input threads for processing
all other messages (e.g., those related to local replication
and global sharing). Each replica has two output threads for
sending messages. The ordering (consensus) and execution
of client requests is done by the worker,execute, and certify
Request batching.
In the design of ResilientDB and
GeoBFT, we support the grouping of client requests in
batches. Clients can group their requests in batches and
send these batches to their local cluster. Furthermore, local
primaries can group requests of diﬀerent clients into a sin-
gle batch. Each batch is then processed by the consensus
protocol (GeoBFT) as a single request, thereby sharing the
cost associated with reaching consensus among each of the
requests in the batch. Such request batching is a common
practice in high-performance blockchain systems and can
be applied to a wide range of workloads (e.g., processing
ﬁnancial transactions).
When an input thread at a local primary P
C
batch of client requests from a client, the thread assigns it a
linearly increasing number and places the batch in the batch
queue. Via this assigned number, P
C
order in which this batch needs to be processed (and the
following consensus steps are only necessary to communicate
this order reliably with all other replicas). Next, the batching
C
takes request batches from the batch queue
and initiates local replication. To do so, the batching thread
initializes the data-structures used by the local replication
step, creates a valid pre-prepare message for the batch (see
Section 2.2), and puts this message on the output queue.
The output threads dequeue these messages from the output
queue and send them to all intended recipients.
When a input thread at replica
R∈ C
as part of local replication (e.g., pre-prepare,prepare,
commit, see Section 2.2), then it places this message in the
work queue. Next, the worker thread processes any incoming
messages placed on the work queue by performing the steps
of Pbft, the local replication protocol. At the end of the
nf
identical commit messages for a batch, it notiﬁes
thread creates a commit certiﬁcate corresponding to this
notiﬁcation and places this commit certiﬁcate in the output
queue, initiating inter-cluster sharing (see Section 2.3).
When a input thread at replica
R∈ C
875
Client
Requests
Prepare & Commit
messages
Input
Network
Message from
Clients and
Replicas
Network
Batching
Worker
Certi fy
Execute
Output
Certi ficates
from other clusters
Message to
Replicas or
Clients
Pre-Prepare, Prepare &
Commit messages
Input
Network
Message from
Clients and
Replicas
Network
Worker
Certify
Execute
Output
Certificates
from other clusters
Message to
Replicas or
Clients
Figure 9: The multi-threaded implementation of
GeoBFT
in
ResilientDB
.Left, the implementation for local
primaries. Right, the implementation for other replicas.
as part of global sharing (a commit certiﬁcate), then it places
this message in the certify queue. Next, the certify thread
processes any incoming messages placed on the certify queue
by performing the steps of the global sharing protocol. If the
certify thread has processed commit certiﬁcates for batches
in round
ρ
from all clusters, then it notiﬁes the execution
thread that round ρcan be ordered and executed.
R∈ C
notiﬁcation for all commit certiﬁcates associated with the
next round. These commit certiﬁcates correspond to all the
client batches that need to be ordered and executed. When
performs the steps described in Section 2.4.
Other protocols.
ResilientDB also provides implemen-
tations of four other state-of-the-art consensus protocols,
namely, Pbft,Zyzzyva,HotStuff, and Steward (see
Section 1.1). Each of these protocols use the multi-threaded
pipelined architecture of ResilientDB and are structured
similar to the design of GeoBFT. Next, we provide some
details on each of these protocols.
We already covered the working of Pbft in Section 2.2.
Next, Zyzzyva is designed with the most optimal case in
mind: it requires non-faulty clients and depends on clients
to aid in the recovery of any failures. To do so, clients in
Zyzzyva require identical responses from all
n
replicas. If
these are not received, the client initiates recovery of any
requests with suﬃcient
nf
cates of these requests. This will greatly reduce performance
when any replicas are faulty. In ResilientDB, the certify
thread at each replica processes these recovery certiﬁcates.
HotStuff is designed to reduce communication of Pbft.
To do so, HotStuff uses threshold signatures to combine
nf
message signatures into a single signature. As there is
no readily available implementation for threshold signatures
available for the Crypto++ library, we skip the construction
and veriﬁcation of threshold signatures in our implementation
of HotStuff. Moreover, we allow each replica of HotStuff
to act as a primary in parallel without requiring the usage of
pacemaker-based synchronization [94]. Both decisions give
our HotStuff implementation a substantial performance
Finally, we also implemented Steward. This protocol
groups replicas into clusters, similar to GeoBFT. Diﬀerent
from GeoBFT,Steward designates one of these clusters
as the primary cluster, which coordinates all operations. To
reduce inter-cluster communication, Steward uses threshold
signatures. As with HotStuff, we omitted these threshold
signatures in our implementation.
4. EVALUATION
To showcase the practical value of GeoBFT, we now use
our ResilientDB fabric to evaluate GeoBFT against four
other popular state-of-the-art consensus protocols (Pbft,
Zyzzyva,HotStuff, and Steward). We deploy Resilient-
DB on the Google Cloud using
N1
machines that have 8-core
Intel Skylake CPUs and
16 GB
we deploy
160 k
clients on eight 4-core machines having
16 GB
of main memory. We equally distribute the clients across all
the regions used in each experiment.
In each experiment, the workload is provided by the Ya-
hoo Cloud Serving Benchmark (YCSB) [25]. Each client
transaction queries a YCSB table with an active set of
600 k
records. For our evaluation, we use write queries, as those
are typically more costly than read-only queries. Prior to
the experiments, each replica is initialized with an identical
copy of the YCSB table. The client transactions generated
by YCSB follow a uniform Zipﬁan distribution. Clients and
replicas can batch transactions to reduce the cost of consen-
sus. In our experiments, we use a batch size of 100 requests
per batch (unless stated otherwise).
With a batch size of 100, the messages have sizes of
5.4 kB
(preprepare),
6.4 kB
(commit certiﬁcates containing seven
commit messages and a preprepare message),
1.5 kB
(client
responses), and
250 B
(other messages). The size of a commit
certiﬁcate is largely dependent on the size of the preprepare
message, while the total size of the accompanying commit
messages is small. Hence, the inter-cluster sharing of these
certiﬁcates is not a bottleneck for GeoBFT: existing Bft
protocols send preprepare messages to all replicas irrespec-
tive of their region. Further, if the size of commit messages
starts dominating, then threshold signatures can be adopted
to reduce their cost [85].
To perform geo-scale experiments, we deploy replicas across
six diﬀerent regions, namely Oregon, Iowa, Montreal, Bel-
gium, Taiwan, and Sydney. In Table 1, we present our
measurements on the inter-region network latency and band-
width. We run each experiment for
180 s
: ﬁrst, we allow the
system to warm-up for
60 s
, after which we collect measure-
ment results for the next
120 s
. We average the results of
our experiments over three runs.
For Pbft and Zyzzyva, centralized protocols in which
a single primary replica coordinates consensus, we placed
the primary in Oregon, as this region has the highest band-
width to all other regions (see Table 1). For HotStuff, our
implementation permits all replicas to act as both primary
and non-primary at the same time. For both GeoBFT and
Steward, we group replicas in a single region into a single
cluster. In each of these protocols, each cluster has its own
local primary. Finally, for Steward, a centralized protocol
876
in which the primary cluster coordinates the consensus, we
placed the primary cluster in Oregon. We focus our evalua-
tion on answering the following four research questions:
(1)
What is the impact of geo-scale deployment of repli-
cas in distant clusters on the performance of GeoBFT, as
compared to other consensus protocols?
(2)
What is the impact of the size of local clusters (relative
to the number of clusters) on the performance of GeoBFT,
as compared to other consensus protocols?
(3)
What is the impact of failures on the performance of
GeoBFT, as compared to other consensus protocols?
(4)
Finally, what is the impact of request batching on the
performance of GeoBFT, as compared to other consensus
protocols, and under which batch sizes can GeoBFT already
provide good throughput?
4.1 Impact of Geo-Scale deployment
First, we determine the impact of geo-scale deployment of
replicas in distant regions on the performance of GeoBFT
and other consensus protocols. To do so, we measure the
throughput and latency attained by ResilientDB as a func-
tion of the number of regions, which we vary between 1 and
6. We use 60 replicas evenly distributed over the regions,
and we select regions in the order Oregon, Iowa, Montreal,
Belgium, Taiwan, and Sydney. E.g., if we have four regions,
then each region has 15 replicas, and we have these replicas
in Oregon, Iowa, Montreal, and Belgium. The results of our
measurements can be found in Figure 10.
From the measurements, we see that Steward is unable
to beneﬁt from its topological knowledge of the network: in
practice, we see that the high computational costs and the
centralized design of Steward prevent high throughput in all
cases. Both Pbft and Zyzzyva perform better than Stew-
ard, especially when ran in a few well-connected regions
(e.g., only the North-American regions). The performance
of these protocols falls when inter-cluster communication
becomes a bottleneck, however (e.g., when regions are spread
across continents). HotStuff, which is designed to reduce
communication compared to Pbft, has reasonable through-
put in a geo-scale deployment, and sees only a small drop in
throughput when regions are added. The high computational
costs of the protocol prevent it from reaching high through-
put in any setting, however. Additionally, HotStuff has
very high latencies due to its 4-phase design. As evident from
Figure 2, HotStuff clients face severe delay in receiving a
response for their client requests.
Finally, the results clearly show that GeoBFT scales well
with an increase in regions. When running at a single cluster,
high, which limits its throughput in this case. Fortunately,
GeoBFT is the only protocol that actively beneﬁts from
GeoBFT uses to increase parallelism of consensus and de-
crease centralized communication. This added parallelism
helps oﬀset the costs of inter-cluster communication, even
regions only incurs a low latency on GeoBFT. Recall that
GeoBFT sends only
f
+ 1 messages between any two clus-
ters. Hence, a total of
O
(
zf
) inter-cluster messages are sent,
which is much less than the number of messages commu-
nicated across clusters by other protocols (see Figure 2).
As the cost of communication between remote clusters is
high (see Figure 1), this explains why other protocols have
lower throughput and higher latencies than GeoBFT. In-
deed, when operating on several regions, GeoBFT is able to
outperform Pbft by a factor of up-to-3
.
1
×
and outperform
HotStuff by a factor of up-to-1.3×.
4.2 Impact of Local Cluster Size
Next, we determine the impact of the number of repli-
cas per region on the performance of GeoBFT and other
consensus protocols. To do so, we measure the throughput
and latency attained by ResilientDB as a function of the
number of replicas per region, which we vary between 4 and
15. We have replicas in four regions (Oregon, Iowa, Montreal,
and Belgium). The results of our measurements can be found
in Figure 11.
The measurements show that increasing the number of
replicas only has minimal negative inﬂuence on the through-
put and latency of Pbft,Zyzzyva, and Steward. As seen
in the previous Section, the inter-cluster communication cost
for the primary to contact individual replicas in other regions
(and continents) is the main bottleneck. Consequently, the
number of replicas used only has minimal inﬂuence. For
HotStuff, which does not have such a bottleneck, adding
replicas does aﬀect throughput and—especially—latency,
this due to the strong dependence between latency and the
number of replicas in the design of HotStuff.
The design of GeoBFT is particularly tuned toward a large
number of regions (clusters), and not toward a large number
of replicas per region. We observe that increasing the replicas
per cluster also allows each cluster to tolerate more failures
(increasing
f
). Due to this, the performance drop-oﬀ for
GeoBFT when increasing the replicas per region is twofold:
ﬁrst, the size of the certiﬁcates exchanged between clusters
is a function of
f
; second, each cluster sends their certiﬁcates
to
f
+ 1 replicas in each other cluster. Still, the parallelism
incurred by running in four clusters allows GeoBFT to
outperform all other protocols, even when scaling up to
ﬁfteen replicas per region, in which case it is still 2
.
9
×
faster
than Pbft and 1.2×faster than HotStuff.
4.3 Impact of Failures
In our third experiment, we determine the impact of replica
failures on the performance of GeoBFT and other consensus
protocols. To do so, we measure the throughput attained by
ResilientDB as a function of the number of replicas, which
we vary between 4 and 12. We perform the measurements
under three failure scenarios: a single non-primary replica
failure, up to
f
simultaneous non-primary replica failures
per region, and a single primary failure. As in the previous
experiment, we have replicas in four regions (Oregon, Iowa,
Montreal, and Belgium). The results of our measurements
can be found in Figure 12.
Single non-primary replica failure.
The measurements
for this case show that the failure of a single non-primary
replica has only a small impact on the throughput of most
protocols. The only exception being Zyzzyva, for which the
throughput plummets to zero, as Zyzzyva is optimized for
the optimal non-failure case. The inability of Zyzzyva to
eﬀectively operate under any failures is consistent with prior
analysis of the protocol [3, 23].
fnon-primary replica failures per cluster.
In this exper-
iment, we measure the performance of GeoBFT in the worst
877
GeoBFT Pbft Zyzzyva HotStuff Steward
123456
0
2
4
6
·104
Number of clusters
Throughput (txn/s)
Throughput
123456
0
2
4
6
8
Number of clusters
Latency (s)
Latency
4 7 10 12 15
0
2
4
6
8
10
·104
Number of replicas (per cluster)
Throughput (txn/s)
Throughput
4 7 10 12 15
0
2
4
6
Number of replicas (per cluster)
Latency (s)
Latency
Figure 10: Throughput and latency as a function of
the number of clusters; zn = 60 replicas.
Figure 11: Throughput and latency as a function of
the number of replicas per cluster; z = 4.
4 7 10 12
0
2
4
6
8
10
·104
Number of replicas (per cluster)
Throughput (txn/s)
Throughput (one failure)
4 7 10 12
0
2
4
6
8
10
·104
Number of replicas (per cluster)
Throughput (txn/s)
Throughput (ffailures)
4 7 10 12
0
2
4
6
8
·104
Number of replicas (per cluster)
Throughput (txn/s)
Throughput (primary failure)
10 50 100 200 300
0
2
4
6
8
10
12
·104
Transactions per batch (batch size)
Throughput (txn/s)
Throughput
Figure 12: Throughput as a function of the number of replicas per cluster; z
= 4
.
Left, throughput with one non-primary failure. Middle, throughput with f non-
primary failures. Right, throughput with a single primary failure.
Figure 13: Throughput as
a function of the batch
size; z = 4 and n = 7.
case scenario it is designed for: the simultaneous failure of
f
replicas in each cluster (
fz
replicas in total). This is also the
worst case Steward can deal with, and is close to the worst
case the other protocols can deal with (see Remark 2.1).
The measurements show that the failures have a moder-
ate impact on the performance of all protocols (except for
Zyzzyva which, as in the single-failure case, sees its through-
put plummet to zero). The reduction in throughput is a
direct consequence of the inner working of the consensus
protocols. Consider, e.g., GeoBFT. In GeoBFT, replicas in
each cluster ﬁrst choose a local client request and replicate
this request locally using Pbft (see Section 2.2). In each
such local replication step, each replica will have two phases
in which it needs to receive
nf
identical messages before
proceeding to the next phase (namely, prepare and commit
messages). If there are no failures, then each replica only
must wait for the
nf
fastest messages and can proceed to
the next phase as soon as these messages are received (ignor-
ing any delayed messages). However, if there are
f
failures,
then each replica must wait for all messages of the remain-
ing non-failed replicas to arrive before proceeding, including
the slowest arriving messages. Consequently, the impact of
temporary disturbances causing random message delays at in-
dividual replicas increases with the number of failed replicas,
which negatively impacts performance. Similar arguments
also hold for Pbft,Steward, and HotStuff.
Single primary failure.
In this experiment, we measure the
performance of GeoBFT if a single primary fails (in one of
the four regions). We compare the performance of GeoBFT
with Pbft under failure of a single primary, which will cause
primary replacement via a view-change. For Pbft, we require
checkpoints to be generated and transmitted after every 600
client transactions. Further, we perform the primary failure
after 900 client transactions have been ordered.
For GeoBFT, we fail the primary of the cluster in Oregon
once each cluster has ordered 900 transactions. Similarly,
each cluster exchanges checkpoints periodically, after locally
replicating every 600 transactions. In this experiment, we
have excluded Zyzzyva, as it already fails to deal with non-
primary failures, HotStuff, as it utilizes rotating primaries
and does not have a notion of a ﬁxed primary, and Steward,
as it does not provide a readily-usable and complete view-
change implementation. As expected, the measurements
show that recovery from failure incurs a small reduction in
overall throughput in both protocols, as both protocols are
able to recover to normal-case operations after failure.
4.4 Impact of Request Batching
We now determine the impact of the batch size—the num-
ber of client transactions processed by the consensus pro-
tocols in a single consensus decision—on the performance
of various consensus protocols. To do so, we measure the
throughput attained by ResilientDB as a function of the
batch size, which we vary between 10 and 300. For this
experiment, we have replicas in four regions (Oregon, Iowa,
Montreal, and Belgium), and each region has seven replicas.
The results of our measurements can be found in Figure 13.
878
The measurements show a clear distinction between, on
the one hand, Pbft,Zyzzyva, and Steward, and, on the
other hand, GeoBFT and HotStuff. Note that in Pbft,
Zyzzyva, and Steward a single primary residing in a single
region coordinates all consensus. This highly centralized
communication limits throughput, as it is bottlenecked by
the bandwidth of the single primary. GeoBFT—which has
primaries in each region—and HotStuff—which rotates
primaries—both distribute consensus over several replicas
in several regions, removing bottlenecks due to the band-
width of any single replica. Hence, these protocols have suf-
ﬁcient bandwidth to support larger batch sizes (and increase
throughput). Due to this, GeoBFT is able to outperform
Pbft by up-to-6
.
0
×
. Additionally, as the design of GeoBFT
is optimized to minimize global bandwidth usage, GeoBFT
is even able to outperform HotStuff by up-to-1.6×.
5. RELATED WORK
Resilient systems and consensus protocols have been widely
studied by the distributed computing community (e.g., [50,
51, 54, 76, 80, 82, 84, 87, 88]). Here, we restrict ourselves
GeoBFT: consensus protocols supporting high-performance
or geo-scale aware resilient system designs.
Bft
consensus.
The consensus problem and
related problems such as Byzantine Agreement and Interac-
tive Consistency have been studied in detail since the late
seventies [32, 35, 36, 37, 38, 41, 42, 64, 65, 73, 78, 86, 87].
The introduction of the Pbft-powered BFS—a fault-tolerant
version of the networked ﬁle system [52]—by Castro et al. [18,
19] marked the ﬁrst practical high-performance system using
consensus. Since the introduction of Pbft, many consensus
protocols have been proposed that improve on aspects of
Pbft, e.g, Zyzzyva,PoE, and HotStuff, as discussed in
the Introduction. To further improve on the performance
of Pbft, some consensus protocols consider providing less
failure resilience [2, 20, 29, 66, 68, 69], or rely on trusted
components [11, 22, 28, 57, 89, 90]. Some recent proto-
cols propose the notion of multiple parallel primaries [46,
47]. Although such designs have partial decentralization,
these protocols are still not geo-scale aware. None of these
protocols are fully geo-scale aware, however, making them
unsuitable for the setting we envision for GeoBFT.
Protocols such as Steward,Blink, and Menicus im-
prove on Pbft by partly optimizing for geo-scale aware
deployments [4, 5, 70, 71]. In Section 4, we already showed
that the design of Steward—which depends on a primary
cluster—severely limits its performance. The Blink protocol
improved the design of Steward by removing the need for
a primary cluster, as it requires each cluster to order all
incoming requests [4]. This design comes with high com-
munication costs, unfortunately. Finally, Menicus tries to
reduce communication costs for clients by letting clients only
communicate with close-by replicas [70, 71]. By alternating
the primary among all replicas, this allows clients to propose
requests without having to resort to contacting replicas at
geographically large distances. As such, this design focusses
on the costs perceived by clients, whereas GeoBFT focusses
on the overall costs of consensus.
Sharded Blockchains.
Sharding is an indispensable tool
used by database systems to deal with Big Data [1, 26, 27,
34, 76, 88]. Unsurprisingly, recent blockchain systems such as
SharPer, Elastico, Monoxide, AHL, and RapidChain explore
the use of sharding within the design of a replicated block-
chain [6, 7, 30, 67, 92, 95]. To further enable sharded designs,
also high-performance communication primitives that en-
able communication between fault-tolerant clusters (shards)
have been proposed [53]. Sharding does not fully solve per-
formance issues associated with geo-scale deployments of
permissioned blockchains, however. The main limitation of
sharding is the eﬃcient evaluation of complex operations
across shards [12, 76], and sharded systems achieve high
throughputs only if large portions of the workload access
single shards. For example, in SharPer [7] and AHL [30], two
recent permissioned blockchain designs, consensus on the
cross-shard transactions is achieved either by running Pbft
among the replicas of the involved shards or by starting a
two-phase commit protocol after running Pbft locally within
each shard, both methods with signiﬁcant cross-shard costs.
Our ResilientDB fabric enables geo-scale deployment
of a fully replicated blockchain system that does not face
such challenges, and can easily deal with any workloads.
Indeed, the usage of sharding is orthogonal to the fully-
replicated design we aim at with ResilientDB. Still, the
integration of geo-scale aware sharding with the design of
GeoBFT—in which local data is maintained in local clusters
only—promises to be an interesting avenue for future work.
Permissionless Blockchains.
permissionless blockchain, uses Proof-of-Work (PoW) to
replicate data [49, 74, 93]. PoW requires limited communi-
cation between replicas, can support many replicas, and can
operate in unstructured geo-scale peer-to-peer networks in
which independent parties can join and leave at any time [81].
Unfortunately, PoW incurs a high computational complexity
on all replicas, which has raised questions about the energy
consumption of Bitcoin [31, 91]. Additionally, the complexity
of PoW causes relative long transaction processing times
(minutes to hours) and signiﬁcantly limits the number of
transactions a permissionless blockchain can handle: in 2017,
it was reported that Bitcoin can only process 7 transactions
per second, whereas Visa already processes 2000 transactions
per second on average [79]. Since the introduction of Bit-
coin, several PoW-inspired protocols and Bitcoin-inspired
systems have been proposed [40, 59, 60, 61, 77, 93], but none
of these proposals come close to providing the performance
outperformed by GeoBFT.
6. CONCLUSIONS AND FUTURE WORK
In this paper, we present our Geo-Scale Byzantine Fault-
Tolerant consensus protocol (GeoBFT), a novel consensus
protocol with great scalability. To achieve great scalability,
GeoBFT relies on a topological-aware clustering of repli-
cas in local clusters to minimize costly global communica-
tion, while providing parallelization of consensus. As such,
GeoBFT enables geo-scale deployments of high-performance
blockchain systems. To support this vision, we implement
GeoBFT in our permissioned blockchain fabric—Resilient-
DB—and show that GeoBFT is not only correct, but also
attains up to six times higher throughput than existing
state-of-the-art Bft protocols.
879
7. REFERENCES
[1] 2ndQuadrant. Postgres-XL: Open source scalable SQL
database cluster. URL:
https://www.postgres-xl.org/.
[2] Michael Abd-El-Malek, Gregory R. Ganger, Garth R.
Goodson, Michael K. Reiter, and Jay J. Wylie.
Fault-scalable byzantine fault-tolerant services. In
Proceedings of the Twentieth ACM Symposium on
Operating Systems Principles, pages 59–74. ACM, 2005.
doi:10.1145/1095810.1095817.
[3] Ittai Abraham, Guy Gueta, Dahlia Malkhi, Lorenzo
Alvisi, Ramakrishna Kotla, and Jean-Philippe Martin.
Revisiting fast practical byzantine fault tolerance, 2017.
URL: https://arxiv.org/abs/1712.01367.
[4] Yair Amir, Brian Coan, Jonathan Kirsch, and John
Lane. Customizable fault tolerance forwide-area
replication. In 26th IEEE International Symposium on
Reliable Distributed Systems, pages 65–82. IEEE, 2007.
doi:10.1109/SRDS.2007.40.
[5] Yair Amir, Claudiu Danilov, Danny Dolev, Jonathan
Kirsch, John Lane, Cristina Nita-Rotaru, Josh Olsen,
and David Zage. Steward: Scaling byzantine
fault-tolerant replication to wide area networks. IEEE
Transactions on Dependable and Secure Computing,
7(1):80–93, 2010. doi:10.1109/TDSC.2008.53.
Amr El Abbadi. CAPER: A cross-application
permissioned blockchain. PVLDB, 12(11):1385–1398,
2019. doi:10.14778/3342263.3342275.
[7]
blockchains over network clusters, 2019. URL:
https://arxiv.org/abs/1910.00765v1.
[8] GSM Association. Blockchain for development:
Emerging opportunities for mobile, identity and aid,
2017. URL:
https://www.gsma.com/mobilefordevelopment/wp-
Development.pdf.
[9] Pierre-Louis Aublin, Rachid Guerraoui, Nikola
Kneˇzevi´c, Vivien Qu´ema, and Marko Vukoli´c. The next
700 BFT protocols. ACM Transactions on Computer
Systems, 32(4):12:1–12:45, 2015.
doi:10.1145/2658994.
[10] Elaine Barker. Recommendation for key management,
part 1: General. Technical report, National Institute of
Standards & Technology, 2016. Special Publication
800-57 Part 1, Revision 4.
doi:10.6028/NIST.SP.800-57pt1r4.
[11] Johannes Behl, Tobias Distler, and R¨udiger Kapitza.
Hybrids on steroids: SGX-based high performance
BFT. In Proceedings of the Twelfth European
Conference on Computer Systems, pages 222–237.
ACM, 2017. doi:10.1145/3064176.3064213.
[12] Philip A. Bernstein and Dah-Ming W. Chiu. Using
semi-joins to solve relational queries. Journal of the
ACM, 28(1):25–40, 1981.
doi:10.1145/322234.322238
.
[13] Kenneth Birman, Andr´e Schiper, and Pat Stephenson.
Lightweight causal and atomic group multicast. ACM
Transactions on Computer Systems, 9(3):272–314, 1991.
doi:10.1145/128738.128742.
[14]
Burkhard Blechschmidt. Blockchain in Europe: Closing
the strategy gap. Technical report, Cognizant
Consulting, 2018. URL: https:
//www.cognizant.com/whitepapers/blockchain-in-
europe-closing-the-strategy-gap-codex3320.pdf.
[15] Eric Brewer. CAP twelve years later: How the “rules”
have changed. Computer, 45(2):23–29, 2012.
doi:10.1109/MC.2012.37.
[16] Eric A. Brewer. Towards robust distributed systems
(abstract). In Proceedings of the Nineteenth Annual
ACM Symposium on Principles of Distributed
Computing, pages 7–7. ACM, 2000.
doi:10.1145/343477.343502.
[17] Michael Casey, Jonah Crane, Gary Gensler, Simon
Johnson, and Neha Narula. The impact of blockchain
technology on ﬁnance: A catalyst for change. Technical
report, International Center for Monetary and Banking
1/1/5/4/115414161/geneva21_1.pdf.
[18]
Miguel Castro and Barbara Liskov. Practical byzantine
fault tolerance. In Proceedings of the Third Symposium
on Operating Systems Design and Implementation,
pages 173–186. USENIX Association, 1999.
[19]
Miguel Castro and Barbara Liskov. Practical byzantine
fault tolerance and proactive recovery. ACM
Transactions on Computer Systems, 20(4):398–461,
2002. doi:10.1145/571637.571640.
[20] Gregory Chockler, Dahlia Malkhi, and Michael K.
Reiter. Backoﬀ protocols for distributed mutual
exclusion and ordering. In Proceedings 21st
International Conference on Distributed Computing
Systems, pages 11–20. IEEE, 2001.
doi:10.1109/ICDSC.2001.918928.
[21]
Christie’s. Major collection of the fall auction season to
be recorded with blockchain technology, 2018. URL:
https:
//www.christies.com/presscenter/pdf/9160/
RELEASE_ChristiesxArtoryxEbsworth_9160_1.pdf.
[22]
Byung-Gon Chun, Petros Maniatis, Scott Shenker, and
John Kubiatowicz. Attested append-only memory:
Making adversaries stick to their word. In Proceedings
of Twenty-ﬁrst ACM SIGOPS Symposium on Operating
Systems Principles, pages 189–204. ACM, 2007.
doi:10.1145/1294261.1294280.
[23] Allen Clement, Edmund Wong, Lorenzo Alvisi, Mike
Dahlin, and Mirco Marchetti. Making byzantine fault
tolerant systems tolerate byzantine faults. In
Proceedings of the 6th USENIX Symposium on
Networked Systems Design and Implementation, pages
153–168. USENIX Association, 2009.
[24] Cindy Compert, Maurizio Luinetti, and Bertrand
Portier. Blockchain and GDPR: How blockchain could
address ﬁve areas associated with gdpr compliance.
Technical report, IBM Security, 2018. URL:
https://public.dhe.ibm.com/common/ssi/ecm/61/
en/61014461usen/security-ibm-security-
solutions-wg-white-paper-external-
61014461usen-20180319.pdf.
[25]
Brian F. Cooper, Adam Silberstein, Erwin Tam, Raghu
Ramakrishnan, and Russell Sears. Benchmarking cloud
serving systems with YCSB. In Proceedings of the 1st
ACM Symposium on Cloud Computing, pages 143–154.
ACM, 2010. doi:10.1145/1807128.1807152.
880
[26] Oracle Corporation. MySQL NDB cluster: Scalability.
URL: https://www.mysql.com/products/cluster/
scalability.html.
[27] Oracle Corporation. Oracle sharding. URL:
https://www.oracle.com/database/technologies/
high-availability/sharding.html.
[28] Miguel Correia, Nuno Ferreira Neves, and Paulo
Verissimo. How to tolerate half less one byzantine
nodes in practical distributed systems. In Proceedings
of the 23rd IEEE International Symposium on Reliable
Distributed Systems, pages 174–183. IEEE, 2004.
doi:10.1109/RELDIS.2004.1353018.
[29]
James Cowling, Daniel Myers, Barbara Liskov, Rodrigo
Rodrigues, and Liuba Shrira. HQ replication: A hybrid
quorum protocol for byzantine fault tolerance. In
Proceedings of the 7th Symposium on Operating
Systems Design and Implementation, pages 177–190.
USENIX Association, 2006.
[30] Hung Dang, Tien Tuan Anh Dinh, Dumitrel Loghin,
Ee-Chien Chang, Qian Lin, and Beng Chin Ooi.
Towards scaling blockchain systems via sharding. In
Proceedings of the 2019 International Conference on
Management of Data, pages 123–140. ACM, 2019.
doi:10.1145/3299869.3319889.
[31]
Alex de Vries. Bitcoin’s growing energy problem. Joule,
2(5):801–805, 2018.
doi:10.1016/j.joule.2018.04.016.
[32] Richard A. DeMillo, Nancy A. Lynch, and Michael J.
Merritt. Cryptographic protocols. In Proceedings of the
Fourteenth Annual ACM Symposium on Theory of
Computing, pages 383–400. ACM, 1982.
doi:10.1145/800070.802214.
[33] Tien Tuan Anh Dinh, Ji Wang, Gang Chen, Rui Liu,
Beng Chin Ooi, and Kian-Lee Tan. BLOCKBENCH: A
framework for analyzing private blockchains. In
Proceedings of the 2017 ACM International Conference
on Management of Data, pages 1085–1100. ACM, 2017.
doi:10.1145/3035918.3064033.
[34] Microsoft Docs. Sharding pattern. URL:
https://docs.microsoft.com/en-
us/azure/architecture/patterns/sharding.
[35] D. Dolev. Unanimity in an unknown and unreliable
environment. In 22nd Annual Symposium on
Foundations of Computer Science, pages 159–168.
IEEE, 1981. doi:10.1109/SFCS.1981.53.
[36] D. Dolev and H. Strong. Authenticated algorithms for
byzantine agreement. SIAM Journal on Computing,
12(4):656–666, 1983. doi:10.1137/0212045.
[37] Danny Dolev. The byzantine generals strike again.
Journal of Algorithms, 3(1):14–30, 1982.
doi:10.1016/0196-6774(82)90004-9.
[38] Danny Dolev and R¨udiger Reischuk. Bounds on
information exchange for byzantine agreement. Journal
of the ACM, 32(1):191–204, 1985.
doi:10.1145/2455.214112.
[39] John R. Douceur. The sybil attack. In Peer-to-Peer
Systems, pages 251–260. Springer Berlin Heidelberg,
2002. doi:10.1007/3-540-45748-8_24.
[40] Ittay Eyal, Adem Efe Gencer, Emin G¨un Sirer, and
Robbert Van Renesse. Bitcoin-NG: A scalable
blockchain protocol. In 13th USENIX Symposium on
Networked Systems Design and Implementation, pages
45–59, Santa Clara, CA, 2016. USENIX Association.
[41]
Michael J. Fischer and Nancy A. Lynch. A lower bound
for the time to assure interactive consistency.
Information Processing Letters, 14(4):183–186, 1982.
doi:10.1016/0020-0190(82)90033-3.
[42] Michael J. Fischer, Nancy A. Lynch, and Michael S.
Paterson. Impossibility of distributed consensus with
one faulty process. Journal of the ACM, 32(2):374–382,
1985. doi:10.1145/3149.214121.
[43] Seth Gilbert and Nancy Lynch. Brewer’s conjecture
and the feasibility of consistent, available,
partition-tolerant web services. SIGACT News,
33(2):51–59, 2002. doi:10.1145/564585.564601.
[44] Suyash Gupta, Jelle Hellings, Sajjad Rahnama, and
consensus in blockchain: Challenges and opportunities.
In Proceedings of the 20th International Middleware
Conference Tutorials, pages 6–10, 2019.
doi:10.1145/3366625.3369437.
[45] Suyash Gupta, Jelle Hellings, Sajjad Rahnama, and
Consensus through Fault-Tolerant Speculation, 2019.
URL: https://arxiv.org/abs/1911.00838.
[46]
Brief announcement: Revisiting consensus protocols
through wait-free parallelization. In 33rd International
Symposium on Distributed Computing (DISC 2019),
volume 146 of Leibniz International Proceedings in
Informatics (LIPIcs), pages 44:1–44:3. Schloss
Dagstuhl–Leibniz-Zentrum fuer Informatik, 2019.
doi:10.4230/LIPIcs.DISC.2019.44.
[47]
Scaling blockchain databases through parallel resilient
https://arxiv.org/abs/1911.00837.
Sadoghi. Revisiting fast practical byzantine fault
tolerance, 2019. URL:
https://arxiv.org/abs/1911.09208.
Transaction Processing, pages 1–11. Springer
International Publishing, 2018.
doi:10.1007/978-3-319-63962-8_333-1.
A non-blocking two-phase commit protocol. In
Proceedings of the 21st International Conference on
Extending Database Technology, pages 157–168. Open
Proceedings, 2018. doi:10.5441/002/edbt.2018.15.
non-blocking agreement protocols. Distributed and
Parallel Databases, 2019.
doi:10.1007/s10619-019-07267-w.
[52] Thomas Haynes and David Noveck. RFC 7530:
Network ﬁle system (NFS) version 4 protocol, 2015.
URL: https://tools.ietf.org/html/rfc7530.
announcement: The fault-tolerant cluster-sending
problem. In 33rd International Symposium on
Distributed Computing (DISC 2019), volume 146 of
Leibniz International Proceedings in Informatics
(LIPIcs), pages 45:1–45:3. Schloss
Dagstuhl–Leibniz-Zentrum fuer Informatik, 2019.
881
doi:10.4230/LIPIcs.DISC.2019.45.
Coordination-free byzantine replication with minimal
communication costs. In Proceedings of the 23rd
International Conference on Database Theory, volume
155 of Leibniz International Proceedings in Informatics
(LIPIcs). Schloss Dagstuhl–Leibniz-Zentrum fuer
Informatik, 2020.
[55] Maurice Herlihy. Blockchains from a distributed
computing perspective. Communications of the ACM,
62(2):78–85, 2019. doi:10.1145/3209623.
[56] Maged N. Kamel Boulos, James T. Wilson, and
Kevin A. Clauson. Geospatial blockchain: promises,
challenges, and scenarios in health and healthcare.
International Journal of Health Geographics,
17(1):1211–1220, 2018.
doi:10.1186/s12942-018-0144-x.
[57] R¨udiger Kapitza, Johannes Behl, Christian Cachin,
Tobias Distler, Simon Kuhnle, Seyed Vahid
Stengel. CheapBFT: Resource-eﬃcient byzantine fault
tolerance. In Proceedings of the 7th ACM European
Conference on Computer Systems, pages 295–308.
ACM, 2012. doi:10.1145/2168836.2168866.
[58] Jonathan Katz and Yehuda Lindell. Introduction to
Modern Cryptography. Chapman and Hall/CRC, 2nd
edition, 2014.
[59] Aggelos Kiayias, Alexander Russell, Bernardo David,
and Roman Oliynykov. Ouroboros: A provably secure
Proof-of-Stake blockchain protocol. In Advances in
Cryptology – CRYPTO 2017, pages 357–388. Springer
International Publishing, 2017.
doi:10.1007/978-3-319-63688-7_12.
[60] Sunny King and Scott Nadal. PPCoin: Peer-to-peer
crypto-currency with proof-of-stake, 2012. URL:
https:
//peercoin.net/assets/paper/peercoin-paper.pdf.
[61] Eleftherios Kokoris-Kogias, Philipp Jovanovic, Nicolas
Gailly, Ismail Khoﬃ, Linus Gasser, and Bryan Ford.
Enhancing bitcoin security and performance with
strong consistency via collective signing. In Proceedings
of the 25th USENIX Conference on Security
Symposium, pages 279–296. USENIX Association, 2016.
[62]
Ramakrishna Kotla, Lorenzo Alvisi, Mike Dahlin, Allen
Clement, and Edmund Wong. Zyzzyva: Speculative
byzantine fault tolerance. In Proceedings of
Twenty-ﬁrst ACM SIGOPS Symposium on Operating
Systems Principles, pages 45–58. ACM, 2007.
doi:10.1145/1294261.1294267.
[63]
Ramakrishna Kotla, Lorenzo Alvisi, Mike Dahlin, Allen
Clement, and Edmund Wong. Zyzzyva: Speculative
byzantine fault tolerance. ACM Transactions on
Computer Systems, 27(4):7:1–7:39, 2009.
doi:10.1145/1658357.1658358.
[64] Leslie Lamport. The part-time parliament. ACM
Transactions on Computer Systems, 16(2):133–169,
1998. doi:10.1145/279227.279229.
[65] Leslie Lamport, Robert Shostak, and Marshall Pease.
The byzantine generals problem. ACM Transactions on
Programming Languages and Systems, 4(3):382–401,
1982. doi:10.1145/357172.357176.
[66] Barbara Liskov and Rodrigo Rodrigues. Byzantine
clients rendered harmless. In Distributed Computing,
pages 487–489. Springer Berlin Heidelberg, 2005.
doi:10.1007/11561927_35.
[67]
Loi Luu, Viswesh Narayanan, Chaodong Zheng, Kunal
Baweja, Seth Gilbert, and Prateek Saxena. A secure
sharding protocol for open blockchains. In Proceedings
of the 2016 ACM SIGSAC Conference on Computer
and Communications Security, pages 17–30. ACM,
2016. doi:10.1145/2976749.2978389.
[68] Dahlia Malkhi and Michael Reiter. Byzantine quorum
systems. Distributed Computing, 11(4):203–213, 1998.
doi:10.1007/s004460050050.
[69]
Dahlia Malkhi and Michael Reiter. Secure and scalable
replication in Phalanx. In Proceedings Seventeenth
IEEE Symposium on Reliable Distributed Systems,
pages 51–58. IEEE, 1998.
doi:10.1109/RELDIS.1998.740474.
[70]
Yanhua Mao, Flavio P. Junqueira, and Keith Marzullo.
Mencius: Building eﬃcient replicated state machines
for WANs. In Proceedings of the 8th USENIX
Conference on Operating Systems Design and
Implementation, pages 369–384. USENIX Association,
2008.
[71]
Yanhua Mao, Flavio P. Junqueira, and Keith Marzullo.
Towards low latency state machine replication for
uncivil wide-area networks. In Fifth Workshop on Hot
Topics in System Dependability, 2009.
[72]
Alfred J. Menezes, Scott A. Vanstone, and Paul C. Van
Oorschot. Handbook of Applied Cryptography. CRC
Press, Inc., 1st edition, 1996.
[73] Shlomo Moran and Yaron Wolfstahl. Extended
impossibility results for asynchronous complete
networks. Information Processing Letters,
26(3):145–151, 1987.
doi:10.1016/0020-0190(87)90052-4.
[74] Satoshi Nakamoto. Bitcoin: A peer-to-peer electronic
cash system, 2009. URL:
https://bitcoin.org/bitcoin.pdf.
global-scale byzantizing middleware. In 35th
International Conference on Data Engineering, pages
124–135. IEEE, 2019. doi:10.1109/ICDE.2019.00020.
[76] M. Tamer ¨
Ozsu and Patrick Valduriez. Principles of
Distributed Database Systems. Springer New York, 3th
edition, 2011.
[77]
Rafael Pass and Elaine Shi. Hybrid consensus: Eﬃcient
consensus in the permissionless model, 2016. URL:
https://eprint.iacr.org/2016/917.
[78] M. Pease, R. Shostak, and L. Lamport. Reaching
agreement in the presence of faults. Journal of the
ACM, 27(2):228–234, 1980.
doi:10.1145/322186.322188.
[79] Michael Pisa and Matt Juden. Blockchain and
economic development: Hype vs. reality. Technical
report, Center for Global Development, 2017. URL:
https://www.cgdev.org/publication/blockchain-
and-economic-development-hype-vs-reality.
[80] Dan R. K. Ports, Jialin Li, Vincent Liu, Naveen Kr.
Sharma, and Arvind Krishnamurthy. Designing
distributed systems using approximate synchrony in
data center networks. In Proceedings of the 12th
USENIX Conference on Networked Systems Design and
Implementation, pages 43–57. USENIX Association,
882
2015.
[81]
Bitcoin Project. Bitcoin developer guide: P2P network,
2018. URL: https://bitcoin.org/en/developer-
guide#p2p-network.
[82]
queue-oriented, control-free concurrency architecture.
In Proceedings of the 19th International Middleware
Conference, pages 13–25, 2018.
doi:10.1145/3274808.3274810.
[83] Aleta Ricciardi, Kenneth Birman, and Patrick
Stephenson. The cost of order in asynchronous systems.
In Distributed Algorithms, pages 329–345. Springer
Berlin Heidelberg, 1992.
doi:10.1007/3-540-56188-9_22.
Processing on Modern Hardware. Synthesis Lectures on
Data Management. Morgan & Claypool, 2019.
doi:10.2200/S00896ED1V01Y201901DTM058.
[85] Victor Shoup. Practical threshold signatures. In
Advances in Cryptology — EUROCRYPT 2000, pages
207–220. Springer Berlin Heidelberg, 2000.
doi:10.1007/3-540-45539-6_15.
[86] Gadi Taubenfeld and Shlomo Moran. Possibility and
impossibility results in a shared memory environment.
Acta Informatica, 33(1):1–20, 1996.
doi:10.1007/s002360050034.
[87] Gerard Tel. Introduction to Distributed Algorithms.
Cambridge University Press, 2nd edition, 2001.
[88] Maarten van Steen and Andrew S. Tanenbaum.
Distributed Systems. Maarten van Steen, 3th edition,
2017. URL: https://www.distributed-systems.net/.
[89] Giuliana Santos Veronese, Miguel Correia,
Alysson Neves Bessani, and Lau Cheuk Lung. EBAWA:
Eﬃcient byzantine agreement for wide-area networks.
In 2010 IEEE 12th International Symposium on High
Assurance Systems Engineering, pages 10–19. IEEE,
2010. doi:10.1109/HASE.2010.19.
[90] Giuliana Santos Veronese, Miguel Correia,
Alysson Neves Bessani, Lau Cheuk Lung, and Paulo
Verissimo. Eﬃcient byzantine fault-tolerance. IEEE
Transactions on Computers, 62(1):16–30, 2013.
doi:10.1109/TC.2011.221.
[91] Harald Vranken. Sustainability of bitcoin and
blockchains. Current Opinion in Environmental
Sustainability, 28:1–9, 2017.
doi:10.1016/j.cosust.2017.04.011.
[92] Jiaping Wang and Hao Wang. Monoxide: Scale out
blockchains with asynchronous consensus zones. In
Proceedings of the 16th USENIX Symposium on
Networked Systems Design and Implementation, pages
95–112. USENIX Association, 2019.
[93] Gavin Wood. Ethereum: a secure decentralised
generalised transaction ledger, 2016. EIP-150 revision.
URL: https://gavwood.com/paper.pdf.
[94] Maofan Yin, Dahlia Malkhi, Michael K. Reiter,
Guy Golan Gueta, and Ittai Abraham. HotStuﬀ: BFT
consensus with linearity and responsiveness. In
Proceedings of the 2019 ACM Symposium on Principles
of Distributed Computing, pages 347–356. ACM, 2019.
doi:10.1145/3293611.3331591.
[95] Mahdi Zamani, Mahnush Movahedi, and Mariana
Raykova. RapidChain: Scaling blockchain via full
sharding. In Proceedings of the 2018 ACM SIGSAC
Conference on Computer and Communications Security,
pages 931–948. ACM, 2018.
doi:10.1145/3243734.3243853.
883
... However, these trust-bft protocols enforce sequential consensus of client requests. In comparison, a majority of existing bft protocols employ several fundamental optimizations: pipelining consensus phases, concurrently ordering multiple client requests, and permitting out-of-order consensus [9,22,25,26]. These bft protocols only enforce request execution in order. ...
... We design two variants of our FlexiTrust protocols: Flexi-BFT and Flexi-ZZ, and implement them on ResilientDB fabric [25,26,46]. To gauge their scalability, we test our FlexiTrust variants on a real-world setup of 97 replicas and up to 80 k clients, and compare them against five protocols: three trust-bft and two bft. ...
... Before illustrating our observations, we present the system model, which is also adopted by existing literature [9,10,22,26]. ...
Preprint
Full-text available
The growing interest in secure multi-party database applications has led to the widespread adoption of Byzantine Fault-Tolerant (BFT) consensus protocols that can handle malicious attacks from byzantine replicas. Existing BFT protocols permit byzantine replicas to equivocate their messages. As a result, they need f more replicas than Paxos-style protocols to prevent safety violations due to equivocation. This led to the design of Trust-BFT protocols, which require each replica to host an independent, trusted component. In this work, we analyze the design of existing Trust-BFT and make the following observations regarding these protocols: (i) they adopt weaker quorums, which prevents them from providing service in scenarios supported by their BFT counterparts, (ii) they rely on the data persistence of trusted components at byzantine replicas, and (iii) they enforce sequential ordering of client requests. To resolve these challenges, we present solutions that facilitate the recovery of Trust-BFT protocols despite their weak quorums or data persistence dependence. Further, we present the design of lightweight, fast, and flexible protocols (FlexiTrust), which achieve up to 100% more throughput than their Trust-BFT counterparts.
... The introduction of the cryptocurrency Bitcoin [26] marked the first wide-spread deployment of a permissionless blockchain. The emergence of Bitcoin and other blockchains has fueled the development of new resilient data management systems [3,11,17,27,28,34]. These new systems are attractive for the database community, as they can be used to provide data management systems that are resilient against failures, enable cooperative (federated) data management with many independent parties, and can support data provenance. ...
... First, we observe that the Proof-of-Work style consensus protocols of permissionless blockchains such as Bitcoin and Ethereum suffer from high costs, very low throughputs, and very high latencies, making such permissionless designs impractical for high-performance data management [9,29,37]. Permissioned blockchains, e.g., those based on Pbft-style consensus, are more suitable for high-performance data management: fine-tuned permissioned systems can easily process up-to-hundreds-of-thousands transactions per second, this even in wide-area (Internet) deployments [7,16,17,19]. Still, even the best permissioned consensus protocols cannot provide the low latencies we are looking for, as all reliable consensus protocols require three-or-more subsequent rounds of internal communication before requests can be executed and clients can be informed. ...
... The many phases before execution in these consensus protocols is especially noticeable in practical deployments of consensus: to maximize resilience against disruptions at any location, individual replicas need to be spread out over a wide-area network. Due to this spread-out nature, the message delay will be high and a message delay of 15 ms ≤ ≤ 200 ms is not uncommon [8,17]. ...
Preprint
Full-text available
The introduction of Bitcoin fueled the development of blockchain-based resilient data management systems that are resilient against failures, enable federated data management, and can support data provenance. The key factor determining the performance of such resilient data management systems is the consensus protocol used by the system to replicate client transactions among all participants. Unfortunately, existing high-throughput consensus protocols are costly and impose significant latencies on transaction processing, which rules out their usage in responsive high-performance data management systems. In this work, we improve on this situation by introducing the Proof-of-Execution consensus protocol (PoE), a consensus protocol designed for high-performance low-latency resilient data management. PoE introduces speculative execution, which minimizes latencies by starting execution before consensus is reached, and PoE introduces proof-of-executions to guarantee successful execution to clients. Furthermore, PoE introduces a single-round check-commit protocol to reduce the overall communication costs of consensus. Hence, we believe that PoE is a promising step towards flexible general-purpose low-latency resilient data management systems.
... • The Bft consensus protocol can utilize out-of-order message processing to concurrently order multiple client requests [18,37]. ...
... • To benchmark our serverless-edge architecture, we install ResilientDB's [17][18][19] light-weight and multi-threaded node architecture on each shim node. We require clients to issue YCSB [8,11] transactions and spawn AWS Lambda functions as serverless executors. ...
... Forward the Error message to the primary p. 18: event r's timer or Υ timeout or r receives ⟨Replace( ⟨T⟩c ) ⟩ V from V do 19: Run the view-change protocol to replace p Figure 4: Actions performed by various participants of the serverless-edge infrastructure in response to a request suppression attack. ...
Preprint
Full-text available
With a growing interest in edge applications, such as the Internet of Things, the continued reliance of developers on existing edge architectures poses a threat. Existing edge applications make use of edge devices that have access to limited resources. Hence, they delegate compute-intensive tasks to the third-party cloud servers. In such an edge-cloud model, neither the edge devices nor the cloud servers can be trusted as they can act maliciously. Further, delegating tasks to cloud servers does not free the application developer from server provisioning and resource administration. In this paper, we present the vision of a novel Byzantine Fault- Tolerant Serverless-Edge architecture. In our architecture, we delegate the compute-intensive tasks to existing Serverless cloud infrastructures, which relieve us from the tasks of provisioning and administration. Further, we do not trust the edge devices and require them to order each request in a byzantine fault-tolerant manner. Neither do we trust the serverless cloud, which requires us to spawn multiple serverless cloud instances. To achieve all these tasks, we design a novel protocol, ServerlessBFT. We discuss all possible attacks in the serverless-edge architecture and extensively evaluate all of its characteristics.
... These solutions do not offer the execution transparency of blockchains, which is key to prevent user manipulations [75]. a) Payment blockchains: Some blockchains are designed for high transactions throughput at large scale, but were not designed to support DApps [50], [28], [80], [65], [54]. This is the case of ResilientDB [54] that exploits topology-awareness to parallelize consensus executions, the Red Belly Blockchain [28] that shares our superblock optimization or Mir [80] that deduplicates transaction verifications. ...
... a) Payment blockchains: Some blockchains are designed for high transactions throughput at large scale, but were not designed to support DApps [50], [28], [80], [65], [54]. This is the case of ResilientDB [54] that exploits topology-awareness to parallelize consensus executions, the Red Belly Blockchain [28] that shares our superblock optimization or Mir [80] that deduplicates transaction verifications. Stellar [65] is an inproduction blockchain running in a geodistributed setting while Algorand [50] introduced the sortition our membership change builds upon. ...
Preprint
The sharing economy is centralizing services, leading to misuses of the Internet. We can list growing damages of data hacks, global outages and even the use of data to manipulate their owners. Unfortunately, there is no decentralized web where users can interact peer-to-peer in a secure way. Blockchains incentivize participants to individually validate every transaction and impose their block to the network. As a result, the validation of smart contract requests is computationally intensive while the agreement on a unique state does not make full use of the network. In this paper, we propose Collachain, a new byzantine fault tolerant blockchain compatible with the largest ecosystem of DApps that leverages collaboration. First, the pariticipants executing smart contracts collaborate to validate the transactions, hence halving the number of validations required by modern blockchains (e.g., Ethereum, Libra). Second, the participants in the consensus collaborate to combine their block proposal into a superblock, hence improving throughput as the system grows to hundreds of nodes. In addition, Collachain offers the possibility to its users to interact securely with each other without downloading the blockchain, hence allowing interactions via mobile devices. Collachain is effective at outperforming the Concord and Quorum blockchains and its throughput peaks at 4500 TPS under a Twitter DApp (Decentralized Application) workload. Finally, we demonstrate Collachain's scalability by deploying it on 200 nodes located in 10 countries over 5 continents.
... Since the introduction of Bitcoin, the concept of blockchain has gained widespread attentions [30]. Various researches have been proposed on blockchain from the perspectives of architecture [31][32][33][34][35][36] and application [37][38][39]. Blockchain can be categorized into two types, i.e., permissionless and permissioned. ...
Conference Paper
Full-text available
Permissioned blockchain is increasingly being used as a collaborative platform for sharing data. However, current blockchain-based data sharing is unable to balance privacy protection and query functionality, limiting its application scenarios. Order-preserving encryption/encoding (OPE) allows encrypting data to prevent privacy leakage while still supporting efficient order-oriented queries on ciphertexts. But existing OPE schemes are constrained by limited use cases and inherent performance limitations that make them difficult to be adopted by permissioned blockchain where performance is a major concern. In this paper, we present BlockOPE, an efficient OPE scheme designed around the first study integrating OPE into blockchain systems. By supporting parallel processing with a conflict-reducing design, we argue that BlockOPE is feasible for permissioned blockchain, achieving orders-of-magnitude performance improvement while preserving the ideal OPE security. Additionally, we improve query processing by leveraging an adaptive lightweight client cache. Extensive experiment results and theoretical analysis illustrate the practicability of our approach.
... As such, they often relax the security aspect of the consensus protocol by employing Byzantine fault-tolerant (BFT) alternatives to Proof-of-Work (PoW) such as Practical Byzantine Fault Tolerance (PBFT) [21] or even crash fault-tolerant (CFT) protocols such as Raft [39]. State-of-the-art permissioned blockchains can achieve thousands of transactions per second, by improving consensus, network performance, or by adopting database techniques [22,29,55]. However, we note that these systems are designed for traditional enterprise workloads, and are evaluated only on commodity or cluster-grade servers with x86/64 architecture, being it at the edge or on the cloud. ...
Preprint
Full-text available
While state-of-the-art permissioned blockchains can achieve thousands of transactions per second on commodity hardware with x86/64 architecture, their performance when running on different architectures is not clear. The goal of this work is to characterize the performance and cost of permissioned blockchains on different hardware systems, which is important as diverse application domains are adopting t. To this end, we conduct extensive cost and performance evaluation of two permissioned blockchains, namely Hyperledger Fabric and ConsenSys Quorum, on five different types of hardware covering both x86/64 and ARM architecture, as well as, both cloud and edge computing. The hardware nodes include servers with Intel Xeon CPU, servers with ARM-based Amazon Graviton CPU, and edge devices with ARM-based CPU. Our results reveal a diverse profile of the two blockchains across different settings, demonstrating the impact of hardware choices on the overall performance and cost. We find that Graviton servers outperform Xeon servers in many settings, due to their powerful CPU and high memory bandwidth. Edge devices with ARM architecture, on the other hand, exhibit low performance. When comparing the cloud with the edge, we show that the cost of the latter is much smaller in the long run if manpower cost is not considered.
... The RCC paradigm enables any primary-backup consensus protocols to be concurrent, consequently increasing the resilience of consensus-based systems to failures by significantly lowering the impact of faulty replicas on the system's throughput and operations. The proposed RCC paradigm is tested by integrating it into ResilientDB−a high-performance resilient blockchain platform [24]. ...
Article
Full-text available
Popular blockchains such as Ethereum and several others execute complex transactions in the block through user-defined scripts known as smart contracts. Serial execution of smart contract transactions/atomic units (AUs) fails to harness the multiprocessing power offered by the prevalence of multi-core processors. By adding concurrency to the execution of AUs, we can achieve better efficiency and higher throughput. In this paper, we develop a concurrent miner that proposes a block by executing AUs concurrently using optimistic Software Transactional Memory systems (STMs). It efficiently captures independent AUs in the concurrent bin and dependent AUs in the block graph (BG). Later, we propose a concurrent validator that re-executes the same AUs concurrently and deterministically using the concurrent bin followed by the BG given by the miner to verify the block. We rigorously prove the correctness of concurrent execution of AUs. The performance benchmark shows that the average speedup for the optimized concurrent miner is 5.21×\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$5.21 \times$$\end{document}, while the maximum is 14.96×\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$14.96 \times$$\end{document} over the serial miner. The optimized validator obtains an average speedup of 8.61×\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$8.61 \times$$\end{document} to a maximum of 14.65×\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$14.65 \times$$\end{document} over the serial validator. The proposed miner outperforms 1.02×\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$1.02 \times$$\end{document} to 1.18×\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$1.18\times$$\end{document}, while the proposed validator outperforms 1×\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$1 \times$$\end{document} to 4.46×\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$4.46 \times$$\end{document} over state-of-the-art concurrent miners and validators, respectively. Moreover, the proposed efficient BG saves an average of 2.29×\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$2.29 \times$$\end{document} more block space when compared with the state-of-the-art.
... XFT [42]) and mixed fault models (e.g. Hierarchical [7,31]) have been proposed to improve performance in geodistributed systems. XFT assumes synchronous communication among majority replicas for safety, while the Hybrid protocols assumes the trusted component for safety. ...
Preprint
Full-text available
Byzantine consensus is a critical component in many permissioned Blockchains and distributed ledgers. We propose a new paradigm for designing BFT protocols called DQBFT that addresses three major performance and scalability challenges that plague past protocols: (i) high communication costs to reach geo-distributed agreement, (ii) uneven resource utilization hampering performance, and (iii) performance degradation under varying node and network conditions and high-contention workloads. Specifically, DQBFT divides consensus into two parts: 1) durable command replication without a global order, and 2) consistent global ordering of commands across all replicas. DQBFT achieves this by decentralizing the heavy task of replicating commands while centralizing the ordering process. Under the new paradigm, we develop a new protocol, Destiny that uses a combination of three techniques to achieve high performance and scalability: using a trusted subsystem to decrease consensus's quorum size, using threshold signatures to attain linear communication costs, reducing client communication. Our evaluations on 300-replica geo-distributed deployment reveal that DQBFT protocols achieve significant performance gains over prior art: $\approx$3x better throughput and $\approx$50\% better latency.
Article
Byzantine consensus is a critical component in many permissioned Blockchains and distributed ledgers. We propose a new paradigm for designing BFT protocols called DQBFT that addresses three major performance and scalability challenges that plague past protocols: (i) high communication costs to reach geo-distributed agreement, (ii) uneven resource utilization hampering performance, and (iii) performance degradation under varying node and network conditions and high-contention workloads. Specifically, DQBFT divides consensus into two parts: 1) durable command replication without a global order, and 2) consistent global ordering of commands across all replicas. DQBFT achieves this by decentralizing the heavy task of replicating commands while centralizing the ordering process. Under the new paradigm, we develop a new protocol, Destiny that uses a combination of three techniques to achieve high performance and scalability: using a trusted subsystem to decrease consensus's quorum size, using threshold signatures to attain linear communication costs, reducing client communication. Our evaluations on 300-replica geo-distributed deployment reveal that DQBFT protocols achieve significant performance gains over prior art: ≈3x better throughput and ≈50% better latency.
Conference Paper
Full-text available
Since the introduction of Bitcoin---the first wide-spread application driven by blockchains---the interest of the public and private sector in blockchains has skyrocketed. At the core of this interest are the ways in which blockchains can be used to improve data management, e.g., by enabling federated data management via decentralization, resilience against failure and malicious actors via replication and consensus, and strong data provenance via a secured immutable ledger. In practice, high-performance blockchains for data management are usually built in permissioned environments in which the participants are vetted and can be identified. In this setting, blockchains are typically powered by Byzantine fault-tolerant consensus protocols. These consensus protocols are used to provide full replication among all honest blockchain participants by enforcing an unique order of processing incoming requests among the participants. In this tutorial, we take an in-depth look at Byzantine fault-tolerant consensus. First, we take a look at the theory behind replicated computing and consensus. Then, we delve into how common consensus protocols operate. Finally, we take a look at current developments and briefly look at our vision moving forward.
Conference Paper
Full-text available
We present HotStuff, a leader-based Byzantine fault-tolerant replication protocol for the partially synchronous model. Once network communication becomes synchronous, HotStuff enables a correct leader to drive the protocol to consensus at the pace of actual (vs. maximum) network delay--a property called responsiveness---and with communication complexity that is linear in the number of replicas. To our knowledge, HotStuff is the first partially synchronous BFT replication protocol exhibiting these combined properties. Its simplicity enables it to be further pipelined and simplified into a practical, concise protocol for building large-scale replication services.
Conference Paper
Full-text available
Existing blockchain systems scale poorly because of their distributed consensus protocols. Current attempts at improving blockchain scalability are limited to cryptocurrency. Scaling blockchain systems under general workloads (i.e., non-cryptocurrency applications) remains an open question. This work takes a principled approach to apply sharding to blockchain systems in order to improve their transaction throughput at scale. This is challenging, however, due to the fundamental difference in failure models between databases and blockchain. To achieve our goal, we first enhance the performance of Byzantine consensus protocols, improving individual shards' throughput. Next, we design an efficient shard formation protocol that securely assigns nodes into shards. We rely on trusted hardware, namely Intel SGX, to achieve high performance for both consensus and shard formation protocol. Third, we design a general distributed transaction protocol that ensures safety and liveness even when transaction coordinators are malicious. Finally, we conduct an extensive evaluation of our design both on a local cluster and on Google Cloud Platform. The results show that our consensus and shard formation protocols outperform state-of-the-art solutions at scale. More importantly, our sharded blockchain reaches a high throughput that can handle Visa-level workloads, and is the largest ever reported in a realistic environment.
Conference Paper
Full-text available
The byzantine fault-tolerance model captures a wide-range of failures-common in real-world scenarios-such as ones due to malicious attacks and arbitrary software/hardware errors. We propose Blockplane, a middleware that enables making existing benign systems tolerate byzantine failures. This is done by making the existing system use Blockplane for durability and as a communication infrastructure. Blockplane proposes the following: (1) A middleware and communication infrastructure to make an entire benign protocol byzantine fault-tolerant, (2) A hierarchical locality-aware design to minimize the number of wide-area messages, (3) A separation of fault-tolerance concerns to enable designs with higher performance. I. INTRODUCTION A byzantine failure model [11] is a model of arbitrary failures that includes-in addition to crashes-unexpected behavior due to software and hardware malfunctions, malicious breaches, and violation of trust between participants. It is significantly more difficult to develop byzantine fault-tolerant protocols compared to benign (non-byzantine) protocols. This poses a challenge to organizations that want to adopt byzantine fault-tolerant software solutions. This challenge is exacerbated with the need of many applications to be globally distributed. With global distribution, the wide-area latency between participants amplifies the performance overhead of byzantine fault-tolerant protocols. To overcome the challenges of adopting byzantine fault-tolerant software solutions, we propose pushing down the byzantine fault-tolerance problem to the communication layer rather than the application/storage layer. Our proposal, Block-plane, is a communication infrastructure that handles the delivery of messages from one node to another. Blockplane exposes an interface of log-commit, send, and receive operations to be used by nodes to both persist their state and communicate with each other. Blockplane adopts a locality-aware hierarchical design due to our interest in supporting efficient byzantine fault-tolerance in global-scale environments. Hierarchical designs have recently been shown to perform well in global-scale settings [15]. Blockplane optimizes for the communication latency by performing as much computation as possible locally and only communicate across the wide-area link when necessary. In the paper, we distinguish between two types of failures. The first is independent byzantine failures that are akin to traditional byzantine failures which affect each node independently (the failure of one node does not correlate with the failure of another node). The second type of failures is benign
Article
Full-text available
Large scale distributed databases are designed to support commercial and cloud based applications. The minimal expectation from such systems is that they ensure consistency and reliability in case of node failures. The distributed database guarantees reliability through the use of atomic commitment protocols. Atomic commitment protocols help in ensuring that either all the changes of a transaction are applied or none of them exist. To ensure efficient commitment process, the database community has mainly used the two-phase commit (2PC) protocol. However, the 2PC protocol is blocking under multiple failures. This necessitated the development of non-blocking, three-phase commit (3PC) protocol. However, the database community is still reluctant to use the 3PC protocol, as it acts as a scalability bottleneck in the design of efficient transaction processing systems. In this work, we present EasyCommit protocol which leverages the best of both worlds (2PC and 3PC), that is non-blocking (like 3PC) and requires two phases (like 2PC). EasyCommit achieves these goals by ensuring two key observations: (i) first transmit and then commit, and (ii) message redundancy. We present the design of the EasyCommit protocol and prove that it guarantees both safety and liveness. We also present a detailed evaluation of EC protocol and show that it is nearly as efficient as the 2PC protocol. To cater the needs of geographically large scale distributed systems we also design a topology-aware agreement protocol (Geo-scale EasyCommit) that is non-blocking, safe, live and outperforms 3PC protocol.