Conference PaperPDF Available

RCC: Resilient Concurrent Consensus for High-Throughput Secure Transaction Processing

Authors:
RCC: Resilient Concurrent Consensus for
High-Throughput Secure Transaction Processing
Suyash Gupta Jelle Hellings
Moka Blox LLC
Exploratory Systems Lab, Department of Computer Science
University of California, Davis
Mohammad Sadoghi
Abstract—Recently, we saw the emergence of consensus-based
database systems that promise resilience against failures, strong
data provenance, and federated data management. Typically,
these fully-replicated systems are operated on top of a primary-
backup consensus protocol, which limits the throughput of these
systems to the capabilities of a single replica (the primary).
To push throughput beyond this single-replica limit, we pro-
pose concurrent consensus. In concurrent consensus, replicas
independently propose transactions, thereby reducing the influ-
ence of any single replica on performance. To put this idea
in practice, we propose our RCC paradigm that can turn any
primary-backup consensus protocol into a concurrent consensus
protocol by running many consensus instances concurrently. RCC
is designed with performance in mind and requires minimal
coordination between instances. Furthermore, RCC also promises
increased resilience against failures. We put the design of RCC
to the test by implementing it in RESILIENTDB, our high-
performance resilient blockchain fabric, and comparing it with
state-of-the-art primary-backup consensus protocols. Our exper-
iments show that RCC achieves up to 2.75×higher throughput
than other consensus protocols and can be scaled to 91 replicas.
Index Terms—High-throughput resilient transaction process-
ing, concurrent consensus, limits of primary-backup consensus.
I. INTRODUCTION
Fueled by the emergence of blockchain technology [2],
[3], [4], we see a surge in consensus-based data processing
frameworks and database systems [4], [5], [6], [7], [8], [9].
This interest can be easily explained: compared to traditional
distributed database systems, consensus-based systems can
provide more resilience during failures, can provide strong
support for data provenance, and can enable federated data
processing in a heterogeneous environment with many inde-
pendent participants. Consequently, consensus-based systems
can prevent disruption of service due to software issues or
cyberattacks that compromise part of the system, and can aid
in improving data quality of data that is managed by many
independent parties, potentially reducing the huge societal
costs of cyberattacks and bad data.
A brief announcement of this work was presented at the 33rd International
Symposium on Distributed Computing (DISC 2019) [1].
This material is based upon work partially supported by the U.S. De-
partment of Energy, Office of Science, Office of Small Business Innovation
Research, under Award Number DE-SC0020455.
At the core of consensus-based systems are consensus
protocols that enable independent participants (e.g., different
companies) to manage a single common database by reliably
and continuously replicating a unique sequence of transactions
among all participants. By design, these consensus protocols
are resilient and can deal with participants that have crashed,
are unable to participate due to local network, hardware, or
software failures, or are compromised and act malicious [10],
[11]. As such, consensus protocols can be seen as the fault-
resilient counterparts of classical two-phase and three-phase
commit protocols [12], [13], [14]. Most practical systems
use consensus protocols that follow the classical primary-
backup design of PB FT [15] in which a single replica, the pri-
mary, proposes transactions by broadcasting them to all other
replicas, after which all replicas exchange state to determine
whether the primary correctly proposes the same transaction to
all replicas and to deal with failure of the primary. Well-known
examples of such protocols are PB FT [15], ZYZZ YVA [16],
SBF T [17], HOT STUFF [18], P OE [19], and RBFT [20], and
fully-optimized implementations of these protocols are able to
process up-to tens-of-thousands transactions per second [21].
A. The Limitations of Traditional Consensus
Unfortunately, a close look at the design of primary-backup
consensus protocols reveals that their design underutilized
available network resources, which prevents the maximization
of transaction throughput: the throughput of these protocols is
determined mainly by the outgoing bandwidth of the primary.
To illustrate this, we consider the maximum throughput by
which primaries can replicate transactions. Consider a system
with nreplicas of which fare faulty and the remaining
nf =nfare non-faulty. The maximum throughput Tmax
of any such protocol is determined by the outgoing bandwidth
Bof the primary, the number of replicas n, and the size of
transactions st :Tmax =B/((n1)st ). No practical consensus
protocol will be able to achieve this throughput, as dealing
with crashes and malicious behavior requires substantial state
exchange. Protocols such as ZY ZZYVA [16] can come close,
however, by optimizing for the case in which no faults occur,
this at the cost of their ability to deal with faults efficiently.
For PBFT, the minimum amount of state exchange consists
of two rounds in which PR EPAR E and COMMIT messages
are exchanged between all replicas (a quadratic amount, see
Example III.1 in Section III). Assuming that these messages
have size sm, the maximum throughput of P BF T is TPBFT =
B/((n1)(st +3sm)). To minimize overhead, typical imple-
mentations of PBFT group hundreds of transactions together,
assuring that st sm and, hence, Tmax TPBFT .
The above not only shows a maximum on throughput, but
also that primary-backup consensus protocols such as PB FT
and ZY ZZYVA severely underutilize resources of non-primary
replicas: when st sm, the primary sends and receives
roughly (n1)st bytes, whereas all other replicas only send
and receive roughly st bytes. The obvious solution would
be to use several primaries. Unfortunately, recent protocols
such as HO TSTU FF [18], SPINNING [22], and P RIME [23]
that regularly switch primaries all require that a switch from
a primary happens after all proposals of that primary are
processed. Hence, such primary switching does load balance
overall resource usage among the replicas, but does not address
the underutilization of resources we observe.
B. Our Solution: Towards Resilient Concurrent Consensus
The only way to push throughput of consensus-based
databases and data processing systems beyond the limit Tmax,
is by better utilizing available resources. In this paper, we
propose to do so via concurrent consensus, in which we use
many primaries that concurrently propose transactions. We
also propose RCC, a paradigm for the realization of concurrent
consensus. Our contributions are as follows:
1) First, in Section II, we propose concurrent consensus
and show that concurrent consensus can achieve much
higher throughput than primary-backup consensus by
effectively utilizing all available system resources.
2) Then, in Section III, we propose R CC, a paradigm
for turning any primary-backup consensus protocol into
a concurrent consensus protocol and that is designed
for maximizing throughput in all cases, even during
malicious activity.
3) Then, in Section IV, we show that RCC can be utilized
to make systems more resilient, as it can mitigate the ef-
fects of order-based attacks and throttling attacks (which
are not prevented by traditional consensus protocols),
and can provide better load balancing.
4) Finally, in Section V, we put the design RCC to the
test by implementing it in RESILIENTD B,1our high-
performance resilient blockchain fabric, and compare
RCC with state-of-the-art primary-backup consensus
protocols. Our comparison shows that RCC answers
the promises of concurrent consensus: it achieves up to
2.75×higher throughput than other consensus protocols,
has a peak throughput of 365 ktxn/batch and can be
easily scaled to 91 replicas.
II. THE PROMISE O F CONCURRENT CONSEN SUS
To deal with the underutilization of resources and the
low throughput of primary-backup consensus, we propose
1RESILIENTDB is open-sourced and available at https://resilientdb.com.
0 20 40 60 80 100
0.0
0.5
1.0
1.5
·105
Number of replicas (n)
Throughput (txn/s)
(20 txn/proposal)
Tmax TPBFT
Tcmax TcPBF T
0 20 40 60 80 100
0.0
0.5
1.0
1.5
·105
Number of replicas (n)
Throughput (txn/s)
(400 txn/proposal)
Tmax TPBFT
Tcmax TcPBF T
Fig. 1. Maximum throughput of replication in a system with B= 1 Gbit/s,
n= 3f+ 1,nf = 2f+ 1,sm = 1 KiB, and individual transactions are
512 B. On the left, each proposal groups 20 transactions (st = 10 KiB) and
on the right, each proposal groups 400 transactions (st = 2 MiB).
concurrent consensus. In specific, we design for a system
that is optimized for high-throughput scenarios in which a
plentitude of transactions are available, and we make every
replica a concurrent primary that is responsible for proposing
and replicating some of these transactions. As we have nf non-
faulty replicas, we can expect to always concurrently propose
at least nf transactions if sufficient transactions are available.
Such concurrent processing has the potential to drastically
improve throughput: in each round, each primary will send
out one proposal to all other replicas, and receive nf 1pro-
posals from other primaries. Hence, the maximum concurrent
throughput is Tcmax =nf B/((n1)st + (nf 1)st ).
In practice, of course, the primaries also need to participate
in state exchange to determine the correct operations of all
concurrent primaries. If we use PB FT-style state exchange, we
end up with a concurrent throughput of TcPB FT =nf B/((n
1)(st + 3sm) + (nf 1)(st + 4(n1)sm)). In Figure 1, we
have sketched the maximum throughputs Tmax,TPBFT ,Tcmax ,
and TcPB FT . As one can see, concurrent consensus not only
promises greatly improved throughput, but also sharply re-
duces the costs associated with scaling consensus. We remark,
however, that these figures provide best-case upper-bounds, as
they only focus on bandwidth usage. In practice, replicas are
also limited by computational power and available memory
buffers that puts limits on the number of transactions they can
process in parallel and can execute (see Section V-B).
III. RCC: RESILIENT CONCURRENT CON SEN SU S
The idea behind concurrent consensus, as outlined in the
previous section, is straightforward: improve overall through-
put by using all available resources via concurrency. Designing
and implementing a concurrent consensus system that operates
correctly, even during crashes and malicious behavior of some
replicas, is challenging, however. In this section, we describe
how to design correct consensus protocols that deliver on
the promises of concurrent consensus. We do so by intro-
ducing RC C, a paradigm that can turn any primary-backup
consensus protocol into a concurrent consensus protocol. At
its basis, RC C makes every replica a primary of a consensus-
instance that replicates transactions among all replicas. Fur-
thermore, RC C provides the necessary coordination between
these consensus-instances to coordinate execution and deal
with faulty primaries. To assure resilience and maximize
throughput, we put the following design goals in RCC:
D1) RC C provides consensus among replicas on the client
transactions that are to be executed and the order in
which they are executed.
D2) Clients can interact with RCC to force execution of their
transactions and learn the outcome of execution.
D3) RC C is a design paradigm that can be applied to any
primary-backup consensus protocol, turning it into a
concurrent consensus protocol.
D4) In RCC, consensus-instances with non-faulty primaries
are always able to propose transactions at maximum
throughput (with respect to the resources available to
any replica), this independent of faulty behavior by any
other replica.
D5) In RC C, dealing with faulty primaries does not interfere
with the operations of other consensus-instances.
Combined, design goals D4 and D5 imply that instances
with non-faulty primaries can propose transactions wait-free:
transactions are proposed concurrent to any other activities and
does not require any coordination with other instances.
A. Background on Primary-Backup Consensus and PB FT
Before we present RC C, we provide the necessary back-
ground and notation for primary-backup consensus. Typical
primary-backup consensus protocols operate in views. Within
each view, a primary can propose client transactions, which
will then be executed by all non-faulty replicas. To assure that
all non-faulty replicas maintain the same state, transactions are
required to be deterministic: on identical inputs, execution of
a transaction must always produce identical outcomes. To deal
with faulty behavior by the primary or by any other replicas
during a view, three complimentary mechanisms are used:
Byzantine commit: The primary uses a Byzantine commit
algorithm BCA to propose a client transaction Tto all replicas.
Next, BCA will perform state exchange to determine whether
the primary successfully proposed a transaction. If the primary
is non-faulty, then all replicas will receive Tand determine
success. If the primary is faulty and more than fnon-faulty
replicas do not receive a proposal or receive different proposals
than the other replicas, then the state exchange step of BC A
will detect this failure of the primary.
Primary replacement: The replicas use a view-change
algorithm to replace the primary of the current view vwhen
this primary is detected to be faulty by non-faulty replicas.
This view-change algorithm will collect the state of sufficient
replicas in view vto determine a correct starting state for the
next view v+ 1 and assign new primary that will propose
client transactions in view v+ 1.
Recovery: A faulty primary can keep up to fnon-faulty
replicas in the dark without being detected, as ffaulty replicas
can cover for this malicious behavior. Such behavior is not
detected and, consequently, does not trigger a view-change.
Via a checkpoint algorithm the at-most-fnon-faulty replicas
that are in the dark will learn the proposed client transactions
R3
R2
R1
P
chTic
Execute hTic
PRE PREPAR E PREPA RE COMMIT INFORM
Fig. 2. A schematic representation of the preprepare-prepare-commit protocol
of PB FT. First, a client crequests transaction Tand the primary Pproposes
Tto all replicas via a PR EPRE PARE message. Next, replicas commit to Tvia
a two-phase message exchange (PR EPARE and C OMMIT messages). Finally,
replicas execute the proposal and inform the client.
that are successfully proposed to the remaining at-least-nf
f>fnon-fault replicas (that are not in the dark).
Example III.1. Next, we illustrate these mechanisms in
PBF T. At the core of PBFT is the preprepare-prepare-commit
Byzantine commit algorithm. This algorithm operates in three
phases, which are sketched in Figure 2.
First, the current primary chooses a client request of the
form hTic, a transaction Tsigned by client c, and proposes
this request as the ρ-th transaction by broadcasting it to all
replicas via a PREPR EPAR E message m. Next, each non-
faulty replica Rprepares the first proposed ρ-th transaction
it receives by broadcasting a PREPARE message for m. If
a replica Rreceives nf PREPARE messages for mfrom nf
distinct replicas, then it has the guarantee that any group of
nf replicas will contain a non-faulty replica that has received
m. Hence, Rhas the guarantee that mcan be recovered from
any group of nf replicas, independent of the behavior of the
current primary. With this guarantee, Rcommits to mby
broadcasting a COMMIT message for m. Finally, if a replica
Rreceives nf COMMIT messages for mfrom nf distinct
replicas, then it accepts m. In PBFT, accepted proposals are
then executed and the client is informed of the outcome.
Each replica Rparticipating in preprepare-prepare-commit
uses an internal timeout value to detect failure: whenever the
primary fails to coordinate a round of preprepare-prepare-
commit—which should result in Raccepting some proposal—
Rwill detect failure of the primary and halt participation
in preprepare-prepare-commit. If f+ 1 non-faulty replicas
detect such a failure and communication is reliable, then they
can cooperate to assure that all non-faulty replicas detect the
failure. We call this a confirmed failure of preprepare-prepare-
commit. In PBFT, confirmed failures trigger a view-change.
Finally, PBFT employs a majority-vote checkpoint protocol
that allows replicas that are kept in the dark to learn accepted
proposals without help of the primary.
B. The Design of RCC
We now present RCC in detail. Consider a primary-backup
consensus protocol P that utilizes Byzantine commit algorithm
BC A (e.g., PBFT with preprepare-prepare-commit). At the
core of applying our RC C paradigm to P is running m,
1mn, instances of BC A concurrently, while providing
sufficient coordination between the instances to deal with any
malicious behavior. To do so, RCC makes BCA concurrent
and uses a checkpoint protocol for per-instance recovery of in-
the-dark replicas (see Section III-D). Instead of view-changes,
RCC uses a novel wait-free mechanism, that does not involve
replacing primaries, to deal with detectable primary failures
(see Section III-C). RC C requires the following guarantees
on BC A:
Assumption. Consider an instance of BCA running in a system
with nreplicas, n>3f.
A1) If no failures are detected in round ρof BCA (the round
is successful), then at least nf fnon-faulty replicas
have accepted a proposed transaction in round ρ.
A2) If a non-faulty replica accepts a proposed transaction T
in round ρof BC A, then all other non-faulty replicas that
accepted a proposed transaction, accepted T.
A3) If a non-faulty replica accepts a transaction T, then T
can be recovered from the state of any subset of nf f
non-faulty replicas.
A4) If the primary is non-faulty and communication is reli-
able, then all non-faulty replicas will accept a proposal
in round ρof BCA.
With minor fine-tuning, these assumptions are met by PBFT,
ZYZ ZYVA, SB FT, HOTST UFF, and many other primary-backup
consensus protocols, meeting design goal D3.
RCC operates in rounds. In each round, RCC replicates
mclient transactions (or, as discussed in Section I-A, m
sets of client transactions), one for each instance. We write
Iito denote the i-th instance of BC A. To enforce that each
instance is coordinated by a distinct primary, the i-th replica
Piis assigned as the primary coordinating Ii. Initially, RCC
operates with m=ninstances. In RC C, instances can fail
and be stopped, e.g., when coordinated by malicious primaries
or during periods of unreliable communication. Each round ρ
of RC C operates in three steps:
1) Concurrent BCA. First, each replica participates in m
instances of BCA, in which each instance is proposing a
transaction requested by a client among all replicas.
2) Ordering. Then, each replica collects all successfully
replicated client transactions and puts them in the
same—deterministically determined—order.
3) Execution. Finally, each replica executes the transactions
of round ρin order and informs the clients of the
outcome of their requested transactions.
Figure 3 sketches a high-level overview of running mconcur-
rent instances of BCA.
To maximize performance, we want every instance to pro-
pose distinct transactions, such that every round results in m
distinct transactions. In Section III-E, we delve into the details
by which primaries can choose transactions to propose.
To meet design goal D4 and D5, individual BC A instances
in RC C can continuously propose and replicate transactions:
ordering and execution of the transactions replicated in a round
by the minstances is done in parallel to the proposal and
replication of transactions for future rounds. Consequently,
I1
I2
.
.
.
Im
BC A
BC A
.
.
.
BC A
hT0ic0
hT1ic1
.
.
.
hTmicm
order
requests
execute
requests
Concurrent BCA Ordering Execution
Fig. 3. A high-level overview of RC C running at replica R. Replica R
participates in mconcurrent instances of BC A (that run independently and
continuously output transactions). The instances yield mtransactions, which
are executed in a deterministic order.
non-faulty primaries can utilize their entire outgoing network
bandwidth for proposing transactions, even if other replicas or
primaries are acting malicious.
Let hTiicibe the transaction Tirequested by ciand pro-
posed by Piin round ρ. After all minstances complete
round ρ, each replica can collect the set of transactions
S={hTiici|1im}. By Assumption A2, all non-
faulty replicas will obtain the same set S. Next, all replicas
choose an order on Sand execute all transactions in that order.
For now, we assume that the transaction hTiiciis executed
as the i-th transaction of round ρ. In Section IV, we show
that a more advanced ordering-scheme can further improve
the resilience of consensus against malicious behavior. As a
direct consequence of Assumption A4, we have the following:
Proposition III.2. Consider RCC running in a system with n
replicas, n>3f. If all minstances have non-faulty primaries
and communication is reliable, then, in each round, all non-
faulty replicas will accept the same set of mtransactions and
execute these transactions in the same order.
As all non-faulty replicas will execute each transaction in
hTiiciS, there are nf distinct non-faulty replicas that can
inform the client of the outcome of execution. As all non-faulty
replicas operate deterministically and execute the transactions
in the same order, client ciwill receive identical outcomes of
nf >freplicas, guaranteeing that this outcome is correct.
In the above, we described the normal-case operations
of RC C. As in normal primary-backup protocols, individual
instances in RC C can be subject to both detectable and
undetectable failures. Next, we deal with these two types of
failures.
C. Dealing with Detectable Failures
Consensus-based systems typically operate in an environ-
ment with asynchronous communication: messages can get
lost, arrive with arbitrary delays, and in arbitrary order.
Consequently, it is impossible to distinguish between, on the
one hand, a primary that is malicious and does not send
out proposals and, on the other hand, a primary that does
send out proposals that get lost in the network. As such,
asynchronous consensus protocols can only provide progress
in periods of reliable bounded-delay communication during
which all messages sent by non-faulty replicas will arrive at
their destination within some maximum delay [24], [25].
Recovery request role (used by replica R):
1: event Rdetects failure of the primary Pi,1im, in round ρdo
2: Rhalts Ii.
3: Let Pbe the state of Rin accordance to Assumption A3.
4: Broadcast FAILURE(i, ρ, P )to all replicas.
5: event Rreceives f+ 1 messages mj=FAILURE(i, ρj, Pj)such that:
1) these messages are sent by a set Sof |S|=f+ 1 distinct replicas;
2) all f+ 1 messages are well-formed; and
3) ρj,1jf+ 1, comes after the round in which Iistarted last
do
6: Rdetects failure of Pi(if not yet done so).
Recovery leader role (used by leader Liof P) :
7: event Lireceives nf messages mj=FAILURE(i, ρj, Pj)such that
1) these messages are sent by a set Sof |S|=f+ 1 distinct replicas;
2) all nf messages are well-formed; and
3) ρj,1jf+ 1, comes after the round in which Iistarted last
do
8: Propose stop(i;{m1,...,mnf })via P.
State recovery role (used by replica R):
9: event Raccepts stop(i;E)from Livia P do
10: Recover the state of Iiusing Ein accordance to Assumption A3.
11: Determine the last round ρfor which Iiaccepted a proposal.
12: Set ρ+ 2f, with fthe number of accepted stop(i;E0)operations,
as the next valid round number for instance Ii.
Fig. 4. The recovery algorithm of RC C.
To be able to deal with failures, RC C assumes that any
failure of non-faulty replicas to receive proposals from a
primary Pi,1im, is due to failure of Pi, and we
design the recovery process such that it can also recover
from failures due to unreliable communication. Furthermore,
in accordance with the wait-free design goals D4 and D5, the
recovery process will be designed so that it does not interfere
with other BCA instances or other recovery processes. Now
assume that primary Piof Ii,1im, fails in round ρ.
The recovery process consists of three steps:
1) All non-faulty replicas need to detect failure of the Pi.
2) All non-faulty replicas need to reach agreement on the
state of Ii: which transactions have been proposed by
Piand have been accepted in the rounds up-to-ρ.
3) To deal with unreliable communication, all non-faulty
replicas need to determine the round in which Piis
allowed to resume its operations.
To reach agreement on the state of Ii, we rely on a
separate instance of the consensus protocol P that is only
used to coordinate agreement on the state of Iiduring failure.
This coordinating consensus protocol P replicates stop(i;E)
operations, in which Eis a set of nf FAILURE messages sent
by nf distinct replicas from which all accepted proposals in
instance Iican be derived. We notice that P is—itself—an
instance of a primary-backup protocol that is coordinated by
some primary Li(based on the current view in which the
instance of P operates), and we use the standard machinery
of P to deal with failures of that leader (see Section III-A).
Next, we shall describe how the recovery process is initiated.
The details of this protocol can be found in Figure 4.
When a replica Rdetects failure of instance Ii,0i < m,
in round ρ, it broadcasts a message FAILURE(i, ρ, P ), in
which Pis the state of Rin accordance to Assumption A3
(Line 1 of Figure 4). To deal with unreliable communication,
Rwill continuously broadcast this FAILURE message with an
exponentially-growing delay until it learns on how to proceed
with Ii. To reduce communication in the normal-case opera-
tions of P, one can send the full message FAILURE(i, ρ, P )to
only Li, while sending FAILURE(i, ρ)to all other replicas.
If a replica receives f+ 1 FAILURE messages from distinct
replicas for a certain instance Ii, then it received at least one
such message from a non-faulty replica. Hence, it can detect
failure of Ii(Line 5 of Figure 4). Finally, if a replica R
receives nf FAILURE messages from distinct replicas for a
certain instance Ii, then we say there is a confirmed failure,
as Rhas the guarantee that eventually—within at most two
message delays—also the primary Liof P will receive nf
FAILURE messages (if communication is reliable). Hence, at
this point, Rsets a timer based on some internal timeout value
(that estimates the message delay) and waits on the leader Li
to propose a valid stop-operation or for the timer to run out.
In the latter case, replica Rdetects failure of the leader Liand
follows the steps of a view-change in P to (try to) replace Li.
When the leader Lireceives nf FAILURE messages, it can and
must construct a valid stop-operation and reach consensus on
this operation (Line 7 of Figure 4). After reaching consensus,
each replica can recover to a common state of Ii:
Theorem III.3. Consider RCC running in a system with
nreplicas. If n>3f, an instance Ii,0i < m, has
a confirmed failure, and the last proposal of Piaccepted
by a non-faulty replica was in round ρ, then—whenever
communication becomes reliable—the recovery protocol of
Figure 4 will assure that all non-faulty replicas will recover
the same state, which will include all proposals accepted by
non-faulty replicas before-or-at round ρ.
Proof. If communication is reliable and instance Iihas a
confirmed failure, then all non-faulty replicas will detect this
failure and send FAILURE messages (Line 1 of Figure 4).
Hence, all replicas are guaranteed to receive at least nf
FAILURE messages, and any replica will be able to construct a
well-formed operation stop(i;E). Hence, P will eventually
be forced to reach consensus on stop(i;E). Consequently,
all non-faulty replicas will conclude on the same state for
instance Ii. Now consider a transaction Taccepted by non-
faulty replica Qin instance Ii. Due to Assumption A3, Q
will only accept Tif Tcan be recovered from the state of
any set of nf fnon-faulty replicas. As |E|=nf (Line 7
of Figure 4), the set Econtains the state of nf fnon-faulty
replicas. Hence, Tmust be recoverable from E.
We notice that the recovery algorithm of RCC, as out-
lined in Figure 4, only affects the capabilities of the BCA
instance that is stopped. All other BCA instances can concur-
rently propose transactions for current and for future rounds.
Hence, the recovery algorithm adheres to the wait-free design
goals D4 and D5. Furthermore, we reiterate that we have
separate instance of the coordinating consensus protocol for
each instance Ii,1im. Hence, recovery of several
instances can happen concurrently, which minimizes the time
it takes to recover from several simultaneous primary failures
and, consequently, minimizes the delay before a round can be
executed during primary failures.
Confirmed failures not only happen due to malicious be-
havior. Instances can also fail due to periods of unreli-
able communication. To deal with this, we eventually restart
any stopped instances. To prevent instances coordinated by
malicious replicas to continuously cause recovery of their
instances, every failure will incur an exponentially growing
restart penalty (Line 12 of Figure 4). The exact round in
which an instance can resume operations can be determined
deterministically from the accepted history of stop-requests.
When all instances have round failures due to unreliable
communication (which can be detected from the history of
stop-requests), any instance is allowed to resume operations
in the earliest available round (after which all other instances
are also required to resume operations).
D. Dealing with Undetectable Failures
As stated in Assumption A1, a malicious primary Piof a
BC A instance Iiis able to keep up to fnon-faulty replicas
in the dark without being detected. In normal primary-backup
protocols, this is not a huge issue: at least nf f>fnon-faulty
replicas still accept transactions, and these replicas can execute
and reliably inform the client of the outcome of execution. This
is not the case in RC C, however:
Example III.4. Consider a system with n= 3f+ 1 = 7
replicas. Assume that primaries P1and P2are malicious,
while all other primaries are non-faulty. We partition the non-
faulty replicas into three sets A1,A2, and Bwith |A1|=
|A2|=fand |B|= 1. In round ρ, the malicious primary Pi,
i∈ {1,2}, proposes transaction hTiicito only the non-faulty
replicas in AiB. This situation is sketched in Figure 5.
After all concurrent instances of BCA finish round ρ, we see
that the replicas in A1have accepted hT1ic1, the replicas
in A2have accepted hT2ic2, and only the replica in Bhas
accepted both hT1ic1and hT2ic2. Hence, only the single replica
in Bcan proceed with execution of round ρ. Notice that,
due to Assumption A1, we consider all instances as finished
successfully. If n10 and f3, this example attack can be
generalized such that also the replica in Bis missing at least
a single client transaction.
To deal with in-the-dark attacks of Example III.4, we can
run a standard checkpoint algorithm for each BCA instance:
if the system does not reach confirmed failure of Piin round
ρ,1im, then, by Assumption A1 and A2, at-least-
nf fnon-faulty replicas have accepted the same transaction
Tin round ρof Ii. Hence, by Assumption A3, a standard
checkpoint algorithm (e.g., the one of PB FT or one based
on delayed replication [26]) that exchanges the state of these
at-least-nf fnon-faulty replicas among all other replicas
is sufficient to assure that all non-faulty replicas eventually
accept T. We notice that these checkpoint algorithms can be
Non-faulty replicas A1,A2, and B
B
A1A2
Faulty replicas P1and P2
P1P2
hT1ic1hT1ic1hT2ic2
hT2ic2
Fig. 5. An attack possible when parallelizing BC A: malicious primaries can
prevent non-faulty replicas from learning all client requests in a round, thereby
preventing timely round execution. The faulty primary Pi,i∈ {1,2}, does
so by only letting non-faulty replicas AiBparticipate in instance Ii.
run concurrently with the operations of BCA instances, thereby
adhering to our wait-free design goals D4 and D5.
To reduce the cost of checkpoints, typical consensus sys-
tems only perform checkpoints after every x-th round for
some system-defined constant x. Due to in-the-dark attacks,
applying such a strategy to RCC means choosing between
execution latency and throughput. Consequently, in RCC we
do checkpoints on a dynamic per-need basis: when replica
Rreceives nf fclaims of failure of primaries (via the
FAILURE messages of the recovery protocol) in round ρand
Ritself finished round ρfor all its instances, then it will
participate in any attempt for a checkpoint for round ρ. Hence,
if an in-the-dark attack affects more than fdistinct non-faulty
replicas in round ρ, then a successful checkpoint will be made
and all non-faulty replicas recover from the attack, accept all
transactions in round ρ, and execute all these transactions.
Using Theorem III.3 to deal with detectable failures and
using checkpoint protocols to deal with replicas in-the-dark,
we conclude that RC C adheres to design goal D1:
Theorem III.5. Consider RCC running in a system with n
replicas. If n>3f, then RCC provides consensus in periods
in which communication is reliable.
E. Client Interactions with RCC
To maximize performance, it is important that every instance
proposes distinct client transactions, as proposing the same
client transaction several times would reduce throughput. We
have designed RCC with faulty clients in mind, hence, we do
not expect cooperation of clients to assure that they send their
transactions to only a single primary.
To be able to do so, the design of RCC is optimized for the
case in which there are always many more concurrent clients
than replicas in the system. In this setting, we assign every
client cto a single primary Pi,1im=n, such that
only instance Iican propose client requests of c. For this
design to work in all cases, we need to solve two issues,
however: we need to deal with situations in which primaries
do not receive client requests (e.g., during downtime periods
in which only few transactions are requested), and we need to
deal with faulty primaries that refuse to propose requests of
some clients.
First, if there are less concurrent clients than replicas in the
system, e.g., when demand for services is low, then RC C still
needs to process client transactions correctly, but it can do so
without optimally utilizing resources available, as this would
not impact throughput in this case due to the low demands. If a
primary Pi,1im, does not have transactions to propose
in any round ρand Pidetects that other BCA instances are
proposing for round ρ(e.g., as it receives proposals), then Pi
proposes a small no-op-request instead.
Second, to deal with a primary Pi,1im, that
refuses to propose requests of some clients, we take a two-
step approach. First, we incentivize malicious primaries to not
refuse services, as otherwise they will be detected faulty and
loose the ability to propose transactions altogether. To detect
failure of Pi, RC C uses standard techniques to enable a client
cto force execution of a transaction T. First, cbroadcasts hTic
to all replicas. Each non-faulty replica Rwill then forward
hTicto the appropriate primary Pi,1im. Next, if the
primary Pidoes not propose any transaction requested by c
within a reasonable amount of time, then Rdetects failure of
Pi. Hence, refusal of Pito propose hTicwill lead to primary
failure, incentivizing malicious primaries to provide service.
Finally, we need to deal with primaries that are unwilling
or incapable of proposing requests of c, e.g., when the pri-
mary crashes. To do so, ccan request to be reassigned to
another instance Ij,1jm, by broadcasting a request
m:= SWITCHINS TANC E(c, j)to all replicas. Reassignment is
handled by the coordinating consensus protocol P for Ii, that
will reach consensus on m. Malicious clients can try to use
reassignment to propose transactions in several instances at the
same time. To deal with this, we assume that no instance is
more than σrounds behind any other instance (see Section IV).
Now, consider the moment at which replica Raccepts mand
let ρ(m, R)be the maximum round in which any request has
been proposed by any instance in which Rparticipates. The
primary Piwill stop proposing transactions of cimmediately.
Any non-faulty replica Rwill stop accepting transactions of
cby Iiafter round ρ(m, R) + σand will start accepting
transactions of cby Ijafter round ρ(m, R) + 2σ. Finally, Pj
will start proposing transactions of cin round ρ(m, Pj) + 3σ.
IV. RC C: IMPROVING RESILIENCE OF CONSEN SU S
Traditional primary-backup consensus protocols rely heav-
ily on the operations of their primary. Although these proto-
cols are designed to deal with primaries that completely fail
proposing client transactions, they are not designed to deal
with many other types of malicious behavior.
Example IV.1. Consider a financial service running on a
traditional PB FT consensus-based system. In this setting, a
malicious primary can affect operations in two malicious ways:
1) Ordering attack. The primary sets the order in which
transactions are processed and, hence, can choose an
ordering that best fits its own interests. To illustrate this,
we consider client transactions of the form:
transfer(A, B, n, m) := if amount(A)> n then
withdraw(A, m);deposit(B, m).
Original First T1, then T2First T2, then T1
Balance T1T2T2T1
Alice 800 600 600 800 600
Bob 300 500 200 300 500
Eve 100 100 400 100 100
Fig. 6. Illustration of the influence of execution order on the outcome:
switching around requests affects the transfer of T2.
Let T1= transfer(Alice,Bob,500,200) and T2=
transfer(Bob,Eve,400,300). Before processing these
transaction, the balance for Alice is 800, for Bob 300,
and for Eve 100. In Figure 6, we summarize the results
of either first executing T1or first executing T2. As is
clear from the figure, execution of T1influences the
outcome of execution of T2. As primaries choose the
ordering of transactions, a malicious primary can chose
an ordering whose outcome benefits its own interests,
e.g., formulate targeted attacks to affect the execution
of the transaction of some clients.
2) Throttling attack. The primary sets the pace at which the
system processes transactions. We recall that individual
replicas rely on time-outs to detect malicious behavior
of the primary. This approach will fail to detect or
deal with primaries that throttle throughput by proposing
transactions as slow as possible, while preventing failure
detection due to time-outs.
Besides malicious primaries, also other malicious entities can
take advantage of a primary-backup consensus protocol:
3) Targeted attack. As the throughput of a primary-backup
system is entirely determined by the primary, attackers
can send arbitrary messages to the primary. Even if
the primary recognizes that these messages are irrele-
vant for its operations, it has spend resources (network
bandwidth, computational power, and memory) to do
so, thereby reducing throughput. Notice that—in the
worst case—this can even lead to failure of a non-faulty
primary to propose transactions in a timely manner.
Where traditional consensus-based systems fail to deal with
these attacks, the concurrent design of RCC can be used to
mitigate these attacks.
First, we look at ordering attacks. To mitigate this type
of attack, we propose a method to deterministically select a
different permutation of the order of execution in every round
in such a way that this ordering is practically impossible
to predict or influence by faulty replicas. Note that for any
sequence Sof k=|S|values, there exist k!distinct per-
mutations. We write P(S)to denote these permutations of
S. To deterministically select one of these permutations, we
construct a function that maps an integer h∈ {0, . . . , k!1}to
a unique permutation in P(S). Then we discuss how replicas
will uniformly pick h. As |P(S)|=k!, we can construct the
following bijection fS:{0, . . . , k!1} → P(S)
fS(i) = (Sif |S|= 1;
fS\S[q](r)S[q]if |S|>1,
in which q=idiv (|S| − 1)! is the quotient and r=imod
(|S| − 1)! is the remainder of integer division by (|S| − 1)!.
Using induction on the size of S, we can prove:
Lemma IV.2. fSis a bijection from {0,...,|S|!1}to all
possible permutations of S.
Let Sbe the sequence of all transactions accepted in round
ρ, ordered on increasing instance. The replicas uniformly pick
h= digest(S) mod (k!1), in which digest(S)is a strong
cryptographic hash function that maps an arbitrary value v
to a numeric digest value in a bounded range such that it
is practically impossible to find another value S0,S6=S0,
with digest(S) = digest(S0). When at least one primary is
non-malicious (m>f), the final value his only known
after completion of round ρand it is practically impossible to
predictably influence this value. After selecting h, all replicas
execute the transactions in Sin the order given by fS(h).
To deal with primaries that throttle their instances, non-
faulty replicas will detect failure of those instances that lag
behind other instances. In specific, if an instance Ii,1i
m, is σrounds behind any other instances (for some system-
dependent constant σ), then Rdetects failure of Pi.
Finally, we notice that concurrent consensus and RCC
by design—provides load balancing with respect to the tasks
of the primary, this by spreading the total workload of the
system over many primaries. As such, RCC not only improves
performance when bounded by the primary bandwidth, but
also when performance is bounded by computational power
(e.g., due to costly cryptographic primitives), or by message
delays. Furthermore, this load balancing reduces the load on
any single primary to propose and process a given amount
of transactions, dampening the effects of any targeted attacks
against the resources of a single primary.
V. EVAL UATIO N OF THE PERFORMANCE OF RCC
In the previous sections, we proposed concurrent consensus
and presented the design of RCC, our concurrent consensus
paradigm. To show that concurrent consensus not only pro-
vides benefits in theory, we study the performance of RCC and
the effects of concurrent consensus in a practical setting. To do
so, we measure the performance of RC C in RESILIENTDB—
our high-performance resilient blockchain fabric—and com-
pare RC C with the well-known primary-backup consensus
protocols PB FT, ZYZZYVA, SBFT, and HOT STUFF. With this
study, we aim to answer the following questions:
Q1) What is the performance of RCC: does RCC deliver on
the promises of concurrent consensus and provide more
throughput than any primary-backup consensus protocol
can provide?
Q2) What is the scalability of RC C: does RCC deliver on
the promises of concurrent consensus and provide better
scalability than primary-backup consensus protocols?
Q3) Does RC C provide sufficient load balancing of primary
tasks to improve performance of consensus by offsetting
any high costs incurred by the primary?
Q4) How does RCC fare under failures?
Q5) What is the impact of batching client transactions on the
performance of RC C?
First, in Section V-A, we describe the experimental setup.
Then, in Section V-B, we provide a high-level overview of
RESILIENTDB and of its general performance characteristics.
Next, in Section V-C, we provide details on the consensus
protocols we use in this evaluation. Then, in Section V-D, we
present the experiments we performed and the measurements
obtained. Finally, in Section V-E, we interpret these measure-
ments and answer the above research questions.
A. Experimental Setup
To be able to study the practical performance of RCC and
other consensus protocols, we choose to study these protocols
in a full resilient database system. To do so, we implemented
RCC in RESILIENTDB. To generate a workload for the
protocols, we used the Yahoo Cloud Serving Benchmark [27]
provided by the Blockbench macro benchmarks [28]. In the
generated workload, each client transaction queries a YCSB
table with half a million active records and 90% of the trans-
actions write and modify records. Prior to the experiments,
each replica is initialized with an identical copy of the YCSB
table. We perform all experiments in the Google Cloud. In
specific, each replica is deployed on a c2-machine with a 16-
core Intel Xeon Cascade Lake CPU running at 3.8 GHz and
with 32 GB memory. We use up to 320 k clients, deployed on
16 machines.
B. The RESILIENTDB Blochchain Fabric
The RESILIENTDB fabric incorporates secure permissioned
blockchain technologies to provide resilient data processing.
A detailed description of how RESILIENTDB achieves high-
throughput consensus in a practical settings can be found in
Gupta et al. [21], [29], [30], [31], [32]. The architecture of
RESILIENTDB is optimized for maximizing throughput via
multi-threading and pipelining. To further maximize through-
put and minimize the overhead of any consensus protocol,
RESILIENTDB has built-in support for batching of client
transactions.
We typically group 100 txn/batch. In this case, the size of a
proposal is 5400 B and of a client reply (for 100 transactions)
is 1748 B. The other messages exchanged between replicas
during the Byzantine commit algorithm have a size of 250 B.
RESILIENTDB supports out-of-order processing of transac-
tions in which primaries can propose future transactions before
current transactions are executed. This allows RESILIENTDB
to maximize throughput of any primary-backup protocol that
supports out-of-order processing (e.g., PBFT, ZYZ ZYVA, and
SBF T) by maximizing bandwidth utilization at the primary.
In RESILIENTDB, each replica maintains a blockchain
ledger (a journal) that holds an ordered copy of all executed
transactions. The ledger not only stores all transactions, but
ReplyFull
0.0
2.0
4.0
6.0·105
Throughput (txn/s)
ReplyFull
0.0
0.2
0.4
0.6
Latency (s)
None PK MAC
0.0
0.5
1.0
1.5
·105
Throughput (txn/s)
None PK MAC
0.0
1.0
2.0
3.0
4.0
Latency (s)
Fig. 7. Characteristics of RESILIENTDB deployed on the Google Cloud. Left,
the maximum performance of a single replica that receives clients transactions,
optionally executes them (Full), and sends replies. Right, the performance of
PBF T with n= 16 replicas that uses no cryptography (None), uses ED25519
public-key cryptography (PK), or CMAC-AES message authentication codes
(MAC) to authenticate messages.
also proofs of their acceptance by a consensus protocols. As
these proofs are built using strong cryptographic primitives,
the ledger is immutable and, hence, can be used to provide
strong data provenance.
In our experiments replicas not only perform consensus, but
also communicate with clients and execute transactions. In this
practical setting, performance is not fully determined by band-
width usage due to consensus (as outlined in Section I-A), but
also by the cost of communicating with clients, of sequential
execution of all transactions, of cryptography, and of other
steps involved in processing messages and transactions, and by
the available memory limitations. To illustrate this, we have
measured the effects of client communication,execution, and
cryptography on our deployment of RESILIENTDB.
In Figure 7, left, we present the maximum performance of
a single replica that receives clients transactions, optionally
executes them (Full), and sends replies (without any consensus
steps). In this figure, we count the total number of client
transactions that are completed during the experiment. As
one can see, the system can receive and respond to up-to-
551 ktxn/sec, but can only execute up-to-217 ktxn/sec.
In Figure 7, right, we present the maximum performance
of PB FT running on n= 16 replicas as a function of the
cryptographic primitives used to provide authenticated com-
munication. In specific, PBFT can either use digital signatures
or message authentication codes. For this comparison, we
compare PB FT using: (1) a baseline that does not use any
message authentication (None); (2) ED25519 digital signatures
for all messages (DS); and (3) CMAC+AES message authen-
tication codes for all messages exchanged between messages
and ED25519 digital signatures for client transactions. As can
be seen from the results, the costs associated with digital
signatures are huge, as their usage reduces performance by
86%, whereas message authentication codes only reduce per-
formance by 33%.
C. The Consensus Protocols
We evaluate the performance of RCC by comparing it with
a representative sample of efficient practical primary-backup
consensus protocols:
PBF T [15]: We use a heavily optimized out-of-order
implementation that uses message authentication codes.
RCC:Our RCC implementation follows the design
outlined in this paper. We have chosen to turn PBFT into a con-
current consensus protocol. We test with three variants: RCCn
runs nconcurrent instances, RC Cf+1 runs f+ 1 concurrent
instances (the minimum to provide the benefits outlined in
Section IV), and RC C3runs 3concurrent instances.
ZYZ ZYVA [16]: As described in Section I-A, ZY ZZYVA
has a optimal-case path due to which the performance of
ZYZ ZYVA provides an upper-bound for any primary-backup
protocol (when no failures occur). Unfortunately, the failure-
handling of ZYZZ YVA is costly, making ZY ZZ YVA unable to
deal with any failures efficiently.
SBF T [17]: This protocol uses threshold signatures to
minimize communication during the state exchange that is
part of its Byzantine commit algorithm. Threshold signatures
do not reduce the communication costs for the primary to
propose client transactions, which have a major influence on
performance in practice (See Section I-A), but can potentially
greatly reduce all other communication costs.
HOTST UFF [18]: As SBFT , HOT STUFF uses threshold
signatures to minimize communication. The state-exchange
of HOT STU FF has an extra phase compared to PBFT. This
additional phase simplifies changing views in HOT STU FF,
and enables HO TSTU FF to regularly switch primaries (which
limits the influence of any faulty replicas). Due to this design,
HOTST UFF does not support out-of-order processing (see
Section I-A). As a consequence, HO TSTU FF is more affected
by message delays than by bandwidth. In our implementation,
we have used the efficient single-phase event-based variant of
HOTST UFF.
D. The Experiments
To be able to answer Question Q1–Q5, we perform four
experiments in which we measure the performance of RCC.
In each experiment, we measure the throughput as the number
of transactions that are executed per second, and we measure
the latency as the time from when a client sends a transaction
to the time where that client receives a response. We run
each experiment for 180 s: the first 60 s are warm-up, and
measurement results are collected over the next 120 s. We
average our results over three runs. The results of all four
experiments can be found in Figure 8.
In the first experiment, we measure the best-case perfor-
mance of the consensus protocols as a function of the number
of replicas when all replicas are non-faulty. We vary the
number of replicas between n= 4 and n= 91 and we use
a batch size of 100 txn/batch. The results can be found in
Figure 8, (a) and (b).
In the second experiment, we measure the performance of
the consensus protocols as a function of the number of replicas
during failure of a single replica. Again, we vary the number
of replicas between n= 4 and n= 91 and we use a batch
size of 100 txn/batch. The results can be found in Figure 8,
(c) and (d).
In the third experiment, we measure the performance of the
consensus protocols as a function of the number of replicas
RCCnRCCf+1 RCC3PBFT ZYZ ZYVA SBF T HOTST UFF
4 16 32 64 91
0.0
1.0
2.0
3.0
·105
Number of replicas (n)
Throughput (txn/s)
(a) Scalability (No Failures)
4 16 32 64 91
0.0
2.0
4.0
6.0
8.0
Number of replicas (n)
Latency (s)
(b) Scalability (No Failures)
4 16 32 64 91
0.0
1.0
2.0
3.0
·105
Number of replicas (n)
Throughput (txn/s)
(c) Scalability (Single Failure)
4 16 32 64 91
0.0
2.0
4.0
6.0
8.0
10.0
Number of replicas (n)
Latency (s)
(d) Scalability (Single Failure)
10 50 100 200 400
0.0
1.0
2.0
3.0
·105
Batch size
Throughput (txn/s)
(e) Batching (Single Failure)
10 50 100 200 400
0.0
10.0
20.0
30.0
Batch size
Latency (s)
(f) Batching (Single Failure)
4 16 32 64 91
0.0
2.0
4.0
6.0
·104
Number of replicas (n)
Throughput (txn/s)
(g) Out-of-ordering disabled
4 16 32 64 91
0.0
0.1
0.2
0.3
Number of replicas (n)
Latency (s)
(h) Out-of-ordering disabled
Fig. 8. Evaluating system throughput and average latency incurred by RC C and other consensus protocols.
during failure of a single replica while varying the batch size
between 10 txn/batch and 400 txn/batch. We use n= 32
replicas. The results can be found in Figure 8, (e) and (f).
In the fourth and final experiment, we measure the per-
formance of the consensus protocols when outgoing primary
bandwidth is not the limiting factor. We do so by disabling out-
of-order processing in all protocols that support out-of-order
processing. This makes the performance of these protocols
inherently bounded by the message delay and not by network
bandwidth. We study this case by varying the number of
replicas between n= 4 and n= 91 and we use a batch
size of 100 txn/batch. The results can be found in Figure 8,
(g) and (h).
E. Discussion
From the experiments, a few obvious patterns emerge. First,
we see that increasing the batch size ((e) and (f)) increases
performance of all consensus protocols (Q5). This is in line
with what one can expect (See Section I-A and Section II). As
the gains beyond 100 txn/batch are small, we have chosen to
use 100 txn/batch in all other experiments.
Second, we see that the three versions of RCC outperform
all other protocols, and the performance of RC C with or
without failures is comparable ((a)–(d)). Furthermore, we see
that adding concurrency by adding more instances improves
performance, as RC C3is outperformed by the other RCC
versions. On small deployments with n= 4,...,16 replicas,
the strength of RC C is most evident, as our RCC implemen-
tations approach the maximum rate at which RESILIENTD B
can execute transactions (see Section V-B).
Third, we see that RCC easily outperforms ZY ZZYVA, even
in the best-case scenario of no failures ((a) and (b)). We also
see that ZY ZZ YVA is—indeed—the fastest primary-backup
consensus protocol when no failures happen. This underlines
the ability of RC C, and of concurrent consensus in general, to
reach throughputs no primary-backup consensus protocol can
reach. We also notice that ZYZZ YVA fails to deal with failures
((c) and (d)), in which case its performance plummets, a case
that the other protocols have no issues dealing with.
Finally, due to the lack of out-of-order processing capa-
bilities in HO TSTU FF, HOTSTU FF is uncompetitive to out-
of-order protocols. When we disable out-of-order processing
for all other protocols ((g) and (h)), the strength of the
simple design of HOTSTU FF shows: its event-based single-
phase design outperforms all other primary-backup consensus
protocols. Due to the concurrent design of RC C, a non-out-
of-order-RCC is still able to greatly outperform HOTSTUFF,
however, as the non-out-of-order variants of RCC balance
the entire workload over many primaries. Furthermore, as the
throughput is not bound by any replica resources in this case
(and only by network delays), the non-out-of-order variants
RCCf+1 and RCCnbenefit from increasing the number of
replicas, as this also increases the amount of concurrent
processing (due to increasing the number of instances).
Summary: RCC implementations achieve up to 2.77×,
1.53×,38×, and 82×higher throughput than SBFT, PBF T,
HOTST UFF, and ZYZ ZYVA in single failure experiments.
RCC implementations achieve up to 2×,1.83×,33×, and
1.45×higher throughput than SB FT, PBF T, HOTSTU FF, and
ZYZ ZYVA in no failure experiments, respectively.
Based on these observations, we conclude that RCC delivers
on the promises of concurrent consensus. RC C provides more
throughput than any primary-backup consensus protocol can
provide (Q1). Moreover, RC C provides great scalability if
throughput is only bounded by the primaries: as the non-
out-of-order results show, the load-balancing capabilities of
RCC can even offset inefficiencies in other parts of the
consensus protocol (Q2, Q3). Finally, we conclude that RCC
can efficiently deal with failures (Q4). Hence, RCC meets the
design goals D1–D5 that we set out in Section III.
RCC-P RCC-Z RCC-S
4 16 32 64 91
1.0
2.0
3.0
4.0·105
Number of replicas (n)
Throughput (txn/s)
4 16 32 64 91
0.0
20.0
40.0
60.0
Number of replicas (n)
Latency (s)
Fig. 9. Evaluating system throughput and latency attained by three RCC
variants: RCC -P, RCC-Z and RCC-S when there are no failures.
F. Analyzing RCC as a Paradigm
Finally, we experimentally illustrate the ability of RCC
to act as a paradigm. To do so, we apply RCC to not
only PB FT, but also to ZYZZYVA and SB FT. In Figure 9,
we plot the performance of these three variants of RCC:
RCC-P (RCC+PBFT), RCC-Z (RCC+Z YZ ZYVA), and RCC-
S (RC C+SBFT). To evaluate the scalability of these protocols,
we perform experiments in the optimistic setting with no
failures and m=nconcurrent instances.
It is evident from these plots that all RCC variants achieve
extremely high throughput. As SB FT and ZYZZ YVA only
require linear communication in the optimistic case, RC C-S
and RC C-Z are able achieve up to 3.33×and 2.78×higher
throughputs than RC C-P, respectively.
Notice that RCC-S consistently attains equal or higher
throughput than RC C-Z, even though ZY ZZY VA scales better
than SB FT. This phenomena is caused by the way R CC-
Z interacts with clients. In specific, like ZYZZ YVA, RCC-
Z requires its clients to wait for responses of all nreplicas.
Hence, clients have to wait longer to place new transactions,
and consequently RC C-Z requires more clients than RCC- S
to attain maximum performance. Even if we ran RCC-Z with
5million clients, the largest amount at our disposal, we would
not see maximum performance. Due to the low single-primary
performance of ZYZZ YVA, this phenomena does not prevent
ZYZ ZYVA to already reach its maximum performance.
VI. RELATED WORK
In Section I-A, we already discussed well-known primary-
backup consensus protocols such as PBFT, ZYZ ZYVA, and
HOTST UFF and why these protocols are underutilizing re-
sources. Furthermore, there is abundant literature on consensus
and on primary-backup consensus in specific (e.g., [11], [33],
[34], [35]). Next, we shall focus on the few works that
deal with either improving throughput and scalability or with
improving resilience, the two strengths of RCC
Parallelizing consensus: Several recent consensus de-
signs propose to run several primaries concurrently, e.g., [20],
[36], [37], [38]. None of these proposals satisfy all design
goals of RC C, however. In specific, these proposals all fall
short with respect to maximizing potential throughput in all
cases, as none of these proposals satisfy the wait-free design
goals D4 and D5 of RC C.
0 10 20 30 40 50 60 70 80 90 100 110 120 130 140
0.0
10.0
20.0
30.0
a b c d e f
Time (s)
Throughput (txn/s)
RCC
MIR BFT
Fig. 10. Throughput of RC C versus MIR BFT during instance failures with
m= 11 instances. At (a), primary P1fails. In R CC, all other instances are
unaffected, whereas in MI RBFT all replicas need to coordinate recovery. At
(b), recovery is finished. In RCC, all instances can resume work, whereas
MIR BFT halts an instance due to recovery. At (c) primaries P1and P2fail.
In RC C, P2will be recovered at (d) and P1at (e) (as P1failed twice, its
recovery in RCC takes twice as long). In MIR BFT, recovery is finished at
(d), after which MI RBF T operates with only m= 9 instances. At (e) and
(f), MI RBFT decides that the system is sufficiently reliable, and MIR BFT
enables the remaining instances one at a time.
Example VI.1. The MI RBFT protocol proposes to run con-
current instances of PBFT, this in a similar fashion as RCC.
The key difference is how MIR BFT deals with failures:
MIR BFT operates in global epochs in which a super-primary
decides which instances are enabled. During any failure,
MIR BFT will switch to a new epoch via a view-change proto-
col that temporarily shuts-down all instances and subsequently
reduces throughput to zero. This is in sharp contrast to the
wait-free design of RC C, in which failures are handled on a
per-instance level. In Figure 10, we illustrated these differences
in the failure recovery of RCC and MI RBFT.
As is clear from the figure, the fully-coordinated approach
of MI RBFT results in substantial performance degradation
during failure recovery. Hence, MI RBFT does not meet design
goals D4 and D5, which is sharply limits the throughput of
MIR BFT when compared to RCC.
Reducing malicious behavior: Several works have ob-
served that traditional consensus protocols only address a nar-
row set of malicious behavior, namely behavior that prevents
any progress [20], [22], [23], [39]. Hence, several designs
have been proposed to also address behavior that impedes
performance without completely preventing progress. One
such design is RBFT, which uses concurrent primaries not to
improve performance—as we propose—but only to mitigate
throttling attacks in a way similar to what we described in
Section IV. In practice, the design of RB FT results in poor
performance at high costs.
HOTST UFF [18], SPINNING [22], and PRIME [23] all pro-
poses to minimize the influence of malicious primaries by
replacing the primary every round. This would not incur the
costs of RB FT, while still reducing—but not eliminating—
the impact of faulty replicas to severely reduce throughput.
Unfortunately, these protocols follow the design of primary-
backup consensus protocols and, as discussed in Section II,
these designs are unable to achieve throughputs close to those
reached by a concurrent consensus such as RC C.
Concurrent consensus via sharding: Several recent works
have proposed to speed up consensus-based systems by in-
corporating sharding, this either at the data level (e.g., [5],
[7], [40], [41], [42]) or at the consensus level (e.g., [43]).
In these approaches only a small subset of all replicas,
those in a single shard, participate in the consensus on any
given transaction, thereby reducing the costs to replicate this
transaction and enabling concurrent transaction processing in
independent shards. As such, sharded designs can promise
huge scalability benefits for easily-sharded workloads. To do
so, sharded designs utilize a weaker failure model than the
fully-replicated model RCC uses, however. Consider, e.g., a
sharded system with zshards of n= 3f+ 1 replicas each.
In this setting, the system can only tolerate failure of up
to freplicas in a single shard, whereas a fully-replicated
system using zreplicas could tolerate the failure of any
choice of b(zn1)/3creplicas. Furthermore, sharded designs
typically operate consensus protocols such as PBFT in each
shard to order local transactions, which opens the opportunity
of concurrent consensus and RC C to achieve even higher
performance in these designs.
VII. CONCLUSION
In this paper, we proposed concurrent consensus as a
major step toward enabling high-throughput and more scal-
able consensus-based database systems. We have shown that
concurrent consensus is in theory able to achieve throughputs
that primary-backup consensus systems are unable to achieve.
To put the idea of concurrent consensus in practice, we
proposed the RC C paradigm that can be used to make normal
primary-backup consensus protocols concurrent. Furthermore,
we showed that RCC is capable of making consensus-based
systems more resilient to failures by sharply reducing the
impact of faulty replicas on the throughput and operations
of the system. We have also put the design of the RCC
paradigm to the test by implementing it in RESILIENTDB, our
high-performance resilient blockchain fabric, and comparing it
with state-of-the-art primary-backup consensus protocols. Our
experiments show that RCC is able to fulfill the promises
of concurrent consensus, as it significantly outperforms other
consensus protocols and provides better scalability. As such,
we believe that RCC opens the door to the development of new
high-throughput resilient database and federated transaction
processing systems.
Acknowledgements: We would like to acknowledge Saj-
jad Rahnama and Patrick J. Liao for their help during the
initial stages of this work.
REFERENCES
[1] S. Gupta, J. Hellings, and M. Sadoghi, “Brief announcement: Revisiting consensus protocols
through wait-free parallelization,” in 33rd International Symposium on Distributed Computing
(DISC 2019), vol. 146. Schloss Dagstuhl, 2019, pp. 44:1–44:3.
[2] M. Herlihy, “Blockchains from a distributed computing perspective,” Commun. ACM, vol. 62,
no. 2, pp. 78–85, 2019.
[3] A. Narayanan and J. Clark, “Bitcoin’s academic pedigree,Commun. ACM, vol. 60, no. 12, pp.
36–45, 2017.
[4] S. Gupta, J. Hellings, and M. Sadoghi, Fault-Tolerant Distributed Transactions on Blockchains,
ser. Synthesis Lectures on Data Management. Morgan & Claypool Publishers, 2020, (to appear).
[5] M. J. Amiri, D. Agrawal, and A. E. Abbadi, “CAPER: A cross-application permissioned
blockchain,” Proc. VLDB Endow., vol. 12, no. 11, pp. 1385–1398, 2019.
[6] E. Androulaki, A. Barger, V. Bortnikov, C. Cachin, K. Christidis, A. De Caro, D. Enyeart, C. Ferris,
G. Laventman, Y. Manevich, S. Muralidharan, C. Murthy, B. Nguyen, M. Sethi, G. Singh,
K. Smith, A. Sorniotti, C. Stathakopoulou, M. Vukoli´
c, S. W. Cocco, and J. Yellick, “Hyperledger
Fabric: A distributed operating system for permissioned blockchains,” in Proceedings of the
Thirteenth EuroSys Conference. ACM, 2018, pp. 30:1–30:15.
[7] M. El-Hindi, C. Binnig, A. Arasu, D. Kossmann, and R. Ramamurthy, “BlockchainDB: A shared
database on blockchains,” Proc. VLDB Endow., vol. 12, no. 11, pp. 1597–1609, 2019.
[8] S. Nathan, C. Govindarajan, A. Saraf, M. Sethi, and P.Jayachandran, “Blockchain meets database:
Design and implementation of a blockchain relational database,” Proc. VLDB Endow., vol. 12,
no. 11, pp. 1539–1552, 2019.
[9] F. Nawab and M. Sadoghi, “Blockplane: A global-scale byzantizing middleware,” in 35th
International Conference on Data Engineering (ICDE). IEEE, 2019, pp. 124–135.
[10] L. Lao, Z. Li, S. Hou, B. Xiao, S. Guo, and Y. Yang, “A survey of IoT applications in blockchain
systems: Architecture, consensus, and traffic modeling,” ACM Comput. Surv., vol. 53, no. 1, 2020.
[11] C. Cachin and M. Vukolic, “Blockchain consensus protocols in the wild (keynote talk),” in 31st
International Symposium on Distributed Computing, vol. 91. Schloss Dagstuhl, 2017, pp. 1:1–
1:16.
[12] J. Gray, “Notes on data base operating systems,” in Operating Systems, An Advanced Course.
Springer-Verlag, 1978, pp. 393–481.
[13] D. Skeen, “A quorum-based commit protocol,” Cornell University,Tech. Rep., 1982.
[14] S. Gupta and M. Sadoghi, “EasyCommit: A non-blocking two-phase commit protocol,” in
Proceedings of the 21st International Conference on Extending Database Technology. Open
Proceedings, 2018, pp. 157–168.
[15] M. Castro and B. Liskov, “Practical byzantine faulttolerance and proactive recovery,” ACMTrans.
Comput. Syst., vol. 20, no. 4, pp. 398–461, 2002.
[16] R. Kotla, L. Alvisi, M. Dahlin, A. Clement, and E. Wong, “Zyzzyva: Speculative byzantine fault
tolerance,” ACM Trans. Comput. Syst., vol. 27, no. 4, pp. 7:1–7:39, 2009.
[17] G. Golan Gueta, I. Abraham, S. Grossman, D. Malkhi, B. Pinkas, M. Reiter, D.-A. Seredinschi,
O. Tamir, and A. Tomescu, “SBFT: A scalable and decentralized trust infrastructure,” in 49th
Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN). IEEE,
2019, pp. 568–580.
[18] M. Yin, D. Malkhi, M. K. Reiter, G. G. Gueta, and I. Abraham, “HotStuff: BFT consensus with
linearity and responsiveness,”in Proceedings of the ACM Symposium on Principles of Distributed
Computing. ACM, 2019, pp. 347–356.
[19] S. Gupta, J. Hellings, S. Rahnama, and M. Sadoghi, “Proof-of-execution: Reaching consensus
through fault-tolerant speculation,” 2019. [Online]. Available: http://arxiv.org/abs/1911.00838
[20] P.-L. Aublin, S. B. Mokhtar, and V. Qu´
ema, “RBFT: Redundant byzantine fault tolerance,” in
2013 IEEE 33rd International Conference on Distributed Computing Systems. IEEE, 2013, pp.
297–306.
[21] S. Gupta, S. Rahnama, and M. Sadoghi, “Permissioned blockchain through the looking glass: Ar-
chitectural and implementation lessons learned,” in 40th International Conference on Distributed
Computing Systems. IEEE, 2020.
[22] G. S. Veronese, M. Correia, A. N. Bessani, and L. C. Lung, “Spin one’s wheels? byzantine fault
tolerance with a spinning primary,” in 2009 28th IEEE International Symposium on Reliable
Distributed Systems. IEEE, 2009, pp. 135–144.
[23] Y. Amir, B. Coan, J. Kirsch, and J. Lane, “Prime: Byzantine replication under attack,”IEEE Trans.
Depend. Secure Comput., vol. 8, no. 4, pp. 564–577, 2011.
[24] M. J. Fischer, N. A. Lynch, and M. S. Paterson, “Impossibility of distributed consensus with one
faulty process,” J. ACM, vol. 32, no. 2, pp. 374–382, 1985.
[25] S. Gilbert and N. Lynch, “Brewer’s conjecture and the feasibility of consistent, available, partition-
tolerant web services,” SIGACT News, vol. 33, no. 2, pp. 51–59, 2002.
[26] J. Hellings and M. Sadoghi, “Coordination-free byzantine replication with minimal communica-
tion costs,” in 23rdInternational Conference on Database Theory (ICDT 2020), vol. 155. Schloss
Dagstuhl, 2020, pp. 17:1–17:20.
[27] B. F. Cooper, A. Silberstein, E. Tam, R. Ramakrishnan, and R. Sears, “Benchmarking cloud
serving systems with YCSB,” in Proceedings of the 1st ACM Symposium on Cloud Computing.
ACM, 2010, pp. 143–154.
[28] T.T. A. Dinh, J. Wang, G. Chen, R. Liu, B. C. Ooi, and K.-L. Tan,“BLOCKBENCH: A framework
for analyzing private blockchains,” in Proceedings of the 2017 ACM International Conference on
Management of Data. ACM, 2017, pp. 1085–1100.
[29] S. Rahnama, S. Gupta, T. Qadah, J. Hellings, and M. Sadoghi, “Scalable, resilient and configurable
permissioned blockchain fabric,” Proc.VLDB Endow., vol. 13, no. 12, pp. 2893–2896, 2020.
[30] S. Gupta, J. Hellings, S. Rahnama, and M. Sadoghi, “An in-depth look of BFT consensus in
blockchain: Challenges and opportunities,” in Proceedings of the 20th International Middleware
Conference Tutorials, Middleware. ACM, 2019, pp. 6–10.
[31] ——, “Blockchain consensus unraveled: virtues and limitations,” in Proceedings of the 14th ACM
International Conference on Distributed and Event-based Systems. ACM, 2020, pp. 218–221.
[32] ——, “Building high throughput permissioned blockchain fabrics: Challenges and opportunities,”
Proc. VLDB Endow., vol. 13, no. 12, pp. 3441–3444, 2020.
[33] C. Berger and H. P. Reiser, “Scaling byzantine consensus: A broad analysis,”in Proceedings of the
2nd Workshop on Scalable and Resilient Infrastructures for Distributed Ledgers. ACM, 2018,
pp. 13–18.
[34] T. T. A. Dinh, R. Liu, M. Zhang, G. Chen, B. C. Ooi, and J. Wang, “Untangling blockchain: A
data processing view of blockchain systems,” IEEE Trans. Knowl. Data Eng., vol. 30, no. 7, pp.
1366–1385, 2018.
[35] S. Gupta and M. Sadoghi, Blockchain TransactionProcessing. Springer International Publishing,
2018, pp. 1–11.
[36] C. Stathakopoulou, T. David, and M. Vukolic, “Mir-BFT: High-throughput BFT for blockchains,
2019. [Online]. Available: http://arxiv.org/abs/1906.05552
[37] M. Eischer and T. Distler, “Scalable byzantine fault-tolerant state-machine replication on hetero-
geneous servers,” Computing, vol. 101, pp. 97–118, 2019.
[38] B. Li, W. Xu, M. Z. Abid, T. Distler, and R. Kapitza, “SAREK: Optimistic parallel ordering in
byzantine fault tolerance,” in 2016 12th European Dependable Computing Conference (EDCC).
IEEE, 2016, pp. 77–88.
[39] A. Clement, E. Wong, L. Alvisi, M. Dahlin, and M. Marchetti, “Making byzantine fault tolerant
systems tolerate byzantine faults,” in Proceedings of the 6th USENIX Symposium on Networked
Systems Design and Implementation. USENIX Association, 2009, pp. 153–168.
[40] J. Hellings, D. P. Hughes, J. Primero, and M. Sadoghi, “Cerberus: Minimalistic multi-shard
byzantine-resilient transaction processing,” 2020. [Online]. Available: https://arxiv.org/abs/2008.
04450
[41] M. J. Amiri, D. Agrawal, and A. El Abbadi, “SharPer: Sharding permissioned blockchains over
network clusters,” 2019. [Online]. Available: https://arxiv.org/abs/1910.00765v1
[42] H. Dang, T. T. A. Dinh, D. Loghin, E.-C. Chang, Q. Lin, and B. C. Ooi, “Towards scaling
blockchain systems via sharding,” in Proceedings of the 2019 International Conference on
Management of Data. ACM, 2019, pp. 123–140.
[43] S. Gupta, S. Rahnama, J. Hellings, and M. Sadoghi, “ResilientDB: Global scale resilient
blockchain fabric,” Proc.VLDB Endow., vol. 13, no. 6, pp. 868–883, 2020.
... To scale Byzantine replication across the globe, projects such as Steward [24] and ResilientDB [25], [26] and Narwhal [27] try to use global communication judiciously, and decrease global in favor of local communication. They allow neighboring nodes to form clusters. ...
... The message includes the current complaint number cn j . Once a local replica receives a local complaint for a remote cluster complained j ← false ▷ If complained about each cluster C j 7 upon timer j for remote C j expires 8 abeb request broadcast(LComplaint(j, cn j , r)) 9 complained j ← true 26 le request next-leader C j with the expected complaint number cn j , and it has not received operations from that cluster (at line 10), it records the accompanying signature σ in the set of complaint signatures cs j (at line 11). If the replica receives f i + 1 complaint signatures, since at least one is from a correct replica, the replica amplifies the complaint locally if it has not already complained (at line [12][13][14]. ...
... When a replica receives the complaint message from its local cluster (at line 23), it performs similar checks to accept it. It then increments the received complaint number rcn j ′ for the complaining cluster C j ′ , and unless the leader is recently changed, it requests the local leader election module le to move to the next leader (at line [24][25][26]. (We will consider the local leader election module le in § IX.) If the leader is changed recently (i.e., only a small amount of time ϵ is passed since the timer i is reset to ∆), the protocol avoids requesting to change the leader again so that the new leader is not disrupted. ...
Preprint
Full-text available
Fault-tolerant replicated database systems consume less energy than the compute-intensive proof-of-work blockchain. Thus, they are promising technologies for the building blocks that assemble global financial infrastructure. To facilitate global scaling, clustered replication protocols are essential in orchestrating nodes into clusters based on proximity. However, the existing approaches often assume a homogeneous and fixed model in which the number of nodes across clusters is the same and fixed, and often limited to a fail-stop fault model. This paper presents heterogeneous and reconfigurable clustered replication for the general environment with arbitrary failures. In particular, we present AVA, a fault-tolerant reconfigurable geo-replication that allows dynamic membership: replicas are allowed to join and leave clusters. We formally state and prove the safety and liveness properties of the protocol. Furthermore, our replication protocol is consensus-agnostic, meaning each cluster can utilize any local replication mechanism. In our comprehensive evaluation, we instantiate our replication with both HotStuff and BFT-SMaRt. Experiments on geo-distributed deployments on Google Cloud demonstrates that members of clusters can be reconfigured without considerably affecting transaction processing, and that heterogeneity of clusters may significantly improve throughput.
... However, the performance of a protocol goes beyond algorithmic complexity. Experimental measurements show that consensus latency and throughput can be significantly affected by message size, verification algorithm, network topology, attack model, etc. [3,12,13,26]. ...
... Model with full round change for non-clique topology: Validation and Analysis For HotStuff, Fig. 10(a) shows Eqn. (13) and the recommended 0 * fit the simulation reasonably well, despite the rough approximation in Eqn. (14) for . ...
... (6), (8)); how the two protocols have the same tail 1−2 0 , where is the fraction of crash faults (Eqns. (13), (15) (15)); how the performance crossover between HotStuff and IBFT depends on I / H and scales with (Eqn.(11), Fig. 9(a)); and how IBFT's consensus time ratio for Folded-Clos and Dragonfly is independent of and number of switches (Eqn.(12), ...
Preprint
Distributed ledgers are common in the industry. Some of them can use blockchains as their underlying infrastructure. A blockchain requires participants to agree on its contents. This can be achieved via a consensus protocol, and several BFT (Byzantine Fault Tolerant) protocols have been proposed for this purpose. How do these protocols differ in performance? And how is this difference affected by the communication network? Moreover, such a protocol would need a timer to ensure progress, but how should the timer be set? This paper presents an analytical model to address these and related issues in the case of crash faults. Specifically, it focuses on two consensus protocols (Istanbul BFT and HotStuff) and two network topologies (Folded-Clos and Dragonfly). The model provides closed-form expressions for analyzing how the timer value and number of participants, faults and switches affect the consensus time. The formulas and analyses are validated with simulations. The conclusion offers some tips for analytical modeling of such protocols.
... However, the leader can become a significant performance bottleneck, especially at scale. The leader's workload increases linearly with the number of replicas [2,18,25,36,37], making the leader the dominant factor in the system's throughput and latency. ...
... To address the leader bottleneck, Multi-BFT systems [2,25,36,37] have emerged as a promising alternative. Multi-BFT consensus runs multiple leader-based BFT instances in parallel, as shown in Fig. 1. ...
... However, the Achilles' heel of Multi-BFT consensus lies in its global ordering. Existing Multi-BFT protocols [2,25,36,37] follow a pre-determined global ordering: a block is assigned a global index that depends solely on two numbers, its instance index and its sequence number in the instance's output. As a concrete example, consider Fig. 1 with three instances outputting four blocks (produced and to be produced in the future). ...
Preprint
Multi-BFT consensus runs multiple leader-based consensus instances in parallel, circumventing the leader bottleneck of a single instance. However, it contains an Achilles' heel: the need to globally order output blocks across instances. Deriving this global ordering is challenging because it must cope with different rates at which blocks are produced by instances. Prior Multi-BFT designs assign each block a global index before creation, leading to poor performance. We propose Ladon, a high-performance Multi-BFT protocol that allows varying instance block rates. Our key idea is to order blocks across instances dynamically, which eliminates blocking on slow instances. We achieve dynamic global ordering by assigning monotonic ranks to blocks. We pipeline rank coordination with the consensus process to reduce protocol overhead and combine aggregate signatures with rank information to reduce message complexity. Ladon's dynamic ordering enables blocks to be globally ordered according to their generation, which respects inter-block causality. We implemented and evaluated Ladon by integrating it with both PBFT and HotStuff protocols. Our evaluation shows that Ladon-PBFT (resp., Ladon-HotStuff) improves the peak throughput of the prior art by \approx8x (resp., 2x) and reduces latency by \approx62% (resp., 23%), when deployed with one straggling replica (out of 128 replicas) in a WAN setting.
... Addressing performance challenges: Several solutions have been proposed [36] to address the throughput and latency limitations of current blockchains, including sophisticated consensus algorithms [23,24] in Hyperledger Fabric, adjusting block size which is prone to security vulnerabilities due to the increase in the propagation delay [6] and reducing block data which provides a limited increase in throughput [29]. Sharding divides the network into different subsets (i.e., shards) and distributes workloads among shards to be executed in parallel. ...
Preprint
Full-text available
Blockchains are being positioned as the "technology of trust" that can be used to mediate transactions between non-trusting parties without the need for a central authority. They support transaction types that are native to the blockchain platform or user-defined via user programs called smart contracts. Despite the significant flexibility in transaction programmability that smart contracts offer, they pose several usability, robustness, and performance challenges. This paper proposes an alternative transaction framework that incorporates more primitives into the native set of transaction types (reducing the likelihood of requiring user-defined transaction programs often). The framework is based on the concept of declarative blockchain transactions whose strength lies in the fact that it addresses several of the limitations of smart contracts simultaneously. A formal and implementation framework is presented, and a subset of commonly occurring transaction behaviors are modeled and implemented as use cases, using an open-source blockchain database, BigchchainDB, as the implementation context. A performance study comparing the declarative transaction approach to equivalent smart contract transaction models reveals several advantages of the proposed approach.
... Parallel Proposing. Slotting is different from prior multi-proposer approaches in BFT consensus: (i) multi-leader protocols like RCC [36], MirBFT [79], and SpotLess [44], and (ii) DAG protocols [14,25,46,47,59,75,77,78]. These protocols focus mostly on increasing throughput, and a majority of these protocols have a HotStuff-core. ...
Preprint
Full-text available
This paper introduces HotStuff-1, a BFT consensus protocol that improves the latency of HotStuff-2 by two network-hops while maintaining linear communication complexity against faults. Additionally, HotStuff-1 incorporates an incentive-compatible leader rotation regime that motivates leaders to commit consensus decisions promptly. HotStuff-1 achieves a reduction by two network hops by sending clients early finality confirmations speculatively, after one phase of the protocol. Unlike previous speculation regimes, the early finality confirmation path of HotStuff-1 is fault-tolerant and the latency improvement does not rely on optimism. An important consideration for speculation regimes in general, which is referred to as the prefix speculation dilemma, is exposed and resolved. HotStuff-1 embodies an additional mechanism, slotting, that thwarts real-world delays caused by rationally-incentivized leaders. Leaders may also be inclined to sabotage each other's progress. The slotting mechanism allows leaders to drive multiple decisions, thus mitigating both threats, while dynamically adapting the number of allowed decisions per leader to network transmission delays.
Article
Consensus is a fundamental problem in distributed systems, involving the challenge of achieving agreement among distributed nodes. It plays a critical role in various distributed data management problems. This tutorial aims to provide a comprehensive primer for data management researchers on the topic of consensus and its fundamental and modern applications in data management. We begin by exploring the basic principles of consensus, including the problem statement, system models, failure scenarios, and various consensus algorithms such as Paxos and its variants. The tutorial then delves into the applications of consensus in distributed data management, focusing on distributed atomic commitment and data replication. We explain how consensus is integral to these areas and present examples of research and industry work that apply consensus to data management. The tutorial extends to modern use cases of consensus in the evolving landscapes of edge-cloud systems and blockchain technology. We discuss how consensus mechanisms are being adapted and applied in these areas, highlighting their growing importance in emerging areas of data management. By exploring these cutting-edge applications, participants will gain insights into how consensus is shaping ongoing and future research on distributed data management. The tutorial builds on the authors' recent book "Consensus in Data Management: from Distributed Commit to Blockchain". The book will serve as the foundation and reading material for the tutorial. This tutorial targets data management researchers and practitioners to equip them with the knowledge and perspective needed to innovate in these emerging fields. This includes graduate students and junior researchers starting their careers in the area of distributed data management. Also, it includes researchers in other areas of data management who wish to explore the area of distributed data management with the goal of utilizing it in their own fields.
Article
Today, blockchain ledgers utilize concurrent deterministic execution schemes to scale up. However, ordering fairness is not preserved in these schemes: although they ensure all replicas achieve the same serial order, this order does not always align with the fair, consensus-established order when executing smart contracts with runtime-determined accesses. To preserve ordering fairness, an intuitive method is to concurrently execute transactions and re-execute any order-violating ones. This in turn increases unforeseen conflicts, leading to scaling bottlenecks caused by numerous costly aborts under contention. To address these issues, we propose Spectrum, a novel deterministic execution scheme for smart contract execution on blockchain ledgers. Spectrum preserves the consensus-established serial order (so-called strict determinism) with high performance. Specifically, we leverage a speculative deterministic concurrency control to execute transactions in speculation and enforce an agreed-upon serial order by aborting and re-executing any mis-speculated ones. To overcome the scaling bottleneck, we present two key optimizations based on speculative processing: operation-level rollback and predictive scheduling, for reducing both the overhead and the number of mis-speculations. We evaluate Spectrum by executing EVM-based smart contracts on popular benchmarks, showing that it realizes fair smart contract execution by preserving ordering fairness and outperforms competitive schemes in contended workloads by 1.4x to 4.1x.
Article
Full-text available
Since the introduction of Bitcoin---the first widespread application driven by blockchains---the interest in the design of blockchain-based applications has increased tremendously. At the core of these applications are consensus protocols that securely replicate client requests among all replicas, even if some replicas are Byzantine faulty. Unfortunately, these consensus protocols typically have low throughput, and this lack of performance is often cited as the reason for the slow wider adoption of blockchain technology. Consequently, many works focus on designing more efficient consensus protocols to increase throughput of consensus. We believe that this focus on consensus protocols only explains part of the story. To investigate this belief, we raise a simple question: Can a well-crafted system using a classical consensus protocol outperform systems using modern protocols? In this tutorial, we answer this question by diving deep into the design of blockchain systems. Further, we take an in-depth look at the theory behind consensus, which can help users select the protocol that best-fits their requirements. Finally, we share our vision of high-throughput blockchain systems that operate at large scales.
Article
Full-text available
With the advent of Bitcoin, the interest of the database community in blockchain systems has steadily grown. Many existing blockchain applications use blockchains as a platform for monetary transactions, however. We deviate from this philosophy and present ResilientDB, which can serve in a suite of non-monetary data-processing blockchain applications. Our ResilientDB uses state-of-the-art technologies and includes a novel visualization that helps in monitoring the state of the blockchain application.
Article
Full-text available
Recent developments in blockchain technology have inspired innovative new designs in resilient distributed and database systems. At their core, these blockchain applications typically use Byzantine fault-tolerant consensus protocols to maintain a common state across all replicas, even if some replicas are faulty or malicious. Unfortunately, existing consensus protocols are not designed to deal with geo-scale deployments in which many replicas spread across a geographically large area participate in consensus. To address this, we present the Geo-Scale Byzantine Fault-Tolerant consensus protocol (GeoBFT). GeoBFT is designed for excellent scalability by using a topological-aware grouping of replicas in local clusters, by introducing parallelization of consensus at the local level, and by minimizing communication between clusters. To validate our vision of high-performance geo-scale resilient distributed systems, we implement GeoBFT in our efficient ResilientDB permissioned blockchain fabric. We show that GeoBFT is not only sound and provides great scalability, but also outperforms state-of-the-art consensus protocols by a factor of six in geo-scale deployments.
Conference Paper
Full-text available
Since the introduction of Bitcoin---the first wide-spread application driven by blockchains---the interest of the public and private sector in blockchains has skyrocketed. At the core of this interest are the ways in which blockchains can be used to improve data management, e.g., by enabling federated data management via decentralization, resilience against failure and malicious actors via replication and consensus, and strong data provenance via a secured immutable ledger. In practice, high-performance blockchains for data management are usually built in permissioned environments in which the participants are vetted and can be identified. In this setting, blockchains are typically powered by Byzantine fault-tolerant consensus protocols. These consensus protocols are used to provide full replication among all honest blockchain participants by enforcing an unique order of processing incoming requests among the participants. In this tutorial, we take an in-depth look at Byzantine fault-tolerant consensus. First, we take a look at the theory behind replicated computing and consensus. Then, we delve into how common consensus protocols operate. Finally, we take a look at current developments and briefly look at our vision moving forward.
Article
Blockchain technology can be extensively applied in diverse services, including online micro-payments, supply chain tracking, digital forensics, health-care record sharing, and insurance payments. Extending the technology to the Internet of things (IoT), we can obtain a verifiable and traceable IoT network. Emerging research in IoT applications exploits blockchain technology to record transaction data, optimize current system performance, or construct next-generation systems, which can provide additional security, automatic transaction management, decentralized platforms, offline-to-online data verification, and so on. In this article, we conduct a systematic survey of the key components of IoT blockchain and examine a number of popular blockchain applications. In particular, we first give an architecture overview of popular IoT-blockchain systems by analyzing their network structures and protocols. Then, we discuss variant consensus protocols for IoT blockchains, and make comparisons among different consensus algorithms. Finally, we analyze the traffic model for P2P and blockchain systems and provide several metrics. We also provide a suitable traffic model for IoT-blockchain systems to illustrate network traffic distribution.
Article
In this paper, we design and implement the first-ever decentralized replicated relational database with blockchain properties that we term blockchain relational database. We highlight several similarities between features provided by blockchain platforms and a replicated relational database, although they are conceptually different, primarily in their trust model. Motivated by this, we leverage the rich features, decades of research and optimization, and available tooling in relational databases to build a blockchain relational database. We consider a permissioned blockchain model of known, but mutually distrustful organizations each operating their own database instance that are replicas of one another. The replicas execute transactions independently and engage in decentralized consensus to determine the commit order for transactions. We design two approaches, the first where the commit order for transactions is agreed upon prior to executing them, and the second where transactions are executed without prior knowledge of the commit order while the ordering happens in parallel. We leverage serializable snapshot isolation (SSI) to guarantee that the replicas across nodes remain consistent and respect the ordering determined by consensus, and devise a new variant of SSI based on block height for the latter approach. We implement our system on PostgreSQL and present detailed performance experiments analyzing both approaches.