Content uploaded by Mohammad Sadoghi
Author content
All content in this area was uploaded by Mohammad Sadoghi on Jul 22, 2021
Content may be subject to copyright.
Content uploaded by Suyash Gupta
Author content
All content in this area was uploaded by Suyash Gupta on Nov 15, 2019
Content may be subject to copyright.
Proof-of-Execution: Reaching Consensus through
Fault-Tolerant Speculation
Suyash Gupta Jelle Hellings Sajjad Rahnama Mohammad Sadoghi
Exploratory Systems Lab
Department of Computer Science
University of California, Davis
ABSTRACT
Multi-party data management and blockchain systems require
data sharing among participants. To provide resilient and consis-
tent data sharing, transactions engines rely on Byzantine Fault-
Tolerant consensus (), which enables operations during fail-
ures and malicious behavior. Unfortunately, existing proto-
cols are unsuitable for high-throughput applications due to their
high computational costs, high communication costs, high client
latencies, and/or reliance on twin-paths and non-faulty clients.
In this paper, we present the Proof-of-Execution consensus pro-
tocol (PE) that alleviates these challenges. At the core of PE are
out-of-order processing and speculative execution, which allow
PE to execute transactions before consensus is reached among
the replicas. With these techniques, PE manages to reduce the
costs of in normal cases, while guaranteeing reliable con-
sensus for clients in all cases. We envision the use of PE in
high-throughput multi-party data-management and blockchain
systems. To validate this vision, we implement PE in our e-
cient RDB fabric and extensively evaluate PE against
several state-of-the-art protocols. Our evaluation showcases
that PE achieves up-to-80% higher throughputs than existing
protocols in the presence of failures.
1 INTRODUCTION
In federate data management a single common database is man-
aged by many independent stakeholders (e.g., an industry con-
sortium). In doing so, federated data management can ease data
sharing and improve data quality [
17
,
32
,
48
]. At the core of fed-
erated data management is reaching agreement on any updates
on the common database in an ecient manner, this to enable
fast query processing, data retrieval, and data modications.
One can achieve federated data management by replicating
the common database among all participant, this by replicat-
ing the sequence of transactions that aect the database to all
stakeholders. One can do so using commit protocols designed
for distributed databases such as two-phase [
22
] and three-phase
commit [
49
], or by using crash-resilient replication protocols
such as Paxos [39] and Raft [45].
These solutions are error-prone in a federated decentralized
environment in which each stakeholder manages its own replicas
and replicas of each stakeholder can fail (e.g., due to software,
hardware, or network failure) or act malicious: commit protocols
and replication protocols can only deal with crashes. Conse-
quently, recent federated designs propose the usage of Byzantine
Fault-Tolerant () consensus protocols. consensus aims at
ordering client requests among a set of replicas, some of which could
be Byzantine, such that all non-faulty replicas reach agreement on
a common order for these requests [
9
,
21
,
29
,
38
,
51
]. Furthermore,
consensus comes with the added benet of democracy, as
©
2021 Copyright held by the owner/author(s). Published in Proceedings of the
24th International Conference on Extending Database Technology (EDBT), March
23-26, 2021, ISBN 978-3-89318-084-4 on OpenProceedings.org.
Distribution of this paper is permitted under the terms of the Creative Commons
license CC-by-nc-nd 4.0.
consensus gives all replicas an equal vote in all agreement
decisions, while the resilience of can aid in dealing with the
billions of dollars losses associated with prevalent attacks on data
management systems [44].
Akin to commit protocols, the majority of consensus pro-
tocols use a primary-backup model in which one replica is des-
ignated the primary that coordinates agreement, while the re-
maining replicas act as backups and follow the protocol [
46
].
This primary-backup consensus was rst popularized by the
inuential P consensus protocol of Castro and Liskov [
9
]. The
design of P requires at least 3
f+
1replicas to deal with up-to-
f
malicious replicas and operates in three communication phases,
two of which necessitate quadratic communication complexity.
As such, P is considered costly when compared to commit or
replication protocols, which has negatively impacted the usage
of consensus in large-scale data management systems [8].
The recent interest in blockchain technology has revived in-
terest in consensus, has led to several new resilient data
management systems (e.g., [
3
,
18
,
29
,
43
]), and has led to the
development of new consensus protocols that promise e-
ciency at the cost of exibility (e.g., [
21
,
28
,
38
,
51
]). Despite the
existence of these modern consensus protocols, the majority
of -fueled systems [
3
,
18
,
29
,
43
] still employ the classical
time-tested, exible, and safe design of P, however.
In this paper, we explore dierent design principles that can
enable implementing a scalable and reliable agreement protocol
that shields against malicious attacks. We use these design princi-
ples to introduce Proof-of-Execution (PE), a novel protocol
that achieves resilient agreement in just three linear phases. To
concoct PE’s scalable and resilient design, we start with P
and successively add four design elements:
(I1) Non-Divergent Speculative Execution.
In P, when
the primary replica receives a client request, it forwards that
request to the backups. Each backup on receiving a request from
the primary agrees to support by broadcasting a message.
When a replica receives message from the majority of
other replicas, it marks itself as prepared and broadcasts a
message. Each replica that has prepared, and receives
messages from a majority of other replicas, executes the request.
Evidently, P requires two phases of all-to-all communica-
tion. Our rst ingredient towards faster consensus is speculative
execution. In P terminology, PE replicas execute requests
after they get prepared, that is, they do not broadcast mes-
sages. This speculative execution is non-divergent as each replica
has a partial guarantee–it has prepared–prior to execution.
(I2) Safe Rollbacks and Robustness under Failures.
Due
to speculative execution, a malicious primary in PE can en-
sure that only a subset of replicas prepare and execute a request.
Hence, a client may or may not receive a sucient number of
matching responses. PE ensures that if a client receives a full
proof-of-execution, consisting of responses from a majority of the
non-faulty replicas, then such a request persists in time. Other-
wise, PE permits replicas to rollback their state if necessary. This
proof-of-execution is the cornerstone of the correctness of PE.
Series ISSN: 2367-2005 301 10.5441/002/edbt.2021.27
(I3) Agnostic Signatures and Linear Communication.
protocols are run among distrusting parties. To provide secu-
rity, these protocols employ cryptographic primitives for signing
the messages and generating message digests. Prior works have
shown that the choice of cryptographic signature scheme can
impact the performance of the underlying system [
9
,
30
]. Hence,
we allow replicas to either employ message authentication codes
(
MAC
s) or threshold signatures (
TS
s) for signing [
36
]. When few
replicas are participating in consensus (up to 16), then a single
phase of all-to-all communication is inexpensive and using
MAC
s
for such setups can make computations cheap. For larger setups,
we employ
TS
s to achieve linear communication complexity.
TS
s
permit us to split a phase of all-to-all communication into two
linear phases [21, 51].
(I4) Avoid Response Aggregation.
SBFT [
21
], a recently-
proposed protocol, suggests the use of a single replica (desig-
nated as the executor) to act as a response aggregator. In specic,
all replicas execute each client request and send their response to
the executor. It is the duty of the executor to reply to the client
and send a proof that a majority of the replicas not only executed
this request, but also outputted the same result. In PE, we avoid
this additional communication between the replicas by allowing
each replica to respond directly to the client.
In specic, we make the following contributions:
(1)
We introduce PE, a novel Byzantine fault-tolerant con-
sensus protocol that uses speculative execution to reach
agreement among replicas.
(2)
To guarantee failure recovery in the presence of specu-
lative execution and Byzantine behavior, we introduce a
novel view-change protocol that can rollback requests.
(3)
PE supports batching, out-of-order processing, and is
signature-scheme agnostic and can be made to employ
either MACs or threshold signatures.
(4)
PE does not rely on non-faulty replicas, clients, or trusted
hardware to achieve safe and ecient consensus.
(5)
To validate our vision of using PE in resilient federated
data management systems, we implement PE and four
other protocols (Z,P,SBFT, and HS)
in our ecient RDB
1
fabric [
23
–
25
,
27
,
29
,
30
,
47
].
(6)
We extensively evaluate PE against these protocols on
a Google Cloud deployment consisting of 91 replicas and
320 k
clients under (i) no failure, (ii) backup failure, (iii)
primary failure, (iv) batching of requests, (v) zero payload,
and (vi) scaling the number of replicas. Further, to prove
the correctness of our results, we also stress test PE and
other protocols in a simulated environment. Our results
show that PE can achieve up to 80% more throughput
than existing protocols in the presence of failures.
To the best of our knowledge, PE is the rst protocol that
achieves consensus in only two phases while being able to deal
with Byzantine failures and without relying on trusted clients
(e.g., Z [
38
]) or on trusted hardware (e.g., MBFT [
50
]).
Hence, PE can serve as a drop-in replacement of P to improve
scalability and performance in permissioned blockchain fabrics
such as our RDB fabric [
27
–
31
], MultiChain [
20
], and
Hyperledger Fabric [
4
]; in multi-primary meta-protocols such as
RCC [26, 28]; and in sharding protocols such as AHL [15].
2 ANALYSIS OF DESIGN PRINCIPLES
To arrive at an optimal design for PE, we studied practices fol-
lowed by state-of-the-art distributed data management systems
1RDB is open-sourced at https://github.com/resilientdb.
Protocol Phases Messages Resilience Requirements
Z 1O(n)0 Reliable clients and unsafe
PE (our paper) 3 O( 3n)fSign. agnostic
P 3O( n+2n2)f
HS 8O( 8n)fSequential Consensus
SBFT 5O( 5n)0Optimistic path
Figure 1: Comparison of bft consensus protocols in a sys-
tem with nreplicas of which fare faulty. The costs given
are for the normal-case behavior.
and applied their principles to the design of PE where possi-
ble. In Figure 1, we present a comparison of PE against four
well-known resilient consensus protocols.
To illustrate the merits of PE’s design, we rst briey look at
P. The last phase of P ensures that non-faulty replicas only
execute requests and inform clients when there is a guarantee
that such a transaction will be recovered after any failures. Hence,
clients need to wait for only
f+
1identical responses, of which
at-least one is from a non-faulty replica, to ensure guaranteed
execution. By eliminating this last phase, replicas speculatively
execute requests before obtaining recovery guarantees. This im-
pacts P-style consensus in two ways:
(1)
First, clients need a way to determine proof-of-execution
after which they have a guarantee that their requests are
executed and maintained by the system. We shall show
that such a proof-of-execution can be obtained using
nf ≥
2f+1identical responses (instead of f+1responses).
(2)
Second, as requests are executed before they are guaran-
teed, replicas need to be able to rollback requests that are
dropped during periods of recovery.
PE’s speculative execution guarantees that requests with a proof-
of-execution will never rollback and that only a single request
can obtain a proof-of-execution per round. Hence, speculative
execution provides the same strong consistency (safety) of P
in all cases, this at much lower cost under normal operations.
Furthermore, we show that speculative execution is fully com-
patible with other scalable design principles applied to P, e.g.,
batching and out-of-order processing to maximize throughput,
even with high message delays.
Out-of-order execution.
Typical systems follow the
order-execute model: rst replicas agree on a unique order of
the client request, and only then they execute the requests in
order [
9
,
21
,
29
,
38
,
51
]. Unfortunately, this prevents these sys-
tems from providing any support for concurrent execution. A
few systems suggest executing prior to ordering, but even
such systems need to re-verify their results prior to commit-
ting changes [
4
,
35
]. Our PE protocol lies between these two
extremes: the replicas speculatively execute using only partial
ordering guarantees. By doing so, PE can eliminate communi-
cation costs and minimize latencies of typical systems, this
without needing to re-verify results in the normal case.
Out-of-order processing.
Although consensus typically
executes requests in-order, this does not imply they need to
process proposals to order requests sequentially. To maximize
throughput, P and other primary-backup protocols support
out-of-order processing in which all available bandwidth of the
primary is used to continuously propose requests (even when
previous proposals are still being processed by the system). By
doing so, out-of-order processing can eliminate the impact of high
message delays. To provide out-of-order processing, all replicas
will process any request proposed as the
𝑘
-th request whenever
𝑘
is within some active window bounded by a low-watermark
and high-watermark [
9
]. These watermarks are increased as the
302
system progresses. The size of this active window is—in practice—
only limited by the memory resources available to replicas. As
out-of-order processing is an essential technique to deliver high
throughputs in environments with high message delays, we have
included out-of-order processing in the design of PE.
Twin-path consensus.
Speculative execution employed by
PE is dierent that the twin-path model utilized by Z [
38
]
and SBFT [
21
]. These twin-path protocols have an optimistic fast
path that works only if none of the replicas are faulty and require
aid to determine whether these optimistic condition hold.
In the fast path of Z, primaries propose requests, and
backups directly execute such proposals and inform the client
(without further coordination). The client waits for responses
from all
n
replicas before marking the request executed. When
the client does not receive
n
responses, it timeouts and sends
a message to all replicas, after which the replicas perform an
expensive client-dependent slow-path recovery process (which is
prone to errors when communication is unreliable [2]).
The fast path of SBFT can deal with up to
c
crash-failures
using 3
f+
2
c+
1replicas and uses threshold signatures to make
communication linear. The fast path of SBFT requires a reliable
collector and executor to aggregate messages and to send only
a single (instead of at-least-
f+
1) response to the client. Due
to aggregating execution, the fast path of SBFT still performs
four rounds of communication before the client gets a response,
whereas PE only uses two rounds of communication (or three
when PE uses threshold signatures). If the fast path timeouts (e.g.,
the collector or executor fails), then SBFT falls back to a threshold-
version of P that takes an additional round before the client
gets a response. Twin-path consensus is in sharp contrast with
the design of PE, which does not need outside aid (reliable
clients, collectors, or executors), and can operate optimally even
while dealing with replica failures.
Primary rotation.
To minimize the inuence of any single
replica on consensus, HS opts to replace the primary
after every consensus decision. To eciently do so, HS
uses an extra communication phase (as compared to P), which
minimizes the cost of primary replacement. Furthermore, H
S uses threshold signatures to make its communication lin-
ear (resulting in eight communication phases before a client gets
responses). The event-based version of HS can overlap
phases of consecutive rounds, thereby assuring that consensus of
a client request starts in every one-to-all-to-one communication
phase. Unfortunately, the primary replacements require that all
consensus rounds are performed in a strictly sequential manner,
eliminating any possibility of out-of-order processing.
3 PROOF-OF-EXECUTION
In our Proof-of-Execution consensus protocol (PE), the primary
replica is responsible for proposing transactions requested by
clients to all backup replicas. Each backup replica speculatively
executes these transactions with the belief that the primary is
behaving correctly. Speculative execution expedites processing
of transactions in all cases. Finally, when malicious behavior is
detected, replicas can recover by rolling back transactions, which
ensures correctness without depending on any twin-path model.
3.1 System model and notations
Before providing a full description of our PE protocol, we present
the system model we use and the relevant notations.
A system is a set
ℜ
of replicas that process client requests.
We assign each replica
∈ℜ
a unique identier
id()
with
0
≤id()<|ℜ|
. We write
F ⊆ ℜ
to denote the set of Byzantine
replicas that can behave in arbitrary, possibly coordinated and
malicious, manners. We assume that non-faulty replicas (those in
ℜ\F
) behave in accordance to the protocol and are deterministic:
on identical inputs, all non-faulty replicas must produce identical
outputs. We do not make any assumptions on clients: all clientcan
be malicious without aecting PE. We write
n=|ℜ|
,
f=|F |
,
and
nf =|ℜ\ F |
to denote the number of replicas, faulty replicas,
and non-faulty replicas, respectively. We assume that
n>
3
f
(nf >2f).
We assume authenticated communication: Byzantine replicas
are able to impersonate each other, but replicas cannot imper-
sonate non-faulty replicas. Authenticated communication is a
minimal requirement to deal with Byzantine behavior. Depend-
ing on the type of message, we use message authentication codes
(
MAC
s) or threshold signatures (
TS
s) to achieve authenticated com-
munication [
36
].
MAC
s are based on symmetric cryptography in
which every pair of communicating nodes has a secret key. We
expect non-faulty replicas to keep their secret keys hidden.
TS
s
are based on asymmetric cryptography. In specic, each replica
holds a distinct private key, which it can use to create a signature
share. Next, one can produce a valid threshold signature given at
least
nf
such signature shares (from distinct replicas). We write
𝑠⟨𝑣⟩𝑖
to denote the signature share of the
𝑖
-th replica for signing
value
𝑣
. Anyone that receives a set
𝑇={𝑠⟨𝑣⟩𝑗|𝑗∈𝑇′}
of signa-
ture shares for
𝑣
from
|𝑇′|=nf
distinct replicas, can aggregate
𝑇
into a single signature
⟨𝑣⟩
. This digital signature can then be
veried using a public key.
We also employ a collision-resistant cryptographic hash function
D(·)
that can map an arbitrary value
𝑣
to a constant-sized digest
D(𝑣)
[
36
]. We assume that it is practically impossible to nd
another value
𝑣′
,
𝑣≠𝑣′
, such that
D(𝑣)=D(𝑣′)
. We use notation
𝑣||𝑤to denotes the concatenation of two values 𝑣and 𝑤.
Next, we dene the consensus provided by PE.
Denition 3.1. A single run of any consensus protocol should
satisfy the following requirements:
Termination. Each non-faulty replica executes a transaction.
Non-divergence.
All non-faulty replicas execute the same trans-
action.
Termination is typically referred to as liveness, whereas non-
divergence is typically referred to as safety. In PE, execution is
speculative: replicas can execute and rollback transactions. To
provide safety, PE provides speculative non-divergence instead
of non-divergence:
Speculative non-divergence.
If
nf −f≥f+
1non-faulty repli-
cas accept and execute the same transaction
𝑇
, then all
non-faulty replicas will eventually accept and execute
𝑇
(after rolling back any other executed transactions).
To provide safety, we do not need any other assumptions on
communication or on clients. Due to well-known impossibility
results for asynchronous consensus [
19
], we can only provide
liveness in periods of reliable bounded-delay communication dur-
ing which all messages sent by non-faulty replicas will arrive at
their destination within some maximum delay.
3.2 The Normal-Case Algorithm of PoE
PE operates in views
𝑣=
0
,
1
, . . .
. In view
𝑣
, replica
with
id()=𝑣mod n
is elected as the primary. The design of PE
relies on authenticated communication, which can be provided
using
MAC
s or
TS
s. In Figure 2, we sketch the normal-case working
of PE for both cases. For the sake of brevity, we will describe PE
built on top of
TS
s, which results in a protocol with low—linear—
message complexity in the normal case. The full pseudo-code for
303
2
1
𝑐𝑇
(a) PE using MACs
2
1
𝑐𝑇
(b) PE using TSs.
Figure 2: Normal-case algorithm of PoE: Client 𝑐sends its
request containing transaction 𝑇to the primary p, which
proposes this request to all replicas. Although replica bis
Byzantine, it fails to aect PoE.
this algorithm can be found in Figure 3. In Section 3.6, we detail
the minimal changes to PE necessary when switching to MACs.
Consider a view
𝑣
with primary
. To request execution of
transaction
𝑇
, a client
𝑐
signs transaction
𝑇
and sends the signed
transaction ⟨𝑇⟩𝑐to . The usage of signatures assures that mali-
cious primaries cannot forge transactions. To initiate replication
and execution of
𝑇
as the
𝑘
-th transaction, the primary proposes
𝑇to all replicas via a message.
After the
𝑖
-th replica
receives a message
𝑚
from
, it checks whether at least
nf
other replicas received the same
proposal
𝑚
from primary
. This check assures
that at least
nf −f
non-faulty replicas received the same proposal, which
will play a central role in achieving speculative non-divergence.
To perform this check, each replica supports the rst proposal
𝑚
it receives from the primary by computing a signature share
𝑠⟨𝑚⟩𝑖
and sending a message containing this share to
the primary.
The primary
waits for messages with valid sig-
nature shares from
nf
distinct replicas, which can then be ag-
gregated into a single signature
⟨𝑚⟩
. After generating such a
signature, the primary broadcasts this signature to all replicas
via a message.
After a replica
receives a valid message, it view-
commits to
𝑇
as the
𝑘
-th transaction in view
𝑣
. The replica logs
this view-commit decision as
VCommit(⟨𝑇⟩𝑐, 𝑣, 𝑘 )
. After
view-
commits to
𝑇
,
schedules
𝑇
for speculative execution as the
𝑘
-th transaction of view
𝑣
. Consequently,
𝑇
will be executed
by
after all preceding transactions are executed. We write
Execute(⟨𝑇⟩𝑐, 𝑣, 𝑘 )to log this execution.
After execution,
informs the client of the order of execution
and of execution result 𝑟(if any) via a message . In turn,
client
𝑐
will wait for a proof-of-execution for the transaction
𝑇
it requested, which consists of identical messages from
nf
distinct replicas. This proof-of-execution guarantees that at
least
nf −f≥f+
1non-faulty replicas executed
𝑇
as the
𝑘
-th
transaction and in Section 3.3, we will see that such transactions
are always preserved by PE when recovering from failures.
If client
𝑐
does not know the current primary or does not get
any timely response for its requests, then it can broadcast its
request
⟨𝑇⟩𝑐
to all replicas. The non-faulty replicas will then for-
ward this request to the current primary (if
𝑇
is not yet executed)
and ensure that the primary initiates successful proposal of this
request in a timely manner.
To prove correctness of PE in all cases, we will need the
following technical safety-related property of view-commits.
Client-role (used by client 𝑐to request transaction𝑇):
1: Send ⟨𝑇⟩𝑐to the primary .
2: Await receipt of messages ( ⟨𝑇⟩𝑐, 𝑣, 𝑘, 𝑟 )from nf replicas.
3: Considers 𝑇executed, with result 𝑟, as the 𝑘-th transaction.
Primary-role (running at the primary of view 𝑣,id ()=𝑣mod n):
4: Let view 𝑣start after execution of the 𝑘-th transaction.
5: event awaits receipt of message ⟨𝑇⟩𝑐from client 𝑐do
6: Broadcast ( ⟨𝑇⟩𝑐, 𝑣, 𝑘)to all replicas.
7: 𝑘:=𝑘+1.
8: end event
9: event receives nf message (𝑠⟨ℎ⟩𝑖, 𝑣, 𝑘 )such that:
(1) each message was sent by a distinct replica, 𝑖∈ {1, . . . , 𝑛 }; and
(2) All 𝑠⟨ℎ⟩𝑖in this set can be combined to generate signature ⟨ℎ⟩.
do
10: Broadcast ( ⟨ℎ⟩,𝑣, 𝑘)to all replicas.
11: end event
Backup-role (running at every 𝑖-th replica .) :
12: event receives message 𝑚:= ( ⟨𝑇⟩𝑐, 𝑣, 𝑘 )such that:
(1) 𝑣is the current view;
(2) 𝑚is sent by the primary of 𝑣; and
(3) did not accept a 𝑘-th proposal in 𝑣
do
13: Compute ℎ:=D( ⟨𝑇⟩𝑐| |𝑣| |𝑘).
14: Compute signature share 𝑠⟨ℎ⟩𝑖.
15: Transmit (𝑠⟨ℎ⟩𝑖, 𝑣, 𝑘 )to .
16: end event
17: event receives messages ( ⟨ℎ⟩,𝑣, 𝑘)from such that:
(1) transmitted (𝑠⟨ℎ⟩𝑖, 𝑣, 𝑘 )to ; and
(2) ⟨ℎ⟩is a valid threshold signature
do
18: View-commit 𝑇, the 𝑘-th transaction of 𝑣(VCommit( ⟨𝑇⟩𝑐, 𝑣, 𝑘)).
19: end event
20: event logged VCommit( ⟨𝑇⟩𝑐, 𝑣, 𝑘 )and
has logged Execute(𝑡′, 𝑣′, 𝑘′)for all 0≤𝑘′<𝑘do
21: Execute 𝑇as the 𝑘-th transaction of 𝑣(Execute( ⟨𝑇⟩𝑐, 𝑣, 𝑘)).
22: Let 𝑟be the result of execution of 𝑇(if there is any result).
23: Send (D( ⟨𝑇⟩𝑐), 𝑣, 𝑘, 𝑟 )to 𝑐.
24: end event
Figure 3: The normal-case algorithm of PoE.
P 3.2. Let
r𝑖
,
𝑖∈ {
1
,
2
}
, be two non-faulty replicas
that view-committed to
⟨𝑇𝑖⟩𝑐𝑖
as the
𝑘
-th transaction of view
𝑣
(VCommitr(⟨𝑇⟩𝑐, 𝑣, 𝑘 )). If n>3f, then ⟨𝑇1⟩𝑐1=⟨𝑇2⟩𝑐2.
P.
Replica
𝑖
only view-committed to
⟨𝑇𝑖⟩𝑐𝑖
after
𝑖
re-
ceived
( ⟨ℎ⟩, 𝑣, 𝑘 )
from the primary
(Line 17 of Figure 3).
This message includes a threshold signature
⟨ℎ⟩
, whose construc-
tion requires signature shares from a set
𝑆𝑖
of
nf
distinct replicas.
Let
𝑋𝑖=𝑆𝑖\ F
be the non-faulty replicas in
𝑆𝑖
. As
|𝑆𝑖|=nf
and
|F | =f
, we have
|𝑋𝑖| ≥ nf −f
. The non-faulty replicas in
𝑋𝑖
will
only send a single message for the
𝑘
-th transaction in
view
𝑣
(Line 12 of Figure 3). Hence, if
⟨𝑇1⟩𝑐1≠⟨𝑇2⟩𝑐2
, then
𝑋1
and
𝑋2
must not overlap and
nf ≥ |𝑋1∪𝑋2| ≥
2
(nf −f)
must
hold. As
n=nf +f
, this simplies to 3
f≥n
, which contradicts
n>3f. Hence, we conclude ⟨𝑇1⟩𝑐1=⟨𝑇2⟩𝑐2.□
We will later use Proposition 3.2 to show that PE provides
speculative non-divergence. Next, we look at typical cases in
which the normal-case of PE is interrupted:
Example 3.3. A malicious primary can try to aect PE by not
conforming to the normal-case algorithm in the following ways:
(1)
By sending proposals for dierent transactions to dierent
non-faulty replicas. In this case, Proposition 3.2 guarantees
that at most a single such proposed transaction will get
view-committed by any non-faulty replica.
(2)
By keeping some non-faulty replicas in the dark by not
sending proposals to them. In this case, the remaining
non-faulty replicas can still end up view-committing the
transactions as long as at least
nf −f
non-faulty replicas
receive proposals: the faulty replicas in
F
can take over the
304
role of up to
f
non-faulty replicas left in the dark (giving
the false illusion that the non-faulty replicas in the dark
are malicious).
(3)
By preventing execution by not proposing a
𝑘
-th transac-
tion, even though transactions following the
𝑘
-th transac-
tion are being proposed.
When the network is unreliable and messages do not get deliv-
ered (or not on time), then the behavior of a non-faulty primary
can match that of the malicious primary in the above example.
Indeed, failure of the normal-case of PE has only two possi-
ble causes: primary failure and unreliable communication. If
communication is unreliable, then there is no way to guarantee
continuous service [
19
]. Hence, replicas simply assume failure
of the current primary if the normal-case behavior of PE is
interrupted, while the design of PE guarantees that unreliable
communication does not aect the correctness of PE.
To deal with primary failure, each replica maintains a timer
for each request. If this timer expires (timeout) and it has not
been able to execute the request, it assumes that the primary
is malicious. To deal with such a failure, replicas will replace
the primary. Next, we present the view-change algorithm that
performs primary replacement.
3.3 The View-Change Algorithm
If PE observes failure of the primary
of view
𝑣
, then PE will
elect a new primary and move to the next view, view
𝑣+
1, via
the view-change algorithm. The goals of the view-change are
(1)
to assure that each request that is considered executed by
any client is preserved under all circumstances; and
(2)
to assure that the replicas are able to agree on a new view
whenever communication is reliable.
As described in the previous section, a client will consider its
request executed if it receives a proof-of-execution consisting
of identical responses from at-least
nf
distinct replicas.
Of these
nf
responses, at-most
f
can come from faulty replicas.
Hence, a client can only consider its request executed whenever
the requested transaction was executed (and view-committed) by
at-least
nf −f≥f+
1non-faulty replicas in the system. We note
the similarity with the view-change algorithm of P, which
will preserve any request that is prepared by at-least
nf −f≥f+
1
non-faulty replicas.
The view-change algorithm of PE consists of three steps.
First, failure of the current primary
needs to be detected by all
non-faulty replicas. Second, all replicas exchange information to
establish which transactions were included in view
𝑣
and which
were not. Third, the new primary
′
proposes a new view. This
new view proposal contains a list of the transactions executed in
the previous views (based on the information exchanged earlier).
Finally, if the new view proposal is valid, then replicas switch to
this view; otherwise, replicas detect failure of
′
and initiate a
view-change for the next view (
𝑣+
2). The communication of the
view-change algorithm of PE is sketched in Figure 4 and the
full pseudo-code of the algorithm can be found in Figure 5. Next,
we discuss each step in detail.
3.3.1 Failure Detection and View-Change Requests. If a replica
detects failure of the primary of view
𝑣
, then it halts the normal-
case algorithm of PE for view
𝑣
and informs all other replicas
of this failure by requesting a view-change. The replica
does
so by broadcasting a message
(𝑣, 𝐸 )
, in which
𝐸
is
a summary of all transactions executed by
(Figure 5, Line 1).
Each replica can detect the failure of primary in two ways:
2
1
′
(detection)
(join)
Enter view 𝑣+1
Figure 4: The current primary bof view 𝑣is faulty and
needs to be replaced. The next primary, p′, and the replica
r1detected this failure rst and request view-change via
vc-reqest messages. The replica r2joins these requests.
vc-request (used by replica to request view-change) :
1: event detects failure of the primary do
2: halts the normal-case algorithm of Figure 3 for view 𝑣.
3: 𝐸:={(( ⟨ℎ⟩,𝑤,𝑘 ),⟨𝑇⟩𝑐) |
𝑤≤𝑣and Execute( ⟨𝑇⟩𝑐, 𝑤, 𝑘 )and ℎ=D( ⟨𝑇⟩𝑐||𝑤| |𝑘) } .
4: Broadcast (𝑣, 𝐸 )to all replicas.
5: end event
6: event receives f+1messages (𝑣𝑖, 𝐸𝑖)such that
(1) each message was sent by a distinct replica; and
(2) 𝑣𝑖,1≤𝑖≤f+1, is the current view
do
7: detects failure of the primary (join).
8: end event
On receiving nv-propose (use by replica ):
9: event receives 𝑚=(𝑣+1, 𝑚1, 𝑚2, ...,𝑚 nf )do
10: if 𝑚is a valid new-view proposal (similar to creating )then
11: Derive the transactions 𝑁for the new-view from 𝑚1,𝑚2, . . . ,𝑚nf .
12: Rollback any executed transactions not included in 𝑁.
13: Execute the transactions in 𝑁not yet executed.
14: Move into view 𝑣+1(see Section 3.3.3 for details).
15: end if
16: end event
nv-propose (used by replica ′that will act as the new primary) :
17: event ′receives nf messages 𝑚𝑖=(𝑣𝑖, 𝐸𝑖)such that
(1) these messages are sent by a set 𝑆,|𝑆|=nf, of distinct replicas;
(2)
for each
𝑚𝑖
,
1≤𝑖≤nf
, sent by replica
𝑖∈𝑆
,
𝐸𝑖
consists of a
consecutive sequence of entries ( ( ⟨ℎ⟩,𝑣, 𝑘),⟨𝑇⟩𝑐);
(3) 𝑣𝑖,1≤𝑖≤nf, is the current view 𝑣; and
(4) ′is the next primary (id(′)=(𝑣+1)mod n)
do
18: Broadcast (𝑣+1, 𝑚1, 𝑚2, ..., 𝑚nf )to all replicas.
19: end event
Figure 5: The view-change algorithm of PoE.
(1)
timeouts while expecting normal-case operations toward
executing a client request. E.g., when
forwards a client
request to the current primary, and the current primary
fails to propose this request on time.
(2)
receives messages, indicating that the pri-
mary of view
𝑣
failed, from
f+
1distinct replicas. As at
most
f
of these messages can come from faulty replicas, at
least one non-faulty replica must have detected a failure.
In this case, joins the view-change (Figure 5, Line 6).
3.3.2 Proposing the New View. To start view
𝑣+
1, the new
primary
′
(with
id(′)=(𝑣+
1
)mod n
) needs to propose a new
view by determining a valid list of requests that need to be pre-
served. To do so,
′
waits until it receives sucient information.
In specic,
′
waits until it received valid messages
from a set 𝑆⊆ℜof |𝑆|=nf distinct replicas.
An
𝑖
-th view-change request
𝑚𝑖
is considered valid if it in-
cludes a consecutive sequence of pairs
(𝑐, ⟨𝑇⟩𝑐)
, where
𝑐
is a valid
message for request
⟨𝑇⟩𝑐
. Such a set
𝑆
is guaranteed to
exist when communication is reliable, as all non-faulty replicas
will participate in the view-change algorithm. The new primary
305
collects the set
𝑆
of
|𝑆|=nf
valid and proposes them
in a new view message to all replicas.
3.3.3 Move to the New View. After a replica
receives a
message containing a new-view proposal from the new
primary
′
,
validates the content of this message. From the set
of messages in the new-view proposal,
chooses,
for each
𝑘
, the pair
(( ⟨ℎ⟩,𝑤, 𝑘 ),⟨𝑇⟩𝑐)
proposed in the
most-recent view
𝑤
. Furthermore,
determines the total number
of such requests
𝑘max
. Then,
view-commits and executes all
𝑘max
chosen requests that happened before view
𝑣+
1. Notice
that replica
can skip execution of any transaction it already
executed. If
executed transactions not included in the new-view
proposal, then
needs to rollback these transactions before it can
proceed executing requests in view
𝑣+
1. After these steps,
can
switch to the new view
𝑣+
1. In the new view, the new primary
′starts by proposing the 𝑘max +1-th transaction.
When moving into the new view, we see the cost of speculative
execution: some replicas can be forced to rollback execution of
transactions:
Example 3.4. Consider a system with non-faulty replica
.
When deciding the
𝑘
-th request, communication became unreli-
able, due to which only
received a message for request
⟨𝑇⟩𝑐
. Consequently,
speculatively executes
𝑇
and informs the
client
𝑐
. During the view-change, all other replicas—none of
which have a message for
⟨𝑇⟩𝑐
—provide their local state
to the new primary, which proposes a new view that does not
include any
𝑘
-th request. Hence, the new primary will start its
view by proposing client request
⟨𝑇′⟩𝑐′
as the
𝑘
-th request, which
gets accepted. Consequently,
needs to rollback execution of
𝑇
.
Luckily, this is not an issue: the client
𝑐
only got at-most
f+
1
<nf
responses for request, does not yet have a proof-of-execution,
and, consequently, does not consider 𝑇executed.
In practice, rollbacks can be supported by, e.g., undoing the
operations of transaction in reverse order, or by reverting to an
old state. For the correct working of PE, the exact working of
rollbacks is not important as long as the execution layer provides
support for rollbacks.
3.4 Correctness of PoE
First, we show that the normal-case algorithm of PE provides
non-divergent speculative consensus when the primary is non-
faulty and communication is reliable.
T 3.5. Consider a system in view
𝑣
, in which the rst
𝑘−
1transactions have been executed by all non-faulty replicas, in
which the primary is non-faulty, and communication is reliable. If
the primary received
⟨𝑇⟩𝑐
, then the primary can use the algorithm
in Figure 3 to ensure that
(1) there is non-divergent execution of 𝑇;
(2) 𝑐considers 𝑇executed as the 𝑘-th transaction; and
(3) 𝑐learns the result of executing 𝑇(if any),
this independent of any malicious behavior by faulty replicas.
P.
Each non-faulty primary would follow the algorithm
of PE described in Figure 3 and send
( ⟨𝑇⟩𝑐, 𝑣, 𝑘 )
to
all replicas (Line 6). In response, all
nf
non-faulty replicas will
compute a signature share and send a message to the
primary (Line 15). Consequently, the primary will receive signa-
ture shares from
nf
replicas and will combine them to generate a
threshold signature
⟨ℎ⟩
. The primary will include this signature
⟨ℎ⟩
in a message and broadcast it to all replicas. Each
replica will successfully verify
⟨ℎ⟩
and will view-commit to
𝑇
(Line 17). As the rst
𝑘−
1transactions have already been exe-
cuted, every non-faulty replica will execute
𝑇
. As all non-faulty
replicas behave deterministically, execution will yield the same
result
𝑟
(if any) across all non-faulty replicas. Hence, when the
non-faulty replicas inform
𝑐
, they do so by all sending identical
messages
(D( ⟨𝑇⟩𝑐), 𝑣, 𝑘 , 𝑟 )
to
𝑐
(Line 20–Line 23). As all
nf
non-faulty replicas executed
𝑇
, we have non-divergent execution.
Finally, as there are at most
f
faulty replicas, the faulty replicas
can only forge up to
f
invalid messages. Consequently,
the client
𝑐
will only receive the message
(D( ⟨𝑇⟩𝑐), 𝑣, 𝑘 , 𝑟 )
from at least
nf
distinct replicas, and will conclude that
𝑇
is exe-
cuted yielding result 𝑟(Line 3). □
At the core of the correctness of PE, under all conditions,
is that no replica will rollback requests
⟨𝑇⟩𝑐
for which client
𝑐
already received a proof-of-execution. We prove this next:
P 3.6. Let
⟨𝑇⟩𝑐
be a request for which client
𝑐
al-
ready received a proof-of-execution showing that
𝑇
was executed
as the
𝑘
-th transaction of view
𝑣
. If
n>
3
f
, then every non-faulty
replica that switches to a view
𝑣′>𝑣
will preserve
𝑇
as the
𝑘
-th
transaction of view 𝑣.
P.
Client
𝑐
considers
⟨𝑇⟩𝑐
executed as the
𝑘
-th transac-
tion of view
𝑣
when it received identical -messages for
𝑇
from a set
𝐴
of
|𝐴|=nf
distinct replicas (Figure 3, Line 3). Let
𝐵=𝐴\ F be the set of non-faulty replicas in 𝐴.
Now consider a non-faulty replica
that switches to view
𝑣′>𝑣
. Before doing so,
must have received a valid proposal
𝑚=
(𝑣′, 𝑚1, ..., 𝑚nf )
from the primary of view
𝑣′
. Let
𝐶
be
the set of
nf
distinct replicas that provided messages
𝑚1, . . . , 𝑚nf
and let
𝐷=𝐶\ F
be the set of non-faulty replicas in
𝐶
. We
have
|𝐵| ≥ nf −f
and
|𝐷| ≥ nf −f
. Hence, using a contradiction
argument similar to the one in the proof of Proposition 3.2, we
conclude that there must exists a non-faulty replica
∈ (𝐵∩𝐷)
that executed ⟨𝑇⟩𝑐, informed 𝑐, and requested a view-change.
To complete the proof, we need to show that
⟨𝑇⟩𝑐
was pro-
posed and executed in the last view that proposed and view-
committed a
𝑘
-th transaction and, hence, that
will include
⟨𝑇⟩𝑐
in its message for view
𝑣′
. We do so by induction
on the dierence
𝑣′−𝑣
. As the base case, we have
𝑣′−𝑣=
1,
in which case no view after
𝑣
exists yet and, hence,
⟨𝑇⟩𝑐
must
be the newest
𝑘
-th transaction available to
. As the induction
hypothesis, we assume that all non-faulty replicas will preserve
𝑇
when entering a new view
𝑤
,
𝑣<𝑤≤𝑤′
. Hence, non-faulty
replicas participating in view
𝑤
will not support any
𝑘
-th trans-
actions proposed in view
𝑤
. Consequently, no messages
can be constructed for any
𝑘
-th transaction in view
𝑤
. Hence,
the new-view proposal for
𝑤′+
1will include
⟨𝑇⟩𝑐
, completing
the proof. □
As a direct consequence of the above, we have
C 3.7 (S PE). PoE provides speculative non-
divergence if n>3f.
We notice that the view-change algorithm does not deal with
minor malicious behavior (e.g., a single replica left in the dark).
Furthermore, the presented view-change algorithm will recover
all transactions since the start of the system, which will result
in unreasonable large messages when many transactions have
already been proposed. In practice, both these issues can be re-
solved by regularly making checkpoints (e.g., after every 100
requests) and only including requests since the last checkpoint
in each message. To do so, PE uses a standard fully-
decentralized P-style checkpoint algorithm that enables the
independent checkpointing and recovery of any request that is
306
executed by at least
f+
1non-faulty replicas whenever communi-
cation is reliable [
9
]. Finally, utilizing the view-change algorithm
and checkpoints, we prove
T 3.8 (L PE). PoE provides termination in
periods of reliable bounded-delay communication if n>3f.
P.
When the primary is non-faulty, Theorem 3.5 guar-
antees termination as replicas continuously accept and execute
requests. If the primary is Byzantine and fails to guarantee ter-
mination for at most
f
non-faulty replicas, then the checkpoint
algorithm will assure termination of these non-faulty replicas.
Finally, if the primary is Byzantine and fails to guarantee termi-
nation for at least
f+
1non-faulty replicas, then it will be replaced
using the view-change algorithm. For the view-change process,
each replica will start with a timeout
𝛿
after it receives
nf
match-
ing and double this timeout after each view-change
(exponential backo). When communication becomes reliable,
this mechanism guarantees that all replicas will eventually view-
change to the same view at the same time. After this point, a
non-faulty replica will become primary in at most
f
view-changes,
after which Theorem 3.5 guarantees termination. □
3.5 Fine-Tuning and Optimizations
To keep presentation simple, we did not include the following
optimizations in the protocol description:
(1)
To reach
nf
signature shares, the primary can generate one
itself. Hence, it only needs
nf −
1shares of other replicas.
(2)
The ,,, and messages
are not forwarded and only need
MAC
s to provide message
authentication. The messages need not be signed,
as tampering them would invalidate the threshold signa-
ture. The messages need to be signed, as they
need to be forwarded without tampering.
Finally, the design of PE is fully compatible with out-of-order
processing as a replica only supports proposals for a
𝑘
-th trans-
action if it had not previously supported another
𝑘
-th proposal
(Figure 3, Line 12) and only executes a
𝑘
-th transaction if it has
already executed all the preceding transactions (Figure 3, Line 20).
As the size of the active out-of-order processing window deter-
mines how many client requests are being processed at the same
time (without receiving a proof-of-execution), the size of the
active window determines the number of transactions that can
be rolled back during view-changes.
3.6 Designing PoE using MACs
The design of PE can be adapted to only use message authen-
tication codes (
MAC
s) to authenticate communication. This will
sharply reduce the computational complexity of PE and elim-
inate one round of communication, this at the cost of higher
quadratic overall communication costs (see Figure 2).
The usage of only
MAC
s makes it impossible to obtain threshold
signatures or reliably forward messages (as forwarding replicas
can tamper with the content of unsigned messages). Hence, us-
ing
MAC
s requires changes to how client requests are included
in proposals (as client requests are forwarded), to the normal-
case algorithm of PE (which uses threshold signatures), and to
the view-change algorithm of PE (which forwards
messages). The changes to the proposal of client requests and to
the view-change algorithm can be derived from the strategies
used by P to support
MAC
s [
9
]. Hence, next we only review
the changes to the normal-case algorithm of PE.
Consider a replica
that receives a message from
the primary
. Next,
needs to determine whether at least
nf
Client
Request s
Support
& Certif y
Input
Network
Messages
from Client s
and Replicas
Batch Creation
Worker
Checkpoint
Execute
Output
Network
Messages
to Client s
and Replicas
Figure 6: Multi-threaded Pipelines at dierent replicas.
other replicas received the same proposal, which is required to
achieve speculative non-divergence (see Proposition 3.2). When
using
MAC
s,
can do so by replacing the all-to-one support and
one-to-all certify phases by a single all-to-all support phase. In the
support phase, each replica agrees to support the rst proposal
( ⟨𝑇⟩𝑐, 𝑣, 𝑘 )
it receives from the primary by broadcast-
ing a message
(D( ⟨𝑇⟩𝑐), 𝑣, 𝑘 )
to all replicas. After this
broadcast, each replica waits until it receives messages,
identical to the message it sent, from
nf
distinct replicas. If
receives these messages, it view-commits to
𝑇
as the
𝑘
-th transac-
tion in view
𝑣
and schedules
𝑇
for execution. We have sketched
this algorithm in Figure 2.
4RESILIENTDB FABRIC
To test our design principles in practical settings, we imple-
ment our PE protocol in our RDB fabric [
27
–
31
]. R
DB provides its users access to a state-of-the-art replicated
transactional engine and fullls the need of a high-throughput
permissioned blockchain fabric. RDB helps us to realize
the following goals: (i) implement and test dierent consensus
protocols; (ii) balance the tasks done by a replica through a paral-
lel pipelined architecture; (iii) minimize the cost of communication
through batching client transactions; and (iv) enable use of a se-
cure and ecient ledger. Next, we present a brief overview of
our RDB fabric.
RDB lays down a client-server architecture where
clients send their transactions to servers for processing. We use
Figure 6 to illustrate the multi-threaded pipelined architecture
associated with each replica. At each replica, we spawn multiple
input and output threads for communicating with the network.
Batching.
During our formal description of PE, we assumed
that the message from the primary includes a single
client request. An eective way to reduce the overall cost of
consensus is by aggregating several client requests in a single
batch and use one consensus step to reach agreement on all these
requests [
9
,
21
,
38
]. To maximize performance, RDB
facilitates batching requests at both replicas and clients.
At the primary replica, we spawn multiple batch-threads that
aggregate clients requests into a batch. The input-threads at the
primary receive client requests, assign them a sequence number
and enqueue these requests in the batch-queue. In RDB,
all batch-threads share a common lock-free queue. When a client
request is available, a batch-thread dequeues the request and con-
tinues adding it to an existing batch until the batch has reached
a pre-dened size. Each batching-thread also hashes the requests
in a batch to create a unique digest.
All other messages received at a replica are enqueued by the
input-thread in the work-queue to be processed by the single
worker-thread. Once a replica receive a message from
the primary, it forwards the request to the execute-thread for
execution. Once the execution is complete, the execution-thread
creates an message, which is transmitted to the client.
Ledger Management.
We now explain how we eciently
maintain a blockchain ledger across dierent replicas. A block-
chain is an immutable ledger, where blocks are chained as a
linked-list. An
𝑖
-th block can be represented as
𝐵𝑖
:
={𝑘, 𝑑, 𝑣,
307
No exec. Exec.
0
2
4
6·105
Throughput (txn/s)
No exec. Exec
0
0.2
0.4
0.6
Latency (s)
Figure 7: Upper bound on performance when primary
only replies to clients (No exec.) and when primary exe-
cutes a request and replies to clients (Exec.).
𝐻(𝐵𝑖−1)}
, in which
𝑘
is the sequence number of the client re-
quest,
𝑑
the digest of the request,
𝑣
the view number, and
𝐻(𝐵𝑖−1)
the hash of the previous block. In RDB, prior to any
consensus, we require the rst primary replica to create a gen-
esis block [
31
]. This genesis block acts as the rst block in the
blockchain and contains some basic data. We use the hash of the
identity of the initial primary, as this information is available to
each participating replicas (eliminating the need for any extra
communication to exchange this block).
After the genesis block, each replica can independently create
the next block in the blockchain. As stated above, each block
corresponds to some batch of transactions. A block is only created
by the execute-thread once it completes executing a batch of
transactions. To create a block, the execute-thread hashes the
previous block in the blockchain and creates a new block. To
prove the validity of individual blocks, RDB stores the
proof-of-accepting the
𝑘
-th request in the
𝑘
-th block. In PE, such
a proof includes the threshold signature sent by the primary as
part of the message.
5 EVALUATION
We now analyze our design principles in practice. To do so, we
evaluate our PE protocol against four state-of-the-art pro-
tocols. There are many protocols we could compare with.
Hence, we pick a representative sample: (1) Z—as it has
the absolute minimal cost in the fault-free case, (2) P—as it is a
common baseline (the used design is based on BFTSmart [
7
]), (3)
SBFT—as it is a safer variation of Z, and (3) HS—as
it is a linear-communication protocol that adopts the notion of
rotating leaders. Through our experiments, we want to answer
the following questions:
(Q1)
How does PE fare in comparison with the other protocols
under failures?
(Q2) Does PE benets from batching client requests?
(Q3) How does PE perform under zero payload?
(Q4)
How scalable is PE on increasing the number of replicas
participating in the consensus, in the normal-case?
Setup.
We run our experiments on the Google Cloud, and
deploy each replicas on a
𝑐
2machine having a 16-core Intel Xeon
Cascade Lake CPU running at
3.8 GHz
with
32 GB
memory. We
deploy up to
320 k
clients on 16 machines. To collect results after
reaching a steady-state, we run each experiment for
180 s
: the
rst
60 s
are warmup, and measurement results are collected over
the next 120 s. We average our results over three runs.
Conguration and Benchmarking.
For evaluating the pro-
tocols, we employed YCSB [
13
] from Blockbench’s macro bench-
marks [
16
]. Each client request queries a YCSB table that holds
half a million active records. We require 90% of the requests to be
write queries as the majority of typical blockchain transactions
are updates to existing records. Prior to the experiments, each
replica is initialized with an identical copy of the YCSB table. The
None ED MAC
0
0.5
1
1.5
·105
Throughput (txn/s)
None ED MAC
0
1
2
3
4
Latency (s)
Figure 8: System performance using three dierent signa-
ture schemes. In all cases, n=
16
replicas participate in
consensus.
client requests generated by YCSB follow a Zipan distribution
and are heavily skewed (skew factor 0.9).
Unless explicitly stated, we use the following conguration
for all experiments. We perform scaling experiments by varying
replicas from 4to 91. We divide our experiments in two dimen-
sions: (1) Zero Payload or Standard Payload, and (2) Failures or
Non-Failures. We employ batching with a batch size of 100 as the
percentage increase in throughput on larger batch sizes is small.
Under Zero Payload conditions, all replicas execute 100 dummy
instructions per batch, while the primary sends an empty pro-
posal (and not a batch of 100 requests). Under Standard Payload,
with a batch size of 100, the size of
P
message is
5400 B
,
of
R
message is
1748 B
, and of other messages is around
250 B
. For experiments with failures, we force one backup replica
to crash. Additionally, we present an experiment that illustrates
the eect of primary failure. We measure throughput as trans-
actions executed per second. We measure latency as the time
from when the client sends a request to the time when the client
receives a response.
Other protocols:
We also implement P,Z,SBFT
and HS in our RDB fabric. We refer to Section 2
for further details on the working of Z,SBFT, and H
S. Our implementation of P is based on the BFTSmart [
7
]
framework with the added benets of out-of-order processing,
pipelining, and multi-threading. In both P and Z, digi-
tal signatures are used for authenticating messages sent by the
clients, while
MAC
s are used for other messages. Both SBFT and
HS require threshold signatures for their communication.
5.1 System Characterization
We rst determine the upper bounds on the performance of
RDB. In Figure 7, we present the maximum throughput
and latency of RDB when there is no communication
among the replicas. We use the term No Execution to refer to the
case where all clients send their request to the primary replica
and primary simply responds back to the client. We count every
query responded back in the system throughput. We use the term
Execution to refer to the case where the primary replica executes
each query before responding back to the client.
The architecture of RDB (see Section 4) states the use
of one worker thread. In these experiments, we maximize system
performance by allowing up to two threads to work indepen-
dently at the primary replica without ordering any queries. Our
results indicate that the system can attain high throughputs (up
to
500 ktxn/s
) and can reach low latencies (up to
0.25 s
). Notice
that if we employ additional worker-threads, our RDB
fabric can easily attain higher throughput.
5.2 Eect of Cryptographic Signatures.
RDB enables a exible design where replicas and clients
can employ both digital signatures (threshold signatures) and
308
message authentication codes. This helps us to implement PE
and other consensus protocols in RDB.
To achieve authenticated communication using symmetric
cryptography, we employ a combination of CMAC and AES [
36
].
Further, we employ ED25519-based digital signatures to enable
asymmetric cryptographic signing. For generating ecient thresh-
old signature scheme, we use Boneh–Lynn–Shacham (BLS) sig-
natures [
36
]. To create message digests and for hashing purposes,
we use the SHA256 algorithm.
Next, we determine the cost of dierent cryptographic signing
schemes. For this purpose, we run three dierent experiments in
which (i) no signature scheme is used (None); (ii) everyone uses
digital signatures based on ED25519 (ED); and (iii) all replicas use
CMAC+AES for signing, while clients sign their message using
ED25519 (MAC). In these three experiments, we run P consen-
sus among 16 replicas. In Figure 8, we illustrate the throughput
attained and latency incurred by RDB for the experi-
ments. Clearly, the system attains its highest throughput when
no signatures are employed. However, such a system cannot han-
dle malicious attacks. Further, using just digital signatures for
signing messages can prove to be expensive. An optimal cong-
uration can require clients to sign their messages using digital
signatures, while replicas can communicate using MACs.
5.3 Scaling Replicas under Standard Payload
In this section, we evaluate scalability of PE both under backup
failure and no failures.
(1) Single Backup Failure.
We use Figures 9(a) and 9(b) to
illustrate the throughput and latency attained by the system on
running dierent consensus protocols under a backup failure.
These graphs arm our claim that PE attains higher throughput
and incurs lower latency than all other protocols.
In case of P, each replica participates in two phases of
quadratic communication, which limits its throughput. For the
twin-path protocols such as Z and SBFT, a single failure
is sucient to cause massive reductions in their system through-
puts. Notice that the collector in SBFT and the clients in Z
have to wait for messages from all
n
replicas, respectively. As
predicting an optimal value for timeouts is hard [
11
,
12
], we
chose a very small value for the timeout (
3 s
) for replicas and
clients. We justify these values, as the experiments we show later
in this section show that the average latency can be as large
as
6 s
. We note that high timeouts aect Z more than
SBFT. In Z, clients are waiting for timeouts during which
they stop sending requests, which empties the pipeline at the
primary, starving it from new request to propose. To alleviate
such issues in real-world deployments of Z, clients need
to be able to precisely predict the latency to minimize the time
the clients needs to wait between requests. Unfortunately, this is
hard and runs the risk of ending up in the expensive slow path of
Z whenever the predicted latency is slightly o. In SBFT,
the collector may timeout waiting for threshold shares for the
𝑘
-th round while the primary can continues propose requests
for future round
𝑙
,
𝑙>𝑘
. Hence, in SBFT replicas have more
opportunity to occupy themselves with useful work.
HS attains signicantly low throughput due to its se-
quential primary-rotation model in which each of its primaries
has to wait for the previous primary before proposing the next
request, which leads to a huge reduction in its throughput. In-
terestingly, HS incurs the least average latency among
all protocols. This is a result of intensive load on the system
when running other protocols. As these protocols process several
requests concurrently (see the multi-threaded architecture in Sec-
tion 4), these requests spend on average more time in the queue
before being processed by a replica. Notice that all out-of-order
consensus protocols employ this trade o: a small sacrice on
latency yields higher gains on system throughput.
In case of PE, its high throughputs under failures is a result
of its three-phase linear protocol that does not rely on any twin-
path model. To summarize, PE attains up to 43%,72%,24
×
and
62
×
more throughputs than P,SBFT,HS and Z.
(2) No Replica Failure.
We use Figures 9(c) and 9(d) to il-
lustrate the throughput and latency attained by the system on
running dierent consensus protocols in fault-free conditions.
These plots help us to bound the maximum throughput that can
be attained by dierent consensus protocols in our system.
First, as expected, in comparison to the Figures 9(a) and 9(b),
the throughputs for PE and P are slightly higher. Second,
PE continues to outperform both P and HS, for the
reasons described earlier. Third, both Z and SBFT are
now attaining higher throughputs as their clients and collector
no longer timeout, respectively. The key reason SBFT’s gains
are limited is because SBFT requires ve phases and becomes
computation bounded. Although P is quadratic, it employs
MAC, which are cheaper to sign and verify.
Notice that the dierences in throughputs of PE and Z
are small. PE has 20% (on 91 replicas) to 13% (on 4replicas) less
throughputs than Z. An interesting observation is that on
91 replicas, Z incurs almost the same latency as PE, even
though it has higher throughput. This happens as clients in PE
have to wait for only the fastest
nf =
61 replies, whereas a client
for Z has to wait for replies from all replicas (even the
slowest ones). To conclude, PE attains up to 35%,27% and 21
×
more throughput than P,SBFT and HS, respectively.
5.4 Scaling Replicas under Zero Payload
We now measure the performance of dierent protocols under
zero payload. In any protocol, the primary starts consensus
by sending a
P
message that includes all transactions. As
a result, this message has the largest size and is responsible for
consuming the majority of the bandwidth. A zero payload ex-
periment ensures that each replica executes dummy instructions.
Hence, the primary is no longer a bottleneck.
We again run these experiments for both
Single Failure
and
Failure-Free
cases, and use Figures 9(e) to 9(h) to illustrate our
observations. It is evident from these gures that zero payload
experiments have helped in increasing PE’s gains. PE attains
up to 85%,62% and 27
×
more throughputs than P,SBFT and
HS, respectively. In fact, under failure-free conditions,
the throughput attained by PE is comparable to Z. This
is easily explained. First, both PE and Z are linear pro-
tocols. Second, although in failure-free cases Z attains
consensus in one phase, its clients need to wait for response from
all
n
replicas, which gives PE an opportunity to cover the gap.
However, SBFT being a linear protocol does not perform as good
as its other linear counterparts. Its throughput is impacted by
the delay of ve phases.
5.5 Impact of Batching under Failures
Next, we study the eect of batching client requests on pro-
tocols [
9
,
51
]. To answer this question, we measure performance
as function of the number of requests in a batch (the batch-size),
which we vary between 10 and 400. For this experiment, we use a
system with 32 available replicas, of which one replica has failed.
We use Figures 9(i) and 9(j) to illustrate, for each consensus
protocol, the throughput and average latency attained by the sys-
tem. For each protocol, increasing the batch-size also increases
309
PE P SBFT HS Z
4 16 32 64 91
0
0.5
1
1.5
·105
Number of replicas (n)
Throughput (txn/s)
(a) Scalability (Single Failure)
4 16 32 64 91
0
2
4
6
8
10
Number of replicas (n)
Latency (s)
(b) Scalability (Single Failure)
4 16 32 64 91
0
0.5
1
1.5
2
2.5·105
Number of replicas (n)
Throughput (txn/s)
(c) Scalability (No Failures)
4 16 32 64 91
0
2
4
6
8
Number of replicas (n)
Latency (s)
(d) Scalability (No Failures)
4 16 32 64 91
0
1
2
3·105
Number of replicas (n)
Throughput (txn/s)
(e) Zero Payload (Single Failure)
4 16 32 64 91
0
2
4
6
8
Number of replicas (n)
Latency (s)
(f) Zero Payload (Single Failures)
4 16 32 64 91
0
1
2
·105
Number of replicas (n)
Throughput (txn/s)
(g) Zero Payload (No Failures)
4 16 32 64 91
0
2
4
6
Number of replicas (n)
Latency (s)
(h) Zero Payload (No Failures)
10 50 100 200 400
0
0.5
1
1.5
2
·105
Batch size
Throughput (txn/s)
(i) Batching (Single Failure)
10 50 100 200 400
0
10
20
30
Batch size
Latency (s)
(j) Batching (Single Failure)
4 16 32 64 91
2
4
6
8
Number of replicas (n)
Throughput (txn/s)
·103(k) Out-of-ordering disabled
4 16 32 64 91
2
3
4
·10−2
Number of replicas (n)
Latency (s)
(l) Out-of-ordering disabled
Figure 9: Evaluating system throughput and average latency incurred by PoE and other bft protocols.
throughput, while decreasing the latency. This happens as larger
batch-sizes require fewer consensus rounds to complete the exact
same set of requests, reducing the cost of ordering and executing
the transactions. This not only improves throughput, but also
reduces client latencies as clients receive faster responses for
their requests. Although increasing the batch-size reduces the
number of consensus rounds, the large message size causes a
proportional decrease in throughput (or increase in latency). This
is evident from the experiments at higher batch-sizes: increas-
ing the batch-size beyond 100 gradually curves the throughput
plots towards a limit for PE,P and SBFT. For example, on
increasing the batch size from 100 to 400,PE and P see an
increase in throughput by 60% and 80%, respectively, while the
gap in throughput reduces from 43% to 25%. As in the previous
experiments, Z yields a signicantly lower throughput as
it cannot handle failures. In case of HS, an increase in
batch size does increases its throughput but due to high scaling
of the graph this change seems insignicant.
5.6 Disabling Out-of-Ordering
Until now, we allowed protocols like P,PE,SBFT and Z
to process requests out-of-order. As a result, these protocols
achieve much higher throughputs than HS, which is re-
stricted by its sequential primary-rotation model. In Figures 9(k)
and 9(l), we evaluate the performance of the protocols when there
are no opportunities for out-of-ordering.
In this setting, we require each client to only send its request
when it has accepted a response for its previous query. As H
S pipelines its phases of consensus into a four-phase pipeline,
so we allow it to access four client requests (each on a distinct
subsequent replica) at any time. As expected, HS performs
better than all other protocols at the expense of a higher latency
as it rotates primaries at the end of each consensus, which allows
·105
1.5
1
0.5
0
Throughput (txn/s)
0 10 13 15 17
Time (s)
abcd
P
PE
Figure 10: System throughput under instance failures (n=
32
). (a) replicas detect failure of primary and broadcast
vc-reqest; (b) replicas receives vc-reqest from oth-
ers; (c) replicas receives nv-propose from new primary;
(d) state recovery;
it to pipeline four requests. However, notice that once out-of-
ordering is disabled, throughput drops from
200 ktransactions/s
to just under a few thousand
transactions/s
. Hence, from a prac-
tical standpoint, out-of-ordering is simply crucial. Further, the
dierence in latency of dierent protocols is quite small, and
the visible variation is a result of graph scaling while the actual
numbers are in the range of 20 ms–40 ms.
5.7 Primary Failure–View Change
In Figure 10, we study the impact of of a benign primary failure on
PE and P. To recover from a primary failure, backup replicas
run the view-change protocol. We skip illustrating view-change
plots for Z and SBFT as they already face severe reduction
in throughput for a single backup failure. Further, Z has
an unsafe view-change algorithm and SBFT’s view-change algo-
rithm is no less expensive than P. For HS, we do not
310
PE P
4 7 13 19 28
2
4
6
8
·104
Number of replicas per region
Throughput (txn/s)
4 7 13 19 28
10
20
30
Number of replicas per region
Latency (s)
Figure 11: System throughput and average latency in-
curred by PoE and Pbft in a WAN deployment of ve re-
gions under a single failure. In the largest deployment, we
have 140 replicas spread equally over these regions.
show results as it changes primary at the end of every consen-
sus. Although single primary protocols face a momentary loss
in throughput during view-change, these protocols easily cover
this gap through their ability to process messages out-of-order.
For our experiments, we let the primary replica complete con-
sensus for
10 s
(or around a million transactions) and then fail.
This causes clients to timeout while waiting for responses for
their pending transactions. Hence, these clients forward their
requests to backup replicas.
When a backup replica receives a client request, it forwards
that request to the primary and waits on a timer. Once a replicas
timeouts, it detects a primary failure and broadcasts a
message to all other replicas—initiate view-change protocol (a).
Next, each replica waits for a new view message from the next
primary. In the meantime, a replica may receive
mes-
sages from other replicas (b). Once a replica receives
message from the new primary (c), it moves to the next view.
5.8 WAN Scalability
In this section, we use Figure 11 to illustrate the throughputs and
latencies for dierent PE and P deployments on a wide-area
network in the presence of a single failure. In specic, we deploy
clients and replicas across ve locations across the globe: Oregon,
Iowa, Montreal, the Netherlands, and Taiwan. Next, we vary the
number of replicas from 20 to 140 by equally distributing these
replicas across each region.
These plots arm our existing observations that PE out-
performs existing state-of-the-art protocols and scales well in
wide-area deployments. In specic, PE achieves up to 1
.
41
×
higher throughput and incurs 28
.
67% less latency than P. We
skip presenting plots for SBFT,HS and Z due to
their low throughputs under failures.
5.9 Simulating bft Protocols
To further underline that the message delay and not bandwidth
requirements becomes a determining factor in the throughput
of protocols in which the primary does not propose requests
out-of-order, we performed a separate simulation of the maxi-
mum performance of PE,P, and HS. The simulation
makes 500 consensus decisions and processes all message send
and receive steps, but delays the arrival of messages by a pre-
determined message delay. The simulation skips any expensive
computations and, hence, the simulated performance is entirely
determined by the cost of message exchanges. We ran the sim-
ulation with
n∈ {
4
,
16
,
128
}
replicas, for which the results can
be found in Figure 12, rst three plots. As one can see, if band-
width is not a limiting factor, then the performance of protocols
that do not propose requests out-of-order will be determined by
the number of communication rounds and the message delay.
As both P and PE have one communication round more
than the two rounds of HS, their performance is roughly
two-thirds that of HS, this independent of the number of
replicas or the message delay. Furthermore, doubling message de-
lay will roughly half performance. Finally, we also measured the
maximum performance of protocols that do allow out-of-order
processing of up to 250 consensus decisions. These results can be
found in Figure 12, last plot. As these results show, out-of-order
processing increases performance by a factor of roughly 200,
even with 128 replicas.
6 RELATED WORK
Consensus is an age-old problem that received much theoretical
and practical attention (see, e.g., [
34
,
39
,
45
]). Further, the use
of rollbacks is common in distributed systems. E.g., the crash-
resilient replication protocol Raft [
45
] allows primaries to re-
write the log of any replica. In a Byzantine environment, such an
approach would delegate too much power to the primary, as they
can maliciously overwrite transactions that need to be preserved.
The interest in practical consensus protocols took o
with the introduction of P [
9
]. Apart from the protocols that
we already discussed, there are some interesting protocols that
achieve ecient consensus by requiring 5
f+
1replicas [
1
,
14
].
However, these protocols have been shown to work only in the
cases where transactions are non-conicting [
38
]. Some other
protocols [
10
,
50
] suggest the use of trusted components to reduce
the cost of consensus. These works require only 2
f+
1replicas
as the trusted component helps to guarantee a correct ordering.
The safety of these protocols relies on the security of trusted
component. In comparison, PE does (i) not require extra replicas,
(ii) not depend on clients, (iii) not require trusted components,
and (iv) not need the two phases of quadratic communication
required by P.
As a promising future direction, Castro [
9
] also suggested ex-
ploring speculative optimizations for P, which he referred
to as tentative execution. However, this lacked: (i) formal de-
scription, (ii) non-divergence safety property, (iii) specication
of rollback under attacks, (iv) re-examination of the view change
protocol, and (v) any actual evaluation.
Consensus for Blockchains:
Since the introduction of Bit-
coin [
42
], the well-known cryptocurrency that led to the coining
of the term blockchain, several new consensus protocols
that cater to cryptocurrencies have been designed [
33
,
37
]. Bit-
coin [
42
] employs the Proof-of-Work [
33
] consensus protocol
(PW), which is computationally intensive, achieves low through-
put, and can cause forks (divergence) in the blockchain: separate
chains can exist on non-faulty replicas, which in turn can cause
double-spending attacks [
31
]. Due to these limitations, several
other similar algorithms have been proposed. E.g., Proof-of-Stake
(PoS) [
37
], which is design such that any replica owning
𝑛
%of
the total resources gets the opportunity to create
𝑛
%of the new
blocks. As PoS is resource driven, it can face attacks where repli-
cas are incentivized to work simultaneously on several forks of
the blokchain, without ever trying to eliminate these forks.
There are also a set of interesting alternative designs such as
ConFlux [
40
], Caper [
3
] and MeshCash [
6
] that suggest the use of
directed acyclic graphs (DAGs) to store a blockchain to improve
the performance of Bitcoin. However, these protocols either rely
on PW or P for consensus.
Meta-protocols such as RCC [
28
] and RBFT [
5
] run multiple
P consensuses in parallel. These protocols also aim at re-
moving dependence on the consensus led by a single primary.
A recent protocol, PoV [
41
], provides fast consensus in a
311
PE P HS
10 20 40
0
10
20
30
Latency (ms)
Throughput (decisions/s)
Simulated performance (4 replicas)
10 20 40
0
10
20
30
Latency (ms)
Throughput (decisions/s)
Simulated performance (16 replicas)
10 20 40
0
10
20
30
Latency (ms)
Throughput (decisions/s)
Simulated performance (128 replicas)
10 20 40
0
1,000
2,000
3,000
4,000
Latency (ms)
Throughput (decisions/s)
Simulated performance (128 replicas)
PE∗
P∗
Figure 12: The simulated number of consensus decisions PoE,Pbft, and HotStuff can make as a function of the latency.
Only the protocols in the right-most plot and marked with ∗process requests out-of-order processing.
consortium architecture. PoV does this by restricting the ability
to propose blocks among a subset of trusted replicas.
PE does not face the limitations faced by PW [
33
] and
PoS [37]. The use of DAGs [3, 6, 40], and sharding [15, 52] is or-
thogonal to the design of PE. Hence, their use with PE can reap
further benets. Further, PE can be employed by meta-protocols
and does not restrict consensus to any subset of replicas.
7 CONCLUSIONS
We present Proof-of-Execution (PE), a novel Byzantine fault-
tolerant consensus protocol that guarantees safety and liveness
and does so in only three linear phases. PE decouples ordering
from execution by allowing replicas to process messages out-of-
order and execute client-transactions speculatively. Despite these
properties, PE ensures that all the replicas reach a single unique
order for all the transactions. Further, PE guarantees that if a
client observes identical results of execution from a majority of
the replicas, then it can reliably mark its transaction committed.
Due to speculative execution, PE may require replicas to revert
executed transactions, however. To evaluate PE’s design, we
implement it in our RDB fabric. Our evaluation shows
that PE achieves up-to-80% higher throughputs than existing
protocols in the presence of failures.
REFERENCES
[1] Michael Abd-El-Malek, Gregor y R. Ganger,Garth R. Goodson, Michael K. Reiter, and Jay J. Wylie. 2005. Fault-
scalable Byzantine Fault-tolerant Services. In Proceedings of the Twentieth ACM Symposium on Operating
Systems Principles. ACM, 59–74. https://doi.org/10.1145/1095810.1095817
[2] Ittai Abraham, Guy Gueta, Dahlia Malkhi, Lorenzo Alvisi, Rama Kotla, and Jean-Philippe Martin. 2017. Re-
visiting Fast Practical Byzantine Fault Tolerance. https://arxiv.org/abs/1712.01367
[3] Mohammad Javad Amiri, Div yakant Agrawal,and Amr El Abbadi. 2019. CAPER:A Cross-application Permis-
sioned Blockchain. Proc. VLDB Endow. 12, 11 (2019), 1385–1398. https://doi.org/10.14778/3342263.3342275
[4] Elli Androulaki, Artem Barger, Vita Bortnikov, Christian Cachin, Konstantinos Christidis, Angelo De Caro,
David Enyeart, Christopher Ferris, Gennady Laventman, Yacov Manevich, Srinivasan Muralidharan,
Chet Murthy, Binh Nguyen, Manish Sethi, Gari Singh, Keith Smith, Alessandro Sorniotti, Chrysoula
Stathakopoulou, Marko Vukolić, Sharon Weed Cocco, and Jason Yellick. 2018. Hyperledger Fabric: A Dis-
tributed Operating System for Permissioned Blockchains. In Proceedings of the Thirteenth EuroSys Conference.
ACM, 30:1–30:15. https://doi.org/10.1145/3190508.3190538
[5] Pierre-Louis Aublin, Sonia Ben Mokhtar, and Vivien Quéma. 2013. RBFT: Redundant Byzantine Fault Toler-
ance. In Proceedings of the 2013 IEEE 33rd International Conference on Distributed Computing Systems. IEEE,
297–306. https://doi.org/10.1109/ICDCS.2013.53
[6] Iddo Bentov, Pavel Hubáček, Tal Moran, and Asaf Nadler.2017. Tortoise and Hares Consensus: the Meshcash
Framework for Incentive-Compatible, Scalable Cryptocurrencies. https://eprint.iacr.org/2017/300
[7] Alysson Bessani, João Sousa, and Eduardo E.P. Alchieri. 2014. StateMachine Replication for the Masses with
BFT-SMART. In 44th Annual IEEE/IFIP International Conferenceon Dependable Systems and Networks. IEEE,
355–362. https://doi.org/10.1109/DSN.2014.43
[8] Manuel Bravo, Zsolt István, and Man-Kit Sit. 2020. TowardsImproving the Performance of BFT Consensus
For Future Permissioned Blockchains. https://arxiv.org/abs/2007.12637
[9] Miguel Castro and Barbara Liskov. 2002. PracticalByzantine Fault Tolerance and Proactive Recovery. ACM
Trans. Comput. Syst. 20, 4 (2002), 398–461. https://doi.org/10.1145/571637.571640
[10] Byung-Gon Chun, Petros Maniatis, Scott Shenker, and John Kubiatowicz. 2007. Attested Append-Only
Memory: Making Adversaries Stick to Their Word. SIGOPS Oper. Syst. Rev. 41, 6 (2007), 189–204. https:
//doi.org/10.1145/1323293.1294280
[11] Allen Clement, Manos Kapritsos, Sangmin Lee, Yang Wang, Lorenzo Alvisi, Mike Dahlin, and Taylor Riche.
2009. Upright Cluster Services. In Proceedings of the ACM SIGOPS 22nd Symposium on Operating Systems
Principles. ACM, 277–290. https://doi.org/10.1145/1629575.1629602
[12] Allen Clement, Edmund Wong, Lorenzo Alvisi, Mike Dahlin, and Mirco Marchetti. 2009. Making Byzantine
Fault TolerantSystems Tolerate Byzantine Faults. In Proceedings of the 6th USENIX Symposium on Networked
Systems Design and Implementation. USENIX, 153–168.
[13] Brian F. Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan, and Russell Sears. 2010. Benchmarking
Cloud Serving Systems with YCSB. In Proceedings of the 1st ACM Symposium on Cloud Computing. ACM,
143–154. https://doi.org/10.1145/1807128.1807152
[14] James Cowling, Daniel Myers, Barbara Liskov, Rodrigo Rodrigues, and Liuba Shrira. 2006. HQ Replication:
A Hybrid Quorum Protocol for Byzantine Fault Tolerance.In Procee dings of the 7th Symposium on Operating
Systems Design and Implementation. USENIX, 177–190.
[15] Hung Dang, Tien Tuan Anh Dinh, Dumitrel Loghin, Ee-Chien Chang, Qian Lin, and Beng Chin Ooi. 2019.
Towards Scaling Blockchain Systems via Sharding. In Proceedings of the 2019 International Conference on
Management of Data. ACM, 123–140. https://doi.org/10.1145/3299869.3319889
[16] Tien Tuan Anh Dinh, Ji Wang, Gang Chen, Rui Liu, Beng Chin Ooi, and Kian-Lee Tan. 2017. BLOCKBENCH:
A Framework for Analyzing Private Blockchains. In Proceedings of the 2017 ACM International Conference
on Management of Data. ACM, 1085–1100. https://doi.org/10.1145/3035918.3064033
[17] Wayne W. Eckerson. 2002. Data quality and the bottom line: Achieving Business Success through a Commit-
ment to High Quality Data. Technical Report. The Data Warehousing Institute, 101communications LLC.
[18] Muhammad El-Hindi, Carsten Binnig, Ar vind Arasu, Donald Kossmann, and Ravi Ramamurthy. 2019.
BlockchainDB: A Shared Database on Blockchains. Proc. VLDB Endow. 12, 11 (2019), 1597–1609. https:
//doi.org/10.14778/3342263.3342636
[19] Michael J. Fischer, Nancy A. Lynch, and Michael S. Paterson. 1985. Impossibility of Distributed Consensus
with One Faulty Process. J. ACM 32, 2 (1985), 374–382. https://doi.org/10.1145/3149.214121
[20] Gideon Greenspan. 2015. MultiChain Private Blockchain–White Paper. https://www.multichain.com/
download/MultiChain-White- Paper.pdf
[21] Guy Golan Gueta, Ittai Abraham, Shelly Grossman, Dahlia Malkhi, Benny Pinkas, Michael Reiter, Dragos-
Adrian Seredinschi, Orr Tamir, and Alin Tomescu. 2019. SBFT: A Scalable and Decentralized Trust Infras-
tructure. In 49th AnnualIEEE/IFIP International Conference on Dependable Systems and Networks (DSN). IEEE,
568–580. https://doi.org/10.1109/DSN.2019.00063
[22] Jim Gray. 1978. Notes on Data Base Op erating Systems. In Operating Systems, AnAdvanced Course. Springer-
Verlag, 393–481. https://doi.org/10.1007/3-540- 08755-9_9
[23] Suyash Gupta, Jelle Hellings, Sajjad Rahnama, and Mohammad Sadoghi. 2019. An In-Depth Look of BFT
Consensus in Blockchain: Challenges and Opportunities. In Proceedings of the 20th International Middleware
Conference Tutorials, Middleware. ACM, 6–10. https://doi.org/10.1145/3366625.3369437
[24] Suyash Gupta, Jelle Hellings, Sajjad Rahnama, and Mohammad Sadoghi. 2020. Blockchain consensus un-
raveled: virtues and limitations. In Proceedings of the 14th ACM International Conference on Distributed and
Event-based Systems. ACM, 218–221. https://doi.org/10.1145/3401025.3404099
[25] Suyash Gupta, Jelle Hellings, Sajjad Rahnama, and Mohammad Sadoghi. 2020. Building High Throughput
Permissioned Blockchain Fabrics: Challenges and Opportunities. Proc. VLDB Endow. 13, 12 (2020), 3441–3444.
https://doi.org/10.14778/3415478.3415565
[26] Suyash Gupta, Jelle Hellings, and Mohammad Sadoghi. 2019. Brief Announcement: Revisiting Consen-
sus Protocols through Wait-Free Parallelization. In 33rd International Symposium on Distributed Computing
(DISC 2019), Vol. 146. Schloss Dagstuhl, 44:1–44:3. https://doi.org/10.4230/LIPIcs.DISC.2019.44
[27] Suyash Gupta, Jelle Hellings, and Mohammad Sadoghi. 2021. Fault-Tolerant Distributed Transactions on
Blockchain. Morgan & Claypool. https://doi.org/10.2200/S01068ED1V01Y202012DTM065
[28] Suyash Gupta, Jelle Hellings, and Mohammad Sadoghi. 2021. RCC: Resilient Concurrent Consensus for
High-Throughput Secure TransactionProcessing. In 37th IEEE International Conference on Data Engineering.
IEEE. to appear.
[29] Suyash Gupta, Sajjad Rahnama, Jelle Hellings, and Mohammad Sadoghi. 2020. ResilientDB: Global Scale
Resilient Blockchain Fabric. Proc. VLDB Endow. 13, 6 (2020), 868–883. https://doi.org/10.14778/3380750.
3380757
[30] Suyash Gupta, Sajjad Rahnama, and Mohammad Sadoghi. 2020. PermissionedBlockchain Through the Look-
ing Glass: Architectural and Implementation Lessons Learned. In Proceedings of the 40th IEEE International
Conference on Distributed Computing Systems.
[31] Suyash Gupta and Mohammad Sadoghi. 2019. Blockchain Transaction Processing. In Encyclopedia of Big
Data Technologies. Springer,1–11. https://doi.org/10.1007/978- 3-319-63962- 8_333-1
[32] Thomas N. Herzog, Fritz J. Scheuren, and William E. Winkler. 2007. Data Quality and Record Linkage Tech-
niques. Springer. https://doi.org/10.1007/0- 387-69505- 2
[33] Markus Jakobsson and Ari Juels. 1999. Proofs of Work and Bread Pudding Protocols. In Secure Information
Networks: Communications and Multimedia Security IFIP TC6/TC11 Joint WorkingConference on Communica-
tions and Multimedia Security (CMS’99). Springer, 258–272. https://doi.org/10.1007/978- 0-387-35568- 9_18
[34] F lavio P. Junqueira, Benjamin C. Reed, and Marco Serani. 2011. Zab: High-Performance Broadcast for
Primary-Backup Systems. In Proceedings of the 2011 IEEE/IFIP 41st International Conference on Dependable
Systems&Networks. IEEE, 245–256. https://doi.org/10.1109/DSN.2011.5958223
[35] Manos Kapritsos, Yang Wang, Vivien Quema, Allen Clement, Lorenzo Alvisi, and Mike Dahlin. 2012. All
about Eve: Execute-VerifyReplication for Multi-Core Servers. In Proceedings of the 10th USENIX Conference
on Operating Systems Design and Implementation. USENIX, 237–250.
[36] Jonathan Katz and Yehuda Lindell. 2014. Introduction to Modern Cryptography (2nd ed.). Chapman and
Hall/CRC.
[37] Sunny King and Scott Nadal. 2012. PPCoin: Peer-to-Peer Cr ypto-Currency with Proof-of-Stake. https:
//www.peercoin.net/whitepapers/peercoin-paper.pdf
[38] Ramakrishna Kotla, Lorenzo Alvisi, Mike Dahlin, Allen Clement, and Edmund Wong. 2009. Zyzzyva: Spec-
ulative Byzantine Fault Tolerance. ACM Trans. Comput. Syst. 27, 4 (2009), 7:1–7:39. https://doi.org/10.1145/
1658357.1658358
[39] Leslie Lamport. 2001. Paxos Made Simple. ACMSIGACT News 32, 4 (2001), 51–58. https://doi.org/10.1145/
568425.568433 Distributed Computing Column 5.
[40] Chenxing Li, Peilun Li, Dong Zhou, Wei Xu, Fan Long, and Andrew Yao. 2018. Scaling Nakamoto Consensus
to Thousands of Transactions per Second. https://arxiv.org/abs/1805.03870
[41] Kejiao Li, Hui Li, Han Wang, Huiyao An, Ping Lu, Peng Yi, and Fusheng Zhu. 2020. PoV: An Ecient
Voting-Based Consensus Algorithm for Consortium Blockchains. Front. Blockchain 3 (2020), 11. https:
//doi.org/10.3389/fbloc.2020.00011
[42] Satoshi Nakamoto. 2009. Bitcoin: A Peer-to-Peer Electronic Cash System. https://bitcoin.org/bitcoin.pdf
[43] Faisal Nawab and Mohammad Sadoghi. 2019. Blockplane: A Global-Scale Byzantizing Middleware. In 35th
International Conference on Data Engineering (ICDE). IEEE, 124–135. https://doi.org/10.1109/ICDE.2019.
00020
[44] The Council of Economic Advisers. 2018. TheCost of Malicious Cyber Activity to the U.S. Economy. Technical
Report. Executive Oce of the President of the United States. https://www.whitehouse.gov/wp-content/
uploads/2018/03/The-Cost- of-Malicious- Cyber- Activity-to- the-U.S.- Economy.pdf
[45] Diego Ongaro and John Ousterhout. 2014. In Search of an Understandable Consensus Algorithm. In Pro-
ceedings of the 2014 USENIX Conference on USENIX Annual TechnicalConference. USENIX, 305–320.
[46] M. Tamer Özsu and Patrick Valduriez. 2020. Principles of Distributed Database Systems. Springer. https:
//doi.org/10.1007/978-3- 030-26253- 2
[47] Sajjad Rahnama, Suyash Gupta, Thamir Qadah, Jelle Hellings, and Mohammad Sadoghi. 2020. Scalable,
Resilient and Congurable Permissioned Blockchain Fabric. Proc. VLDB Endow. 13, 12 (2020), 2893–2896.
https://doi.org/10.14778/3415478.3415502
[48] Thomas C. Redman. 1998. The Impact of Poor Data Quality on the Typical Enterprise. Commun. ACM 41, 2
(1998), 79–82. https://doi.org/10.1145/269012.269025
[49] Dale Skeen. 1982. A Quorum-Based Commit Protocol. Technical Report. Cornell University.
[50] Giuliana Santos Veronese, Miguel Correia, Alysson Neves Bessani, Lau Cheuk Lung, and Paulo Verissimo.
2013. Ecient Byzantine Fault-Tolerance. IEEE Trans. Comput. 62, 1 (2013), 16–30. https://doi.org/10.1109/
TC.2011.221
[51] Maofan Yin, Dahlia Malkhi, Michael K. Reiter, Guy Golan Gueta, and Ittai Abraham. 2019. HotStu: BFT
Consensus with Linearity and Responsiveness. In Proceedings of the ACM Symposium on Principles of Dis-
tributed Computing. ACM, 347–356. https://doi.org/10.1145/3293611.3331591
[52] Mahdi Zamani, Mahnush Movahedi, and Mariana Raykova. 2018. RapidChain: Scaling Blockchain via Full
Sharding. In Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security.
ACM, 931–948. https://doi.org/10.1145/3243734.3243853
312