PreprintPDF Available

Proof-of-Execution: Reaching Consensus through Fault-Tolerant Speculation

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

Since the introduction of blockchains, several new database systems and applications have tried to employ them. At the core of such blockchain designs are Byzantine Fault-Tolerant (BFT) consensus protocols that enable designing systems that are resilient to failures and malicious behavior. Unfortunately, existing BFT protocols seem unsuitable for usage in database systems due to their high computational costs, high communication costs, high client latencies, and/or reliance on trusted components and clients. In this paper, we present the Proof-of-Execution consensus protocol (PoE) that alleviates these challenges. At the core of PoE are out-of-order processing and speculative execution, which allow PoE to execute transactions before consensus is reached among the replicas. With these techniques, PoE manages to reduce the costs of BFT in normal cases, while still providing reliable consensus toward clients in all cases. We envision the use of PoE in high-performance resilient database systems. To validate this vision, we implement PoE in our efficient ResilientDB blockchain and database framework. ResilientDB helps us to implement and evaluate PoE against several state-of-the-art BFT protocols. Our evaluation shows that PoE achieves up to $86\%$ more throughput than existing BFT protocols.
Content may be subject to copyright.
Proof-of-Execution: Reaching Consensus through
Fault-Tolerant Speculation
Suyash Gupta Jelle Hellings Sajjad Rahnama Mohammad Sadoghi
Exploratory Systems Lab
Department of Computer Science
University of California, Davis
ABSTRACT
Multi-party data management and blockchain systems require
data sharing among participants. To provide resilient and consis-
tent data sharing, transactions engines rely on Byzantine Fault-
Tolerant consensus (), which enables operations during fail-
ures and malicious behavior. Unfortunately, existing  proto-
cols are unsuitable for high-throughput applications due to their
high computational costs, high communication costs, high client
latencies, and/or reliance on twin-paths and non-faulty clients.
In this paper, we present the Proof-of-Execution consensus pro-
tocol (PE) that alleviates these challenges. At the core of PE are
out-of-order processing and speculative execution, which allow
PE to execute transactions before consensus is reached among
the replicas. With these techniques, PE manages to reduce the
costs of  in normal cases, while guaranteeing reliable con-
sensus for clients in all cases. We envision the use of PE in
high-throughput multi-party data-management and blockchain
systems. To validate this vision, we implement PE in our e-
cient RDB fabric and extensively evaluate PE against
several state-of-the-art  protocols. Our evaluation showcases
that PE achieves up-to-80% higher throughputs than existing
 protocols in the presence of failures.
1 INTRODUCTION
In federate data management a single common database is man-
aged by many independent stakeholders (e.g., an industry con-
sortium). In doing so, federated data management can ease data
sharing and improve data quality [
17
,
32
,
48
]. At the core of fed-
erated data management is reaching agreement on any updates
on the common database in an ecient manner, this to enable
fast query processing, data retrieval, and data modications.
One can achieve federated data management by replicating
the common database among all participant, this by replicat-
ing the sequence of transactions that aect the database to all
stakeholders. One can do so using commit protocols designed
for distributed databases such as two-phase [
22
] and three-phase
commit [
49
], or by using crash-resilient replication protocols
such as Paxos [39] and Raft [45].
These solutions are error-prone in a federated decentralized
environment in which each stakeholder manages its own replicas
and replicas of each stakeholder can fail (e.g., due to software,
hardware, or network failure) or act malicious: commit protocols
and replication protocols can only deal with crashes. Conse-
quently, recent federated designs propose the usage of Byzantine
Fault-Tolerant () consensus protocols.  consensus aims at
ordering client requests among a set of replicas, some of which could
be Byzantine, such that all non-faulty replicas reach agreement on
a common order for these requests [
9
,
21
,
29
,
38
,
51
]. Furthermore,
 consensus comes with the added benet of democracy, as
©
2021 Copyright held by the owner/author(s). Published in Proceedings of the
24th International Conference on Extending Database Technology (EDBT), March
23-26, 2021, ISBN 978-3-89318-084-4 on OpenProceedings.org.
Distribution of this paper is permitted under the terms of the Creative Commons
license CC-by-nc-nd 4.0.
 consensus gives all replicas an equal vote in all agreement
decisions, while the resilience of  can aid in dealing with the
billions of dollars losses associated with prevalent attacks on data
management systems [44].
Akin to commit protocols, the majority of  consensus pro-
tocols use a primary-backup model in which one replica is des-
ignated the primary that coordinates agreement, while the re-
maining replicas act as backups and follow the protocol [
46
].
This primary-backup  consensus was rst popularized by the
inuential P consensus protocol of Castro and Liskov [
9
]. The
design of P requires at least 3
f+
1replicas to deal with up-to-
f
malicious replicas and operates in three communication phases,
two of which necessitate quadratic communication complexity.
As such, P is considered costly when compared to commit or
replication protocols, which has negatively impacted the usage
of  consensus in large-scale data management systems [8].
The recent interest in blockchain technology has revived in-
terest in  consensus, has led to several new resilient data
management systems (e.g., [
3
,
18
,
29
,
43
]), and has led to the
development of new  consensus protocols that promise e-
ciency at the cost of exibility (e.g., [
21
,
28
,
38
,
51
]). Despite the
existence of these modern  consensus protocols, the majority
of -fueled systems [
3
,
18
,
29
,
43
] still employ the classical
time-tested, exible, and safe design of P, however.
In this paper, we explore dierent design principles that can
enable implementing a scalable and reliable agreement protocol
that shields against malicious attacks. We use these design princi-
ples to introduce Proof-of-Execution (PE), a novel  protocol
that achieves resilient agreement in just three linear phases. To
concoct PE’s scalable and resilient design, we start with P
and successively add four design elements:
(I1) Non-Divergent Speculative Execution.
In P, when
the primary replica receives a client request, it forwards that
request to the backups. Each backup on receiving a request from
the primary agrees to support by broadcasting a  message.
When a replica receives  message from the majority of
other replicas, it marks itself as prepared and broadcasts a 
message. Each replica that has prepared, and receives 
messages from a majority of other replicas, executes the request.
Evidently, P requires two phases of all-to-all communica-
tion. Our rst ingredient towards faster consensus is speculative
execution. In P terminology, PE replicas execute requests
after they get prepared, that is, they do not broadcast  mes-
sages. This speculative execution is non-divergent as each replica
has a partial guarantee–it has prepared–prior to execution.
(I2) Safe Rollbacks and Robustness under Failures.
Due
to speculative execution, a malicious primary in PE can en-
sure that only a subset of replicas prepare and execute a request.
Hence, a client may or may not receive a sucient number of
matching responses. PE ensures that if a client receives a full
proof-of-execution, consisting of responses from a majority of the
non-faulty replicas, then such a request persists in time. Other-
wise, PE permits replicas to rollback their state if necessary. This
proof-of-execution is the cornerstone of the correctness of PE.
Series ISSN: 2367-2005 301 10.5441/002/edbt.2021.27
(I3) Agnostic Signatures and Linear Communication.

protocols are run among distrusting parties. To provide secu-
rity, these protocols employ cryptographic primitives for signing
the messages and generating message digests. Prior works have
shown that the choice of cryptographic signature scheme can
impact the performance of the underlying system [
9
,
30
]. Hence,
we allow replicas to either employ message authentication codes
(
MAC
s) or threshold signatures (
TS
s) for signing [
36
]. When few
replicas are participating in consensus (up to 16), then a single
phase of all-to-all communication is inexpensive and using
MAC
s
for such setups can make computations cheap. For larger setups,
we employ
TS
s to achieve linear communication complexity.
TS
s
permit us to split a phase of all-to-all communication into two
linear phases [21, 51].
(I4) Avoid Response Aggregation.
SBFT [
21
], a recently-
proposed  protocol, suggests the use of a single replica (desig-
nated as the executor) to act as a response aggregator. In specic,
all replicas execute each client request and send their response to
the executor. It is the duty of the executor to reply to the client
and send a proof that a majority of the replicas not only executed
this request, but also outputted the same result. In PE, we avoid
this additional communication between the replicas by allowing
each replica to respond directly to the client.
In specic, we make the following contributions:
(1)
We introduce PE, a novel Byzantine fault-tolerant con-
sensus protocol that uses speculative execution to reach
agreement among replicas.
(2)
To guarantee failure recovery in the presence of specu-
lative execution and Byzantine behavior, we introduce a
novel view-change protocol that can rollback requests.
(3)
PE supports batching, out-of-order processing, and is
signature-scheme agnostic and can be made to employ
either MACs or threshold signatures.
(4)
PE does not rely on non-faulty replicas, clients, or trusted
hardware to achieve safe and ecient consensus.
(5)
To validate our vision of using PE in resilient federated
data management systems, we implement PE and four
other  protocols (Z,P,SBFT, and HS)
in our ecient RDB
1
fabric [
23
25
,
27
,
29
,
30
,
47
].
(6)
We extensively evaluate PE against these protocols on
a Google Cloud deployment consisting of 91 replicas and
320 k
clients under (i) no failure, (ii) backup failure, (iii)
primary failure, (iv) batching of requests, (v) zero payload,
and (vi) scaling the number of replicas. Further, to prove
the correctness of our results, we also stress test PE and
other protocols in a simulated environment. Our results
show that PE can achieve up to 80% more throughput
than existing  protocols in the presence of failures.
To the best of our knowledge, PE is the rst protocol that
achieves consensus in only two phases while being able to deal
with Byzantine failures and without relying on trusted clients
(e.g., Z[
38
]) or on trusted hardware (e.g., MBFT [
50
]).
Hence, PE can serve as a drop-in replacement of P to improve
scalability and performance in permissioned blockchain fabrics
such as our RDB fabric [
27
31
], MultiChain [
20
], and
Hyperledger Fabric [
4
]; in multi-primary meta-protocols such as
RCC [26, 28]; and in sharding protocols such as AHL [15].
2 ANALYSIS OF DESIGN PRINCIPLES
To arrive at an optimal design for PE, we studied practices fol-
lowed by state-of-the-art distributed data management systems
1RDB is open-sourced at https://github.com/resilientdb.
Protocol Phases Messages Resilience Requirements
Z 1O(n)0 Reliable clients and unsafe
PE (our paper) 3 O( 3n)fSign. agnostic
P 3O( n+2n2)f
HS 8O( 8n)fSequential Consensus
SBFT 5O( 5n)0Optimistic path
Figure 1: Comparison of bft consensus protocols in a sys-
tem with nreplicas of which fare faulty. The costs given
are for the normal-case behavior.
and applied their principles to the design of PE where possi-
ble. In Figure 1, we present a comparison of PE against four
well-known resilient consensus protocols.
To illustrate the merits of PE’s design, we rst briey look at
P. The last phase of P ensures that non-faulty replicas only
execute requests and inform clients when there is a guarantee
that such a transaction will be recovered after any failures. Hence,
clients need to wait for only
f+
1identical responses, of which
at-least one is from a non-faulty replica, to ensure guaranteed
execution. By eliminating this last phase, replicas speculatively
execute requests before obtaining recovery guarantees. This im-
pacts P-style consensus in two ways:
(1)
First, clients need a way to determine proof-of-execution
after which they have a guarantee that their requests are
executed and maintained by the system. We shall show
that such a proof-of-execution can be obtained using
nf
2f+1identical responses (instead of f+1responses).
(2)
Second, as requests are executed before they are guaran-
teed, replicas need to be able to rollback requests that are
dropped during periods of recovery.
PE’s speculative execution guarantees that requests with a proof-
of-execution will never rollback and that only a single request
can obtain a proof-of-execution per round. Hence, speculative
execution provides the same strong consistency (safety) of P
in all cases, this at much lower cost under normal operations.
Furthermore, we show that speculative execution is fully com-
patible with other scalable design principles applied to P, e.g.,
batching and out-of-order processing to maximize throughput,
even with high message delays.
Out-of-order execution.
Typical  systems follow the
order-execute model: rst replicas agree on a unique order of
the client request, and only then they execute the requests in
order [
9
,
21
,
29
,
38
,
51
]. Unfortunately, this prevents these sys-
tems from providing any support for concurrent execution. A
few  systems suggest executing prior to ordering, but even
such systems need to re-verify their results prior to commit-
ting changes [
4
,
35
]. Our PE protocol lies between these two
extremes: the replicas speculatively execute using only partial
ordering guarantees. By doing so, PE can eliminate communi-
cation costs and minimize latencies of typical  systems, this
without needing to re-verify results in the normal case.
Out-of-order processing.
Although  consensus typically
executes requests in-order, this does not imply they need to
process proposals to order requests sequentially. To maximize
throughput, P and other primary-backup protocols support
out-of-order processing in which all available bandwidth of the
primary is used to continuously propose requests (even when
previous proposals are still being processed by the system). By
doing so, out-of-order processing can eliminate the impact of high
message delays. To provide out-of-order processing, all replicas
will process any request proposed as the
𝑘
-th request whenever
𝑘
is within some active window bounded by a low-watermark
and high-watermark [
9
]. These watermarks are increased as the
302
system progresses. The size of this active window is—in practice—
only limited by the memory resources available to replicas. As
out-of-order processing is an essential technique to deliver high
throughputs in environments with high message delays, we have
included out-of-order processing in the design of PE.
Twin-path consensus.
Speculative execution employed by
PE is dierent that the twin-path model utilized by Z [
38
]
and SBFT [
21
]. These twin-path protocols have an optimistic fast
path that works only if none of the replicas are faulty and require
aid to determine whether these optimistic condition hold.
In the fast path of Z, primaries propose requests, and
backups directly execute such proposals and inform the client
(without further coordination). The client waits for responses
from all
n
replicas before marking the request executed. When
the client does not receive
n
responses, it timeouts and sends
a message to all replicas, after which the replicas perform an
expensive client-dependent slow-path recovery process (which is
prone to errors when communication is unreliable [2]).
The fast path of SBFT can deal with up to
c
crash-failures
using 3
f+
2
c+
1replicas and uses threshold signatures to make
communication linear. The fast path of SBFT requires a reliable
collector and executor to aggregate messages and to send only
a single (instead of at-least-
f+
1) response to the client. Due
to aggregating execution, the fast path of SBFT still performs
four rounds of communication before the client gets a response,
whereas PE only uses two rounds of communication (or three
when PE uses threshold signatures). If the fast path timeouts (e.g.,
the collector or executor fails), then SBFT falls back to a threshold-
version of P that takes an additional round before the client
gets a response. Twin-path consensus is in sharp contrast with
the design of PE, which does not need outside aid (reliable
clients, collectors, or executors), and can operate optimally even
while dealing with replica failures.
Primary rotation.
To minimize the inuence of any single
replica on  consensus, HS opts to replace the primary
after every consensus decision. To eciently do so, HS
uses an extra communication phase (as compared to P), which
minimizes the cost of primary replacement. Furthermore, H
S uses threshold signatures to make its communication lin-
ear (resulting in eight communication phases before a client gets
responses). The event-based version of HS can overlap
phases of consecutive rounds, thereby assuring that consensus of
a client request starts in every one-to-all-to-one communication
phase. Unfortunately, the primary replacements require that all
consensus rounds are performed in a strictly sequential manner,
eliminating any possibility of out-of-order processing.
3 PROOF-OF-EXECUTION
In our Proof-of-Execution consensus protocol (PE), the primary
replica is responsible for proposing transactions requested by
clients to all backup replicas. Each backup replica speculatively
executes these transactions with the belief that the primary is
behaving correctly. Speculative execution expedites processing
of transactions in all cases. Finally, when malicious behavior is
detected, replicas can recover by rolling back transactions, which
ensures correctness without depending on any twin-path model.
3.1 System model and notations
Before providing a full description of our PE protocol, we present
the system model we use and the relevant notations.
A system is a set
of replicas that process client requests.
We assign each replica
a unique identier
id()
with
0
id()<||
. We write
F
to denote the set of Byzantine
replicas that can behave in arbitrary, possibly coordinated and
malicious, manners. We assume that non-faulty replicas (those in
\F
) behave in accordance to the protocol and are deterministic:
on identical inputs, all non-faulty replicas must produce identical
outputs. We do not make any assumptions on clients: all clientcan
be malicious without aecting PE. We write
n=||
,
f=|F |
,
and
nf =|\ F |
to denote the number of replicas, faulty replicas,
and non-faulty replicas, respectively. We assume that
n>
3
f
(nf >2f).
We assume authenticated communication: Byzantine replicas
are able to impersonate each other, but replicas cannot imper-
sonate non-faulty replicas. Authenticated communication is a
minimal requirement to deal with Byzantine behavior. Depend-
ing on the type of message, we use message authentication codes
(
MAC
s) or threshold signatures (
TS
s) to achieve authenticated com-
munication [
36
].
MAC
s are based on symmetric cryptography in
which every pair of communicating nodes has a secret key. We
expect non-faulty replicas to keep their secret keys hidden.
TS
s
are based on asymmetric cryptography. In specic, each replica
holds a distinct private key, which it can use to create a signature
share. Next, one can produce a valid threshold signature given at
least
nf
such signature shares (from distinct replicas). We write
𝑠𝑣𝑖
to denote the signature share of the
𝑖
-th replica for signing
value
𝑣
. Anyone that receives a set
𝑇={𝑠𝑣𝑗|𝑗𝑇}
of signa-
ture shares for
𝑣
from
|𝑇|=nf
distinct replicas, can aggregate
𝑇
into a single signature
𝑣
. This digital signature can then be
veried using a public key.
We also employ a collision-resistant cryptographic hash function
D(·)
that can map an arbitrary value
𝑣
to a constant-sized digest
D(𝑣)
[
36
]. We assume that it is practically impossible to nd
another value
𝑣
,
𝑣𝑣
, such that
D(𝑣)=D(𝑣)
. We use notation
𝑣||𝑤to denotes the concatenation of two values 𝑣and 𝑤.
Next, we dene the consensus provided by PE.
Denition 3.1. A single run of any consensus protocol should
satisfy the following requirements:
Termination. Each non-faulty replica executes a transaction.
Non-divergence.
All non-faulty replicas execute the same trans-
action.
Termination is typically referred to as liveness, whereas non-
divergence is typically referred to as safety. In PE, execution is
speculative: replicas can execute and rollback transactions. To
provide safety, PE provides speculative non-divergence instead
of non-divergence:
Speculative non-divergence.
If
nf ff+
1non-faulty repli-
cas accept and execute the same transaction
𝑇
, then all
non-faulty replicas will eventually accept and execute
𝑇
(after rolling back any other executed transactions).
To provide safety, we do not need any other assumptions on
communication or on clients. Due to well-known impossibility
results for asynchronous consensus [
19
], we can only provide
liveness in periods of reliable bounded-delay communication dur-
ing which all messages sent by non-faulty replicas will arrive at
their destination within some maximum delay.
3.2 The Normal-Case Algorithm of PoE
PE operates in views
𝑣=
0
,
1
, . . .
. In view
𝑣
, replica
with
id()=𝑣mod n
is elected as the primary. The design of PE
relies on authenticated communication, which can be provided
using
MAC
s or
TS
s. In Figure 2, we sketch the normal-case working
of PE for both cases. For the sake of brevity, we will describe PE
built on top of
TS
s, which results in a protocol with low—linear
message complexity in the normal case. The full pseudo-code for
303
2
1
𝑐𝑇
(a) PE using MACs
2
1
𝑐𝑇
   
(b) PE using TSs.
Figure 2: Normal-case algorithm of PoE: Client 𝑐sends its
request containing transaction 𝑇to the primary p, which
proposes this request to all replicas. Although replica bis
Byzantine, it fails to aect PoE.
this algorithm can be found in Figure 3. In Section 3.6, we detail
the minimal changes to PE necessary when switching to MACs.
Consider a view
𝑣
with primary
. To request execution of
transaction
𝑇
, a client
𝑐
signs transaction
𝑇
and sends the signed
transaction 𝑇𝑐to . The usage of signatures assures that mali-
cious primaries cannot forge transactions. To initiate replication
and execution of
𝑇
as the
𝑘
-th transaction, the primary proposes
𝑇to all replicas via a  message.
After the
𝑖
-th replica
receives a  message
𝑚
from
, it checks whether at least
nf
other replicas received the same
proposal
𝑚
from primary
. This check assures
that at least
nf f
non-faulty replicas received the same proposal, which
will play a central role in achieving speculative non-divergence.
To perform this check, each replica supports the rst proposal
𝑚
it receives from the primary by computing a signature share
𝑠𝑚𝑖
and sending a  message containing this share to
the primary.
The primary
waits for  messages with valid sig-
nature shares from
nf
distinct replicas, which can then be ag-
gregated into a single signature
𝑚
. After generating such a
signature, the primary broadcasts this signature to all replicas
via a  message.
After a replica
receives a valid  message, it view-
commits to
𝑇
as the
𝑘
-th transaction in view
𝑣
. The replica logs
this view-commit decision as
VCommit(⟨𝑇𝑐, 𝑣, 𝑘 )
. After
view-
commits to
𝑇
,
schedules
𝑇
for speculative execution as the
𝑘
-th transaction of view
𝑣
. Consequently,
𝑇
will be executed
by
after all preceding transactions are executed. We write
Execute(⟨𝑇𝑐, 𝑣, 𝑘 )to log this execution.
After execution,
informs the client of the order of execution
and of execution result 𝑟(if any) via a message . In turn,
client
𝑐
will wait for a proof-of-execution for the transaction
𝑇
it requested, which consists of identical  messages from
nf
distinct replicas. This proof-of-execution guarantees that at
least
nf ff+
1non-faulty replicas executed
𝑇
as the
𝑘
-th
transaction and in Section 3.3, we will see that such transactions
are always preserved by PE when recovering from failures.
If client
𝑐
does not know the current primary or does not get
any timely response for its requests, then it can broadcast its
request
𝑇𝑐
to all replicas. The non-faulty replicas will then for-
ward this request to the current primary (if
𝑇
is not yet executed)
and ensure that the primary initiates successful proposal of this
request in a timely manner.
To prove correctness of PE in all cases, we will need the
following technical safety-related property of view-commits.
Client-role (used by client 𝑐to request transaction𝑇):
1: Send 𝑇𝑐to the primary .
2: Await receipt of messages ( ⟨𝑇𝑐, 𝑣, 𝑘, 𝑟 )from nf replicas.
3: Considers 𝑇executed, with result 𝑟, as the 𝑘-th transaction.
Primary-role (running at the primary of view 𝑣,id ()=𝑣mod n):
4: Let view 𝑣start after execution of the 𝑘-th transaction.
5: event awaits receipt of message 𝑇𝑐from client 𝑐do
6: Broadcast ( ⟨𝑇𝑐, 𝑣, 𝑘)to all replicas.
7: 𝑘:=𝑘+1.
8: end event
9: event receives nf message (𝑠𝑖, 𝑣, 𝑘 )such that:
(1) each message was sent by a distinct replica, 𝑖∈ {1, . . . , 𝑛 }; and
(2) All 𝑠𝑖in this set can be combined to generate signature .
do
10: Broadcast ( ⟨,𝑣, 𝑘)to all replicas.
11: end event
Backup-role (running at every 𝑖-th replica .) :
12: event receives message 𝑚:= ( ⟨𝑇𝑐, 𝑣, 𝑘 )such that:
(1) 𝑣is the current view;
(2) 𝑚is sent by the primary of 𝑣; and
(3) did not accept a 𝑘-th proposal in 𝑣
do
13: Compute :=D( ⟨𝑇𝑐| |𝑣| |𝑘).
14: Compute signature share 𝑠𝑖.
15: Transmit (𝑠𝑖, 𝑣, 𝑘 )to .
16: end event
17: event receives messages ( ⟨,𝑣, 𝑘)from such that:
(1) transmitted (𝑠𝑖, 𝑣, 𝑘 )to ; and
(2) is a valid threshold signature
do
18: View-commit 𝑇, the 𝑘-th transaction of 𝑣(VCommit( ⟨𝑇𝑐, 𝑣, 𝑘)).
19: end event
20: event logged VCommit( ⟨𝑇𝑐, 𝑣, 𝑘 )and
has logged Execute(𝑡, 𝑣, 𝑘)for all 0𝑘<𝑘do
21: Execute 𝑇as the 𝑘-th transaction of 𝑣(Execute( ⟨𝑇𝑐, 𝑣, 𝑘)).
22: Let 𝑟be the result of execution of 𝑇(if there is any result).
23: Send (D( ⟨𝑇𝑐), 𝑣, 𝑘, 𝑟 )to 𝑐.
24: end event
Figure 3: The normal-case algorithm of PoE.
P 3.2. Let
r𝑖
,
𝑖∈ {
1
,
2
}
, be two non-faulty replicas
that view-committed to
𝑇𝑖𝑐𝑖
as the
𝑘
-th transaction of view
𝑣
(VCommitr(⟨𝑇𝑐, 𝑣, 𝑘 )). If n>3f, then 𝑇1𝑐1=𝑇2𝑐2.
P.
Replica
𝑖
only view-committed to
𝑇𝑖𝑐𝑖
after
𝑖
re-
ceived
( , 𝑣, 𝑘 )
from the primary
(Line 17 of Figure 3).
This message includes a threshold signature
, whose construc-
tion requires signature shares from a set
𝑆𝑖
of
nf
distinct replicas.
Let
𝑋𝑖=𝑆𝑖\ F
be the non-faulty replicas in
𝑆𝑖
. As
|𝑆𝑖|=nf
and
|F | =f
, we have
|𝑋𝑖| ≥ nf f
. The non-faulty replicas in
𝑋𝑖
will
only send a single  message for the
𝑘
-th transaction in
view
𝑣
(Line 12 of Figure 3). Hence, if
𝑇1𝑐1𝑇2𝑐2
, then
𝑋1
and
𝑋2
must not overlap and
nf ≥ |𝑋1𝑋2| ≥
2
(nf f)
must
hold. As
n=nf +f
, this simplies to 3
fn
, which contradicts
n>3f. Hence, we conclude 𝑇1𝑐1=𝑇2𝑐2.
We will later use Proposition 3.2 to show that PE provides
speculative non-divergence. Next, we look at typical cases in
which the normal-case of PE is interrupted:
Example 3.3. A malicious primary can try to aect PE by not
conforming to the normal-case algorithm in the following ways:
(1)
By sending proposals for dierent transactions to dierent
non-faulty replicas. In this case, Proposition 3.2 guarantees
that at most a single such proposed transaction will get
view-committed by any non-faulty replica.
(2)
By keeping some non-faulty replicas in the dark by not
sending proposals to them. In this case, the remaining
non-faulty replicas can still end up view-committing the
transactions as long as at least
nf f
non-faulty replicas
receive proposals: the faulty replicas in
F
can take over the
304
role of up to
f
non-faulty replicas left in the dark (giving
the false illusion that the non-faulty replicas in the dark
are malicious).
(3)
By preventing execution by not proposing a
𝑘
-th transac-
tion, even though transactions following the
𝑘
-th transac-
tion are being proposed.
When the network is unreliable and messages do not get deliv-
ered (or not on time), then the behavior of a non-faulty primary
can match that of the malicious primary in the above example.
Indeed, failure of the normal-case of PE has only two possi-
ble causes: primary failure and unreliable communication. If
communication is unreliable, then there is no way to guarantee
continuous service [
19
]. Hence, replicas simply assume failure
of the current primary if the normal-case behavior of PE is
interrupted, while the design of PE guarantees that unreliable
communication does not aect the correctness of PE.
To deal with primary failure, each replica maintains a timer
for each request. If this timer expires (timeout) and it has not
been able to execute the request, it assumes that the primary
is malicious. To deal with such a failure, replicas will replace
the primary. Next, we present the view-change algorithm that
performs primary replacement.
3.3 The View-Change Algorithm
If PE observes failure of the primary
of view
𝑣
, then PE will
elect a new primary and move to the next view, view
𝑣+
1, via
the view-change algorithm. The goals of the view-change are
(1)
to assure that each request that is considered executed by
any client is preserved under all circumstances; and
(2)
to assure that the replicas are able to agree on a new view
whenever communication is reliable.
As described in the previous section, a client will consider its
request executed if it receives a proof-of-execution consisting
of identical  responses from at-least
nf
distinct replicas.
Of these
nf
responses, at-most
f
can come from faulty replicas.
Hence, a client can only consider its request executed whenever
the requested transaction was executed (and view-committed) by
at-least
nf ff+
1non-faulty replicas in the system. We note
the similarity with the view-change algorithm of P, which
will preserve any request that is prepared by at-least
nf ff+
1
non-faulty replicas.
The view-change algorithm of PE consists of three steps.
First, failure of the current primary
needs to be detected by all
non-faulty replicas. Second, all replicas exchange information to
establish which transactions were included in view
𝑣
and which
were not. Third, the new primary
proposes a new view. This
new view proposal contains a list of the transactions executed in
the previous views (based on the information exchanged earlier).
Finally, if the new view proposal is valid, then replicas switch to
this view; otherwise, replicas detect failure of
and initiate a
view-change for the next view (
𝑣+
2). The communication of the
view-change algorithm of PE is sketched in Figure 4 and the
full pseudo-code of the algorithm can be found in Figure 5. Next,
we discuss each step in detail.
3.3.1 Failure Detection and View-Change Requests. If a replica
detects failure of the primary of view
𝑣
, then it halts the normal-
case algorithm of PE for view
𝑣
and informs all other replicas
of this failure by requesting a view-change. The replica
does
so by broadcasting a message
(𝑣, 𝐸 )
, in which
𝐸
is
a summary of all transactions executed by
(Figure 5, Line 1).
Each replica can detect the failure of primary in two ways:
2
1

(detection)

(join)
 Enter view 𝑣+1
Figure 4: The current primary bof view 𝑣is faulty and
needs to be replaced. The next primary, p, and the replica
r1detected this failure rst and request view-change via
vc-reqest messages. The replica r2joins these requests.
vc-request (used by replica to request view-change) :
1: event detects failure of the primary do
2: halts the normal-case algorithm of Figure 3 for view 𝑣.
3: 𝐸:={(( ⟨,𝑤,𝑘 ),𝑇𝑐) |
𝑤𝑣and Execute( ⟨𝑇𝑐, 𝑤, 𝑘 )and =D( ⟨𝑇𝑐||𝑤| |𝑘) } .
4: Broadcast (𝑣, 𝐸 )to all replicas.
5: end event
6: event receives f+1messages (𝑣𝑖, 𝐸𝑖)such that
(1) each message was sent by a distinct replica; and
(2) 𝑣𝑖,1𝑖f+1, is the current view
do
7: detects failure of the primary (join).
8: end event
On receiving nv-propose (use by replica ):
9: event receives 𝑚=(𝑣+1, 𝑚1, 𝑚2, ...,𝑚 nf )do
10: if 𝑚is a valid new-view proposal (similar to creating )then
11: Derive the transactions 𝑁for the new-view from 𝑚1,𝑚2, . . . ,𝑚nf .
12: Rollback any executed transactions not included in 𝑁.
13: Execute the transactions in 𝑁not yet executed.
14: Move into view 𝑣+1(see Section 3.3.3 for details).
15: end if
16: end event
nv-propose (used by replica that will act as the new primary) :
17: event receives nf messages 𝑚𝑖=(𝑣𝑖, 𝐸𝑖)such that
(1) these messages are sent by a set 𝑆,|𝑆|=nf, of distinct replicas;
(2)
for each
𝑚𝑖
,
1𝑖nf
, sent by replica
𝑖𝑆
,
𝐸𝑖
consists of a
consecutive sequence of entries ( ( ⟨,𝑣, 𝑘),𝑇𝑐);
(3) 𝑣𝑖,1𝑖nf, is the current view 𝑣; and
(4) is the next primary (id()=(𝑣+1)mod n)
do
18: Broadcast (𝑣+1, 𝑚1, 𝑚2, ..., 𝑚nf )to all replicas.
19: end event
Figure 5: The view-change algorithm of PoE.
(1)
timeouts while expecting normal-case operations toward
executing a client request. E.g., when
forwards a client
request to the current primary, and the current primary
fails to propose this request on time.
(2)
receives  messages, indicating that the pri-
mary of view
𝑣
failed, from
f+
1distinct replicas. As at
most
f
of these messages can come from faulty replicas, at
least one non-faulty replica must have detected a failure.
In this case, joins the view-change (Figure 5, Line 6).
3.3.2 Proposing the New View. To start view
𝑣+
1, the new
primary
(with
id()=(𝑣+
1
)mod n
) needs to propose a new
view by determining a valid list of requests that need to be pre-
served. To do so,
waits until it receives sucient information.
In specic,
waits until it received valid  messages
from a set 𝑆of |𝑆|=nf distinct replicas.
An
𝑖
-th view-change request
𝑚𝑖
is considered valid if it in-
cludes a consecutive sequence of pairs
(𝑐, 𝑇𝑐)
, where
𝑐
is a valid
 message for request
𝑇𝑐
. Such a set
𝑆
is guaranteed to
exist when communication is reliable, as all non-faulty replicas
will participate in the view-change algorithm. The new primary
305
collects the set
𝑆
of
|𝑆|=nf
valid  and proposes them
in a new view message  to all replicas.
3.3.3 Move to the New View. After a replica
receives a 
 message containing a new-view proposal from the new
primary
,
validates the content of this message. From the set
of  messages in the new-view proposal,
chooses,
for each
𝑘
, the pair
(( ,𝑤, 𝑘 ),𝑇𝑐)
proposed in the
most-recent view
𝑤
. Furthermore,
determines the total number
of such requests
𝑘max
. Then,
view-commits and executes all
𝑘max
chosen requests that happened before view
𝑣+
1. Notice
that replica
can skip execution of any transaction it already
executed. If
executed transactions not included in the new-view
proposal, then
needs to rollback these transactions before it can
proceed executing requests in view
𝑣+
1. After these steps,
can
switch to the new view
𝑣+
1. In the new view, the new primary
starts by proposing the 𝑘max +1-th transaction.
When moving into the new view, we see the cost of speculative
execution: some replicas can be forced to rollback execution of
transactions:
Example 3.4. Consider a system with non-faulty replica
.
When deciding the
𝑘
-th request, communication became unreli-
able, due to which only
received a  message for request
𝑇𝑐
. Consequently,
speculatively executes
𝑇
and informs the
client
𝑐
. During the view-change, all other replicas—none of
which have a  message for
𝑇𝑐
—provide their local state
to the new primary, which proposes a new view that does not
include any
𝑘
-th request. Hence, the new primary will start its
view by proposing client request
𝑇𝑐
as the
𝑘
-th request, which
gets accepted. Consequently,
needs to rollback execution of
𝑇
.
Luckily, this is not an issue: the client
𝑐
only got at-most
f+
1
<nf
responses for request, does not yet have a proof-of-execution,
and, consequently, does not consider 𝑇executed.
In practice, rollbacks can be supported by, e.g., undoing the
operations of transaction in reverse order, or by reverting to an
old state. For the correct working of PE, the exact working of
rollbacks is not important as long as the execution layer provides
support for rollbacks.
3.4 Correctness of PoE
First, we show that the normal-case algorithm of PE provides
non-divergent speculative consensus when the primary is non-
faulty and communication is reliable.
T 3.5. Consider a system in view
𝑣
, in which the rst
𝑘
1transactions have been executed by all non-faulty replicas, in
which the primary is non-faulty, and communication is reliable. If
the primary received
𝑇𝑐
, then the primary can use the algorithm
in Figure 3 to ensure that
(1) there is non-divergent execution of 𝑇;
(2) 𝑐considers 𝑇executed as the 𝑘-th transaction; and
(3) 𝑐learns the result of executing 𝑇(if any),
this independent of any malicious behavior by faulty replicas.
P.
Each non-faulty primary would follow the algorithm
of PE described in Figure 3 and send
( 𝑇𝑐, 𝑣, 𝑘 )
to
all replicas (Line 6). In response, all
nf
non-faulty replicas will
compute a signature share and send a  message to the
primary (Line 15). Consequently, the primary will receive signa-
ture shares from
nf
replicas and will combine them to generate a
threshold signature
. The primary will include this signature
in a  message and broadcast it to all replicas. Each
replica will successfully verify
and will view-commit to
𝑇
(Line 17). As the rst
𝑘
1transactions have already been exe-
cuted, every non-faulty replica will execute
𝑇
. As all non-faulty
replicas behave deterministically, execution will yield the same
result
𝑟
(if any) across all non-faulty replicas. Hence, when the
non-faulty replicas inform
𝑐
, they do so by all sending identical
messages
(D( 𝑇𝑐), 𝑣, 𝑘 , 𝑟 )
to
𝑐
(Line 20–Line 23). As all
nf
non-faulty replicas executed
𝑇
, we have non-divergent execution.
Finally, as there are at most
f
faulty replicas, the faulty replicas
can only forge up to
f
invalid  messages. Consequently,
the client
𝑐
will only receive the message
(D( 𝑇𝑐), 𝑣, 𝑘 , 𝑟 )
from at least
nf
distinct replicas, and will conclude that
𝑇
is exe-
cuted yielding result 𝑟(Line 3).
At the core of the correctness of PE, under all conditions,
is that no replica will rollback requests
𝑇𝑐
for which client
𝑐
already received a proof-of-execution. We prove this next:
P 3.6. Let
𝑇𝑐
be a request for which client
𝑐
al-
ready received a proof-of-execution showing that
𝑇
was executed
as the
𝑘
-th transaction of view
𝑣
. If
n>
3
f
, then every non-faulty
replica that switches to a view
𝑣>𝑣
will preserve
𝑇
as the
𝑘
-th
transaction of view 𝑣.
P.
Client
𝑐
considers
𝑇𝑐
executed as the
𝑘
-th transac-
tion of view
𝑣
when it received identical -messages for
𝑇
from a set
𝐴
of
|𝐴|=nf
distinct replicas (Figure 3, Line 3). Let
𝐵=𝐴\ F be the set of non-faulty replicas in 𝐴.
Now consider a non-faulty replica
that switches to view
𝑣>𝑣
. Before doing so,
must have received a valid proposal
𝑚=
(𝑣, 𝑚1, ..., 𝑚nf )
from the primary of view
𝑣
. Let
𝐶
be
the set of
nf
distinct replicas that provided messages
𝑚1, . . . , 𝑚nf
and let
𝐷=𝐶\ F
be the set of non-faulty replicas in
𝐶
. We
have
|𝐵| ≥ nf f
and
|𝐷| ≥ nf f
. Hence, using a contradiction
argument similar to the one in the proof of Proposition 3.2, we
conclude that there must exists a non-faulty replica
∈ (𝐵𝐷)
that executed 𝑇𝑐, informed 𝑐, and requested a view-change.
To complete the proof, we need to show that
𝑇𝑐
was pro-
posed and executed in the last view that proposed and view-
committed a
𝑘
-th transaction and, hence, that
will include
𝑇𝑐
in its  message for view
𝑣
. We do so by induction
on the dierence
𝑣𝑣
. As the base case, we have
𝑣𝑣=
1,
in which case no view after
𝑣
exists yet and, hence,
𝑇𝑐
must
be the newest
𝑘
-th transaction available to
. As the induction
hypothesis, we assume that all non-faulty replicas will preserve
𝑇
when entering a new view
𝑤
,
𝑣<𝑤𝑤
. Hence, non-faulty
replicas participating in view
𝑤
will not support any
𝑘
-th trans-
actions proposed in view
𝑤
. Consequently, no  messages
can be constructed for any
𝑘
-th transaction in view
𝑤
. Hence,
the new-view proposal for
𝑤+
1will include
𝑇𝑐
, completing
the proof.
As a direct consequence of the above, we have
C 3.7 (S  PE). PoE provides speculative non-
divergence if n>3f.
We notice that the view-change algorithm does not deal with
minor malicious behavior (e.g., a single replica left in the dark).
Furthermore, the presented view-change algorithm will recover
all transactions since the start of the system, which will result
in unreasonable large messages when many transactions have
already been proposed. In practice, both these issues can be re-
solved by regularly making checkpoints (e.g., after every 100
requests) and only including requests since the last checkpoint
in each  message. To do so, PE uses a standard fully-
decentralized P-style checkpoint algorithm that enables the
independent checkpointing and recovery of any request that is
306
executed by at least
f+
1non-faulty replicas whenever communi-
cation is reliable [
9
]. Finally, utilizing the view-change algorithm
and checkpoints, we prove
T 3.8 (L  PE). PoE provides termination in
periods of reliable bounded-delay communication if n>3f.
P.
When the primary is non-faulty, Theorem 3.5 guar-
antees termination as replicas continuously accept and execute
requests. If the primary is Byzantine and fails to guarantee ter-
mination for at most
f
non-faulty replicas, then the checkpoint
algorithm will assure termination of these non-faulty replicas.
Finally, if the primary is Byzantine and fails to guarantee termi-
nation for at least
f+
1non-faulty replicas, then it will be replaced
using the view-change algorithm. For the view-change process,
each replica will start with a timeout
𝛿
after it receives
nf
match-
ing  and double this timeout after each view-change
(exponential backo). When communication becomes reliable,
this mechanism guarantees that all replicas will eventually view-
change to the same view at the same time. After this point, a
non-faulty replica will become primary in at most
f
view-changes,
after which Theorem 3.5 guarantees termination.
3.5 Fine-Tuning and Optimizations
To keep presentation simple, we did not include the following
optimizations in the protocol description:
(1)
To reach
nf
signature shares, the primary can generate one
itself. Hence, it only needs
nf
1shares of other replicas.
(2)
The ,,, and  messages
are not forwarded and only need
MAC
s to provide message
authentication. The  messages need not be signed,
as tampering them would invalidate the threshold signa-
ture. The  messages need to be signed, as they
need to be forwarded without tampering.
Finally, the design of PE is fully compatible with out-of-order
processing as a replica only supports proposals for a
𝑘
-th trans-
action if it had not previously supported another
𝑘
-th proposal
(Figure 3, Line 12) and only executes a
𝑘
-th transaction if it has
already executed all the preceding transactions (Figure 3, Line 20).
As the size of the active out-of-order processing window deter-
mines how many client requests are being processed at the same
time (without receiving a proof-of-execution), the size of the
active window determines the number of transactions that can
be rolled back during view-changes.
3.6 Designing PoE using MACs
The design of PE can be adapted to only use message authen-
tication codes (
MAC
s) to authenticate communication. This will
sharply reduce the computational complexity of PE and elim-
inate one round of communication, this at the cost of higher
quadratic overall communication costs (see Figure 2).
The usage of only
MAC
s makes it impossible to obtain threshold
signatures or reliably forward messages (as forwarding replicas
can tamper with the content of unsigned messages). Hence, us-
ing
MAC
s requires changes to how client requests are included
in proposals (as client requests are forwarded), to the normal-
case algorithm of PE (which uses threshold signatures), and to
the view-change algorithm of PE (which forwards 
messages). The changes to the proposal of client requests and to
the view-change algorithm can be derived from the strategies
used by P to support
MAC
s [
9
]. Hence, next we only review
the changes to the normal-case algorithm of PE.
Consider a replica
that receives a  message from
the primary
. Next,
needs to determine whether at least
nf
Client
Request s
Support
& Certif y
Input
Network
Messages
from Client s
and Replicas
Batch Creation
Worker
Checkpoint
Execute
Output
Network
Messages
to Client s
and Replicas
Figure 6: Multi-threaded Pipelines at dierent replicas.
other replicas received the same proposal, which is required to
achieve speculative non-divergence (see Proposition 3.2). When
using
MAC
s,
can do so by replacing the all-to-one support and
one-to-all certify phases by a single all-to-all support phase. In the
support phase, each replica agrees to support the rst proposal
( 𝑇𝑐, 𝑣, 𝑘 )
it receives from the primary by broadcast-
ing a message
(D( 𝑇𝑐), 𝑣, 𝑘 )
to all replicas. After this
broadcast, each replica waits until it receives  messages,
identical to the message it sent, from
nf
distinct replicas. If
receives these messages, it view-commits to
𝑇
as the
𝑘
-th transac-
tion in view
𝑣
and schedules
𝑇
for execution. We have sketched
this algorithm in Figure 2.
4RESILIENTDB FABRIC
To test our design principles in practical settings, we imple-
ment our PE protocol in our RDB fabric [
27
31
]. R
DB provides its users access to a state-of-the-art replicated
transactional engine and fullls the need of a high-throughput
permissioned blockchain fabric. RDB helps us to realize
the following goals: (i) implement and test dierent consensus
protocols; (ii) balance the tasks done by a replica through a paral-
lel pipelined architecture; (iii) minimize the cost of communication
through batching client transactions; and (iv) enable use of a se-
cure and ecient ledger. Next, we present a brief overview of
our RDB fabric.
RDB lays down a client-server architecture where
clients send their transactions to servers for processing. We use
Figure 6 to illustrate the multi-threaded pipelined architecture
associated with each replica. At each replica, we spawn multiple
input and output threads for communicating with the network.
Batching.
During our formal description of PE, we assumed
that the  message from the primary includes a single
client request. An eective way to reduce the overall cost of
consensus is by aggregating several client requests in a single
batch and use one consensus step to reach agreement on all these
requests [
9
,
21
,
38
]. To maximize performance, RDB
facilitates batching requests at both replicas and clients.
At the primary replica, we spawn multiple batch-threads that
aggregate clients requests into a batch. The input-threads at the
primary receive client requests, assign them a sequence number
and enqueue these requests in the batch-queue. In RDB,
all batch-threads share a common lock-free queue. When a client
request is available, a batch-thread dequeues the request and con-
tinues adding it to an existing batch until the batch has reached
a pre-dened size. Each batching-thread also hashes the requests
in a batch to create a unique digest.
All other messages received at a replica are enqueued by the
input-thread in the work-queue to be processed by the single
worker-thread. Once a replica receive a  message from
the primary, it forwards the request to the execute-thread for
execution. Once the execution is complete, the execution-thread
creates an  message, which is transmitted to the client.
Ledger Management.
We now explain how we eciently
maintain a blockchain ledger across dierent replicas. A block-
chain is an immutable ledger, where blocks are chained as a
linked-list. An
𝑖
-th block can be represented as
𝐵𝑖
:
={𝑘, 𝑑, 𝑣,
307
No exec. Exec.
0
2
4
6·105
Throughput (txn/s)
No exec. Exec
0
0.2
0.4
0.6
Latency (s)
Figure 7: Upper bound on performance when primary
only replies to clients (No exec.) and when primary exe-
cutes a request and replies to clients (Exec.).
𝐻(𝐵𝑖1)}
, in which
𝑘
is the sequence number of the client re-
quest,
𝑑
the digest of the request,
𝑣
the view number, and
𝐻(𝐵𝑖1)
the hash of the previous block. In RDB, prior to any
consensus, we require the rst primary replica to create a gen-
esis block [
31
]. This genesis block acts as the rst block in the
blockchain and contains some basic data. We use the hash of the
identity of the initial primary, as this information is available to
each participating replicas (eliminating the need for any extra
communication to exchange this block).
After the genesis block, each replica can independently create
the next block in the blockchain. As stated above, each block
corresponds to some batch of transactions. A block is only created
by the execute-thread once it completes executing a batch of
transactions. To create a block, the execute-thread hashes the
previous block in the blockchain and creates a new block. To
prove the validity of individual blocks, RDB stores the
proof-of-accepting the
𝑘
-th request in the
𝑘
-th block. In PE, such
a proof includes the threshold signature sent by the primary as
part of the  message.
5 EVALUATION
We now analyze our design principles in practice. To do so, we
evaluate our PE protocol against four state-of-the-art  pro-
tocols. There are many  protocols we could compare with.
Hence, we pick a representative sample: (1) Z—as it has
the absolute minimal cost in the fault-free case, (2) P—as it is a
common baseline (the used design is based on BFTSmart [
7
]), (3)
SBFT—as it is a safer variation of Z, and (3) HS—as
it is a linear-communication protocol that adopts the notion of
rotating leaders. Through our experiments, we want to answer
the following questions:
(Q1)
How does PE fare in comparison with the other protocols
under failures?
(Q2) Does PE benets from batching client requests?
(Q3) How does PE perform under zero payload?
(Q4)
How scalable is PE on increasing the number of replicas
participating in the consensus, in the normal-case?
Setup.
We run our experiments on the Google Cloud, and
deploy each replicas on a
𝑐
2machine having a 16-core Intel Xeon
Cascade Lake CPU running at
3.8 GHz
with
32 GB
memory. We
deploy up to
320 k
clients on 16 machines. To collect results after
reaching a steady-state, we run each experiment for
180 s
: the
rst
60 s
are warmup, and measurement results are collected over
the next 120 s. We average our results over three runs.
Conguration and Benchmarking.
For evaluating the pro-
tocols, we employed YCSB [
13
] from Blockbench’s macro bench-
marks [
16
]. Each client request queries a YCSB table that holds
half a million active records. We require 90% of the requests to be
write queries as the majority of typical blockchain transactions
are updates to existing records. Prior to the experiments, each
replica is initialized with an identical copy of the YCSB table. The
None ED MAC
0
0.5
1
1.5
·105
Throughput (txn/s)
None ED MAC
0
1
2
3
4
Latency (s)
Figure 8: System performance using three dierent signa-
ture schemes. In all cases, n=
16
replicas participate in
consensus.
client requests generated by YCSB follow a Zipan distribution
and are heavily skewed (skew factor 0.9).
Unless explicitly stated, we use the following conguration
for all experiments. We perform scaling experiments by varying
replicas from 4to 91. We divide our experiments in two dimen-
sions: (1) Zero Payload or Standard Payload, and (2) Failures or
Non-Failures. We employ batching with a batch size of 100 as the
percentage increase in throughput on larger batch sizes is small.
Under Zero Payload conditions, all replicas execute 100 dummy
instructions per batch, while the primary sends an empty pro-
posal (and not a batch of 100 requests). Under Standard Payload,
with a batch size of 100, the size of
P
message is
5400 B
,
of
R
message is
1748 B
, and of other messages is around
250 B
. For experiments with failures, we force one backup replica
to crash. Additionally, we present an experiment that illustrates
the eect of primary failure. We measure throughput as trans-
actions executed per second. We measure latency as the time
from when the client sends a request to the time when the client
receives a response.
Other protocols:
We also implement P,Z,SBFT
and HS in our RDB fabric. We refer to Section 2
for further details on the working of Z,SBFT, and H
S. Our implementation of P is based on the BFTSmart [
7
]
framework with the added benets of out-of-order processing,
pipelining, and multi-threading. In both P and Z, digi-
tal signatures are used for authenticating messages sent by the
clients, while
MAC
s are used for other messages. Both SBFT and
HS require threshold signatures for their communication.
5.1 System Characterization
We rst determine the upper bounds on the performance of
RDB. In Figure 7, we present the maximum throughput
and latency of RDB when there is no communication
among the replicas. We use the term No Execution to refer to the
case where all clients send their request to the primary replica
and primary simply responds back to the client. We count every
query responded back in the system throughput. We use the term
Execution to refer to the case where the primary replica executes
each query before responding back to the client.
The architecture of RDB (see Section 4) states the use
of one worker thread. In these experiments, we maximize system
performance by allowing up to two threads to work indepen-
dently at the primary replica without ordering any queries. Our
results indicate that the system can attain high throughputs (up
to
500 ktxn/s
) and can reach low latencies (up to
0.25 s
). Notice
that if we employ additional worker-threads, our RDB
fabric can easily attain higher throughput.
5.2 Eect of Cryptographic Signatures.
RDB enables a exible design where replicas and clients
can employ both digital signatures (threshold signatures) and
308
message authentication codes. This helps us to implement PE
and other consensus protocols in RDB.
To achieve authenticated communication using symmetric
cryptography, we employ a combination of CMAC and AES [
36
].
Further, we employ ED25519-based digital signatures to enable
asymmetric cryptographic signing. For generating ecient thresh-
old signature scheme, we use Boneh–Lynn–Shacham (BLS) sig-
natures [
36
]. To create message digests and for hashing purposes,
we use the SHA256 algorithm.
Next, we determine the cost of dierent cryptographic signing
schemes. For this purpose, we run three dierent experiments in
which (i) no signature scheme is used (None); (ii) everyone uses
digital signatures based on ED25519 (ED); and (iii) all replicas use
CMAC+AES for signing, while clients sign their message using
ED25519 (MAC). In these three experiments, we run P consen-
sus among 16 replicas. In Figure 8, we illustrate the throughput
attained and latency incurred by RDB for the experi-
ments. Clearly, the system attains its highest throughput when
no signatures are employed. However, such a system cannot han-
dle malicious attacks. Further, using just digital signatures for
signing messages can prove to be expensive. An optimal cong-
uration can require clients to sign their messages using digital
signatures, while replicas can communicate using MACs.
5.3 Scaling Replicas under Standard Payload
In this section, we evaluate scalability of PE both under backup
failure and no failures.
(1) Single Backup Failure.
We use Figures 9(a) and 9(b) to
illustrate the throughput and latency attained by the system on
running dierent consensus protocols under a backup failure.
These graphs arm our claim that PE attains higher throughput
and incurs lower latency than all other protocols.
In case of P, each replica participates in two phases of
quadratic communication, which limits its throughput. For the
twin-path protocols such as Z and SBFT, a single failure
is sucient to cause massive reductions in their system through-
puts. Notice that the collector in SBFT and the clients in Z
have to wait for messages from all
n
replicas, respectively. As
predicting an optimal value for timeouts is hard [
11
,
12
], we
chose a very small value for the timeout (
3 s
) for replicas and
clients. We justify these values, as the experiments we show later
in this section show that the average latency can be as large
as
6 s
. We note that high timeouts aect Zmore than
SBFT. In Z, clients are waiting for timeouts during which
they stop sending requests, which empties the pipeline at the
primary, starving it from new request to propose. To alleviate
such issues in real-world deployments of Z, clients need
to be able to precisely predict the latency to minimize the time
the clients needs to wait between requests. Unfortunately, this is
hard and runs the risk of ending up in the expensive slow path of
Z whenever the predicted latency is slightly o. In SBFT,
the collector may timeout waiting for threshold shares for the
𝑘
-th round while the primary can continues propose requests
for future round
𝑙
,
𝑙>𝑘
. Hence, in SBFT replicas have more
opportunity to occupy themselves with useful work.
HS attains signicantly low throughput due to its se-
quential primary-rotation model in which each of its primaries
has to wait for the previous primary before proposing the next
request, which leads to a huge reduction in its throughput. In-
terestingly, HS incurs the least average latency among
all protocols. This is a result of intensive load on the system
when running other protocols. As these protocols process several
requests concurrently (see the multi-threaded architecture in Sec-
tion 4), these requests spend on average more time in the queue
before being processed by a replica. Notice that all out-of-order
consensus protocols employ this trade o: a small sacrice on
latency yields higher gains on system throughput.
In case of PE, its high throughputs under failures is a result
of its three-phase linear protocol that does not rely on any twin-
path model. To summarize, PE attains up to 43%,72%,24
×
and
62
×
more throughputs than P,SBFT,HS and Z.
(2) No Replica Failure.
We use Figures 9(c) and 9(d) to il-
lustrate the throughput and latency attained by the system on
running dierent consensus protocols in fault-free conditions.
These plots help us to bound the maximum throughput that can
be attained by dierent consensus protocols in our system.
First, as expected, in comparison to the Figures 9(a) and 9(b),
the throughputs for PE and P are slightly higher. Second,
PE continues to outperform both P and HS, for the
reasons described earlier. Third, both Zand SBFT are
now attaining higher throughputs as their clients and collector
no longer timeout, respectively. The key reason SBFT’s gains
are limited is because SBFT requires ve phases and becomes
computation bounded. Although P is quadratic, it employs
MAC, which are cheaper to sign and verify.
Notice that the dierences in throughputs of PE and Z
are small. PE has 20% (on 91 replicas) to 13% (on 4replicas) less
throughputs than Z. An interesting observation is that on
91 replicas, Z incurs almost the same latency as PE, even
though it has higher throughput. This happens as clients in PE
have to wait for only the fastest
nf =
61 replies, whereas a client
for Z has to wait for replies from all replicas (even the
slowest ones). To conclude, PE attains up to 35%,27% and 21
×
more throughput than P,SBFT and HS, respectively.
5.4 Scaling Replicas under Zero Payload
We now measure the performance of dierent protocols under
zero payload. In any  protocol, the primary starts consensus
by sending a
P
message that includes all transactions. As
a result, this message has the largest size and is responsible for
consuming the majority of the bandwidth. A zero payload ex-
periment ensures that each replica executes dummy instructions.
Hence, the primary is no longer a bottleneck.
We again run these experiments for both
Single Failure
and
Failure-Free
cases, and use Figures 9(e) to 9(h) to illustrate our
observations. It is evident from these gures that zero payload
experiments have helped in increasing PE’s gains. PE attains
up to 85%,62% and 27
×
more throughputs than P,SBFT and
HS, respectively. In fact, under failure-free conditions,
the throughput attained by PE is comparable to Z. This
is easily explained. First, both PE and Zare linear pro-
tocols. Second, although in failure-free cases Z attains
consensus in one phase, its clients need to wait for response from
all
n
replicas, which gives PE an opportunity to cover the gap.
However, SBFT being a linear protocol does not perform as good
as its other linear counterparts. Its throughput is impacted by
the delay of ve phases.
5.5 Impact of Batching under Failures
Next, we study the eect of batching client requests on  pro-
tocols [
9
,
51
]. To answer this question, we measure performance
as function of the number of requests in a batch (the batch-size),
which we vary between 10 and 400. For this experiment, we use a
system with 32 available replicas, of which one replica has failed.
We use Figures 9(i) and 9(j) to illustrate, for each consensus
protocol, the throughput and average latency attained by the sys-
tem. For each protocol, increasing the batch-size also increases
309
PE P SBFT HS Z
4 16 32 64 91
0
0.5
1
1.5
·105
Number of replicas (n)
Throughput (txn/s)
(a) Scalability (Single Failure)
4 16 32 64 91
0
2
4
6
8
10
Number of replicas (n)
Latency (s)
(b) Scalability (Single Failure)
4 16 32 64 91
0
0.5
1
1.5
2
2.5·105
Number of replicas (n)
Throughput (txn/s)
(c) Scalability (No Failures)
4 16 32 64 91
0
2
4
6
8
Number of replicas (n)
Latency (s)
(d) Scalability (No Failures)
4 16 32 64 91
0
1
2
3·105
Number of replicas (n)
Throughput (txn/s)
(e) Zero Payload (Single Failure)
4 16 32 64 91
0
2
4
6
8
Number of replicas (n)
Latency (s)
(f) Zero Payload (Single Failures)
4 16 32 64 91
0
1
2
·105
Number of replicas (n)
Throughput (txn/s)
(g) Zero Payload (No Failures)
4 16 32 64 91
0
2
4
6
Number of replicas (n)
Latency (s)
(h) Zero Payload (No Failures)
10 50 100 200 400
0
0.5
1
1.5
2
·105
Batch size
Throughput (txn/s)
(i) Batching (Single Failure)
10 50 100 200 400
0
10
20
30
Batch size
Latency (s)
(j) Batching (Single Failure)
4 16 32 64 91
2
4
6
8
Number of replicas (n)
Throughput (txn/s)
·103(k) Out-of-ordering disabled
4 16 32 64 91
2
3
4
·102
Number of replicas (n)
Latency (s)
(l) Out-of-ordering disabled
Figure 9: Evaluating system throughput and average latency incurred by PoE and other bft protocols.
throughput, while decreasing the latency. This happens as larger
batch-sizes require fewer consensus rounds to complete the exact
same set of requests, reducing the cost of ordering and executing
the transactions. This not only improves throughput, but also
reduces client latencies as clients receive faster responses for
their requests. Although increasing the batch-size reduces the
number of consensus rounds, the large message size causes a
proportional decrease in throughput (or increase in latency). This
is evident from the experiments at higher batch-sizes: increas-
ing the batch-size beyond 100 gradually curves the throughput
plots towards a limit for PE,P and SBFT. For example, on
increasing the batch size from 100 to 400,PE and P see an
increase in throughput by 60% and 80%, respectively, while the
gap in throughput reduces from 43% to 25%. As in the previous
experiments, Z yields a signicantly lower throughput as
it cannot handle failures. In case of HS, an increase in
batch size does increases its throughput but due to high scaling
of the graph this change seems insignicant.
5.6 Disabling Out-of-Ordering
Until now, we allowed protocols like P,PE,SBFT and Z
to process requests out-of-order. As a result, these protocols
achieve much higher throughputs than HS, which is re-
stricted by its sequential primary-rotation model. In Figures 9(k)
and 9(l), we evaluate the performance of the protocols when there
are no opportunities for out-of-ordering.
In this setting, we require each client to only send its request
when it has accepted a response for its previous query. As H
S pipelines its phases of consensus into a four-phase pipeline,
so we allow it to access four client requests (each on a distinct
subsequent replica) at any time. As expected, HS performs
better than all other protocols at the expense of a higher latency
as it rotates primaries at the end of each consensus, which allows
·105
1.5
1
0.5
0
Throughput (txn/s)
0 10 13 15 17
Time (s)
abcd
P
PE
Figure 10: System throughput under instance failures (n=
32
). (a) replicas detect failure of primary and broadcast
vc-reqest; (b) replicas receives vc-reqest from oth-
ers; (c) replicas receives nv-propose from new primary;
(d) state recovery;
it to pipeline four requests. However, notice that once out-of-
ordering is disabled, throughput drops from
200 ktransactions/s
to just under a few thousand
transactions/s
. Hence, from a prac-
tical standpoint, out-of-ordering is simply crucial. Further, the
dierence in latency of dierent protocols is quite small, and
the visible variation is a result of graph scaling while the actual
numbers are in the range of 20 ms40 ms.
5.7 Primary Failure–View Change
In Figure 10, we study the impact of of a benign primary failure on
PE and P. To recover from a primary failure, backup replicas
run the view-change protocol. We skip illustrating view-change
plots for Z and SBFT as they already face severe reduction
in throughput for a single backup failure. Further, Z has
an unsafe view-change algorithm and SBFT’s view-change algo-
rithm is no less expensive than P. For HS, we do not
310
PE P
4 7 13 19 28
2
4
6
8
·104
Number of replicas per region
Throughput (txn/s)
4 7 13 19 28
10
20
30
Number of replicas per region
Latency (s)
Figure 11: System throughput and average latency in-
curred by PoE and Pbft in a WAN deployment of ve re-
gions under a single failure. In the largest deployment, we
have 140 replicas spread equally over these regions.
show results as it changes primary at the end of every consen-
sus. Although single primary protocols face a momentary loss
in throughput during view-change, these protocols easily cover
this gap through their ability to process messages out-of-order.
For our experiments, we let the primary replica complete con-
sensus for
10 s
(or around a million transactions) and then fail.
This causes clients to timeout while waiting for responses for
their pending transactions. Hence, these clients forward their
requests to backup replicas.
When a backup replica receives a client request, it forwards
that request to the primary and waits on a timer. Once a replicas
timeouts, it detects a primary failure and broadcasts a

message to all other replicas—initiate view-change protocol (a).
Next, each replica waits for a new view message from the next
primary. In the meantime, a replica may receive

mes-
sages from other replicas (b). Once a replica receives

message from the new primary (c), it moves to the next view.
5.8 WAN Scalability
In this section, we use Figure 11 to illustrate the throughputs and
latencies for dierent PE and P deployments on a wide-area
network in the presence of a single failure. In specic, we deploy
clients and replicas across ve locations across the globe: Oregon,
Iowa, Montreal, the Netherlands, and Taiwan. Next, we vary the
number of replicas from 20 to 140 by equally distributing these
replicas across each region.
These plots arm our existing observations that PE out-
performs existing state-of-the-art protocols and scales well in
wide-area deployments. In specic, PE achieves up to 1
.
41
×
higher throughput and incurs 28
.
67% less latency than P. We
skip presenting plots for SBFT,HS and Zdue to
their low throughputs under failures.
5.9 Simulating bft Protocols
To further underline that the message delay and not bandwidth
requirements becomes a determining factor in the throughput
of protocols in which the primary does not propose requests
out-of-order, we performed a separate simulation of the maxi-
mum performance of PE,P, and HS. The simulation
makes 500 consensus decisions and processes all message send
and receive steps, but delays the arrival of messages by a pre-
determined message delay. The simulation skips any expensive
computations and, hence, the simulated performance is entirely
determined by the cost of message exchanges. We ran the sim-
ulation with
n∈ {
4
,
16
,
128
}
replicas, for which the results can
be found in Figure 12, rst three plots. As one can see, if band-
width is not a limiting factor, then the performance of protocols
that do not propose requests out-of-order will be determined by
the number of communication rounds and the message delay.
As both P and PE have one communication round more
than the two rounds of HS, their performance is roughly
two-thirds that of HS, this independent of the number of
replicas or the message delay. Furthermore, doubling message de-
lay will roughly half performance. Finally, we also measured the
maximum performance of protocols that do allow out-of-order
processing of up to 250 consensus decisions. These results can be
found in Figure 12, last plot. As these results show, out-of-order
processing increases performance by a factor of roughly 200,
even with 128 replicas.
6 RELATED WORK
Consensus is an age-old problem that received much theoretical
and practical attention (see, e.g., [
34
,
39
,
45
]). Further, the use
of rollbacks is common in distributed systems. E.g., the crash-
resilient replication protocol Raft [
45
] allows primaries to re-
write the log of any replica. In a Byzantine environment, such an
approach would delegate too much power to the primary, as they
can maliciously overwrite transactions that need to be preserved.
The interest in practical  consensus protocols took o
with the introduction of P [
9
]. Apart from the protocols that
we already discussed, there are some interesting protocols that
achieve ecient consensus by requiring 5
f+
1replicas [
1
,
14
].
However, these protocols have been shown to work only in the
cases where transactions are non-conicting [
38
]. Some other 
protocols [
10
,
50
] suggest the use of trusted components to reduce
the cost of  consensus. These works require only 2
f+
1replicas
as the trusted component helps to guarantee a correct ordering.
The safety of these protocols relies on the security of trusted
component. In comparison, PE does (i) not require extra replicas,
(ii) not depend on clients, (iii) not require trusted components,
and (iv) not need the two phases of quadratic communication
required by P.
As a promising future direction, Castro [
9
] also suggested ex-
ploring speculative optimizations for P, which he referred
to as tentative execution. However, this lacked: (i) formal de-
scription, (ii) non-divergence safety property, (iii) specication
of rollback under attacks, (iv) re-examination of the view change
protocol, and (v) any actual evaluation.
Consensus for Blockchains:
Since the introduction of Bit-
coin [
42
], the well-known cryptocurrency that led to the coining
of the term blockchain, several new  consensus protocols
that cater to cryptocurrencies have been designed [
33
,
37
]. Bit-
coin [
42
] employs the Proof-of-Work [
33
] consensus protocol
(PW), which is computationally intensive, achieves low through-
put, and can cause forks (divergence) in the blockchain: separate
chains can exist on non-faulty replicas, which in turn can cause
double-spending attacks [
31
]. Due to these limitations, several
other similar algorithms have been proposed. E.g., Proof-of-Stake
(PoS) [
37
], which is design such that any replica owning
𝑛
%of
the total resources gets the opportunity to create
𝑛
%of the new
blocks. As PoS is resource driven, it can face attacks where repli-
cas are incentivized to work simultaneously on several forks of
the blokchain, without ever trying to eliminate these forks.
There are also a set of interesting alternative designs such as
ConFlux [
40
], Caper [
3
] and MeshCash [
6
] that suggest the use of
directed acyclic graphs (DAGs) to store a blockchain to improve
the performance of Bitcoin. However, these protocols either rely
on PW or P for consensus.
Meta-protocols such as RCC [
28
] and RBFT [
5
] run multiple
P consensuses in parallel. These protocols also aim at re-
moving dependence on the consensus led by a single primary.
A recent protocol, PoV [
41
], provides fast  consensus in a
311
PE P HS
10 20 40
0
10
20
30
Latency (ms)
Throughput (decisions/s)
Simulated performance (4 replicas)
10 20 40
0
10
20
30
Latency (ms)
Throughput (decisions/s)
Simulated performance (16 replicas)
10 20 40
0
10
20
30
Latency (ms)
Throughput (decisions/s)
Simulated performance (128 replicas)
10 20 40
0
1,000
2,000
3,000
4,000
Latency (ms)
Throughput (decisions/s)
Simulated performance (128 replicas)
PE
P
Figure 12: The simulated number of consensus decisions PoE,Pbft, and HotStuff can make as a function of the latency.
Only the protocols in the right-most plot and marked with process requests out-of-order processing.
consortium architecture. PoV does this by restricting the ability
to propose blocks among a subset of trusted replicas.
PE does not face the limitations faced by PW [
33
] and
PoS [37]. The use of DAGs [3, 6, 40], and sharding [15, 52] is or-
thogonal to the design of PE. Hence, their use with PE can reap
further benets. Further, PE can be employed by meta-protocols
and does not restrict consensus to any subset of replicas.
7 CONCLUSIONS
We present Proof-of-Execution (PE), a novel Byzantine fault-
tolerant consensus protocol that guarantees safety and liveness
and does so in only three linear phases. PE decouples ordering
from execution by allowing replicas to process messages out-of-
order and execute client-transactions speculatively. Despite these
properties, PE ensures that all the replicas reach a single unique
order for all the transactions. Further, PE guarantees that if a
client observes identical results of execution from a majority of
the replicas, then it can reliably mark its transaction committed.
Due to speculative execution, PE may require replicas to revert
executed transactions, however. To evaluate PE’s design, we
implement it in our RDB fabric. Our evaluation shows
that PE achieves up-to-80% higher throughputs than existing
 protocols in the presence of failures.
REFERENCES
[1] Michael Abd-El-Malek, Gregor y R. Ganger,Garth R. Goodson, Michael K. Reiter, and Jay J. Wylie. 2005. Fault-
scalable Byzantine Fault-tolerant Services. In Proceedings of the Twentieth ACM Symposium on Operating
Systems Principles. ACM, 59–74. https://doi.org/10.1145/1095810.1095817
[2] Ittai Abraham, Guy Gueta, Dahlia Malkhi, Lorenzo Alvisi, Rama Kotla, and Jean-Philippe Martin. 2017. Re-
visiting Fast Practical Byzantine Fault Tolerance. https://arxiv.org/abs/1712.01367
[3] Mohammad Javad Amiri, Div yakant Agrawal,and Amr El Abbadi. 2019. CAPER:A Cross-application Permis-
sioned Blockchain. Proc. VLDB Endow. 12, 11 (2019), 1385–1398. https://doi.org/10.14778/3342263.3342275
[4] Elli Androulaki, Artem Barger, Vita Bortnikov, Christian Cachin, Konstantinos Christidis, Angelo De Caro,
David Enyeart, Christopher Ferris, Gennady Laventman, Yacov Manevich, Srinivasan Muralidharan,
Chet Murthy, Binh Nguyen, Manish Sethi, Gari Singh, Keith Smith, Alessandro Sorniotti, Chrysoula
Stathakopoulou, Marko Vukolić, Sharon Weed Cocco, and Jason Yellick. 2018. Hyperledger Fabric: A Dis-
tributed Operating System for Permissioned Blockchains. In Proceedings of the Thirteenth EuroSys Conference.
ACM, 30:1–30:15. https://doi.org/10.1145/3190508.3190538
[5] Pierre-Louis Aublin, Sonia Ben Mokhtar, and Vivien Quéma. 2013. RBFT: Redundant Byzantine Fault Toler-
ance. In Proceedings of the 2013 IEEE 33rd International Conference on Distributed Computing Systems. IEEE,
297–306. https://doi.org/10.1109/ICDCS.2013.53
[6] Iddo Bentov, Pavel Hubáček, Tal Moran, and Asaf Nadler.2017. Tortoise and Hares Consensus: the Meshcash
Framework for Incentive-Compatible, Scalable Cryptocurrencies. https://eprint.iacr.org/2017/300
[7] Alysson Bessani, João Sousa, and Eduardo E.P. Alchieri. 2014. StateMachine Replication for the Masses with
BFT-SMART. In 44th Annual IEEE/IFIP International Conferenceon Dependable Systems and Networks. IEEE,
355–362. https://doi.org/10.1109/DSN.2014.43
[8] Manuel Bravo, Zsolt István, and Man-Kit Sit. 2020. TowardsImproving the Performance of BFT Consensus
For Future Permissioned Blockchains. https://arxiv.org/abs/2007.12637
[9] Miguel Castro and Barbara Liskov. 2002. PracticalByzantine Fault Tolerance and Proactive Recovery. ACM
Trans. Comput. Syst. 20, 4 (2002), 398–461. https://doi.org/10.1145/571637.571640
[10] Byung-Gon Chun, Petros Maniatis, Scott Shenker, and John Kubiatowicz. 2007. Attested Append-Only
Memory: Making Adversaries Stick to Their Word. SIGOPS Oper. Syst. Rev. 41, 6 (2007), 189–204. https:
//doi.org/10.1145/1323293.1294280
[11] Allen Clement, Manos Kapritsos, Sangmin Lee, Yang Wang, Lorenzo Alvisi, Mike Dahlin, and Taylor Riche.
2009. Upright Cluster Services. In Proceedings of the ACM SIGOPS 22nd Symposium on Operating Systems
Principles. ACM, 277–290. https://doi.org/10.1145/1629575.1629602
[12] Allen Clement, Edmund Wong, Lorenzo Alvisi, Mike Dahlin, and Mirco Marchetti. 2009. Making Byzantine
Fault TolerantSystems Tolerate Byzantine Faults. In Proceedings of the 6th USENIX Symposium on Networked
Systems Design and Implementation. USENIX, 153–168.
[13] Brian F. Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan, and Russell Sears. 2010. Benchmarking
Cloud Serving Systems with YCSB. In Proceedings of the 1st ACM Symposium on Cloud Computing. ACM,
143–154. https://doi.org/10.1145/1807128.1807152
[14] James Cowling, Daniel Myers, Barbara Liskov, Rodrigo Rodrigues, and Liuba Shrira. 2006. HQ Replication:
A Hybrid Quorum Protocol for Byzantine Fault Tolerance.In Procee dings of the 7th Symposium on Operating
Systems Design and Implementation. USENIX, 177–190.
[15] Hung Dang, Tien Tuan Anh Dinh, Dumitrel Loghin, Ee-Chien Chang, Qian Lin, and Beng Chin Ooi. 2019.
Towards Scaling Blockchain Systems via Sharding. In Proceedings of the 2019 International Conference on
Management of Data. ACM, 123–140. https://doi.org/10.1145/3299869.3319889
[16] Tien Tuan Anh Dinh, Ji Wang, Gang Chen, Rui Liu, Beng Chin Ooi, and Kian-Lee Tan. 2017. BLOCKBENCH:
A Framework for Analyzing Private Blockchains. In Proceedings of the 2017 ACM International Conference
on Management of Data. ACM, 1085–1100. https://doi.org/10.1145/3035918.3064033
[17] Wayne W. Eckerson. 2002. Data quality and the bottom line: Achieving Business Success through a Commit-
ment to High Quality Data. Technical Report. The Data Warehousing Institute, 101communications LLC.
[18] Muhammad El-Hindi, Carsten Binnig, Ar vind Arasu, Donald Kossmann, and Ravi Ramamurthy. 2019.
BlockchainDB: A Shared Database on Blockchains. Proc. VLDB Endow. 12, 11 (2019), 1597–1609. https:
//doi.org/10.14778/3342263.3342636
[19] Michael J. Fischer, Nancy A. Lynch, and Michael S. Paterson. 1985. Impossibility of Distributed Consensus
with One Faulty Process. J. ACM 32, 2 (1985), 374–382. https://doi.org/10.1145/3149.214121
[20] Gideon Greenspan. 2015. MultiChain Private Blockchain–White Paper. https://www.multichain.com/
download/MultiChain-White- Paper.pdf
[21] Guy Golan Gueta, Ittai Abraham, Shelly Grossman, Dahlia Malkhi, Benny Pinkas, Michael Reiter, Dragos-
Adrian Seredinschi, Orr Tamir, and Alin Tomescu. 2019. SBFT: A Scalable and Decentralized Trust Infras-
tructure. In 49th AnnualIEEE/IFIP International Conference on Dependable Systems and Networks (DSN). IEEE,
568–580. https://doi.org/10.1109/DSN.2019.00063
[22] Jim Gray. 1978. Notes on Data Base Op erating Systems. In Operating Systems, AnAdvanced Course. Springer-
Verlag, 393–481. https://doi.org/10.1007/3-540- 08755-9_9
[23] Suyash Gupta, Jelle Hellings, Sajjad Rahnama, and Mohammad Sadoghi. 2019. An In-Depth Look of BFT
Consensus in Blockchain: Challenges and Opportunities. In Proceedings of the 20th International Middleware
Conference Tutorials, Middleware. ACM, 6–10. https://doi.org/10.1145/3366625.3369437
[24] Suyash Gupta, Jelle Hellings, Sajjad Rahnama, and Mohammad Sadoghi. 2020. Blockchain consensus un-
raveled: virtues and limitations. In Proceedings of the 14th ACM International Conference on Distributed and
Event-based Systems. ACM, 218–221. https://doi.org/10.1145/3401025.3404099
[25] Suyash Gupta, Jelle Hellings, Sajjad Rahnama, and Mohammad Sadoghi. 2020. Building High Throughput
Permissioned Blockchain Fabrics: Challenges and Opportunities. Proc. VLDB Endow. 13, 12 (2020), 3441–3444.
https://doi.org/10.14778/3415478.3415565
[26] Suyash Gupta, Jelle Hellings, and Mohammad Sadoghi. 2019. Brief Announcement: Revisiting Consen-
sus Protocols through Wait-Free Parallelization. In 33rd International Symposium on Distributed Computing
(DISC 2019), Vol. 146. Schloss Dagstuhl, 44:1–44:3. https://doi.org/10.4230/LIPIcs.DISC.2019.44
[27] Suyash Gupta, Jelle Hellings, and Mohammad Sadoghi. 2021. Fault-Tolerant Distributed Transactions on
Blockchain. Morgan & Claypool. https://doi.org/10.2200/S01068ED1V01Y202012DTM065
[28] Suyash Gupta, Jelle Hellings, and Mohammad Sadoghi. 2021. RCC: Resilient Concurrent Consensus for
High-Throughput Secure TransactionProcessing. In 37th IEEE International Conference on Data Engineering.
IEEE. to appear.
[29] Suyash Gupta, Sajjad Rahnama, Jelle Hellings, and Mohammad Sadoghi. 2020. ResilientDB: Global Scale
Resilient Blockchain Fabric. Proc. VLDB Endow. 13, 6 (2020), 868–883. https://doi.org/10.14778/3380750.
3380757
[30] Suyash Gupta, Sajjad Rahnama, and Mohammad Sadoghi. 2020. PermissionedBlockchain Through the Look-
ing Glass: Architectural and Implementation Lessons Learned. In Proceedings of the 40th IEEE International
Conference on Distributed Computing Systems.
[31] Suyash Gupta and Mohammad Sadoghi. 2019. Blockchain Transaction Processing. In Encyclopedia of Big
Data Technologies. Springer,1–11. https://doi.org/10.1007/978- 3-319-63962- 8_333-1
[32] Thomas N. Herzog, Fritz J. Scheuren, and William E. Winkler. 2007. Data Quality and Record Linkage Tech-
niques. Springer. https://doi.org/10.1007/0- 387-69505- 2
[33] Markus Jakobsson and Ari Juels. 1999. Proofs of Work and Bread Pudding Protocols. In Secure Information
Networks: Communications and Multimedia Security IFIP TC6/TC11 Joint WorkingConference on Communica-
tions and Multimedia Security (CMS’99). Springer, 258–272. https://doi.org/10.1007/978- 0-387-35568- 9_18
[34] F lavio P. Junqueira, Benjamin C. Reed, and Marco Serani. 2011. Zab: High-Performance Broadcast for
Primary-Backup Systems. In Proceedings of the 2011 IEEE/IFIP 41st International Conference on Dependable
Systems&Networks. IEEE, 245–256. https://doi.org/10.1109/DSN.2011.5958223
[35] Manos Kapritsos, Yang Wang, Vivien Quema, Allen Clement, Lorenzo Alvisi, and Mike Dahlin. 2012. All
about Eve: Execute-VerifyReplication for Multi-Core Servers. In Proceedings of the 10th USENIX Conference
on Operating Systems Design and Implementation. USENIX, 237–250.
[36] Jonathan Katz and Yehuda Lindell. 2014. Introduction to Modern Cryptography (2nd ed.). Chapman and
Hall/CRC.
[37] Sunny King and Scott Nadal. 2012. PPCoin: Peer-to-Peer Cr ypto-Currency with Proof-of-Stake. https:
//www.peercoin.net/whitepapers/peercoin-paper.pdf
[38] Ramakrishna Kotla, Lorenzo Alvisi, Mike Dahlin, Allen Clement, and Edmund Wong. 2009. Zyzzyva: Spec-
ulative Byzantine Fault Tolerance. ACM Trans. Comput. Syst. 27, 4 (2009), 7:1–7:39. https://doi.org/10.1145/
1658357.1658358
[39] Leslie Lamport. 2001. Paxos Made Simple. ACMSIGACT News 32, 4 (2001), 51–58. https://doi.org/10.1145/
568425.568433 Distributed Computing Column 5.
[40] Chenxing Li, Peilun Li, Dong Zhou, Wei Xu, Fan Long, and Andrew Yao. 2018. Scaling Nakamoto Consensus
to Thousands of Transactions per Second. https://arxiv.org/abs/1805.03870
[41] Kejiao Li, Hui Li, Han Wang, Huiyao An, Ping Lu, Peng Yi, and Fusheng Zhu. 2020. PoV: An Ecient
Voting-Based Consensus Algorithm for Consortium Blockchains. Front. Blockchain 3 (2020), 11. https:
//doi.org/10.3389/fbloc.2020.00011
[42] Satoshi Nakamoto. 2009. Bitcoin: A Peer-to-Peer Electronic Cash System. https://bitcoin.org/bitcoin.pdf
[43] Faisal Nawab and Mohammad Sadoghi. 2019. Blockplane: A Global-Scale Byzantizing Middleware. In 35th
International Conference on Data Engineering (ICDE). IEEE, 124–135. https://doi.org/10.1109/ICDE.2019.
00020
[44] The Council of Economic Advisers. 2018. TheCost of Malicious Cyber Activity to the U.S. Economy. Technical
Report. Executive Oce of the President of the United States. https://www.whitehouse.gov/wp-content/
uploads/2018/03/The-Cost- of-Malicious- Cyber- Activity-to- the-U.S.- Economy.pdf
[45] Diego Ongaro and John Ousterhout. 2014. In Search of an Understandable Consensus Algorithm. In Pro-
ceedings of the 2014 USENIX Conference on USENIX Annual TechnicalConference. USENIX, 305–320.
[46] M. Tamer Özsu and Patrick Valduriez. 2020. Principles of Distributed Database Systems. Springer. https:
//doi.org/10.1007/978-3- 030-26253- 2
[47] Sajjad Rahnama, Suyash Gupta, Thamir Qadah, Jelle Hellings, and Mohammad Sadoghi. 2020. Scalable,
Resilient and Congurable Permissioned Blockchain Fabric. Proc. VLDB Endow. 13, 12 (2020), 2893–2896.
https://doi.org/10.14778/3415478.3415502
[48] Thomas C. Redman. 1998. The Impact of Poor Data Quality on the Typical Enterprise. Commun. ACM 41, 2
(1998), 79–82. https://doi.org/10.1145/269012.269025
[49] Dale Skeen. 1982. A Quorum-Based Commit Protocol. Technical Report. Cornell University.
[50] Giuliana Santos Veronese, Miguel Correia, Alysson Neves Bessani, Lau Cheuk Lung, and Paulo Verissimo.
2013. Ecient Byzantine Fault-Tolerance. IEEE Trans. Comput. 62, 1 (2013), 16–30. https://doi.org/10.1109/
TC.2011.221
[51] Maofan Yin, Dahlia Malkhi, Michael K. Reiter, Guy Golan Gueta, and Ittai Abraham. 2019. HotStu: BFT
Consensus with Linearity and Responsiveness. In Proceedings of the ACM Symposium on Principles of Dis-
tributed Computing. ACM, 347–356. https://doi.org/10.1145/3293611.3331591
[52] Mahdi Zamani, Mahnush Movahedi, and Mariana Raykova. 2018. RapidChain: Scaling Blockchain via Full
Sharding. In Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security.
ACM, 931–948. https://doi.org/10.1145/3243734.3243853
312
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
The blockchain has a great vogue in recent years, and its core consensus algorithms also become the focus of research. At present, most of the research on consensus mechanisms are oriented to the public blockchain and based on existing consensus mechanisms or sophisticated distributed algorithms. Various application scenarios have been developed based on the consortium blockchain, while few researchers pay attention to customize consistency algorithms. Moreover, there is a trade-off between security and performance in designing consensus mechanisms. We propose a novel consensus algorithm called proof of vote (PoV), where the distributed nodes controlled by consortium members could reach consensus and come to a decentralized arbitration by voting. PoV separates the voting rights and bookkeeping rights with the essential idea of establishing different security identities for network nodes. Contrary to the third-party intermediary or uncontrollable public awareness, the production and verification of PoV blocks are decided by the voting results among the core consortium members. We theoretically prove that PoV blocks can reach transaction finality by only one confirmation. Compared with the total traffic complexity of BFT-based consensus, PoV has just that of O(3Nc), which is a great improvement when the number of nodes is over 100.
Article
Full-text available
Recent developments in blockchain technology have inspired innovative new designs in resilient distributed and database systems. At their core, these blockchain applications typically use Byzantine fault-tolerant consensus protocols to maintain a common state across all replicas, even if some replicas are faulty or malicious. Unfortunately, existing consensus protocols are not designed to deal with geo-scale deployments in which many replicas spread across a geographically large area participate in consensus. To address this, we present the Geo-Scale Byzantine Fault-Tolerant consensus protocol (GeoBFT). GeoBFT is designed for excellent scalability by using a topological-aware grouping of replicas in local clusters, by introducing parallelization of consensus at the local level, and by minimizing communication between clusters. To validate our vision of high-performance geo-scale resilient distributed systems, we implement GeoBFT in our efficient ResilientDB permissioned blockchain fabric. We show that GeoBFT is not only sound and provides great scalability, but also outperforms state-of-the-art consensus protocols by a factor of six in geo-scale deployments.
Conference Paper
Full-text available
Since the introduction of Bitcoin---the first wide-spread application driven by blockchains---the interest of the public and private sector in blockchains has skyrocketed. At the core of this interest are the ways in which blockchains can be used to improve data management, e.g., by enabling federated data management via decentralization, resilience against failure and malicious actors via replication and consensus, and strong data provenance via a secured immutable ledger. In practice, high-performance blockchains for data management are usually built in permissioned environments in which the participants are vetted and can be identified. In this setting, blockchains are typically powered by Byzantine fault-tolerant consensus protocols. These consensus protocols are used to provide full replication among all honest blockchain participants by enforcing an unique order of processing incoming requests among the participants. In this tutorial, we take an in-depth look at Byzantine fault-tolerant consensus. First, we take a look at the theory behind replicated computing and consensus. Then, we delve into how common consensus protocols operate. Finally, we take a look at current developments and briefly look at our vision moving forward.
Book
Since the introduction of Bitcoin—the first widespread application driven by blockchain—the interest of the public and private sectors in blockchain has skyrocketed. In recent years, blockchain-based fabrics have been used to address challenges in diverse fields such as trade, food production, property rights, identity-management, aid delivery, health care, and fraud prevention. This widespread interest follows from fundamental concepts on which blockchains are built that together embed the notion of trust, upon which blockchains are built. 1. Blockchains provide data transparency. Data in a blockchain is stored in the form of a ledger, which contains an ordered history of all the transactions. This facilitates oversight and auditing. 2. Blockchains ensure data integrity by using strong cryptographic primitives. This guarantees that transactions accepted by the blockchain are authenticated by its issuer, are immutable, and cannot be repudiated by the issuer. This ensures accountability. 3. Blockchains are decentralized, democratic, and resilient. They use consensus-based replication to decentralize the ledger among many independent participants. Thus, it can operate completely decentralized and does not require trust in a single authority. Additions to the chain are performed by consensus, in which all participants have a democratic voice in maintaining the integrity of the blockchain. Due to the usage of replication and consensus, blockchains are also highly resilient to malicious attacks even when a significant portion of the participants are malicious. It further increases the opportunity for fairness and equity through democratization. These fundamental concepts and the technologies behind them—a generic ledger-based data model, cryptographically ensured data integrity, and consensus-based replication—prove to be a powerful and inspiring combination, a catalyst to promote computational trust. In this book, we present an in-depth study of blockchain, unraveling its revolutionary promise to instill computational trust in society, all carefully tailored to a broad audience including students, researchers, and practitioners. We offer a comprehensive overview of theoretical limitations and practical usability of consensus protocols while examining the diverse landscape of how blockchains are manifested in their permissioned and permissionless forms.
Article
The fourth edition of this classic textbook provides major updates. This edition has completely new chapters on Big Data Platforms (distributed storage systems, MapReduce, Spark, data stream processing, graph analytics) and on NoSQL, NewSQL and polystore systems. It also includes an updated web data management chapter that includes RDF and semantic web discussion, an integrated database integration chapter focusing both on schema integration and querying over these systems. The peer-to-peer computing chapter has been updated with a discussion of blockchains. The chapters that describe classical distributed and parallel database technology have all been updated. The new edition covers the breadth and depth of the field from a modern viewpoint. Graduate students, as well as senior undergraduate students studying computer science and other related fields will use this book as a primary textbook. Researchers working in computer science will also find this textbook useful. This textbook has a companion web site that includes background information on relational database fundamentals, query processing, transaction management, and computer networks for those who might need this background. The web site also includes all the figures and presentation slides as well as solutions to exercises (restricted to instructors).