PreprintPDF Available
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

The recent surge in federated data-management applications has brought forth concerns about the security of underlying data and the consistency of replicas in the presence of malicious attacks. A prominent solution in this direction is to employ a permissioned blockchain framework that is modeled around traditional Byzantine Fault-Tolerant (BFT) consensus protocols. Any federated application expects its data to be globally scattered to achieve faster access. But, prior works have shown that traditional BFT protocols are slow and this led to the rise of sharded-replicated blockchains. Existing BFT protocols for these sharded blockchains are efficient if client transactions require access to a single-shard, but face performance degradation if there is a cross-shard transaction that requires access to multiple shards. However, cross-shard transactions are common, and to resolve this dilemma, we present RingBFT, a novel meta-BFT protocol for sharded blockchains. RingBFT requires shards to adhere to the ring order, and follow the principle of process, forward, and re-transmit while ensuring the communication between shards is linear. Our evaluation of RingBFT against state-of-the-art sharding BFT protocols illustrates that RingBFT achieves up to 25x higher throughput, easily scales to nearly 500 globally distributed nodes, and achieves a peak throughput of 1.2 million txns/s.
Content may be subject to copyright.
RingBFT: Resilient Consensus over Sharded Ring Topology
Sajjad Rahnama Suyash Gupta Rohan Sogani Dhruv Krishnan Mohammad Sadoghi
Exploratory Systems Lab
University of California Davis
ABSTRACT
The recent surge in federated data-management applications has
brought forth concerns about the security of underlying data and
the consistency of replicas in the presence of malicious attacks. A
prominent solution in this direction is to employ a permissioned
blockchain framework that is modeled around traditional Byzantine
Fault-Tolerant (Bft) consensus protocols. Any federated application
expects its data to be globally scattered to achieve faster access. But,
prior works have shown that traditional Bft protocols are slow and
this led to the rise of sharded-replicated blockchains.
Existing Bft protocols for these sharded blockchains are e-
cient if client transactions require access to a single-shard, but face
performance degradation if there is a cross-shard transaction that re-
quires access to multiple shards. However, cross-shard transactions
are common, and to resolve this dilemma, we present RingBFT,
a novel meta-Bft protocol for sharded blockchains. RingBFT re-
quires shards to adhere to the ring order, and follow the principle
of process, forward, and re-transmit while ensuring the communi-
cation between shards is linear. Our evaluation of RingBFT against
state-of-the-art sharding Bft protocols illustrates that RingBFT
achieves up to 25
×
higher throughput, easily scales to nearly 500
globally distributed nodes, and achieves a peak throughput of 1
.
2
million txns/s.
1 INTRODUCTION
Recent works have illustrated a growing interest in federated data
management [
10
,
21
,
28
,
70
,
79
,
80
]. In a federated system, a com-
mon database is maintained by several parties. These parties need
to reach a consensus on the fate of every transaction that is com-
mitted to database. Such a database managed by multiple parties
raises concerns for data-privacy, data-quality, resistance against
adversaries, and database availability and consistency [40, 56, 68].
A recent solution to guarantee secure federated data-management
is through the use of permissioned blockchain technology [
5
,
40
,
55
,
56
]. Permissioned blockchain applications employ age-old database
principles to facilitate a democratic and failure-resilient consen-
sus among several participants. Such a democratic system allows
all parties to maintain a copy of the common database–act as a
replica–and cast a vote on the fate of any transaction. Hence, at the
core of any permissioned blockchain application runs a Byzantine-
Fault Tolerant (Bft) consensus protocol that aims to order all the
client transactions in the same manner across all the replicas, de-
spite of any malicious attacks. Once a transaction is ordered, it is
recorded in a block, which is appended to the blockchain. Each new
block also includes the hash of the previous block, which makes
the blockchain immutable.
At a closer look, Bft consensus protocols are resilient coun-
terparts of the crash-fault tolerant protocols such as Paxos and
Raft [
51
,
64
]. As the name suggests, these protocols ensure that the
participating replicas reach a safe consensus under crash failures.
4 16 32
0
200K
400K
600K
800K
1M
1.2M
1.4M
Number of Nodes
Total Throughput (txn/s)
Scalability
RingBFT
RingBFT𝑋
Pbft
Sbft
HotStuff
Rcc
PoE
Zyzzyva
Figure 1: Comparing scalability of dierent
Bft
protocols.
In this gure, we depict throughputs of single-primary,
multiple-primaries, geographically-scalable, and sharding
Bft
protocols. For RingBFT, we require each shard to have
number of replicas stated on x-axis.
However, in federated systems, byzantine attacks are possible and
malicious participants may wish: (i) to exclude transactions of some
clients, (ii) to make system unavailable to clients, and (iii) to make
replicas inconsistent. Hence, the use of a Bft protocol is in order.
In this paper, we present a novel meta-Bft protocol RingBFT
that aims to be secure against byzantine attacks, achieves high
throughputs, and incurs low latencies. Our RingBFT protocol ex-
plores the landscape of sharded-replicated databases, and helps to
scale permissioned blockchains, which in turn helps in designing ef-
cient federated data-management systems. RingBFT aims to make
consensus inexpensive even when transactions require access to
multiple shards. In the rest of this section, we motivate the need for
our design choices. To highlight the need for RingBFT, we will be
referring to Figure 1, which illustrates the throughput attained by
the system when employing dierent Bft consensus protocols.
1.1 Challenges for Ecient BFT Consensus
Existing permissioned blockchain applications employ traditional
Bft protocols to achieve consensus among their replicas [
3
,
4
,
11
,
49
]. Over the past two decades, these Bft protocols have under-
gone a series of evolutions to guarantee resilience against byzantine
attacks, while ensuring high throughputs and low latency. The sem-
inal work by Castro and Liskov [
11
,
12
] led to the design of rst
practical Bft protocol, Pbft, which advocates a primary-backup
paradigm where primary initiates the consensus and all the backups
follow primary’s lead. Pbft achieves consensus among the replicas
in three phases, of which two require quadratic communication
complexity. Following this, several exciting primary-backup pro-
tocols, such as Zyzzyva [
50
], Sbft [
31
], and PoE [
34
], have been
proposed that try to yield higher throughputs from Bft consensuses.
We use Figure 1 to illustrate the benets of these optimizations over
arXiv:2107.13047v1 [cs.DB] 27 Jul 2021
Sajjad Rahnama Suyash Gupta Rohan Sogani Dhruv Krishnan Mohammad Sadoghi
Pbft. Prior works [
2
,
37
,
61
] have illustrated that these single pri-
mary protocols are essentially centralized and prevent scaling the
system to a large number of replicas.
An emerging solution to balance load among replicas is to employ
multi-primary protocols like Honeybadger [
61
] and Rcc [
35
,
36
]
that permit all replicas to act as primaries by running multiple
consensuses concurrently. However, multi-primary protocols also
face scalability limitations as despite concurrent consensuses, each
transaction requires communication between all the replicas. More-
over, if the replicas are separated by geographically large distances,
then these protocols incur low throughput and high latencies due
to low bandwidth and high round-trip time. This led to the design
of topology-aware protocols, such as Steward [
2
] and Geobft [
37
],
which cluster replicas based on their geographical distances. For
instance, Geobft expects each cluster to rst locally order its client
transaction by running the Pbft protocol, and then exchange this
ordered transaction with all the other clusters. Although Geobft is
highly scalable, it necessitates total replication, which forces com-
municating large messages among geographically distant replicas.
1.2 The Landscape for Sharding
To mitigate the costs associated with replicated databases, a com-
mon strategy is to employ the sharded-replicated paradigm [
65
].
In a sharded-replicated database, the data is distributed across a
set of shards where each shard manages a unique partition of the
data. Further, each shard replicates its partition of data to ensure
availability under failures. If each transaction accesses only one
shard, then these sharded systems can fetch high throughputs as
consensus is restricted to a subset of replicas.
AHL [
19
] was the rst permissioned blockchain system to em-
ploy principles of sharding. AHL’s seminal design helps to scale
blockchain systems to hundreds of replicas across the globe and
achieve high throughputs for single-shard transactions. To tackle
cross-shard transactions that require access to data in multiple
shards, AHL designates a set of replicas as a reference commit-
tee, which globally orders all such transactions. Following AHL’s
design, Sharper [
4
] presents a sharding protocol that eliminates
the need for a reference committee for ordering cross-shard trans-
actions, but necessitates global communication among all replicas of
all the participating shards.
Why RingBFT
? Decades of research in database community
has illustrated that cross-shard transactions are common [
16
,
22
,
41
,
63
,
76
,
83
]. In fact, heavy presence of these cross-shard transactions
has led to development of several concurrency control [
8
,
9
,
41
] and
commit protocols [
32
,
39
,
71
]. Hence, in this paper, we present our
RingBFT protocol that signicantly reduces the costs associated
with cross-shard transactions.
Akin to AHL and Sharper,RingBFT assumes that the read-write
sets of each transaction are known prior to the start of consensus.
Given this, RingBFT guarantees consensus for each cross-shard
transaction in
at most two rotations around the ring
. In Ring-
BFT, we envision concurrent execution of transactions, thus, each
shard may participate in concurrent rotational ows, where each ro-
tation maps to the processing of a transaction. For each cross-shard
transaction, RingBFT follows the principle of
process, forward,
and re-transmit
. This implies that each shard performs consensus
on the transaction and forwards it to the next shard. This ow
continues until each shard is aware of the fate of the transaction.
However, the real challenge with cross-shard transactions is to man-
age conicts and to prevent deadlocks, which RingBFT achieves by
requiring cross-shard transactions to travel in
ring order
. Despite
all of this, RingBFT ensures communication between the shards is
linear
. This minimalistic thought has allowed RingBFT to achieve
unprecedented gains in throughput and has allowed us to scale Bft
protocols to nearly 500 nodes. The benets of our RingBFT proto-
col are visible from Figure 1 where we run RingBFT in a system
of 9shards with each shard having 4,16 and 32 replicas. Further,
we show the throughput with 0(RingBFT) and 15% (RingBFT
𝑋
)
cross-shard transactions. We now list down our contributions.
(1)
We present a novel meta-Bft protocol for sharded-replicated
permissioned blockchain systems that requires participating shards
to adhere to the ring order. We term RingBFT as “meta” because it
can employ any single-primary protocols within each shard.
(2)
Our RingBFT protocol presents a scalable consensus for
cross-shard transactions that neither depends on any centralized
committee nor requires all-to-all communication.
(3)
We show that the cross-shard consensus provided by Ring-
BFT is safe, and live, despite any byzantine attacks.
(4)
We evaluate RingBFT against two state-of-the-art Bft proto-
cols for permissioned sharded systems, AHL [
19
], and Sharper [
4
].
Our results show that RingBFT outperforms these protocols, easily
scales to 428 globally-distributed nodes, and achieves up to 25
×
and
21
×
times higher throughputs than AHL and Sharper, respectively.
2 CROSS-SHARD DILEMMA
For any sharded system, ordering a single-shard transaction is triv-
ial as such a transaction requires access to only one shard. This
implies that achieving consensus on a single-shard transaction just
requires running a standard Bft protocol. Further, single-shard
transactions support parallelism as each shard can order its transac-
tion in parallel, this without any communication between shards.
On the other hand, cross-shard transactions are complex. Not
only do they require communication between shards but also their
fate depends on the consent of each of the involved shards. Further,
two or more cross-shard transactions can conict if they require
access to same data. Such conicts can cause one or more trans-
actions to abort or worse, can create a deadlock. Hence, we need
an ecient protocol to order these cross-shard transactions, which
ensures that the system is both safe and live.
Designated Commiee (AHL).
One way to order cross-shard
transactions is to designate a set of replicas with this task. AHL [
19
]
denes a reference committee that assigns an order to each cross-
shard transaction, which requires running Pbft protocol among all
the members of the reference committee. Next, reference committee
members run the Two-phase commit (2pc) protocol with all the
replicas of involved shards. Notice that the 2pc protocol requires:
(1) each shard to send a vote to the reference committee, (2) refer-
ence committee collects these votes and takes a decision (abort or
commit), and (3) each shard implements the decision. Firstly, this
solution requires each shard to run the Pbft protocol to decide on
the vote. Secondly, reference committee needs to again run Pbft
to reach a common decision. Finally, these multiple phases of 2PC
RingBFT: Resilient Consensus over Sharded Ring Topology
require all-to-all communication between the replicas of each shard
and the replicas of reference committee.
Initiator Shard (Sharper).
Another way to process a cross-
shard transaction is to designate one of the involved shards as the
initiator shard.Sharper [
4
] employs this approach by requiring
each cross-shard transaction to be managed by the primary replica
of one of the involved shards. This initiator primary proposes the
transaction to the primaries of other shards. Next, these primaries
propose this transaction within their own shards. Following this
there is an all-to-all communication between replicas of all the
involved shards.
3 SYSTEM MODEL
To explain our RingBFT protocol in detail, we rst lay down some
notations and assumptions. Our system comprises of a set
𝔖
of
shards where each shard
S
provides a replicated service. In spe-
cic, each shard
S
manages a unique partition of the data, which is
replicated by a set Sof replicas.
In each shard
S
, there are
F S
byzantine replicas, of which
N F = ℜS\F
are non-faulty replicas. We expect non-faulty replicas
to follow the protocol and act deterministic, that is, on identical
inputs, all non-faulty replicas must produce identical outputs. We
write
z=|𝔖|
to denote the total number of shards and
n=|S|
,
f=|F |
, and
nf =|N F |
to denote the number of replicas, faulty
replicas, and non-faulty replicas, respectively, in each shard.
Fault-Tolerance Requirement.
Traditional, Bft protocols such
as Pbft,Zyzzyva, and Sbft expect a total replicated system where
the total number of byzantine replicas are less than one-third of
the total replicas in the system. In our sharded-replicated model,
we adopt a slightly weaker setting where at each shard the total
number of byzantine replicas are less than one-third of the total
replicas in that shard. In specic, at each shard
S
, we have
n
3
f+
1.
Notice that this requirement is in accordance with existing works
in byzantine sharding space [4, 19, 81, 82].
Cross-Shard Transactions.
Each shard
S𝔖
can receive a
single-shard or cross-shard transaction. A single-shard transaction
for
S
leads to intra-shard communication, that is, all the messages
necessary to order this transaction are exchanged among the repli-
cas of
S
. On the other hand, a cross-shard transaction requires
access to data from a subset of shards (henceforth we use the ab-
breviation cst to refer to a cross-shard transaction). We denote
this subset of shards as
where
𝔖
, and refer to it as involved
shards. Each cst can be termed as simple or complex. A simple cst is
a collection of fragments where each shard can independently run
consensus and execute its fragment. On the other hand, a complex
cst includes dependencies, that is, an involved shard may require
access to data from other involved shards to execute its fragment.
Deterministic Transactions.
We dene a deterministic trans-
action as the transaction for which the data-items it will read/write
are known prior to the start of the consensus [
69
,
76
]. Given a
deterministic transaction, a replica can determine which data-items
accessed by this transaction are present in its shard.
Ring Order.
We assumes shards in set
𝔖
are logically arranged
in a ring topology. In specic, each shard
S𝔖
has a position in the
ring, which we denote by
id(S)
,1
id(S) ≤ |𝔖|
.RingBFT employs
these identiers to specify the ow of a cst or
ring order
. For in-
stance, a simple ring policy can be that each cst is processed by the
involved shards in the increasing order of their identiers. RingBFT
can also adopt other complex permutations of these identiers for
determining the ow across the ring.
Authenticated Communication.
We assume that each mes-
sage exchanged among clients and replicas is authenticated. Fur-
ther, we assume that byzantine replicas are unable to impersonate
non-faulty replicas. Notice that authenticated communication is a
minimal requirement to deal with Byzantine behavior. For intra-
shard communication, we employ cheap message authentication
codes (MACs), while for cross-shard communication we employ
digital signatures (DS) to achieve authenticated communication.
MACs are a form of symmetric cryptography where each pair of
communicating nodes shares a secret key. We expect non-faulty
replicas to keep their secret keys hidden. DS follow asymmetric
cryptography. In specic, prior to signing a message, each replica
generates a pair of public-key and private-key. The signer keeps
the private-key hidden and uses it to sign a message. Each receiver
authenticates the message using the corresponding public-key.
In the rest of this manuscript, if a message
𝑚
is signed by a
replica
r
using DS, we represent it as
𝑚r
to explicitly identify
replica r. Otherwise, we assume that the message employs MAC.
To ensure message integrity, we employ a collision-resistant cryp-
tographic hash function
𝐻(·)
that maps an arbitrary value
𝑣
to a
constant-sized digest
𝐻(𝑣)
[
47
]. We assume that there is a negligible
probability to nd another value
𝑣
,
𝑣𝑣
, such that
𝐻(𝑣)=𝐻(𝑣)
.
Further, we refer to a message as well-formed if a non-faulty receiver
can validate the DS or MAC, verify the integrity of message digest,
and determine that the sender of the message is also the creator.
4 RINGBFT CONSENSUS PROTOCOL
To achieve ecient consensus in sharded-replicated databases, we
employ our RingBFT protocol. While designing our RingBFT pro-
tocol, we set following goals:
(G1) Inexpensive consensus of single-shard transactions.
(G2)
Flexibility of employing dierent existing consensus proto-
cols for intra-shard consensus.
(G3)
Deadlock-free two-ring consensus of deterministic cross-
shard transactions.
(G4)
Cheap communication between globally-distributed shards.
Next, we dene the safety and liveness guarantees provided by
our RingBFT protocol.
Denition 4.1.
Let
𝔖
be a system of shards and
S
be a set of
replicas in some shard
S𝔖
. Each run of a consensus protocol in
this system should satisfy the following requirements:
Involvement Each S𝔖processes a transaction if S.
Termination
Each non-faulty replica in
S
executes a transaction.
Non-divergence
(intra-shard) All non-faulty replicas in
S
exe-
cute the same transaction.
Consistence (cross-shard) Each non-faulty replica in 𝔖executes
a conicting transaction in same order.
In traditional replicated systems, non-divergence implies safety,
while termination implies liveness. For a sharded-replicated system
like RingBFT, we need stronger guarantees. If a transaction requires
access to only one shard, safety is provided by involvement and
Sajjad Rahnama Suyash Gupta Rohan Sogani Dhruv Krishnan Mohammad Sadoghi
2
2
2
Shard SShard UShard V
g
Client c1
T1
g
Client c2
T2
g
Client c3
T3Pbft
Pbft
Pbft gClient c1
𝑣1
gClient c2
𝑣2
gClient c3
𝑣3
Figure 2: An illustration of how
RingBFT
manages single-
shard transactions. Each of the three shards S,U, and Vre-
ceive transactions T1,T2, and T3from their respective clients
c1,c2, and c3to execute. Each shard independently run Pbft
consensus, and sends responses to respective clients.
non-divergence, while termination suciently guarantees liveness.
For a cross-shard transaction, to guarantee safety, we also need
consistence apart from involvement and non-divergence, while
liveness is provided using involvement and termination.
RingBFT guarantees safety in asynchronous setting. In such a
setting, messages may get lost, delayed or duplicated, and up to
f
replicas in each shard may act byzantine. However, RingBFT
can only provide liveness during periods of synchrony. Notice that
these assumptions are no harder than those required by existing
protocols [4, 11, 19].
4.1 Single-Shard Consensus
To order and execute single-shard transactions is trivial. For this
task, RingBFT employs one of the many available primary-backup
consensus protocols and runs them at each shard. In the rest of
this section, without the loss of generality, we assume that Ring-
BFT employs the Pbft consensus protocol to order single-shard
transactions. We use the following example to explain RingBFT’s
single-shard consensus.
Example 4.2.
Assume a system that comprises of three shards
S
,
U
, and
V
. Say client
c1
sends
T1
to
S
,
c2
sends
T2
to
U
, and client
c3
sends
T3
to
V
. On receiving the client transaction, the primary of
each shard initiates the Pbft consensus protocol among its replicas.
Once each replica successfully orders the transaction, it sends a
response to the client. Such a ow is depicted in Figure 2.
It is evident from Example 4.2 that there is no communication
among the shards. This is the case because each transaction requires
access to data available inside only one shard. Hence, ordering
single-shard transactions for shard
S
requires running the Pbft
protocol among the replicas of
S
without any synchronization with
other shards. For the sake of completeness, we present the single-
shard consensus based on Pbft protocol in brief next.
Request.
When client
c
wants to execute a transaction
T
, it
creates a
Tc
and sends it to the primary
p
of shard
S
that has
access to corresponding data.
Pre-prepare.
When
p
receives message
𝑚
:
=Tc
from the
client, it checks if the message is well-formed. If this is the case,
p
creates a message
Preprepare(𝑚, Δ, 𝑘)
and broadcasts it to all the
replicas of shard . This
Preprepare
message includes: (1) sequence
number
𝑘
that species the order for this transaction, and (2) digest
Δ=𝐻(⟨Tc)
of the client transaction which will be used in future
communication to reduce data communicated across network.
Prepare.
When a replica
r
of shard
S
receives a
Preprepare
message from its primary, it checks if the message is well-formed.
If this is the case, the replica
r
agrees to support
p
’s order for
𝑚
by
sending Prepare(Δ, 𝑘)to all the replicas of S.
Commit.
When
r
receives identical
Prepare
messages (and are
also well-formed) from at least nf replicas of S, it achieves a weak
guarantee that majority of non-faulty replicas have also agreed to
support
p
’s order for
𝑚
. Hence, it marks this request as prepared,
creates a Commit(Δ, 𝑘 )message, and broadcasts this message.
Reply.
When
r
receives identical
Commit
messages (and are
also well-formed) from at least
nf
replicas of
S
, it achieves a strong
guarantee that majority of non-faulty replicas have also prepared
this request. Hence, it executes transaction
T𝑇
, after all the
(𝑘
1
)
-th
transactions have been executed and replies to the client c.
4.2 Cross-Shard Consensus: Process and
Forward
In this section, we illustrate how RingBFT guarantees consensus
of every deterministic cross-shard transaction (cst) in at most two
rotations across the ring. To order a cst,RingBFT requires shards to
adhere to the ring order, and follow the principle of process, forward,
and re-transmit while ensuring the communication between shards
is linear. We use the following example to illustrate what we mean
by following the ring order.
Example 4.3.
Assume a system that comprises of four shards
S
,
U
,
V
, and
W
where the ring order has been dened as
SUV
W
. Say client
c1
wants to process a transaction
TS,U,V
that requires
access to data from shards
S
,
U
, and
V
, and client
c2
wants to process
a transaction
TU,V,W
that requires access to data from shards
U
,
V
,
and
W
(refer to Figure 3). In this case, client
c1
sends its transaction
to the primary of shard
S
while
c2
sends its transaction to primary
of
U
. On receiving
TS,U,V
, replicas of
S
process the transaction and
forward it to replicas of
U
. Next, replicas of
U
process
TS,U,V
and
forward it to replicas of
V
. Finally, replicas of
V
process
TS,U,V
and
send it back to replicas of
S
, which reply to client
c1
. Similar ow
takes place while ordering transaction TU,V,W.
Although Example 4.3 illustrates RingBFT’s design, it is unclear
how multiple concurrent cst are ordered in a deadlock-free manner.
In specic, we wish to answer following questions regarding the
design of our RingBFT protocol.
(Q1) Can a shard concurrently order multiple cst?
(Q2) How does RingBFT handle conicting transactions?
(Q3) Can shards running RingBFT protocol deadlock?
(Q4)
How much communication is required between two shards?
To answer these questions, we rst present the transactional ow
of a cross-shard transaction undergoing RingBFT consensus, fol-
lowing which we lay down the steps of our RingBFT consensus
protocol.
RingBFT: Resilient Consensus over Sharded Ring Topology
2
2
2
2
Shard SShard U
Shard VShard W
g g
Client c1Client c2
TS,U,V
process
forward
TU,V,W
process
forward
Figure 3: An illustration of how
RingBFT
concurrently orders
two cross-shard transactions TS,U,Vand TU,V,Wacross four
shards. The prescribed ring order is SUVW.
4.2.1 Cross-shard Transactional Flow. RingBFT assumes shards are
arranged in a logical ring. For the sake of explanation, we assume
the ring order of lowest to highest identier. For each cst, we denote
one shard as the
initiator shard
, which is responsible for starting
consensus on the client transaction. How do we select the initiator
shard? Of all the involved shards a cst accesses, the shard with the
lowest identier in ring order is denoted as the initiator shard.
We also claim that RingBFT guarantees consensus for each de-
terministic cst in at most two rotations across the ring. This implies
that for achieving consensus on a deterministic cst, each involved
shard
S
needs to process it at most two times. Notice that if a
cst is simple, then a single rotation around the ring is sucient to
ensure that each involved shard Ssafely executes its fragment.
Prior to presenting our RingBFT’s consensus protocol that safely
orders each cst, we sketch the ow of a cst in Figure 4. In this
gure, we assume a system of four shards:
S
,
U
,
V
, and
W
where
id(S)<id(U)<id(V)<id(W)
. The client creates a transaction
TS,U,W
that requires access to data in shards
S
,
U
, and
W
and sends
this transaction to the primary
p𝑆
of
S
. On receiving this transaction,
p𝑆
initiates the Pbft consensus protocol (local replication) among
its replicas. If the local replication is successful, then all the replicas
of
Slock
the corresponding data. This locking of data-items in the
ring-order helps in preventing deadlocks. Next, replicas of
S
for-
ward the transaction to replicas of shard
U
. Notice that only linear
communication takes place between replicas of
S
and
U
. Hence, to
handle any failures, replicas of
U
share this message among them-
selves. Next, replicas of
U
also follow similar steps and forward
transaction to
W
. As
W
is the last shard in the ring of involved
shards, it goes ahead and executes the cst if all the dependencies are
met. Finally, replicas of shards
S
and
U
also execute the transaction
and replicas of Ssend the result of execution to the client.
4.3 Cross-Shard Consensus Algorithm
We use Figure 5 to present RingBFT’s algorithm for ordering cross-
shard transactions. Next, we discuss these steps in detail.
4.3.1 Client Request. When a client
c
wants to process a cross-
shard transaction
T
, it creates a
Tc
message and sends it to the
primary of the rst shard in ring order. As part of this transaction,
the client
c
species the information regarding all the involved
shards (
), such as their identiers and the necessary read-write
sets of each shard.
4.3.2 Client Request Reception. When the primary
pS
of shard
S
receives a client request
T
, it rst checks if the message is well-
formed. If this is the case, then
pS
checks if among the set of involved
shards
,
S
is the rst shard in ring order. If this condition is met,
then
pS
assigns this request a linearly increasing sequence number
𝑘
, calculates the digest
Δ
, and broadcasts a
Preprepare
message to
all the replicas
S
of its shard. In the case when
S
is not the rst
shard in the ring order,
pS
forwards the transaction to the primary
of the appropriate shard.
4.3.3 Pre-prepare Phase. When a replica
rS
receives the
Preprepare
message from
pS
, it checks if the request is well-formed.
If this is the case and if
r
has not agreed to support any other re-
quest from
pS
as the
𝑘
-th request, then it broadcasts a
Prepare
message in its shard S.
4.3.4 Prepare Phase. When a replica
r
receives identical
Prepare
messages from
nf
distinct replicas, it gets an assurance that a major-
ity of non-faulty replicas are supporting this request. At this point,
each replica
r
broadcasts a
Commit
message to all the replicas in
S
.
Once a transaction passes this phase, the replica
r
marks it prepared.
4.3.5 Commit and Data Locking. When a replica
r
receives well-
formed identical
Commit
messages from
nf
distinct replicas in
S
, it
checks if it also prepared this transaction at same sequence number.
If this is the case, RingBFT requires each replica
r
to lock all the
read-write sets that transaction Tneeds to access in shard S.
In RingBFT, we allow replicas to process and broadcast
Prepare
and
Commit
messages out-of-order, but require each replica to ac-
quire locks on data in transactional sequence order. This out-of-
ordering helps replicas to continuously perform useful work by
concurrently participating in consensus of several transactions. To
achieve these tasks, each replica
r
tracks the maximum sequence
number (
𝑘max
), which indicates the sequence number of the last
transaction to lock data. If sequence number
𝑘
for a transaction
T
is greater than
𝑘max +
1, we store the transaction in a list
𝜋
until
transaction at
𝑘max +
1has acquired the locks. Once the
𝑘max +
1-th
transaction has acquired locks, we gradually release transactions
in 𝜋until there is a transaction that wishes to lock already locked
data-fragments. We illustrate this through the following example.
Example 4.4.
Assume the use of following notations for four trans-
actions and the data-fragments they access at shard
S
:
T1,𝑎
,
T2,𝑏
,
T3,𝑎
, and
T4,𝑐
. For instance,
T1,𝑎
implies that transaction at sequence
1requires access to data-item
𝑎
. Next, due to out-of-order message
processing, assume a replica
r
in
S
receives
nf Commit
messages
for
T2,𝑏
,
T3,𝑎
, and
T4,𝑐
before
T1,𝑎
. Hence,
𝜋={T2,𝑏,T3,𝑎,T4,𝑐 }
.
Once
r
receiving
nf Commit
messages for
T1,𝑎
, it locks data-item
𝑎
and extracts
T2,𝑏
from
𝜋
. As
T2,𝑏
wishes to lock a distinct data-item,
so
r
continues processing
T2,𝑏
. Next,
r
moves to
T3,𝑎
but it cannot
process
T3,𝑎
due to lock-conicts. Hence, it places back
T3,𝑎
in
𝜋
and stops processing transactions in
𝜋
until lock is available for
T3,𝑎.
Notice that if the client transaction
T
is a single-shard transac-
tion, it requires access to data-items in only this shard. In such a
case, this commit phase is the nal phase of consensus and each
replica executes
T
and replies to the client when the lock for
corresponding data-item is available.
Sajjad Rahnama Suyash Gupta Rohan Sogani Dhruv Krishnan Mohammad Sadoghi
r1
r2
r3
pS
c
r1
r2
r3
pU
r1
r2
r3
pV
r1
r2
r3
pW
TS,U,W
Local Pbft
Consensus
on TS,U,W
Local Pbft
Consensus
on TS,U,W
Local Pbft
Consensus
on TS,U,W
Execute
Execute
Execute
Local
Request
Local
Replication Forward Local
Sharing
Local
Replication
Global
Sharing
Local
Sharing
Local
Replication
Global
Sharing
Local
Sharing
Client
Response
S
U
V
W
Round 1 Round 2
Figure 4: Representation of the normal-case ow of
RingBFT
in a system of four shards where client sends a cross-shard
transaction TS,U,Wthat requires access to data in three shards: S,U, and W.
4.3.6 Forward to next Shard via Linear Communication. Once a
replica
r
in
S
locks the data corresponding to cst
T
, it sends a
Forward
message to only one replica
q
of the next shard in ring
order. As one of the key goals of RingBFT is to ensure communica-
tion between two shards is linear, so we design a communication
primitive that builds on top of the optimal bound for communica-
tion between two shards [
37
,
44
]. We dene RingBFT’s cross-shard
communication primitive as follows:
Linear Communication Primitive.
In a system
𝔖
of shards, where
each shard
S,U𝔖
has at most
f
byzantine replicas, if each
replica in shard
S
communicates with a distinct replica in
shard
U
, then at least
f+
1non-faulty replicas from
S
will
communicate with f+1non-faulty replicas in U.
Our linear communication primitive guarantees that to reliably
communicate a message
𝑚
between two shards requires only send-
ing a linear number of messages in comparison to protocols like
AHL and Sharper which require quadratic communication. Using
this communication primitive, to communicate message
𝑚
from
shard Sto shard U, we need to exchange only nmessages.
So, how does RingBFT achieves this task? We require each replica
of
S
to initiate communication with the replica of
U
having the same
identier. Hence, replica
r
of shard
S
sends a
Forward
message
to replica
q
in shard
U
such that
id(r)=id(q)
. By transmitting
a
Forward
message,
r
is requesting
q
to initiate consensus on
Tc
. For
q
to support such a request, it needs a proof that
Tc
was successfully ordered in shard
S
. Hence,
r
includes the DS on
Commit messages from nf distinct replicas (Figure 5, Line 16).
4.3.7 Execution and Final Rotation. Once a client request has been
ordered on all the involved shards, we call it one complete rotation
around the ring. This is a signicant event because it implies that
all the necessary data-fragments have been locked by each of the
involved shards. If a cst is simple, then each shard can indepen-
dently execute its fragment without any further communication
between the shards. In the case a cst is complex, at the end of the
rst rotation, the replicas of rst shard in ring order (
S
) will receive
aForward message from the replicas of last shard in ring order.
Next, the replicas of
S
will attempt to execute parts of transac-
tion, which are their responsibility. Post execution, replicas of
S
send
Execute
messages to the replicas in next shard using our com-
munication primitive. Notice that the
Execute
message includes
updated write sets (
Σ
), which help in resolving any dependencies
during execution. Finally, when the execution is completed across
all the shards, the rst shard in ring order replies to the client.
5 UNCIVIL EXECUTIONS
In previous sections, we discussed transactional ows under the
assumption that the network is stable and replicas will follow the
stated protocol. However, any Byzantine-Fault Tolerant protocol
should provide safety under asynchronous settings and liveness in
the period of synchrony even if up to freplicas are byzantine.
RingBFT oers safety in an asynchronous environment. To guar-
antee liveness during periods of synchrony, RingBFT oers several
recovery protocols, such as checkpoint,retransmission, and view-
change, to counter malicious attacks. The rst step in recovery
against any attack is detection. To do so, we require each replica
r
to employ a set of timers. When a timer at a replica
r
timeouts, then
RingBFT: Resilient Consensus over Sharded Ring Topology
Initialization:
// 𝑘max :=0 (maximum sequence number in shard S)
// Σ:=(set of data-fragments of each shard)
// 𝜋:=(list of pending transactions at a replica)
Client-role (used by client cto request transaction T):
1: Sends Tcto the primary pSof shard S.
2: Awaits receipt of messages Response( ⟨Tc, 𝑘, 𝑟 )from f+1replicas of S.
3: Considers Texecuted, with result 𝑟, as the 𝑘-th transaction.
Primary-role (running at the primary pSof shard S):
4: event pSreceives Tcdo
5: if Sid(S)=FirstInRingOrder ()then
6: Calculate digest Δ:=𝐻( ⟨Tc).
7: Broadcast Preprepare( ⟨Tc,Δ, 𝑘)in shard S(order at sequence 𝑘).
8: else
9: Send to primary pUof shard U,U𝔖id(U)=FirstInRingOrder ()
Non-Primary Replica-role (running at the replica rof shard S):
10: event rreceives Preprepare( ⟨Tc,Δ, 𝑘)from pSsuch that:
message is well-formed, and rdid not accept a 𝑘-th proposal from pS.do
11: Broadcast Prepare(Δ, 𝑘)to replicas in S.
Replica-role (running at the replica rof shard S):
12: event rreceives well-formed Prepare(Δ, 𝑘)messages from nf replicas in Sdo
13: Broadcast Commit(Δ, 𝑘 ) ⟩rto replicas in S.
14: event rreceives nf 𝑚:=Commit(Δ, 𝑘 ) ⟩qmessages such that:
each message 𝑚is well-formed and is sent by a distinct replica qS.do
15: Ube the shard to forward such that id(U)=NextInRingOrder ().
16: 𝐴:= set of DS of these nf messages.
17: if 𝑘=𝑘max +1// Forward to next shard then
18: Lock data-fragment corresponding to Tc.
19: Send Forward( ⟨Tc, 𝐴,𝑚 , Δ,) ⟩rto replica o, where oUid(r)=id(o)
20: else
21: Store Forward( ⟨Tc, 𝐴,𝑚, Δ,) ⟩rin 𝜋.
22: while 𝜋!=// Pop out waiting transaction. do
23: Extract transaction at 𝑘max +1from 𝜋(if any).
24: if Corresponding data-fragment is not locked then
25: 𝑘max =𝑘max +1
26: Follow lines 18 and 19.
27: else
28: Store transaction at 𝑘max in 𝜋and exit the loop.
// Locally share any message from previous shard.
29: event rreceives message 𝑚:=message-typeqsuch that:
𝑚is well-formed and sent by replica q, where
id(U)=PrevInRingOrder (),qUid(r)=id(q)do
30: Broadcast 𝑚to all replicas in S.
// Forward message from previous shard.
31: event rreceives f+1𝑚:=Forward( ⟨Tc, 𝐴,𝑚, Δ) ⟩qsuch that:
each 𝑚is well-formed; and set 𝐴includes valid DS from nf replicas for 𝑚.do
32: if Data-fragment corresponding to Tcis locked // Se cond Rotation then
33: Execute data-fragment of Tcand add to log.
34: Push result to set Σ.
35: Release the locks from corresponding data-fragment.
36: Vbe the shard to forward such that id(V)=NextInRingOrder ().
37: Send Execute(Δ,Σ) ⟩rto replica o, where oVid(r)=id (o).
38: else if r=pS// Primary initiates consensus then
39: Broadcast Preprepare( ⟨Tc,Δ, 𝑘)in shard S(order at sequence 𝑘).
40: event rreceives 𝑚:=Execute(Δ,Σ) ⟩qsuch that:
𝑚is sent by replica q, where qUid(r)=id (q)do
41: if Already executed Tc// Reply to client then
42: Send client cthe result 𝑟.
43: else
44: Follow lines 33 to 37.
Figure 5: The normal-case algorithm of RingBFT.
r
initiates an appropriate recovery mechanism. In specic, each
replica rsets following timers:
Local Timer
– To track successful replication of a transac-
tion in its shard.
Transmit Timer
– To re-transmit a successfully replicated
cross-shard transaction to next shard.
Remote Timer
– To track replication of a cross-shard trans-
action in the previous shard.
Each of these timers is initiated at the occurrence of a distinct
event and its timeout leads to running a specic recovery mecha-
nism. When a local timer expires, then the corresponding replica
initiates replacement of the primary of its shard (view-change),
while a remote timer timeout requires the replica to inform the
previous shard in ring order about the insucient communication.
This brings us to following observation regarding the consensus
oered by RingBFT:
Proposition 5.1. If the network is reliable and the primary of
each shard is non-faulty, then the byzantine replicas in the system
cannot aect the consensus protocol.
Notice that Proposition 5.1 holds implicitly as no step in Figure 5
depends on the correct working of non-primary byzantine replicas;
in each shard
S
, local replication of each transaction is managed
by the primary of
S
and communication between any two shards
S
and
U
involves all the replicas. This implies that we need to only
consider cases when the network is unreliable and/or primary is
byzantine. We know that RingBFT guarantees safety even in unre-
liable communication and requires reliable network for assuring
liveness. Hence, we will illustrate mechanisms to tackle attacks by
byzantine primaries. Next, we illustrate how RingBFT resolves all
the possible attacks it encounters.
(A1) Client Behavior and Attacks.
In the case, the primary
is byzantine and/or network is unreliable, client is the key entity
at loss. Client requested the primary to process its transaction, but
due to an ongoing byzantine attack, client did not receive sucient
responses. Clearly, client cannot wait indenitely to receive valid
responses. Hence, we require each client
c
to start a timer when it
sends its transaction
T
to the primary
pS
of shard
S
. If the timer
timeouts prior to
c
receiving at least
f+
1identical responses,
c
broadcasts Tto all the replicas rSof shard S.
When a non-primary replica
r
receives a transaction from
c
,
it forwards that transaction to
pS
and waits on a timer for
pS
to
initiate consensus on
T
. During this time,
r
expects
pS
to start
consensus on at least one transaction from
c
, otherwise it initiates
view-change protocol. Notice that a byzantine client can always
forward its request to all the replicas of some shard to blame a
non-faulty primary. Such an attack will not succeed as if
c
sends
to
r
an already executed request,
r
simply replies with the stored
response. Moreover, if
r
belongs to some shard
S
, which is not the
rst shard in ring order, then rignores the client transaction.
(A2) Faulty Primary and/or Unreliable network.
A faulty
primary can prevent successful consensus of a client transaction.
Such a primary can be trivially detected as at most
f
non-faulty repli-
cas would have successfully committed the transaction (received at
least nfCommit messages).
An unreliable network can cause messages to get lost or inde-
nitely delayed. Such an attack is dicult to detect and non-faulty
replicas may blame the primary.
Each primary represents a view of a shard. Hence, the term view-
change is often used to imply primary replacement. Notice that
each shard in RingBFT is a replicated system. Further, RingBFT
is a meta-protocol, which employs existing Bft protocols, such as
Pbft, to run consensus. These properties allow RingBFT to use the
accompanying view-change protocol. Specically, in this paper, we
Sajjad Rahnama Suyash Gupta Rohan Sogani Dhruv Krishnan Mohammad Sadoghi
use Pbft’s view change protocol (for MAC-based authentication)
to detect and replace a faulty primary [12].
A replica
rS
initiates the view-change protocol to replace
its primary
pS
in response to a timeout. As discussed earlier in this
section, there are two main causes for such timeouts: (i)
r
does not
receive
nf
identical
Commit
messages from distinct replicas, and
(ii) pSfails to propose a request from client c.
(A3) Malicious Primary.
A malicious primary
p
can ensure
that up to
f
non-faulty replicas in its shard
S
are unable to make
progress (in dark). Under such conditions, the aected non-faulty
replicas will request a view-change, but they will not be success-
ful as the next primary may not receive sucient
ViewChange
messages (from at least
nf
replicas) to initiate a new view. Fur-
ther, the remaining
f+
1non-faulty replicas will not support such
ViewChange
requests as it is impossible for them to distinguish
between this set of
f
non-faulty replicas and the actual
f
byzantine
replicas.
To ensure these replicas in dark make progress, traditional pro-
tocols periodically send checkpoint messages. These checkpoint
messages include all client transactions and the corresponding
nf
Commit messages since the last checkpoint.
5.1 Cross-Shard Attacks
Until now, we have discussed attacks that can be resolved by repli-
cas of any shard independent of the functioning of other shards.
However, the existence of cross-shard transactions unravels new at-
tacks, which may span multiple shards. We use the term cross-shard
attacks to denote attacks that thwart successful consensus of a cst,
First, we describe such attacks, and then we present solutions to
recover from these attacks.
In RingBFT, we know that the consensus of each cst follows
a ring order. In specic, for a cross-shard transaction
T
, each of
its involved shards
S,U
rst run a local consensus and then
communicate the data to the next shard in ring order. Earlier in this
section, we observed that if at least
f+
1non-faulty replicas of any
shard are unable to reach consensus on
T
, then that shard will
undergo local view-change. Hence, we are interested in those cross-
shard attacks where neither the involved shards are able to trigger
local view change by themselves, nor are they able to execute the
transaction and reply to the client. This can only occur when all the
involved shards of a cross-shard transaction
T
, either successfully
completed consensus on
T
, or are unable to initiate the consensus
on T. Next, we describe these attacks.
Assume
𝑆
and
𝑈
represent the sets of replicas in shards
S
and U, respectively.
(C1) No Communication.
Under a no communication attack,
we expect that the replicas in
𝑆
are unable to send any messages
to replicas of 𝑈.
(C2) Partial Communication.
Under a partial communication
attack, we expect that at least
f+
1replicas in
𝑈
receive less than
f+1Forward messages from replicas in 𝑆.
Both of these attacks could occur solely due to an unreliable net-
work that causes message loss or indenite message delays. Further,
a malicious primary can collude with an adversarial network to
accelerate the frequency to such attacks. In either of the cases, to
Replica-role (running at the replica qof shard U):
1: event Remote timer of qtimeouts such that:
qhas received at most fForward( ⟨Tc, 𝐴,𝑚 , Δ,) ⟩rmessages, where
id(S)=PrevInRingOrder (),rSdo
2: Send RemoteView( ⟨Tc,Δ) ⟩qto replica o, where oSid(q)=id (o)
3: event rreceives message 𝑚:=RemoteView( ⟨Tc,Δ) ⟩qsuch that:
𝑚is well-formed and sent by replica q, where
id(U)=NextInRingOrder (),qUid(r)=id(q)do
4: Broadcast 𝑚to all replicas in S.
5: event rreceives f+1RemoteView( ⟨Tc,Δ) ⟩qmessages do
6: Initiate Local view-change protocol.
Figure 6: The remote view-change algorithm of RingBFT.
recover from such cross-shard attacks, all the involved shards may
need to communicate among themselves.
5.1.1
Message Retransmission.
In RingBFT, to handle a no com-
munication attack, aected replicas of the preceding shard retrans-
mit their original message to the next shard in ring order. Speci-
cally, when a replica
r
of shard
S
successfully completes the con-
sensus on transaction
T
, it sets the transmit timer for this request
prior to sending the
Forward
message to replica
Q
of shard
U
(next
shard in ring order). When the transmit timer of
r
timeouts, it again
sends the Forward message to q.
5.1.2
Remote View Change.
Apartial communication attack could
be either due to a byzantine primary or unreliable network. If the
primary
pS
of shard
S
is byzantine, then it will ensure that at most
f
non-faulty replicas replicate a cross-shard transaction
T
(
S,U
),
locally. As a result, replicas of next shard
U
will receive at most
f
Forward
messages. Another case is where the network is unreli-
able, and under such conditions, replicas of
U
may again receive at
most fForward messages.
From Figure 5, we know that when replica
q
of shard
U
receives
a
Forward
message from replica
r
of shard
S
such that
id(r)=
id(q)
, then
q
broadcasts this
Forward
message to all the replicas
in
U
. At this point, RingBFT also requires replica
q
to start the
remote timer. If any replica
q
in shard
U
does not receive identical
Forward
messages from
f+
1distinct replicas of shard
S
, prior
to the timeout of its remote timer, then
q
detects a cross-shard
attack and sends a
RemoteView
message to the replica
r
of shard
S
, where
id(r)=id(q)
. Following this,
r
broadcasts the received
RemoteView
message to all the replicas in
S
. Finally, when any
replica
r
of shard
S
receives
RemoteView
messages from
f+
1
replicas of
U
, it supports the view change request and initiates the
view-change protocol. We illustrate this process in Figure 6.
Triggering of Timers. In RingBFT, we know that for each cross-
shard transaction, each replica
r
of
S
sets three distinct timers.
Although each timer helps in recovering against a specic attack,
there needs to be an order in which they timeout. As local timers
lead to detecting a local malicious primary, we expect a local timer
to have the shortest duration. Further, a remote timer helps to detect
a lack of communication due to which it has a longer duration than
local timer. Similarly, we require the duration of retransmit timer
to be longest.
RingBFT: Resilient Consensus over Sharded Ring Topology
6 RINGBFT GUARANTEES
We now prove the safety, liveness, and no deadlock guarantees
provided by our RingBFT protocol. First, we prove the correctness
of consensus for single-shard transactions.
Proposition 6.1. Let
R𝑖
,
𝑖∈ {
1
,
2
}
, be two non-faulty replicas in
shard
S
that committed to
T𝑖c𝑖
as the
𝑘
-th transaction sent by
p
. If
n>3f, then T1c1=T2c2.
Proof.
Replica
r𝑖
only committed to
Tc𝑖
after
r𝑖
received
identical
Commit(Δ, 𝑘 )
messages from
nf
distinct replicas in
S
. Let
𝑋𝑖
be the set of such
nf
replicas and
𝑌𝑖=𝑋𝑖\ F
be the non-faulty
replicas in
𝑋𝑖
. As
|F | =f
, so
|𝑌𝑖| ≥ nf f
. We know that each non-
faulty replica only supports one transaction from primary
p
as the
𝑘
-th transaction, and it will send only one
Prepare
message. This
implies that sets
𝑌1
and
𝑌2
must not overlap. Hence,
|𝑋1𝑋2| ≥
2
(nf f)
. As
|𝑋1𝑋2|=nf
, the above inequality simplies to 3
fn
,
which contradicts n>3f. Thus, we conclude T1c1=T2c2.
Theorem 6.2.
No Deadlock:
In a system
𝔖
of shards, where
S,U
𝔖
and
SU
, no two replicas
rS
and
qU
that order two
conicting transactions
T1
and
T2
such that
S,U12
will
execute T1and T2in dierent orders.
Proof.
We know that RingBFT associates an identier with
each shard and uses this identier to dene a ring order. Let
id(S)<
id(U)
, and the ring order be dened as lowest to highest identier.
Assume that the conicting transactions
T1
and
T2
are in a dead-
lock at shards
S
and
U
, where
S,U12
. This implies that each
non-faulty replica rShas locked some data-item for T1that is
required by
T2
while each non-faulty replica
qU
has locked
some data-item for T2that is required by T1or vice versa.
As each transaction
T𝑖
,
𝑖∈ [
1
,
2
]
accesses
S
and
U
in ring
order, so each transaction
T𝑖
was initiated by
S
. This implies that
the primary of
S
would have assigned these transactions distinct
sequence numbers
𝑘𝑖, 𝑖 ∈ [
1
,
2
]
, such that
𝑘1<𝑘2
or
𝑘1>𝑘2
(
𝑘1=
𝑘2
is not possible as it will be detected as a byzantine attack). During
the commit phase, each replica
r
will put the transaction with larger
sequence number
𝑘𝑖
in the
𝜋
list and lock the corresponding data-
item (Figure 5, Line 23), while the transaction with smaller
𝑘𝑖
is
forwarded to the next shard
U
. The transaction present in the
𝜋
list
is only extracted once the data-item is unlocked. Hence, there is a
contradiction, that is, shards Sand Uwill not suer deadlock.
Theorem 6.3.
Safety:
In a system
𝔖
of shards, where each shard
S𝔖
has at most
f
byzantine replicas, each replica
r
follows the
Involvement, Non-divergence, and Consistence properties. Specically,
all the replicas of
S
execute each transaction in the same order, and
every conicting cross-shard transaction is executed by all the replicas
of all the involved shards in the same order.
Proof.
Using Proposition 6.1 we have already illustrated that
RingBFT safely replicates a single-shard transaction, this despite of
a malicious primary and/or unreliable network. In specic, any non-
faulty replica
RS
will only commit a single-shard transaction if
it receives
Commit
messages from
nf
distinct replicas in
S
. When
a non-faulty replica receives less than
nf Commit
messages, then
eventually its local timer will timeout and it will participate in the
view-change protocol. Post the view-change protocol, any request
that was committed by at least one non-faulty replica will persist
across views.
Similarly, we can show that each cross-shard transaction is also
safely replicated across all replicas of all the involved shards. In
RingBFT, each cross-shard transaction
T
is processed in ring order
by all the involved shards
. Let shards
S,U
and
id(S)<id(U)
such that ring order be based on lowest to highest identier. Hence,
replicas of shard
U
will only start consensus on
T
if they receive
Forward
messages from
f+
1distinct replicas of
S
. Further, each
of these
Forward
messages includes DS from
nf
distinct replicas
of
S
on identical
Commit
messages corresponding to
T
, which
guarantees that
T
was replicated in
S
. If the network is unreliable
and/or primary of shard
S
is byzantine, then replicas of
U
will
receive less than
f+
1
Forward
messages. In such a case, either
the remote timer at replicas of
U
will timeout, or one of the two
timers (local timer or transmit timer) of replicas of
S
will timeout.
In any case, following the specic recovery procedure, replicas of
Uwill receive sucient number of Forward messages.
Theorem 6.4. Liveness: In a system
𝔖
of shards, where each shard
S𝔖
has at most
f
byzantine replica, if the network is reliable, then
each replica
r
follows the Involvement and Termination properties.
Specically, all the replicas continue making progress, and good clients
continue receiving responses for their transactions.
Proof.
In the case of a single-shard transaction, if the primary
is non-faulty, then each replica will continue processing client
transactions. If the primary is faulty, and prevents a request from
replicating by allowing at most
f
replicas to receive
Commit
mes-
sages, then such a primary will be replaced through view-change
protocol, following which a new primary will ensure that the repli-
cas continue processing subsequent transactions. Notice that there
can be at most
f
such faulty primaries, and the system will even-
tually make progress. If the primary is malicious, then it can keep
up to
f
non-faulty replicas in dark, which will continue making
progress through periodic checkpoints.
In the case of a cross-shard transaction, there is nothing extra
that a faulty primary
pS
can do than preventing local replication of
the transaction. If
pS
does that, then as discussed above,
pS
will be
replaced. Further, during any communication between two shards,
primary has no extra advantage over other replicas in the system.
Further, the existence of transmit and remote timers help replicas
of all the involved shards to keep track of any malicious act by
primaries.
7 DESIGN & IMPLEMENTATION
RingBFT aims to scale permissioned blockchains to hundreds of
replicas through ecient sharding. To argue the benets of our
RingBFT protocol, we need to rst implement it in a permissioned
blockchain fabric. For this purpose, we employed a state-of-the-art
permissioned blockchain fabric, ResilientDB [
33
,
34
,
36
38
,
40
,
67
].
In our prior works, we illustrated how ResilientDB oers an op-
timal system-centric design that eases implementing novel Bft
Sajjad Rahnama Suyash Gupta Rohan Sogani Dhruv Krishnan Mohammad Sadoghi
Network
ClientRequests
ConsensusMessages
Input
Thread
Input
Thread
Input
Thread
Input
Thread
WorkQueue
ClientRequests
WorkQueue
Commit
Certificate
WorkQueue
Prepare,Commit
WorkQueue
Checkpoint
Output
Thread
Output
Thread
Output
Thread
Output
Thread
Batching
Thread
Certify
Thread
Worker
Thread
Checkpoint
Thread
Execute
Thread
Network
Figure 7: The parallel-pipelined architecture provided by
ResilientDB fabric for eciently implementing RingBFT.
consensus protocols. Further, ResilientDB presents an architec-
ture that allows even classical protocols like Pbft to achieve high
throughputs and low latencies.
In this section, we describe in brief ResilientDB’s architecture
and explain the design decisions we took to implement RingBFT.
Network Layer.
ResilientDB provides a network layer to man-
age communication among clients and replicas. The network layer
provides TCP/IP capabilities through Nanomsg-NG to communi-
cate messages. To facilitate uninterrupted processing of millions of
messages, at each replica, ResilientDB oers multiple input and
output threads to communicate with the network.
Pipelined Consensus.
Once a message is received from the net-
work, the key challenge is to process it eciently. If all the ensuing
consensus tasks are performed sequentially, the resulting system
output would be abysmally low. Moreover, such a system would be
unable to utilize the available computational and network capabil-
ities. Hence, ResilientDB associates with each replica a parallel
pipelined architecture, which we illustrate in Figure 7.
When an input thread receives a message from the network, it
places them in a specic work queues based on the type of the
message. As depicted in Figure 7, ResilientDB provides dedicated
threads for processing each type of message.
Blockchain.
To securely record each successfully replicated
transaction, we also implement an immutable ledger–blockchain.
For systems running fully-replicated Bft consensus protocols like
Pbft and Zyzzyva, blockchain is maintained as a single linked-list
of all transactions where each replica stores a copy of the block-
chain. However, in the case of sharding protocols like RingBFT,
each shard maintains its own blockchain. As a result, no single shard
can provide a complete state of all the transactions. Hence, we refer
to the ledger maintained at each shard as a partial-blockchain.
Let,
𝔖
be the system of
z=|𝔖|
shards. Say, we use the represen-
tation
S1,S2, ..., S𝑖𝔖
, to denote the shards in
𝔖
where 1
𝑖𝑧
.
In this sharded system, we represent the blockchain ledger main-
tained by replicas of
S𝑖
as
𝔏S𝑖
. Hence the complete state of the
system can be expressed as:
𝔏S1𝔏S2... 𝔏S𝑖... 𝔏S𝑧(1)
Further, we know that each ledger 𝔏S𝑖is a linked list of blocks:
𝔏S𝑖={𝔅1,𝔅2, . .., 𝔅𝑘}(2)
where chaining is guaranteed by requiring each block to include
the hash of the previous block:
𝔅𝑘={𝑘, Δ,pS𝑖,𝐻 (𝔅𝑘1)} (3)
In ResilientDB, for ecient processing, we follow existing liter-
ature and require the primary
pS𝑖
of shard
S𝑖
to aggregate transac-
tions in a batch and perform consensus on this batch. Hence, each
𝑘
-th block
𝔅𝑘
in
𝔏S𝑖
represents a batch of transactions that replicas
of
S𝑖
successfully committed at sequence
𝑘
. Note: we expect each
block to include all the transactions that access the same shards.
If a block includes cross-shard transactions, then such a block is
appended to the ledger of all the involved shards
. In specic, if a
block
𝔅
includes a transaction
T
, such that
S𝑖,S𝑗
, then
𝔅
𝔏S𝑖
and
𝔅𝔏S𝑗
. Notice that the order in which these blocks appear
in each individual chain can be dierent. However, if two blocks
𝔅𝑥
and
𝔅𝑦
include conicting transactions that access intersecting
set of shards, and consensus on
𝔅𝑥
happens before
𝔅𝑦
, then in each
ledger 𝔅𝑥is appended before 𝔅𝑦.
Depending on the choice of storage, each block can include
either all the transactional information or the Merkle Root [60] of
all transactions in the block. A Merkle Root (
Δ
) helps to optimize the
size of each block, and is generated by assuming all the transactions
in a batch as leaf nodes, followed by a pair-wise hashing up till the
root. To initialize each blockchain, every replica adds an agreed
upon dummy block termed as the genesis block [40].
8 EVALUATION
In this section, we evaluate our RingBFT protocol. To do so, we
deploy ResilientDB on Google Cloud Platform (GCP) in
een
regions
across
ve continents
, namely: Oregon, Iowa, Montreal,
Netherlands, Taiwan, Sydney, Singapore, South-Carolina, North-
Virginia, Los Angeles, Las Vegas, London, Belgium, Tokyo, and
Hong Kong. In any experiment involving less than 15 shards, the
choice of the shards is in the order we have mentioned them above.
We deploy each replica on a 16-core N1 machines having Intel
Broadwell CPUs with a 2
.
2GHz clock and 32GB RAM. For deploying
clients, we use the 4-core variants having 16GB RAM. For each
experiment, we equally distribute the clients in all regions.
Benchmark.
To provide workload for our experiments, we use
the Yahoo Cloud Serving Benchmark (YCSB) [
15
,
23
] . Each client
transaction queries a YCSB table with an active set of
600 k
records.
For our evaluation, we use write queries, as majority of blockchain
transactions are updates to the existing records. Prior to each exper-
iment, each replica initializes an identical copy of the YCSB table.
During our experiments, each client transaction follows a uniform
Zipan distribution.
Existing Protocols.
In all our experiments, we compare the per-
formance of RingBFT against two other state-of-the-art sharding
Bft protocols, AHL [
19
] and Sharper [
4
]. In Section 2, we high-
lighted key properties of these protocols. Like RingBFT, both AHL
and Sharper employ Pbft to achieve consensus on single-shard
transactions. Hence, all the three protocols have identical imple-
mentation for replicating single-shard transactions. For achieving
consensus on cross-shard transactions, we follow the respective
algorithms and modify ResilientDB appropriately.
WAN Bandwidth and Round-Trip Costs.
As majority of ex-
periments take place in a geo-scaled WAN environment spanning
multiple countries, available bandwidth and round-trip costs be-
tween two regions play a crucial role. Prior works [
2
,
37
] have illus-
trated that if the available bandwidth is low and round-trip costs
RingBFT: Resilient Consensus over Sharded Ring Topology
are high, then the protocols dependent on a subset of replicas face
performance degradation. In the case of AHL, the reference com-
mittee is responsible for managing cross-shard consensus, while for
Sharper, the primary of coordinating shard leads the cross-shard
consensus. Hence, both of these protocols observe low throughputs
and high latencies in proportion to available bandwidth and round-
trip costs. Although RingBFT requires cross-shard communication
in the form of
Forward
and
Execute
messages, the system is com-
parably less burdened as all the replicas participate equally in this
communication process.
Standard Seings.
Unless explicitly stated, we use the following
settings for all our experiments. We run with a mixture of single-
shard and cross-shard transactions, of which 30% are cross-shard
transaction. Each cross-shard accesses all the 15 regions, and in each
shard we deploy 28 replicas, that is, a total of 420 globaly distributed
replicas. In these experiments, we allow up to 50K clients to send
transactions. Further, we require clients and replicas to employ
batching, and create batches of transactions of size 100.
The size of dierent messages communicated during RingBFT
consensus is:
Preprepare
(5408B),
Prepare
(216B),
Commit
(269B),
Forward (6147B), Checkpoint (164B), and Execute (1732B)
Through our experiments, we want to answer the following:
(Q1)
What is the eect of increasing the number of shards on
consensus provided by RingBFT?
(Q2)
How does varying the number of replicas per shard aects
the performance of RingBFT?
(Q3)
What is the impact of increasing the percentage of cross-
shard transactions on RingBFT?
(Q4) How does batching aect the system performance?
(Q5)
What is the eect of varying the number of involved shards
in a cross-shard transaction on RingBFT?
(Q6)
What is the impact of varying number of clients on con-
sensus provided by RingBFT?
(Q7)
How do faulty primary and view change aect the perfor-
mance of RingBFT?
8.1 Scaling Number of Shards
For our rst set of experiments, we study the eect of scaling
number of shards. In specic, we require clients to send cross-shard
transactions that can access from 3,5,7,9,11, and 15 shards, while
keeping other parameters at the standard setting. We use Figures 8
(I) and (II) to illustrate the throughput and latency metrics.
RingBFT achieves 4
×
higher throughput than AHL and Sharper
in the 15 shard setting. An increase in the number of shards only
increases the length of the ring while keeping the amount of commu-
nication between two shards at constant. As a result, for RingBFT,
we observe an increase in latency as there is an increase in time
to go around the ring. From three shards to 15 shards the latency
increases from 1
.
17
s
to 6
.
82
s
. Notice that the throughput for Ring-
BFT is nearly constant since the size of shards and the amount of
communicaton among shards are constant. This is a consequence
of an increase in number of shards that can perform consensus on
single-shard transactions, in parallel. Although on increasing the
number of shards, there is a proportional increase in the number of
involved shards per transaction, the linear communication pattern
of RingBFT prevents throughput degradation.
In the case of AHL, the consensus on cross-shard transactions is
led by the reference committee, which essentially centralizes the
communication and aects the system performance. In contrast,
Sharper scales better because there is no single reference commit-
tee leading all cross-shard consensuses. However, even Sharper
sees a fall in throughput due to two rounds of communication be-
tween all replicas of all the involved shards. For a system where all
the shards are globally scattered, quadratic communication com-
plexity and communication between all the shards, impacts scala-
bility of the system.
8.2 Scaling Number of Replicas per Shard
We now study the eects of varying dierent parameters within
a single shard. Our next set of experiments aim to increase the
amount of replication within a single shard. In specic, we allow
each shard to have 10,16,22, and 28 replicas. We use Figures 8 (III)
and (IV) to illustrate the throughput and latency metrics.
These plots rearm our theory that RingBFT ensures up to 16
×
higher throughput and 11
×
lower latency than the other two pro-
tocols. As the number of replicas in each shard increases, there is a
corresponding decrease in throughput for RingBFT. This decrease
is not surprising because RingBFT employs the Pbft protocol for
local replication, which necessitates two phases of quadratic com-
munication complexity. This in turn increases the size (and as a
result cost) of Forward messages communicated between shards.
In the case of AHL, the existence of reference committee acts
as a performance bottleneck to an extent that 30% cross-shard
transactions involving all the 15 shards subsides the benets due
to reduced replication (10 or 16 replicas). Sharper also observes a
drop in its performance due to its reliance on Pbft, and is unable
to scale at smaller congurations due to expensive communication
that requires an all-to-all communication between the replicas of
involved shards.
To summarize:
RingBFT achieves up to 4
×
and
16×higher throughput than Sharper and AHL, respectively.
8.3 Varying percentage of Cross-shard Txns.
For our next study, we allow client workloads to have 0,5%,10%,
15%,30%,60%, and 100% cross-shard transactions. We use Figures 8
(V) and (VI) to illustrate the throughput and latency metrics.
When the workload contains 0cross-shard transactions, it simply
indicates a system where all the transactions access only one shard.
In this case, all the three protocols attain same throughputs and
latency as all of them employ Pbft for reaching consensus on single-
shard transactions. They achieve
1.2 Million txn/s
throughput
among 500 nodes in 15 globaly distributed regions. With a small
(5%) introduction of cross-shard transactions in the workload, there
is signicant decrease for all the protocols. The amount of decrease
is in accordance to the reasons we discussed in previous sections.
However, RingBFT continues to outperform other protocols. In the
exterme case of 100% cross-shard workload, RingBFTachieve 4
×
and 18
×
higher throughput and 3
.
3
×
and 7
.
8
×
lower latecny than
Sharper and AHL respectively.
Sajjad Rahnama Suyash Gupta Rohan Sogani Dhruv Krishnan Mohammad Sadoghi
RingBFT Sharper AHL
3 5 7 9 11 15
0
20K
40K
60K
80K
Number of Shards (s)
Total Throughput (txn/s)
(I) Impact of Shards (Throughput)
3 5 7 9 11 15
0.0
20.0
40.0
60.0
80.0
Number of Shards (s)
Latency (s)
(II) Impact of Shards (Latency)
10 16 22 28
0
50K
100K
150K
200K
Number of Nodes Per Shard (n)
Total Throughput (txn/s)
(III) Impact of Nodes per Shards (Throughput)
10 16 22 28
0.0
20.0
40.0
60.0
80.0
Number of Nodes Per Shard (n)
Latency (s)
(IV) Impact of Nodes per Shards (Latency)
0 5 10 15 30 60 100
2K
10K
50K
200K
1M
Cross-Shard Workload Rate
Total Throughput (txn/s)
(V) Impact of X-Shard WorkloadRate (Throughput)
0 5 10 15 30 60 100
0.0
50.0
100.0
Cross-Shard Workload Rate
Latency (s)
(VI) Impact of X-Shard WorkloadRate (Latency)
10 50 100 500 1K 1.5K
0
50K
100K
150K
200K
Batch Size
Total Throughput (txn/s)
(VII) Impact of Batch Size (Throughput)
10 50 100 500 1K 1.5K
0.0
50.0
100.0
150.0
Batch Size
Latency (s)
(VIII) Impact of Batch Size (Latency)
1 3 6 9 15
10K
50K
200K
1M
Involved Shards
Total Throughput (txn/s)
(IX) Impact of Involved Shards (Throughput)
1 3 6 9 15
0.0
20.0
40.0
60.0
80.0
Involved Shards
Latency (s)
(X) Impact of Involved Shards (Latency)
3K 5K 10K 15K 20K
0
20K
40K
60K
80K
Clients
Total Throughput (txn/s)
(XI) Impact of Inight Transactions (Throughput)
3K 5K 10K 15K 20K
0.0
50.0
100.0
Clients
Latency (s)
(XII) Impact of Inight Transactions (Latency)
Figure 8: Measuring system throughput and average latency on running dierent Bft sharding consensus protocols.
8.4 Varying the Batch Size
In our next set of experiments, we study the impact of batching
transactions on the system performance. We require the three pro-
tocols to run consensus on batches of client transactions with sizes
10,50,100,500,1K, and 5K. We use Figures 8 (VII) and (VIII) to
illustrate the throughput and latency metrics.
As the number of transactions in a batch increases, there is a
proportional decrease in the number of consensuses. For example,
with a batch size of 10 and 100 for 5000 transactions, we need 500
and 50 instances of consensus. However, larger batches also cause
an increase in latency due to the increased cost of communication
and time for processing all the transactions in the batch. Hence, we
observe an increase in throughput on moving from small batches
of 10 transactions to large batches of 1K transactions. On further
increase (after 1
.
5K), the system throughput hits saturation and
eventually decreases as benets of batching are over-shadowed by
increased communication costs.
Starting from the batch size of 10, by increasing batch size, the
throughput increases up to 27
×
in RingBFT because, with less
communication and fewer messages, we are processing more trans-
actions. This trend lasts until the system reaches its saturation point
in terms of communication and computation, which is the batch
size of 1
.
5K for RingBFT. Once the system is at lling its network
bandwidth, adding more transactions to the batch will not increase
the throughput because it cannot process more, and sending those
batches will be a bottleneck for the system. Ideally, it should get
constant after some point but because of implementation details
and queuing, it drops slightly after some time.
Ideally, we expect the latency to also decrease with an increase
in batch size. However, for RingBFT, more transactions in a batch
implies more time spent processing the transactions around the ring.
This causes an increase in latency for the client.
To summarize:
Using the optimal batch size improve the throughput of RingBFT,
Sharper and AHL,27×,45×, and 3×respectively.
8.5 Varying Number of Involved Shards
We now keep the number of shards xed at 15 and all clients to
create transactions that access a subset of these shards. In specic,
clients send transactions that access 1,3,6,9, and 15 shards. As
our selected order for shards gives no preference to their proximity
to each other (to prevent any bias), our clients select consecutive
shards in order for generating the workload.
We use Figures 8 (IX) and (X) to illustrate the throughput and
latency metrics. As expected, all three protocols observe a drop
in performance on the increase in the number of involved shards.
However, RingBFT still outperforms the other two protocols. As
we increase the number of involved shards, the performance gap
between RingBFT and the other two protocols increases. As shown
in the graph, with three shards involved, RingBFT has a 4% perfor-
mance gap, increasing to 4×with 15 shards involved.
RingBFT: Resilient Consensus over Sharded Ring Topology
Figure 9:
RingBFT
’s throughput under the primary failure of three
shards out of nine. (s=
10
) primary fails; (s=
20
) replicas timeout
and send view-change messages; (s=
30
) new primary starts the new
view; (s=
35
) system’s throughput start increasing and returns back
to normal at s=55.
8.6 Varying Number of Clients
Each system can reach optimal latency only if it is not overwhelmed
by incoming client requests. In this section, we study the impact of
the same by varying the number of incoming client transactions
through gradual increase in the number of clients from 5K to 20K.
We use Figures 8 (XI) and (XII) to illustrate resulting throughput and
latency metrics. As we increase the number of clients transmitting
transactions, we observe a 15
20% increase in throughput, reach-
ing the saturation point. Having more clients causes a decrease
between 7% and 9%, which is a result of various queues being full
with incoming requests, which in turn causes a replica to perform
extensive memory management. Due to similar reasons, there is a
signicant increase in latency as the time to process each request
has increased proportionally. We observed 32
.
75
s
,58
.
21
s
, and 59
.
64
s
increase in RingBFT,Sharper, and AHL respectively. Despite this,
RingBFT scales better than other protocols even when the system
is overwhelmed by clients.
8.7 Impact of Primary Failure
Next we evaluate the eect of a replacing a faulty primary in dier-
ent shards. For this experiment, we run experiments with 9shards
and allow workload to consist of 30% cross-shard transactions. We
use Figure 9 to show the throughput attained by RingBFT when
the primary of rst three shards fail, and the replicas run the view
change protocol to replace the faulty primary. The primaries of
these shards fail at 10
𝑠
, and the system’s average throughput starts
decreasing while other shards are processing their clients’ requests.
RingBFT observe a 15
.
1% decrease in throughput, and post view
change, it again observes an increase in throughput.
8.8 Impact of Complex Cross-Shard
Transactions
Until now we have experimented with simple cst where for a given
cst each shard could independently execute its data-fragment. How-
ever, a sharded system may encounter a complex cst where each
shard may require access to data (and needs to check constraints)
present in other shards while executing its data-fragment. These
data-access dependencies require each shard to read the data from
RingBFT
8 16 32 48 64
0
20K
40K
60K
80K
0
0.0
Number of Remote Reads Each Txn Requires
Total Throughput (txn/s)
(I) Impact of Remote Reads (Throughput)
8 16 32 48 64
0.0
5.0
10.0
15.0
0
0.0
Number of Remote Reads Each Txn Requires
Latency (s)
(II) Impact of Remote Reads (Latency)
Figure 10:
RingBFT
’s throughput and latency on encoun-
tering complex cross-shard transactions with dependencies
varying from 0to 64.
other shards. Our RingBFT protocol performs this task by requir-
ing each shard to send its read-write sets along with the
Forward
message. In this section, we study the cost of communicating the
read-write sets of a complex cst on our RingBFT protocol.
We use Figure 10 to illustrate the throughput and latency metrics
on varying the number of data-access dependencies from 0to 64
distributed randomly across 15 shards. These gures illustrate that
our RingBFT protocol provides reasonable throughput and latency
even for a cst with extensive dependencies. Note that we are unable
to experiment with Sharper and AHL as they do not provide any
discussion on implementation and consensus of a complex cst.
9 RELATED WORK
In Section 1, we presented an overview of dierent types of Bft
protocols. Further, we have extensively studied the architecture
of state-of-the-art permissioned sharding Bft protocols, AHL and
Sharper. We now summarize other works in the space of Byzantine
Fault-Tolerance consensus.
Traditional Bft consensus The consensus problems such as Byzan-
tine Agreement and Interactive Consistency have been studied in
literature in great detail [
20
,
24
27
,
29
,
30
,
51
,
52
,
62
,
66
,
74
,
75
]. With
the introduction of Pbft-powered BFS—a fault-tolerant version of
the networked le system [
42
]—by Castro et al. [
11
,
12
] there has
been an unprecedented interest in the design of high-performance
Bft consensus protocos. This has led to the design of several con-
sensus protocols that have optimized dierent aspects of Pbft, e.g,
Zyzzyva,Sbft, and PoE, as discussed in the Introduction. To further
improve on the performance of Pbft, some consensus protocols
consider providing less failure resilience [
1
,
13
,
18
,
54
,
58
,
59
], fo-
cused on theoretical framework to support weaker consistency and
isolation semantics such as dirty reads and committed reads [
45
],
or rely on trusted components [
6
,
14
,
17
,
46
,
77
,
78
]. None of these
protocols are able to scale to hundreds of replicas scattered across
the globe, and this is where our vision of RingBFT acts as a resolve.
Permissionless Sharded Blockchains Permissionless space includes
several sharding Bft consensus protocols, such as Conux [
53
],
Elastico [
57
], MeshCash [
7
], Monoxide [
81
], Nightshade [
72
], Om-
niLedger [
48
], RapidChain [
82
], and Spectre [
73
]. All of these pro-
tocols require each of their shards to run either the Proof-of-Work
or Proof-of-Stake protocol during some phase of the consensus.
As a result these protocols oer a magnitude lower throughput
than both AHL and Sharper, which are included in our evaluation.
Sajjad Rahnama Suyash Gupta Rohan Sogani Dhruv Krishnan Mohammad Sadoghi
In our recent sharding work, we have developed a comprehen-
sive theoretical framework to study a wide range of consistency
models and isolation semantics (e.g., dirty reads, committed reads,
serializability) and communication patterns (e.g., centralized vs.
distributed) [
45
]. We have further developed a hybrid sharding
protocol intended for the permissionless setting optimized for the
widely used unspent transaction model [43].
10 CONCLUSIONS
In this paper, we present RingBFT–a novel Bft protocol for sharded
blockchains. For a single-shard transaction, RingBFT performs
as ecient as any state-of-the-art sharding Bft consensus proto-
col. However, existing sharding Bft protocols face severe fall in
throughput when they have to achieve consensus on a cross-shard
transaction. RingBFT resolves this situation by requiring each shard
to participate in at most two rotations around the ring. In specic,
RingBFT expects each shard to adhere to the prescribed ring order,
and follow the principle of process, forward, and re-transmit, while
ensuring the communication between shards is linear. We imple-
ment RingBFT on our ecient ResilientDB fabric, and evaluate it
against state-of-the-art sharding Bft protocols. Our results illus-
trates that RingBFT achieves up to 25
×
higher throughput than
the most recent sharding protocols and easily scales to nearly 500
globally-distributed nodes.
REFERENCES
[1]
Michael Abd-El-Malek, Gregory R. Ganger, Garth R. Goodson, Michael K. Reiter,
and Jay J. Wylie. 2005. Fault-scalable Byzantine Fault-tolerant Services. In Pro-
ceedings of the Twentieth ACM Symposium on Operating Systems Principles. ACM,
59–74. https://doi.org/10.1145/1095810.1095817
[2]
Yair Amir, Claudiu Danilov, Jonathan Kirsch, John Lane, Danny Dolev, Cristina
Nita-Rotaru, Josh Olsen, and David Zage. 2006. Scaling Byzantine Fault-Tolerant
Replication to Wide Area Networks. In International Conference on Dependable
Systems and Networks (DSN’06). 105–114. https://doi.org/10.1109/DSN.2006.63
[3]
Mohammad Javad Amiri, Divyakant Agrawal, and Amr El Abbadi. 2019. CAPER:
A Cross-application Permissioned Blockchain. Proc. VLDB Endow. 12, 11 (2019),
1385–1398. https://doi.org/10.14778/3342263.3342275
[4]
Mohammad Javad Amiri, Divyakant Agrawal, and Amr El Abbadi. 2019. SharPer:
Sharding Permissioned Blockchains Over Network Clusters. https://arxiv.org/
abs/1910.00765v1
[5]
L. Aniello, R. Baldoni, E. Gaetani, F. Lombardi, A. Margheri, and V. Sassone. 2017.
A Prototype Evaluation of a Tamper-Resistant High Performance Blockchain-
Based Transaction Log for a Distributed Database. In 2017 13th European Depend-
able Computing Conference. 151–154. https://doi.org/10.1109/EDCC.2017.31
[6]
Johannes Behl, Tobias Distler, and Rüdiger Kapitza. 2017. Hybrids on Steroids:
SGX-Based High Performance BFT. In Proceedings of the Twelfth European Con-
ference on Computer Systems. ACM, 222–237. https://doi.org/10.1145/3064176.
3064213
[7]
Iddo Bentov, Pavel Hubáček, Tal Moran, and Asaf Nadler. 2017. Tortoise and
Hares Consensus: the Meshcash Framework for Incentive-Compatible, Scalable
Cryptocurrencies. https://eprint.iacr.org/2017/300
[8]
P. A. Bernstein and N. Goodman. 1981. Concurrency Control in Distributed
Database Systems. ACM Comput. Surv. 13, 2 (1981), 185–221.
[9]
P. A. Bernstein and N. Goodman. 1983. Multiversion Concurrency Control -
Theory and Algorithms. ACM TODS 8, 4 (1983), 465–483.
[10]
Matthias Butenuth, Guido v. Gösseln, Michael Tiedge, Christian Heipke, Udo
Lipeck, and Monika Sester. 2007. Integration of heterogeneous geospatial data in
a federated database. ISPRS Journal of Photogrammetry and Remote Sensing 62, 5
(2007), 328 – 346. https://doi.org/10.1016/j.isprsjprs.2007.04.003 Theme Issue:
Distributed Geoinformatics.
[11]
Miguel Castro and Barbara Liskov. 1999. Practical Byzantine Fault Tolerance. In
Proceedings of the Third Symposium on Operating Systems Design and Implemen-
tation. USENIX, USA, 173–186.
[12]
Miguel Castro and Barbara Liskov. 2002. Practical Byzantine Fault Tolerance
and Proactive Recovery. ACM Trans. Comput. Syst. 20, 4 (2002), 398–461. https:
//doi.org/10.1145/571637.571640
[13]
Gregory Chockler, Dahlia Malkhi, and Michael K. Reiter. 2001. Backo protocols
for distributed mutual exclusion and ordering. In Proceedings 21st International
Conference on Distributed Computing Systems. IEEE, 11–20. https://doi.org/10.
1109/ICDSC.2001.918928
[14]
Byung-Gon Chun, Petros Maniatis, Scott Shenker, and John Kubiatowicz. 2007.
Attested Append-only Memory: Making Adversaries Stick to Their Word. In Pro-
ceedings of Twenty-rst ACM SIGOPS Symposium on Operating Systems Principles.
ACM, 189–204. https://doi.org/10.1145/1294261.1294280
[15]
Brian F. Cooper, Adam Silberstein, Erwin Tam,Raghu Ramakrishnan, and Russell
Sears. 2010. Benchmarking Cloud Serving Systems with YCSB. In Proceedings of
the 1st ACM Symposium on Cloud Computing. ACM, 143–154. https://doi.org/10.
1145/1807128.1807152
[16]
J. C. Corbett, Jerey Dean, Michael Epstein, Andrew Fikes, Christopher Frost,
JJ Furman, Sanjay Ghemawat, Andrey Gubarev, Christopher Heiser, Peter
Hochschild, Wilson Hsieh, Sebastian Kanthak, Eugene Kogan, Hongyi Li, Alexan-
der Lloyd, Sergey Melnik, David Mwaura, David Nagle, Sean Quinlan, Rajesh Rao,
Lindsay Rolig, Yasushi Saito, Michal Szymaniak, Christopher Taylor, Ruth Wang,
and Dale Woodford. 2012. Spanner: Google’s Globally-Distributed Database. In
10th USENIX Symposium on Operating Systems Design and Implementation (OSDI
12). USENIX Association, 261–264.
[17]
Miguel Correia, Nuno Ferreira Neves, and Paulo Verissimo. 2004. How to Tolerate
Half Less One Byzantine Nodes in Practical Distributed Systems. In Proceedings
of the 23rd IEEE International Symposium on Reliable Distributed Systems. IEEE,
174–183. https://doi.org/10.1109/RELDIS.2004.1353018
[18]
James Cowling, Daniel Myers, Barbara Liskov, Rodrigo Rodrigues, and Liuba
Shrira. 2006. HQ Replication: A Hybrid Quorum Protocol for Byzantine Fault
Tolerance. In Proceedings of the 7th Symposium on Operating Systems Design and
Implementation. USENIX Association, 177–190.
[19]
Hung Dang, Tien Tuan Anh Dinh, Dumitrel Loghin, Ee-Chien Chang, Qian Lin,
and Beng Chin Ooi. 2019. Towards Scaling Blockchain Systems via Sharding. In
Proceedings of the 2019 International Conference on Management of Data. ACM,
123–140. https://doi.org/10.1145/3299869.3319889
[20]
Richard A. DeMillo, Nancy A. Lynch, and Michael J. Merritt. 1982. Cryptographic
Protocols. In Proceedings of the Fourteenth Annual ACM Symposium on Theory of
Computing. ACM, 383–400. https://doi.org/10.1145/800070.802214
[21]
A. Deshpande and J. M. Hellerstein. 2002. Decoupled query optimization for
federated database systems. In Proceedings 18th International Conference on Data
Engineering. 716–727. https://doi.org/10.1109/ICDE.2002.994788
[22]
C. Diaconu, C. Freedman, E. Ismert, P.-A. Larson, P. Mittal, R. Stonecipher, N.
Verma, and M. Zwilling. 2013. Hekaton: SQL Server’s Memory-optimized OLTP
Engine. ACM, 1243–1254. https://doi.org/10.1145/2463676.2463710
[23]
Tien Tuan Anh Dinh, Ji Wang, Gang Chen, Rui Liu, Beng Chin Ooi, and Kian-Lee
Tan. 2017. BLOCKBENCH: A Framework for Analyzing Private Blockchains. In
Proceedings of the 2017 ACM International Conference on Management of Data.
ACM, 1085–1100. https://doi.org/10.1145/3035918.3064033
[24]
D. Dolev. 1981. Unanimity in an unknown and unreliable environment. In
22nd Annual Symposium on Foundations of Computer Science. IEEE, 159–168.
https://doi.org/10.1109/SFCS.1981.53
[25]
Danny Dolev. 1982. The Byzantine generals strike again. Journal of Algorithms
3, 1 (1982), 14–30. https://doi.org/10.1016/0196-6774(82)90004- 9
[26]
Danny Dolev and Rüdiger Reischuk. 1985. Bounds on Information Exchange for
Byzantine Agreement. J. ACM 32, 1 (1985), 191–204. https://doi.org/10.1145/
2455.214112
[27]
D. Dolev and H. Strong. 1983. Authenticated Algorithms for Byzantine Agreement.
SIAM J. Comput. 12, 4 (1983), 656–666. https://doi.org/10.1137/0212045
[28]
Franz Färber, Sang Kyun Cha, Jürgen Primsch, Christof Bornhövd, Stefan Sigg,
and Wolfgang Lehner. 2012. SAP HANA Database: Data Management for Modern
Business Applications. SIGMOD Rec. 40, 4 (Jan. 2012), 45–51. https://doi.org/10.
1145/2094114.2094126
[29]
Michael J. Fischer and Nancy A. Lynch. 1982. A lower bound for the time
to assure interactive consistency. Inform. Process. Lett. 14, 4 (1982), 183–186.
https://doi.org/10.1016/0020-0190(82)90033- 3
[30]
Michael J. Fischer, Nancy A. Lynch, and Michael S. Paterson. 1985. Impossibility
of Distributed Consensus with One Faulty Process. J. ACM 32, 2 (1985), 374–382.
https://doi.org/10.1145/3149.214121
[31]
Guy Golan Gueta, Ittai Abraham, Shelly Grossman, Dahlia Malkhi, Benny Pinkas,
Michael Reiter, Dragos-Adrian Seredinschi, Orr Tamir, and Alin Tomescu. 2019.
SBFT: A Scalable and Decentralized Trust Infrastructure. In 2019 49th Annual
IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).
IEEE, 568–580. https://doi.org/10.1109/DSN.2019.00063
[32] Jim Gray. 1978. Notes on Data Base Operating Systems.
[33]
Suyash Gupta, Jelle Hellings, Sajjad Rahnama, and Mohammad Sadoghi. 2020.
Building High Throughput Permissioned Blockchain Fabrics: Challenges and
Opportunities. Proc. VLDB Endow. 13, 12 (2020), 3441–3444. https://doi.org/10.
14778/3415478.3415565
[34]
Suyash Gupta, Jelle Hellings, Sajjad Rahnama, and Mohammad Sadoghi. 2021.
Proof-of-Execution: Reaching Consensus through Fault-Tolerant Speculation. In
Proceedings of the 24th International Conference on Extending Database Technology.
[35]
Suyash Gupta, Jelle Hellings, and Mohammad Sadoghi. 2019. Brief Announce-
ment: Revisiting Consensus Protocols through Wait-Free Parallelization. In 33rd
RingBFT: Resilient Consensus over Sharded Ring Topology
International Symposium on Distributed Computing (DISC 2019), Vol. 146. Schloss
Dagstuhl, 44:1–44:3. https://doi.org/10.4230/LIPIcs.DISC.2019.44
[36]
Suyash Gupta, Jelle Hellings, and Mohammad Sadoghi. 2021. RCC: Resilient
Concurrent Consensus for High-Throughput Secure Transaction Processing. In
37th IEEE International Conference on Data Engineering (ICDE). arXiv:1911.00837
http://arxiv.org/abs/1911.00837
[37]
Suyash Gupta, Sajjad Rahnama, Jelle Hellings, and Mohammad Sadoghi. 2020.
ResilientDB: Global Scale Resilient Blockchain Fabric. Proc. VLDB Endow. 13, 6
(2020), 868–883. https://doi.org/10.14778/3380750.3380757
[38]
Suyash Gupta, Sajjad Rahnama, and Mohammad Sadoghi. 2020. Permissioned
Blockchain Through the Looking Glass: Architectural and Implementation
Lessons Learned. In Proceedings of the 40th IEEE International Conference on
Distributed Computing Systems.
[39]
Suyash Gupta and Mohammad Sadoghi. 2018. EasyCommit: A Non-blocking
Two-phase Commit Protocol (EDBT).
[40]
Suyash Gupta and Mohammad Sadoghi. 2019. Blockchain Transaction Processing.
In Encyclopedia of Big Data Technologies. Springer, 1–11. https://doi.org/10.1007/
978-3- 319-63962-8_333- 1
[41]
R. Harding, D. Van Aken, A. Pavlo, and M. Stonebraker. 2017. An Evaluation
of Distributed Concurrency Control. Proc. VLDB Endow. 10, 5 (2017), 553–564.
https://doi.org/10.14778/3055540.3055548
[42]
Thomas Haynes and David Noveck. 2015. RFC 7530: Network File System (NFS)
Version 4 Protocol. https://tools.ietf.org/html/rfc7530
[43]
Jelle Hellings, Daniel P. Hughes, Joshua Primero, and Mohammad Sadoghi. 2020.
Cerberus: Minimalistic Multi-shard Byzantine-resilient Transaction Processing.
https://arxiv.org/abs/2008.04450
[44]
Jelle Hellings and Mohammad Sadoghi. 2019. The fault-tolerant cluster-sending
problem. https://arxiv.org/abs/1908.01455
[45]
Jelle Hellings and Mohammad Sadoghi. 2021. ByShard: Sharding in a Byzantine
Environment. Proc. VLDB Endow. (2021).
[46]
Rüdiger Kapitza, Johannes Behl, Christian Cachin, Tobias Distler, Simon Kuhnle,
Seyed Vahid Mohammadi, Wolfgang Schröder-Preikschat, and Klaus Stengel.
2012. CheapBFT: Resource-ecient Byzantine Fault Tolerance. In Proceedings of
the 7th ACM European Conference on Computer Systems. ACM, 295–308. https:
//doi.org/10.1145/2168836.2168866
[47]
Jonathan Katz and Yehuda Lindell. 2014. Introduction to Modern Cryptography
(2nd ed.).
[48]
Eleftherios Kokoris-Kogias, Philipp Jovanovic, Linus Gasser, Nicolas Gailly, Ewa
Syta, and Bryan Ford. 2018. OmniLedger: A Secure, Scale-Out, Decentralized
Ledger via Sharding. In 2018 IEEE Symposium on Security and Privacy (SP). 583–
598. https://doi.org/10.1109/SP.2018.000-5
[49]
Ramakrishna Kotla, Lorenzo Alvisi, Mike Dahlin, Allen Clement, and Edmund
Wong. 2007. Zyzzyva: Speculative Byzantine Fault Tolerance. SIGOPS Oper. Syst.
Rev. 41, 6 (2007), 45–58. https://doi.org/10.1145/1323293.1294267
[50]
Ramakrishna Kotla, Lorenzo Alvisi, Mike Dahlin, Allen Clement, and Edmund
Wong.2010. Zyzzyva: Speculative Byzantine Fault Tolerance. ACM Trans. Comput.
Syst. 27, 4, Article 7 (2010), 39 pages. https://doi.org/10.1145/1658357.1658358
[51] Leslie Lamport. 1998. The Part-time Parliament. (1998).
[52]
Leslie Lamport, Robert Shostak, and Marshall Pease. 1982. The Byzantine Gen-
erals Problem. ACM Transactions on Programming Languages and Systems 4, 3
(1982), 382–401. https://doi.org/10.1145/357172.357176
[53]
Chenxing Li, Peilun Li, Dong Zhou, Wei Xu, Fan Long, and Andrew Yao. 2018.
Scaling Nakamoto Consensus to Thousands of Transactions per Second. https:
//arxiv.org/abs/1805.03870
[54]
Barbara Liskov and Rodrigo Rodrigues. 2005. Byzantine Clients Rendered Harm-
less. In Distributed Computing. Springer Berlin Heidelberg, 487–489. https:
//doi.org/10.1007/11561927_35
[55]
Marta Lokhava, Giuliano Losa, David Mazières, Graydon Hoare, Nicolas Barry,
Eli Gafni, Jonathan Jove, Rafał Malinowsky, and Jed McCaleb. 2019. Fast and
Secure Global Payments with Stellar. In Proceedings of the 27th ACM Symposium
on Operating Systems Principles. ACM, 80–96. https://doi.org/10.1145/3341301.
3359636
[56]
Y. Lu, X. Huang, Y.Dai, S. Maharjan, and Y. Zhang. 2020. Blockchain and Federate d
Learning for Privacy-Preserved Data Sharing in Industrial IoT. IEEE Transactions
on Industrial Informatics 16, 6 (2020), 4177–4186. https://doi.org/10.1109/TII.2019.
2942190
[57]
Loi Luu, Viswesh Narayanan, Chaodong Zheng, Kunal Baweja, Seth Gilbert, and
Prateek Saxena. 2016. A Secure Sharding Protocol For Open Blockchains. In
Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications
Security. ACM, 17–30. https://doi.org/10.1145/2976749.2978389
[58]
Dahlia Malkhi and Michael Reiter. 1998. Byzantine quorum systems. Distributed
Computing 11, 4 (1998), 203–213. https://doi.org/10.1007/s004460050050
[59]
Dahlia Malkhi and Michael Reiter. 1998. Secure and scalable replication in Phalanx.
In Proceedings Seventeenth IEEE Symposium on Reliable Distributed Systems. IEEE,
51–58. https://doi.org/10.1109/RELDIS.1998.740474
[60]
Ralph C. Merkle. 1988. A Digital Signature Based on a Conventional Encryption
Function. In Advances in Cryptology — CRYPTO ’87. Springer, 369–378. https:
//doi.org/10.1007/3-540- 48184-2_32
[61]
Andrew Miller,Yu Xia, Kyle Croman, Elaine Shi, and Dawn Song. 2016. The Honey
Badger of BFT Protocols. In Proceedings of the 2016 ACM SIGSAC Conference on
Computer and Communications Security (CCS ’16). ACM, 31–42. https://doi.org/
10.1145/2976749.2978399
[62]
Shlomo Moran and Yaron Wolfstahl. 1987. Extended impossibility results for
asynchronous complete networks. Inform. Process. Lett. 26, 3 (1987), 145–151.
https://doi.org/10.1016/0020-0190(87)90052- 4
[63]
F. Nawab, D.Agrawal, and A. El Abbadi. 2016. The Challenges of Global-scale Data
Management. In Proceedings of the 2016 International Conference on Management
of Data. ACM, 2223–2227. https://doi.org/10.1145/2882903.2912571
[64]
Diego Ongaro and John Ousterhout. 2014. In Search of an Understandable
Consensus Algorithm. In ATC.
[65]
M. Tamer Özsu and Patrick Valduriez. 2020. Principles of Distributed Database
Systems. Springer. https://doi.org/10.1007/978- 3-030-26253- 2
[66]
M. Pease, R. Shostak, and L. Lamport. 1980. Reaching Agreement in the Presence
of Faults. J. ACM 27, 2 (1980), 228–234. https://doi.org/10.1145/322186.322188
[67]
Sajjad Rahnama, Suyash Gupta, Thamir Qadah, Jelle Hellings, and Mohammad
Sadoghi. 2020. Scalable, Resilient and Congurable Permissioned Blockchain
Fabric. Proc. VLDB Endow. 13, 12 (2020), 2893–2896. https://doi.org/doi.org/10.
14778/3415478.3415502
[68]
Thomas C. Redman. 1998. The Impact of Poor Data Quality on the Typical
Enterprise. Commun. ACM 41, 2 (1998), 79–82.
[69]
Kun Ren, Dennis Li, and Daniel J. Abadi. 2019. SLOG: Serializable, Low-Latency,
Geo-Replicated Transactions. Proc. VLDB Endow. 12, 11 (July 2019), 1747–1761.
https://doi.org/10.14778/3342263.3342647
[70]
Amit P. Sheth and James A. Larson. 1990. Federated Database Systems for Man-
aging Distributed, Heterogeneous, and Autonomous Databases. ACM Comput.
Surv. 22, 3 (Sept. 1990), 183–236. https://doi.org/10.1145/96602.96604
[71] Dale Skeen. 1982. A Quorum-Based Commit Protocol. Technical Report. Cornell
University.
[72]
Alex Skidanov and Illia Polosukhin. 2019. Nightshade: Near Protocol Sharding
Design. https://near.org/downloads/Nightshade.pdf
[73]
Yonatan Sompolinsky, Yoad Lewenberg, and Aviv Zohar. 2016. SPECTRE: A Fast
and Scalable Cryptocurrency Protocol. https://eprint.iacr.org/2016/1159.
[74]
Gadi Taubenfeld and Shlomo Moran. 1996. Possibility and impossibility results
in a shared memory environment. Acta Informatica 33, 1 (1996), 1–20. https:
//doi.org/10.1007/s002360050034
[75]
Gerard Tel. 2001. Introduction to Distributed Algorithms (2nd ed.). Cambridge
University Press.
[76]
Alexander Thomson, Thaddeus Diamond, Shu-Chun Weng, Kun Ren, Philip
Shao, and Daniel J. Abadi. 2012. Calvin: Fast Distributed Transactions for
Partitioned Database Systems. In Proceedings of the 2012 ACM SIGMOD Inter-
national Conference on Management of Data (SIGMOD). ACM, 1–12. https:
//doi.org/10.1145/2213836.2213838
[77]
Giuliana Santos Veronese,Miguel Correia, Alysson Neves Bessani, and Lau Cheuk
Lung. 2010. EBAWA: Ecient Byzantine Agreement for Wide-Area Networks. In
2010 IEEE 12th International Symposium on High Assurance Systems Engineering.
IEEE, 10–19. https://doi.org/10.1109/HASE.2010.19
[78]
Giuliana Santos Veronese, Miguel Correia, Alysson Neves Bessani, Lau Cheuk
Lung, and Paulo Verissimo. 2013. Ecient Byzantine Fault-Tolerance. IEEE Trans.
Comput. 62, 1 (2013), 16–30. https://doi.org/10.1109/TC.2011.221
[79]
Hoang Tam Vo, Ashish Kundu, and Mukesh K. Mohania. 2018. Research Direc-
tions in Blockchain Data Management and Analytics. In Proceedings of the 21st
International Conference on Extending Database Technology. OpenProceedings.org,
445–448. https://doi.org/10.5441/002/edbt.2018.43
[80]
Jingjing Wang, Tobin Baker, Magdalena Balazinska, Daniel Halperin, Brandon
Haynes, Bill Howe, Dylan Hutchison, Shrainik Jain, Ryan Maas, Parmita Mehta,
Dominik Moritz, Brandon Myers, Jennifer Ortiz, Dan Suciu, Andrew Whitaker,
and Shengliang Xu. 2017. The Myria Big Data Management and Analytics
System and Cloud Services. In 8th Biennial Conference on Innovative Data Systems
Research. www.cidrdb.org.
[81]
Jiaping Wang and Hao Wang. 2019. Monoxide: Scale out Blockchains with Asyn-
chronous Consensus Zones. In 16th USENIX Symposium on Networked Systems
Design and Implementation (NSDI 19). USENIX Association, 95–112.
[82]
Mahdi Zamani, Mahnush Movahedi, and Mariana Raykova. 2018. RapidChain:
Scaling Blockchain via Full Sharding. In Proceedings of the 2018 ACM SIGSAC
Conference on Computer and Communications Security. ACM, 931–948. https:
//doi.org/10.1145/3243734.3243853
[83]
Erfan Zamanian, Carsten Binnig, Tim Harris, and Tim Kraska. 2017. The End
of a Myth: Distributed Transactions Can Scale. Proc. VLDB Endow. 10, 6 (2017),
685–696. https://doi.org/10.14778/3055330.3055335
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Recent developments in blockchain technology have inspired innovative new designs in resilient distributed and database systems. At their core, these blockchain applications typically use Byzantine fault-tolerant consensus protocols to maintain a common state across all replicas, even if some replicas are faulty or malicious. Unfortunately, existing consensus protocols are not designed to deal with geo-scale deployments in which many replicas spread across a geographically large area participate in consensus. To address this, we present the Geo-Scale Byzantine Fault-Tolerant consensus protocol (GeoBFT). GeoBFT is designed for excellent scalability by using a topological-aware grouping of replicas in local clusters, by introducing parallelization of consensus at the local level, and by minimizing communication between clusters. To validate our vision of high-performance geo-scale resilient distributed systems, we implement GeoBFT in our efficient ResilientDB permissioned blockchain fabric. We show that GeoBFT is not only sound and provides great scalability, but also outperforms state-of-the-art consensus protocols by a factor of six in geo-scale deployments.
Article
The emergence of blockchains has fueled the development of resilient systems that can deal with Byzantine failures due to crashes, bugs, or even malicious behavior. Recently, we have also seen the exploration of sharding in these resilient systems, this to provide the scalability required by very large data-based applications. Unfortunately, current sharded resilient systems all use system-specific specialized approaches toward sharding that do not provide the flexibility of traditional sharded data management systems. To improve on this situation, we fundamentally look at the design of sharded resilient systems. We do so by introducing BYSHARD, a unifying framework for the study of sharded resilient systems. Within this framework, we show how two-phase commit and two-phase locking ---two techniques central to providing atomicity and isolation in traditional sharded databases---can be implemented efficiently in a Byzantine environment, this with a minimal usage of costly Byzantine resilient primitives. Based on these techniques, we propose eighteen multi-shard transaction processing protocols. Finally, we practically evaluate these protocols and show that each protocol supports high transaction throughput and provides scalability while each striking its own trade-off between throughput, isolation level, latency , and abort rate. As such, our work provides a strong foundation for the development of ACID-compliant general-purpose and flexible sharded resilient data management systems.
Article
The fourth edition of this classic textbook provides major updates. This edition has completely new chapters on Big Data Platforms (distributed storage systems, MapReduce, Spark, data stream processing, graph analytics) and on NoSQL, NewSQL and polystore systems. It also includes an updated web data management chapter that includes RDF and semantic web discussion, an integrated database integration chapter focusing both on schema integration and querying over these systems. The peer-to-peer computing chapter has been updated with a discussion of blockchains. The chapters that describe classical distributed and parallel database technology have all been updated. The new edition covers the breadth and depth of the field from a modern viewpoint. Graduate students, as well as senior undergraduate students studying computer science and other related fields will use this book as a primary textbook. Researchers working in computer science will also find this textbook useful. This textbook has a companion web site that includes background information on relational database fundamentals, query processing, transaction management, and computer networks for those who might need this background. The web site also includes all the figures and presentation slides as well as solutions to exercises (restricted to instructors).
Conference Paper
International payments are slow and expensive, in part because of multi-hop payment routing through heterogeneous banking systems. Stellar is a new global payment network that can directly transfer digital money anywhere in the world in seconds. The key innovation is a secure transaction mechanism across untrusted intermediaries, using a new Byzantine agreement protocol called SCP. With SCP, each institution specifies other institutions with which to remain in agreement; through the global interconnectedness of the financial system, the whole network then agrees on atomic transactions spanning arbitrary institutions, with no solvency or exchange-rate risk from intermediary asset issuers or market makers. We present SCP's model, protocol, and formal verification; describe the Stellar payment network; and finally evaluate Stellar empirically through benchmarks and our experience with several years of production use.
Article
Despite recent intensive research, existing blockchain systems do not adequately address all the characteristics of distributed applications. In particular, distributed applications collaborate with each other following service level agreements (SLAs) to provide different services. While collaboration between applications, e.g., cross-application transactions, should be visible to all applications, the internal data of each application, e.g, internal transactions, might be confidential. In this paper, we introduce CAPER, a permissioned blockchain system to support both internal and cross-application transactions of collaborating distributed applications. In CAPER, the blockchain ledger is formed as a directed acyclic graph where each application accesses and maintains only its own view of the ledger including its internal and all cross-application transactions. CAPER also introduces three consensus protocols to globally order cross-application transactions between applications with different internal consensus protocols. The experimental results reveal the efficiency of CAPER in terms of performance and scalability.