PreprintPDF Available

Cerberus: Minimalistic Multi-shard Byzantine-resilient Transaction Processing

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

To enable high-performance and scalable blockchains, we need to step away from traditional consensus-based fully-replicated designs. One direction is to explore the usage of sharding in which we partition the managed dataset over many shards that independently operate as blockchains. Sharding requires an efficient fault-tolerant primitive for the ordering and execution of multi-shard transactions, however. In this work, we seek to design such a primitive suitable for distributed ledger networks with high transaction throughput. To do so, we propose Cerberus, a set of minimalistic primitives for processing single-shard and multi-shard UTXO-like transactions. Cerberus aims at maximizing parallel processing at shards while minimizing coordination within and between shards. First, we propose Core-Cerberus, that uses strict environmental requirements to enable simple yet powerful multi-shard transaction processing. In our intended UTXO-environment, Core-Cerberus will operate perfectly with respect to all transactions proposed and approved by well-behaved clients, but does not provide any guarantees for other transactions. To also support more general-purpose environments, we propose two generalizations of Core-Cerberus: we propose Optimistic-Cerberus, a protocol that does not require any additional coordination phases in the well-behaved optimistic case, while requiring intricate coordination when recovering from attacks; and we propose Pessimistic-Cerberus, a protocol that adds sufficient coordination to the well-behaved case of Core-Cerberus, allowing it to operate in a general-purpose fault-tolerant environments without significant costs to recover from attacks. Finally, we compare the three protocols, showing their potential scalability and high transaction throughput in practical environments.
Content may be subject to copyright.
arXiv:2008.04450v1 [cs.DC] 10 Aug 2020
Cerberus: Minimalistic Multi-shard Byzantine-resilient
Transaction Processing
Jelle HellingsDaniel P. Hughes
Joshua PrimeroMohammad Sadoghi
Exploratory Systems Lab, Department of Computer Science
University of California, Davis, CA, 95616-8562, USA
Radix DLT Ltd, Argyle Works, 29-31 Euston Road, London, NW1 2SD
Abstract
To enable high-performance and scalable blockchains, we need to step away from traditional
consensus-based fully-replicated designs. One direction is to explore the usage of sharding in
which we partition the managed dataset over many shards that—independently—operate as
blockchains. Sharding requires an efficient fault-tolerant primitive for the ordering and execution
of multi-shard transactions, however.
In this work, we seek to design such a primitive suitable for distributed ledger networks
with high transaction throughput. To do so, we propose Cerberus, a set of minimalistic
primitives for processing single-shard and multi-shard UTXO-like transactions. Cerberus aims
at maximizing parallel processing at shards while minimizing coordination within and between
shards. First, we propose Core-Cerberus, that uses strict environmental requirements to enable
simple yet powerful multi-shard transaction processing. In our intended UTXO-environment,
Core-Cerberus will operate perfectly with respect to all transactions proposed and approved
by well-behaved clients, but does not provide any guarantees for other transactions.
To also support more general-purpose environments, we propose two generalizations of Core-
Cerberus: we propose Optimistic-Cerberus, a protocol that does not require any additional
coordination phases in the well-behaved optimistic case, while requiring intricate coordination
when recovering from attacks; and we propose Pessimistic-Cerberus, a protocol that adds
sufficient coordination to the well-behaved case of Core-Cerberus, allowing it to operate in a
general-purpose fault-tolerant environments without significant costs to recover from attacks.
Finally, we compare the three protocols, showing their potential scalability and high transaction
throughput in practical environments.
1 Introduction
The advent of blockchain applications and technology has rejuvenated interest of companies, gov-
ernments, and developers in resilient distributed fully-replicated systems and the distributed ledger
technology (DLT) that powers them. Indeed, in the last decade we have seen a surge of interest in
reimagining systems and build them using DLT networks. Examples can be found in the financial
and banking sector [15, 36, 47], IoT [41], health care [28, 37], supply chain tracking, advertising,
and in databases [3, 5, 23, 44, 45]. This wide interest is easily explained, as blockchains promise to
improve resilience, while enabling the federated management of data by many participants.
To illustrate this, we look at the financial sector. Current traditional banking infrastructure
is often rigid, slow, and creates substantial frictional costs. It is estimated that the yearly cost
of transactional friction alone is $71 billion [8] in the financial sector, creating a strong desire for
alternatives. This sector is a perfect match for DLT, as it enables systems that manage digital assets
1
A1A2
A3A4
Pbft
(Objects o1,...,o10 )
B1B2
B3B4
Pbft
(Objects o11 ,...,o20)
Request on o3, o5
(via Pbft)
Request on o12, o17
(via Pbft)
Cerberus
Request on o2, o14
(via Cerberus)
Figure 1: A sharded design in which two resilient blockchains each hold only a part of the data. Local
decisions within a cluster are made via traditional Pbft consensus, whereas multi-shard transactions
are processed via Cerberus (proposed in this work).
and financial transactions in more flexible, fast, and open federated infrastructures that eliminate the
friction caused by individual private databases maintained by banks and financial services providers.
Consequently, it is expected that a large part of the financial sector will move towards DLT [18].
At the core of DLT is the replicated state maintained by the network in the form of a ledger of
transactions. In traditional blockchains, this ledger is fully replicated among all participants using
consensus protocols [14, 35, 41, 43]. For many practical use-cases, one can choose to use either
permissionless consensus solutions that are operated via economic self-incentivization through cryp-
tocurrencies (e.g., Nakamoto consensus [42, 51]), or permissioned consensus solutions that require
vetted participation (e.g, Pbft [16]). Unfortunately, the design of consensus protocols utilized by
todays DLT networks are severely limited in their ability to provide the high transaction throughput
that is needed to address practical needs, e.g., in the financial and banking sector.
On the one hand, we see that permissionless solutions can easily scale to thousands of participants,
but are severely limited in their transaction processing throughput. E.g., in Ethereum, a popular
public permissionless DLT platform, the rapid growth of decentralized finance applications [12] has
caused its network fees to rise precipitously as participants bid for limited network capacity [7],
while Bitcoin can only process a few transactions per second [47]. On the other hand, permissioned
solutions can reach much higher throughputs, but still lack scalability as their performance is bound
by the speed of individual participants.
In this paper, we focus on a fundamental solution to significantly increase the throughput of DLT
that may apply to either permissionless or permissioned networks. While this paper primarily discuss
this solution through the lens of permissioned networks, similar techniques apply to permissionless
DLT with the necessary extensions for these kinds of networks, such as self-incentivization, Sybil
attack protection, and tolerance of validator set churn. These kinds of permissionless networks are
the focus of Radix, and their impetus for their original creation of the Cerberus concept that this
paper will discuss.
A direction one can take to improve on the limited throughput of a DLT network, is to incorporate
sharding in their design: instead of operating a single fully-replicated consensus-based DLT network,
one can partition the data in the DLT network among several shards that each have the potential
to operate mostly-independent on their data, while only requiring cooperation between shards to
process transactions that affect data on several shards. In such a sharded design, transactions
that only affect ob jects within a single shard can be processed via normal consensus (e.g., Pbft).
Transactions that affect objects within several shards require additional coordination, however. The
choice of protocol for such multi-shard transaction processing determines greatly the scalability
benefits of sharding and the overhead costs incurred by sharding. We have sketched a basic sharded
design in Figure 1.
To provide multi-shard transaction processing with high throughput in practical environments
with a large number of shards, including permissionless networks, Radix proposed Cerberus—a
2
technique for performing multi-shard transactions. In this paper, we propose and analyze a family
of multi-shard transaction processing protocol variants using the original Cerberus concept. To
be able to adapt to the needs of specific use-cases, we propose three variants of Cerberus: Core-
Cerberus, Optimistic-Cerberus, and Pessimistic-Cerberus.1
First, we propose Core-Cerberus (CCerberus), a design specialized for processing UTXO-like
transactions.CCerberus is a simplified variant of Cerberus, and uses the strict environmental
assumptions on UTXO-transactions to its advantage to yield a minimalistic design that does as little
work as possible per involved shard. Even with this minimalistic design, CCerberus will operate
perfectly with respect to all transactions proposed and approved by well-behaved clients (although
it may fail to process transactions originating from malicious clients).
Next, to also support more general-purpose environments, we propose two generalizations of
CCerberus, namely Optimistic-Cerberus and Pessimistic-Cerberus , that each deal with the
strict environmental assumptions of CCerberus, while preserving the minimalistic design of CCer-
berus. In the design of Optimistic-Cerberus (OCerberus), we assume that malicious behavior is
rare and we optimize the normal-case operations. We do so by keeping the normal-case operations
as minimalistic as possible. In specific, compared to CCerberus,OCerberus does not require
any additional coordination phases in the well-behaved optimistic case, while still being able to lift
the environmental assumptions of CCerberus. In doing so, OCerberus does require intricate
coordination when recovering from attacks. In the design of Pessimistic-Cerberus, we assume that
malicious behavior is common and we add sufficient coordination to the normal-case operations
of CCerberus to enable a simpler and localized recovery path, allowing PCerberus to recover
from attacks at lower cost, at the expense of increased complexity in normal-case operation. Both
variants we believe may be productive directions for consideration for different network deployment
situations depending on desired trade-offs.
To show the strengths of each of the Cerberus protocols, we show that Cerberus can provide
serializable transaction execution for UTXO-like transactions. Furthermore, we show that each
of the protocol variants have excellent scalability in practice, even when exclusively dealing with
multi-shard workloads.
Organization First, in Section 2, we present the terminology and notation used throughout this
paper. Then, in Section 3, we specify the correctness criteria by which we evaluate our Cerberus
multi-shard transaction processing protocols. Then, in Sections 4, 5, and 6, we present the three vari-
ants of Cerberus, namely Core-Cerberus (CCerberus), Optimistic-Cerberus (OCerberus),
and Pessimistic-Cerberus (PCerberus). In Section 7, we further analyze the practical strengths,
properties, and performance of Cerberus. Then, in Section 8, we discuss related work, while we
conclude on our findings in Section 9.
2 Preliminaries
Before we proceed with our detailed presentation of Cerberus, we first introduce the system model,
the sharding model, the data model, the transaction model, and the relevant terminology and
notation used throughout this paper.
Sharded fault-tolerant systems If Sis a set of replicas, then we write G(S) to denote the
non-faulty good replicas in Sthat always operate as intended, and we write F(S) = S\ G(S) to
denote the remaining replicas in Sthat are faulty and can act Byzantine, deviate from the intended
operations, or even operate in coordinated malicious manners. We write nS=|S|,gS=|G(S)|, and
1The ideas underlying Cerberus was outlined in an earlier whitepaper of our Radix team available at
https://www.radixdlt.com/wp-content/uploads/2020/03/Cerberus-Whitepaper- v1.0.pdf.
3
fS=|S\ G(S)|=nSgSto denote the number of replicas in S, good replicas in S, and faulty
replicas in S, respectively.
Let Rbe a set of replicas. In a sharded fault-tolerant system over R, the replicas are partitioned
into sets shards(R) = {S0,...,Sz}such that the replicas in Si, 0 iz, operate as an independent
Byzantine fault-tolerant system. As each Sioperates as an independent fault-tolerant system, we
require nSi>3fSi, a minimal requirement to enable Byzantine fault-tolerance in an asynchronous
environment [20, 21]. We assume that every shard S shards(R) has a unique identifier id(S).
We assume asynchronous communication: messages can get lost, arrive with arbitrary delays,
and in arbitrary order. Consequently, it is impossible to distinguish between, on the one hand, a
replica that is malicious and does not send out messages, and, on the other hand, a replica that
does send out proposals that get lost in the network. As such, Cerberus can only provide progress
in periods of reliable bounded-delay communication during which all messages sent by good replicas
will arrive at their destination within some maximum delay [25, 27]. Further, we assume that
communication is authenticated : on receipt of a message mfrom replica rR, one can determine
that rdid sent mif r∈ G(R). Hence, faulty replicas are able to impersonate each other, but are
not able to impersonate good replicas. To provide authenticated communication under practical
assumptions, we can rely on cryptographic primitives such as message authentication codes, digital
signatures, or threshold signatures [38, 48].
Assumption 2.1. Let shards(R) be a sharded fault-tolerant system. We assume coordinating
adversaries that can—at will—choose and control any replica r∈ S in any shard S shards(R)
as long as, for each shard S, the adversaries only control up to fSreplicas in S.
Object-dataset model We use the object-dataset model in which data is modeled as a collection
of objects. Each object ohas a unique identifier id(o) and a unique owner owner(o). In the following,
we assume that all owners are clients of the system that manages these objects. The only operations
that one can perform on an object are construction and destruction. An object cannot be recreated,
as the attempted recreation of an ob ject owill result in a new object owith a distinct identifier
(id(o)6=id(o)).
Object-dataset transactions Changes to object-dataset data are made via transactions re-
quested by clients. We write hτicto denote a transaction τrequested by a client c. We assume
that all transactions are UTXO-like transactions: a transaction τfirst produces resources by de-
structing a set of input objects and then consumes these resources in the construction of a set
of output objects. We do not rely on the exact rules regarding the production and consumption
of resources, as they are highly application-specific. Given a transaction τ, we write Inputs(τ)
and Outputs(τ) to denote the input objects and output objects of τ, respectively, and we write
Objects(τ) = Inputs(τ)Outputs(τ).
Assumption 2.2. Given a transaction τ, we assume that one can determine Inputs(τ) and
Outputs(τ) a-priori. Furthermore, we assume that every transaction has inputs. Hence, |Inputs(τ)| ≥
1.
Owners of objects ocan express their support for transactions τthat have oas their input. To
provide this functionality, we can rely on cryptographic primitives such as digital signatures [38].
Assumption 2.3. If an owner is well-behaved, then an expression of support cannot be forged or
provided by any other party. Furthermore, a well-behaved owner of owill only express its support
for a single transaction τwith oInputs(τ), as only one transaction can consume the object o,
and the owner will only do so after the construction of o.
4
Multi-shard transactions Let obe an object. We assume that there is a well-defined function
shard(o) that maps object oto the single shard S shards(R) that is responsible for maintaining
o. Given a transaction τ, we write
shards(τ) = {shard(o)|oObjects(τ)}
to denote the shards that are affected by τ. We say that τis a single-shard transaction if |shards(τ)|=
1 and is a multi-shard transaction otherwise. We assume
Assumption 2.4. Let D(S) be the dataset maintained by shard S. We have oD(S) only if
shard(o) = S.
3 Correctness of multi-shard transaction processing
Before we introduce Cerberus, we put forward the correctness requirements we want to maintain
in a multi-shard transaction system in which each shard is itself a set of replicas operated as a
Byzantine fault-tolerant system. We say that a shard Sperforms an action if every good replica
in G(S) performs that action. Hence, any processing decision or execution step performed by S
requires the usage of a consensus protocol to coordinate the replicas in S:
Fault-tolerant primitives At the core of resilient systems are consensus protocols [14, 16, 40, 41]
that coordinate the operations of individual replicas in the system, e.g., a Byzantine fault-tolerant
system driven by Pbft [16] or HotStuff [53], or a crash fault-tolerant system driven by Paxos [40].
As these systems are fully-replicated, each replica holds exactly the same data, which is determined
by the sequence of transactions—the journal—agreed upon via consensus:
Definition 3.1. Aconsensus protocol coordinate decision making among the replicas of a resilient
cluster Sby providing a reliable ordered replication of decisions. To do so, consensus protocols
provide the following guarantees:
1. If good replica r∈ S makes a ρ-th decision, then all good replicas r∈ S will make a ρ-th
decision (whenever communication becomes reliable).
2. If good replicas r,q∈ S make ρ-th decisions, then they make the same decisions.
3. Whenever a good replica learns that a decision Dneeds to be made, then it can force consensus
on D.
Let τbe a transaction processed by a sharded fault-tolerant system. Processing of τdoes not
imply execution: the transaction could be invalid (e.g., the owners of affected objects did not express
their support) or the transaction could have inputs that no longer exists. We say that the system
commits to τif it decides to apply the modifications prescribed by τ, and we say that the system
aborts τif it decides to not do so. Using this terminology, we put forward the following requirements
for any sharded fault-tolerant system:
R1. Validity. The system must only processes transaction τif, for every input object oInputs(τ)
with a well-behaved owner owner(o), the owner owner(o) supports the transaction.
R2. Shard-involvement. The shard Sonly processes transaction τif S shards(τ).
R3. Shard-applicability. Let D(S) be the dataset maintained by shard Sat time t. The shards
shards(τ) only commit to execution of transaction τat tif τconsumes only existing ob jects.
Hence, Inputs(τ)S{D(S)| S ∈ shards(τ)}.
5
R4. Cross-shard-consistency. If shard Scommits (aborts) transaction τ, then all shards S
shards(τ) eventually commit (abort) τ.
R5. Service. If client cis well-behaved and wants to request a valid transaction τ, then the sharded
system will eventually process hτic. If τis shard-applicable, then the sharded system will
eventually execute hτic.
R6. Confirmation. If the system processes hτicand cis well-behaved, then cwill eventually learn
whether τis committed or aborted.
We notice that shard-involvement is a local requirement, as individual shards can determine whether
they need to process a given transaction. In the same sense, shard-applicability and cross-shard-
consistency are global requirements, as assuring these requirements requires coordination between
the shards affected by a transaction.
4 Core-Cerberus: simple yet efficient transaction processing
The core idea of Cerberus is to minimize the coordination necessary for multi-shard ordering
and execution of transactions. To do so, Cerberus combines the semantics of transactions in the
object-dataset model with the minimal coordination required to assure shard-applicability and cross-
shard consistency. This combination results in the following high-level three-step approach towards
processing any transaction τ:
1. Local inputs. First, every affected shard S shards(τ) locally determines whether it has all
inputs from Sthat are necessary to process τ.
2. Cross-shard exchange. Then, every affected shard S shards(τ) exchanges these inputs to
all other shards in shards(τ), thereby pledging to use their local inputs in the execution of τ.
3. Decide outcome Finally, every affected shard S shards(τ) decides to commit τif all affected
shards were able to provide all local inputs necessary for execution of τ.
Next, we describe how these three high-level steps are incorporated by Cerberus into normal
consensus steps at each shards. Let shard S ∈ shards(R) receive client request hτic. The good
replicas in Swill first determine whether τis valid and applicable. If τis not valid or S/shards(τ),
then the good replicas discard τ. Otherwise, if τis valid and S shards(τ), then the good replicas
utilize consensus to force the primary P(S) to propose in some consensus round ρthe message
m(S, τ )ρ= (hτic, I (S, τ), D(S, τ )), in which I(S, τ ) = {oInputs(τ)| S =shard(o)}is the set of
objects maintained by Sthat are input to τand D(S, τ )I(S, τ) is the set of currently-available
inputs at S. Only if I(S, τ ) = D(S, τ) will Spledge to use the local inputs I(S, τ ) in the execution
of τ.
The acceptance of m(S, τ )ρin round ρby all good replicas completes the local inputs step. Next,
during execution of τ, the cross-shard exchange and decide outcome steps are performed. First, the
cross-shard exchange step. In this step, Sbroadcasts m(S, τ )ρto all other shards in shards(τ). To
assure that the broadcast arrives, we rely on a reliable primitive for cross-shard exchange, e.g., via
an efficient cluster-sending protocol [29, 32]. Then, the replicas in Swait until they receive messages
m(S, τ )ρ= (hτic, I (S, τ), D(S, τ )) from all other shards Sshards(τ).
After cross-shard exchange comes the final decide outcome step. After Sreceives m(S, τ )ρfrom
all shards Sshards(τ), it decides to commit whenever I(S, τ) = D(S, τ ) for all Sshards(τ).
Otherwise, it decides abort. If Sdecides commit, then all good replicas in Sdestruct all objects in
D(S, τ ) and construct all objects oOutputs(τ) with S=shard(o). Finally, each good replica
informs cof the outcome of execution. If creceives, from every shard S′′ shards(τ), identical
6
c
S1
S2
S3
hτic
Consensus on hτic
Consensus on hτic
Consensus on hτic
Wait for Commit/Abort
Wait for Commit/Abort
Wait for Commit/Abort
Local Inputs
(Consensus)
Cross-Shard Exchange
(Cluster-Sending)
Decide Outcome Inform
Figure 2: The message flow of CCerberus for a 3-shard client request hτicthat is committed.
outcomes from gS′′ fS′′ distinct replicas in S′′ , then it considers τto be successfully executed. In
Figure 2, we sketched the working of CCerberus.
The cross-shard exchange step of CCerberus at Sinvolves waiting for other shards S. Hence,
there is the danger of deadlocks if the other shards Snever perform the cross-shard exchange step:
Example 4.1. Consider distinct transactions hτ1ic1and hτ2ic2that both affect objects o1in S1
and o2in S2. Hence, we have Inputs(τ1) = Inputs(τ2) = {o1, o2}and with shard(o1) = S1and
shard(o2) = S2. We assume that S1processes τ1first and S2processes τ2first. Shard S1will start
by sending m(S, τ1)ρ1= (hτ1ic1,{o1},{o1})to S2. Next, S1will wait, during which it receives τ2.
At the same time, S2follows similar steps for τ2and sends m(S, τ2)ρ2= (hτ2ic2,{o2},{o2})to S1.
As a result, S1is waiting for information on τ1from S2, while S2is waiting for information on τ2
from S1.
To assure that the above example does not lead to a deadlock, we employ two techniques.
1. Internal propagation. To deal with situations in which some shards S ∈ shards(τ) did not
receive hτic(e.g., due to network failure or due to a faulty client that fails to send hτicto
some shards), we allow each shard to learn τfrom any other shard. In specific, any shard
S ∈ shards(τ) will start consensus on hτicafter receiving cross-shard exchange related to
hτic.
2. Concurrent resolution. To deal with concurrent transactions, as in Example 4.1, we allow each
shard to accept and execute transactions for different rounds concurrently. To assure that such
concurrent execution does not lead to inconsistent state updates, each replica implements the
following first-pledge and ordered-commit rules. Let τbe a transaction with S shards(τ)
and r∈ S. The first-pledge rule states that Spledges o, constructed in round ρ, to transaction
τonly if τis the first transaction proposed after round ρwith oInputs(τ). The ordered-
commit rule states that Scan abort τin any order, but will only commit τthat is accepted
in round ρafter previous rounds finished execution.
Next, we apply the above two techniques to the situation of Example 4.1:
Example 4.2. While S1is waiting for τ1, it received cross-shard exchange related to hτ2ic2. Hence,
in a future round ρ
1> ρ1, it can propose and accept hτ2ic2. By the first-pledge rule, S1already pledged
o1to the execution of τ1. Hence, it cannot pledge any objects to the execution of τ2. Consequently,
S1will eventually be able to send m(S1, τ2)ρ
1= (hτ2ic2,{o1},)to S2. Likewise, S2will eventually
be able to send m(S2, τ1)ρ
2= (hτ1ic1,{o2},)to S1. Consequently, both shards decide abort on τ1
and τ2, which they can do in any order due to the ordered-commit rule.
Finally, we notice that abort decisions at shard Son a transaction τcan often be made with-
out waiting for all shards Sshards(τ). Shard Scan decide abort after it detects I(S, τ )6=
D(S, τ ) or after it receives the first message (hτic, I(S′′ , τ ), D(S′′, τ )) with I(S′′, τ )6=D(S′′, τ ),
S′′ shards(τ). For efficiency, we allow Sto abort in these cases.
7
Theorem 4.3. If, for all shards S,gS>2fS, and Assumptions 2.1, 2.2, 2.3, and 2.4 hold, then
Core-Cerberus satisfies Requirements R1–R6 with respect to all transactions that are not requested
by malicious clients and that do not involve objects with malicious owners.
Proof. Let τbe a transaction. As good replicas in Sdiscard τif it is invalid or if S/shards(τ),
CCerberus provides validity and shard-involvement. Next, shard-applicability follow directly from
the decide outcome step.
If a shard Scommits or aborts transaction τ, then it must have completed the decide outcome
and cross-shard exchange steps. As Scompleted cross-shard exchange, all shards Sshards(τ)
must have exchanged the necessary information to S. By relying on cluster-sending for cross-shard
exchange, Srequires cooperation of all good replicas in Sto exchange the necessary information
to S. Hence, we have the guarantee that these good replicas will also perform cross-shard exchange
to any other shard S′′ shards(τ). As such, every shard S′′ shards(τ) will receive the same
information as S, complete cross-shard exchange, and make the same decision during the decide
outcome step, providing cross-shard consistency.
Finally, due to internal propagation and concurrent resolution, every valid transaction τwill be
processed by CCerberus as soon as it is send to any shard S shards(τ). Hence, every shard
in shards(τ) will perform the necessary steps to eventually inform the client. As all good replicas
r∈ S,S shards(τ), will inform the client of the outcome for τ, the majority of these inform-
messages come from good replicas, enabling the client to reliably derive the true outcome. Hence,
CCerberus provides service and confirmation.
Notice that in the object-dataset model in which we operate, each object can be constructed
once and destructed once. Hence, each object ocan be part of at-most two committed transactions:
the first of which will construct oas an output, and the second of which has oas an input and
will consume and destruct o. This is independent of any other operations on other objects. As
such these two transactions cannot happen concurrently. Consequently, we only have concurrent
transactions on oif the owner owner(o) expresses support for several transactions that have oas an
input. By Assumption 2.3, the owner owner(o) must be malicious in that case. As such, transactions
of well-behaved clients and owners will never abort.
In the design of CCerberus, we take advantage of this observation that aborts are always due to
malicious behavior by clients and owners of objects: to minimize coordination while allowing graceful
resolution of concurrent transactions, we do not undo any pledges of objects to the execution of any
transactions. This implies that objects that are involved in malicious transactions can get lost for
future usage, while not affecting any transactions of well-behaved clients.
5 Optimistic-Cerberus: robust transaction processing
In the previous section, we introduced CCerberus, a minimalistic and efficient multi-shard transac-
tion processing protocol that relies on practical properties of UTXO-like transactions. Although the
design of CCerberus is simple yet effective, we see two shortcomings that limits its use cases. First,
CCerberus operates under the assumption that any issues arising from concurrent transactions is
due to malicious behavior of clients. As such, CCerberus chooses to lock out objects affected by
such malicious behavior for any future usage. Second, CCerberus requires consecutive consensus
and cluster-sending steps, which increases its transaction processing latencies. Next, we investigate
how to deal with these weaknesses of CCerberus without giving up on the minimalistic nature of
CCerberus.
To do so, we propose Optimistic-Cerberus (OCerberus), which is optimized for the optimistic
case in which we have no concurrent transactions, while providing a recovery path that can recover
from concurrent transactions without locking out objects. At the core of OCerberus is assuring
that any issues due to malicious behavior, e.g., concurrent transactions, are detected in such a
8
way that individual replicas can start a recovery process. At the same time, we want to minimize
transaction processing latencies. To bridge between these two objectives, we integrate detection and
cross-shard coordination within a single consensus round that runs at each affected shard.
Let hτicbe a multi-shard transaction, let S shards(τ) be a shard with primary P(S), and
let m(S, τ )v,ρ be the round-ρproposal of P(S) of view vof S. To enable detection of concurrent
transactions, OCerberus modifies the consensus-steps of the underlying consensus protocol by
applying the following high-level idea:
A replica r∈ S,S shards(τ), only accepts proposal m(S, τ )v,ρ = (hτic, I(S, τ ), D(S, τ ))
for some transaction τif it gets confirmation that replicas in each other shard S
shards(τ) are also accepting proposals for τ. Otherwise, replica rdetects failure.
Next, we illustrate how to integrate the above idea in the three-phase design of Pbft, thereby
turning Pbft into a multi-shard aware consensus protocol:
1. Global preprepare. Primary P(S) must send m(S, τ)v,ρ to all replicas r∈ S,Sshards(τ).
Replica r∈ S only finishes the global preprepare phase after it receives a global preprepare
certificate consisting of a set M={m(S′′ , τ )v′′′′ | S′′ shards(τ)}of preprepare messages
from all primaries of shards affected by τ.
2. Global prepare. After r∈ S,S shards(τ), finishes the global preprepare phase, it sends
prepare messages for Mto all other replicas in r∈ S,Sshards(τ). Replica r∈ S only
finishes the global prepare phase for Mafter, for every shard Sshards(τ), it receives a
local prepare certificate consisting of a set P(S) of prepare messages for Mfrom gSdistinct
replicas in S. We call the set {P(S′′)| S ′′ shards(τ)}aglobal prepare certificate.
3. Global commit. After replica r∈ S,S ∈ shards(τ), finishes the global prepare phase, it sends
commit messages for Mto all other replicas in r∈ S,Sshards(τ). Replica r∈ S only
finishes the global commit phase for Mafter, for every shard Sshards(τ), it receives a
local commit certificate consisting of a set C(S) of commit messages for Mfrom gSdistinct
replicas in S. We call the set {P(S′′)| S ′′ shards(τ)}aglobal commit certificate.
To minimize inter-shard communication, one can utilize threshold signatures and cluster-sending
to carry over local prepare and commit certificates between shards via a few constant-sized messages.
The above three-phase global-Pbft protocol already takes care of the local input and cross-shard
exchange steps. Indeed, a replica r∈ S that finishes the global commit phase has accepted global
preprepare certificate M, which contains all information of other shards to proceed with execution.
At the same time, ralso has confirmation that Mis prepared by a majority of all good replicas in
each shard Sshards(τ) (which will eventually be followed by execution of τwithin S). With
these ingredients in place, only the decide outcome step remains.
The decide outcome step at shard Sis entirely determined by the global preprepare certificate
M. Shard Sdecides to commit whenever I(S, τ) = D(S, τ ) for all (hτic, I (S, τ), D(S, τ )) M.
Otherwise, it decides abort. If Sdecides commit, then all good replicas in Sdestruct all objects in
D(S, τ ) and construct all objects oOutputs(τ) with S=shard(o). Finally, each good replica
informs cof the outcome of execution. If creceives, from every shard Sshards(τ), identical
outcomes from gSfSdistinct replicas in S, then it considers τto be successfully executed. In
Figure 3, we sketched the working of OCerberus.
Due to the similarity between OCerberus and CCerberus, it is straightforward to use the
details of Theorem 4.3 to prove that OCerberus provides validity,shard-involvement, and shard-
applicability. Next, we will focus on how OCerberus provides cross-shard-consistency. As a first
step, we illustrate the ways in which the normal-case of OCerberus can fail (e.g., due to malicious
behavior of clients, failing replicas, or unreliable communication).
9
c
S1
S2
S3
hτic
Preprepare Prepare Commit
Local Inputs and Cross-Shard Exchange
(Global Consensus)
Decide for Commit/Abort
Decide for Commit/Abort
Decide for Commit/Abort
Decide Outcome Inform
Figure 3: The message flow of OCerberus for a 3-shard client request hτicthat is committed.
Example 5.1. Consider a transaction τproposed by client cand affecting shard S shards(τ).
First, we consider the case in which P(S)is malicious and tries to set up a coordinated attack. To
have maximum control over the steps of OCerberus, the primary sends the message m(S, τ )v,ρ to
only gS′′ fS′′ good replicas in each shard S′′ shards(τ). By doing so, P(S)can coordinate the
faulty replicas in each shard to assure failure of any phase at any replica r∈ S,Sτ:
1. To prevent rfrom finishing the global preprepare phase (and start the global prepare phase)
for an Mwith m(S, τ)vM,P(S)simply does not send m(S, τ)v to r.
2. To prevent rfrom finishing the global prepare phase (and start the global commit phase) for M,
P(S)instructs the faulty replicas in F(S)to not send prepare messages for Mto r. Hence,
rwill receive at-most gSfSprepare messages for Mfrom replicas in S, assuring that it will
not receive a local prepare certificate P(S)and will not finish the global prepare phase for M.
3. Likewise, to prevent rfrom finishing the global commit phase (and start execution) for M,
P(S)instructs the faulty replicas in F(S)to not send commit messages to r. Hence, rwill
receive at-most gSfScommit messages for Mfrom replicas in S, assuring that it will not
receive a local commit certificate C(S)and will not finish the global commit phase for M.
None of the above attacks can be attributed to faulty behavior of P(S). First, unreliable commu-
nication can result in the same outcomes for r. Furthermore, even if communication is reliable and
P(S)is good, we can see the same outcomes:
1. The client ccan be malicious and not send τto S. At the same time, all other primaries
P(S′′)of shards S′′ shards(τ)can be malicious and not send anything to Seither. In this
case, P(S)will never be able to send any message m(S, τ)v to r, as no replica in Sis aware
of τ.
2. If any primary P(S′′ )of S′′ shards(τ)is malicious, then it can prevent some replicas in
Sfrom starting the global prepare phase, thereby preventing these replicas to send prepare
messages to r. If P(S′′ )prevents sufficient replicas in Sfrom starting the global prepare
phase, rwill be unable to finish the global prepare phase.
3. Likewise, any malicious primary P(S′′)of S′′ shards(τ)can prevent replicas in Sfrom
starting the global commit phase, thereby assuring that rwill be unable to finish the global
commit phase.
To deal with malicious behavior, OCerberus needs a robust recovery mechanism. We cannot
simply build that mechanism on top of traditional view-change approaches: these traditional view-
change approaches require that one can identify a single source of failure (when communication is
reliable), namely the current primary. As Example 5.1 already showed, this property does not hold
for OCerberus. To remedy this, the recovery mechanisms of OCerberus has components that
10
1: event r∈ S is unable to finish round ρof view vdo
2: if rfinished in round ρthe global prepare phase for M,
but is unable to finish the global commit phase then
3: Let Pbe the global prepare certificate of rfor M.
4: if rhas a local commit certificate C(S′′ ) for Mthen
5: for Sshards(τ)do
6: if rdid not yet receive a local commit certificate C(S)then
7: Broadcast hVCGlobalSCR :M, P, C(S′′)ito all replicas in S.
8: else
9: Detect the need for local state recovery of round ρof view v(Figure 5).
10: else
11: Detect the need for local state recovery of round ρof view v(Figure 5).
12: (Eventually repeat this event if rremains unable to finish round ρ.)
13: event r∈ Sreceives a message hVCGlobalSCR :M , P, C(S′′ )ifrom r∈ S do
14: if rdid not reach the global commit phase for Mthen
15: Use M,P, and C(S′′) to reach the global commit phase for M.
16: else
17: Send a commit message for Mto r.
Figure 4: The view-change global short-cut recovery path that determines whether ralready has the
assurance that the current transaction will be committed. If this is the case, then rrequests only
the missing information to proceed with execution. Otherwise, rrequires at-least local recovery
(Figure 5).
perform local view-change and that perform global state recovery. The pseudo-code for this recovery
protocol can be found in Figure 4. Next, we describe the working of this recovery protocol in detail.
Let r∈ S be a replica that determines that it cannot finish a round ρof view v.
First, rdetermines whether it already has a guarantee on which transaction it has to execute
in round ρ. This is the case when the following conditions are met: rfinished the global prepare
phase for Mwith m(S, τ )v,ρ Mand has received a local commit certificate C(S′′ ) for Mfrom
some shard S′′ shards(τ). In this case, rcan simply request all missing local commit certificates
directly, as C(S′′) can be used to prove to any involved replica r∈ S,Sshards(τ), that r
also needs to commit to M. To request such missing commit certificates of S, replica rsends out
VCGlobalSCR messages to all replicas in S(Line 7 of Figure 4). Any replica rthat receives such a
VCGlobalSCR message can use the information in that message to reach the global commit phase
for Mand, hence, provide rwith the requested commit messages (Line 13 of Figure 4).
If rdoes not have a guarantee itself on which transaction it has to execute in round ρ, then it
needs to determine whether any other replica (either in its own shard or in any other shard) has
already received and acted upon such a guarantee. To initiate such local and global state recovery,
rsimply detects the current view as faulty. To do so, rbroadcasts a VCRecoveryRQ message to
all other replicas in Sthat contains all information rcollected on round ρin view v(Line 4 of
Figure 5). Other replicas q∈ S that already have guarantees for round ρcan help rby providing
all missing information (Line 6 of Figure 5). On receipt of this information, rcan proceed with the
round (Line 7 of Figure 5). If no replicas can provide the missing information, then eventually all
good replicas will detect the need for local recovery, this either by themselves (Line 1 of Figure 5)
or after receiving VCRecoveryRQ messages of at-least fS+ 1 distinct replicas in S, of which at-least
a single replica must be good (Line 10 of Figure 5).
Finally, if a replica rreceives gSVCRecoveryRQ messages, then it has the guarantee that at least
gSfSfS+ 1 of these messages come from good replicas in S. Hence, due to Line 10 of Figure 5,
11
1: event r∈ S detects the need for local state recovery of round ρof view vdo
2: Let Mbe the latest global preprepare certificate accepted for round ρby r(if any).
3: Let Sbe Mand any prepare and commit certificates for Mcollected by r.
4: Broadcast hVCRecoveryRQ :v, ρ, S i.
5: event q∈ S receives messages hVCRecoveryRQ :v, ρ, Siof r∈ S and qhas
1. started the global prepare phase for Mwith m(S, τ )w,ρ M;
2. a global prepare certificate for M;
3. a local commit certificate C(S′′) for M
do
6: Send hVCLocalSCR :M, P, C (S′′)ito r∈ S.
7: event r∈ S receives a message hVCLocalSCR :M, P, C (S )ifrom q∈ S do
8: if rdid not reach the global commit phase for Mthen
9: Use M,P, and Cto reach the global commit phase for M.
10: event r∈ S receives messages hVCRecoveryRQ :vi, ρ, Sii, 1 ifS+ 1,
from distinct replicas in Sdo
11: rdetects the need for local state recovery of round ρof view min{vi|1ifS+ 1}.
12: event r∈ S receives messages hVCRecoveryRQ :v, ρ, Sii, 1 igS,
from distinct replicas in Sdo
13: if id(r)6= (v+ 1) mod nSthen
14: (rawaits the NewView message of the new primary, Line 15 of Figure 6.)
15: else
16: Broadcast hNewView :hVCRecoveryRQ :v, ρ, Sii | 1igSito all replicas in S.
17: if there exists a Sithat contains global preprepare certificate M,
but no Sjcontains a local commit certificate for Mthen
18: rinitiates global state recovery of round ρ(Line 1 of Figure 6).
Figure 5: The view-change local short-cut recovery path that determines whether some qcan provide
rwith the assurance that the current transaction will be committed. If this is the case, then ronly
needs this assurance, otherwise Srequires a new view (Figure 6).
all gSgood replicas in Swill send VCRecoveryRQ, and, when communication is reliable, also receive
these messages. Consequently, at this point, rcan start the new view by electing a new primary and
awaiting the NewView proposal of this new primary (Line 12 of Figure 5). If ris the new primary,
then it starts the new view by proposing a NewView. As other shards could have already made final
decisions depending on local prepare or commit certificates of Sfor round ρ, we need to assure that
such certificates are not invalidated. To figure out whether such final decisions have been made,
the new primary will query other shards Sfor their state whenever the NewView message contains
global preprepare certificates for transactions τ,Sshards(τ), but not a local commit certificate
to guarantee execution of τ(Line 17 of Figure 5).
The new-view process has three stages. First, the new primary pproposes the new-view via a
NewView message (Line 12 of Figure 5). If necessary, the new primary palso requests the relevant
global state from any relevant shard (Line 1 of Figure 6). The replicas in other shards will respond
to this request with their local state (Line 9 of Figure 6). The new primary collects these responses
12
1: event p∈ S initiates global state recovery of round ρusing hNewView :Vido
2: Let Tbe the transactions with global preprepare certificates for round ρof Sin V.
3: Let Sbe the shards affected by transactions in T.
4: Broadcast hVCGlobalStateRQ :v, ρ, V ito all replicas in SS.
5: for SSdo
6: Wait for VCGlobalStateRQ messages for Vfrom gSdistinct replicas in S.
7: Let W(S) be the set of received VCGlobalStateRQ messages.
8: Broadcast hNewViewGlobal :V, {W(S)| SS}i to all replicas in S.
9: event r∈ Sreceives message hVCGlobalStateRQ :v, ρ, V ifrom p∈ S do
10: if rhas a global preprepare certificate Mwith m(S, τ )w,ρ M
and reached the global commit phase for Mthen
11: Let Pbe the global prepare certificate for M.
12: Send hVCGlobalStateR :v, ρ, V, M , Pito p.
13: else
14: Send hVCGlobalStateR :v, ρ, V ito p.
15: event r∈ S receives a valid hNewView :Vimessage from replica pdo
16: if there exists a hVCRecoveryRQ :vi, ρ, Sii ∈ Vthat contains
a global preprepare certificate Mwith m(S, τ )w,ρ M,
a global prepare certificate Pfor M, and a local commit certificate C(S′′) for Mthen
17: Use M,P, and Cto reach the global commit phase for M.
18: else if there exists a hVCRecoveryRQ :vi, ρ, Sii ∈ Vthat contains
a global preprepare certificate M,
but no hVCRecoveryRQ :vj, ρ, Sji ∈ Vcontains a local commit certificate for Mthen
19: rdetects the need for global state recovery of round ρ(Line 22 of Figure 6).
20: else
21: (pmust propose for round ρ.)
22: event r∈ S receives a valid hNewViewGlobal :V, W ifrom p∈ S do
23: if any message in Wis of the form hVCGlobalStateR :v , ρ, V, M, P ithen
24: Select hVCGlobalStateR :v, ρ, V, M, P i ∈ Wwith latest view w,m(S, τ )w,ρ M.
25: Use Mand Pto reach the global commit phase for M.
26: else
27: (pmust propose for round ρ.)
Figure 6: The view-change new-view recovery path that recovers the state of the previous view
based on a NewView proposal of the new primary. As part of the new-view recovery path, the new
primary can construct a global new-view that contains the necessary information from other shards
to reconstruct the local state.
and sends them to all replicas in Svia a NewViewGlobal message.
Then, after psends the NewView message to r∈ S,rdetermines whether the NewView message
contains sufficient information to recover round ρ(Line 16 of Figure 6), contains sufficient information
to wait for any relevant global state (Line 18 of Figure 6), or to determine that the new primary
must propose for round ρ(Line 21 of Figure 6). If rdetermines it needs to wait for any relevant
global state, then rwill wait for this state to arrive via a NewViewGlobal message. Based on the
received global state, rdetermines to recover round ρ(Line 23 of Figure 6), or determines that the
new primary must propose for round ρ(Line 26 of Figure 6).
Next, we shall prove the correctness of the view-change protocol outlined in Figures 4, 5, and 6.
13
First, using a standard quorum argument, we prove that in a single round of a single view of S, only
a single global preprepare message affecting Scan get committed by any other affected shards:
Lemma 5.2. Let τ1and τ2be transactions with S (shards(τ1)shards(τ2)). If gS>2fSand
there exists shards Sishards(τi),i∈ {1,2}, such that good replicas ri∈ G(Si)reached the global
commit phase for global preprepare certificate Miwith m(S, τi)v,ρ Mi, then τ1=τ2.
Proof. We prove this property using contradiction. We assume τ16=τ2. Let Pi(S) be the local
prepare certificate provided by Sfor Miand used by rito reach the global commit phase, let
Si⊆ S be the gSreplicas in Sthat provided the prepare messages in Pi(S), and let Ti=Si\ F (S)
be the good replicas in Si. By construction, we have |Ti| ≥ gSfS. As all replicas in T1T2are
good, they will only send out a single prepare message per round ρof view v. Hence, if τ16=τ2,
then T1T2=, and we must have 2(gSfS)≤ |T1T2|. As all replicas in T1T2are good, we
also have |T1T2| ≤ gS. Hence, 2(gSfS)gS, which simplifies to gS2fS, a contradiction.
Hence, we conclude τ1=τ2.
Next, we use Lemma 5.2 to prove that any global preprepare certificate that could have been
accepted by any good affected replica is preserved by OCerberus:
Proposition 5.3. Let τbe a transaction and m(S, τ )v,ρ be a preprepare message. If, for all shards
S,gS>2fS, and there exists a shard Sshards(τ) such that gSfSgood replicas in S
reached the global commit phase for Mwith m(S, τ )v,ρ M, then every successful future view of
Swill recover Mand assure that the good replicas in Sreach the commit phase for M.
Proof. Let vvbe the first view in which a global prepare certificate Mwith m(S, τ )vM
satisfied the premise of this proposition. Using induction on the number of views after the first view
v, we will prove the following two properties on M:
1. every good replica that participates in view w,v< w, will recover Mupon entering view w
and reach the commit phase for M; and
2. no replica will be able to construct a local prepare certificate of Sfor any global preprepare
certificate M6=Mwith m(S, τ )w,ρ M,v< w.
The base case is view v+ 1. Let S⊆ G(S) be the set of gSfSgood replicas in Sthat reached
the global commit phase for M. Each replica rShas a local prepare certificate P(S) consisting
of gSprepare messages for Mprovided by replicas in S. We write S(r)⊆ G(S) to denote the
at-least gSfSgood replicas in Sthat provided such a prepare message to r.
Consider any valid new-view proposal hNewView :Vifor view v+ 1. If the conditions of Line 16
of Figure 6 hold for some global preprepare certificate Mwith m(S, τ )w,ρ M, then we recover
M. As there is a local commit certificate for Min this case, the premise of this proposition holds
on M. As vis the first view in which the premise of this proposition hold, we can use Lemma 5.2
to conclude that w=v,M=M, and, hence, that the base case holds if the conditions of Line 16
of Figure 6 hold. Next, we assume that the conditions of Line 16 of Figure 6 do not hold, in which
case Mcan only be recovered via global state recovery. As the first step in global state recovery is
proving that the condition of Line 18 of Figure 6 holds. Let T⊆ G(S) be the set of at-least gSfS
good replicas in Swhose VCRecoveryRQ message is in Vand let rS. We have |S(r)| ≥ gSfS
and |T| ≥ gSfS. Hence, by a standard quorum argument, we conclude S(r)T6=. Let
q(S(r)T). As qis good and send prepare messages for M, it must have reached the global
prepare phase for M. Consequently, the condition of Line 18 of Figure 6 holds and to complete the
proof, we only need to prove that any well-formed NewViewGlobal message will recover M.
Let hNewViewGlobal :V , W ibe any valid global new-view proposal for view v+ 1. As qreached
the global prepare phase for M, any valid global new-view proposal must include messages from
Sshards(τ). Let U⊆ Sbe the replicas in Sof whom messages VCGlobalStateR are included
14
in W. Let V=U\ F(S). We have |S| ≥ gSfSand |V| ≥ gSfS. Hence, by a standard
quorum argument, we conclude SV6=. Let q(SV). As qreached the global commit
phase for M, it will meet the conditions of Line 25 of Figure 6 and provide both Mand a global
prepare certificate for M. Let Mbe any other global preprepare certificate in Waccompanied
by a global prepare certificate. Due to Line 24 of Figure 6, the global preprepare certificate for the
newest view of Swill be recovered. As vis the newest view of S,Mwill only prevent recovery of
Mif it is also a global preprepare certificate for view vof S. In this case, Lemma 5.2 guarantees
that M=M. Hence, any replica rwill recover Mupon receiving hNewViewGlobal :V , W i.
Now assume that the induction hypothesis holds for all views j,v< j i. We will prove that
the induction hypothesis holds for view i+ 1. Consider any valid new-view proposal hNewView :Vi
for view i+1 and let Mwith m(S, τ )w,ρ Mbe any global preprepare certificate that is recovered
due to the new-view proposal hNewView :Vi. Hence, Mis recovered via either Line 17 of Figure 6
or Line 25 of Figure 6. In both cases, there must exist a global prepare certificate Pfor M. As
hNewView :Viis valid, we must have wi. Hence, we can apply the second property of the
induction hypothesis to conclude that wv. If w=v, then we can use Lemma 5.2 to conclude
that M=M. Hence, to complete the proof, we must show that w=v. First, the case in which
Mis recovered via Line 17 of Figure 6. Due to the existence of a global commit certificate Cfor
M,Msatisfies the premise of this proposition. By assumption, vis the first view for which the
premise of this proposition holds. Hence, wv, in which case we conclude M=M. Last,
the case in which Mis recovered via Line 25 of Figure 6. In this case, Mis recovered via some
message hNewViewGlobal :V, W i. Analogous to the proof for the base case, Vwill contain a message
VCRecoveryRQ from some replica qS(r). Due to Line 2 of Figure 5, qwill provide information
on M. Consequently, a prepare certificate for Mwill be obtained via global state recovery, and
we also conclude M=M.
Lemma 5.2 and Proposition 5.3 are technical properties that assures that no transaction that
could-be-committed by any replica will ever get lost by the system. Next, we bootstrap these
technical properties to prove that all good replicas can always recover such could-be-committed
transactions.
Proposition 5.4. Let τbe a transaction and m(S, τ )v,ρ be a preprepare message. If, for all shards
S,gS>2fS, and there exists a shard Sshards(τ) such that gSfSgood replicas in S
reached the global commit phase for Mwith m(S, τ )v,ρ M, then every good replica in Swill
accept Mwhenever communication becomes reliable.
Proof. Let r∈ S be a good replica that is unable to accept M. At some point, communication
becomes reliable, after which rwill eventually trigger Line 1 of Figure 4. We have the following
cases:
1. If rmeets the conditions of Line 4 of Figure 4, then rhas a local commit certificate C(S′′ ),
S′′ shards(τ). This local commit certificate certifies that at least gS′′ fS′′ good replicas
in S′′ finished the global prepare phase for M. Hence, the conditions for Proposition 5.3 are
met for Mand, hence, any shard in shards(τ) will maintain or recover M. Replica rcan use
C(S′′) to prove this situation to other replicas, forcing them to commit to M, and provide any
commit messages ris missing (Line 13 of Figure 4).
2. If rdoes not meet the conditions of Line 4 of Figure 4, but some other good replica q∈ S
does, then qcan provide all missing information to r(Line 6 of Figure 5). Next, ruses this
information (Line 7 of Figure 5), after which it meets the conditions of Line 4 of Figure 4.
3. Otherwise, if the above two cases do not hold, then all gSgood replicas in Sare unable to
finish the commit phase. Hence, they perform a view-change. Due to Proposition 5.3, this
view-change will succeed and put every replica in Sinto the commit phase for M. As all good
15
replicas in Sare in the commit phase, each good replica in Swill be able to make a local
commit certificate C(S) for M, after which they meet the conditions of Line 4 of Figure 4.
Finally, we use Proposition 5.4 to prove cross-shard-consistency.
Theorem 5.5. Optimistic-Cerberus maintains cross-shard-consistency.
Proof. Assume a single good replica r∈ S commits or aborts a transaction τ. Hence, it accepted
some global preprepare certificate Mwith m(S, τ )v,ρ M. Consequently, rhas local commit
certificates C(S) for Mof every Sshards(τ). Hence, at least gSfSgood replicas in S
reached the global commit phase for M, and we can apply Proposition 5.4 to conclude that any
good replica r′′ ∈ S′′,S′′ shards(τ) will accept M. As r′′ bases its commit or abort decision for
τon the same global prepare certificate Mas r, they will both make the same decision, completing
the proof.
As already argued, it is straightforward to use the details of Theorem 4.3 to prove that OCer-
berus provides validity,shard-involvement, and shard-applicability. Via Theorem 5.5, we proved
cross-shard-consistency. We cannot prove service and confirmation, however. The reason for this is
simple: even though OCerberus can detect and recover from accidental faulty behavior and acci-
dental concurrent transactions, OCerberus is not designed to gracefully handle targeted attacks.
Example 5.6. Recall the situation of Example 4.1. Next, we illustrate how OCerberus deals
with these concurrent transactions. We again consider distinct transactions hτ1ic1and hτ2ic2with
Inputs(τ1) = Inputs(τ2) = {o1, o2}and with shard(o1) = S1and shard(o2) = S2. We assume that
S1processes τ1first and S2processes τ2first.
The primary P(S1)will propose τ1by prepreparing m(S1, τ1)v11. In doing so, P(S1)sends
m(S1, τ1)v11to all replicas S1∪ S2. Next, the replicas in S1will wait for a message m(S2, τ1)v
2
2
from S2. At the same time, P(S2)already proposed τ2at the same time by sending out m(S2, τ2)v22,
and the replicas in S2will wait for a message m(S1, τ2)v
1
1. Hence, the replicas in S1will never
receive m(S2, τ1)v
2
2and the replicas in S2will never receive m(S1, τ2)v
1
1. Consequently, no replica
will finish global preprepare, the consensus round will fail for all replicas, and all good replicas will
initiate a view-change. As no replica reached the global prepare phase, transactions τ1and τ2do not
need to be recovered during the view-change. After the view-changes, both S1and S2can process
other transactions (or retry τ1or τ2), but if they both process τ1and τ2again, the system will again
initiate a view-change.
As said before, OCerberus is optimistic in the sense that it is optimized for the situation in
which faulty behavior (including concurrent transactions) is rare. Still, in all cases, OCerberus
maintains cross-shard consistency, however. Moreover, in the optimistic case in which shards have
good primaries and no concurrent transactions exist, progress is guaranteed whenever communication
is reliable:
Proposition 5.7. If, for all shards S,gS>2fS, and Assumptions 2.1, 2.2, 2.3, and 2.4 hold,
then Optimistic-Cerberus satisfies Requirements R1–R6 in the optimistic case.
If the optimistic assumption does not hold, then this can result in coordinated attempts to prevent
OCerberus from making progress. At the core of such attacks is the ability for malicious clients and
malicious primaries to corrupt the operations of shards coordinated by good primaries, as already
shown in Example 5.1. To reduce the impact of targeted attacks, one can opt to make primary
election non-deterministic, e.g., by using shard-specific distributed coins to elect new primaries in
individual shards [11, 13].
As a final note, we remark that we have presented OCerberus with a per-round checkpoint
and recovery method. In this simplified design, the recovery path only has to recover at-most a
single round. Our approach can easily be generalized to a more typical multi-round checkpoint and
recovery method, however. Furthermore, we believe that the way in which OCerberus extends
Pbft can easily be generalized to other consensus protocols, e.g., HotStuff.
16
c
S1
S2
S3
hτic
Consensus on hτic
Consensus on hτic
Consensus on hτic
Consensus on Commit/Abort
Consensus on Commit/Abort
Consensus on Commit/Abort
Local Inputs
(Consensus)
Cross-Shard Exchange
(Cluster-Sending)
Decide Outcome
(Consensus)
Inform
destruction construction or rol lback
Figure 7: The message flow of PCerberus for a 3-shard client request hτicthat is committed.
6 Pessimistic-Cerberus: transaction processing under attack
In the previous section, we introduced OCerberus, a general-purpose minimalistic and efficient
multi-shard transaction processing protocol. OCerberus is designed with the assumption that
malicious behavior is rare, due to which it can minimize coordination in the normal-case while
requiring intricate coordination when recovering from attacks. As an alternative to the optimistic
approach of OCerberus, we can apply a pessimistic approach to CCerberus to gracefully recover
from concurrent transactions that is geared towards minimizing the influence of malicious behavior
altogether. Next, we explore such a pessimistic design via Pessimistic-Cerberus (PCerberus).
The design of PCerberus builds upon the design of CCerberus by adding additional coordi-
nation to the cross-shard exchange and decide outcome steps. As in CCerberus, the acceptance
of m(S, τ )ρin round ρby all good replicas completes the local inputs step. Before cross-shard
exchange, the replicas in Sdestruct the objects in D(S, τ ), thereby fully pledging these objects
to τuntil the commit or abort decision. Then, Sperforms cross-shard exchange by broadcasting
m(S, τ )ρto all other shards in shards(τ), while the replicas in Swait until they receive messages
m(S, τ )ρ= (hτic, I (S, τ), D(S, τ )) from all other shards Sshards(τ).
After cross-shard exchange comes the final decide outcome step. After Sreceives m(S, τ )ρfrom
all shards Sshards(τ), the replicas force a second consensus step that determines the round ρ
at which Sdecides commit (whenever I(S, τ ) = D(S, τ ) for all Sshards(τ)) or abort. If S
decides commit, then, in round ρ, all good replicas in Sconstruct all objects oOutputs(τ) with
S=shard(o). If Sdecides abort, then, in round ρ, all good replicas in Sreconstruct all objects in
D(S, τ ) (rollback). Finally, each good replica informs cof the outcome of execution. If creceives,
from every shard Sshards(τ), identical outcomes from gSfSdistinct replicas in S, then it
considers τto be successfully executed. In Figure 7, we sketched the working of PCerberus.
We notice that processing a multi-shard transaction via PCerberus requires two consensus steps
per shard. In some cases, we can eliminate the second step, however. First, if τis a multi-shard
transaction with S shards(τ) and the replicas in Saccept (hτic, I (S, τ), D(S, τ )) with I(S, τ )6=
D(S, τ ), then the replicas can immediately abort whenever they accept (hτic, I(S, τ ), D(S, τ)). Sec-
ond, if τis a single-shard transaction with shards(τ) = {S}, then the replicas in Scan immediately
decide commit or abort whenever they accept (hτic, I(S, τ ), D(S, τ )). Hence, in both cases, process-
ing of τat Sonly requires a single consensus step at S.
Next, we illustrate how PCerberus deals with concurrent transactions.
Example 6.1. Recall the situation of Example 4.1. Next, we illustrate how PCerberus deals
with these concurrent transactions. We again consider distinct transactions hτ1ic1and hτ2ic2with
Inputs(τ1) = Inputs(τ2) = {o1, o2}and with shard(o1) = S1and shard(o2) = S2. We assume that
S1processes τ1first and S2processes τ2first.
Shard S1will start by destructing o1and sends (hτ1ic1,{o1},{o1})to S2. Next, S1will wait, dur-
ing which it receives τ2. At the same time, S2follows similar steps for τ2and sends (hτ2ic2,{o2},{o2})
to S1. While S1is waiting for information on τ1from S2, it receives τ2and starts processing of τ2.
17
Shard S1directly determines that o1does no longer exist. Hence, it sends (hτ2ic2,{o1},∅})to S2.
Likewise, S2will start processing of τ1, sending (hτ1ic1,{o2},)to S1as a result.
After the above exchange, both S1and S2conclude that transactions τ1and τ2must be aborted,
which they eventually both do, after which o1is restored in S1and o2is restored in S2.
We notice that this situation leads to both transactions being aborted. Furthermore, we see
that even though transactions get aborted, individual replicas can all determine whether their shard
performed the necessary steps and, hence, whether their primary operated correctly. Next, we prove
the correctness of PCerberus:
Theorem 6.2. If, for all shards S,gS>2fS, and Assumptions 2.1, 2.2, 2.3, and 2.4 hold, then
Pessimistic-Cerberus satisfies Requirements R1–R6.
Proof. Let τbe a transaction. As good replicas in Sdiscard τif it is invalid or if S/shards(τ),
PCerberus provides validity and shard-involvement. Next, shard-applicability follow directly from
the decide outcome step.
If a shard Scommits or aborts transaction τ, then it must have completed the decide outcome
and cross-shard exchange steps. As Scompleted cross-shard exchange, all shards Sshards(τ)
must have exchanged the necessary information to S. By relying on cluster-sending for cross-shard
exchange, Srequires cooperation of all good replicas in Sto exchange the necessary information
to S. Hence, we have the guarantee that these good replicas will also perform cross-shard exchange
to any other shard S′′ shards(τ). Hence, every shard S′′ shards(τ) will receive the same
information as S, complete cross-shard exchange, and make the same decision during the decide
outcome step, providing cross-shard consistency.
A client can force service on a transaction τby choosing a shard S shards(τ) and sending
τto all good replicas in G(S). By doing so, the normal mechanisms of consensus can be used by
the good replicas in G(S) to force acceptance on τin Sand, hence, bootstrapping acceptance on
τin all shards S∈ G(S). Due to cross-shard consistency, every shard in shards(τ) will perform
the necessary steps to eventually inform the client. As all good replicas r∈ S ,S shards(τ),
will inform the client of the outcome for τ, the majority of these inform-messages come from good
replicas, enabling the client to reliably derive the true outcome. Hence, CCerberus provides service
and confirmation.
7 The strengths of Cerberus
Having introduced the three variants of Cerberus in Sections 4, 5, and 6, we will now analyze
the strengths and performance characteristics of each of the variants. First, we will show that
Cerberus provides serializable execution [6, 9]. Second, we look at the ability of Cerberus to
maximize per-shard throughput by supporting out-of-order processing. Finally, we compare the
costs, the attainable performance, and the scalability of the three protocols.
7.1 The ordering of transactions in Cerberus
The data model utilized by CCerberus,OCerberus, and PCerberus guarantees that any object
ocan only be involved in at-most two committed transactions: one that constructs oand another
one that destructs o. Assume the existence of such transactions τ1and τ2with oOutputs(τ1) and
oInputs(τ2). Due to cross-shard-consistency (Requirement R4), the shard shard(o) will have to
execute both τ1and τ2. Moreover, due to shard-applicability (Requirement R3), the shard shard(o)
will execute τ1strictly before τ2. Now consider the relation
:= {(τ, τ )|(the system committed to τand τ)(Outputs(τ)Inputs(τ)6=)}.
18
Obviously, we have (τ1, τ2). Next, we will prove that all committed transactions are executed in a
serializable ordering. As a first step, we prove the following:
Lemma 7.1. If we interpret transactions as nodes and as an edge relation, then the resulting
graph is acyclic.
Proof. The proof is by contradiction. Let Gbe the graph-interpretation of . We assume that
graph Gis cyclic. Hence, there exists transactions τ0,...,τm1such that (τi, τi+1), 0 i <
m1, and (τm1, τ0). By the definition of , we can choose objects oi, 0 i < m, with
oi(Outputs(τi)Inputs(τ(i+1) mod m)). Due to cross-shard-consistency (Requirement R4), the
shard shard(oi), 0 i < m, executed transactions τiand τ(i+1) mo d m.
Consider oi, 0 i < m, and let tibe the time at which shard shard(oi) executed τiand
constructed oi. Due to shard-applicability (Requirement R3), we know that shard shard(oi) exe-
cuted τ(i+1) mod mstrictly after ti. Moreover, also shard shard(o(i+1) mod m) must have executed
τ(i+1) mod mstrictly after tiand we derive ti< t(i+1) mod m. Hence, we must have t0< t1<··· <
tm1< t0, a contradiction. Consequently, Gmust be acyclic.
To derive a serializable execution order for all committed transactions, we simply construct
a directed acyclic graph in which transactions are nodes and is the edge relation. Next, we
topologically sort the graph to derive the searched-for ordering. Hence, we conclude:
Theorem 7.2. A sharded fault-tolerant system that uses the object-dataset data model, processes
UTXO-like transactions, and satisfies Requirements R1-R5 commits transactions in a serializable
order.
We notice that Cerberus only provides serializability for committed transactions. As we have
seen in Example 6.1, concurrent transactions are not executed in a serializable order, as they are
aborted. It is this flexibility in dealing with aborted transactions that allows all variants of Cer-
berus to operate with minimal and fully-decentralized coordination between shards; while still
providing strong isolation for all committed transactions.
7.2 Out-of-order processing in Cerberus
In normal consensus-based systems, the latency for a single consensus decision is ultimately deter-
mined by the message delay δ. E.g., with the three-phase design of Pbft, it will take at least 3δ
before a transaction that arrives at the primary is executed by all replicas. To minimize the influ-
ence of message delay on throughput, some consensus-based systems support out-of-order decision
making in which the primary is allowed to maximize bandwidth usage by continuously proposing
transactions for future rounds (while previous rounds are processed by the replicas). To illustrate
this, one can look at fine-tuned implementations of Pbft running at replicas that have sufficient
memory buffers available [16, 30]. In this setting, replicas can work on several consensus rounds at
the same time by allowing the primary to propose for rounds within a window of rounds.
As the goal of Cerberus is to maximize performance—both in terms of latency (OCerberus)
and in terms of throughput—we have designed Cerberus to support out-of-order processing (if
provided by the underlying consensus protocol, in the case of CCerberus and PCerberus). The
only limitation to these out-of-order processing capabilities are with respect to transactions affecting
a shared object: such transactions must be proposed strictly in-order, as otherwise the set of pledged
inputs cannot be correctly determined by the good replicas. This is not a limitation for the normal-
case operations, however, as such concurrent transactions only happen due to malicious behavior.
7.3 A comparison of the three Cerberus variants
Finally, we compare the practical costs of the three Cerberus multi-shard transaction processing
protocols. First, in Figure 8, we provide a high-level comparison of the costs of each of the protocols
19
Normal-case complexity Concurrent View-changes
Protocol name Consensus Exchange Phases Transactions
CCerberus s1 4 Objects pledged Single-shard
OCerberus s3 3 View-change & Abort Multi-shard
PCerberus 2s1 7 Normal-case Abort Single-shard
Figure 8: Comparison of the three Cerberus protocols for processing a transaction that affects s
shards. We compare the normal-case complexity, how they deal with concurrent transactions (due
to malicious clients), and how they deal with malicious primaries.
to process a single transaction τthat affects s=|shards(τ)|distinct shards. For the normal-case
behavior, we compare the complexity in the number of consensus steps per shard and the number of
cross-shard exchange steps between shards (which together determine the maximum throughput),
and the number of consecutive communication phases (which determines the minimum latency).
Next, we compare how the three protocols deal with malicious behavior by clients and by replicas.
If no clients behave malicious, then all transactions will commit. In all three protocols, malicious
behavior by clients can lead to the existence of concurrent transactions that affect the same object.
Upon detection of such concurrent transactions, all three protocols will abort. The consequences of
such an abort are different in the three protocols.
In CCerberus, objects affected by aborted transactions remain pledged and cannot be reused.
In practice, this loss of ob jects can provide an incentive for clients to not behave malicious, but
does limit the usability of CCerberus in non-incentivized environments. OCerberus is optimized
with the assumption that conflicting concurrent transactions are rare. When conflicts occur, they
can lead to the failure of a global consensus round, which can lead to a view-change in one or more
affected shards (even if none of the primaries is faulty). Finally, PCerberus deals with concurrent
transactions by aborting them via the normal-case of the protocol. To be able to do so, PCerberus
does require additional consensus steps, however.
The three Cerberus protocols are resilient against malicious replicas: only malicious primaries
can affect the normal-case operations of these protocols. If malicious primaries behave sufficiently
malicious to affect the normal-case operations, their behavior is detected, and the primary is replaced.
In both CCerberus and PCerberus, dealing with a malicious primary in a shard can be done
completely in isolation of all other shards. In OCerberus, which is optimized with the assumption
that failures are rare, the failure of a primary while processing a transaction τcan lead to view-
changes in all shards affected by τ.
Finally, we illustrate the performance of Cerberus. To do so, we have modeled the maximum
throughput of each of these protocols in an environment where each shard has seven replicas (of
which two can be faulty) and each replica has a bandwidth of 100 Mbit s1. We have chosen to opti-
mize CCerberus,OCerberus, and PCerberus to minimize processing latencies over minimizing
bandwidth usage (e.g., we do not batch requests and the cross-shard exchange steps do not utilize
threshold signatures; with these techniques in place we can boost throughput by a constant factor
at the cost of the per-transaction processing latency). In Figure 9, we have visualized the maximum
attainable throughput for each of the protocols as function of the number of shards. In Figure 10,
we have visualized the number of per-shard steps performed by the system (for CCerberus and
OCerberus, this is equivalent to the number of per-shard consensus steps, for PCerberus this is
half the number of per-shard consensus steps). As one can see from these figures, all three protocols
have excellent scalability: increasing the number of shards will increase the overall throughput of
the system. Sharding does come with clear overheads, however, increasing the number of shards also
increases the number of shards affected by each transaction, thereby increasing the overall number
of consensus steps. This is especially true for very large transactions that affect many objects (that
can affect many distinct shards).
20
CCerberus OCerberus PCerberus
2022242628210 212 214
103
104
105
106
107
Shards
Throughput (txn/s)
Performance (2 ob j/txn)
2022242628210 212 214
103
104
105
106
Shards
Throughput (txn/s)
Performance (4 ob j/txn)
2022242628210 212 214
102
103
104
105
Shards
Throughput (txn/s)
Performance (8 ob j/txn)
2022242628210 212 214
102
103
104
Shards
Throughput (txn/s)
Performance (16 ob j/txn)
2022242628210 212 214
101
102
103
Shards
Throughput (txn/s)
Performance (32 ob j/txn)
2022242628210 212 214
101
102
Shards
Throughput (txn/s)
Performance (64 ob j/txn)
Figure 9: Throughput of the three Cerberus protocols as a function of the number of shards.
2 obj/txn 4ob j/txn 8 ob j/txn 16 ob j/txn 32 ob j/txn 64 ob j/txn
2022242628210 212 214
0.0
0.5
1.0
·109
Shards
Steps
Total Consensus Steps
2022242628210 212 214
0.0
0.5
1.0
1.5
·107
Shards
Steps
Consensus Steps per Shard
Figure 10: Amount of work, in terms of consensus steps, for the shards involved in processing the
transactions.
8 Related Work
Distributed systems are typically employed to either increase reliability (e.g., via consensus-based
fault-tolerance) or to increase performance (e.g., via sharding). Consequently, there is abundant
literature on such distributed systems, distributed databases, and sharding (e.g., [46, 49, 50]) and
on consensus-based fault-tolerant systems (e.g., [10, 14, 19, 31, 49]). Next, we shall focus on the few
works that deal with sharding in fault-tolerant systems.
Several recent system papers have proposed specialized systems that combine sharding with
consensus-based resilient systems. Examples include AHL [17], Caper [3], Chainspace [1], and
21
SharPer [4], which all use sharding for data management and transaction processing. Systems
such as AHL and Caper are designed with single-shard workloads in mind, as they rely on central-
ized orderers to order and process multi-shard transactions, whereas systems such as Chainspace
and SharPer are closer to the decentralized design of Cerberus. In specific, Chainspace uses
a consensus-based commit protocol that performs three consecutive consensus and cross-shard ex-
change steps that resemble the two-step approach of PCerberus (although the details of the re-
covery path are rather different). In comparison, Cerberus greatly improves on the design of
Chainspace by reducing the number of consecutive consensus steps necessary to process transac-
tions and by introducing out-of-order transaction processing capabilities. Finally, SharPer inte-
grates global consensus steps in a consensus protocol in a similar manner as OCerberus. Their
focus is mainly on a crash-tolerant Paxos protocol, however, and they do not fully explorer the
details of a full Byzantine fault-tolerant recovery path.
A few fully-replicate consensus-based systems utilize sharding at the level of consensus decision
making, this to improve consensus throughput [2, 22, 26, 29]. In these systems, only a small subset
of all replicas, those in a single shard, participate in the consensus on any given transaction, thereby
reducing the costs to replicate this transaction without improving storage and processing scalability.
Finally, the recently-proposed delayed-replication algorithm aims at improving scalability of resilient
systems by separating fault-tolerant data storage from specialized data processing tasks [33], the
latter of which can be distributed over many participants.
Recently, there has also been promising work on sharding and techniques supporting sharding
for permissionless blockchains. Examples include techniques to enable sidechains, blockchain re-
lays, and atomic swaps [23, 24, 34, 39, 52], which each enable various forms of cooperation between
blockchains (including simple cross-chain communication and cross-chain transaction coordination).
Unfortunately, these permissionless techniques are several orders of magnitudes slower than compa-
rable techniques for traditional fault-tolerant systems, making them incomparable with the design
of Cerberus discussed in this work.
9 Conclusion
In this paper, we introduced Core-Cerberus, Optimistic-Cerberus, and Pessimistic-Cerberus,
three fully distributed approaches towards multi-shard fault-tolerant transaction processing. The
design of these approaches is geared towards processing UTXO-like transactions in sharded dis-
tributed ledger networks with minimal cost, while maximizing performance. By using the properties
of UTXO-like transactions to our advantage, both Core-Cerberus and Optimistic-Cerberus are
optimized for cases with fewer expected malicious behaviors, in which case they are able to provide
serializable transaction processing with only a single consensus step per affected shard, whereas
Pessimistic-Cerberus is optimized to efficiently deal with a broad-range of malicious behavior at
the cost of a second consensus step during normal operations.
The core ideas of Cerberus are not tied to any particular underlying consensus protocol. In
this work, we have chosen to build Cerberus on top of Pbft, as our experience shows that well-
tuned implementations that use out-of-order processing of this protocol can outperform most other
protocols in raw throughput [30]. Combining other consensus protocols with Cerberus will result in
other trade-offs between maximum throughput, per-transaction latency, bandwidth usage, and (for
protocols that do not support out-of-order processing) vulnerability to message delays. Applying the
ideas of Cerberus fully onto other consensus protocols in a fully fine-tuned manner remains open,
however. E.g., we are very interested in seeing whether incorporating Cerberus into the more-
resilient four-phase design of HotStuff can sharply reduce the need for multi-shard view-changes
in OCerberus (at the cost of higher per-transaction latency).
22
References
[1] Mustafa Al-Bassam, Alberto Sonnino, Shehar Bano, Dave Hrycyszyn, and George
Danezis. Chainspace: A sharded smart contracts platform, 2017. URL:
http://arxiv.org/abs/1708.03778.
[2] Yair Amir, Claudiu Danilov, Danny Dolev, Jonathan Kirsch, John Lane, Cristina Nita-Rotaru,
Josh Olsen, and David Zage. Steward: Scaling byzantine fault-tolerant replication to wide
area networks. IEEE Transactions on Dependable and Secure Computing, 7(1):80–93, 2010.
doi:10.1109/TDSC.2008.53.
[3] Mohammad Javad Amiri, Divyakant Agrawal, and Amr El Abbadi. CAPER: A cross-
application permissioned blockchain. Proc. VLDB Endow., 12(11):1385–1398, 2019.
doi:10.14778/3342263.3342275.
[4] Mohammad Javad Amiri, Divyakant Agrawal, and Amr El Abbadi. SharPer: Sharding permis-
sioned blockchains over network clusters, 2019. URL: https://arxiv.org/abs/1910.00765v1.
[5] Elli Androulaki, Artem Barger, Vita Bortnikov, Christian Cachin, Konstantinos Christidis,
Angelo De Caro, David Enyeart, Christopher Ferris, Gennady Laventman, Yacov Manevich,
Srinivasan Muralidharan, Chet Murthy, Binh Nguyen, Manish Sethi, Gari Singh, Keith
Smith, Alessandro Sorniotti, Chrysoula Stathakopoulou, Marko Vukoli´c, Sharon Weed Cocco,
and Jason Yellick. Hyperledger Fabric: A distributed operating system for permissioned
blockchains. In Proceedings of the Thirteenth EuroSys Conference, pages 30:1–30:15. ACM,
2018. doi:10.1145/3190508.3190538.
[6] Vijayalakshmi Atluri, Elisa Bertino, and Sushil Jajodia. A theoretical formulation
for degrees of isolation in databases. Inform. Software Tech., 39(1):47–53, 1997.
doi:10.1016/0950- 5849(96)01109-3.
[7] Paddy Baker and Omkar Godbole. Ethereum fees soaring to 2-year high: Coin metrics. Coin-
Desk, 2020.
[8] Guillaume Bazot. Financial intermediation cost, rents, and productivity: An international
comparison. Technical report, European Historical Economics Society, 2018.
[9] Hal Berenson, Phil Bernstein, Jim Gray, Jim Melton, Elizabeth ONeil, and Patrick
ONeil. A critique of ANSI SQL isolation levels. SIGMOD Rec., 24(2):1–10, 1995.
doi:10.1145/568271.223785.
[10] Christian Berger and Hans P. Reiser. Scaling byzantine consensus: A broad analysis. In Pro-
ceedings of the 2nd Workshop on Scalable and Resilient Infrastructures for Distributed Ledgers,
pages 13–18. ACM, 2018. doi:10.1145/3284764.3284767.
[11] Gabi Bracha and Ophir Rachman. Randomized consensus in expected O((n2log n)) op-
erations. In Distributed Algorithms, pages 143–150. Springer Berlin Heidelberg, 1992.
doi:10.1007/BFb0022443.
[12] Christopher Brookins. DeFi boom has saved bitcoin from plummeting. Forbes, 2020.
[13] Christian Cachin, Klaus Kursawe, Frank Petzold, and Victor Shoup. Secure and efficient asyn-
chronous broadcast protocols. In Advances in Cryptology — CRYPTO 2001, pages 524–541.
Springer Berlin Heidelberg, 2001. doi:10.1007/3-540-44647-8_31.
[14] Christian Cachin and Marko Vukolic. Blockchain consensus protocols in the wild (keynote talk).
In 31st International Symposium on Distributed Computing, volume 91, pages 1:1–1:16. Schloss
Dagstuhl–Leibniz-Zentrum fuer Informatik, 2017. doi:10.4230/LIPIcs.DISC.2017.1.
23
[15] Michael Casey, Jonah Crane, Gary Gensler, Simon Johnson, and Neha Narula.
The impact of blockchain technology on finance: A catalyst for change. Techni-
cal report, International Center for Monetary and Banking Studies, 2018. URL:
https://www.cimb.ch/uploads/1/1/5/4/115414161/geneva21_1.pdf.
[16] Miguel Castro and Barbara Liskov. Practical byzantine fault tolerance and proactive recovery.
ACM Trans. Comput. Syst., 20(4):398–461, 2002. doi:10.1145/571637.571640.
[17] Hung Dang, Tien Tuan Anh Dinh, Dumitrel Loghin, Ee-Chien Chang, Qian Lin, and
Beng Chin Ooi. Towards scaling blockchain systems via sharding. In Proceedings of
the 2019 International Conference on Management of Data, pages 123–140. ACM, 2019.
doi:10.1145/3299869.3319889.
[18] Nikhilesh De. CFTC chair: ‘a large part’ of financial system could end up in blockchain format.
CoinDesk, 2020.
[19] Tien Tuan Anh Dinh, Rui Liu, Meihui Zhang, Gang Chen, Beng Chin Ooi, and Ji Wang.
Untangling blockchain: A data processing view of blockchain systems. IEEE Trans. Knowl.
Data Eng., 30(7):1366–1385, 2018. doi:10.1109/TKDE.2017.2781227.
[20] D. Dolev. Unanimity in an unknown and unreliable environment. In 22nd Annual Symposium on
Foundations of Computer Science, pages 159–168. IEEE, 1981. doi:10.1109/SFCS.1981.53.
[21] Danny Dolev. The byzantine generals strike again. J. Algorithms, 3(1):14–30, 1982.
doi:10.1016/0196- 6774(82)90004-9.
[22] Michael Eischer and Tobias Distler. Scalable byzantine fault-tolerant state-machine replication
on heterogeneous servers. Computing, 101:97–118, 2019. doi:10.1007/s00607-018-0652-3.
[23] Muhammad El-Hindi, Carsten Binnig, Arvind Arasu, Donald Kossmann, and Ravi Rama-
murthy. BlockchainDB: A shared database on blockchains. Proc. VLDB Endow., 12(11):1597–
1609, 2019. doi:10.14778/3342263.3342636.
[24] Ethereum Foundation. BTC Relay: A bridge between the bitcoin blockchain & ethereum smart
contracts, 2017. URL: http://btcrelay.org.
[25] Michael J. Fischer, Nancy A. Lynch, and Michael S. Paterson. Impossibility of distributed
consensus with one faulty process. J. ACM, 32(2):374–382, 1985. doi:10.1145/3149.214121.
[26] Yossi Gilad, Rotem Hemo, Silvio Micali, Georgios Vlachos, and Nickolai Zeldovich. Algorand:
Scaling byzantine agreements for cryptocurrencies. In Proceedings of the 26th Symposium on Op-
erating Systems Principles, SOSP, pages 51–68. ACM, 2017. doi:10.1145/3132747.3132757.
[27] Seth Gilbert and Nancy Lynch. Brewer’s conjecture and the feasibility of consis-
tent, available, partition-tolerant web services. SIGACT News, 33(2):51–59, 2002.
doi:10.1145/564585.564601.
[28] William J. Gordon and Christian Catalini. Blockchain technology for healthcare: Facilitating
the transition to patient-driven interoperability. Computational and Structural Biotechnology
Journal, 16:224–230, 2018. doi:10.1016/j.csbj.2018.06.003.
[29] Suyash Gupta, Sajjad Rahnama, Jelle Hellings, and Mohammad Sadoghi. ResilientDB:
Global scale resilient blockchain fabric. Proc. VLDB Endow., 13(6):868–883, 2020.
doi:10.14778/3380750.3380757.
24
[30] Suyash Gupta, Sa jjad Rahnama, and Mohammad Sadoghi. Permissioned blockchain through
the looking glass: Architectural and implementation lessons learned. In 40th International
Conference on Distributed Computing Systems. IEEE, 2020.
[31] Suyash Gupta and Mohammad Sadoghi. Blockchain Transaction Processing, pages 1–11.
Springer International Publishing, 2018. doi:10.1007/978-3-319-63962-8_333-1.
[32] Jelle Hellings and Mohammad Sadoghi. Brief announcement: The fault-tolerant
cluster-sending problem. In 33rd International Symposium on Distributed Computing
(DISC 2019), pages 45:1–45:3. Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik, 2019.
doi:10.4230/LIPIcs.DISC.2019.45.
[33] Jelle Hellings and Mohammad Sadoghi. Coordination-free byzantine replication with min-
imal communication costs. In 23rd International Conference on Database Theory (ICDT
2020), volume 155, pages 17:1–17:20. Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik, 2020.
doi:10.4230/LIPIcs.ICDT.2020.17.
[34] Maurice Herlihy. Atomic cross-chain swaps. In Proceedings of the 2018 ACM
Symposium on Principles of Distributed Computing, pages 245–254. ACM, 2018.
doi:10.1145/3212734.3212736.
[35] Maurice Herlihy. Blockchains from a distributed computing perspective. Commun. ACM,
62(2):78–85, 2019. doi:10.1145/3209623.
[36] Matt Higginson, Johannes-Tobias Lorenz, Bjrn Mnstermann, and Peter Braad Olesen. The
promise of blockchain. Technical report, McKinsey&Company, 2017.
[37] Maged N. Kamel Boulos, James T. Wilson, and Kevin A. Clauson. Geospatial blockchain:
promises, challenges, and scenarios in health and healthcare. International Journal of Health
Geographics, 17(1):1211–1220, 2018. doi:10.1186/s12942-018-0144-x.
[38] Jonathan Katz and Yehuda Lindell. Introduction to Modern Cryptography. Chapman and
Hall/CRC, 2nd edition, 2014.
[39] Jae Kwon and Ethan Buchman. Cosmos whitepaper: A network of distributed ledgers, 2019.
URL: https://cosmos.network/cosmos-whitepaper.pdf.
[40] Leslie Lamport. Paxos made simple. ACM SIGACT News, 32(4):51–58, 2001. Distributed
Computing Column 5. doi:10.1145/568425.568433.
[41] Laphou Lao, Zecheng Li, Songlin Hou, Bin Xiao, Songtao Guo, and Yuanyuan Yang. A survey
of iot applications in blockchain systems: Architecture, consensus, and traffic modeling. ACM
Comput. Surv., 53(1), 2020. doi:10.1145/3372136.
[42] Satoshi Nakamoto. Bitcoin: A peer-to-peer electronic cash system, 2009. URL:
https://bitcoin.org/bitcoin.pdf.
[43] Arvind Narayanan and Jeremy Clark. Bitcoin’s academic pedigree. Commun. ACM, 60(12):36–
45, 2017. doi:10.1145/3132259.
[44] Senthil Nathan, Chander Govindara jan, Adarsh Saraf, Manish Sethi, and Praveen Jayachan-
dran. Blockchain meets database: Design and implementation of a blockchain relational
database. Proc. VLDB Endow., 12(11):1539–1552, 2019. doi:10.14778/3342263.3342632.
[45] Faisal Nawab and Mohammad Sadoghi. Blockplane: A global-scale byzantizing middleware.
In 35th International Conference on Data Engineering (ICDE), pages 124–135. IEEE, 2019.
doi:10.1109/ICDE.2019.00020.
25
[46] M. Tamer ¨
Ozsu and Patrick Valduriez. Principles of Distributed Database Systems. Springer,
2020. doi:10.1007/978- 3-030-26253-2.
[47] Michael Pisa and Matt Juden. Blockchain and economic development: Hype vs. reality. Tech-
nical report, Center for Global Development, 2017.
[48] Victor Shoup. Practical threshold signatures. In Advances in Cryptology — EUROCRYPT
2000, pages 207–220. Springer Berlin Heidelberg, 2000. doi:10.1007/3-540-45539-6_15.
[49] Gerard Tel. Introduction to Distributed Algorithms. Cambridge University Press, 2nd edition,
2001.
[50] Maarten van Steen and Andrew S. Tanenbaum. Distributed Systems. Maarten van Steen, 3th
edition, 2017. URL: https://www.distributed-systems.net/.
[51] Gavin Wood. Ethereum: a secure decentralised generalised transaction ledger, 2016. EIP-150
revision. URL: https://gavwood.com/paper.pdf.
[52] Gavin Wood. Polkadot: vision for a heterogeneous multi-chain framework, 2016. URL:
https://polkadot.network/PolkaDotPaper.pdf.
[53] Maofan Yin, Dahlia Malkhi, Michael K. Reiter, Guy Golan Gueta, and Ittai Abra-
ham. HotStuff: BFT consensus with linearity and responsiveness. In Proceedings of
the ACM Symposium on Principles of Distributed Computing, pages 347–356. ACM, 2019.
doi:10.1145/3293611.3331591.
26
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Recent developments in blockchain technology have inspired innovative new designs in resilient distributed and database systems. At their core, these blockchain applications typically use Byzantine fault-tolerant consensus protocols to maintain a common state across all replicas, even if some replicas are faulty or malicious. Unfortunately, existing consensus protocols are not designed to deal with geo-scale deployments in which many replicas spread across a geographically large area participate in consensus. To address this, we present the Geo-Scale Byzantine Fault-Tolerant consensus protocol (GeoBFT). GeoBFT is designed for excellent scalability by using a topological-aware grouping of replicas in local clusters, by introducing parallelization of consensus at the local level, and by minimizing communication between clusters. To validate our vision of high-performance geo-scale resilient distributed systems, we implement GeoBFT in our efficient ResilientDB permissioned blockchain fabric. We show that GeoBFT is not only sound and provides great scalability, but also outperforms state-of-the-art consensus protocols by a factor of six in geo-scale deployments.
Conference Paper
Full-text available
We present HotStuff, a leader-based Byzantine fault-tolerant replication protocol for the partially synchronous model. Once network communication becomes synchronous, HotStuff enables a correct leader to drive the protocol to consensus at the pace of actual (vs. maximum) network delay--a property called responsiveness---and with communication complexity that is linear in the number of replicas. To our knowledge, HotStuff is the first partially synchronous BFT replication protocol exhibiting these combined properties. Its simplicity enables it to be further pipelined and simplified into a practical, concise protocol for building large-scale replication services.
Conference Paper
Full-text available
Existing blockchain systems scale poorly because of their distributed consensus protocols. Current attempts at improving blockchain scalability are limited to cryptocurrency. Scaling blockchain systems under general workloads (i.e., non-cryptocurrency applications) remains an open question. This work takes a principled approach to apply sharding to blockchain systems in order to improve their transaction throughput at scale. This is challenging, however, due to the fundamental difference in failure models between databases and blockchain. To achieve our goal, we first enhance the performance of Byzantine consensus protocols, improving individual shards' throughput. Next, we design an efficient shard formation protocol that securely assigns nodes into shards. We rely on trusted hardware, namely Intel SGX, to achieve high performance for both consensus and shard formation protocol. Third, we design a general distributed transaction protocol that ensures safety and liveness even when transaction coordinators are malicious. Finally, we conduct an extensive evaluation of our design both on a local cluster and on Google Cloud Platform. The results show that our consensus and shard formation protocols outperform state-of-the-art solutions at scale. More importantly, our sharded blockchain reaches a high throughput that can handle Visa-level workloads, and is the largest ever reported in a realistic environment.
Article
Blockchain technology can be extensively applied in diverse services, including online micro-payments, supply chain tracking, digital forensics, health-care record sharing, and insurance payments. Extending the technology to the Internet of things (IoT), we can obtain a verifiable and traceable IoT network. Emerging research in IoT applications exploits blockchain technology to record transaction data, optimize current system performance, or construct next-generation systems, which can provide additional security, automatic transaction management, decentralized platforms, offline-to-online data verification, and so on. In this article, we conduct a systematic survey of the key components of IoT blockchain and examine a number of popular blockchain applications. In particular, we first give an architecture overview of popular IoT-blockchain systems by analyzing their network structures and protocols. Then, we discuss variant consensus protocols for IoT blockchains, and make comparisons among different consensus algorithms. Finally, we analyze the traffic model for P2P and blockchain systems and provide several metrics. We also provide a suitable traffic model for IoT-blockchain systems to illustrate network traffic distribution.
Article
The fourth edition of this classic textbook provides major updates. This edition has completely new chapters on Big Data Platforms (distributed storage systems, MapReduce, Spark, data stream processing, graph analytics) and on NoSQL, NewSQL and polystore systems. It also includes an updated web data management chapter that includes RDF and semantic web discussion