Conference PaperPDF Available

Blockplane: A Global-Scale Byzantizing Middleware

Authors:

Abstract and Figures

The byzantine fault-tolerance model captures a wide-range of failures-common in real-world scenarios-such as ones due to malicious attacks and arbitrary software/hardware errors. We propose Blockplane, a middleware that enables making existing benign systems tolerate byzantine failures. This is done by making the existing system use Blockplane for durability and as a communication infrastructure. Blockplane proposes the following: (1) A middleware and communication infrastructure to make an entire benign protocol byzantine fault-tolerant, (2) A hierarchical locality-aware design to minimize the number of wide-area messages, (3) A separation of fault-tolerance concerns to enable designs with higher performance. I. INTRODUCTION A byzantine failure model [11] is a model of arbitrary failures that includes-in addition to crashes-unexpected behavior due to software and hardware malfunctions, malicious breaches, and violation of trust between participants. It is significantly more difficult to develop byzantine fault-tolerant protocols compared to benign (non-byzantine) protocols. This poses a challenge to organizations that want to adopt byzantine fault-tolerant software solutions. This challenge is exacerbated with the need of many applications to be globally distributed. With global distribution, the wide-area latency between participants amplifies the performance overhead of byzantine fault-tolerant protocols. To overcome the challenges of adopting byzantine fault-tolerant software solutions, we propose pushing down the byzantine fault-tolerance problem to the communication layer rather than the application/storage layer. Our proposal, Block-plane, is a communication infrastructure that handles the delivery of messages from one node to another. Blockplane exposes an interface of log-commit, send, and receive operations to be used by nodes to both persist their state and communicate with each other. Blockplane adopts a locality-aware hierarchical design due to our interest in supporting efficient byzantine fault-tolerance in global-scale environments. Hierarchical designs have recently been shown to perform well in global-scale settings [15]. Blockplane optimizes for the communication latency by performing as much computation as possible locally and only communicate across the wide-area link when necessary. In the paper, we distinguish between two types of failures. The first is independent byzantine failures that are akin to traditional byzantine failures which affect each node independently (the failure of one node does not correlate with the failure of another node). The second type of failures is benign
Content may be subject to copyright.
Blockplane:
A Global-Scale Byzantizing Middleware
Faisal Nawab
Department of Computer Science and Engineering
University of California, Santa Cruz
fnawab@ucsc.edu
Mohammad Sadoghi
Exploratory Systems Lab
Department of Computer Science
University of California, Davis
msadoghi@ucdavis.edu
Abstract—The byzantine fault-tolerance model captures a
wide-range of failures—common in real-world scenarios—such
as ones due to malicious attacks and arbitrary software/hardware
errors. We propose Blockplane, a middleware that enables
making existing benign systems tolerate byzantine failures. This is
done by making the existing system use Blockplane for durability
and as a communication infrastructure. Blockplane proposes the
following: (1) A middleware and communication infrastructure
to make an entire benign protocol byzantine fault-tolerant, (2) A
hierarchical locality-aware design to minimize the number of
wide-area messages, (3) A separation of fault-tolerance concerns
to enable designs with higher performance.
I. INTRODUCTION
A byzantine failure model [11] is a model of arbitrary
failures that includes—in addition to crashes—unexpected be-
havior due to software and hardware malfunctions, malicious
breaches, and violation of trust between participants. It is
significantly more difficult to develop byzantine fault-tolerant
protocols compared to benign (non-byzantine) protocols. This
poses a challenge to organizations that want to adopt byzantine
fault-tolerant software solutions. This challenge is exacerbated
with the need of many applications to be globally distributed.
With global distribution, the wide-area latency between partic-
ipants amplifies the performance overhead of byzantine fault-
tolerant protocols.
To overcome the challenges of adopting byzantine fault-
tolerant software solutions, we propose pushing down the
byzantine fault-tolerance problem to the communication layer
rather than the application/storage layer. Our proposal, Block-
plane, is a communication infrastructure that handles the deliv-
ery of messages from one node to another. Blockplane exposes
an interface of log-commit,send, and receive operations to
be used by nodes to both persist their state and communicate
with each other.
Blockplane adopts a locality-aware hierarchical design
due to our interest in supporting efficient byzantine fault-
tolerance in global-scale environments. Hierarchical designs
have recently been shown to perform well in global-scale
settings [15]. Blockplane optimizes for the communication
latency by performing as much computation as possible lo-
cally and only communicate across the wide-area link when
necessary.
In the paper, we distinguish between two types of failures.
The first is independent byzantine failures that are akin to
traditional byzantine failures which affect each node indepen-
dently (the failure of one node does not correlate with the
failure of another node). The second type of failures is benign
geo-correlated failures. In geo-correlated failures, the nodes
that are deployed on the same datacenter (or datacenters close
to each other) may experience a failure together. In this case,
the failure of a node is no longer independent from other
nodes. This distinction between the two types of failures is
important due to the frequency of datacenter-scale outages [6].
To summarize, the paper proposes the following:
An approach to make an entire benign protocol tolerate
byzantine failures through Blockplane.
Blockplane proposes a locality-aware design to reduce wide-
area communication.
A separation of fault-tolerance concerns of byzantine fail-
ures and benign geo-correlated failures.
The rest of this paper begins with a background in Sec-
tion II. We propose Blockplane in Sections III to VII. Sec-
tion VIII presents the experimental evaluation. Section IX
presents an overview of related work. The paper concludes
in Section X.
II. BACKGROU ND
Blockplane is a permissioned blockchain system [7], [14]
that targets applications where: (1) Participants are globally-
distributed, and (2) Byzantine failures need to be tolerated. We
distinguish between byzantine failures that model independent
arbitrary behavior of nodes and geo-correlated failures that
model an benign outage of a whole datacenter. To clarify the
distinction between the two types of failures, we introduce the
notations: fito represent the number of tolerated independent
byzantine failures and fgto denote the number of tolerated
geo-correlated failures. (When the notation fis used without
a subscript, then it should be interpreted as the number of
tolerated independent byzantine failures fi.)
Byzantine Agreement Byzantine agreement is the prob-
lem of reaching consensus between nodes in the presence
of byzantine failures. This includes benign crash failures,
hardware/communication malfunctions, software errors, and
malicious breaches. The PBFT protocol is a widely-known
leader-based byzantine agreement protocol. To tolerate f
byzantine failures, 3f+ 1 nodes are needed (i.e.,n= 3f+ 1).
We discuss the normal-case operation of PBFT next, since it
is used in the design of Blockplane.
In PBFT, the leader drives the commitment of new com-
mands through three consecutive phases: pre-prepare, prepare,
and commit. The life-cycle of committing a command begins
with a user sending a request to commit a command to the
leader. The leader, then, broadcasts a pre-prepare message to
Log-commit
record
Communication
record (mesage to B)
Log-commit
record
Communication
record (mesage from A)
participant A Local Log for participant B
User-Space
Blockplane-Space
log-commit() read() send() receive()
Local State
User-level interface:
state
Local Log
state
Program and
veri cation
routines
local-commit() BP-read() get-signatures()
Blockplane-level functions:
------
-----
-------
-----------
state
state
state
leader
Communication
Daemon
------
-----
-------
-----------
------
-----
-------
-----------
------
-----
-------
-----------
Fig. 1. An illustration of a Blockplane deployment in participant A
distinguishing between user-space and Blockplane-space and showing an
example of Local Logs in a scenario with two participants. Each log contains
log-commit and communication records.
all other nodes. A node that receives a pre-prepare message
accepts it if it is authentic the node has not accepted another
request with the same sequence number.
If a node accepts a pre-prepare message from the leader, it
proceeds to the next phase and the node broadcasts a prepare
message to all other nodes, including the leader. Each node
waits to collect 2fprepare messages in addition to the pre-
prepare sent by the leader. Once these messages are received,
the node enters the prepared phase. The significance of being
prepared is that the node now knows that all non-faulty nodes
agree on the contents of the message sent by the leader for
that view and sequence number. This is because even with f
faulty nodes, f+ 1 other nodes have sent prepare messages
with these contents—and they intersect with every other group
of f+ 1 non-faulty nodes.
Once a node enters the prepared state, it broadcasts a
commit message to all other nodes. A prepared node waits
for 2f+ 1 commit messages (including its own) to enter the
commit phase. Once in the commit phase, the node—assuming
it is not faulty—considers the request committed and logs
this information in its local storage. Then, it responds with a
reply message to the client. The client waits for identical reply
messages from f+ 1 nodes before it considers the command
committed (because up to fnodes might be faulty.)
III. BLOCKPLANE SYS TE M AND PROGRAMMING MOD EL
In this section, we present the system and programming
models of Blockplane.
A. Motivation
The goal of Blockplane is to provide a framework for
efficient and accessible byzantine fault-tolerance in wide-
area environments. The system model is hierarchical, where
nodes are grouped together to form local units and the units
communicate together globally. The aim of this hierarchy is
to mask byzantine failures locally within the unit. By masking
byzantine failures locally, the global coordination can utilize
benign (non-byzantine) protocols. The efficiency here is that
the global, wide-area communication patterns mimics those of
the benign protocol rather than the more expensive byzantine
protocol (later in the paper we present an example of applying
this to consensus is presented in Section VI-E and correctness
is discussed in Section VII.) The programming model aims
to provide accessibility to programmers by proposing a new
abstraction that—unlike the traditional SMR abstraction—
exposes both a commitment and communication interfaces.
This allows programmers to follow a design pattern that is
closer to the distributed benign protocol rather than adapting
the distributed protocol to an SMR-style of programming.
Additionally, the separate interfaces for commitment and
communication allows the Blockplane infrastructure to handle
these two requests differently, leading to a specialized, more
efficient design for each.
B. System Model and Notation
The system model consists of nparticipants. Participants
are on different datacenters. We will use the terms participant,
datacenter, and site interchangeably. The communication be-
tween a pair of participants incurs wide-area communication
latency. Participants communicate using a message passing-
based distributed protocol, denoted P. All communication
is handled by Blockplane. Blockplane establishes and uti-
lizes communication channels between nodes that can incur
message drops, corruption and reordering. Blockplane utilizes
existing approaches to detect data corruption and reordering
such as the TCP protocol.
Each participant maintains 3f+ 1 nodes for Blockplane
(Each node runs an instance of Blockplane and an instance
of P). Launching and managing the Blockplane nodes is
performed by the application administrator. For example, if
fequals 1, then each participant maintains 4 nodes for
Blockplane. The set of Blockplane nodes for participant iare
denoted Ni. The set of Blockplane nodes corresponding to a
participant, Ni, is called a Blockplane unit, or unit for short.
The set of nodes and their public keys are known to all nodes.
Each Blockplane unit, Ni, maintains a log of events that
represents a SMR log of the corresponding participant. The
Local Log of participant iis denoted Li. The jth event in Li
is denoted Li[j]. There are two types of events in the log:
(1) Log-Commit records: A log-commit record (or commit
record for short) represents an event that changes the state
of the corresponding participant’s protocol, P. A participant
writes a commit record via an interface instruction called
log-commit that takes an arbitrary string as input. The
participant uses log-commit records to persist its state on
Blockplane nodes to enable recovery in the case of failure.
(2) Communication records: A communication record
represents a message that is sent from the corresponding
participant to another participant. The participant writes
a communication record by using an interface instruction
called send that takes an arbitrary string message and the
destination participants. Blockplane also provides a receive
interface instruction to receive any incoming messages.
Figure 1 shows an example of a Blockplane deployment.
We distinguish between the user-space and the Blockplane-
space. The user-space is the abstraction that is seen by the
system developer who uses Blockplane. This includes an
abstraction of a Local Log and a single copy of an up-to-date
state. User-space is where the system developer’s code (and
verification routines that we will introduce later) reside. The
2
system code uses Blockplane through the user-level interface
functions such as log-commit and other shown interfaces. The
Blockplane-space is the underlying infrastructure provided by
Blockplane. System developers use the user-level interfaces
and are not exposed to Blockplane-space complexities. In
Blockplane space there are 3f+ 1 Blockplane nodes, each
with a copy of the log and program state. There are various
Blockplane-level functions that are not exposed to the system
developer. However, we show some of them as they are
presented later in the paper as part of Blockplane design.
Blockplane-space also includes the communication daemon
that is responsible for processing communication records and
delivering them to their destinations. The figure also shows an
example of the contents of the Local Logs in two participants.
Each log consists of log-commit and communication records.
C. Programming Model
Blockplane exposes two types of interfaces: First, an in-
terface for log-commit records that includes a log-commit
instruction and a read instruction. This interface is similar
to a SMR interface and should be used in the same way SMR
interfaces are used. The log-commit instruction guarantees
that the committed value will survive the set fault-tolerance
level. Also, it guarantees that it is ordered after all previously
committed values in the Local Log. The read instruction
allows the participant to read the committed records to recover
from failures or update replicas. Like systems that use SMR
systems in general, a system that uses Blockplane must be
deterministic (i.e., an event has a deterministic effect on the
state of the system) and all copies of the protocol, P, in the
participant’s Blockplane unit must start with an identical initial
state.
The other type of interface exposed by Blockplane is the
communication interface that consists of send and receive
instructions. The protocol developer of Pmust use this inter-
face for any communication between participants.
The protocol Pitself can be written as a benign protocol
that does not tolerate byzantine failures. To use Blockplane, the
protocol developer must use the commit and communication
interfaces:
Definition 1: A protocol that uses Blockplane must use the
log-commit and communication interfaces for all the following
cases in the protocol:
If an event changes the state of the protocol, then the event
must be committed to the Local Log.
If an event may lead to one or more communication events,
then the event must be committed to the Local Log.
Any communication between participants must be handled
through the communication interface.
Additionally, the developer must write verification routines
for log-commit and communication instructions. These ver-
ification routines are going to be used by the Blockplane
replicas to enable them to verify whether a record is a valid
state transition. (More details about how verification routines
are used by Blockplane is presented in Section IV-B.) The
intuition of verification routines is the following: A Blockplane
node may propose committing a record to represent a state
change in the protocol (e.g., changing the value of a state
Algorithm 1: An example of a program using Blockplane (the code
in red is what is added or modified to use Blockplane)
1: c := a counter initially set to 0
2: on event UserRequest (in: destination) {
3: log-commit(request info)
4: send (to: destination)
5: }
6: on event StartServer () {
7: while (true)
8: receive ()
9: log-commit(increment-counter)
10: c++
11: }
variable). However, for this record to be committed (as we will
see in Section IV-B), other Blockplane nodes must attest to
whether the proposed record is a valid state transition given the
current state. The verification routines written by the developer
will be used for this purpose.
Summary and discussion. In summary, given this inter-
face of Blockplane, a system developer is not exposed to
the complexities of tolerating byzantine failures. Rather, a
developer only needs to modify the system to use the log-
commit and communication interfaces and implement veri-
fication routines. The intention of exposing both a commit
and a communication interfaces—unlike SMR that exposes
a commit interface only—is two-fold: (1) to make the pro-
gramming pattern closer to ones used in distributed algorithms
which model the communication between different entities
as direct or broadcast messages. SMR, on the other hand,
would require transforming distributed algorithms that rely
on a direct or broadcast message abstraction to utilize an
ordered-log coordination abstraction. (2) to enable Blockplane
to handle commit and communication requests differently
which leads to opportunities to enhance locality as we detail
later. These opportunities are more challenging with a SMR
abstraction. However, Blockplane’s abstractions are new and
more complex than traditional SMR (since it requires dealing
with two types of interfaces and potentially more complicated
verification routines.) These factors will affect the potential
for adoption in real scenarios, compared to existing methods
that are familiar to developers. Also, Blockplane might not
offer an advantage compared to byzantine SMR for protocols
that are easily transformed to use SMR-based coordination
and for applications where nodes are not separated by wide-
area networks. Another obstacle to adopting Blockplane—
as well as SMR systems like PBFT—is that the application
must be deterministic. Otherwise, an ordered log of events is
not guaranteed to produce the same output from a common
initial state. The complexity of verification routines depends
on the application, for example, a transaction processing
application would have verification routines to check whether
a transaction can commit. Byzantine SMR systems share this
complexity, however, Blockplane verification routines may be
more complex due to the need to verify both commit and
communication events.
Example. The following is an example of using the Block-
plane programming interface (Algorithm 1 shows the modified
algorithms in the example.) The example is based on a simple
distributed counting protocol, P. In the counting protocol, each
3
participant maintains a counter that is initially set to 0. A user
can trigger an event to send a message from a participant Ato
another participant B. When a participant receives a message,
it increments the counter. In this simple example, the only
state that a participant needs to maintain is the counter value.
Therefore, the protocol Pcalls the log-commit instruction
whenever a message is received. This commits the event to
the SMR log. In the case of a failure, the protocol Preads the
log using read instructions to recover the state of the counter.
The other change that is needed is to use Blockplane’s send
and receive instructions to send and receive messages.
In addition to using the commit and communication in-
structions, the system developer must also provide verification
routines for each log-commit and send instruction. In the case
of the program in Algorithm 1, three verification routines must
be written, for the log-commit and send instructions in the
UserRequest event and for the log-commit instruction in Start-
Server. The verification routines are provided to Blockplane
in the form of callbacks. (When these callbacks are used is
discussed in Section IV-B.) The following is a description of
possible verification routines for these three instructions:
The log-commit instruction in the UserRequest event: the
verification routine in this case may verify that the user
request is from a trusted user/source.
The send instruction in the UserRequest event: the verifica-
tion routine in this case validates that the corresponding user
request has been actually received and was not processed
before. (This is to avoid the case of a malicious node
trying to send messages to other participants without having
received a user request to perform that communication.)
The log-commit instruction in StartServer: the verification
routine in this case validates that the corresponding received
messages has been actually received from another partic-
ipant. (We provide more details about verifying received
messages later in Section IV-C. The intuition of this verifi-
cation routine is that the Blockplane node checks whether
the received message has been signed by f+ 1 nodes in the
source participant.) The purpose of this verification routine
is to prevent a malicious node from proposing to commit a
record that increments the counter without having received
a message from another participant.
We present an example of using Blockplane to byzantize a
more complex protocol in Section VI-E.
IV. BLOCKPLANE DES IG N FO R INDEPENDENT FAILURES
In this section, we present the design of Blockplane. We
focus here on tolerating byzantine failures that affect machines
independently. In Section V, we augment this design to tolerate
benign geo-correlated failures.
A. Blockplane Operation Overview
Assume that the number of tolerated failures fiis 1. In this
case, each participant maintains 4 Blockplane nodes. In each
Blockplane node is a replica of the protocol Pand a copy of
the Local Log.
The log-commit instruction takes an arbitrary value as
input. The log-commit instruction causes the value to be
appended to the Local Log in a fault-tolerant manner. At this
Node A-1
send(to: B)
Node A-2
Node A-3 Node A-4
Participant A
Node B-1
recv
Node B-2
Node B-3 Node B-4
Participant B
inc-count recv inc-count
recv inc-count
c=0 c=0
c=0 c=0 c=0
c=1 c=1
c=1
req send(to: B)
req
send(to: B)
req
Fig. 2. An example of the state of Blockplane nodes.
point the replicas maintained by Blockplane nodes can read
the new committed record and incorporate it to their state.
Similar to committing an event, the send instruction ap-
pends the message and destination to the Local Log in a way
that tolerates byzantine failures. After it is appended, fi+ 1
Blockplane nodes sign the message denoting that they testify
for its correctness. The set of signatures is called a proof. Then,
the message with its proof are sent to the Blockplane nodes
in the destination and it is appended to the destination’s Local
Log. When the protocol Pat the destination calls the receive
instruction it returns the next received unread message in its
Local Log.
Figure 2 shows an example of Blockplane state. Consider
a Blockplane scenario that is running the counter program in
Algorithm 1. The figure shows the state of participants Aand
B. Each participant has four Blockplane nodes, each labeled
X-i, where Xis the participant and iis the node’s number. The
figure shows the state of the node, hence the value of c. Also,
it shows the instance of the Local Log. In the log, black entries
denote commit records and blue entries denote communication
records. The scenario shows the state produced after a request
from Ais sent to B. Nodes at Ahave the state of the request
and the send event, and nodes at Bhave the state of receiving
the request and processing the counter increment operation.
B. Committing to the Local Log
Committing to the Local Log (called local commit) is the
most common operation in Blockplane. It is used in three
types of events: (1) Logging a state change using log-commit,
(2) Declaring an event that may lead to communication with
other participants using log-commit, and (3) Communication
events using send. Whenever, log-commit or send are called,
a commit routine commits the event to the Local Log. We
call the instruction to the commit routine local-commit. The
instruction takes an arbitrary string as input.
In Blockplane, local commit is performed using the PBFT
byzantine fault-tolerant protocol. Specifically, when a commit
routine starts, the event value to be committed is sent to the
current PBFT leader. Then, the current leader commits the
event to the next available log position in the Local Log using
the PBFT protocol. (Reads, recovery, and failure cases are
handled in the same way as PBFT.) PBFT requires responses
from 2f+ 1 nodes to make progress.
In our deployment of PBFT as the component for local
commitment, we make some changes to accommodate the
Blockplane protocol. The first change is that every value has
a type annotation that represents the type of the record. The
type of the record can be a commit record or a communication
record. The other change is in the voting process in the
commit phase. When a node enters the prepared phase, rather
4
Algorithm 2: Blockplane communication algorithms
1: on send (in: message) {
2: local-commit (in: message, communication record annotation)
3: }
4:
5: on start-communication-daemon (in: destination){
6: p= first entry in the Local Log
7: loop {
8: if pis a communication record to destination then
9: if pwas sent to destination then
10: continue
11: end if
12: P=pwith a pointer to the previous communication record to destination
13: Get signatures for the validity of Pfrom f+ 1 local nodes
14: Send Pand f+ 1 signatures to Blockplane nodes in destination
15: end if
16: p++
17: }
18: }
than broadcasting a PBFT commit message immediately, the
corresponding verification routine is called. (The verification
routine is described in Section III-C.) The verification routine
checks the validity of the value.
C. Communication
Send instruction. Algorithm 2 shows the steps in the
send instruction and the communication daemon responsible
for delivering the message to the other participant. A send
instruction takes as input a message (string of bytes) and a
destination. When the send instruction is called, the message
is committed to the local log, via local-commit.
Acommunication daemon takes care of communicating
messages that were committed to the Local Log. Each partic-
ipant runs a communication daemon that continuously reads
the log to detect new send entries. For every other participant,
a separate communication daemon handles its communication.
Algorithm 2 shows a simplified pseudo code of the operation
of a communication daemon. First, the daemon maintains a
pointer p, that points initially to the first entry in the Local
Log. Then it reads the entries in the Local Log one-by-one
until it finds an entry that is a communication record to the
daemon’s destination. If that communication record has been
already sent to the destination, then it is skipped. Otherwise,
the communication demon constructs a transmission record,
P, which consists of the contents of the message in addition
to a pointer to the previous communication record to the same
destination.
When Pis ready, the communication daemon collects f+1
signatures from local Blockplane nodes. A Blockplane node
signs the transmission record if it agrees that its contents and
meta-information are accurate.
At any point in time, it is possible that there are more
than one communication daemon. In such a case, the same
communication record might be sent more than once to the
intended participant. This is, however, is not problematic
because the receiving end will verify the validity of the
message and that duplicates are dropped. (We present more
details about this in the rest of this section in the description
of the receive instruction.) In fact, running more than one
communication daemons is necessary, because a single com-
munication daemon might be faulty (e.g., malicious) and may
pretend maliciously to send messages. Running more than one
communication daemon might stress network I/O with a lot
of redundancy. For this reason we deploy a communication
daemon reserve, or reserve for short. A reserve is a collection
of f+ 1 Blockplane nodes. A reserve node periodically send
requests to nodes at other participants asking them about the
most recent communication record they have received. If there
is a substantial gap between the two participants, then the
reserve node transforms to a regular communication daemon.
This is because a significant gap between the two nodes may
signal that the current communication daemon is maliciously
delaying the communication of messages.
Receive instruction. When a node receives a transmission
record Pfrom another participant’s communication demon,
it tries to commit the received message to its participant’s
Local Log using log-commit. However, in order to commit the
received record, enough nodes in the destination participant
must participate in committing it. Blockplane has a special
verification routine for received message that is provided
by Blockplane. This verification routine, called the receive
verification routine checks the following:
The transmission record has f+ 1 signatures from the
source.
The transmission record has not been received before.
There are no previous transmission records that were not
received. This is checked by verifying that the previous log
position—if any—in the received transmission record has
been already received.
A node receiving the transmission record calls the local-
commit instruction to commit it. Blockplane nodes use the
receive verification routine to ensure the conditions above
apply. After the received message is committed to the Local
Log, nodes read the entry and process the received message
the same way other records are read from the log.
In Blockplane, receiving messages is implemented by ex-
posing an interface called receive which takes the source
participant as input (we omit the input in the discussions
and pseudo code if it can be implied.) This is intended to
make the communication interface similar to ones typically
in use in networked systems. The Blockplane nodes, as they
are reading the Local Log and incorporating new events to
the node’s state, they process received transmission records in
the following way: Each node maintains a reception buffer
for every other participant. When a received transmission
record is encountered in the Local Log, it is appended to the
corresponding reception buffer. When a receive instruction is
called, the corresponding reception buffer is polled and the
next transmission record is returned.
In addition to the transmission records, a Blockplane node
may receive requests for information from reserve nodes at
other participants. A node respond to these requests with
the log position of the most recently committed transmission
record received from the participant corresponding to the
reserve node. (Note that the returned log position is the one
that was sent along with the transmission record and not
the one at the receiver’s Local Log.) A reserve node must
account for the possibility of malicious nodes at the destination
participant nodes. Therefore, it should send requests to at least
f+ 1 nodes. The response with the smallest log position is
5
Participant 0
Leader Replicas
Participant 1
Leader Replicas
Participant n
Leader Replicas
preprepare
prepare
commit
reply
(a) The communication needed to commit a value using local-commit
in participant 0
Participant 0
Leader Replicas
Participant 1
Leader Replicas
Participant n
Leader Replicas
preprepare
prepare
commit
reply
send-record
preprepare
prepare
commit
reply
Committing the transmission record
get-signatures
signatures Getting signatures for the transmission record
Send the transmission record to P. 1
The received transmission
record is committed to
participant 1's Local Log
(b) The communication needed to communicate a message from
participant 0 to participant 1 where send and receive instructions are
involved
Fig. 3. The normal-case operation of commitment and communication in
Blockplane.
guaranteed to be true. This is because at least one out of the
f+ 1 participants is non-faulty. In the worst case, the non-
faulty node is the one that responded with the smallest log
position. However, it is also possible that a faulty node would
respond with a very small log position to make the reserve
node suspect that its participant’s communication daemon is
faulty. To overcome this case, the reserve node can send the
request to more than f+1 nodes in the destination participant.
If any group of f+1 nodes agrees that the log position is larger
than a certain number, then this can be taken as the current
received log positions. The reserve node chooses the set of
f+ 1 nodes that maximizes that lowest log position.
D. Normal-Case Performance
In this section, we present the communication patterns
resulting from calling the user-level log-commit instruction
and then the communication pattern resulting from calling the
user-level send instruction.
Figure 3(a) shows the communication needed to commit a
record to the Local Log in participant 0 when the user-level
code calls log-commit. (All this communication is performed
by Blockplane-level code and is not implemented by the
program developer in user-level code.) Committing a record
involves the three phases of the PBFT protocol: pre-prepare,
prepare, and commit phases. Finally, reply messages are
sent from the replicas to the user that called local-commit.
Committing a record to the local log takes four intra-datacenter
communication latencies.
Figure 3(b) shows the communication involved in sending
a message from participant 0 to participant 1. This commu-
nication is triggered by a send instruction in user-level code
and proceeds in four steps:
(1) The first step occurs when the send instruction is called.
The send instruction commits the message by calling the
commit-local instruction. The message is forwarded to the
leader (this step is not shown in the figure to account for
the typical case where the instructions are called at the
leader). The message is then committed to the Local Log
of participant 0 in the same way we show in Figure 3(a).
(2) After the message is committed to the Local Log, it
becomes readable to the communication daemon. Assume
that the communication daemon is co-located with the
current leader. The communication daemon constructs a
transmission record Pand then collects f+ 1 signatures
for its validity.
(3) The transmission record and the collected signatures are
sent to the destination, participant 1. The transmission record
can be sent to one or more nodes in participant 1.
(4) After receiving the transmission record, a node in
participant 1 calls commit-local to commit it to the Local
Log. Committing to the Local Log in participant 1 proceeds
in the same way we describe in Figure 3(a). The special
Blockplane verification routine (the receive verification rou-
tine) ensures that the received transmission record is valid.
V. GE O-CO RR EL ATED FAILURES
In the previous sections, we showed Blockplane design that
tolerates fibyzantine failures. The type of failures considered
in the previous section is the typical byzantine failure that
affects nodes independently (i.e., a failure does not affect
more than one node). In this section, we introduce the notion
of a geo-correlated benign failure, where nodes in the same
geographic locality experience a crash failure due to a natural
disaster. Remember that we distinguish between the two types
of failures: fi(used interchangeably with f) denotes the
number of tolerated independent byzantine failures and fg
denotes the number of tolerated geo-correlated benign failures.
Modeling geo-correlated failures is important because it
captures cases where a single failure affects multiple nodes
collectively. We focus on geo-correlated failures that are due
(non-byzantine) datacenter-scale outages [6].
To overcome geo-correlated failures, Blockplane nodes must
coordinate with fg+ 1 participants (out of a chosen set of
2fg+ 1 participants) in order to commit or communicate with
other participants. In the case of committing a new value
using local-commit, the following changes are introduced:
participants maintain mirrors of each others’ states on 3fi+ 1
nodes (these nodes can co-locate with the Blockplane nodes
used for local commitment.) If a participant Awants to commit
a new value, it must collect proofs from fgother participants.
Obtaining these proofs starts after locally committing the value
(in the same manner introduced in the previous section.)
The committed value is then sent to fgother participants.
Each participant would locally commit the value on its mirror
of participant A. If the value commits, then the participant
responds with fi+ 1 proofs from its local nodes. These
responses are maintained by all nodes in participant Aas an
annotation of the proved entry.
The same changes are applied to the send instruction.
Additionally, the transmission record would also include the
proofs from other participants and a node receiving a trans-
mission record would only accept it if the proofs of the source
participant and the other fgparticipants are valid.
6
Recovery in the case of a failure is done in a similar way
to recovery in primary-copy replication. A secondary that
suspects the failure of the current primary, starts processing
requests and using other participant as its secondary nodes.
(The new secondary nodes must be ones in the originally
chosen set of 2fg+ 1 nodes.) This transfer of the primary role
does not threaten the integrity of data since any entry must be
replicated to fg+1 nodes prior to commitment, which ensures
intersection between any two primaries.
VI. DESIGN CONSIDERATIONS AND EXAM PL ES
A. Read Operations
Committing to the Local Log is performed to persist a state
change or communication event. Read operations, on the other
hand, do not generally have to be committed or persisted.
However, in some instances, an application may want stronger
guarantees for its read operations, which may lead to having
to commit some read operations. There are different read
strategies in Blockplane (Here, we refer to operations that
read an entry of the Local Log.) The default read strategy
in Blockplane is a read-1 strategy, where the read is served
from the closest node to the client. The node provides a proof
of the entry’s validity which is the set of commit messages
corresponding to the entry. Another read strategy is to wait for
2f+ 1 identical responses from different nodes. This strategy
overcomes the scenario where a malicious node returns that an
entry is not committed when in reality it is committed. The last
read strategy, and strongest, is a linearizable read that requires
committing the read operation to the log in the same way an
entry is committed.
B. Recovery from failures
Within a participant replication group, there are two types
of nodes that can fail, replica nodes and leader nodes. When
a replica node fails, operation is not brought to a halt because
enough other replicas are responsive to the leader’s requests.
When the replica becomes non-faulty again, it reads the state
of the Local Log from other nodes to catch up with the current
state. A failure of the leader node causes operation to stop until
a new leader is elected. In Blockplane, a leader controls the
Local Log of a single participant using the PBFT protocol. To
elect a leader, we use the same method used in PBFT, which is
a view-based leader election. A view is like an epoch, where all
nodes start at view 0. Each view has a predetermined leader
(e.g., the leader can be the node with id equal to the view
number modulo the number of nodes). If nodes suspect that
the current view’s leader is faulty, then they start moving to the
next view. This is repeated until a non-faulty leader is found.
(More details can be found in PBFT [4].)
Another type of failure is the failure of a whole participant.
If fg>0, Blockplane recovers from this failure by starting
to process the requests for the failed participant at one of
the secondary participants. This is akin to how primary-copy
replication handles the failure of a primary.
C. Batching and Group Commit
To increase throughput, Blockplane utilizes batching and
group commit, which are typical approaches used in many data
management and transaction processing systems. Blockplane
utilizes batching in a similar manner to SMR-based systems,
where transactions (or requests) are batched together. At any
given point in time, a leader only attempts to commit a single
batch and does not start the next one until the current one
is committed. The transactions in a batch are ordered in a
way that preserves any dependencies between them, i.e., if a
transaction t1reads from t2, then t1is ordered before t2. The
leader and replicas perform the validation routines for each
transaction and vote positively to commit a batch only if all
the transactions are validated successfully. Once the batch is
committed, its transactions are applied according to their order
in the batch.
D. Performance and Monetary Costs
Byzantizing a system using Blockplane incurs performance
and monetary costs due to the additional resources and com-
munication needed to run the protocols. In terms of storage and
compute resources, Blockplane requires 3fiadditional nodes
for each participant. In terms of communication, Blockplane
adds the local commitment overhead for all commit and
communication requests that incur three phases of communi-
cation across fg+ 1 participants. Additionally, the application
developers need to perform non-trivial effort to transform
their application to use Blockplane API and write verification
routines. All these overheads—and other smaller ones that are
covered in the paper—increase monetary costs as well.
Compared to a SMR approach, many of these overheads
exist at varying degrees. For example, byzantine SMR requires
a smaller number of nodes and less communication, however
it incurs higher latency which may decrease throughput. Given
these added costs, Blockplane targets applications that wish to
tolerate byzantine failures. This is especially the case for ap-
plications that handle finances and mission critical operations,
such as e-commerce and banking applications.
E. Paxos Example
In this section, we present how the paxos protocol [10]
can be augmented to use Blockplane. We choose paxos for
this section due to its popularity and because agreement is
one of the problems that has been studied extensively with
the byzantine fault-tolerance model, thus allowing us to com-
pare the transformed paxos protocol with existing specialized
byzantine agreement protocols. However, Blockplane can be
applied to distributed protocols in general, given that they are
deterministic and start from the same initial state.
The paxos protocol has two main routines: Leader Elec-
tion and Replication. In the Leader Election routine, a node
attempts to become a leader by getting a majority of votes
from other nodes. In the Replication phase, a leader commits
a new value by getting a majority of votes accepting it. In the
following we show the details of the two routines and how
they can be implemented using the Blockplane interface.
Algorithm 3 shows the routines for Leader Election and
Replication. The state of the protocol consists of three vari-
ables: (1) r, which is the proposal number, initially set to some
unique number. (2) l, which is a boolean variable denoting
whether the participant is a leader. (3) max-val, which is a
value used in the paxos protocol as we describe next.
7
Algorithm 3: Paxos routines using Blockplane (code in
red corresponds to Blockplane interfaces).
1: r := proposal number, initially set to U
2: l := am I a leader, initially false
3: max-val := maximum accepted value, initially null
4: on LeaderElection () {
5: log-commit(Leader Election)
6: for every mM, where Mis a at least a majority of participants
7: send (msg: paxos-prepare(r), to: m)
8: while responses pending
9: responses receive ()
10: if responses include a majority of positive votes
11: l = true
12: max-val = maximum-accepted-value(responses)
13: log-commit (l, max-val)
14: else
15: r = next unique proposal number
16: log-commit (r)
17: }
18:
19: on Replication (in: value) {
20: log-commit(Replication, value)
21: if l == false
22: return
23: for every mM, where Mis a at least a majority of participants
24: send (msg: paxos-propose(r,value), to: m)
25: while responses pending
26: responses receive ()
27: if responses include a majority of positive votes
28: log-commit (value committed)
29: else
30: r = next unique proposal number
31: l = false
32: log-commit (r, l)
33: }
The Leader Election routine starts off by committing that the
event has started. Then, a paxos prepare messages, denoted
paxos-prepare is sent to at least a majority of participants
using the send instruction. Responses, in the form of paxos-
promise messages, are collected via the receive instruction.
Once collected, the node checks whether there is a majority of
positive votes. If this is the case, then the node declares that
it is the new leader by changing the lvariable to true. Also,
the received paxos-promise messages may include previous
values that were accepted by the participant. If that is the
case, the variable max-val is updates with the value with the
largest proposal number. (This value must be the one used in
a subsequent Replication phase.) Then, the new values of l
and max-val are committed using log-commit. If a majority
of paxos-promise messages was not attained, then the node
updates the proposal number to the next available unique
proposal number and commits that change using log-commit.
The Replication routine also starts with committing that
the event has started. Then, the node verifies that it is the
leader before proceeding. The node then sends a paxos-
propose message to at least a majority of nodes with the
proposal number and value to be committed. The node receives
responses, in the form of paxos-accept messages, using the
receive instruction. If a majority of nodes votes positively,
the value is considered committed and the node commits that
event using log-commit. Otherwise, the node is no longer a
leader and updates the proposal number to the next available
unique proposal number. It also commits this event using log-
commit.
For brevity, we do not show the other algorithms such as
the ones to react to receiving paxos-prepare and paxos-
propose messages in addition to the verification and recovery
code. However, similar changes are applied to them to use
Blockplane.
VII. CORRECTNESS
In this section, we provide a proof sketch of Blockplane’s
correctness. (The proof sketches consider the basic design
of Blockplane as well as the geo-correlated fault-tolerance
extensions.) We discuss three properties in the following
lemmas:
Lemma 1: All honest (non-malicious) nodes corresponding
to a participant agree on the content of any entry in the Local
Log.
Proof sketch. The safety can be proven by contradiction.
Assume that two honest nodes, aand b, disagree on the value
of an entry in LA. This means that aand bhave each collected
a set of fi+ 1 reply messages from fg+ 1 participants. This
means that both aand bhave collected a set of fi+ 1 reply
messages from at least one common participant P(this is
because the set of Local Log replication across participants
is limited to 2fg+ 1 participants.) At least one of the reply
messages collected by ais from an honest node and at least
one of the reply messages collected by bis from an honest
node (potentially different from the one collected by a.) Two
honest nodes in Pcannot disagree about the contents of an
entry because all the honest nodes in Pfollowed the PBFT
protocol. This disagreement is a contradiction.
Lemma 2: An honest node in a participant A only receives
a communication message from participant B if honest nodes
in participant B agree about its content and order.
Proof sketch. An honest node aruns the receive verification
routine. The received message correctness (that it is agreed
upon by the honest nodes of the sender participant B) can
be proven by contradiction. Assume that the nodes in the
sender Bdo not agree about the content of the sent message.
However, the node ahas a set of (fg+ 1)(fi+ 1) signatures
from B, at least one of which is from an honest node b. Since
the honest nodes in Bfollow the PBFT protocol in committing
to their Local Log, then the signature from bis a guarantee
that all nodes in Bagree on the content of the message. This
is a contradiction, and proves the agreement of nodes in Bon
the content of the message.
The other part we need to prove is that the message is
received in order, i.e., there are no earlier messages that were
maliciously delayed or dropped. This is guaranteed by tracking
the earlier log positions of the sender by the receiver.
Lemma 3: A benign protocol that uses Blockplane to com-
mit SMR logs and communicate with others cannot make
illegal state transitions (an illegal state transition is an entry
in a Local Log that is not the result of correct execution of
the protocol in respect to all the previous Local Log entries.)
Proof sketch. The intuition of this proof is that Blockplane
masks any byzantine failures locally, and thus allows following
the benign protocol’s communication pattern globally (such is
the case in the presented example in Section VI-E.) This can be
proven by contradiction. Assume that an illegal state transition
of the benign protocol at a participant A has been made. This
illegal state transition is caused by one of two cases:
8
C O V I
C0 19 61 130
O19 0 79 132
V61 79 0 70
I130 132 70 0
TABLE I
THE AVER AGE RO UN D-T RIP TIMES IN MILLISECONDS FOR EVERY PAIR OF
TH E 4USE D DATACEN TER S.
An event that is processed in A: in this case, malicious
nodes in A, Amhave caused the illegal state transition.
However, for Amto influence the content of the Local
Log, they will have to be more than fiof them, which
is a contradiction.
A malicious message received from another participant
B: The message includes signatures from fi+ 1 nodes
in B. Since the message is malicious, the number of
malicious nodes in B must be higher than fi, which is a
contradiction.
In addition to the properties above, Blockplane inherits the
availability characteristics of PBFT within a participant (due
to the Local Commit protocol) and inherits the availability
characteristics of primary-copy replication across participants
when fg>0.
VIII. EXP ER IM EN TAL EVALUATION
We present a performance evaluation of Blockplane in this
section. The evaluation is conducted across four Amazon AWS
datacenters in Virginia (V), Oregon (O), California (C), and
Ireland (I). The Round-Trip Times (RTTs) between the four
datacenters range between 19ms and 132ms (Table I.) In each
datacenter we use Amazon EC2 m5.xlarge. Each machine
runs Linux and have 4 virtualized CPUs and 16 GB memory.
The communication bandwidth between two nodes in the same
datacenter—as measured using iperf—is 640 MB/s.
For the evaluation, we compare with (non-byzantine)
Paxos [10] and PBFT [4], and a hierarchical variant of
PBFT. The prototypes developed for this evaluation aim to
highlight the communication overhead effect on the overall
normal-case performance. Some design elements that dealt
with aspects beyond the evaluation are not implemented, such
as recovery, independent code bases via techniques such as
n-version programming, and creating and checking signatures
and digests. These design aspects either do not play a role in
normal-case operation scenarios we evaluate or are negligible
compared to the wide-area latency cost and thus their absence
is not going to affect the validity of the presented results.
In the evaluations, unless we mention otherwise, we set the
fault-tolerance level fito 1 and fgto 0. We also present eval-
uations with geo-correlated byzantine fault-tolerance where fg
is set to 1. In the Blockplane evaluations, we run 4 machines in
each datacenter to represent the nodes of a single organization.
In all experiments, we batch commands together to form big
batches. Each experiment is the average of committing 1000
batches after a warm-up period of committing 100 batches.
The size of a batch is 1000 bytes. (We vary the size of the
batches in some of the evaluations.) The contents of each batch
is an arbitrary set of commands.
A. Local Commit Performance
In this section, we present a set of experiments to test the
performance of local commitment, which is the performance
(a) Latency
(b) Throughput
Fig. 4. The performance of local commitment while varying the block size
of running the log-commit instruction. The log-commit in-
struction triggers a three-round byzantine protocol inside the
datacenters. Thus, there are no wide-area communication
involved in this set of experiments. Figure 4 shows the results
of these experiments that were conducted in the datacenter in
Virginia. In the experiments, we vary the size of the batches
between 1 KB and 2000 KBs. The network bandwidth between
two nodes is 640 MB/s. Since we have four nodes in the
deployment, a leader must send the payload to at least two
other nodes. Therefore, we cannot expect the throughput to
be larger than 320 MB/s. Due to other communications (to
prepare, commit and apply), the practical throughput becomes
even lower.
Figure 4(a) shows the latency of local commitment. With
small batch sizes (up to 100 KBs), the latency is within 1 ms.
However, larger batch sizes incur larger latency. Committing
1000 KBs batches incurs a latency of 4.5 ms and committing
2000 KBs batches incurs a latency of 8.2 ms. This signals
that there is a stress on the system’s resources (e.g. network
I/O) with batch sizes bigger than or equal to 1000 KBs.)
This is validated by the throughput results that we show in
Figure 4(b). As we increase the batch size, the throughput
increases. For small batch size, the increase is more significant.
Increasing the batch size from 1 KB to 100 KBs results in a
60x increase in throughput. However, increasing the batch size
from 100 KBs to 1000 KBs results in only a 160% increase
in throughput. Beyond this point, the increase in throughput
is minimal. Increasing the batch size from 1000 KBs to 2000
KBs results in a 10% increase in throughput. This mirrors
the effect we observe on latency, where stressing the system
resources leads to increasing latency and reaching a throughput
plateau.
In summary, with the right choice of batch sizes, Blockplane
local commitment can achieve low latencies around 1 ms while
reaching a throughput of up to 83 MB/s.
Number of nodes (fi) 4 (1) 7 (2) 10 (3) 13 (4)
Throughput (MB/s) 83 51 28 25
Latency (ms) 1.2 1.9 3.5 4
TABLE II
LOCAL COMMITMENT PERFORMANCE WHILE VARYING THE NUMBER OF
NO DES AN D fi.
Another set of experiments we performed measures the
scalability of Local Commitment by varying the number of
nodes from 4 to 13 nodes (corresponding to fivalues from 1
to 4.) The results are shown in Table II for a batch size 100
KB which yielded the best balance of throughput and latency
in the previous experiment. As we increase the number of
9
0
20
40
60
80
100
120
140
160
C(1) C(2) C(3) O(1) O(2) O(3) V(1) V(2) V(3) I(1) I(2) I(3)
Latency (ms)
Scenario
Fig. 5. The performance of committing with geo-correlated byzantine fault-
tolerance
nodes (and consequently fi) the pressure on the network I/O
increases because the batches need to be sent to more replicas.
This leads to decreasing the throughput of local commitment
from 83 KB/s with 4 nodes to 25 MB/s with 13 nodes.
The increase in the number of nodes and pressure on the
network I/O also leads to increasing the latency of committing
a batch from 1.2 ms with 4 nodes to 4 ms with 13 nodes.
In summary, increasing the resilience of local commitment
incurs a significant overhead and should only be done when
necessary.
B. Geo-Correlated Fault-Tolerance
In this section, we test the performance of Blockplane
with geo-correlated byzantine fault-tolerance by varying the
tolerance level, fg, from 1 to 3. Figure 5 shows the results
of this set of experiments. The x-axis denotes the label of
the scenario and the level of geo-correlated byzantine fault-
tolerance. For example, the label C(1) denotes the latency
of committing at California with the level fgset to 1. Each
datacenter is represented with three data points, one for each
fglevel from 1 to 3. Increasing the level of fgalways lead
to an increase in latency. This is because increasing the fg
level means that there is a need to coordinate with more
datacenters. However, the magnitude of this increase varies
from one datacenter to another. The difference in the latency
increase magnitude depends on many factors—most important
is the wide-area latency between datacenters. For example,
increasing the level of fgfrom 1 to 2 in California leads to a
176% increase in latency, whereas increasing the level of fg
from 1 to 2 in Virginia leads to only a 13% increase in latency.
This difference in magnitude can be inferred from observing
the RTT latencies in Table I. California incurs a low latency
(19 ms) to communicate with its closest datacenter (Oregon)
compared to the latency (61 ms) to communicate with the
second closest datacenter (Virginia). On the other hand, the
RTT latencies between Virginia and all other datacenters are
close to each other (between 61 ms and 79 ms). Likewise,
achieving the same level of fgdiffers from one datacenter
to the other for the same reason. The latency of commitment
depends on the RTT latency between the datacenter and the
fgclosest datacenters. For the level fgequals to 1, California
and Virginia perform much better than other datacenters. For
a level fgequals to 2, all datacenters achieve a close latency
between 64 and 80 ms latency, except Ireland that incurs a
135 ms latency. For a level fgequals to 3, all datacenters
incur a latency over 135 ms, except for Virginia that achieves
a latency of 80 ms.
In summary, the effect of the level of geo-correlated byzan-
tine fault-tolerance on performance depends on the topology
of the network and the datacenter where the commitment
0
20
40
60
80
100
120
140
160
CO CV CI OV OI VI
Latency (ms)
Scenario
Fig. 6. The performance of communication between participants
occurs. This means that placement is an important aspect of
deployment to achieve a good performance. Also, knowing
the sufficient level of fault-tolerance is essential in order to
not incur unnecessary latency overhead.
C. Communication Performance
In this section, we perform a set of experiments to test
the performance of Blockplane’s communication interface
(i.e., the send and receive instructions.) Figure 6 shows the
results of sending a message from one datacenter to another
datacenter through the Blockplane interface. We show six data
points, one for each pair of datacenters. The latency for each
data point, is the time to send a message through the send
interface, then receiving using the receive instruction and
finally acknowledging the receipt of the message back at the
source. The latency of communication varies across pairs of
datacenters. This is because the RTT latencies between pairs
of datacenters are different, and the RTT latency between a
pair of datacenters directly influence the latency of sending a
message through Blockplane. For example, sending a message
from California to Oregon requires 23.4 ms. Sending a mes-
sage between the following pairs of datacenters, California-
Virginia, Oregon-Virginia, and Virginia-Ireland, incurs a la-
tency between 64 ms and 80 ms. The highest communication
latency (over 135 ms) is incurred between the following pairs
of datacenters: California-Ireland and Oregon-Ireland.
An important observation to make is the overhead that
Blockplane adds to normal communication (communication
without byzantine fault-tolerance.) This overhead is caused
by the need to locally commit the communication records at
both the source and destination. We expect that the overhead
is not going to be significant because the local commit
overhead (intra-datacenter latency) is negligible in comparison
to the wide-area latency between datacenters. We quantify this
overhead by comparing the RTT latencies in Table I with
the numbers we obtained in Figure 6. The overhead incurred
by Blockplane communication is between 1% and 7% for
all pairs of datacenters except for California-Oregon where
the overhead is 23%. The reason for this is that the RTT
latency between California and Oregon is the lowest (19 ms
only). Therefore, the effect of the additional intra-datacenter
communication is more significant in comparison to other
datacenters.
In summary, the performance of Blockplane communication
varies across pairs of datacenters depending on the RTT
latency between each pair. Also, Blockplane communication
introduces an overhead that is caused by the additional intra-
datacenter communication to commit communication records.
The overhead is more significant between nearby pairs of
datacenters (such as California and Oregon where the overhead
is 23%) but is negligible between pairs of datacenters with
10
0
20
40
60
80
100
120
140
160
180
V O C I
Latency (ms)
Datacenter
Paxos
Blockplane-paxos
PBFT
Hierarchical PBFT
Fig. 7. The performance of Blockplane-paxos in comparison with paxos,
PBFT, and Hierarchical PBFT.
higher RTT latency (the overhead can be as low as 1% for
some pairs of datacenters.)
D. Performance of a Global Consensus Use Case: Byzantized
Paxos
In this section, we evaluate the use of Blockplane to trans-
form a non-byzantine protocol and making it tolerate byzantine
failures. Specifically, we take the paxos protocol and augment
it with Blockplane commit and communication interfaces.
Therefore, we transform a typical deployment of paxos on
four datacenters where there is a single machine on each
datacenter to a Blockplane deployment on four datacenters
where there are four Blockplane machines in each datacenter.
In the rest of this section, we present the performance of
the byzantized paxos protocol using Blockplane that we call
Blockplane-paxos. We also compare with three other protocols.
The first is paxos that will provide a benchmark to measure the
overhead of byzantizing that is incurred in Blockplane-paxos.
The second is PBFT that will provide a benchmark to measure
the performance of a byzantized paxos with a protocol that is
specifically designed to solve byzantine agreement. Finally, we
compare with a third variant that we call Hierarchical PBFT
that uses PBFT in a hierarchical way similar to how it is done
in Blockplane but without the API separation.
Figure 7 shows the results of this set of experiments. Each
data point represents the latency of the Replication phase of
paxos if the leader is at the corresponding datacenter in the
x-axis label. Paxos requires polling the votes of a majority to
perform the Replication phase. This means that the expected
latency at each datacenter is the RTT latency between that dat-
acenter and the closest majority to that datacenter. The results
in the figure confirm this expectation, where the Replication
phase latency is within 10% of the RTT latency to the majority
as derived from Table I. The performance of Blockplane-paxos
should also be close to the latency of a RTT to the majority.
This is because—like paxos—the leader needs to hear from
the majority. However, in Blockplane-paxos there is additional
overhead caused by intra-datacenter communication to locally
commit records for commit and communication operations.
This overhead varies across datacenters from almost no change
in latency (for the case of Ireland) to 33% overhead in the
case of California. The Blockplane-paxos overhead in Virginia
and Oregon is within 10–13%. The magnitude of the change
is affected by the original majority latency. If the majority
latency is small, then the effect of intra-datacenter latency
would be more significant. Otherwise, the overhead of intra-
datacenter communication would be masked by the large wide-
area communication latency.
We also compare with PBFT [4]. In this scenario, there
are four PBFT node, one at each datacenter. This means
20
30
40
50
60
70
80
20 40 60 80
Latency (ms)
Batch number
(a) Backup failure
0
50
100
150
200
250
300
20 60 100 140
Latency (ms)
Batch number
(b) Primary failure
Fig. 8. The performance of reacting to two types of node failures
that—like Blockplane-paxos in this set of experiments—the
number of tolerated byzantine failures is 1. PBFT requires
three rounds of communication across three out of the four
nodes to reach agreement. However, this does not mean that
the latency would simply be three times the RTT to three
out of the four nodes. This is because each node broadcasts
its messages to all nodes, making the end-to-end latency
depend not only on the RTTs from the leader, but also the
RTTs between the replicas. In the figure, we show the latency
to reach agreement for each datacenter. The PBFT latency
varies from 102 ms (for California) to 157 ms (for Ireland).
Compared to Blockplane-paxos, PBFT’s latency is 16–78%
higher than the latency of Blockplane-paxos with the same
level of fault-tolerance. The main reason of the improvement
in latency that is achieved by Blockplane-paxos is that there
is only a single round of communication across datacenters,
and the rest of communication is localized within datacenters
to mask the effect of byzantine failures. Although Blockplane-
paxos achieves a better latency, it requires a larger number of
nodes.
The idea of using hierarchy and local-aware computation
can be used without the overhead of Blockplane API sep-
aration and communication. We quantify the overhead of
Blockplane in comparison to a PBFT deployment that uses the
same communication patterns of Blockplane-paxos, which is
using PBFT locally in a datacenter and using the SMR logs of
PBFT to communicate events to commit globally using paxos.
We call this hierarchical PBFT. Hierarchical PBFT have
the similar communication patterns of Blockplane-paxos but
without the overhead of API separation and communication.
Also, Hierarchical PBFT have the same wide-area communi-
cation patterns of paxos but with added local communication.
Therefore, we expect the latency of hierarchical PBFT to be
between the latencies of paxos and Blockplane-paxos. The
figure shows that this is generally the case in our experiments.
Blockplane is intended to be a general framework to byzan-
tize global-scale systems while providing efficient latency
characteristics through locality. This set of experiments using
the use-case of paxos (and global-scale consensus) shows that
using Blockplane may lead to comparable performance to
specialized byzantine protocols in global-scale environments.
Also, the evaluation showed that the overhead of using Block-
plane to byzantize paxos can be modest. However, increasing
the level of fault-tolerance may lead to more significant
overhead.
E. Reacting to Failure
We evaluate the performance characteristics of how Block-
plane reacts to node failures. Figure 8 shows the results of
two failure cases of participants. In this set of experiments we
11
set both fiand fgto 1. The first failure case is of a backup
participant (Figure 8(a).) In this scenario, the participant in
datacenter California is the primary and the other datacenters
are backups (each datacenter also has four nodes for local
commitment using PBFT.) In the first failure case, the partic-
ipant in Virginia is committing 100 batches. Up until Batch
45, the backup in Oregon is active, and since it is the closest
datacenter to California, it enables committing the batch by
one RTT from California to Oregon, which is around 20–40ms.
However, we emulate the case of a failure of the backup in
Oregon by shutting down the servers. In this case, the primary
has to wait to hear from the next closest datacenter (Virginia)
which increases the latency to be within 60–80ms. The second
case emulates the failure of the primary (California in this
case), which is shown in Figure 8(b). In this case, the primary
fails after committing 70 batches. This triggers one of the
backups to take over and become the new primary. In this case,
Virginia becomes the primary and commits batches 71 to 160.
The transition leads to increasing the latency since Virginia
needs more time to replicate to another backup compared to
California. Also, the primary transition can cause some batches
to experience higher latency than usual, such as the case in
two batches between batches 71 and 80 that incur a latency
around 250ms.
IX. RE LATE D WOR K
Byzantine Agreement Byzantine fault-tolerant protocols
date back to the early 1980s [11], [17]. A notable mile-
stone is the development of Practical Byzantine Fault-Tolerant
(PBFT) [4] that is used by Blockplane to commit requests in
the Local Log. There has been a resurgence of byzantine fault-
tolerance protocols in the decade following the publication of
PBFT [1], [5], [8], [9], [12], [18]. These protocols offer various
performance trade-offs in terms of metrics such as the number
of communication rounds needed to commit a command and
the number of nodes needed to tolerate ffailures. For example,
Q/U [1] requires the use of 5f+ 1 nodes rather than 3f+ 1
nodes required by PBFT and other protocols [4], [5], [8].
Blockplane differs from earlier byzantine agreement ap-
proaches in terms of its abstraction,architecture, and algo-
rithms. In terms of abstraction, Blockplane acts as a middle-
ware with both a commitment and communication interface,
unlike other byzantine services that rely on the SMR abstrac-
tion of commitment only. Blockplane’s abstraction enables
more flexibility to the programmer to express—through the
communication interface—distributed protocols that rely on
coordination. Also, having the communication interface dis-
tinct from the commit interface allows Blockplane to optimize
the communications performed to minimize wide-area com-
munication in ways that are unattainable if all requests (both
commit and communication) are treated similarly. In terms
of architecture, Blockplane is different from many byzantine
protocols in that it has a hierarchical architecture that groups
neighboring nodes together. This allows Blockplane’s algo-
rithms to be designed in a way that limits wide-area communi-
cation when possible. Steward [2] is a byzantine protocol that
enables hierarchical consensus in a similar way to Blockplane.
Blockplane differs in that its algorithms proposes a two-
dimensional data structure for coordination—rather than the
one-dimensional structure in SMR—to leverage this hierarchy.
Database Application Fault-tolerance Middleware. There
has been a number of works in developing frameworks and
middleware to make database applications fault-tolerance. This
includes ones that target tolerating benign failures such as
Phoenix Project [3] that uses redo logging to persist the
state of applications. More closely to us are frameworks
that target tolerating byzantine failures such as Mitra [13]
and Fireplug [16]. Unlike Mitra, Blockplane targets wide-
area replication scenarios. Unlike Blockplane, Fireplug only
considers graph databases.
X. CONCLUSION
In this paper, we propose Blockplane. Blockplane is a
hierarchical middleware solution that is intended to transform
systems to make them tolerate byzantine failures. In addition to
transforming systems to tolerate byzantine failures, Blockplane
aims to also reduce the wide-area communication incurred
by the transformed systems in global-scale multi-organization
coordination applications. It does so by a hierarchical, locality-
aware approach where communication is localized as much as
possible. Specifically, each participant is augmented with a
number of local Blockplane nodes that perform the durability
and communication tasks on behalf of the application. The
local nodes run local byzantine fault-tolerance protocols to
mask the failures locally. Then, inter-datacenter communica-
tion is only performed when necessary for communication or
to tolerate datacenter-scale failures.
XI. ACK NOWLEDGEMENTS
This research is supported in part by the NSF under grant
CNS-1815212.
REFERENCES
[1] M. Abd-El-Malek, G. R. Ganger, G. R. Goodson, M. K. Reiter, and J. J. Wylie. Fault-scalable
byzantine fault-tolerant services. ACM SIGOPS Operating Systems Review, 39(5):59–74, 2005.
[2] Y. Amir, C. Danilov, D. Dolev, J. Kirsch, J. Lane, C. Nita-Rotaru, J. Olsen, and D. Zage.
Steward: Scaling byzantine fault-tolerant replication to wide area networks. IEEE Transactions
on Dependable and Secure Computing, 7(1):80–93, 2010.
[3] R. S. Barga and D. B. Lomet. Phoenix project: Fault-tolerant applications. SIGMOD Record,
31(2):94–100, 2002.
[4] M. Castro, B. Liskov, et al. Practical byzantine fault tolerance. In OSDI, volume 99, pages 173–
186, 1999.
[5] J. Cowling, D. Myers, B. Liskov, R. Rodrigues, and L. Shrira. Hq replication: A hybrid quorum
protocol for byzantine fault tolerance. In Proceedings of the 7th symposium on Operating systems
design and implementation, pages 177–190. USENIX Association, 2006.
[6] H. S. Gunawi, M. Hao, R. O. Suminto, A. Laksono, A. D. Satria, J. Adityatama, and K. J. Eliazar.
Why does the cloud stop computing?: Lessons from hundreds of service outages. In Proceedings
of the Seventh ACM Symposium on Cloud Computing, pages 1–16. ACM, 2016.
[7] S. Gupta and M. Sadoghi. Blockchain Transaction Processing, pages 1–11. Springer International
Publishing, Cham, 2018.
[8] R. Kotla, L. Alvisi, M. Dahlin, A. Clement, and E. Wong. Zyzzyva: speculative byzantine fault
tolerance. In ACM SIGOPS Operating Systems Review, volume 41, pages 45–58. ACM, 2007.
[9] R. Kotla and M. Dahlin. High throughput byzantine fault tolerance. In Dependable Systems and
Networks, 2004 International Conference on, pages 575–584. IEEE, 2004.
[10] L. Lamport. Paxos made simple. ACM SIGACT News, 32(4):18–25, December 2001.
[11] L. Lamport, R. Shostak, and M. Pease. The byzantine generals problem. ACM Transactions on
Programming Languages and Systems (TOPLAS), 4(3):382–401, 1982.
[12] J. Li and D. Mazi´
eres. Beyond one-third faulty replicas in byzantine fault tolerant systems. In
NSDI, 2007.
[13] A. F. Luiz, L. C. Lung, and M. Correia. Mitra: Byzantine fault-tolerant middleware for transaction
processing on replicated databases. ACM SIGMOD Record, 43(1):32–38, 2014.
[14] A. A. Mamun, T. Li, M. Sadoghi, and D. Zhao. In-memory blockchain: Toward efficient and
trustworthy data provenance for HPC systems. In IEEE International Conference on Big Data,
Big Data 2018, Seattle, WA, USA, December 10-13, 2018, pages 3808–3813, 2018.
[15] F. Nawab, D. Agrawal, and A. El Abbadi. Dpaxos: Managing data closer to users for low-latency
and mobile applications. In SIGMOD, 2018.
[16] R. Neiheiser, D. Presser, L. Rech, M. Bravo, L. Rodrigues, and M. Correia. Fireplug: Flexible
and robust n-version geo-replication of graph databases. In 2018 International Conference on
Information Networking (ICOIN), pages 110–115. IEEE, 2018.
[17] M. Pease, R. Shostak, and L. Lamport. Reaching agreement in the presence of faults. Journal of
the ACM (JACM), 27(2):228–234, 1980.
[18] J. Yin, J.-P. Martin, A. Venkataramani, L. Alvisi, and M. Dahlin. Separating agreement from
execution for byzantine fault tolerant services. ACM SIGOPS Operating Systems Review,
37(5):253–267, 2003.
12
... Increasing the number of replicas to 5 + 1 [173] (its proven lower bound, 5 − 1 [11,146] (1) the star topology where messages are linearly multicast from a designated replica, e.g., the leader, to all other replicas and the other way around [142,229], (2) the clique topology where all (or subset of) replicas multicast message to each other (quadratic message complexity) [71], (3) the tree topology where the replicas are organized in a tree with the leader placed at the root, and at each phase, a replica communicates with its child or parent replicas [139,140,185], or (4) the chain topology where replicas construct a pipeline and each replica communicates with its successor replica [36]. Note that multi-layer BFT protocols [124,184] might follow different topologies for cross-cluster communication. ...
... In multi-layer BFT protocols [23,124,164,184,186] the load of the leader is distributed between the leader of different clusters. However, the system still suffers from load imbalance between the leader and backup replicas in each cluster. ...
Preprint
Full-text available
Byzantine fault-tolerant protocols cover a broad spectrum of design dimensions from environmental setting on communication topology, to more technical features such as commitment strategy and even fundamental social choice related properties like order fairness. Designing and building BFT protocols remains a laborious task despite of years of intensive research. The proliferation of different BFT protocols has rendered it difficult to navigate BFT landscape, let alone determine the protocol that best meets application needs. This paper presents Bedrock, a unified platform for BFT protocols design and implementation. Bedrock exposes an API that presents a set of design choices capturing the trade-offs between different design space dimensions in BFT implementations. Based on user choices, Bedrock then generates the BFT protocols within the space of plausible choices, evolves current protocols to generate new ones, and even uncovers previously unknown protocols. Our experimental results validate the capability of Bedrock in deriving existing and new BFT protocols.
... NVMs promise to be an indispensable part of future memory systems due to their unique characteristics such as non volatility, byte addressability, high density, high scalability, and requiring near-zero standby power. They can revolutionize the performance, energy efficiency, and processing footprint of existing systems from storage systems to edge and cloud environments to distributed database systems and blockchain decentralized applications [5,[122][123][124][125][126][127]. However, their main limitations, especially limited write endurance and high write energy consumption, pose serious challenges for their full adoption, which needs to be taken into consideration to leverage their full potential. ...
Article
Full-text available
Recently, non-volatile memory (NVM) technology has revolutionized the landscape of memory systems. With many advantages, such as non volatility and near zero standby power consumption, these byte-addressable memory technologies are taking the place of DRAMs. Nonetheless, they also present some limitations, such as limited write endurance, which hinders their widespread use in today’s systems. Furthermore, adjusting current data management systems to embrace these new memory technologies and all their potential is proving to be a nontrivial task. Because of this, a substantial amount of research has been done, from both the database community and the storage systems community, that tries to improve various aspects of NVMs to integrate these technologies into the memory hierarchy. In this work, which is the extended version of Kargar and Nawab (Proc. VLDB Endowment 14(12):3194–3197, 2021), we explore state-of-the-art work on deploying NVMs in database and storage systems communities and the ways their limitations are being handled within these communities. In particular, we focus on (1) the challenges that are related to high energy consumption, low write endurance and asymmetric read/write costs and (2) how these challenges can be solved using hardware and software solutions, especially by reducing the number of bit flips in write operations. We believe that this area has not gained enough attention in the data management community and this tutorial will provide information on how to integrate recent advances from the NVM storage community into existing and future data management systems.
... [1]. There is a consensus system in every blockchain application that helps replicate data across multiple servers, some of which may go down or be malicious in their behavior [2]. Wait-free consensus is an open research problem in distributed computing. ...
Article
Full-text available
A wait-free consensus technique is provided with endless processes utilizing a shared memory model. When a powerful adversary is allowed to view and destroy infinite number of votes, the approach weighted voting can be used to reach consensus with at least constant probability. In asynchronous system, there is no known upper bound to transmit the message from source to destination processor. This paper presents a resilient and message-efficient algorithm by aggregating the votes of individual processors to solve the wait-free consensus in asynchronous systems. We considered an adaptive adversary and message-passing communication system. Our aim is to construct a message-passing algorithm equivalent to a weak shared coin and to provide a message-efficient algorithm for aggregating the votes of individual processors. A processor announces votes to smaller groups before propagating them to larger ones. To limit generated, received, or sent, vote weights are gradually increased. The wait-free consensus problem is optimally solved by our algorithm, which demonstrates an effective message-passing execution of the shared coin. When less than n/2 processes are faulty or crashed, the predicted message complexity of this randomized consensus procedure is O (n2 (log log n)2). This is a linear improvement over the previous best protocol and is close to a message lower bound.
... The introduction of the cryptocurrency Bitcoin [26] marked the first wide-spread deployment of a permissionless blockchain. The emergence of Bitcoin and other blockchains has fueled the development of new resilient data management systems [3,11,17,27,28,34]. These new systems are attractive for the database community, as they can be used to provide data management systems that are resilient against failures, enable cooperative (federated) data management with many independent parties, and can support data provenance. ...
Preprint
Full-text available
The introduction of Bitcoin fueled the development of blockchain-based resilient data management systems that are resilient against failures, enable federated data management, and can support data provenance. The key factor determining the performance of such resilient data management systems is the consensus protocol used by the system to replicate client transactions among all participants. Unfortunately, existing high-throughput consensus protocols are costly and impose significant latencies on transaction processing, which rules out their usage in responsive high-performance data management systems. In this work, we improve on this situation by introducing the Proof-of-Execution consensus protocol (PoE), a consensus protocol designed for high-performance low-latency resilient data management. PoE introduces speculative execution, which minimizes latencies by starting execution before consensus is reached, and PoE introduces proof-of-executions to guarantee successful execution to clients. Furthermore, PoE introduces a single-round check-commit protocol to reduce the overall communication costs of consensus. Hence, we believe that PoE is a promising step towards flexible general-purpose low-latency resilient data management systems.
... This also includes problems where there is an asymmetry between resources, such as edge and cloud environments [35,36]. Finally, DynamicC allows deriving smaller models from bigger ones which can be useful in blockchain decentralized applications where there is a need to reduce the storage and processing footprint of applications due to cost and performance overhead [6,8,11,37] ...
Preprint
Full-text available
Clustering aims to group unlabeled objects based on similarity inherent among them into clusters. It is important for many tasks such as anomaly detection, database sharding, record linkage, and others. Some clustering methods are taken as batch algorithms that incur a high overhead as they cluster all the objects in the database from scratch or assume an incremental workload. In practice, database objects are updated, added, and removed from databases continuously which makes previous results stale. Running batch algorithms is infeasible in such scenarios as it would incur a significant overhead if performed continuously. This is particularly the case for high-velocity scenarios such as ones in Internet of Things applications. In this paper, we tackle the problem of clustering in high-velocity dynamic scenarios, where the objects are continuously updated, inserted, and deleted. Specifically, we propose a generally dynamic approach to clustering that utilizes previous clustering results. Our system, DynamicC, uses a machine learning model that is augmented with an existing batch algorithm. The DynamicC model trains by observing the clustering decisions made by the batch algorithm. After training, the DynamicC model is usedin cooperation with the batch algorithm to achieve both accurate and fast clustering decisions. The experimental results on four real-world and one synthetic datasets show that our approach has a better performance compared to the state-of-the-art method while achieving similarly accurate clustering results to the baseline batch algorithm.
... Our future work explores these applications. One area of future work is to apply this pattern of multi-stage processing to blockchain systems with off-chain components [49][50][51]. In such a case, the first stage is performed in the off-chain component while the final stage is performed after validation from the blockchain. ...
Preprint
Full-text available
Emerging edge applications require both a fast response latency and complex processing. This is infeasible without expensive hardware that can process complex operations -- such as object detection -- within a short time. Many approach this problem by addressing the complexity of the models -- via model compression, pruning and quantization -- or compressing the input. In this paper, we propose a different perspective when addressing the performance challenges. Croesus is a multi-stage approach to edge-cloud systems that provides the ability to find the balance between accuracy and performance. Croesus consists of two stages (that can be generalized to multiple stages): an initial and a final stage. The initial stage performs the computation in real-time using approximate/best-effort computation at the edge. The final stage performs the full computation at the cloud, and uses the results to correct any errors made at the initial stage. In this paper, we demonstrate the implications of such an approach on a video analytics use-case and show how multi-stage processing yields a better balance between accuracy and performance. Moreover, we study the safety of multi-stage transactions via two proposals: multi-stage serializability (MS-SR) and multi-stage invariant confluence with Apologies (MS-IA).
... Related to blockchain confidentiality, tracking handover of calls [59], secure data provenance [71], verifiable data sharing [35], tamper evidence storage [83], global asset management [87], verifiable query processing [84][68] [89], on-chain secret management [51] and private membership attestation [85] have been studied. [79] with crash-only nodes and in permissioned blockchain systems, e.g., AHL [31], Blockplane [64], chainspace [2], SharPer [6], and Cerberus [46], with Byzantine nodes. AHL [31] uses a similar technique as permissionless blockchains Elastico [58], OmniLedger [52], and Rapidchain [88], provides probabilistic safety guarantees, and processes cross-shard transactions in a centralized manner. ...
Preprint
Full-text available
Today's large-scale data management systems need to address distributed applications' confidentiality and scalability requirements among a set of collaborative enterprises. In this paper, we present Qanaat, a scalable multi-enterprise permissioned blockchain system that guarantees confidentiality. Qanaat consists of multiple enterprises where each enterprise partitions its data into multiple shards and replicates a data shard on a cluster of nodes to provide fault tolerance. Qanaat presents data collections that preserve the confidentiality of transactions and a transaction ordering schema that enforces only the necessary and sufficient constraints to guarantee data consistency. Furthermore, Qanaat supports both data consistency and confidentiality across collaboration workflows where an enterprise can participate in different collaboration workflows with different sets of enterprises. Finally, Qanaat presents a suite of centralized and decentralized consensus protocols to support different types of intra-shard and cross-shard transactions within or across enterprises. The experimental results reveal the efficiency of Qanaat in processing multi-shard and multi-enterprise transactions.
Article
The emergence of blockchains has fueled the development of resilient systems that can deal with Byzantine failures due to crashes, bugs, or even malicious behavior. Recently, we have also seen the exploration of sharding in these resilient systems, this to provide the scalability required by very large data-based applications. Unfortunately, current sharded resilient systems all use system-specific specialized approaches toward sharding that do not provide the flexibility of traditional sharded data management systems. To improve on this situation, we fundamentally look at the design of sharded resilient systems. We do so by introducing BYSHARD, a unifying framework for the study of sharded resilient systems. Within this framework, we show how two-phase commit and two-phase locking ---two techniques central to providing atomicity and isolation in traditional sharded databases---can be implemented efficiently in a Byzantine environment, this with a minimal usage of costly Byzantine resilient primitives. Based on these techniques, we propose eighteen multi-shard transaction processing protocols. Finally, we practically evaluate these protocols and show that each protocol supports high transaction throughput and provides scalability while each striking its own trade-off between throughput, isolation level, latency , and abort rate. As such, our work provides a strong foundation for the development of ACID-compliant general-purpose and flexible sharded resilient data management systems.
Article
Full-text available
Blockchain Transaction Processing.
Conference Paper
Full-text available
We conducted a cloud outage study (COS) of 32 popular Internet services. We analyzed 1247 headline news and public post-mortem reports that detail 597 unplanned outages that occurred within a 7-year span from 2009 to 2015. We analyzed outage duration, root causes, impacts, and fix procedures. This study reveals the broader availability landscape of modern cloud services and provides answers to why outages still take place even with pervasive redundancies.
Article
Full-text available
This paper presents the first hierarchical byzantine fault-tolerant replication architecture suitable to systems that span multiple wide-area sites. The architecture confines the effects of any malicious replica to its local site, reduces message complexity of wide-area communication, and allows read-only queries to be performed locally within a site for the price of additional standard hardware. We present proofs that our algorithm provides safety and liveness properties. A prototype implementation is evaluated over several network topologies and is compared with a flat byzantine fault-tolerant approach. The experimental results show considerable improvement over flat byzantine replication algorithms, bringing the performance of byzantine replication closer to existing benign fault-tolerant replication techniques over wide area networks.
Conference Paper
Full-text available
Transaction commit is a problem much investigated, both in the databases and systems communities, from the theoretical and practical sides. We present a modular approach to solve this problem in the context of database replication on environments that are subject to Byzantine faults. Our protocol builds on a total order multicast abstraction and is proven to satisfy a set of safety and liveness properties. On the contrary of previous solutions in the literature, it assures strong consistency for transactions, tolerates Byzantine clients and does not need centralized control or multi-version databases. We present an evaluation of a prototype of the system.
Conference Paper
In this paper, we propose Dynamic Paxos (DPaxos), a Paxos-based consensus protocol to manage access to partitioned data across globally-distributed datacenters and edge nodes. DPaxos is intended to implement a State Machine Replication component in data management systems for the edge. DPaxos targets the unique opportunities of utilizing edge computing resources to support emerging applications with stringent mobility and real-time requirements such as Augmented and Virtual Reality and vehicular applications. The main objective of DPaxos is to reduce the latency of serving user requests, recovering from failures, and reacting to mobility. DPaxos achieves these objectives by a few proposed changes to the traditional Paxos protocol. Most notably, DPaxos proposes a dynamic allocation of quorums ( i.e. , groups of nodes) that are needed for Paxos Leader Election. Leader Election quorums in DPaxos are smaller than traditional Paxos and expand only in the presence of conflicts.
Article
The problem addressed here concerns a set of isolated processors, some unknown subset of which may be faulty, that communicate only by means of two-party messages. Each nonfaulty processor has a private value of information that must be communicated to each other nonfaulty processor. Nonfaulty processors always communicate honestly, whereas faulty processors may lie. The problem is to devise an algorithm in which processors communicate their own values and relay values received from others that allows each nonfaulty processor to infer a value for each other processor. The value inferred for a nonfaulty processor must be that processor's private value, and the value inferred for a faulty one must be consistent with the corresponding value inferred by each other nonfaulty processor. It is shown that the problem is solvable for, and only for, n ≥ 3m + 1, where m is the number of faulty processors and n is the total number. It is also shown that if faulty processors can refuse to pass on information but cannot falsely relay information, the problem is solvable for arbitrary n ≥ m ≥ 0. This weaker assumption can be approximated in practice using cryptographic methods.