ArticlePDF Available

Performance Optimization for State Machine Replication Based on Application Semantics: A Review

Authors:

Abstract and Figures

The pervasiveness of cloud-based services has significantly increased the demand for highly dependable systems. State machine replication is a powerful way of constructing highly dependable systems. However, state machine replication requires replicas to run deterministically and to process requests sequentially according to a total order. In this article, we review various techniques that have been used to engineer fault tolerance systems for better performance. Common to most such techniques is the customization of fault tolerance mechanisms based on the application semantics. By incorporating application semantics into fault tolerance design, we could enable concurrent processing of requests, reduce the frequency of distributed agreement operations, and control application nondeterminism. We start this review by making a case for considering application semantics for state machine replication. We then present a classification of various approaches to enhancing the performance of fault tolerance systems. This is followed by the description of various fault tolerance mechanisms. We conclude this article by outlining potential future research in high performance fault tolerance computing.
Content may be subject to copyright.
The Journal of Systems and Software 112 (2016) 96–109
Contents lists available at ScienceDirect
The Journal of Systems and Software
journal homepage: www.elsevier.com/locate/jss
Performance optimization for state machine replication based on
application semantics: A review
Wenbing Zhao
Department of Electrical Engineering and Computer Science, Cleveland State University, 2121 Euclid Avenue, Cleveland, OH 44115, USA
article info
Article history:
Received 21 July 2015
Revised 2 November 2015
Accepted 3 November 2015
Availableonline12November2015
Keywords:
Application semantics
State machine replication
Fault tolerance
Byzantine agreement
abstract
The pervasiveness of cloud-based services has significantly increased the demand for highly dependable sys-
tems. State machine replication is a powerful way of constructing highly dependable systems. However, state
machine replication requires replicas to run deterministically and to process requests sequentially according
to a total order. In this article, we review various techniques that have been used to engineer fault tolerance
systems for better performance. Common to most such techniques is the customization of fault tolerance
mechanisms based on the application semantics. By incorporating application semantics into fault tolerance
design, we could enable concurrent processing of requests, reduce the frequency of distributed agreement
operations, and control application nondeterminism. We start this review by making a case for considering
application semantics for state machine replication. We then present a classification of various approaches
to enhancing the performance of fault tolerance systems. This is followed by the description of various fault
tolerance mechanisms. We conclude this article by outlining potential future research in high performance
fault tolerance computing.
© 2015 Elsevier Inc. All rights reserved.
1. Introduction
The pervasiveness of cloud-based services has significantly in-
creased the dependability expectations of many computer systems
not only from business partners, but also from many millions of end-
users. Such systems must be made resilient against various hardware
failures and possibly against cyber attacks as well. State machine
replication has shown to be an effective technique to help achieve this
goal (Zhao, 2014c). Highly efficient fault tolerance algorithms have
been developed to tolerate both crash faults and Byzantine faults, the
most seminal of which are Paxos (Lamport, 1998, 2001)andPBFT
(Castro and Liskov, 2002). However, state machine replication re-
quires replicas to run deterministically and to process requests se-
quentially according to a total order. We argue that these constraints
could impede the adoption of fault tolerance techniques in practice,
for example
Practical systems often involve nondeterministic operations when
they execute clients’ requests, and states of the replicas could di-
verge if the application nondeterminism is not controlled.
Executing requests at the replicated server sequentially according
to a total order often results in unacceptable low system through-
put and high end-to-end latency.
Tel.: +12 165237480; fax: +12 166875405.
E-mail address: wenbing@ieee.org
To overcome these issues, the application semantics must be con-
sidered, as demonstrated by a large number of fault tolerance systems
(Burrows, 2006; Castro and Liskov, 2002; Chandra et al., 2007; Cowl-
ing et al., 2006; Kotla et al., 2009; Li et al., 2012; MacCormick et al.,
2004; Moraru et al., 2013; Zhao, 2009; Zhao et al., 2009). For example,
most of them have focused on building fault tolerant storage systems
such as file systems and database systems, where a lighter weight
mechanism is designed to handle read-only requests while update
requests are totally ordered. The identification of read-only requests
requires the knowledge of the application semantics of a particu-
lar application. In some systems, such as a networked file system
(NFS), some requests would trigger the access of the local physical
clock, which constitutes a nondeterministic operation. In Castro and
Liskov (2002), the application semantics of NFS was used to identify
this nondeterministic operation and to design a control mechanism
accordingly.
The need for incorporating application semantics is further show-
cased by recent works on fault tolerance systems designed to oper-
ate in wide-area networks (Li et al., 2012; Moraru et al., 2013). In Li
et al. (2012), a hybrid replica consistency model is used to reduce the
end-to-end latency and enhance the system throughput for wide-
area fault tolerance systems based on state machine replication. In
this hybrid model, commutative requests (referred to as Blue oper-
ations) are executed locally and the operations are asynchronously
disseminated to other replicas using an eventual consistency model.
On the other hand, a strong consistency model is used for requests
http://dx.doi.org/10.1016/j.jss.2015.11.006
0164-1212/© 2015 Elsevier Inc. All rights reserved.
W. Zhao / The Journal of Systems and Software 112 (2016) 96–109 97
that have inter-dependencies (referred to as Red operations). Further-
more, some operations that are not directly commutative are con-
verted into commutative operations (the converted operations are
referred to as shadow operations) for even better performance. The
determination on whether or not two requests are commutative, and
the transformation of some pairs of requests into commutative re-
quests, all require the knowledge of application semantics.
In Moraru et al. (2013), a new variation of the Paxos consensus al-
gorithm, referred to as Egalitarian Paxos (or EPaxos), was proposed
for state machine replication. EPaxos requires every request to in-
dicate whether or not it is read-only, and to provide the set of re-
quests on which it depends for each update request. Total ordering
is needed for conflicting requests only and non-conflicting requests
are executed concurrently, thereby, significantly increasing the sys-
tem throughput. It is apparent that application semantics is essential
forEPaxostoworkproperly.
We start this review by elaborating why considering appli-
cation semantics for state machine replication is necessary. We
then propose a classification of various approaches to building
practical fault tolerance systems. This is followed by the descrip-
tion of the actual performance engineering mechanisms under
the classification framework. We conclude this article by outlin-
ing potential future research in high performance fault tolerance
computing.
This article is based on our previous work in attempting to classify
existing approaches to designing practical Byzantine fault tolerance
systems by incorporating application semantics. We are not aware
of other work that aims to provide a systematic review of this sub-
ject. In Zhao (2014c), we coined the term “application-aware Byzan-
tine fault tolerance” to refer to this line of work, and compiled known
research works roughly based on their complexity. In Zhao (2014b),
we proposed our first classification framework. In this article, we
have expanded the classification to include both conservative Byzan-
tine fault tolerance and optimistic Byzantine fault tolerance. Further-
more, we have widened the scope of the classification to include state
machine replication with both the crash fault and Byzantine fault
models.
The ultimate goal of our research is to provide a guideline on de-
signing high performance practical fault tolerance systems by iden-
tifying when a distributed agreement is needed, and on what opera-
tions. This is analogous to the challenge of building a secure system,
where one must know when and where to use security primitives,
such as encryption and digital signatures. We believe that it is time
to treat distributed agreement algorithms as basic building blocks for
dependable systems the same way as security primitives to secure
systems. The use of security primitives alone does not warrant a se-
cure system. Similarly, the use of distributed agreement in a naive
manner does not necessarily enhance the dependability of a system.
It is essential to know exactly when to use the tool and on what oper-
ations. This cannot be accomplished without considering application
semantics.
The contributions of this review include
We present a strong argument that it is not practical to treat the
sever application as a black box when replicating it for fault tol-
erance and, that one cannot ignore the application semantics in
dependability design. In addition to performance overhead and
replica consistency issues, we identify scenarios when the repli-
cated system may deadlock if requests are executed sequentially.
We propose a classification framework for various performance
engineering approaches to fault tolerance, together with the de-
scription of the mechanisms used in these approaches. They could
serve as a guideline on designing practical dependable systems.
We outline potential future research directions that could reduce
the cost of application semantics discovery and the maintenance
of custom fault tolerance implementations.
2. Why application semantics matters
Fault tolerance algorithms designed for state machine replication
concern only the total ordering of requests to be delivered to the
replicated server replicas. Hence, such algorithms can be used by any
application as long as the replicated component acts deterministi-
cally as a state machine, i.e.,given the same request delivered in the
same total order, all replicas would go through the same state transi-
tions (if any), and produce exactly the same reply. However, this does
not mean that we should treat each application as a black box and
employ a fault tolerance algorithm as it is by totally ordering all re-
quests and executing them sequentially according to the total order.
In the following, we present three major motivations for exploiting
application semantics in fault tolerance.
2.1. Reduce runtime overhead
There are two main types of runtime overhead introduced by state
machine replication based fault tolerance algorithms
1. Communication and processing delays for each remote invoca-
tion due to the need for total ordering of requests, which would
impact the end-to-end latency.
2. The loss of concurrency degrees at the replicated server due to
the sequential execution of requests, which impacts the sys-
tem throughput (i.e.,how many requests can be handled by the
replicated server per unit of time).
By exploiting application semantics, we can introduce the follow-
ing optimizations:
Reducing the end-to-end latency by not totally ordering all re-
quests. There is no need to totally order read-only requests
(i.e.,requests that do not modify the server state). In addition, for
some stateless session-oriented applications, source ordering of
requests may be sufficient (Chai and Zhao, 2013).
Enabling concurrent processing at the server replicas for some
requests. Non-conflicting requests can be executed concurrently
(Kotla and Dahlin, 2004; Raykov et al., 2011). Doing source order-
ing alone for requests also increases the system throughput (Chai
and Zhao, 2013).
Furthermore, deferred agreement for session-oriented applica-
tions (using one distributed agreement instance for a group of re-
quests typically at the end of the session), could further increase
the system throughput (Zhang et al., 2012).
2.2. Respect causality and avoid deadlocks
General purpose fault tolerance algorithms are designed for sim-
ple client-server applications where clients do not directly interact
with each other and send requests to the replicated server inde-
pendently. For multi-tiered applications with sophisticated interac-
tion patterns (Chai and Zhao, 2012b), the basic assumption for these
fault tolerance algorithms may no longer hold and hence, if used in
a straightforward manner, may lead to two problems: (1) causality
violation, when the total order imposed on two or more requests is
different from their causal order and, (2) deadlocks, when sequen-
tial execution of requests is imposed when concurrent processing
of some requests is mandatory according to the application design
(Zhao et al., 2005a). In these cases, application semantics must be
tapped to discover the causal ordering between requests and iden-
tify what requests must be delivered concurrently. Otherwise, the in-
tegrity and the availability of the system could be lost.
2.3. Control replica nondeterminism
State machine replication requires that replicas behave determin-
istically when processing requests. However, many applications are
98 W. Zhao / The Journal of Systems and Software 112 (2016) 96–109
involved with some form of nondeterministic operations, such as tak-
ing a timestamp and generating a random number using a pseudo
random number generator. Without considering the application se-
mantics, it is usually not possible to know whether or not nondeter-
ministic operations would be involved in request processing (Zhang
et al., 2011). In addition, multithreaded applications are pervasive and
it is particularly difficult to control the nondeterminism caused by
multithreading. In fact, it is a huge hindrance for applications to use
state machine replication.
If such nondeterministic operation is not controlled, the states of
the replicas might diverge and the replicas might produce different
replies to the client. Hence, exploiting application semantics not only
helps in minimizing the runtime overhead, but also helps guaran-
tee the safety and integrity of fault tolerance systems (especially for
Byzantine fault tolerance systems).
3. Classification of performance engineering approaches
As shown in Fig. 1, performance engineering approaches for state
machine replication can be divided into two different camps. In one
camp, approaches are developed to achieve conservative fault toler-
ance. In the other camp, the performance improvement is accom-
plished by executing a request before the total order of the request
has been established, hence, the name optimistic fault tolerance. In
conservative fault tolerance, the focus is to ensure the consistency of
message execution order at all nonfaulty replicas. The objective of op-
timistic fault tolerance, on the other hand, is to speculatively execute
requests as early as possible and execute multiple requests concur-
rently whenever possible. As a tradeoff, nonfaulty replicas may be-
come inconsistent temporarily, hence, appropriate mechanisms must
be designed to ensure prompt detection of, and recovery from, in-
consistency of the replicas. Compared with conservative approaches,
optimistic fault tolerance protocols offer better performance at the
cost of poorly masking faults, i.e.,they can more easily affect liveness
properties and render the system unusable.
3.1. Conservative fault tolerance
Within the camp of conservative fault tolerance, we can further
divide the approaches based on two different criteria. The first crite-
rion concerns the types of operations performed at each replica. The
second criterion concerns whether or not the context to which the
request belongs is considered.
According to the first criterion, we classify the approaches based
on whether or not the operations on state objects are limited to basic
or complex types.
Basic operation: an operation is basic if it is a read or write opera-
tion on a state object, or the creation or deletion of an object. Fur-
thermore, if the operation is write or create, the value of the object
is set deterministically, i.e.,it is either provided by the client that
issues the request, or set by the system deterministically. Note
that the create operation may effectively access two state objects.
When an object factory is used, or the object is created within a
specific hierarchy such as within a particular directory, the cre-
ate operation will inevitably access both the object factory or the
parent directory, and the object created.
Complex operation: it refers to an operation beyond basic opera-
tions, such as increment and decrement for integer numbers, and
append or truncate for text strings. We also consider an operation
complex if it sets a new value to a state object nondeterministi-
cally, such as using the local clock value or using a value produced
by a pseudo random number generator.
According to the second criterion, if the context is not considered
(or there is no such context), requests from different clients are as-
sumed to be sent independently. On the other hand, if a context is
present, it correlates a number of requests issued by the same or dif-
ferent clients within the same session. If a request is issued within
such a session, the representation of the context is typically included
in the message.
We first discuss the resulting two categories using the first crite-
rion. We then elaborate the implications of the context as used in the
second criterion.
3.1.1. Requests with basic operations
For this category, the relationship between different requests is
determined by the target state objects and the corresponding opera-
tions. Two requests are considered inter-dependent if they access at
least one common state object and the operation on that object from
any of the requests is a write operation. Otherwise, requests are con-
sidered independent.
Note that for a write or a create operation, we assume that the new
value is determined deterministically. Furthermore, different create
operations are independent. Similarly, different delete operations are
independent as well. However, create and delete operations must be
totally ordered with respect to each other (i.e.,they are considered
inter-dependent).
For requests that are inter-dependent, they must be delivered and
executed sequentially according to some total order at all nonfaulty
replicas to ensure strong replica consistency. Independent requests,
on the other hand, can be delivered and executed concurrently with
respect to each other.
3.1.2. Requests with complex operations
For this category, a request may be involved with basic as well as
complex operations. The dependency among the requests can be ana-
lyzed in a similar manner as that for requests with basic operations by
classifying complex operations into read-oriented and write-oriented
types. However, such analysis is not adequate.
Even if two write-oriented requests access the same state object,
they might not be conflicting with each other depending on the com-
plex operations involved. For example, if both requests are involved
with the increment operation on the same state object, they are con-
sidered commutative and hence, can be delivered and processed con-
currently.
The complexity of the operation could also require additional
mechanisms to ensure strong replica consistency. For example, if ei-
ther the value to be returned to the client, or the new value to be
assigned to the state object is determined based on the reading of the
local clock, or on the output from a pseudo random number genera-
tor, additional mechanisms are required to make sure that the same
value is chosen by all nonfaulty replicas (Zhao, 2008).
3.1.3. Session-oriented interactions
For session-less interactions between clients and server replicas,
requests are assumed to be sent independently. Hence, server repli-
cas may determine a total order of the requests sent by different
clients arbitrarily (usually according to the order in which they are re-
ceived at the primary). For concurrent requests, they may be batched
together for total ordering. Obviously, requests issued by the same
client are ordered according to the order in which they are sent to
preserve causality.
For requests sent within a session, however, they may be corre-
lated. Furthermore, session-oriented applications typically are multi-
tiered. This correlation among the requests imposes the following
constraints on the ordering and handling of requests: (1) the total
order of requests must respect their causality, and (2) some nested
invocations may have to be executed before another request is fully
processed. Issues related to these constraints and the corresponding
solutions will be presented in Sections 5.3.1 and 5.3.2. On the other
hand, the correlation among the requests within a session also of-
fers additional opportunities for optimization such as source ordering
W. Zhao / The Journal of Systems and Software 112 (2016) 96–109 99
Conservative
Fault Tolerance
Based on Types of
Operations on State Objects
Requests with
Basic Operations
Optimistic
Fault Tolerance
Based on Relationship
between Requests
Requests with
Complex Operations
Session-less
Interactions
Session-Oriented
Interactions
Single-Message
Level
Multi-Message
Level
Fig. 1. Classification of various approaches to performance optimization for state machine replication.
(Chai and Zhao, 2013) and deferred agreement (Zhang et al., 2012),
which will be elaborated in Sections 5.3.3 and 5.3.4.
3.2. Optimistic fault tolerance
Optimistic fault tolerance under the crash fault model has been
well studied (Li et al., 2012; Moraru et al., 2013; Saito and Shapiro,
2005). The first form of optimistic Byzantine fault tolerance is re-
ferred to as tentative execution (Castro and Liskov, 2002)orspecula-
tive execution (Kotla et al., 2009). This form of optimization allows a
replica to execute a request before knowing that the request has been
totally ordered, and yet the requests are still executed sequentially at
each replica. Hence, we call this form of optimization single-message
optimistic fault tolerance. The potential benefit of this form of opti-
mistic fault tolerance is the reduced end-to-end latency perceived by
clients.
In Chai and Zhao (2014a, b, c); Zhao and Babi (2013),amore
aggressive form of optimistic fault tolerance approach is employed
where replicas execute multiple requests concurrently and synchro-
nize their state periodically or on-demand when an inconsistency
among the replicas has been detected. We call this approach multi-
message optimistic fault tolerance. In this approach, the end-to-end
latency can be reduced and the throughput may be increased signif-
icantly. As a tradeoff, optimistic multi-message fault tolerance en-
sures a weaker form of consistency called eventual replica consis-
tency (Saito and Shapiro, 2005).
Note that the nature of some applications dictates that multi-
message optimistic fault tolerance be used, such as realtime collab-
orative editing applications where multiple users expect to be able
to update the shared document concurrently (Sun et al., 1998). Op-
timistic fault tolerance would also be required when the availability
outweighs the consistency of the system (Brewer, 2012; Ramakrish-
nan, 2012; Saito and Shapiro, 2005).
4. System model
The asynchronous distributed system model is assumed when
establishing the safety property of all performance engineering
mechanisms described in Sections 5 and 6. The liveness prop-
erty is ensured only when the system is behaving partially
synchronously.
For client-server applications, the server is replicated with suffi-
cient replication degrees. For multi-tiered applications, the middle-
tier servers or the backend server is replicated in the same way as
the client-server applications except that a replica may issue nested
requests to, and receive the corresponding replies from, other com-
ponents in the system.
For state machine replication with the crash fault model, we as-
sume that n = 2 f + 1replicas are used to tolerate up to f faulty repli-
cas. For state machine replication with the Byzantine fault model,
we assume that n = 3 f + 1replicas are used to tolerate up to f faulty
replicas. 3 f + 1is the optimal number of replicas for the Byzantine
fault model (Castro and Liskov, 2002) as explained in the following.
A replica has to make a decision when it has collected inputs from
n f replicas because up to f replicas might be faulty. Among the
n f inputs, up to f of them might have come from faulty replicas.
Hence, only n 2 f of them may be sent from nonfaulty replicas. We
want to ensure that the majority of inputs are from nonfaulty replicas.
Hence, n 2 f must be greater than f, which means that n > 3f.There-
fore, the smallest total number of replicas needed for a system to tol-
erate up to f faulty replicas is 3 f + 1. In addition, under the Byzantine
fault model, messages exchanged must be protected by an unforge-
able security token to prevent a faulty replica from impersonating as
another nonfaulty replica. In most cases, a message authentication
code could be used as the security token. For view-change related
messages, a digital signature is typically used as the security token
forstrongerprotection.
We assume that a replica is either stateless or stateful. A state-
less replica simply process a request based on some predefined ap-
plication logic and it does not maintain any state for any client across
different replicas. A stateful replica consists of one or more state ob-
jects, and the processing of a request may result in the change of one
or more state objects.
Furthermore, we assume that the application is designed to exe-
cute concurrent requests with multiple threads, such as in a thread
pool, for maximum throughput as well as for the capability of han-
dling multiple concurrent clients. It is not interesting to consider
applications that use the single-threaded execution model in this
review, because such applications fit the state machine replication
requirement well naturally.
Unless otherwise stated, we assume that replicas are directly in-
volved with distributed consensus to ensure the agreement of the
message total ordering and other values if necessary, although the
100 W. Zhao / The Journal of Systems and Software 112 (2016) 96–109
task of distributed agreement (Byzantine agreement in particular)
can be separated out and provided as a service by a separate cluster
(Clement et al., 2009; Yin et al., 2003; Chai and Zhao, 2012a).
5. Conservative fault tolerance mechanisms
In this section, we describe the mechanisms for conservative fault
tolerance in three parts. In the first two parts, we elaborate the mech-
anisms designed for basic operations, and those to accommodate
complex operations, respectively. These mechanisms are applicable
to both session-less and session-oriented applications. In the third
part, we present the mechanisms designed specifically for session-
oriented application, regardless of the type of operations (i.e.,basic or
complex).
5.1. Requests with basic operations
Research on requests with basic operations is heavily influenced
by techniques developed for transaction processing (Gray and Reuter,
1992), in particular, the concurrency control theories. This is not sur-
prising because the problem layout for requests with basic operations
is rather similar to that for transaction processing. It is intuitive to
identify and handle read-only requests differently from requests that
modify the server state so that the end-to-end latency for read-only
requests is minimized. Requests that might modify one or more state
objects are analyzed for potential dependency based on the type of
operations (e.g.,read or write) and target state objects. In a conserva-
tive approach introduced in Section 5.1.2, a partial order is imposed
on the requests that allows independent requests to be processed in
parallel. A special case is requests partitioning, where requests that
operate on disjoint state objects can be separated based on some sim-
ple measure and handled concurrently without further dependency
analysis.
5.1.1. Read-only requests
It is well-known that the majority of operations are read-only.
Hence, it has long been a design principle to optimize for read-only
operations. This is the case for replicated database systems (Gray
and Reuter, 1992), group communication systems (Birman, 1985), and
state machine replication with the crash fault model in the pre-Paxos
era (Felber, 2001). Interestingly, it is not straightforward to optimize
read-only operations for state machine replication based on the clas-
sic Paxos algorithm as pointed out by Chandra et al. (2007).Byde-
fault, read-only requests are totally ordered with respect to all other
requests in Paxos to ensure that the most up-to-date state is returned.
One may attempt to allow the read from the primary replica alone
without going through an instance of Paxos. However, doing so may
risk of reading stale data if another replica has been elected to be
the new primary and the previous primary is not aware of the situ-
ation. In Chandra et al. (2007), a master lease mechanism (Gray and
Cheriton, 1989) is used to ensure that no other replica could be pro-
moted to the primary role while the current primary is still holding
the release. The master lease mechanism enables the primary to serve
read-only requests directly without having to run an instance of the
Paxos.
The first mechanism designed for read-only requests in the con-
text of Byzantine fault tolerance is described by Castro and Liskov
(2002). A read-only request can be delivered at a replica immedi-
ately without being totally ordered first. To ensure that it reads from
the updates that are causally preceding the read, the client collects
2 f + 1matching reply messages before it delivers the reply. In case
when the client fails to collect 2 f + 1matching replies, it resends the
request as a non-read-only request. Note that here we assume that
the total number of replicas n is 3 f + 1. If n > 3 f + 1, the number
of matching reply messages would increase accordingly (Malkhi and
Reiter, 1998).
3m
2m
4
m
Parallelizer
Being
Executed
next_request()
remove_request()
insert()
ToBeExecuted
Ordered
Request
Concurrency
Matrix
Executor (Server Application)
daerhTre
k
r
oW
eldI
el
dI
Agreement
Layer
Fig. 2. The architecture of the parallelizer framework for concurrent execution of in-
dependent requests.
5.1.2. Concurrent execution of independent requests
The first mechanism to enable concurrent execution of requests
in the context of Byzantine fault tolerance is introduced by Kotla and
Dahlin (2004). The center piece of the mechanism is a predefined
concurrency matrix that can be used to determine whether or not
two requests are independent based on the operation and argument
(representing the state object to be accessed) specified in the request.
The message delivery is controlled by a parallelizer component that
is sandwiched between the Byzantine agreement layer (for message
total ordering) and the server application (referred to as executor by
Kotla and Dahlin (2004)), as shown in Fig. 2.
The parallelizer dynamically tracks the dependency of requests
that have been totally ordered according to the concurrency matrix.
The concurrency matrix is populated based on application-specific
rules. Furthermore, it is assumed that a particular thread model is
used where threads from a thread-pool ask the parallelizer for the
next requests to execute.
The method
insert() provided by the parallelizer is called
when a request has been totally ordered. The inserted request is
then added to the “to-be-executed” queue. The
next_request()
method is called when a worker thread wants to retrieve the next
request to execute. The
remove_request() method is invoked
when a worker thread completes the execution of a request. When
remove_request() is called, the request is removed from the
“being-executed” queue.
When the
next_request() method is called, the parallelizer
first checks if the “to-be-executed” queue is empty. If true, the par-
allelizer blocks the call. Otherwise, the parallelizer decides on which
request to return according to the following rules
If no message is in the “being-executed” queue, the first request
in the “to-be-executed” queue is returned.
If there is one or more messages in the “being-executed” queue,
the first request in the “to-be-executed” queue is examined based
W. Zhao / The Journal of Systems and Software 112 (2016) 96–109 101
on the concurrency matrix to determine if there exists any re-
quests in the “being-executed” queue that conflict with this re-
quest. The message is returned only if it does not conflict with
any request in the “being-executed” queue.
Requests partitioning can be considered as a special form of con-
current execution of independent requests. This is because requests
that belong to different partitions are naturally independent, hence,
they can be executed concurrently. Requests can be partitioned if they
operate on disjoint state objects. Typically, requests can be separated
based on some simple measure, such as an identifier contained in the
request, without complicated dependency analysis. Requests that be-
long to different partitions can be executed concurrently and may be
handled by different groups of server replicas for scalability. This ap-
proach is seen to be employed in farsite (Adya et al., 2002), as well
as in Byzantine fault tolerant Web services coordination (Chai et al.,
2013; Zhang et al., 2012).
In farsite (Adya et al., 2002), requests partitioning is used to en-
hance the scalability of a Byzantine fault tolerant storage service.
The basis for the requests partitioning is the pathname contained in
the client’s requests. Based on the pathname, a request is forwarded
to the designated group of replicas for processing. To minimize the
load on the root directory group (hence, to facilitate scalability), each
client caches the mapping of the pathnames and their corresponding
directory groups. When sending a request, the client identifies the
target directory group by finding the longest-matching prefix in the
cached mappings.
In Web services coordination (Chai et al., 2013; Zhang et al., 2012),
it was recognized that requests for different atomic transactions or
business activities are processed by different coordinator objects.
Hence, requests arriving at the replicated coordination service can be
partitioned based on atomic transactions or business activities.
5.1.3. Transaction processing
The concurrency degree of requests execution can be further
enhanced by treating the operations for each request as part of
an atomic transaction (typically via software transactional memory
(Shavit and Touitou, 1995)) and employing some concurrency control
mechanism developed for transaction processing systems (Gray and
Reuter, 1992). This is because in this approach, non-conflicting oper-
ations may be executed in parallel instead of sequentially for inter-
dependent requests. In essence, the granularity of concurrency con-
trol is reduced from the request level (with each request consisting
of several operations on a number of state objects) to the state object
level.
To ensure strong replica consistency, a concurrency control mech-
anism must produce a serializable schedule for inter-dependent re-
quests according to some total order at all nonfaulty replicas. The
timestamp-based concurrency control may be adapted for this pur-
pose by using the sequence number as the timestamp. Because a re-
quest is totally ordered at all nonfaulty replicas, they see the same
sequence number for the same request, guaranteeing identical serial-
izable schedule for all transactions (mapped from the requests) at all
nonfaulty replicas. This approach has been used to enable concurrent
Byzantine fault tolerance (Zhang and Zhao, 2012) based on a lock-free
software transactional memory library (Brito et al., 2009).
Note that the transaction processing model described here and in
Section 5.3.4 deviates from the traditional state machine replication
technique. Nevertheless, both rely on distributed consensus or reli-
able totally ordered multicast to ensure replica consistency.
5.2. Requests with complex operations
In addition to the mechanisms designed for handling requests
with basic operations, current research has focused on the following
two aspects of complex operations: (1) commutative requests and (2)
replica nondeterminism.
5.2.1. Commutative requests
Even if two requests access the same object and at least one of
them engages in a write-oriented operation, they might not be con-
flicting with each other depending on the complex operations in-
volved. For example, if both requests are involved with the incre-
ment operation on the same state object, they are considered com-
mutative and hence, can be delivered and processed concurrently.
Apparently, requests that update different state objects are naturally
commutative as well. Hence, all independent requests as defined in
Section 5.1.2 are commutative, i.e.,commutative requests form a su-
perset of independent requests.
If some requests are commutative while others are conflicting
with each other, the mechanism described in Section 5.1.2 can be
easily modified to accommodate commutative requests. More specif-
ically, the entries corresponding to commutative requests will be la-
beled as non-conflicting. No additional changes are required.
One way to systematically ensure that all requests to a service are
commutative is to implement the service using commutative (or con-
vergent) replicated data types (Shapiro et al., 2011). In this case, it
is possible to synchronize the state of the replicas only when it is
necessary or periodically (Chai and Zhao, 2014a). We refer to this
approach as multi-message optimistic fault tolerance, which is re-
viewed in Section 6.2.
Note that the difference between independent requests and com-
mutative requests lies in the technique in making the determinations
of these requests. If one makes a dependency check between two re-
quests only by examining whether or not they access the same state
object, then only independent requests can be identified. More so-
phisticated mechanisms, which inevitably would involve with appli-
cation semantics, are needed to identify commutative requests that
update the same state object(s). In addition, it is worth noting that
some non-commutative operations can be rendered commutative by
using shadow operations (Li et al., 2012), and by operational transfor-
mation (Ellis and Gibbs, 1989).
5.2.2. Replica nondeterminism
Some simple location-specific nondeterminism can be masked by
using a wrapper function. The wrapper translates location-specific
values such as process id to a group-wide id, which would ensure
strong replica consistency. This practice has long been used to build
fault tolerance systems (Zhao et al., 2013) and is used by Castro et al.
(2003) as an extension to PBFT. The use of a wrapper does not intro-
duce any additional communication step.
In Zhang et al. (2011), a classification of replica nondeterminism
beyond simple wrappable types is provided and a set of mechanisms
for controlling replica nondeterminism is presented. As shown in
Fig. 3, three criteria are used to classify different approaches: the
wrapability of nondeterministic operations, the determinability of
nondeterministic operations and their associated values before a re-
quest is executed, and the verifiability of nondeterministic values
proposed by a replica.
Wrapable nondeterminism: the nondeterministic values of this
type at different replicas can be made consistent using a pre-
defined wrapper function. The wrapper function uses the group
identity together with a deterministic local algorithm to compute
various local identifiers (such as process ids and file descriptors)
that would otherwise be different at different replicas (Castro
et al., 2003).
Pre-determinable nondeterminism: although the nondeterministic
variables of this type can be determined before a request is ex-
ecuted, the nature of the operations dictates that they cannot be
set according to some predefined deterministic algorithm, such as
the use of random numbers. Usually, such values should be deter-
mined collectively by replicas.
102 W. Zhao / The Journal of Systems and Software 112 (2016) 96–109
Application
Nondeterminism
Pre-
Determinable?
Verifiable?
Wrappable?
Wrappable
Nondeterminism
Yes
No
Pre-Determinable
Nondeterminism
Yes
Post-Determinable
Nondeterminism
No
Verifiable?
Verifiable
Post-Determinable
Nondeterminism
Yes
No
Non-Verifiable
Post-Determinable
Nondeterminism
Verifiable
Pre-Determinable
Nondeterminism
Yes
No
Non-Verifiable
Pre-Determinable
Nondeterminism
Fig. 3. A classification of nondeterminism types.
Post-determinable nondeterminism: the nondeterministic values
of this type cannot be determined before a request is executed.
Hence, such values would have to be initially recorded during the
execution of a request at the primary, and the values will then be
propagated to backup replicas for their verification and adoption.
For example, it is virtually impossible to predefine the thread in-
terleaving for a multi-threaded application prior to the execution
of a request. The only practical way is to record such interleaving
at the primary and enforce the same interleaving at the backups.
The last two types can additionally be divided according to the
verifiability of the values proposed by a replica
Verifiable nondeterminism: the nondeterministic values proposed
by a replica can be easily verified.
Nonverifiable nondeterminism: the nondeterministic values pro-
posed by a replica cannot be easily verified. For instance, it is diffi-
cult to verify a random number proposed by a replica. In this case,
the final values to be adopted for each request would have to be
collectively determined by the majority of nonfaulty replicas. This
makes it impossible for a faulty replica to dictate a wrong value
for the group of replicas.
In summary, as shown in Fig. 3, there are five types of ap-
plication nondeterminism: wrapable nondeterminism, verifiable
predeterminable nondeterminism, nonverifiable pre-determinable
nondeterminism, verifiable post-determinable nondeterminism, and
nonverifiable post-determinable nondeterminism.
Some of the nondeterminism can be controlled by adding one
more communication step to the Byzantine agreement step for the
request to guarantee that the same nondeterministic decision are
made by nonfaulty replicas. Other nondeterministic decisions must
be controlled via an extra Byzantine agreement instance. For detailed
description of the mechanisms, readers are referred to read (Zhang
et al., 2011).
5.3. Session-oriented interactions
A unique aspect of session-oriented interactions is the correlation
of requests that belong to the same session. In this section, we explain
the two potential issues caused by the correlation of requests as we
identified in Section 3.1.3 in detail, and describe the corresponding
mechanisms to overcome them. We also elaborate how to improve
Process A
m2
m1
m3
Server
Fig. 4. A scenario that two requests might be causally related. In this example, m
3
might causally depends on m
1
, hence, m
3
must be ordered and delivered after m
1
has
been processed.
the system performance by exploiting the correlation among requests
with a session.
5.3.1. Preventing causality violation
The most obvious constraint due to requests correlation is that the
replicas can no longer impose a total order according to the order in
which the requests are received because doing so might violate the
causal ordering among the requests. Consider the example scenario
shown in Fig. 4. If process A sends a request m
1
to server replicas
and subsequently issues a request m
2
to process B (i.e.,Asendsm
2
prior to receiving the reply for m
1
). Subsequently, process B issues
arequestm
3
to the server replicas. It is apparent that m
1
must be
ordered ahead of m
3
because m
3
might depend on m
1
.
The causality violation issue might arise only when clients com-
municate directly with each other without the knowledge of the
replicated server, as shown in Fig. 4. In multi-tired session-oriented
applications, this may happen frequently. For example, in a Web ser-
vices atomic transaction or business activity, participants often com-
municate with each other as well as with the coordination service
(Chai et al., 2013; Zhang et al., 2012). In these applications, the causal-
ity of requests is often reinforced at the application level, e.g.,are-
quest is queued until the one that precedes it is received (Chai et al.,
2013; Zhang et al., 2012). Therefore, for applications that have provi-
sions for causality detection and enforcement, the causality violation
issue goes away if we allow concurrent delivery of requests.
W. Zhao / The Journal of Systems and Software 112 (2016) 96–109 103
Client 1
Client 2
Service A
Service A
Service A
Service A
Service A
Service A
Service A
Service B
UpdateA()
UpdateB()
NestedA()NestedB()
Service A Service B
UpdateA() UpdateB()
Queued
Queued
Reply
Reply
NestedB()
NestedA()
Queued
Queued
Fig. 5. A scenario that could lead to deadlock without concurrent processing of requests to the replicated servers. The sequence diagram on the right additionally illustrates how
to break the deadlock using steps represented in dotted lines.
5.3.2. Avoiding deadlocks
In multi-tiered applications, nested remote invocations are com-
mon. If two or more replicated servers issue nested invocations to
each other in response to requests sent from their clients, the nested
requests must be delivered and executed prior to the initial requests
are fully processed. Dictating a sequential order in the execution of
the requests from the clients and those for nested invocations could
lead to deadlocks. To see why, consider the scenario shown in Fig. 5
where two clients (Client 1 and Client 2) concurrently issue requests
to two correlated replicated services (Service A and Service B), re-
spectively. As the result of the remote invocation
UpdateA() at Ser-
vice A, a nested invocation
NestedB() is issued by Service A on Ser-
vice B. Similarly, a nested invocation
NestedA() is issued by Service
B on Service A as the result of the invocation from Client 1. According
to the state machine replication requirement, all requests on a replica
must be serialized, i.e.,
NestedA() must be delivered and executed
after
UpdateA() at Service A, and NestedB() must be delivered
and executed after
UpdateB() at Service B. Hence, NestedA()
and NestedB() will be queued. However, doing so would prevent
Service A from completing the invocation
UpdateA() from Client
1 because it blocks waiting for the reply for its nested invocation
NestedB() on Service B, and similarly, this would prevent Service B
from completing the invocation
UpdateB() from Client 1.
In principle, the deadlock problem can be resolved with concur-
rent delivery of requests that have causal relationship, as illustrated
on the right side of Fig. 5 using dotted lines. This inevitably violates
the basic requirement of state machine replication, and it is not triv-
ial to allow concurrent invocations without compromising the replica
consistency and without inter-replica communication. A mechanism
was introduced by Zhao et al. (2005a) to accomplish this task, which
requires the interception of all mutex acquisition and release opera-
tions and the use of a set of data structures to keep track of the con-
currency control policy, messages sent and received, and active and
blocked threads.
The main challenge is to identify exactly when the next request
could be delivered. The complexity of this problem is due to
A replica might receive more than one message to be handled by
thesamethread.
A thread might hold on to a mutex until a sequence of nested in-
vocations have been completed.
In either case, it is not safe to deliver a new request to the replica
even if all threads are blocked because in the first case, some of the
threads might be waiting for a nested reply, and in the second case,
uncontrolled concurrent multithreading might occur, which could in-
duce replica inconsistency.
To deal with the first case, a new request can be delivered only
when all threads in the replica have become quiescent, i.e.,all requests
Client
R0
R1
R2
Request
Backend
Deliver
Reply
Nested Request
Nested Reply
Execute Request
Fig. 6. Normal operation of source-order based protocol.
have been delivered and no thread is holding a mutex. The mecha-
nism to handle the second case is much more complicated because a
new request must be delivered to a replica when it is not quiesced.
The basic idea is to make sure that only one thread is active at a time.
For details, readers are referred to (Zhao et al., 2005a).
5.3.3. Source ordering
For some multi-tiered applications with a replicated middle-tier
server and a backend server managing all persistent state, it is argued
by Chai and Zhao (2013) that maintaining source ordering of requests
is sufficient either due to the commutativity of the operations at the
replicated middle-tier server, or due to the constraint imposed by the
database server. Furthermore, only 2 f + 1middle-tier server replicas
are required when up to f of them could be Byzantine faulty. A client,
as well as the backend server, must collect f + 1matching reply mes-
sages or nested request messages to deliver the reply or request, as
shown in Fig. 6.
5.3.4. Deferred agreement
A session by definition defines the beginning and the ending of
an interaction between a group of participants. For some requests re-
ceived at the replicated server, their total order can be determined
right before the end of the session, thereby, reducing the runtime
overhead. Deferred agreement differs from the batching mechanism
used in general purpose fault tolerance systems (Moser et al., 1996;
Castro and Liskov, 2002) because batching is done to a group of re-
quests that arrive concurrently, while deferred agreement applies to
requests received over time during a session.
Many session-oriented applications adopt the transaction pro-
cessing model where a group of operations within the session (i.e.,the
transaction) are executed atomically (i.e.,either they are all success-
ful or as if none of them has been executed). In this case, some of
the operations can be delivered immediately during the session, and
a sanity check on potential conflicts can be performed at the end of
104 W. Zhao / The Journal of Systems and Software 112 (2016) 96–109
R0
R1
R2
R3
Initiator
Request
to Commit
Registration
Update
Consensus Committed
2PC:
Prepare
Byzantine
Agreement
on Registration
&Transaction
Outcome
2PC:
Notify
Fig. 7. Deferred agreement mechanism in the context of a distributed transaction.
the session, which constitutes a deferred agreement on the group of
operations within the session. Deferred agreement also applies to dis-
tributed transactions where the two phase commit (2PC) protocol is
used.
In the context of deferred agreement, the total ordering of indi-
vidual requests is no longer important. What has to be agreed upon
by all nonfaulty replicas is the collection of requests that have been
deferred. For a distributed transaction, the votes collected in the first
phase of the two-phase commit and the corresponding transaction
outcome should also be agreed upon by nonfaulty replicas. The agree-
ment of all such information can be accomplished by using a single
instance of distributed agreement, as shown in Fig. 7.
For the deferred agreement mechanism to work, a transaction
participant must ensure that its registration request has reached
at least 2 f + 1out of 3 f + 1transaction coordinator replicas (Zhang
et al., 2012). When a replica receives a registration request, it adds
the participant to its registration record. When the initiator requests
the coordinator to commit a distributed transaction, an extra phase
of message exchange is carried out prior to the start of 2PC. In this
phase, which is referred to as the registration update phase, each co-
ordinator replica broadcasts its registration record to all other repli-
cas. This phase ends at a replica when it has collected 2f update mes-
sages from other replicas. The objective of this phase is to ensure that
every nonfaulty replica has the registration records of all nonfaulty
participants. At the end of this phase, each replica computes a super-
set of registration records.
At the conclusion of the first phase of 2PC, a coordinator replica
determines the outcome of the transaction (i.e.,commit or rollback).
The decision certificate is used as the evidence for every transaction
outcome proposed by a replica. The decision certificate is a critical
data structure that contains the registrations and the votes from par-
ticipants collected in the first phase of 2PC. According to 2PC, a trans-
action can be committed only when all participants have voted to
commit the transaction in the first phase of 2PC. Hence, the regis-
tration records matter because a faulty replica could try to abort a
transaction when it should not by excluding a participant from the
membership of the transaction.
In the ensuing consensus phase, replicas build an agreement on
both the decision certificate and the outcome of the transaction.
When an agreement is reached, a replica notifies all participants
about the transaction outcome, and then informs the initiator about
the outcome of the transaction.
6. Optimistic fault tolerance mechanisms
There are two levels of approaches to optimistic fault tolerance.
In a relatively conservative approach, it is assumed that replicas
are rarely faulty (especially for the primary) and there usually ex-
ists a single leader. Hence, a replica tentatively accepts the ordering
for a request before a full distributed consensus has been reached
(for server-side speculation), and a client tentatively accepts a re-
ply as soon as it receives the first reply from any of the replicas (for
client-side speculation in a Byzantine fault tolerance system). In this
approach, the speculation is done for one request at a time, hence,
we refer to this approach as single-message optimistic fault toler-
ance. In a more aggressive approach, it is assumed that a very small
fraction of requests are conflicting, and a replica delivers and exe-
cutes a request as soon as it receives one. Naturally, a replica could
have executed a number of requests before the system engages in a
consistency check among replicas. Hence, we refer to this approach
as multi-message optimistic fault tolerance. The former approach is
often used in Byzantine fault tolerance systems because Byzantine
agreement typically takes a number of communication steps. The lat-
ter approach has been used in fault tolerance systems with the crash
model (Li et al., 2012; Moraru et al., 2013; Saito and Shapiro, 2005)as
well as those designed to tolerate Byzantine faults (Chai and Zhao,
2014a, b, c; Zhao and Babi, 2013). In both approaches, a recovery
mechanism is necessary to resolve replica inconsistency as the result
of wrong speculation.
Note that single-message optimistic fault tolerance does not de-
pend on the application semantics and hence, they can be used in any
system. We include this line of research here for completeness rea-
sons. By utilizing application semantics, more aggressive optimiza-
tion can be performed, which is reviewed in Section 6.2.
6.1. Single-message optimistic fault tolerance
The fundamental assumption for single-message optimistic fault
tolerance is that there often exists a single leader among the repli-
cas and replicas are rarely faulty. Based on this assumption, a replica
could deliver and tentatively executes a request before the total order
of the message has been established in server-side speculation, and a
client could tentatively accept the first reply it receives in client-side
speculation.
6.1.1. Server-side speculation
The first single-message level optimistic approach for Byzantine
faulttolerancewasproposedbyCastroandLiskov(2002). In this ap-
proach, a replica tentatively executes a request immediately after the
message has been prepared, i.e.,when 2 f + 1replicas have accepted
(or sent) a pre-prepare message for the message, and has accepted
2f matching prepare messages (including the one from its own for a
backup replica). Tentative execution has low risk of executing the re-
quest in a wrong order because unless the primary becomes faulty,
the order for the request at the prepared stage (i.e.,the prepared or-
der)wouldbeidenticaltothetotalorder.Therequestmightbeexe-
cuted in a wrong order only if the primary fails and the new primary
is unaware of the prepared order for the request. The benefit of ten-
tative execution is the elimination of one communication step in the
critical path of the end-to-end latency.
In speculative execution introduced in Zyzzyva (Kotla et al., 2009),
a replica executes a request as soon as it has generated (for the pri-
mary) or received (for a backup) the pre-prepare message for the re-
quest as shown in Fig. 8. This more aggressive approach could cut two
W. Zhao / The Journal of Systems and Software 112 (2016) 96–109 105
R0
R1
R2
R3
Client
Request
Pre-Prepare
Received 2f+1
Matching Replies
Execute Request
Timer
Expired
Deliver
Reply
Commit
Local Commit
Fig. 8. Speculative execution in Zyzzyva.
communication steps from the end-to-end latency in the most opti-
mal case. However, as a tradeoff, the client must wait until it has col-
lected 3 f + 1matching replies (i.e.,from all replicas) before it accepts
and delivers the reply. If the client fails to collect matching replies
from all replicas, but manages to collect at least 2 f + 1matching
replies, an additional round of communication between the client
and replicas becomes necessary (as shown in Fig. 8). In the latter case,
Zyzzyva would not have any advantage over PBFT as far as the end-
to-end latency is concerned. In addition, the view change must also
be changed in Zyzzyva due to more aggressive speculation.
In both tentative execution and speculative execution, a recovery
mechanism is needed to resolve the issue of wrong speculation. In
PBFT (Castro and Liskov, 2002), the prepared record for a request that
has been tentatively executed must be indicated explicitly so when
it is disseminated in a view change. If the order as determined by
the new primary is different from the order executed tentatively in
the previous view, the replica must be rolled back to the latest sta-
ble checkpoint, and all logged requests ordered after the checkpoint
must be re-executed, including those that have not been executed
tentatively. In Zyzzyva (Kotla et al., 2009), no recovery mechanism
was provided explicitly. We believe the recovery mechanism defined
for tentative execution can also be used to recover from a wrong spec-
ulation. In summary, the penalty for a wrong tentative or speculative
execution can be quite high, depending on the frequency of check-
pointing.
6.1.2. Client-side speculation
In addition to server-side speculation, a client could also use spec-
ulation to reduce the end-to-end latency, as presented by Wester
et al. (2009). In client-side speculation, a client accepts and deliv-
ers the first reply received instead of doing so only when it has col-
lected f + 1or more matching replies for Byzantine fault tolerance.
This scheme is particularly useful in Byzantine fault tolerance sys-
tems operating in wide-area networks. To further reduce the end-to-
end latency, the primary replica immediately executes a request and
generates an early reply to the client as soon as it has issued a pre-
prepare message, i.e.,client-speculation is typically used in conjunc-
tion with server-side speculation.
Obviously, if the client speculation is wrong, e.g.,if the first re-
ply that the client received is sent by a faulty replica, the client
must be rolled back to a previously recorded correct state. This re-
quires periodic checkpoining at the client. A client could detect wrong
speculation by continuing collecting more replies. If it could collect
f + 1matching replies that are different from the one it has delivered
speculatively, the client learns that it has speculated wrong and rolls
back its state using the last checkpoint and replays the logged com-
mitted replies since that checkpoint.
Another issue is that the client might issue new requests after it
has delivered a reply speculatively. Because these requests are gener-
ated based on the speculative state at the client, they might corrupt
the state of server replicas, if the reply delivered speculatively turns
out to be wrong. To address this concern, the client keeps track of
all speculative replies and piggybacks such replies with every specu-
lative request. Upon receiving a speculative request, a server replica
compares its logged replies with the piggybacked speculative replies.
If the two do not match, the replica drops the speculative request be-
cause it knows that the client would have to be rolled back.
6.2. Multi-message optimistic fault tolerance
In multi-message optimistic fault tolerance, a replica executes a
request as soon as it is received. The replica consistency is ensured
by synchronizing the state of the replicas periodically, or on-demand
whenever a conflict is detected. It is easy to understand that the
checkpointing and redo mechanisms used for the recovery alone in
single-message optimistic fault tolerance will not work for multi-
message optimistic fault tolerance because the checkpoints produced
at the synchronization point are guaranteed to be inconsistent if
replicas have executed some requests in different orders.
6.2.1. A general purpose replication algorithm for commutative requests
Raykov et al. proposed a general purpose algorithm (Raykov
et al., 2011) that can support a mixture of commutative and non-
commutative requests. The algorithm ensures that commutative re-
quests are not totally ordered and it may reduce the end-to-end la-
tency to as few as two communication steps for such requests. This
algorithm requires the use of 5 f + 1replicas when f of them could be
Byzantine faulty.
Fig. 9 illustrates the operation of the algorithm. A client broad-
casts its request to server replicas and waits until either one of the
following conditions is satisfied
It has received 4 f + 1matching speculative replies.
It has received f + 1matching regular replies.
A server replica speculatively executes a request immediately if
no other request that has been received but not yet delivered is con-
flicting with the current request, and a speculative reply is sent to the
client. If a conflicting request is found, a recovery consensus phase is
started to ensure the total ordering of conflicting requests. During the
recovery consensus phase, each replica i launches a Byzantine agree-
ment instance on the following two sets of requests it has received in
the current round and on the total ordering of the sets:
NCSet
i
: a set of requests that are not conflicting with each other
(i.e.,they are commutative) received by replica i.
CSet
i
: a set of conflicting requests received by replica i.Forany
message m in CSet
i
, there is at least one message m
in NCSet
i
that
is conflicting with m.
A replica i waits until it has received 4 f + 1sets of Byzantine
agreed NCSet
j
and CSet
j
from different replicas to build a final set for
commutative requests NCSet and the final set for conflicting requests
CSet, according to the following rules
106 W. Zhao / The Journal of Systems and Software 112 (2016) 96–109
R0
R1
R2
R3
C1
m1
Received 4f+1
Matching
Speculative
Replies
Execute
Request
R4
R5
C2
m2
m3
Received f+1
Matching
Regular
Replies
Byzantine
Agreement
Recovery
Actions
(undo/redo)
Fig. 9. Two forms of operations in the replication algorithm for commutative requests.
If a message is present in the majority of NCSet
j
collected, the mes-
sage is included in NCSet.
ForallothermessagesinanyofNCSet
j
collected, as well as the
messages in any CSet
j
that are not included in the final NCSet,they
are included in the final CSet.
The recovery consensus phase is needed to identify, and recover
from, wrong speculations. In this phase, a request that is speculatively
executed in the wrong order will be rolled back and re-executed in
the correct total order. All conflicting requests are executed normally
according to the total order established. A normal reply is sent to the
client for each request that has been totally ordered.
6.2.2. Application-specific optimistic fault tolerance
As shown in Fig. 10, three types of applications have been iden-
tified to fit optimistic fault tolerance well (Zhao, 2015) because their
operations are either commutative, or can be rendered commutative
via operational transformation (Ellis and Gibbs, 1989).
The first type is realtime collaborative editing applications (Sun
et al., 1998). As we mentioned previously, this type of applications
must allow different users to update a shared document concur-
rently, which makes it impossible to use conservative fault toler-
ance or single-message optimistic fault tolerance. For this type of
applications under the crash-fault model, replica consistency can be
achieved by using a vector timestamp to impose a total order on all
requests, and by applying operational transformation (Ellis and Gibbs,
1989) to out-of-ordered requests. In the presence of Byzantine faults,
this mechanism is insufficient to ensure replica consistency because
the vector timestamp could be tempered with by a faulty replica. To
cope with this challenge, during a round of state synchronization,
replicas compare the list of requests that have been executed together
with their vector timestamps. If a discrepancy is detected, the prob-
lematic requests are identified and undone, which is followed by a
redo step with the correct order (Zhao and Babi, 2013).
The second type of applications includes those that are con-
structed with conflict-free (or commutative) replicated data types
(CRDTs) (Shapiro et al., 2011). Even though all requests are commu-
tative in this type of applications, and hence, their execution order is
not important, state synchronization is still needed because different
replicas might execute different sets of requests, which could lead to
the divergence of replica state. The purpose of the state synchroniza-
tion is to ensure that each replica executes the same set of requests,
which achieves an eventual replica consistency.
ParticipantParticipant
Participant
Participant
Realtime Collaborative Editing Applications
Server
Clients
Request / Reply
CRDTs
CRDT-based Applications
Event Processing Engine
Event
Producers
Event
Consumer
Events
Alert
Control Msgs
Complex Event Stream Processing Applications
Publisher
Shared State
Fig. 10. Three types of applications that fit optimistic fault tolerance well.
To facilitate faster recovery, undo operations could be used to
selectively cancel out the effect of the execution of some requests
(Chai and Zhao, 2014a). An undo operation can be either provided by
the application, or be constructed based on the application seman-
tics. With the facility of undo operations, a replica can be brought to
W. Zhao / The Journal of Systems and Software 112 (2016) 96–109 107
R0
R1
R2
R3
Sync
Requests
Totally Ordered
Sync Rqeuests
Build
State
Checkpoint
State
Byzantine
Agreement
Cluster
Byzantine
Agreement
Fig. 11. State synchronization using 2 f + 1rounds of Byzantine agreement.
a desirable state from any inconsistent state by doing selective un-
dos followed by redos, without the need of rolling back to a previous
state.
The third type of applications consists of complex event stream
processing applications (Etzion and Niblett, 2010). This type of ap-
plications typically generates alerts based on a group of events and
events within the group are typically commutative. Hence, multi-
message optimistic Byzantine fault tolerance is a good fit. Doing
so also helps satisfy the soft-realtime requirement of such appli-
cations. Totally ordering all events clearly would not work because
the runtime overhead for doing so might be too excessive for these
applications.
For all three types of applications, in addition to periodic state
synchronization, a round of on-demand state synchronization could
be triggered on the detection of inconsistency among replicas by a
client (Chai and Zhao, 2014b). The on-demand state synchronization
closely works with the output-commit mechanism. In a distributed
system, when one component generates an output for another com-
ponent, it must commit to the output, which is referred to output
commit (Strom and Yemini, 1985). To satisfy this requirement, a client
does not accept the output from a replicated component until it has
received 2 f + 1valid and matching outputs from the replicas. If the
first 2 f + 1messages received do not completely match, the client
sets a timer and attempts to collect more output messages towards
2 f + 1matching messages before the timer expires. When this at-
tempt fails, the client sends a synchronization request to all replicas,
which will trigger a round of on-demand state synchronization.
For both periodic state synchronization and on-demand state syn-
chronization, an instance of Byzantine agreement on the state is in-
volved. There are two approaches to accomplishing this task. In one
approach (Zhang et al., 2012), this is done with an update phase
followed by an instance of Byzantine agreement on the superset of
records to be agreed upon by all nonfaulty replicas. Similar to what
has been shown in Fig. 7, during the update phase, all replicas broad-
cast their records to other replicas. Then, the primary replica builds a
superset of the records collected during the update phase for Byzan-
tine agreement. Although this approach is efficient in that only a sin-
gle instance of Byzantine agreement is used in each state synchro-
nization, it does depends on a particular replica (i.e.,the primary) to
propose a superset. If the primary is faulty, then a view change may
be inevitable, which could delay the completion of the round of state
synchronization.
An alternative approach is to use an external Byzantine agree-
ment cluster and use 2 f + 1instances of Byzantine agreement to syn-
chronize the state without the update phase (Chai and Zhao, 2014a,
2014b). In this approach, as shown in Fig. 11, the superset (that rep-
resents the state to be synchronized) is built deterministically based
on the identical set of totally ordered 2 f + 1state messages at each
replica. Hence, a faulty replica could not disrupt the superset forma-
tion as does in the former approach. As a tradeoff, 2f additional in-
stances of Byzantine agreement are needed.
7. Future research directions
While application semantics is essential for the design of practical
fault tolerance systems, it may be costly to discover relevant appli-
cation semantics. It may also be difficult to maintain the implemen-
tation of fault tolerance mechanisms when the application logic is
changed.
These issues can be addressed in several fronts: (1) use structured
data; (2) use standard, document-based remote communication in-
terfaces; and (3) separate application logic execution from data man-
agement.
7.1. Structured data
Structured data has long been used in transaction processing. In
transaction processing, data types and operations on the data types
in an application are predefined with one or more schemas (Gray and
Reuter, 1992). As such, the state of an application and the operations
on the state are readily available without the need of analyzing the
source code. In addition to traditional transaction processing applica-
tions, general-purpose applications can be developed using software
transactional memory (STM) (Shavit and Touitou, 1995), which offers
many of the benefits of traditional transaction processing. A particu-
lar benefit of using transaction processing or STM is the elimination
of manual thread synchronization, which is a source of major obstacle
for developing highly concurrent fault tolerance systems. Structured
data enables the adoption of well-defined concurrency control algo-
rithms, which is also conducive to developing highly efficient fault
tolerance solutions.
An exciting new research direction related to structured data is
the use of CRDTs for developing fault tolerant applications (Shapiro
et al., 2011). The use of CRDTs eliminates the need for totally ordering
requests, hence, it could drastically reduce the number of distributed
agreement instances needed in an application. A list of CRDTs are in-
troduced by Shapiro et al. (2011), including various types of counters,
registers, sets, maps, graphs, and sequences. It is shown that these
CRDTs can be used to build practical applications. In Chai and Zhao
(2014a), the Two-Phase Set (2P-Set) was used to implement a shop-
ping cart service. A 2P-Set consists of two grow-only sets, one for
adding objects to the shopping cart, the other for removing objects
from the shopping cart. The latter set is often referred to as the tomb-
stone set. The 2P-Set ensures that the “add” and “remove” operations
on the same shopping cart are commutative (the “add” and “remove”
operations on different shopping carts are apparently commutative).
In both cases, the use of structured data also makes it easier to de-
velop undo operations, which are essential to enable fast state syn-
chronization in optimistic fault tolerance.
7.2. Standard document-based interfaces
Document-based remote communication interfaces facilitate the
development of loosely coupled distributed systems, which by itself
is conducive for fault tolerance. If the interface conforms to standards,
such as various Web services standards (Little and Wilkinson, 2007;
Freund and Little, 2007; Feingold and Jeyaraman, 2007), the interac-
tions between different components of an application are predefined
in an interface file. Hence, the application semantics can be discov-
ered without cumbersome source code analysis or expert knowledge.
7.3. Separation of application logic from data management
The separation of application logic execution from data man-
agement, which is popular in many Web-based applications (often
reflected via the three-tier architecture), enables the use of state-
less design in application servers running at the middle tier, with
persistent data stored by dedicated database management systems
108 W. Zhao / The Journal of Systems and Software 112 (2016) 96–109
(Zhao et al., 2005b; Zhao, 2014a). As pointed out by Chai and Zhao
(2013), such systems can be rendered Byzantine fault tolerant with-
out requiring deep knowledge of application semantics.
8. Conclusion
In this article, we presented an overview of various performance
engineering techniques for practical fault tolerance systems based on
state machine replication. Most such techniques depend on the inte-
gration of application semantics into fault tolerance design. We first
elaborated that application semantics is not only needed for achiev-
ing better performance, but for ensuring the correctness of the sys-
tems as well. We then provided a classification of these approaches,
which is followed by the description of the mechanisms used in these
approaches. Finally, we outlined potential future research directions
related to performance engineering of fault tolerance systems. When
designing a distributed system for fault tolerance, we advocate the
use of structured data and standard, document-based remote com-
munication interfaces, and the separation of application logic from
data management. For such systems, it is easier to discover applica-
tion semantics and maintain fault tolerance solutions.
Acknowledgments
I sincerely thank the reviewers for their invaluable suggestions on
how to improve an earlier version of this article. This work was sup-
ported in part by NSF grant CNS-0821319 and by a graduate faculty
travel award from Cleveland State University.
References
Adya, A., Bolosky, W.J., Castro, M., Cermak, G., Chaiken, R., Douceur, J.R., Howell, J.,
Lorch, J.R., Theimer, M., Wattenhofer, R.P., 2002. Farsite: federated, available, and
reliable storage for an incompletely trusted environment. In: Proceedings of the
5th Symposium on Operating Systems Design and Implementation, pp. 1–14.
Birman, K.P., 1985. Replication and fault-tolerance in the isis system. Proceedings of
the 10th ACM Symposium on Operating Systems Principles. ACM, pp. 79–86.
Brewer, E.A., 2012. Pushing the cap: strategies for consistency and availability. IEEE
Computer 45 (2), 23–29.
Brito, A., Fetzer, C., Felber, P., 2009. Multithreading-enabled active replication for event
stream processing operators. In: Proceedings of the 28th IEEE International Sym-
posium on Reliable Distributed Systems. IEEE, pp. 22–31.
Burrows, M., 2006. The chubby lock service for loosely-coupled distributed systems.
Proceedings of the 7th symposium on Operating systems design and implementa-
tion. USENIX Association, Berkeley, CA, USA, pp. 335–350.
Castro, M., Liskov, B., 2002. Practical byzantine fault tolerance and proactive recovery.
ACM Trans Comput Syst 20 (4), 398–461.
Castro, M., Rodrigues, R., Liskov, B., 2003. Base: using abstraction to improve fault tol-
erance. ACM Trans. Comput. Syst. 21 (3), 236–269.
Chai,H.,Zhang,H.,Zhao,W.,Melliar-Smith,P.M.,Moser,L.E.,2013.Towardtrustworthy
coordination for web service business activities. IEEE Trans Services Comput 6 (2),
276–288.
Chai, H., Zhao, W., 2012a. Byzantine fault tolerance as a service. In: Kim, T.h, Mo-
hammed, S., Ramos, C., Abawajy, J., Kang, B.H, Slezak, D. (Eds.), Computer Appli-
cations for Web, Human Computer Interaction, Signal and Image Processing, and
Pattern Recognition, Vol. 342, Of Communications in Computer and Information
Science. Springer, Berlin Heidelberg, pp. 173–179.
Chai, H., Zhao, W., 2012b. Interaction patterns for byzantine fault tolerance computing.
In: Kim, T.h, Mohammed, S., Ramos, C., Abawajy, J., Kang, B.H, Slezak, D. (Eds.),
Computer Applications for Web, Human Computer Interaction, Signal and Image
Processing, and Pattern Recognition, Vol. 342. Springer, Berlin Heidelberg, pp. 180–
188.Of Communications in Computer and Information Science.
Chai, H., Zhao, W., 2013. Byzantine fault tolerance for session-oriented multi-tiered
applications. Int. J. of Web Science 2 (1/2), 113–125.
Chai, H., Zhao, W., 2014a. Byzantine fault tolerance for services with commutative op-
erations. Proceedings of the IEEE International Conference on Services Computing.
IEEE, Anchorage, Alaska, USA, pp. 219–226.
Chai, H., Zhao, W., 2014b. Byzantine fault tolerant event stream processing for auto-
nomic computing. Proceedings of the 12th IEEE International Conference on De-
pendable, Autonomic and Secure Computing. IEEE, pp. 109–114.
Chai, H., Zhao, W., 2014c. Towards trustworthy complex event processing. Proceedings
of the 5th IEEE International Conference on Software Engineering and Service Sci-
ence. IEEE, pp. 758–761.
Chandra, T.D., Griesemer, R., Redstone, J., 2007. Paxos made live: an engineering per-
spective. Proceedings of the 26th annual ACM symposium on Principles of dis-
tributed computing. PODC ’07. ACM, New York, NY, USA, pp. 398–407.
Clement, A., Kapritsos, M., Lee, S., Wang, Y., Alvisi, L., Dahlin, M., Riche, T., 2009. Up-
right cluster services. Proceedings of the ACM symposium on Operating systems
principles. ACM, New York, NY, USA, pp. 277–290.
Cowling, J., Myers, D., Liskov, B., Rodrigues, R., Shrira, L., 2006. Hq replication: a hybrid
quorum protocol for byzantine fault tolerance. Proceedings of the 7th symposium
on Operating systems design and implementation. USENIX Association, pp. 177–
190.
Ellis, C.A., Gibbs, S.J., 1989. Concurrency control in groupware systems. Proceedings
of the ACM SIGMOD international conference on Management of data. ACM, New
York, NY, USA, pp. 399–407.
Etzion, O., Niblett, P., 2010. Event Processing in Action. Manning Publications.
Feingold, M., Jeyaraman, R., 2007. Web services coordination, version 1.1. OASIS stan-
dard.
Felber, P., 2001. Lightweight fault tolerance in corba. Proceedings of the 3rd Interna-
tional Symposium on Distributed Objects and Applications. IEEE, pp. 239–247.
Freund, T., Little, M., 2007. Web Services Business Activity Version 1.1. OASIS standard.
Gray, C., Cheriton, D., 1989. Leases: an efficient fault-tolerant mechanism for dis-
tributed file cache consistency. Proceedings of the 12th ACM symposium on Op-
erating systems principles. ACM, New York, NY, USA, pp. 202–210.
Gray, J., Reuter, A., 1992. Transaction Processing: Concepts and Techniques, 1st edition
Morgan Kaufmann Publishers Inc., San Francisco, CA, USA.
Kotla, R., Alvisi, L., Dahlin, M., Clement, A., Wong, E., 2009. Zyzzyva: Speculative byzan-
tine fault tolerance. ACM Trans. Comput. Syst. (TOCS) 27 (4), 7.
Kotla, R., Dahlin, M., 2004. High throughput byzantine fault tolerance. In: Proceedings
of International Conference on Dependable Systems and Networks, pp. 575–584.
Lamport, L., 1998. The part-time parliament. ACM Trans. Comput. Syst. 16 (2), 133–169.
Lamport, L., 2001. Paxos made simple. ACM SIGACT News (Distrib. Comput. Column) 32
(4), 18–25.
Li, C., Porto, D., Clement, A., Gehrke, J., Preguiça, N.M., Rodrigues, R., 2012. Making geo-
replicated systems fast as possible, consistent when necessary. Proceedings of the
10th USENIX Symposium on Operating Systems Design and Implementation. Hol-
lywood, CA, pp. 265–278.
Little, M., Wilkinson, A., 2007. Web services atomic transactions version 1.1. OASIS stan-
dard.
MacCormick, J., Murphy, N., Najork, M., Thekkath, C.A., Zhou, L., 2004. Boxwood: ab-
stractions as the foundation for storage infrastructure. In: Proceedings of the
USENIX Symposium on Operating Systems Design and Implementation, pp. 105–
120.
Malkhi, D., Reiter, M., 1998. Byzantine quorum systems. Distrib. Comput. 11 (4), 203–
213.
Moraru, I., Andersen, D.G., Kaminsky, M., 2013. There is more consensus in egalitar-
ian parliaments. Proceedings of the 24th ACM Symposium on Operating Systems
Principles. ACM, pp. 358–372.
Moser, L.E., Melliar-Smith, P.M., Agarwal, D.A., Budhia, R.K., Lingley-Papadopoulos, C.A.,
1996. Totem: a fault-tolerant multicast group communication system. Commun.
ACM 39 (4), 54–63. http://doi.acm.org/10.1145/227210.227226.
Ramakrishnan, R., 2012. Cap and cloud data management. IEEE Computer 45 (2), 43–
49.
Raykov, P., Schiper, N., Pedone, F., 2011. Byzantine fault-tolerance with commutative
commands. In: Fernndez Anta, A., Lipari, G., Roy, M. (Eds.), Principles of Distributed
Systems, Vol. 7109, Of Lecture Notes in Computer Science. Springer, Berlin Heidel-
berg, pp. 329–342.
Saito, Y., Shapiro, M., 2005. Optimistic replication. ACM Comput. Surv 37 (1), 42–81.
Shapiro, M., Preguia, N., Baquero, C., Zawirski, M., 2011. Conflict-free replicated data
types. In: Dfago, X., Petit, F., Villain, V. (Eds.), Stabilization, Safety, and Security of
Distributed Systems, Vol. 6976, Of Lecture Notes in Computer Science. Springer,
Berlin Heidelberg, pp. 386–400.
Shavit, N., Touitou, D., 1995. Software transactional memory. In: Proceedings of the
14th ACM Symposium on Principles of Distributed Computing, pp. 204–213.
Strom, R., Yemini, S., 1985. Optimistic recovery in distributed systems. ACM Trans. Com-
put. Syst. 3 (3), 204–226.
Sun, C., Jia, X., Zhang, Y., Yang, Y., Chen, D., 1998. Achieving convergence, causality
preservation, and intention preservation in real-time cooperative editing systems.
ACM Trans. Comput.-Hum. Interact. 5 (1), 63–108.
Wester, B., Cowling, J.A., Nightingale, E.B., Chen, P.M., Flinn, J., Liskov, B., 2009. Tolerat-
ing latency in replicated state machines through client speculation. In: Proceedings
of the Networked Systems Design and Implementation, pp. 245–260.
Yin, J., Martin, J.P., Venkataramani, A., Alvisi, L., Dahlin, M., 2003. Separating agreement
from execution for byzantine fault tolerant services. In: Proceedings of the ACM
Symposium on Operating Systems Principles, pp. 253–267.Bolton Landing, NY
Zhang,H.,Chai,H.,Zhao,W.,Melliar-Smith,P.M.,Moser,L.E.,2012.Trustworthycoor-
dination for web service atomic transactions. IEEE Trans. Parallel Distrib. Syst. 23
(8), 1551–1565.
Zhang, H., Zhao, W., 2012. Concurrent byzantine fault tolerance for software-
transactional-memory based applications. Int. J. Future Comput. Commun. 1 (1),
47–50.
Zhang, H., Zhao, W., Melliar-Smith, P.M., Moser, L.E., 2011. Design and implementation
of a byzantine fault tolerance framework for non-deterministic applications. IET
Software 5, 342–356.
Zhao, W., 2008. Integrity-preserving replica coordination for byzantine fault tolerant
systems. Proceedings of the IEEE International Conference on Parallel and Dis-
tributed Systems. Melbourne, Victoria, Australia, pp. 447–454.
Zhao, W., 2009. Design and implementation of a Byzantine fault tolerance framework
for web services. J. Syst. Softw. 82 (6), 1004–1015.
Zhao, W., 2014a. A novel approach to building intrusion tolerant systems. Int. J. Per-
formability Eng. 10 (2), 123.
W. Zhao / The Journal of Systems and Software 112 (2016) 96–109 109
Zhao, W., 2014b. Application-aware byzantine fault tolerance. Proceedings of the 12th
IEEE International Conference on Dependable, Autonomic and Secure Computing.
IEEE, pp. 45–50.
Zhao, W., 2014c. Building Dependable Distributed Systems. Wiley-Scrivener.
Zhao, W., 2015. Optimistic byzantine fault tolerance. International Journal of Parallel.
Emergent Distrib. Syst. (ahead-of-print) 1–14. http://www.tandfonline.com/doi/
abs/10.1080/17445760.2015.1078802.
Zhao, W., Babi, M., 2013. Byzantine fault tolerant collaborative editing. Proceedings of
the IET International Conference on Information and Communications Technolo-
gies. IET, pp. 233–240.
Zhao, W., Melliar-Smith, P.M., Moser, L.E., 2013. Low latency fault tolerance system. The
Comput J 56 (6), 716–740.
Zhao, W., Moser, L., Melliar-Smith, P.M., 2005a. Deterministic scheduling for multi-
threaded replicas. In: Proceedings of the IEEE International Workshop on Object-
oriented Real-time Dependable Systems, pp. 74–81. Sedona, AZ
Zhao, W., Moser, L.E., Melliar-Smith, P.M., 2005b. Unification of transactions and repli-
cation in three-tier architectures based on CORBA. IEEE Trans. Dependable Secure
Comput. 2 (1), 20–33.
Zhao, W., Zhang, H., Chai, H., 2009. A lightweight fault tolerance framework for web
services. Web Intel. Agent Syst. 7, 255–268.
Wenbing Zhao is currently an Associate Professor at the Department of Electrical En-
gineering and Computer Science, Cleveland State University. He earned his Ph.D. at
University of California, Santa Barbara, under the supervision of Dr. Moser and Dr.
Melliar-Smith, in 2002. Dr. Zhao has authored a research monograph titled: Building
Dependable Distributed Systems published by Scrivener Publishing. Furthermore, Dr.
Zhao published over 100 peer-reviewed papers in the area of fault tolerant and depend-
able systems (three of them won the best paper award), computer vision and motion
analysis, and physics. Dr. Zhaos research is supported in part by the US National Science
Foundation, the Ohio Bureau of Workers’ Compensation, and by Cleveland State Uni-
versity. Dr. Zhao is currently serving on the technical program committee for numerous
international conferences and is a member of editorial board for PeerJ Computer Sci-
ence, International Journal of Performability Engineering, International Journal of Web
Science, International Journal of Distributed Systems and Technologies, and several in-
ternational journals of the International Academy, Research, and Industry Association.
Dr. Zhao is a senior member of IEEE and an ABET program evaluator.
... Furthermore, due to the massive degree of redundancy in a large blockchain system, the execution of the contract is guaranteed to be fault tolerant. It is worth noting that fault-tolerant code execution in a blockchain (via a smart contract) is a much more scalable and robust method compared to that of traditional approaches [5,49] Fault tolerance. Fault tolerance is implied in atomic contract execution. ...
... Although high availability was rarely mentioned in the systematic reviews, we believe this would be a highly desirable property that can be achieved by using a blockchain-based solution. The way a blockchain achieves high availability (via decentralized consensus) is drastically different from traditional fault tolerance, which makes the system much more robust and elegant [5,49]. ...
Article
Full-text available
This article presents an umbrella review of blockchain-based smart grid applications. By umbrella review, we mean that our review is based on systematic reviews of this topic. We aim to synthesize the findings from these systematic reviews and gain deeper insights into this discipline. After studying the systematic reviews, we find it imperative to provide a concise and authoritative description of blockchain technology because many technical inaccuracies permeate many of these papers. This umbrella review is guided by five research questions. The first research question concerns the types of blockchain-based smart grid applications. Existing systematic reviews rarely used a systematic method to classify these applications. To address this issue, we propose a taxonomy of these applications, first by differentiating them based on whether the application is focusing on functional or non-functional aspects of smart grid operations, and then by the specific functions or perspectives that the application aims to implement or enhance. The second research question concerns the roles that blockchain technology plays in smart grid applications. We synthesize the findings by identifying the most prominent benefits that blockchain technology could bring to these applications. We also take the opportunity to point out several common technical mistakes that pervade the blockchain literature, such as equating all forms of blockchains to data immutability. The third research question concerns the guidelines for deciding whether a blockchain-based solution would be useful to address the needs of smart grids. We synthesize the findings by proposing benefit-based guidelines. The fourth research question concerns the maturity levels of blockchain-based smart grid applications. We differentiate between academic-led and industry-led projects. We propose a five-level scale to evaluate the maturity levels. The ranking of the industry-led projects is performed through our own investigation. Our investigation shows that more than half of the industry-led projects mentioned in the systematic reviews are no longer active. Furthermore, although there are numerous news reports and a large number of academic papers published on blockchain-based smart grid applications, very few have been successfully embraced by the industry. The fifth research question concerns the open research issues in the development of blockchain-based smart grid applications. We synthesize the findings and provide our own analysis.
... Blockchain consensus has several major advantages over traditional approaches: (1) it is much more robust and much easier to implement; (2) it is much more scalable due to no need for voting among the nodes; (3) it is not a blocking algorithm in that it guarantees the system to make progress in all circumstances while the traditional approaches are all blocking by design and as such in cases of frequent membership changes (i.e., often referred to as view changes [7,22]), the consensus might not be achieved for an extensive period of time. ...
... Both are due to the need to collectively reach a consensus explicitly in a conservative way. We should note that previously we explored the approach of optimistic replication [21,22], where the algorithm allows temporary disagreement among the replicas, and offers a conflict resolution mechanism, where the purpose is to reduce the blocking time, but not to eliminate blocking. ...
... e ODE, as a dynamic system of data utilization that has changed from "one-way" to "cycle" [2], provides new ideas for using data to solve problems in areas such as governance [3], entrepreneurship [4], community improvement [5], and university innovation [6]. Enabling an ODE with the support of high-tech [7,8] can not only improve data quality and accuracy but also reduce the time and process of data acquisition [9]. In addition, this also promotes regional economic development [10] and strengthens the collaboration between governments, enterprises, and individuals, enabling them to calmly respond to data challenges [11]. ...
Article
Full-text available
Exploring the information interaction between users can provide a new way to solve open data ecosystem (ODE) sustainable development. In this paper, an information diffusion model, ILSOC, is proposed to analyze the information spreading among users within the ODE, and we use this model to inspect the impact on the sustainability of the ODE. First, we establish an information diffusion model that considers the public swing mentality. As a result, the relevant parameter variables are defined, and their influence is evaluated. Finally, this article proposes countermeasures to solve the sustainability problem within ODEs from the perspectives of users. Results demonstrate that (1) simply increasing the probability of interaction between users cannot effectively improve the sustainability of the ODE. (2) Parameters, such as contact rate, conversation rate, silent rate, and reversal rate, have constant effects on the ODE. (3) To maintain this ecosystem stability, some incentives and punishments for users in the public event are necessary.
... Consensus is essential to developing many networked systems. The subject has been intensely studied for more than three decades [1][2][3][4][5][6][7][8][9]. All traditional solutions to distributed consensus use a membership concept where a node would know whether or not it belongs to a membership, what role it assumes, and exactly how many other members are in the current membership. ...
Conference Paper
Proof of Stake (PoS) has been talked about extensively as an alternative way of reaching consensus in blockchain systems. However, there are few publications on how PoS can be used to create new blocks in detail. The undisputed lead proponent for PoS is Ethereum. However, virtually all discussions regarding PoS for Ethereum are centered on selecting a block using PoS after one or more candidate blocks have already been somehow created. PeerCoin was the first blockchain system that incorporated PoS in block creation. Unfortunately, there is no known documentation on how PoS works in PeerCoin. In this paper, we fill this gap by presenting a detailed explanation of the PeerCoin PoS algorithm based on PeerCoin source code. We also dispel the misconception that PeerCoin PoS is based on Proof of Work (PoW) and hence would consume a lot of energy just like proof of work (PoW). In fact, it resembles PoW only on surface and differs from PoW substantially in terms of how to meet the difficulty target.
... Due to the review by [9], several layers of fault-tolerance may be defined, such as optimistic fault-tolerance and conservative fault-tolerance mechanisms. This study also states that, by using checkpointing and redo mechanisms, there is a strong chance for ensuring replica consistency for the RSM clusters. ...
Article
Application checkpointing is a widely used recovery mechanism that consists of saving an application's state periodically to be used in case of a failure. In this study we investigate the utilisation of distributed checkpointing for replicated state machines. Conventionally, for replicated state machines, checkpointing information is stored in a replicated way in each of the replicas or separately in a single instance. Applying distributed checkpointing provides a means to adjust the level of fault tolerance of the checkpointing approach by giving away from recovery time. We use a local cluster and cloud environment to examine the effects of distributed checkpointing in a simple state machine example and compare the results with conventional approaches. As expected, distributed checkpointing gains from memory consumption and utilise different levels of fault tolerance while performing worse in terms of recovery time.
... Due to the rapid improvement of the computing, storage, and perception capabilities of IoT devices, IoT devices are widely deployed in various applications [44][45][46]. Although the storage and computing power of a single IoT device is relatively small [47], the massive amount of IoT devices deployed to the edge of the network have huge computing and storage capabilities. With more and more services are deployed to the network edge, current network is transferred from the cloud of the network to the edge of the network. ...
Article
Full-text available
Billions of Internet of Thing (IoT) devices are deployed in edge network. They are used to monitor specific event, process and to collect huge data to control center with smart decision based on the collected data. However, some malicious IoT devices may interrupt and interfere with normal nodes in data collection, causing damage to edge network. Due to the open character of the edge network, how to identify the credibility of these nodes, thereby identifying malicious IoT devices, and ensure reliable data collection in the edge network is a great challenge. In this paper, an Active and Verifiable Trust Evaluation (AVTE) approach is proposed to identify the credibility of IoT devices, so to ensure reliable data collection for Edge Computing with low cost. The main innovations of the AVTE approach compared with the existing work are as follows: (1) In AVTE approach, the trust of the device is obtained by an actively initiated trusted detection routing method. It is fast, accurate and targeted. (2) The acquisition of trust in the AVTE approach is based on a verifiable method and it ensures that the trust degree has higher reliability. (3) The trust acquisition method proposed in this paper is low-cost. An encoding returned verification method is applied to obtain verification messages at a very low cost. This paper proposes an encoding returned verification method, which can obtain verification messages at a very low cost. In addition, the strategy of this paper adopts initiation and verification of adaptive active trust detection according to the different energy consumption of IoT devices, so as to reliably obtain the trust of device under the premise of ensuring network lifetime. Theoretical analysis shows that AVTE approach can improve the data collection rate by 0.5 ~ 23.16% while ensuring long network lifetime compared with the existing scheme.
... There are also attempts to adopt bigger block sizes or shorter consensus time [23]. Another approach is to adopt a consensus algorithm other than proof of work [24] such as proof of stake or even traditional Byzantine fault tolerance algorithms [12], [25], [26], [27], [28]. New nonchain-based distributed ledger, such as IOTA, has been made available for developers, which uses directed acyclic graph as the data structure to store the transaction records. ...
Chapter
This chapter provides an overview of cryptocurrency and the blockchain technology. A cryptocurrency that is powered by decentralized computer technology has long been a dream for some pioneers such as Nick Szabo. In January 2009, Bitcoin was launched as the first ever practical cryptocurrency in human history. The chapter first introduces the history of cryptocurrency, then describes the design principle and major components of the first cryptocurrency, Bitcoin. Next, it outlines the vision and key components of Ethereum, which was developed not aiming to compete directly with Bitcoin as a cryptocurrency, but instead, as a platform for developing decentralized applications based on its smart contract implementation, which is in turn powered by the blockchain technology. Finally, the chapter presents common attacks on the blockchain technology.
Chapter
The blockchain technology has attained huge interest in the last several years. This chapter first introduces the insight on the value of the blockchain technology in terms of different levels of benefits it can bring to applications. Second, it reviews the existing proposals on various blockchain applications for cyber‐physical systems. Third, the chapter summarizes the work on addressing the limited blockchain throughput issue using various means. Xu and Zou reviewed the blockchain technology from three perspectives: understanding the blockchain technology from the economy point of view; the economic functions of blockchain; and the use of blockchain as a financial infrastructure. Finally, the chapter introduces the work by Xu and Zou on their view of what blockchain can and cannot do and their opinion on the balance between decentralization and the trust on third parties.
Article
Conflicts in the data transmission lead to a serious waste of energy, so it is pivotal to design a parallel and less conflict communication protocol for sensing nodes with limited energy. In this paper, a Parallel joint Optimized Relay Selection (PORS) protocol is proposed to reduce collision, delay as well as energy consumption for wake-up radio enabled WSNs. First, the basic Consecutive Packet Transmissions (CPT) method is adopted by PORS protocol. In this method, once a node obtains the channel, its data packets are sent continuously at once, and other nodes go to sleep in this period to reduce channel conflicts and save energy. Then, two thresholds namely maximum waiting queue length and waiting time are set for each node. When the data packets of the node do not meet the threshold, it keeps silent to reduce channel conflicts, and accumulates more packets to increase the number of data packets sent at once, thereby improving energy efficiency. Furthermore, a relay node selection approach is proposed by comprehensively considering factors such as the number of data packets, waiting time, and remaining energy. Those nodes that have many packets, long waiting time, and a large amount of remaining energy are more likely to be selected as relay nodes, so that they can receive more data packets in a short time and meet the threshold requirements as soon as possible. While the other nodes try not to obtain data and keep silent on the channel, thus reducing the collision as well as delay. Finally, the threshold values of nodes in adjacent layers differ by a constant. Thus, the crossing nodes of adjacent layers can transmit data in parallel to reduce the collision and delay Comprehensive analysis and experimental results demonstrate that the average energy consumption is reduced by 11.9%, network lifetime is increased by 26.1%, and network delay is reduced by 35.6% compared with the strategy with the fixed relaying nodes.
Conference Paper
Full-text available
Byzantine fault tolerance has been intensively studied over the past decade as a way to enhance the intrusion resilience of computer systems. However, state-machine-based Byzantine fault tolerance algorithms require deterministic application processing and sequential execution of totally ordered requests. One way of increasing the practicality of Byzantine fault tolerance is to exploit the application semantics, which we refer to as application-aware Byzantine fault tolerance. Application-aware Byzantine fault tolerance makes it possible to facilitate concurrent processing of requests, to minimize the use of Byzantine agreement, and to identify and control replica nondeterminism. In this paper, we provide an overview of recent works on application-aware Byzantine fault tolerance techniques. We elaborate the need for exploiting application semantics for Byzantine fault tolerance and the benefits of doing so, provide a classification of various approaches to application-aware Byzantine fault tolerance, and outline the mechanisms used in achieving application-aware Byzantine fault tolerance according to our classification.
Conference Paper
Full-text available
Event stream processing has been used to construct many mission-critical event-driven applications, such as business intelligence applications and collaborative intrusion detection applications. In this paper, we argue that event stream processing is also a good fit for autonomic computing and describe how to design such a system that is resilient to both hardware failures and malicious attacks. Based on a comprehensive threat analysis of event stream processing, we propose a set of lightweight mechanisms that help achieve Byzantine fault tolerant event processing for autonomic computing. The mechanisms consist of voting at the event consumers and an on-demand state synchronization mechanism triggered when an event consumer fails to collect a quorum of matching decision messages. We also introduce an evidence-based safe-guarding mechanism that prevents a faulty event consumer from inducing unnecessary rounds of state synchronization.
Conference Paper
Full-text available
Complex event processing has become an important technology for big data and intelligent computing because it facilitates the creation of actionable, situational knowledge from potentially large amount events in soft realtime. Complex event processing can be instrumental for many mission-critical applications, such as business intelligence, algorithmic stock trading, and intrusion detection. Hence, the servers that carry out complex event processing must be made trustworthy. In this paper, we present a threat analysis on complex event processing systems and describe a set of mechanisms that can be used to control various threats. By exploiting the application semantics for typical event processing operations, we are able to design lightweight mechanisms that incur minimum runtime overhead appropriate for soft realtime computing.
Chapter
Full-text available
In this paper, we present a catalog of application interaction patterns with the corresponding message ordering and execution rules for Byzantine fault tolerance computing. For each pattern, a set of rules are defined to determine whether or not an inbound message should be ordered and in what particular order, and which set of messages should be delivered sequentially, concurrently, or selectively concurrently under various scenarios. This catalog could serve as the design patterns for constructing practical Byzantine fault tolerance applications that may use much more sophisticated system models than the basic client-server state machine model. The set of patterns will make it easier and less error-prone when applying the Byzantine fault tolerance techniques for practical systems, in particular, Web based applications.
Article
The primary concern of traditional Byzantine fault tolerance is to ensure strong replica consistency by executing incoming requests sequentially according to a total order. Speculative execution at both clients and server replicas has been proposed as a way of reducing the end-to-end latency. In this article, we introduce optimistic Byzantine fault tolerance. Optimistic Byzantine fault tolerance aims to achieve higher throughput and lower end-to-end latency by using a weaker replica consistency model. Instead of ensuring strong safety as in traditional Byzantine fault tolerance, nonfaulty replicas are brought to a consistent state periodically and on-demand in optimistic Byzantine fault tolerance. Not all applications are suitable for optimistic Byzantine fault tolerance. We identify three types of applications, namely, realtime collaborative editing, event stream processing, and services constructed with conflict-free replicated data types, as good candidates for applying optimistic Byzantine fault tolerance. Furthermore, we provide a design guideline on how to achieve eventual consistency and how to recover from conflicts at different replicas. In optimistic Byzantine fault tolerance, a replica executes a request immediately without first establishing a total order of the message, and Byzantine agreement is used only to establish a common state synchronization point and the set of individual states needed to resolve conflicts. The recovery mechanism ensures both replica consistency and the validity of the system by identifying and removing the operations introduced by faulty clients and server replicas.
Conference Paper
Farsite is a secure, scalable file system that logically functions as a centralized file server but is physically distributed among a set of untrusted computers. Farsite provides file availability and reliability through randomized replicated storage; it ensures the secrecy of file contents with cryptographic techniques; it maintains the integrity of file and directory data with a Byzantine-fault-tolerant protocol; it is designed to be scalable by using a distributed hint mechanism and delegation certificates for pathname translations; and it achieves good performance by locally caching file data, lazily propagating file updates, and varying the duration and granularity of content leases. We report on the design of Farsite and the lessons we have learned by implementing much of that design.
Article
A longstanding vision in distributed systems is to build reliable systems from unreliable components. An enticing formulation of this vision is Byzantine Fault-Tolerant (BFT) state machine replication, in which a group of servers collectively act as a correct server even if some of the servers misbehave or malfunction in arbitrary (“Byzantine”) ways. Despite this promise, practitioners hesitate to deploy BFT systems, at least partly because of the perception that BFT must impose high overheads. In this article, we present Zyzzyva, a protocol that uses speculation to reduce the cost of BFT replication. In Zyzzyva, replicas reply to a client's request without first running an expensive three-phase commit protocol to agree on the order to process requests. Instead, they optimistically adopt the order proposed by a primary server, process the request, and reply immediately to the client. If the primary is faulty, replicas can become temporarily inconsistent with one another, but clients detect inconsistencies, help correct replicas converge on a single total ordering of requests, and only rely on responses that are consistent with this total order. This approach allows Zyzzyva to reduce replication overheads to near their theoretical minima and to achieve throughputs of tens of thousands of requests per second, making BFT replication practical for a broad range of demanding services.
Article
Software errors are a major cause of outages and they are increasingly exploited in malicious attacks. Byzantine fault tolerance allows replicated systems to mask some software errors but it is expensive to deploy. This paper describes a replication technique, BASE, which uses abstraction to reduce the cost of Byzantine fault tolerance and to improve its ability to mask software errors. BASE reduces cost because it enables reuse of off-the-shelf service implementations. It improves availability because each replica can be repaired periodically using an abstract view of the state stored by correct replicas, and because each replica can run distinct or nondeterministic service implementations, which reduces the probability of common mode failures. We built an NFS service where each replica can run a different off-the-shelf file system implementation, and an object-oriented database where the replicas ran the same, nondeterministic implementation. These examples suggest that our technique can be used in practice---in both cases, the implementation required only a modest amount of new code, and our performance results indicate that the replicated services perform comparably to the implementations that they reuse.