Content uploaded by Jinlai Xu
Author content
All content in this area was uploaded by Jinlai Xu on Oct 06, 2021
Content may be subject to copyright.
Resilient Stream Processing in Edge Computing
Jinlai Xu†, Balaji Palanisamy†, and Qingyang Wang‡
†School of Computing and Information, University of Pittsburgh, Pittsburgh, PA 15260 USA
‡Computer Science and Engineering, Louisiana State University, Baton Rouge, LA 70803, USA
Email: jinlai.xu@pitt.edu, bpalan@pitt.edu, qywang@csc.lsu.edu
Abstract—The proliferation of Internet-of-Things (IoT) devices
is rapidly increasing the demands for efficient processing of
low latency stream data generated close to the edge of the
network. A large number of IoT applications require continuous
processing of data streams in real-time. Examples include virtual
reality applications, connected autonomous vehicles and smart
city applications. Although current distributed stream processing
systems offer various forms of fault tolerance, existing schemes do
not understand the dynamic characteristics of edge computing in-
frastructures and the unique requirements of edge computing ap-
plications. Optimizing fault tolerance techniques to meet latency
requirements while minimizing resource usage becomes a critical
dimension of resource allocation and scheduling when dealing
with latency-sensitive IoT applications in edge computing. In this
paper, we present a novel resilient stream processing framework
that achieves system-wide fault tolerance while meeting the
latency requirements for edge-based applications. The proposed
approach employs a novel resilient physical plan generation for
stream queries and optimizes the placement of operators to
minimize the processing latency during recovery and reduces
the overhead of checkpointing. We implement a prototype of the
proposed techniques in Apache Storm and evaluate it in a real
testbed. Our results demonstrate that the proposed approach is
highly effective and scalable while ensuring low latency and low-
cost recovery for edge-based stream processing applications.
Index Terms—Edge Computing, Stream Processing, Fault Tol-
erance, Scheduling
I. INT ROD UC TI ON
The proliferation of Internet-of-Things(IoT) devices is
rapidly increasing the demands for efficient processing of low
latency stream data generated near the edge of the network.
IHS Market forecasts that the number of IoT devices will
increase to more than 125 billion by 2030 [1]. In general,
applications in the IoT era have strict demands for low latency
computing. For example, virtual reality applications that use
head-tracked systems require latencies less than 16 ms to
achieve perceptual stability [2]. Connected autonomous vehi-
cle applications (collision warning, autonomous driving, traffic
efficiency, etc.) have latency requirements from 10ms to 100ms
[3]. Edge computing provides an effective infrastructure to
support low latency computing for IoT applications. By mov-
ing computation near the end-devices, edge computing enables
low-latency processing of data for many IoT devices and
infrastructures including smart home hubs (e.g. Google Home,
Amazon Echo), road-side units [4], and micro datacenters [5]
deployed at the edge of the network near the IoT devices.
With the recent advancements in distributed stream process-
ing systems (e.g. Apache Storm [6], and Apache Flink [7]),
stream data processing becomes an integral component of low-
latency data analytic systems. Compared to traditional batch
processing systems (e.g. Apache Hadoop, Apache Spark),
stream processing systems can support continuous data pro-
cessing and produce results in real-time with high throughput
and low (or bounded) response time (latency). Many IoT
applications scenarios require continuous processing of stream
data. For instance, in smart cities applications, the traffic events
need to be processed in real-time to assist drivers, and the
automated driver assistant system needs to make real-time
decisions in the car. However, with IoT stream processing
applications, reliability is a critical factor besides strict latency
requirements. For example, connected vehicles need to make
real-time decisions by analyzing the environment information.
If there is a failure in the application, it may result in delayed
decisions or even lead to accidents in some cases. Most
existing fault-tolerance solutions focus on: (i) checkpointing
[8] which takes the states of the application at regular intervals
and (ii) replaying that tracks the completion of each tuple to
make sure every tuple contributes to the final result [9]. Such
solutions primarily lack knowledge of the characteristics of the
edge computing environments including the need for ensuring
low latency processing. In some cases, the recovery may take
minutes [10] which may violate the application requirements.
Another class of existing work has focused on employing
active replication to stream processing systems [11]. While this
approach achieves nearly zero recovery time when there is a
failure, the overhead of active replication is significantly higher
than the checkpointing approach, sometimes resulting in even
two times the original resource usage cost. As mentioned,
most existing works focus on primarily optimizing resource
utilization [12], [13] and do not focus on latency.
In this paper, we present a novel resilient stream processing
framework that achieves system-wide fault tolerance while
meeting the latency requirement for the applications in the
edge computing environment. The proposed approach employs
a novel resilient physical plan generation for the stream queries
that carefully considers the fault tolerance resource budget and
the risk of each operator in the query to partially actively
replicate the high-risk operators in order to minimize the
recovery time when there is a failure. The proposed techniques
also consider the placement of the backup components (e.g.
active replication) to further optimize the processing latency
during recovery and reduce the overhead of checkpointing
delays. We extensively evaluate the performance of our tech-
niques by implementing a prototype on Apache Storm [6] on a
cluster test-bed in CloudLab [14]. Our results demonstrate that
the proposed approach is highly effective and scalable while
ensuring low latency and low-cost recovery for edge-based
Smart Gateway 1
Smart
Gateway 2 2
1
Worker 1 Worker 4
9
splitter counter
sink
source
Micro
Datacenter
3
0
Checkpoint
store
acker
Tuple
Status
1,“stay foolish”
processed
2,“stay hungry”
processed
3,”stay foolish”
unprocessed
…
…
state
checkpoint
time
“
stay:1,foolish:1”
1
“
stay:2,hungry:1,foolish:1”
2
44
3
…3
stay:3
hungry:1
foolish:2
4,”stay”,1
4,”hungry”,1
3,”stay:3,
hungry:1,
foolish:2”
2′
44
4,”stay”,1
4,”hungry”,1
stay:3
hungry:1
foolish:2
3
3,”stay:3,
hungry:1,
foolish:2”
Worker 2
replication
Worker 3
X
X
(a) Resilience Unaware
Smart
Gateway 1
Smart
Gateway 2 2
1
Worker 5
Worker 1 Worker 4
9
splitter counter
sink
source
Micro
Datacenter
3
0
Checkpoint
store
acker
Tuple
Status
1,“stay foolish”
processed
2,“stay hungry”
processed
3,”stay foolish”
processed
…
…
state
checkpoint
time
“
stay:1,foolish:1”
1
“
stay:2,hungry:1,foolish:1”
2
”
stay:3,hungry:1,foolish:2”
3
44
4
…
stay:3
hungry:1
foolish:2
4,”stay”,1
4,”hungry”,1
2′
5,”stay”,1
5,”foolish”,1
stay:4
hungry:2
foolish:2
3,”stay:3,
hungry:1,
foolish:2”
Worker 3
4
5
5
X
replication
Smart
Gateway 3
Worker 2
(b) Resilience Aware
Fig. 1: An example illustrating the comparison of the resilience unaware scheduling and the proposed approach
stream processing applications.
II. BACKGROU ND A ND MOT IVATIO N
In this section, we discuss the state-of-art stream processing
fault tolerance solutions and illustrate the challenges in sup-
porting resilient and fault tolerant stream processing in the
edge computing. As the number of IoT devices increases,
large amounts of data get generated near the edge of the
network in real-time. Traditional cloud solutions for IoT may
lead to long processing times and may not be suitable for
latency-sensitive IoT applications such as virtual reality (VR)
applications (that require less than 16 milliseconds latency to
achieve perceptual stability) [2], and applications for smart
cities such as connected vehicles (e.g., collision warning,
autonomous driving and traffic efficiency with latency require-
ment around 10 to 100 milliseconds) [3] and intelligent online
traffic control systems [5]. Stream processing in an edge com-
puting environment provides a promising approach to meet
strict latency requirements while processing huge amounts of
data in real-time. Fault tolerance is an important aspect of
edge computing as many IoT applications require both high
accuracy and timeliness of results. As edge infrastructures
consist of several unreliable devices and components in a
highly dynamic environment, end-to-end failures are more of
a norm than exception [15]. Thus, to support reliable delivery
of low latency stream processing over edge computing, we
need a highly fault-tolerant stream processing solution that
understands the properties of the edge computing environment
for meeting both the latency and fault tolerance requirements.
Checkpointing [8] and replication [11] represent two clas-
sical techniques for fault-tolerant stream processing. The
idea behind replication is to withstand the failure by using
additional backup resources. Replication approaches include
both active replication and standby replication. When there
is a failure, the backup resources will be used to handle
the workload impacted due to the failure. The difference
between active replication and standby replication is that
in active replication, both the primary and the replica run
simultaneously and process the same input and produce the
output. However, in standby mechanisms, the standby does
not generate outputs. In hot standby mechanisms, the standby
processes the tuples simultaneously similar to the primary but
with cold standby, the standby only synchronizes the state with
the primary operator passively.
The checkpointing mechanisms on the other hand period-
ically store the state of the operators in persistent storage
to create timestamped snapshots of the application. Often,
the checkpoint mechanism works with the replay mechanism
that caches the input tuples at the source (or upstream)
operator (e.g. the source operator shown in Figure 1). As
shown in Figure 1, the stream processing application has
four operators, a source operator which fetches the stream
from the data provider outside of the system, a splitter that
splits the sentence into words, a counter that counts the
words, and a sink that stores the result. It also includes fault
tolerance components namely the acker, the checkpoint store,
and active replication of the counter operator. In order to
explain the checkpointing and replay mechanisms, we present
the simplified content of the tuples’ status, the intermediate
results, and the checkpointing information in Figure 1 with
the timestamped tuples illustrated using numbered yellow
triangles. When there is a tuple failure, the source operator
re-sends the tuple to downstream operators for reprocessing.
The replay mechanism tracks the processing status of each
tuple (e.g., the acker shown in Figure 1 is used to track this
information). When a failure happens, the state of the failed
operator is restored and the unprocessed (unacknowledged)
tuples are replayed as shown in the example in Figure 1.
Hybrid methods employ a combination of both check-
pointing and replication. They are referred to as adaptive
checkpointing and replication techniques. Adaptive checkpoint
and replications schemes have been proposed in several do-
mains [12], [13], [16]–[19]. The goal of combining active
replication and checkpoint mechanisms is to achieve seam-
less recovery compared to pure checkpointing. When active
replication is applied correctly, the recovery time will be zero
(the input of the failed operator is handled by the replica).
Combining active replication and checkpointing also decreases
the resource overhead as it significantly employs checkpoint-
ing that only needs a very limited resource (transferring and
storing the checkpoints as shown in Figure 1) compared to
fully active replication that needs nearly twice the resource
usage of the original.
While adaptive checkpointing and replication is a promising
approach, its application in edge computing is challenged in
several aspects. The heterogeneous nature of both physical
nodes and the network components in an edge computing
environment significantly challenges the placement of the
replicas that directly influences the performance and cost. For
example, in Figure 1b, if active replication of counter is placed
in a node far from the splitter and sink operators assuming
a latency of 200ms between the smart gateway 3 and the
micro datacenter, the latency to transmit the stream to the sink
operator from the replica will be very high. We note that it will
not influence the performance in the fail-free condition as the
application will eliminate the duplicate results at the end and
the output from the primary will be always generated earlier
than the output from the replica in this condition. However,
when the primary counter operator fails, the output of the
active replication will become the valid output and it will
dramatically influence the latency that makes the results not
useful as it may contain out of date information. For instance,
let us consider that the latency requirement is less than 200ms
and the normal processing from the source to the sink needs
100ms. In the fail-free condition, the latency requirement will
be met. However, if the primary operator fails, the output
comes only from the replica during the recovery phase and
a bad placement of the replica can drastically increase the
latency to more than 200ms which may violate the latency
requirement and make the results not useful.
Most current state-of-art distributed stream processing sys-
tems (e.g. Apache Flink [7], Apache Storm [6]) optimize the
performance of an application by placing the operators in a
single worker (single process) and in a single physical node or
a set of adjacent physical nodes to improve the data locality
by reducing the overhead of copying or transmitting the data
across the processes or nodes. Therefore, if the placement
mechanism is unaware of the fault tolerance mechanisms and
requirements, the fault tolerance properties achieved by the
mechanisms may be poor. For example, in the worst case as
shown in Figure 1a, when the active replica is placed along
with the primary operator in the same worker 3 (the dotted
line in the figure indicates the boundary of a worker), the
active replica will also fail because of the influence of the
primary failure which makes the active replication scheme
ineffective. While optimizing for performance and data locality
is important, careful decisions on where to place the backup
resources (e.g. the active replication, the checkpoint store)
while closely considering their roles in the application is vital
to achieving the desired resilience properties. In the example
shown in Figure 1b, we place the active replication in a
different node near the node where the primary is placed
so that the failure of either will not influence each other.
Here, careful tradeoffs between data locality and minimizing
correlated failure probabilities are essential to ensuring both
high resiliency and performance in terms of throughput and
latency. In the next section, we discuss the system design
1 2
Logic Plan
12
Resilient Physical Plan
12
03
0
3
ack 2′
2′
cp
monitor
MDC 1
Smart
Gateway 1
Smart
Gateway 2
Cluster Information
monitor
monitor
Failure
Prediction
Recovery Cost
Estimation
MDC 1
Smart
Gateway 1
Smart
Gateway 2
Preliminary Operator Placement
12
03
Resilience Aware
Scheduling
Fig. 2: System Overview
of our proposed fault tolerant stream processing mechanisms
optimized for edge computing environment.
III. SYS TE M DES IG N
The proposed fault tolerance mechanism for edge comput-
ing consists of two phases namely (i) resilient physical plan
generation, and (ii) resilience-aware scheduling and failure
handling. In the resilient physical plan generation (Figure 2),
we decide the physical plan of the stream processing appli-
cation by first translating the user code into a logic plan and
then, based on the preliminary operator placement result, we
add the necessary backup components (e.g. active replication,
and checkpoint store) considering the recovery cost for each
operator. Then, when the physical plan is going to be deployed
on the cluster, the system needs to decide the placement of
both the operators and the backup components and handle the
failure when the application fails. Next, we discuss the details
of various components (Figure 2) of the proposed resilient
stream processing system.
A. Resilient physical plan generation
A stream processing application can be represented as a
Directed Acyclic Graph (DAG) which captures the processing
graph provided in the user-defined program. A logic plan
illustrates the logic of the stream processing application repre-
sented by the DAG. Thus, a logic plan consists of vertices and
edges where the vertices represent the operators defined by the
user and the edges are the streams between the operators as
shown in Figure 2. We use the notation, Glogic (Vlogic, Elog ic)
to represent the logic plan. A physical plan extends the logic
plan with more detailed configurations including the level of
parallelism for each operator, the configuration for the backup
operators, etc. The resilient physical plan is represented as
a graph Gphy(Vphy , Ephy ). When generating the resilient
physical plan, the system needs to decide several parameters
including the operators which are actively replicated.
We present an example in Figure 3, Besides the logic plan,
the resilient physical plan generation includes many other
components: (i) the acker is an operator tracking whether
the tuples have been completely processed in the application,
(ii) the checkpoint store provides the services to store the
checkpoint in the volatile memory or in the persistent storage,
(iii) the state management for each stateful operator (e.g.
the counter operator) indicating the checkpoint mechanism
and the parameters of the checkpoint mechanism (e.g. the
checkpoint interval), (iv) the fault tolerance mechanism for
splitter counter
sink
source
checkpoint
store
acker
active
replication
state
state
primary path
checkpoint
Interval: 10 seconds
splitter counter sink
source
Logic Plan
Resilient Physical Plan
checkpoint
source checkpoint stream
Fig. 3: Resilient Physical Plan Example
each operator. For example, the counter operator is protected
by both the active replication, the checkpoint mechanism and
the event replaying (the acker feedback loop). We denote the
fault-tolerant physical plan including the backup components
as Vphy =Vlogic ∪ {ickstore , icksource } ∪ Vacker ∪Vactive ,
where iacker ∈Vacker is an acker operator in the acker
set, ickstore is the checkpoint store which can be a local
database or a memory key-value store. Here icksource is a
source operator which generates the checkpoint stream based
on the checkpoint interval configuration, which we will discuss
the details about the checkpoint stream later, and Vactive is a
copy of the subset of Vlogic which defines the operator set that
actively replicates the operators in the logic plan. The details
of how to configure the above components are described in
Section IV. Also, in the physical plan, the parallelism of each
operator ican be decided by the user to determine how many
instances will be run to process the input of an operator which
directly influences the decision of active replication. Here we
use ikto indicate the k-th task of an operator iand the
parallelism is denoted as Ki, a predefined parameter.
The checkpoint stream is a fault tolerance component which
will pass through all the operators in the logic plan in the same
sequence defined by the logic plan. When one of the operators
receives the checkpoint tuple (barrier), the stateless operator
will forward it to the downstream operators, and the stateful
operator will halt the stream processing and perform the
checkpointing by committing the current state to a checkpoint
store. The checkpoint stream is for synchronizing the state
over the whole stream processing application to ensure that
the checkpointing is done across the application as an atomic
operation. When the checkpoint tuple (barrier) passes all the
operators in the logic plan, an acknowledgment will be made
by the acker to inform the checkpoint source that the current
checkpointing is done. Next, we discuss the notion of operator
and backup component placement and illustrate how the failure
is handled by the backup components.
B. Scheduling and failure handling
After the resilient physical plan is determined, the system
needs to decide the placement for each component in the phys-
ical plan and handle the application. The scheduling decision
can be illustrated as a mapping between the physical plan
graph Gphy and the cluster graph denoted by Gres(Vr es, Eres )
where the Vres indicates the nodes in the cluster and Eres
indicates the virtual links connecting them. An example is
shown in Figure 2 in which we have two smart gateways
connecting a micro datacenter. Here, we use ito represent
the component in the physical plan that i∈Vphy and vto
indicate the node in the cluster that v∈Vres. We use cv
to indicate the idle resource capacity of a node which is the
full capacity subtracting the resource usage on that. For the
mapping between the physical plan and the cluster, we use
X={xi
v|i∈Vphy, v ∈Vres}to indicate if xi
v= 1 then the
component iis deployed on node v, and vice versa. After the
physical plan is scheduled to the cluster, the stream processing
runs continuously until it is shutdown. When there is a failure
in the state-of-art stream processing systems (e.g. Apache
Flink [7], Apache Storm [6]), the state of the application is
typically backed up by checkpointing and the input data is
backed up using a replay mechanism. Thus during recovery,
a few steps can restore the application to its normal status.
However, this process can be time consuming. The recovery
time varies from application to application but usually, it can
be divided into two parts: (i) the time for detection of the fault
and (ii) the time for restoring the computation. For the first
component, most distributed stream processing systems use
heartbeat [20] to detect task failures and node failures. The
heartbeat is a kind of signal sent between the monitor (e.g. a
master node) and the monitored tasks. When there is a timeout
of the heartbeat, the monitor will assume that the monitored
task has failed and will trigger the recovery mechanism (e.g.
restarting the fail task). For the second part, the application
needs to recover from the failure in order to restore to the state
before the failure. In distributed stream processing systems,
the state of the operator is restored by the latest checkpoint
and the tuples are replayed to gradually recover the state of
the operator to the point before the fail happens. Thus, with
the above two delays, the application will not produce any
output until all the operators are synchronized to the state
before the failure which will introduce high recovery latency
and processing latency. The latency can cause several issues
including violating the user’s latency requirement and in some
cases, the peak workload during recovery may cause other
operators to fail consecutively.
Active replication can be one of the most promising supple-
mentary technique for the regular checkpointing-based fault
tolerance. The primary and secondary (replica) will run at
the same time to produce results so that only when both
fail simultaneously, it will cause failure of the application,
otherwise the failure will be seamlessly covered. There is no
heartbeat detection latency and no restoring latency during the
time of failure and the the total recovery time for this mech-
anism becomes nearly zero for most conditions. However, if
we replicate all operators with active replications, it results
in nearly twice the original resource usage cost to handle the
workload. Therefore, we need a mechanism to estimate the
risk of each operator based on the estimated recovery time
and adopt cost-effective approach to selective replication.
C. Recovery time estimation
If the application is only backed up by the checkpoint
mechanism or when the failed operator is not replicated by
active replication, the recovery time can be significant. We first
estimate the recovery time to obtain the risk of each operator
to determine which operators need to be replicated. We note
that the recovery time contains two components namely (i)
fault detection and (ii) computation restoring.
Fault detection time is determined by the heartbeat interval,
the latency between the monitor and the fail task and the
configuration of the heartbeat timeout. We assume that the
failed task is iand the monitor task is mwhich can be
either a node manager (e.g. supervisor in Apache Storm), a
cluster master (e.g. the nimbus in Apache Storm), or a cluster
coordinator (e.g. a zookeeper cluster in Apache Storm). If
the monitor is a node manager, the failure of task ican be
detected by the timeout of the heartbeat, which is denoted
as τhbtimeout (the timeout can be set by a parameter, for
example it can be five seconds). Therefore, in the worst case,
the time to detect the fault can be the sum of τhbtimeout
and the heartbeat interval (the fault happens immediately after
acknowledging the last heartbeat). As the heartbeat timeout
is often significantly larger than the heartbeat interval, we can
ignore the heartbeat interval and only consider the influence
of the heartbeat timeout, τhbtimeout, in the recovery. If there
is a node vfailure, the time to detect the failure needs to
include the latency between the failed node and the monitor
node which is denoted as l(v, m). The second part namely
the time to recover the computation is more challenging to
estimate as it related to many aspects including the length of
the unacknowledged queue of the tuples, the size of the state,
and the average processing time of the fail operator.
When restoring the computation, the recovery time can be
divided into three phases namely (i) restarting the task and
loading the program into memory, (ii) retrieving of the latest
checkpoint from the checkpoint store, and (iii) reprocessing
the unacknowledged tuples to restore the computational state
before the failure. For the first component namely restarting
and loading the program, we assume it is a constant time
τrestart . For the second part, the time to retrieve the checkpoint
is noted as τcheckpoint(size(si)), which can be determined by
the size of the state size(si)where sidenotes the latest state
of an operator i, and the latency between the operator and
the checkpoint store, which is denoted as l(v, v0)where v
is the node in which operator iis placed, and v0is the node
where checkpoint store is placed. For the third part, we need to
know how many input tuples of operator iis not acknowledged
yet. We assume that it is a function qi(λi)denoting the input
buffer of the unacknowledged tuples which is related to the
input rate λiof operator i. For each tuple, we need dito
fully process it on average and we can estimate the replaying
time as τreplay =diqi(λi). With above mentioned steps, we
can estimate the overall recovery time if the operator ifails
without an active replication by adding the detection time and
the restoring time as shown below:
τi(X) = τhbtimeout +l(v, m) + τrestart
+τcheckpoint(siz e(si)) + l(v, v0) + τr eplay (1)
Based on the above recovery time estimation, we can estimate
the risk of each operator with the operator placement decision
and optimize it further to either add more active replications
or migrate some of the operators to reduce the risk, which we
will discuss the detail in Section IV. In the next subsection, we
will introduce the method we use to predict the failure, which
is an important component to achieve an accurate recovery
cost estimation.
D. Failure prediction
Our method is based on the accurate prediction of the
failure. We summarize the failure modes to include: (i) tasks
failures in which a specific task fails (e.g., due to memory
issues), (ii) node failures in which a node or the supervisor
deployed on it fails causing all tasks running on it to fail,
(iii) data failures which refer to the loss of data that may
occur due to data dropping on a congested network device or
a communication timeout due to a high latency network.
For the task and node failures, there are many related works
[21] that predict such failures accurately with up to 99%
accuracy. For example, Zhang et al. [22] proposed a system
failure prediction method based on Long Short-Term Memory
(LSTM) that can achieve 90.9% recall with a correct prediction
of 73 minutes before the failure occurs. There are also some
recent efforts in the IoT domain [23] which predicts the failure
of the IoT devices. For the data loss, we do not handle it
through active replication but through the replaying techniques
in stream processing [9] which we discuss in Section IV.
With the above observations, we can leverage the failure
prediction to help us predict the risk of each operator. We
assume the failure probability is pi(t)for task iand pv(t)for
a node vin a time-slot t. We assume the node failure will
cause all the tasks (operators) placed on it to fail. Thus, if
a preliminary operator placement X0={xi
v|i∈Vlogic, v ∈
Vres}is determined as shown in Figure 2, we can combine
the respective task and node failure probabilities together to
form a uniform failure probability ρi(t) = pi(t) + pv(t)−
pi(t)pv(t),that xv
i= 1 with the assumption that the task
failure and the node failure are independent events. The
failure probability may change due to the workload change or
environment change but the probability can be updated by the
prediction algorithm before each time-slot using the monitors
deployed on each node as shown in Figure 2. Within each
time-slot, we can decide a resilient physical plan based on the
risk estimated by the recovery time and the failure probability
predicted by the prediction algorithms, which consists of the
original operators, their placement, fault tolerance components
configuration, and the placement of the components. We
discuss its detail in the next section.
IV. RES IL IE NT ST RE AM P ROCESSING
In this section, we describe the proposed algorithms for
resilient stream processing in edge computing that leverage
both checkpointing and active replication techniques. We
assume that the user can specify a fault tolerance resource
budget which can be the amount of the additional compu-
tational resources (e.g. CPU, memory). We transform the
budget to quantify the extra resources to be used to handle
the fault (e.g. the resource amount can be calculated using
the resource unit price). Then the multi-dimension resource
amount (CPU, memory, bandwidth, etc.) can be converted
into a one-dimension amount using methods such as the
dominant resource described in [24]. We use Cto denote the
additional resources that can be used to run the fault tolerance
components (e.g. the checkpoint store and active replication).
With the budget configuration, the resilient physical plan
can be generated by considering the risk and failure probability
of the operator to selectively replicate some of the operators
which have higher recovery costs (e.g. recovery time) and
higher failure probabilities. Thus in the physical plan gen-
eration, we estimate recovery cost by combining the risk (re-
covery time) with failure probability: ai(t, X) = τi(X)ρi(t)
, which can be also seen as the expectation of the recovery
cost of the operator iin the time-slot t. For simplicity, we
assume that the basic physical plan which is directly generated
from the logic plan Glogic and the operator placement are
already decided by parsing the user’s program and calculated
by a scheduler. We use X0to denote the original operator
placement decision generated by the default scheduler (e.g.
the default resource aware scheduler in Apache Storm [6]).
Our algorithm will use the determined operator placement
decision to further optimize the configuration and placement
for the fault tolerance components. We note that the placement
of the operators can influence the performance of the stream
processing application and therefore, jointly optimizing the
configuration and placement of both the operators and the
fault tolerance components can be an interesting direction
of future research. In this paper, we primarily focus on the
fault tolerance aspect and its influence on the applications.
We divide the proposed fault tolerance solution into two
phases: (i) checkpoint related component configuration (ii)
active replication related configuration. For the checkpoint
related component, we need to decide: (i) where to place
the checkpoint store, (ii) how many ackers we need to use
to track the completion of each tuple and where to place
them. For the active replication, we need to decide: (i) which
operators to be actively replicated, (ii) where to place these
active replications. In the rest of this section, we illustrate
our proposed solutions to achieve fault tolerance in edge
computing environments by leveraging both checkpointing
and active replication while carefully considering the resource
budget and latency requirement.
A. Checkpoint
The checkpointing mechanism periodically takes snapshots
of the state of the whole stream processing application, and
when there is a failure, the checkpoints can be used to
restore the state of the application to a state before the failure
happens. However, merely restoring the state is not sufficient
to guarantee the correctness of the processing as the input
data that do not contribute to the restored state should be
replayed in order to make sure that they also contribute to
the final output. We leverage two techniques here to guarantee
the correctness when failure happens: (i) checkpointing which
includes the snapshot mechanism, the state committing by the
stateful operators, and the checkpoint storage, and (ii) input
data replaying which includes the tuple tracking mechanism
and the replaying mechanism.
As we want to ensure that all the operators are backed up,
we enable the checkpointing mechanism across the application
as the basic fault tolerance mechanism before applying the
active replication. We take the logic plan of the application
Glogic(Vlogic, Elog ic)and add the checkpointing related com-
ponents into it including a checkpoint store ickstore which is
responsible for storing the checkpoints, a checkpoint source
icksource generating the stream to synchronize the snapshot
status across the application, and a set of ackers Vacker
tracking the accomplishment of all the tuples. Besides, we
need to decide the placement for the above-mentioned compo-
nents. As the checkpoint source only influences the snapshot
step by generating synchronization signals and tracking the
checkpointing step, the influence of the placement of it is not
very significant. Here, we can simply collocate it with one
of the source operators in the logic plan. For the checkpoint
store, we need to consider the network connectivity to all the
stateful operators which commit checkpoint information to it.
If the operator fails without an active replication, it needs
to communicate with the checkpoint store to fetch the latest
state promptly. Besides, the checkpoint will be committed to
it periodically from the stateful operators and therefore, we
need to select a node which is in an appropriate node in
which all the stateful operators can commit the checkpoint to it
without waiting a long time to get the acknowledgment. In our
work, we assume that one-node checkpoint store is capable of
handling the checkpoints of the stream processing application
with a resource requirement cckstore . For the highly geo-
distributed application, we can employ geo-distributed key-
value stores [25] to implement a more scalable checkpoint
store. We have the initial operator placement decision X0. We
use Vstateful to denote the set of all the stateful operators in
the application and we can compute the objective function:
min
Vstateful
X
i
l(v, v0)that xi
v= 1, xickstore
v0= 1 (2)
The problem can be solved by traversing all the nodes and the
computational complexity is O(|Vres||Vstatef ul |).
Next, for the acker which tracks the completion of each
tuple, we need to determine the following: (i) the minimal
number of ackers that can handle the tracking of the appli-
cation, (ii) the placement of the ackers to minimize the gap
between the accomplishment and the acknowledgment of the
tuples, which in turn minimizes the unnecessary replaying of
the tuples when there is a failure. We assume that the capacity
of an acker is oacker with a resource requirement, cacker,
which indicates that one acker can handle the tracking of at
most oacker tuples in a unit time. Therefore, the number of
ackers can be calculated as follows: |Vacker|=PVlogic
iλi
oacker . The
placement can be also decided similar to the checkpoint store
that we need to minimize the weighted distance between the
ackers and the operators:
min
Vacker
X
iacker
Vlogic
X
i
λil(v, v0)
|Vacker |that xi
v= 1, xiacker
v0= 1 (3)
To traverse all the combinations, the computational complexity
ranges from O(|Vres||Vlogic|)to O(|Vres |
|Vacker ||Vlogic |)which
is determined by how many nodes are used for placing
the ackers. The result of the above problem composes of a
placement decision of the checkpointing components, which
we denote as Xcheckpoint consisting of the placement of the
checkpoint source, the checkpoint store, and the ackers.
B. Active Replication
The checkpointing mechanism backs up the application
entirely by periodically snapshotting the state of the whole
application. However, the recovery time is significant if there
is a failure. The restarting of the failed task, the restoring of
the state, and the replaying of the unacknowledged tuples incur
significant time. Therefore, we add the active replication to the
operator which has a higher failure probability and a longer
estimated recovery time by considering the fault tolerance
budget defined by the user.
Algorithm 1: Select and place active replication
Input : Logic plan: Glogic(Vlog ic, Elogic );
Operator placement: X0;
Checkpoint Component placement: Xcheckpoint;
User fault tolerance budget for active replications:
Cactive =C−cckstore − |Vacker |cacker ;
Time-slot: t;
Output: Operator set to be replicated: Vactive;
Active replication placement: Xactive;
1Initial the operator set Vactive =∅and placement Xactive =∅;
2Sort the operators in Vlogic by their risk ai(t, X)in a descending order;
3for each operator i∈Vlogic do
4if ci(λi)≤Cactive then
5Vactive =Vactive ∪ {i};
6Update Cactive =Cactive −ci(λi);
7for each active replication i0∈Vactive do
8Get the node vwhich handles the primary operator iof the active
replication i0;
9Sort the neighbor nodes of v,v0∈Vres by their distance l(v, v0)in
an ascending order;
10 for each neighbor node v0∈Vres do
11 if ci(λi)≤cv0then
12 Xactive =Xactive ∪ {xv0
i0= 1};
13 Update cv0=cv0−ci(λi);
14 Break;
15 For the remaining replication which does not find a placement, we use a
network-aware mechanism [26] to place them;
There are two decisions we need to make when dealing
with active replication: (i) which operators need to be actively
replicated, and (ii) where to place the active replication to
minimize the latency when there is a failure. For the selective
active replication, we consider a user-defined budget C, which
represents the resource that can be used by the fault tolerance
components. The resource requirement for each operator is
ci(λi), which is related to the input rate. We assume that
the resource requirement is a non-decreasing function of the
input rate λi. We also assume that active replication replicates
the method of the primary operator exactly the same way
so that the resource requirement of the active replication of
an operator iis the same, which is also ci(λi). The active
replication selection method is shown in Algorithm 1 line 1-
6. We can see that the algorithm selects the operators to be
replicated based on the initial operator placement decision X0
and based on their recovery cost until we reach the fault toler-
ance budget. The output is the operator set which is selected
to be actively replicated, Vactive. The computation complexity
is determined by the sorting, which is O(|Vlogic |log |Vlogic|).
After the active replicated operators are selected, we need to
decide the placement of them to also minimize the latency
during failure. Instead of understanding all the operators in the
application to estimate the performance and schedule the active
replication, we assume that the physical plan and the original
placement X0already meet the service level agreement (SLA)
with the user and thus, we focus on how to minimize the
performance (especially latency) gap between the fail-free
condition and fault condition. In Algorithm 1 line 7-15, we
illustrate the detail of our proposed algorithm to place the
active replications. The problem is solved by placing the
replication to the node which is the nearest capable neighbor of
the primary operator to minimize the influence of the network
latency and other impacts on the active replication when there
is a failure. The method fetches the placement information
of the primary operator and tries to place the replication in
one of the neighbors. It is worth noting that the neighbor
information can be obtained by clustering the nodes in a
latency space [26] or by a predefined cluster architecture as
used in our experiments. The algorithm traverses the neighbors
in the increasing order of distance (network latency) until there
is enough capacity in a node to place the replication. The
computation complexity is determined by the outer loop and
sorting and the complexity is O(|Vactive||Vr es|log |Vres |).
V. EVAL UATI ON
We evaluate our fault-tolerant stream processing system in
an edge computing experimental testbed. The evaluation is
designed to measure the performance improvement of our
method in comparison with both the baseline and state-of-art
solutions. In the evaluation, we study the influence of the fault
tolerance component placement and analyze the performance
impact of using checkpointing as the only fault tolerance
mechanism. Finally, we study the overhead of applying various
fault tolerance solutions.
A. Implementation and experimental setup
We implement the system on top of Apache Storm [6]
(v2.0.0). It can also be implemented on other distributed
stream processing engines such as the Apache Flink [7].
We extend DefaultResourceAwareScheduler(DRA) in Storm to
implement our algorithms for the physical plan and scheduling
optimizations. The physical plan generation and the scheduling
decision is implemented outside of the scheduler. The sched-
uler takes the physical plan and the scheduling decision as
input and based on them to schedule the tasks.
We deploy a testbed on CloudLab [14] with nodes organized
in three tiers. We use the cluster with ten m510 servers in the
CloudLab cluster and simulate the three-tier architecture on
an Openstack [27] cluster. The third tier contains fourteen
m1.medium instances (2 vCPUs and 4 GB memory) that
act as the smart gateways with relatively low computing
capacity corresponding to the leaf nodes of the architecture.
The second tier has five m1.xlarge instances (8 vCPUs and 16
GB memory) and each of them functions as a micro datacenter.
The first tier contains one m1.2xlarge instance (16 vCPUs
Sensors
Source Filter Aggregate DB Sink
DB Table
filter the car
with speed<=𝜖
aggregate the car
location by
tumbling windows
ts
car_id
sensor_id
speed
ts
car_id
sensor_id
speed
win_id
location
car_ids
Fig. 4: Accident Detection Application
(a) Fail-free condition (b) Fault condition (fail injected at 30s)
Fig. 5: Throughput
and 32 GB memory) acting as the computing resource used
in the cloud datacenter. The network bandwidth, latency and
topology are configured by dividing virtual LANs (local area
networks) between the nodes and adding policies to the ports
of each node to enforce, which use the Neutron module of
OpenStack and the traffic control (tc) tool in Linux to simulate.
We deploy the Storm Nimbus service (acting as the master
node) on the m1.2xlarge instance and one Storm Supervisor
service (acting as the slave node) on each node respectively.
For the checkpoint store, we use a single node Redis service.
The default network is set to be 100 Mb bandwidth in
capacity with 20 ms latency between the gateways and micro
datacenters, and the bandwidth capacity is 100 Mb with 50 ms
latency between the cloud datacenter and micro datacenters.
We also place a stream generator on each smart gateway to
emulate the input stream. The input stream comes to an MQTT
(Message Queuing Telemetry Transport) [28] service deployed
on each smart gateway. The default stream rate is set to be 100
tuples per source (smart gateway) per second (which is 1400
tuples per second in total).
B. Application
The application we use in the experiment is to detect
accidents in linear roads as shown in Figure 4. The sensors
gather the position and the speed of each car passing them.
Then the sensor data is filtered using the condition speed < ,
where is a small number indicating the error range of the
sensor. After that they are aggregated by the location ID
and time window. When a car is detected to be not moving
in a particular continuous time window, it is treated as a
broken car and the position will be reported to the next
operation. In the end, the position is reported and stored in
a database table. We implement the application based on the
API provided by Apache Storm. For the windowed aggregator
in the application, we enable the window persistence so that
the tuples in every window will be stored in the checkpoint
store periodically. The timeout parameter for tracking the
completion of the tuple is set to be 5s. To generally accept
the out of order tuples, we enable the lag parameter in the
windowed aggregator as one second (the default setting is zero
which means the lag tuples will be dropped immediately),
(a) Overall (b) During Recovery
Fig. 6: Latency
Fig. 7: Success rate Fig. 8: Throughput
which means that the out of order tuple can be accepted
in a one second time window, otherwise it will be dropped.
We set the window size to one second and the default fault
tolerance budget to 10% of the original application. The
default parallelism of the filter operator is set to be 14 which
is as same as the number of the sources and the parallelism
of the aggregator is set to be 2.
For each experiment, we run the application and the stream
generators for one minute and let the application run another
thirty seconds to let it fully process the input. The tuples
which are not processed in the additional thirty seconds are
considered as failed tuples.
C. Algorithm
We compare our methods in different combinations. We di-
vide the proposed method into two parts: (i) Resilient Physical
Plan Generation (RPPG) and (ii) Resilience-Aware Scheduling
(RAS). For the physical plan generation, we compare with: (i)
ck-only, which only applies the checkpoint to achieve fault
tolerance, and (ii) full-rep, which applies full active replication
to protect all of the operators. For the scheduling optimization,
we compare RAS with DRA which is the default scheduler
Apache Storm uses as described above.
D. Experiment Results
We first evaluate the performance to compare fail-free and
fixed failure conditions as shown in Figure 5. We can see
that when there is no failure, the throughput fluctuates near
the input rate for all the four mechanisms. However, when
we inject a failure at the 30s, we can see the difference
as shown in Figure 5b. The ck-only+DRA mechanism has a
throughput gap after the failure injection as the primary needs
time to recover from the failure. After the recovery is done
within about 10 seconds, the throughput gradually becomes
normal. For the RPPG+DRA mechanism, it uses our proposed
mechanism to generate the physical plan but uses the DRA
to schedule the tasks. Here, we can see that there is also a
drop after the failure but the throughput is about half of the
(a) CPU (b) Memory (c) Network
Fig. 9: Resource utilization
input rate unlike ck-only+DRA that does not output anything
during the recovery. As RPPG+DRA replicates some of the
operators and the scheduling places the primary and replication
on one node, the failure of the primary influences some of
the replications (if they are placed in one worker). For full-
rep+DRA and RPPG+RAS, the throughput does not change
significantly during the primary recovering from failure. In this
experiment, we can see that applying only the adaptive fault
tolerance does not solve the problem entirely. We also need
to place the components appropriately to avoid the influence
of the correlated failures to further decrease the influence of
the failure on the application.
Next, we evaluate the latency of the application by applying
the same four mechanisms. We change the input rates in
these experiments as shown in Figure 6. The bars illustrate
the average latency and the ticks represent the 90% confi-
dence interval of the latency. We can see that our method
RPPG+RAS performs similar to full-rep+DRA including the
overall runtime which average latency is around 2.5 seconds
as shown in Figure 6a. However, the ck-only+DRA performs
similar with RPPG+DRA that gets around or higher than 5
seconds latency. When comparing the latency during recov-
ery, the difference becomes larger as shown in Figure 6b.
We can see that the average latency during recovery is all
increased to more than 5 seconds for the mechanisms except
RPPG+RAS even for the full-rep+DRA. The reason is that the
bad placement of the replication influences the effectiveness of
the application when there is a fail. This experiment shows the
priority of our method in the latency metric that our method
achieves both lower latency and higher stability comparing
with the other three mechanisms whenever considering the
latency distribution in overall runtime or only during recovery.
In addition, we compare success rates in Figure 7. We
observe that the result is similar to the one shown in Figure 5.
When there is a failure, our methods RPPG+RAS and full-
rep+DRA are not influenced and hence, they obtain 100%
success rate for different input rates. Here, RPPG+DRA ob-
tains only around 95% success rate which is higher than the
checkpoint only mechanism that has around 80% success rate.
Next, we compare the throughput of the mechanisms in
Figure 8 when the input rate increases. We can see that
RPPG+RAS and full-rep+DRA achieve similar throughput
when input rate increases and it matches the input rate.
However, the RPPG+DRA and ck-only+DRA achieve lower
throughput than the input rate which may lead to either loss
of data or delayed output.
Finally, we evaluate the resource utilization of the four
mechanisms as shown in Figure 9. We can see that full-
rep+DRA uses significantly more resources than the other
three techniques in all the CPU, memory, and network usages.
The CPU resource utilization shown in Figure 9a increases
from 12.5 cores to 19 cores when the input rate increases.
The rest of the techniques are similar to each other with
CPU utilization ranging from around 10 cores to around 13.5
cores. Overall, the CPU resource usage is as high as 40%
more than the other three mechanisms for full-rep+DRA. The
result is similar in memory and network usage as shown in
Figure 9b and 9c. Comparing our method RPPG+RAS to
the ck-only+DRA and RPPG+DRA, the CPU usage is similar
but the network usage is higher as RAS schedules the active
replications to different nodes which increases the network
usage but decreases the influence of correlated failures.
In summary, the proposed method, RPPG+RAS combines
the consideration of both generating an appropriate resilient
physical plan to cover the operators with using less re-
sources than the full replication and also the application’s
latency requirement to achieve similar latency when recovery
from fail with better performance during fail comparing with
RPPG+DRA which only optimizes the physical plan but
lacking the consideration of the optimization of the scheduling.
VI. RE LATE D WORK
Fault tolerance is a well-studied topic in the context of cloud
computing and Big Data techniques. In stream processing
systems, fault tolerance mechanisms have primarily focused
on developing two kinds of solutions namely (i) checkpointing
and relaying techniques [9] that have low resource overhead
and higher recovery time and (ii) active replication techniques
[11] that incur high resource cost and lower recovery time.
These solutions do not optimize for latency and recovery time
requirements that are critical in edge-based IoT applications.
Su and Zhou [19] proposed a hybrid solution employing both
checkpointing and active replication by selectively choosing
the operators to be actively replicated using a minimal comple-
tion tree. A key limitation of this approach is that the operators
that are actively replicated may not fail simultaneously which
leads to higher resource usage cost. Heinze et al. [12] proposed
an adaptive mechanism that enables the operator to switch
between active and reserved status. The adaptive mechanism
proposed by Upadhyaya et al. [18] provides fault tolerance for
database queries by optimizing recovery latency. The hybrid
solution proposed by Zhang et al. [17] allows the operator
to change from passive backup to active replication when
failure happens. Martin et al. [13] proposed an adaptive hybrid
mechanism, which can alternate between six fault tolerance
schema based on user-defined recovery time and cost. The
adaptive hybrid mechanism for HPC systems proposed by
Subasi et al. [16] selects partial tasks to be replicated using
active replication. None of the above mentioned adaptive
and hybrid mechanisms jointly consider the resilient physical
plan generation, recovery cost, processing latency requirement
and operator scheduling during failure as considered in our
paper. As a result, these techniques will result in longer
recovery times during failure when applied to edge-based
stream processing systems.
VII. CON CL US IO N
Edge computing provides a promising approach for efficient
processing of low latency stream data generated close to
the edge of the network. Although current distributed stream
processing systems offer some form of fault tolerance, existing
schemes are not optimized for edge computing environments
where applications have strict latency and recovery time re-
quirements. In this paper, we present a novel resilient stream
processing framework that achieves system-wide fault toler-
ance while meeting the latency requirement for edge-based
applications. The proposed approach employs a novel resilient
physical plan generation for stream queries and optimizes the
placement of operators to minimize the processing latency
during recovery and reduce the overhead of checkpointing de-
lays. The proposed techniques are evaluated by implementing
a prototype in Apache Storm [6] and the results demonstrate
the effectiveness and scalability of the approach.
ACK NOW LE DG EM EN T
This work is partially supported by an IBM Faculty award
for Balaji Palanisamy.
REF ER EN CE S
[1] “The internet of things: A movement, not a market.” [Online].
Available: https://ihsmarkit.com/Info/1017/Internet-of- things.html
[2] M. Satyanarayanan, “The emergence of edge computing,” Computer,
vol. 50, no. 1, pp. 30–39, 2017.
[3] Z. Amjad, A. Sikora, B. Hilt, and J.-P. Lauffenburger, “Low latency
v2x applications and network requirements: Performance evaluation,”
in 2018 IEEE Intelligent Vehicles Symposium (IV). IEEE, 2018, pp.
220–225.
[4] S.-I. Sou and O. K. Tonguz, “Enhancing vanet connectivity through
roadside units on highways,” IEEE transactions on vehicular technology,
vol. 60, no. 8, pp. 3586–3602, 2011.
[5] F. Bonomi, R. Milito, J. Zhu, and S. Addepalli, “Fog computing and its
role in the internet of things,” in Proceedings of the first edition of the
MCC workshop on Mobile cloud computing. ACM, 2012, pp. 13–16.
[6] “Apache storm,” accessed August 24, 2020. [Online]. Available:
https://storm.apache.org/
[7] “Apache flink,” accessed August 24, 2020. [Online]. Available:
https://flink.apache.org/
[8] H. Wang, L.-S. Peh, E. Koukoumidis, S. Tao, and M. C. Chan,
“Meteor shower: A reliable stream processing system for commodity
data centers,” in 2012 IEEE 26th International Parallel and Distributed
Processing Symposium. IEEE, 2012, pp. 1180–1191.
[9] P. Carbone, A. Katsifodimos, S. Ewen, V. Markl, S. Haridi, and
K. Tzoumas, “Apache flink: Stream and batch processing in a single
engine,” Bulletin of the IEEE Computer Society Technical Committee
on Data Engineering, vol. 36, no. 4, 2015.
[10] X. Liu, A. Harwood, S. Karunasekera, B. Rubinstein, and R. Buyya,
“E-storm: Replication-based state management in distributed stream
processing systems,” in 2017 46th International Conference on Parallel
Processing (ICPP). IEEE, 2017, pp. 571–580.
[11] J.-H. Hwang, S. Cha, U. Cetintemel, and S. Zdonik, “Borealis-r: a
replication-transparent stream processing system for wide-area moni-
toring applications,” in Proceedings of the 2008 ACM SIGMOD inter-
national conference on Management of data, 2008, pp. 1303–1306.
[12] T. Heinze, M. Zia, R. Krahn, Z. Jerzak, and C. Fetzer, “An adaptive
replication scheme for elastic data stream processing systems,” in
Proceedings of the 9th ACM International Conference on Distributed
Event-Based Systems, 2015, pp. 150–161.
[13] A. Martin, T. Smaneoto, T. Dietze, A. Brito, and C. Fetzer, “User-
constraint and self-adaptive fault tolerance for event stream processing
systems,” in 2015 45th Annual IEEE/IFIP International Conference on
Dependable Systems and Networks. IEEE, 2015, pp. 462–473.
[14] D. Duplyakin, R. Ricci, A. Maricq, G. Wong, J. Duerig, E. Eide,
L. Stoller, M. Hibler, D. Johnson, K. Webb, A. Akella, K. Wang,
G. Ricart, L. Landweber, C. Elliott, M. Zink, E. Cecchet, S. Kar, and
P. Mishra, “The design and operation of CloudLab,” in Proceedings of
the USENIX Annual Technical Conference (ATC), Jul. 2019, pp. 1–14.
[Online]. Available: https://www.flux.utah.edu/paper/duplyakin-atc19
[15] L. Li, Z. Jin, G. Li, L. Zheng, and Q. Wei, “Modeling and analyzing
the reliability and cost of service composition in the iot: A probabilistic
approach,” in 2012 IEEE 19th International Conference on Web Services.
IEEE, 2012, pp. 584–591.
[16] O. Subasi, O. Unsal, and S. Krishnamoorthy, “Automatic risk-based
selective redundancy for fault-tolerant task-parallel hpc applications,”
in Proceedings of the Third International Workshop on Extreme Scale
Programming Models and Middleware, 2017, pp. 1–8.
[17] Z. Zhang, Y. Gu, F. Ye, H. Yang, M. Kim, H. Lei, and Z. Liu, “A
hybrid approach to high availability in stream processing systems,” in
2010 IEEE 30th International Conference on Distributed Computing
Systems. IEEE, 2010, pp. 138–148.
[18] P. Upadhyaya, Y. Kwon, and M. Balazinska, “A latency and fault-
tolerance optimizer for online parallel query plans,” in Proceedings of
the 2011 ACM SIGMOD International Conference on Management of
data, 2011, pp. 241–252.
[19] L. Su and Y. Zhou, “Passive and partially active fault tolerance for
massively parallel stream processing engines,” IEEE Transactions on
Knowledge and Data Engineering, vol. 31, no. 1, pp. 32–45, 2017.
[20] M. K. Aguilera, W. Chen, and S. Toueg, “Heartbeat: A timeout-free
failure detector for quiescent reliable communication,” in International
Workshop on Distributed Algorithms. Springer, 1997, pp. 126–140.
[21] F. Salfner, M. Lenk, and M. Malek, “A survey of online failure prediction
methods,” ACM Computing Surveys (CSUR), vol. 42, no. 3, pp. 1–42,
2010.
[22] K. Zhang, J. Xu, M. R. Min, G. Jiang, K. Pelechrinis, and H. Zhang,
“Automated it system failure prediction: A deep learning approach,” in
2016 IEEE International Conference on Big Data (Big Data). IEEE,
2016, pp. 1291–1300.
[23] V. Belenko, V. Chernenko, V. Krundyshev, and M. Kalinin, “Data-driven
failure analysis for the cyber physical infrastructures,” in 2019 IEEE
International Conference on Industrial Cyber Physical Systems (ICPS).
IEEE, 2019, pp. 1–5.
[24] A. Ghodsi, M. Zaharia, B. Hindman, A. Konwinski, S. Shenker, and
I. Stoica, “Dominant resource fairness: Fair allocation of multiple
resource types.” in Nsdi, vol. 11, no. 2011, 2011, pp. 24–24.
[25] Y. Zhang, R. Power, S. Zhou, Y. Sovran, M. K. Aguilera, and J. Li,
“Transaction chains: achieving serializability with low latency in geo-
distributed storage systems,” in Proceedings of the Twenty-Fourth ACM
Symposium on Operating Systems Principles, 2013, pp. 276–291.
[26] P. Pietzuch, J. Ledlie, J. Shneidman, M. Roussopoulos, M. Welsh, and
M. Seltzer, “Network-aware operator placement for stream-processing
systems,” in 22nd International Conference on Data Engineering
(ICDE’06). IEEE, 2006, pp. 49–49.
[27] O. Sefraoui, M. Aissaoui, and M. Eleuldj, “Openstack: toward an open-
source solution for cloud computing,” International Journal of Computer
Applications, vol. 55, no. 3, pp. 38–42, 2012.
[28] U. Hunkeler, H. L. Truong, and A. Stanford-Clark, “Mqtt-s—a pub-
lish/subscribe protocol for wireless sensor networks,” in 2008 3rd
International Conference on Communication Systems Software and
Middleware and Workshops (COMSWARE’08). IEEE, 2008, pp. 791–
798.