ArticlePDF Available

Optimally Self-Healing IoT Choreographies

Authors:

Abstract and Figures

In the industrial Internet of Things domain, applications are moving from the Cloud into the Edge, closer to the devices producing and consuming data. This means that applications move from the scalable and homogeneous Cloud environment into a potentially constrained heterogeneous Edge network. Making Edge applications reliable enough to fulfill Industry 4.0 use cases remains an open research challenge. Maintaining operation of an Edge system requires advanced management techniques to mitigate the failure of devices. This article tackles this challenge with a twofold approach: (1) a policy-enabled failure detector that enables adaptable failure detection and (2) an allocation component for the efficient selection of failure mitigation actions. The parameters and performance of the failure detection approach are evaluated, and the performance of an energy-efficient allocation technique is measured. Finally, a vision for a complete system and an example use case are presented.
Content may be subject to copyright.
1
Optimally Self-Healing IoT Choreographies
JAN SEEGER, TU M¨
unchen
ARNE BR ¨
ORING, Siemens AG
GEORG CARLE, TU M¨
unchen
In the industrial Internet of ings domain, applications are moving from the Cloud into the edge, closer to
the devices producing and consuming data. is means applications move from the scalable and homogeneous
cloud environment into a constrained heterogeneous edge network. Making edge applications reliable enough
to fulll Industrie 4.0 use cases is still an open research challenge. Maintaining operation of an edge system
requires advanced management techniques to mitigate the failure of devices. is paper tackles this challenge
with a twofold approach: (1) a policy-enabled failure detector that enables adaptable failure detection and (2)
an allocation component for the ecient selection of failure mitigation actions. We evaluate the parameters
and performance of our failure detection approach and the performance of an energy-ecient allocation
technique, and present a vision for a complete system as well as an example use case.
CCS Concepts:
Networks
Network manageability; Programmable networks;
eory of computation
Description logics;
Additional Key Words and Phrases: IOT, optimization, failure detection
ACM Reference format:
Jan Seeger, Arne Br
¨
oring, and Georg Carle. 2016. Optimally Self-Healing IoT Choreographies. 1, 1, Article 1
(January 2016), 18 pages.
DOI: 10.1145/nnnnnnn.nnnnnnn
1 INTRODUCTION
IoT application deployments are currently mostly cloud-based, with a central processing component
provided with data via remote sensors and actuators. is centralized cloud structure limits the
possible applications for latency and condentiality reasons. As a response, processing is moving
back into the edge of the network, or into the sensors and actuators themselves. Centralized
cloud systems are being supplanted by edge systems in latency- and privacy-sensitive applications.
Industrial applications have stringent requirements in these elds, with low latency necessary for
monitoring and control applications, and condentiality necessary for economic reasons.
Shi et al. [
29
] dene the term “edge” as “any computing and network resources along the path
between data sources and cloud data centers”. We use this denition, but focus on edge networks
that are close to the data producers and consumers and fully under the control of the application
owner. ereby, applications consist of multiple chained tasks, which can be distributed over
several edge nodes to enable collaboration, e.g., to process a complex algorithm or AI pipeline in a
distributed and coordinated way. Challenge is the heterogeneity of industrial edge networks, which
consist of various kinds of nodes (e.g., industrial PCs, HMI units, network switches, or eld devices
such as simple sensors and actuators). ese nodes have varying computational and communication
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee
provided that copies are not made or distributed for prot or commercial advantage and that copies bear this notice and
the full citation on the rst page. Copyrights for components of this work owned by others than ACM must be honored.
Abstracting with credit is permied. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires
prior specic permission and/or a fee. Request permissions from permissions@acm.org.
©2016 ACM. XXXX-XXXX/2016/1-ART1 $15.00
DOI: 10.1145/nnnnnnn.nnnnnnn
, Vol. 1, No. 1, Article 1. Publication date: January 2016.
arXiv:1907.04611v1 [cs.NI] 10 Jul 2019
1:2 Jan Seeger, Arne Br¨
oring, and Georg Carle
capacities. Hence, operating a distributed application reliably and eciently requires intelligent
management approaches.
In this paper, we focus on (1) eciently detecting failures of devices and soware components
using an accrual-based failure detection augmented with policies, and (2) automatically mitigating
failures by nding an optimal allocation of application tasks, e.g., towards minimized energy
consumption of the system. is work describes the latest ndings on our research agenda to
enable distributed IoT choreographies. Our path began with the introduction of the “Recipe”
concept for dening IoT application templates [
31
], continued by our work on improving the
runtime management of such recipes by handling them as service choreographies [
26
], and most
recently dened a mechanism for the dynamic and resilient management of IoT choreographies [
27
].
A “Recipe” describes an application template as a graph of abstract IoT tasks (e.g., device services
such as “stream video” or “notify user”) where data ows along the edges of the graph, and tasks are
executed when they have received enough input data. An example recipe use case implementing a
vibration analysis for rotating machinery is described in more detail in Section 5.
Recipe tasks are generic, and a concrete recipe is derived by “instantiating” these generic tasks
with concrete implementations that are available in the system. Using this mechanism, parts of
the application can be replaced automatically when failure is detected. Missing in our previous
work are evaluations of the “quality” of these replacements. So far, the rst replacement available
was chosen without regarding the eect on the properties of the system. Such relevant properties
are for example the end-to-end latency of the application or the energy usage. By optimizing the
placement of tasks on devices, we can optimize the properties of a system even when devices
fail. Building up on our previous work, we present here a thorough examination of our failure
detector and an evaluation of its memory and processing usage, as well as a detailed description
and evaluation of our mechanism to optimize the assignment of tasks to devices for minimal energy
usage.
2 BACKGROUND & RELATED WORK
In this section, we describe the context of our work. Section 2.1 introduces relevant works in the
eld of composing services (and IoT devices) to applications. Section 2.2 provides an overview
about mechanisms for detecting failures in distributed systems. Section 2.3 presents related work
on optimal placements of system operators.
2.1 IoT Composition
e composition of web services has been extensively researched [
28
]. ereby, service composition
can be classied into two types, service orchestration and service choreography, based on the
manner in which the participant services interact [
28
]. Productive solutions for web services
composition typically follow the orchestration approach.
In the IoT domain, we have today established composition systems with a broad user community
such as “If is en at”
1
and Node-RED
2
. ese tools use simple composition techniques that
are executed centrally as orchestrations. ese platforms are targeting mainstream users and lack
systematic engineering support, which leads e.g. to widely duplicated recipes, as shown by Ur[
33
].
Giang et al. [
9
] focus on application-level distributed choreographies by building on Node-RED
as a visual programming tool. However, they do not address the conguration of critical automation
systems and their need for failure detection and recovery.
1hp://i.com
2hp://nodered.org
, Vol. 1, No. 1, Article 1. Publication date: January 2016.
Optimally Self-Healing IoT Choreographies 1:3
Fig. 1. Example of a recipe combining multiple device services for object detection (Source: [25]).
Khan et al.[
12
] propose a reliable infrastructure for IoT compositions, but focus on communica-
tion of data instead of application-level orchestrations. uluva et al.[
32
] employ Semantic Web
technologies to enable low-eort engineering of industrial IoT applications. However, they do
not focus on the runtime aspects and dynamic reconguration or failure detection. Focusing on
building automation, Ruta et al.[
20
] present a multi-agent framework that uses semantic technolo-
gies and makes use of automated reasoning for enabling device discovery and orchestration of IoT
components. eir approach misses to address failure handling or mitigation.
is work builds up on our previous works [
26
,
27
,
31
] that present an IoT composition as
a “Recipe”, i.e., separate from its implementation. A semi-automated service composition and
instantiation tool assists the user in creating the composition. IoT choreographies are described by
a directed graph of connected abstract application components, with data owing along the edges
of these components. During the instantiation phase, each abstract component is replaced with a
concrete one. e approach of this work builds up on this concept by rerunning the instantiation
algorithm when a failure is detected, to nd another component that can fulll the functionality of
the failed component.
Figure 1shows an example of a recipe that combines multiple services of devices in an intrusion
detection system. e green boxes are ingredients that need to be replaced with concrete com-
ponents when the recipe is instantiated. Ingredients are connected via their outputs and inputs.
In this example, video and audio streams are connected to analytics components that feed into
an aggregating intrusion detector and nally a component that is able to send notications. e
recipe designer can further specify application-level constraints (e.g., minimum video frame rate)
on the interactions between sensors and analysis services [25].
2.2 Failure Detection
Failure detection is an essential building block for distributed systems. Without a suitable failure
detector, distributed applications are generally not guaranteed to complete successfully. Chandra
et al. describe the theory of failure detection in [
3
] and dene two properties of failure detectors:
, Vol. 1, No. 1, Article 1. Publication date: January 2016.
1:4 Jan Seeger, Arne Br¨
oring, and Georg Carle
completeness and accuracy. Completeness describes the property of a failure detector to correctly
detect failure, while accuracy describes the capability of a detector to not detect failure on nodes
functioning correctly.
Our research is based on
ϕ
-accrual failure detection, which is in detail described by D
´
efago et
al. [
7
] and more recent implementations include Satzger et al. [
24
] and Liu et al. [
16
]. Accrual failure
detectors calculate the probability of a node having failed from the distribution of inter-arrival times
of received failure detection messages. ereby, accrual failure detectors are strongly complete
(there is some time aer which all failed processes are permanently marked failed by all other
processes) and eventually strongly accurate (there is some time aer which correct processes are
not marked failed by other correct processes).
In IoT environments, failures have to be detected on all involved devices and services, in order
to ensure reliability. Kodeswaran et al. [
13
] present a system for ecient failure management in
smart home environments based on tracking the most performed activities. is knowledge is used
to predict future degradation and failures of involved devices. However, their work is restricted to
the smart home domain and not directly usable in general IoT or industrial IoT cases.
e Gaia framework [
5
] allows to build pervasive systems and also includes failure detection.
ereby, the implemented failure detector is part of a central controller. I.e., the availability of
the controller has to be ensured for the failure detection to work. Our approach is based on the
distributed nodes and does not require communication with a central controller to detect a failure.
In [
6
] an unreliable failure detector is presented that enables the denition of an impact factor
for the involved nodes. is allows to tune the performance of the failure detector for specic
application needs.
Guclu et al. [
10
] present a distributed failure detector that builds on trust management. eir
method evaluates the trustworthiness of the data from neighboring nodes. However, their approach
is limited to homogeneous networks and similar structured data. is makes it not applicable to
our scenarios of heterogeneous edge and industrial IoT environments.
2.3 Optimal Allocation
Eciently allocating application tasks of a recipe to available edge devices is comparable to a
widely studied research problem in the distributed systems eld: the optimal operator placement of
distributed stream processing applications, or the optimal selection of networked devices for tasks
or computations of a chained process or workow. When either a hardware or soware failure
occurs, an application component has failed, and this failure needs to be mitigated.
Tasks have dierent parameters, depending on the optimization target. e result of the task
allocation problem is an allocation, an assignment of tasks to devices, that fullls the constraints,
and improves the performance of the system in some metric.
An overview of existing allocation approaches for stream processing is given in [15].
Based on Constraint Programming, Haubenwaller & Vandikas [
11
] describe an approach for
the ecient distribution of actors (processing tasks) to IoT devices. e approach resembles the
adratic Assignment Problem and is NP-hard, resulting in long computation times when scaling
up. Samie et al. [
22
] present another Constraint Programming-based approach that takes into
account the bandwidth limitations and minimizing energy concumption of IoT nodes. e system
optimizes computation ooading from an IoT node to a gateway, however, it does not consider
composed computations that can be distributed to multiple devices.
A Game eory-based approach is presented in [
23
] that aims at the joint optimization of radio
and computational resources of mobile devices. e system local optimum for multiple users,
, Vol. 1, No. 1, Article 1. Publication date: January 2016.
Optimally Self-Healing IoT Choreographies 1:5
however, it only aims at deciding whether to fully ooad a computation or to fully process it on
device.
Based on Non-linear Integer Programming, Sahni et al. [
21
] present their Edge Mesh algorithm
for task allocation that optimizes overall energy consumption and considers data distribution,
task dependency, embedded device constraints, and device heterogeneity. However, only basic
evaluation and experimentation are done and no performance comparison has been performed.
Based on Integer Linear Programming (ILP), Mohan & Kangasharju [
17
] propose a task assign-
ment solver that rst minimizes the processing cost and secondly optimizes the network cost,
which stems from the assumption that Edge resources may not be highly processing-capable. An
intermediary step of reduces the sub-problem space by combining tasks and jobs with the same
associated costs. is reduces the overall processing costs.
Cardellini et al. [
2
] describe a comprehensive ILP-based framework for optimally placing opera-
tors of distributed stream processing applications, while being exible enough to be adjusted to
other application contexts. Dierent optimization goals are considered, e.g., application response
time and availability. ey propose their solution as a unied general formulation of the optimal
placement problem and provide a strong theoretical foundation. e framework is exible so that
it can be extended by adding further constraints or shied to other optimization targets. Hence,
we utilize their framework and extend it by incorporating further constraints for our optimization
goal, the overall energy usage.
3 A FAILURE DETECTOR FOR SELF-HEALING IOT CHOREOGRAPHIES
In this section, we describe our failure detector PE-FD and its properties in comparison to other
failure detectors, and present our policy concept to tune the failure detector for specic application
requirements.
3.1 The PE-FD Failure Detector
Failure detection is a crucial functionality for a distributed system that should operate reliably.
Without the information on the current state of components, a mitigator cannot decide what
(if any) action to take to continue operation of the system. In this work, we focus on crash
failures [
30
], where devices and soware work correctly until they fail permanently. For ecient
failure detection in IoT Choreographies, we have developed our Policy-Enabled Failure Detector
(PE-FD). It is based on the principle of
ϕ
-accrual failure detection [
7
] and augmented with the
support for “policies”, where parameters of the failure detection algorithm are adjusted according
to application requirements.
In general,
ϕ
-accrual failure detectors are unreliable, meaning that errors in their output are
permissible, but aer some point, their output is always correct. Compared to “traditional” fail-
ure detectors (e.g., heartbeat-based or adaptive [
1
,
4
]),
ϕ
-accrual detectors compute a suspicion
function
ϕ
that describes the probability of a node having crashed. is probability is computed
by estimating the distribution of the inter-arrival times of heartbeats, and computing the prob-
ability of a new heartbeat arriving aer the current time. With the current time being
tnow
, and
the time of the last timestamp’s arrival being
tlast
, the suspicion is thus given by the formula
ϕ(tnow) log10(Plater (tnow tlast ))
, where
Plater
is computed from the inferred distribution of
timestamp inter-arrival times.
Failure detectors dier in the parameters and implementation of estimating
Plater
, either storing
all inter-arrival times for every heartbeat and using the empirical distribution function, or assuming
a certain distribution for inter-arrival times and estimating the parameters of the distribution. Our
, Vol. 1, No. 1, Article 1. Publication date: January 2016.
1:6 Jan Seeger, Arne Br¨
oring, and Georg Carle
PE-FD failure detector computes suspicion in constant time and space by taking advantage of the
one-sided Chebyshev inequality together with empirical estimators for both ϕand µ.
ρn=
n
Õ
i=1
xi=ρn1+xi(1)
κn=
n
Õ
i=1
x2
i=κn1+x2
n(2)
µn=1
nρn(3)
σ2
n=1
n1
n
Õ
i=1
(Xiµ)2=1
n1κnnµ2
n(4)
Plater =P[X>now] ≤ σ2
σ2+(Tnow µ)2(5)
Plater =0|now <mu (6)
In equations 1and (2), we dene two helper variables that store the sum of the timestamps
and the sum of the squares of the timestamps. We can then compute the mean
µ
by dividing by
the number of timestamps, and derive
σ2
via equation (4). To calculate mean and variance, we
thus need to store three variables (
ρ
,
κ
and
n
), independent of the number of timestamps received.
e suspicion can then be calculated without any additional information. To combat overow
and numerical instability,
ρ
and
κ
are periodically reset aer a number of timestamps have been
received. We call this number
ωmax
“learning window”. However, aer reseing, we need a certain
number of heartbeats to regain a good estimate for the distribution parameters. us, we introduce
a parameter
ωmin
that is the minimum number that needs to be received until the new estimate is
used.
def updateState(state, delay):
return IotaState(state.rho +delay, state.kappa +delay**2, state.n+ 1)
def suspicion(state, time):
if state.n== 0 or state.n= 1:
return 0
mu =state.rho /state.n
if time <mu:
return 0
sigma =(state.kappa -state.n*(mu *mu)) /(state.n- 1)
return -np.log10(sigma /(sigma +(time -mu)**2))
Listing 1. Suspicion calculation and state update for the PE-FD failure detector.
A python implementation of the suspicion algorithm is included in Listing 1. It can be seen that
the calculation of the next state takes a constant amount of computations (three additions, and
one multiplication). e necessary operations to calculate a suspicion value are three divisions,
three multiplications and four additions. e number of computations is independent of the chosen
parameters. With extremely large values for the learning window, the variables
ρ
and
κ
could
overow, but this is not a realistic constraint for 32-bit variables.
, Vol. 1, No. 1, Article 1. Publication date: January 2016.
Optimally Self-Healing IoT Choreographies 1:7
Φ
-accrual failure detectors can be converted into binary failure detectors by choosing a threshold
u
, and marking a process as failed when the suspicion rises above this threshold. We describe the
choices for
u
in the next section, and the behavior of our PE-FD failure detector is evaluated in
detail in Section 6.1.
3.2 Policies for application-tuned failure detection
e PE-FD failure detector provides a number of adjustable parameters:
e minimum number of heartbeats required for an estimate (ωmin)
e maximum number of heartbeats until the current estimate is reset (ωmax)
e heartbeat period (theartbeat)
e suspicion threshold (u)
ωmin
and
ωmax
are called the “learning window” in combination. ese parameters represent a
wide range of adaptability for our algorithm. By adjusting these parameters based on policies that
take the structure and requirements of applications into account, failure detection can be improved
over a “one-size-ts-all” approach.
e maximum and minimum number of heartbeats
ωmin
and
ωmax
are relevant for nodes with
changing network conditions. For example, a mobile node can benet from a lower maximum and
minimum number of heartbeats, so the failure detection algorithm can adapt to changing network
conditions with fewer received heartbeats. e complementary adjustment is possible as well: For
wired nodes, increasing the size of the learning window allows them to ignore transient failures,
and keep the application working.
When failure detection is used in a web service composition as those described in Section 2.1, the
structure of the composition can be inspected to modify the suspicion threshold and the heartbeat
period. When a task is central to an application, and no replacement is available, the heartbeat
period should be set low to allow quick detection of failures. e suspicion threshold ushould be
set relatively high, not to cause false positives.
4 OPTIMAL MITIGATION OF FAILURES IN IOT CHOREOGRAPHIES
In this section, we describe how our failure detector is combined with a task assignment approach to
an optimal self-healing procedure for IoT choreographies, and we describe the details of allocating
tasks to optimize the overall energy usage.
4.1 Optimal Self-Healing Procedure
IoT applications are becoming more prevalent in various domains, e.g., smart homes and buildings,
industrial manufacturing, transportation, or healthcare. Such IoT applications oen consist of
multiple tasks that interact in the form of a dataow graph, where components exchange data along
directed edges. A simple but popular execution engine for such application is “IFTTT” (Section 2.1).
With the recipe concept (Section 2.1, [
31
]), we have dened a schema for the expression of such
dataow graphs. ese recipes are executed in a distributed fashion, and can be dynamically replaced
and recongured (see [
26
]). With the growing scale of IoT device deployments and applications,
such dynamic replacement and reconguration will become more important. Additionally, IoT
applications are penetrating more an more crucial areas (e.g., patient monitoring or optimization of
industrial processes), i.e., failures may have large impact and the ability to recongure the system
dynamically is crucial.
In our previous work, the chosen replacement component was not evaluated for its quality,
besides fullling the obvious functional requirements. Hence, we aim here at evaluating the
replacement of devices with regard to a dened quality metric and thereby improve the operating
, Vol. 1, No. 1, Article 1. Publication date: January 2016.
1:8 Jan Seeger, Arne Br¨
oring, and Georg Carle
parameters of the application and the network. is is especially important with long-running and
resource-constrained processes.
Combined with the failure detection presented in Section 2.2, the steps of our optimal self-healing
procedure for an IoT orchestration are as follows:
(1) Detect soware or device failure
(2) Find functionally matching replacement devices
(3) Optimal assignment of application tasks to available network nodes
(4) Recongure application with replacement device and soware
Everything in that application is allocated. We can eciently detect soware or device failure (1)
via the failure detection algorithm described in Section 3, and tune the detection according to the
application requirements with policies as described in Section 3.2. e semantic matching algorithm
in [
26
,
31
] can then nd a functionally matching replacement component (2). en, in step (3), we
evaluate the placement of recipe components on nodes of the modied graph via the allocation
algorithm described in the next Section 4.2. Step (4), the reconguration of the instantiated recipe,
then happens as described in [26].
4.2 Energy-optimal Task Assignment
As described in Section 2.3, we have based our approach for allocating recipe tasks on [
2
]. Cardellini
et al. evaluate conguring a system for optimal response time and optimal availability. ey
formulate the allocation of operators as an ILP problem, which they hand over to an IBM CPLEX
3
solver to nd the optimal approach. We have followed this approach to optimize the energy usage
of an IoT application formulated as a recipe.
We dene optimality of the allocation by total energy use over one execution of the recipe.
Energy during recipe execution is consumed in two phases: “Device energy” is consumed by a
device when executing a task, and “network energy” is consumed by the device when sending
the result of the calculation over the network. e optimal conguration of the network is the
assignment of tasks to devices that results in the lowest total consumption of energy and satises
the constraints. e constraints concern the requirements that an assignment must satisfy: Each
task should only be allocated once and resource requirements for assigned tasks should not exceed
the resources of the node. is problem is a form of the quadratic assignment problem, and thus
NP-hard. We have developed and evaluated a heuristic that reduces the problem to a non-quadratic
assignment problem, which we describe in Section 4.3.
We dene the energy-optimal task assignment as follows: e recipe
Grcp
consists of a set of tasks
Vrcp
connected by directed edges
Ercp
. e network
Gnet
that tasks can be evaluated on consists
of a set of nodes
Gnet
connected by a set of undirected links
Enet
. e result of the allocation is a
matrix X=Vrcp ×Vnet where X[t,n]=1 if and only if task tis allocated to node n.
Tasks, nodes and links have properties that are relevant for the energy consumption of the
application once allocated. ese parameters are described in Table 1.
St
,
Pn
,
Rn
and
Cn
are dened
as multiples of some reference node. e resources of a node are expressed as a single scalar, but
additional resource requirements can easily be introduced into the model.
For calculating the network energy, we need to know whether a link between two tasks is assigned
to a link between two nodes. For this, we introduce a matrix
Y=Vrcp ×Vrcp ×Vnet ×Vnet
, where
Y[t1,t2,n1,n2]=
1 if and only if the communication between task
t1
and task
t2
is allocated on the
network link between nodes
n1
and
n2
. is corresponds to
X[t1,n1]=
1
X[t2,n2]
. Unfortunately,
this is not a linear constraint, and thus we need to linearize the formulation.
3hps://www.ibm.com/analytics/cplex-optimizer
, Vol. 1, No. 1, Article 1. Publication date: January 2016.
Optimally Self-Healing IoT Choreographies 1:9
Symbol Description
RtResources required for the evaluation of task tVrcp.
OtOutput of task tVrcp for a single received input.
StComputation time required for completing task tVrcp once.
PnProcessing power of node nVnet.
RnResources available on node nVnet.
CnEnergy consumption of node nVnet for one unit of computation.
TlEnergy use for the transfer of one data packet over link lEnet.
D(n1,n2)Energy cost of the shortest path between n1and n2.
Table 1. Parameters of energy-aware allocation algorithm.
t1,t2Vrcp :n1,n2Vnet :Y[t1,t2,n1,n2] ≤ X[t1,n1](7)
t1,t2Vrcp :n1,n2Vnet :Y[t1,t2,n1,n2] ≤ X[t2,n2](8)
t1,t2Vrcp :n1,n2Vnet :Y[t1,t2,n1,n2] ≥ X[t1,n1]+X[t2,n2] − 1 (9)
tVrcp :Õ
nVnet
X[t,n]=1 (10)
nVnet :Õ
tVrcp
X[t,n] ∗ RtRn(11)
Õ
tVrcp
Õ
ni nVnet
Cn∗ (St/Pn) ∗ X[t,n] ≤ device energy (12)
Õ
(t1,t2)∈Ercp
Õ
n1,n2Vnet
On1Pn1,n2Y[t1,t2,n1,n2] ≤ network energy (13)
network energy +device energy total energy (14)
We follow the formulation presented in [
2
] and dene an ILP model as shown in Equations 7
to 9. Equations 7to 9describe the linearization of the network matrix
Y
. Equations 10 and 11
express the “only allocated once” and “resources not exceeded” constraint. Equations 12 and 13
calculate network and device energy as described above. Finally, we calculate the total energy use
of the assignment by adding both energies in equation 14. e objective of the optimization is the
minimization of the total used energy.
We implemented this model in a Python
4
script using the PuLP
5
linear programming library. We
can then nd solution for the problem using the CPLEX solver, which uses a branch-and-bound
approach [19]. For a discussion of the benchmark results, see Section 6.2.
4.3 A Linear Heuristic for Energy-Optimized Allocation
e quadratic assignment problem described in the previous section is NP-hard and thus compute
intensive. e culprit for this is the network cost calculation and the linearization of Y resulting in
a large number of constraints. By approximating the network energy, we can get a faster solution,
which is however no longer optimal. We quantify the loss of optimality and speedup in Section 6.2.
4hps://www.python.org/
5hps://pythonhosted.org/PuLP/
, Vol. 1, No. 1, Article 1. Publication date: January 2016.
1:10 Jan Seeger, Arne Br¨
oring, and Georg Carle
Congurator
Knowledge
base
Allocator
PE-FD PE-FD
Network
(2)
(4)
(3)
(5) (1)
Fig. 2. System model of an optimally self-healing IoT choreography
By removing the
Y
matrix and the associated constraints, we create a linear problem that can be
evaluated eectively by the simplex method [
18
]. Our approach approximates the energy required
for sending a packet of data by taking the average of a node’s links. We introduce the parameter
ˆ
Tn=1
|outgoing(n)| Íeoutgoing(n)Tethat describes the average transmission cost of a node’s links.
Õ
tVrcp
Õ
ni nVnet
Cn∗ (St/Pn) ∗ X[t,n]+Otˆ
TnX[t,n] ≤ total energy (15)
e complete model reuses constraints (10) and (11) with the constraint (15). By transforming
the QAP into a linear problem, we greatly increase the speed of nding a solution, and make the
optimization feasible for on-line usage.
5 SYSTEM MODEL & APPLICATION EXAMPLE
is section presents our implementation of the system for optimal self-healing approach for IoT
choreographies and describes a use case example for applying the developed system.
Figure 2shows the integrated system combining our optimal allocator and the PE-FD failure
detector. e system consists of 3 main components, as well as the devices in the network. A
knowledge base stores the knowledge about the system, such as available devices, applications and
links between devices. One possible storage mechanism for such data would be a semantic triple
store such as Apache Jena
6
. With this semantic store, the system can take advantage of semantic
reasoning and translation, as described in [
25
]. e congurator controls the creation of the system
and is responsible for conguring devices into a choreography. It is not involved in the operation
of the system, but for administrative actions (such as reconguring applications when devices
fail). e allocator is the component responsible for running the allocation algorithms described in
Section 4. Finally, the network contains devices that communicate via heterogeneous network links.
ese devices are running an engine that supports conguration by the congurator.
When a device or soware component fails, this is detected by the PE-FD failure detection
algorithm, and the congurator is informed by the devices that have detected the failure (1). e
congurator retrieves the applications that the failed component was part of from the knowledge
base and nds replacement for these devices that are available in the network (2). en, the set of
6hps://jena.apache.org/
, Vol. 1, No. 1, Article 1. Publication date: January 2016.
Optimally Self-Healing IoT Choreographies 1:11
Vibration
sensing
Data
preprocessing
KPI
calculation Postprocessing
Signal-based
fault detection
Maintenance
decision
Fig. 3. Wireless vibration analysis use case (based on: Kr ¨
ugel et al. [14])
applications and available devices is passed to the allocator (3), which computes an allocation for
tasks and devices, and returns the resulting allocation to the congurator (4), which then applies
the new conguration to devices in the network (5).
As motivated above, the usage of such a system for optimally self-healing IoT choreographies
is increasingly important. As an example use case for this system, we discuss here the vibration
analysis of rotating machinery via vibration sensors as described by Kr
¨
ugel et al. [
14
]. e structure
of the application is shown in Figure 3. Vibration sensors sense the vibrations of a rotating machine
(such as an engine or fan). e vibration information is preprocessed for the analysis and fed into a
reduced-complexity model of the machine. From this model, the key performance indicators are
derived. ese KPIs are forces acting on the machine parts. e forces are then postprocessed, and
nally, a maintenance decision is made. In parallel, signal-based fault detection based on ags can
detect faults.
It is easy to see that these components of the vibration analysis have dierent requirements
for processing power and required resources. e sensor nodes running the analysis are baery
powered and wireless, since they need to be non-invasive and placed on machines without intro-
ducing extra infrastructure. As such, using an energy-optimal allocation for tasks is important to
maximize the runtime of the analysis.
6 EVALUATION
In this section we rst evaluate our failure detector, PE-FD (Section 6.1), as well as the mitigator
component for optimal allocation of tasks (Section 6.2).
6.1 Evaluation of PE-FD Algorithm
By adjusting the threshold
u
, we can modify the behavior of the failure detection algorithm. A
lower threshold
u
leads to faster detection of failure, while increasing
u
reduces the amount of
false positives. Figure 4shows the behavior of an example failure detection run. We chose a
normally-distributed timestamp arrival time. We generated timestamp inter-arrival time with a
mean of 20 seconds and a variance of 5 seconds for 3000 seconds, and then increased the mean of
the distribution to 50 for another 3000 seconds. is might happen when the sending node switches
into an energy saving mode, or changes its network connection to one with increased latency. We
then calculated the detection time (rst correct “failed” verdict) and mistake rate (incorrect “failed”
verdicts) for thresholds
u
from 0.1 to 2.0. As seen in Figure 4, increasing the threshold decreases
the false positive rate, but decreases the detection time.
Selecting such a threshold can be done with application-level policies based on reconguration
policies for a node. A node for which replacements are available can be congured with a lower
threshold, since replacing it on a false positive will still allow the system to function, and the node
will be replaced faster on a “true” failure.
, Vol. 1, No. 1, Article 1. Publication date: January 2016.
1:12 Jan Seeger, Arne Br¨
oring, and Georg Carle
(a) False positive rate vs. threshold (b) Detection time (s) vs. threshold
Fig. 4. Behavior of PE-FD with varying thresholds
Fig. 5. Detection time vs. timestamp interval.
Adjusting the timestamp interval has an impact on the detection time. By lowering the timestamp
interval, the detection time is decreased at an increased cost of network trac.
Additionally, lowering the timestamp interval can drastically decrease baery life for energy-
starved nodes, as waking up and sending packets consumes a large amount of energy.
An example for this behavior in 10 runs of the PE-FD can be seen in Figure 5. We generated
timestamp times that were normally distributed around the sending interval with a variance of 1 to
account for network delays for a period of 1000 seconds. We congured PE-FD with a threshold of
0.8, and an innite learning window. We then sampled the suspicion function every 5 seconds, and
measured the detection time (i.e. the rst “true” detection) for the resulting suspicion values.
We thus see it is advantageous to adjust the timestamp interval of the algorithm dynamically,
trading o between the importance of high detection speed (which might be mandated by QoS
requirements made by the user) and network and baery eciency. Policies to adjust the timestamp
interval should take into account the “kind” of node they are operating on (wireless or wired network,
baery or mains powered) to get achieve optimal results.
e nal parameter remaining is the size of the learning window. We evaluated dierent
congurations of PE-FD with varying window sizes. e timestamps for this experiment were
generated with a normal distribution. We sampled timestamp inter-arrival times from
N(
20
,
1
)
for 1000 seconds, from
N(
50
,
1
)
for 1000 seconds, and again from
N(
20
,
1
)
for 500 seconds. is
resulted in a total of 95 timestamps. Figure 6shows the eect of
ωmax
on mistake rate and detection
, Vol. 1, No. 1, Article 1. Publication date: January 2016.
Optimally Self-Healing IoT Choreographies 1:13
(a) Detection time vs. ωmax. (b) Mistake rate vs. ωmax.
Fig. 6. Eect of learning window on parameters.
time. e graph shows an interesting trend at an
ωmax
of 50. e timestamp generation generated
approximately 50 timestamps (1000 seconds / 20 seconds interval) with a 20 second delay. is
means the learning window of the
ωmax
=50 conguration was reset right as the distribution was
changing. Also, 50 is the largest conguration smaller than the “period” of timestamp changes. is
means that this conguration adapts most slowly to changes. We can see this in the strong growth
of both mistake rate and detection time. Smaller
ωmax
congurations adapt faster, while larger
ωmax
congurations “smear” across the two timestamp distributions and learn a “mixed” distribution
with a mean between 50 and 20, and a higher variance. We see this in the graph by the detection
time decreasing with larger
ωmax
. e majority of incorrect suspicions are generated at the change
from
N(
20
,
1
)
to
N(
50
,
1
)
, at which time there were only 50 timestamps. us, the congurations
with
ωmax
greater than 50 perform the same as the
ωmax
=50 congurations. Since the timestamp
distribution decreased in average delay, generally, the change from 50 to 20 generated almost no
suspicion, as average timestamp times decreased, and thus, the suspicion calculated was set to
zero for most samplings of the suspicion function (see Equation (6).). e general takeaway is
that seing the learning window is dicult. If periodic changes in the timestamp distribution are
expected, care should be taken to select a learning window smaller than the period of change. If no
periodic changes are expected, a large learning window should decrease false positives.
6.2 Evaluation of Mitigator
To evaluate the performance of the mitigator, we have built a Python-based evaluation framework.
We evaluate the performance of the mitigator by generating a random network and a random
recipe, and leing the allocator nd the optimal allocation.
We generate the network with two classes of nodes: Wireless nodes are connected via an energy-
inecient wireless connection, and wired nodes are connected via an energy-ecient wireless
connection. In our conguration, 60% of the nodes are wired nodes, and the remaining 40% are
wireless nodes. Nodes are connected to each other with a certain probability. at probability is
0.8 for wired-wired connections, 0.5 for wireless-wireless connections and 0.4 for wireless-wired
connections. Wired connections use 0.2 units of energy, while wireless connections use 0.8 units of
energy. Nodes have a varying amount of resources uniformly distributed between a lower bound
of 1 and an upper bound of 8 resource units. Nodes also have a varying processing speed between
, Vol. 1, No. 1, Article 1. Publication date: January 2016.
1:14 Jan Seeger, Arne Br¨
oring, and Georg Carle
Fig. 7. “Long” (le) and “wide” (right) recipes.
(a) CP U time for the optimal allocation algo-
rithm vs. the number of nodes. Each experiment
with n nodes was measured 5 times with 3 to
n-1 tasks.
(b) CPU time for the optimal allocation algo-
rithm vs. the number of tasks. Each experiment
with n tasks was measured 5 times with 5 to 20
nodes.
Fig. 8. Runtime for optimal allocation.
1 and 3 speedup compared to a reference processor. Finally, nodes can use from 0.5 to 1.5 as much
energy as a reference processor for a single unit of computation.
For the recipe, we generate two classes of recipes with a certain number of tasks, a “wide” recipe
and a “long” recipe. In a “wide” recipe, two tasks are designated the “start” and “end” tasks, and
every other task needs input from the start node and sends output to the end node. In a long
recipe, tasks are linked serially. Figure 7shows two example recipes. Each recipe task has resource
requirements randomly distributed between 1 and 8, an output factor randomly distributed between
0.5 and 1.5, and a computation size of 1 or 2.
As expected, the optimal allocation algorithm scales very badly (non-polynomially). In Figure 8,
we see the runtime of the algorithm for varying problem sizes. e shaded area shows the variance
with the non-shown parameter (dierent recipe sizes for the network node graph, diering network
sizes for the recipe node graph). e time needed for nding the optimal allocation grows unwieldy
very quickly.
In comparison, our heuristic presented in Section 4.3 nds a solution much more quickly. Figure 9
shows the runtime of the heuristic for dierent network and recipe sizes. For the slowest case for
the full allocation, the heuristic takes 8 seconds of CPU time, while the solver consumes 864104
seconds (about 10 days) of CPU time for nding the optimal allocation. e allocation evaluation
was executed on an Amazon EC2
m4.10xlarge
machine with 40 virtual cores and 160 GiB of
memory. Peak memory use was 51 GiB.
, Vol. 1, No. 1, Article 1. Publication date: January 2016.
Optimally Self-Healing IoT Choreographies 1:15
(a) CPU time of the allocation heuristic vs. the
number of nodes. Each experiment with n nodes
was measured 5 times with 3 to n-1 tasks.
(b) CPU time of the allocation heuristic vs. the
number of tasks. Each experiment with n tasks
was measured 5 times with 5 to 20 nodes.
Fig. 9. Heuristic runtime
Fig. 10. Energy consumption of heuristic solution scaled against optimal solution.
However, the heuristic loses about 30% of energy eciency over the optimal algorithm. As seen
in Figure 10, 50% of the solution lie in the 0.6 to 0.8 range.
In Figure 11, the performance of the heuristic as related to the size of the recipe can be see. e
performance of the heuristic decreases with larger networks. is is explainable by the network
links longer, as the dierence between the node-local transmission energy
ˆ
Tn
and the actual
transmission energy
T(n,n2)
grows larger with a growing network. e diagram also shows that the
long recipe is harder to allocate than the wide recipe.
is improves the results shown in [
2
] for the sampling heuristic, where a sampling factor of
30% (i.e. only one third of all nodes was considered) led to a runtime of 5 seconds, but solution
quality of 40% for the sequential application.
7 CONCLUSIONS & FUTURE WORK
Today, IoT applications are increasingly executed in edge environments to avoid latency and privacy
issues associated with a cloud-based execution. To enable the execution of complex applications
, Vol. 1, No. 1, Article 1. Publication date: January 2016.
1:16 Jan Seeger, Arne Br¨
oring, and Georg Carle
Fig. 11. Performance of heuristic vs. number of tasks and recipe type.
on the edge, we need to split them in separate tasks and execute them on multiple devices. An
example of such a complex application has been described above: the vibration analysis of rotating
machinery on the manufacturing shop oor. Running such distributed applications reliably is a
challenge.
We present in this work a system that supports the self-healing of such IoT choreographies. is
system consists mainly of two contributions: (1) A novel failure detector concept that supports
a wide range of parameters for application-specic and policy-based conguration and some
guidelines towards the selection of these parameters. (2) We have introduced an ILP formulation
for optimal task allocation with regards to energy, and designed a heuristic that makes on-line
computation of allocations feasible. We have evaluated both the PE-FD failure detector and the
performance of the allocation algorithm.
In the future, we plan to formalize the policies described textually in Section 3.2, possibly
in the form of semantic rules in combination with a reasoner. In conjunction with formally
described application requirements, this will allow to automatically infer a tuned failure detector
parameterization without manual conguration.
Further, we aim to extend our allocation benchmarking framework to evaluate other heuristics
for allocation, taking into account resource distribution in the network. ese heuristics will likely
perform beer on large networks, where the simple “outgoing” heuristic fails.
Additionally, realizing and evaluating the use case as described in Section 5will be required to
gain a beer understanding of the use of allocation and failure detection in industrial automation
systems. Also, we aim to integrate our system with the Node-RED framework for the easy creation
of applications. With approaches such as Distributed Node-RED[
9
] and the traction Node-RED is
gaining in automation communities, this will be a promising eld to apply the techniques described
in this paper.
ACKNOWLEDGMENTS
is work is part of the SEMIoTICS project
7
that develops a paern-driven framework to guarantee
secure and dependable behavior in IoT environments [
8
]. It received funding from the European
Union’s Horizon 2020 research and innovation program under grant agreement No. 780315.
REFERENCES
[1]
M. Bertier, O. Marin, and P. Sens. 2002. Implementation and performance evaluation of an adaptable failure detector.
In Proceedings International Conference on Dependable Systems and Networks. 354–363. hps://doi.org/10.1109/DSN.
2002.1028920
7hps://www.semiotics-project.eu/
, Vol. 1, No. 1, Article 1. Publication date: January 2016.
Optimally Self-Healing IoT Choreographies 1:17
[2]
Valeria Cardellini, Vincenzo Grassi, Francesco Lo Presti, and Maeo Nardelli. 2016. Optimal operator placement for
distributed stream processing applications. In Proceedings of the 10th ACM International Conference on Distributed and
Event-based Systems. ACM Press, 69–80. hps://doi.org/10.1145/2933267.2933312
[3]
Tushar Deepak Chandra and Sam Toueg. 1996. Unreliable Failure Detectors for Reliable Distributed Systems. J. ACM
43, 2 (March 1996), 225–267. hps://doi.org/10.1145/226643.226647
[4]
Wei Chen, S. Toueg, and M. K. Aguilera. 2002. On the quality of service of failure detectors. IEEE Trans. Comput. 51, 1
(Jan. 2002), 13–32. hps://doi.org/10.1109/12.980014
[5]
Shiva Chetan, Anand Ranganathan, and R. Campbell. 2005. Towards fault tolerance pervasive computing. IEEE
Technology and Society Magazine 24, 1 (2005), 38–44. hps://doi.org/10.1109/MTAS.2005.1407746
[6]
Anubis Graciela De Moraes Rosseo, Carlos O. Rolim, Valderi Leithardt, Guilherme A. Borges, Cl
´
audio F.R. Geyer,
Luciana Arantes, and Pierre Sens. 2015. A new unreliable failure detector for self-healing in ubiquitous environments.
In Proceedings - International Conference on Advanced Information Networking and Applications, AINA.hps://doi.org/
10.1109/AINA.2015.201
[7]
X. D
´
efago, N. Hayashibara, R. Yared, and T. Katayama. 2004. e
ϕ
Accrual Failure Detector. In Reliable Distributed
Systems, IEEE Symposium on(SRDS). 66–78. hps://doi.org/10.1109/RELDIS.2004.1353004
[8]
K. Fysarakis, G. Panoudakis, N. Petroulakis, O. Soultatos, A. Br
¨
oring, and T. Marktscheel. 2019. Architectural Paerns
for Secure IoT Orchestrations. In Global Internet of ings Summit (GIoTS 2019), 17.-21. June 2019, Aarhus, DK. IEEE.
[9]
N. K. Giang, M. Blackstock, R. Lea, and V. C. M. Leung. 2015. Developing IoT applications in the Fog: A Distributed
Dataow approach. In 2015 5th International Conference on the Internet of ings (IOT). 155–162. hps://doi.org/10.
1109/IOT.2015.7356560
[10]
Sila Ozen Guclu, Tanir Ozcelebi, and Johan Lukkien. 2016. Distributed Fault Detection in Smart Spaces Based on Trust
Management. Procedia Computer Science 83 (Jan. 2016), 66–73. hps://doi.org/10.1016/j.procs.2016.04.100
[11]
Andreas Moreg
˚
ard Haubenwaller and Konstantinos Vandikas. 2015. Computations on the edge in the internet of
things. Procedia Computer Science 52 (2015), 29–34.
[12]
W. Z. Khan, M. Y. Aalsalem, M. K. Khan, M. S. Hossain, and M. Atiquzzaman. 2017. A reliable Internet of ings based
architecture for oil and gas industry. In 2017 19th International Conference on Advanced Communication Technology
(ICACT). 705–710. hps://doi.org/10.23919/ICACT.2017.7890184
[13]
Palanivel A. Kodeswaran, Ravi Kokku, Sayandeep Sen, and Mudhakar Srivatsa. 2016. Idea: A System for Ecient Failure
Management in Smart IoT Environments. In Proceedings of the 14th Annual International Conference on Mobile Systems,
Applications, and Services (MobiSys ’16). ACM, New York, NY, USA, 43–56. hps://doi.org/10.1145/2906388.2906406
[14]
S. Kr
¨
ugel, J. Maierhofer, T. 
¨
ummel, and D. J. Rixen. 2019. Rotor Model Reduction for Wireless Sensor Node Based
Monitoring Systems. 13th International Conference on Dynamics of Rotating Machines (2019).
[15]
G. T. Lakshmanan, Y. Li, and R. Strom. 2008. Placement Strategies for Internet-Scale Data Stream Systems. IEEE
Internet Computing 12, 6 (Nov. 2008), 50–60. hps://doi.org/10.1109/MIC.2008.129
[16]
Jiaxi Liu, Zhibo Wu, Jian Dong, Jin Wu, and Dongxin Wen. 2018. An energy-ecient failure detector for vehicular
cloud computing. PLOS ONE 13, 1 (Jan. 2018), e0191577. hps://doi.org/10.1371/journal.pone.0191577
[17]
Nitinder Mohan and Jussi Kangasharju. 2016. Edge-Fog cloud: A distributed cloud for Internet of ings computations.
In 2016 Cloudication of the Internet of ings (CIoT). IEEE, 1–6.
[18]
J. A. Nelder and R. Mead. 1965. A Simplex Method for Function Minimization. Comput. J. 7, 4 (Jan. 1965), 308–313.
hps://doi.org/10.1093/comjnl/7.4.308
[19]
G. Terry Ross and Richard M. Soland. 1975. A branch and bound algorithm for the generalized assignment problem.
Mathematical Programming 8, 1 (Dec. 1975), 91–103. hps://doi.org/10.1007/BF01580430
[20]
M. Ruta, F. Scioscia, G. Loseto, and E. Di Sciascio. 2014. Semantic-Based Resource Discovery and Orchestration in
Home and Building Automation: A Multi-Agent Approach. IEEE Transactions on Industrial Informatics 10, 1 (Feb.
2014), 730–741. hps://doi.org/10.1109/TII.2013.2273433
[21]
Yuvraj Sahni, Jiannong Cao, Shigeng Zhang, and Lei Yang. 2017. Edge Mesh: A new paradigm to enable distributed
intelligence in Internet of ings. IEEE access 5 (2017), 16441–16458.
[22]
Farzad Samie, Vasileios Tsoutsouras, Lars Bauer, Sotirios Xydis, Dimitrios Soudris, and J
¨
org Henkel. 2016. Computation
ooading and resource allocation for low-power IoT edge devices. In 2016 IEEE 3rd World Forum on Internet of ings
(WF-IoT). IEEE, 7–12.
[23]
Stefania Sardellii, Gesualdo Scutari, and Sergio Barbarossa. 2015. Joint optimization of radio and computational
resources for multicell mobile-edge computing. IEEE Transactions on Signal and Information Processing over Networks
1, 2 (2015), 89–103.
[24]
Benjamin Satzger, Andreas Pietzowski, Wolfgang Trumler, and eo Ungerer. 2007. A New Adaptive Accrual Failure
Detector for Dependable Distributed Systems. In Proceedings of the 2007 ACM Symposium on Applied Computing (SAC
’07). ACM, New York, NY, USA, 551–555. hps://doi.org/10.1145/1244002.1244129
, Vol. 1, No. 1, Article 1. Publication date: January 2016.
1:18 Jan Seeger, Arne Br¨
oring, and Georg Carle
[25]
J. Seeger, A. Br
¨
oring, M.-O. Pahl, and E. Sakic. [n.d.]. Rule-Based Translation of Application-Level QoS Constraints
into SDN Congurations for the IoT. In EuCNC 2019, Valencia, Spain. IEEE.
[26]
Jan Seeger, Rohit A. Deshmukh, and Arne Br
¨
oring. 2018. Running Distributed and Dynamic IoT Choreographies. In
2018 IEEE Global Internet of ings Summit (GIoTS) Proceedings, Vol. 2. IEEE, Bilbao, Spain, 33–38. hp://arxiv.org/abs/
1802.03159 arXiv: 1802.03159.
[27]
J. Seeger, R. A. Deshmukh, V. Sarafov, and A. Br
¨
oring. 2019. Dynamic IoT Choreographies. IEEE Pervasive Computing
18, 1 (Jan. 2019), 19–27. hps://doi.org/10.1109/MPRV.2019.2907003
[28]
an Z. Sheng, Xiaoqiang Qiao, Athanasios V. Vasilakos, Claudia Szabo, Sco Bourne, and Xiaofei Xu. 2014. Web
services composition: A decade’s overview. Information Sciences 280 (Oct. 2014), 218–238. hps://doi.org/10.1016/j.
ins.2014.04.054 WOS:000339132700014.
[29]
W. Shi, J. Cao, Q. Zhang, Y. Li, and L. Xu. 2016. Edge Computing: Vision and Challenges. IEEE Internet of ings
Journal 3, 5 (Oct. 2016), 637–646. hps://doi.org/10.1109/JIOT.2016.2579198
[30]
Andrew S. Tanenbaum and Maarten van Steen. 2007. Distributed systems - principles and paradigms, 2nd Edition.
Pearson Education.
[31]
Aparna Saisree uluva, Arne Br
¨
oring, Ganindu P. Medagoda, Heige Don, Darko Anicic, and Jan Seeger. 2017.
Recipes for IoT Applications. In Proceedings of the Seventh International Conference on the Internet of ings (IoT ’17).
ACM, New York, NY, USA, 10:1–10:8. hps://doi.org/10.1145/3131542.3131553
[32]
Aparna Saisree uluva, Kirill Dorofeev, Monika Wenger, Darko Anicic, and Sebastian Rudolph. 2017. Semantic-Based
Approach for Low-Eort Engineering of Automation Systems. In On the Move to Meaningful Internet Systems. OTM 2017
Conferences (Lecture Notes in Computer Science). Springer, Cham, 497–512. hps://doi.org/10.1007/978-3-319-69459-7
33
[33]
Blase Ur, Melwyn Pak Yong Ho, Stephen Brawner, Jiyun Lee, Sarah Mennicken, Noah Picard, Diane Schulze, and
Michael L. Liman. 2016. Trigger-Action Programming in the Wild: An Analysis of 200,000 IFT TT Recipes. In
Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems (CHI ’16). ACM, New York, NY, USA,
3227–3231. hps://doi.org/10.1145/2858036.2858556 event-place: San Jose, California, USA.
, Vol. 1, No. 1, Article 1. Publication date: January 2016.
... The framework is flexible so that it can be extended by adding further constraints or shifted to other optimization targets. Finally, our previous work [54] has leveraged Cardellini's framework and has extended it by incorporating further constraints for the optimization goal, namely the overall energy usage of the application. ...
... Specifically, we consider the problem of allocating the application components to the available end devices. As first presented in [54], we extend the Integer Linear Programming (ILP) based framework defined by Cardellini et al. [53] (Section II-C). In [54] the goal was to minimize the overall energy consumption needed for executing an IoT application. ...
... As first presented in [54], we extend the Integer Linear Programming (ILP) based framework defined by Cardellini et al. [53] (Section II-C). In [54] the goal was to minimize the overall energy consumption needed for executing an IoT application. The formulated ILP model is described below. ...
Article
Full-text available
An Intelligent IoT Environment (iIoTe) is comprised of heterogeneous devices that can collaboratively execute semiautonomous IoT applications, examples of which include highly automated manufacturing cells or autonomously interacting harvesting machines. Energy efficiency is key in such edge environments, since they are often based on an infrastructure that consists of wireless and battery-run devices, e.g., e-tractors, drones, Automated Guided Vehicle (AGV)s and robots. The total energy consumption draws contributions from multiple iIoTe technologies that enable edge computing and communication, distributed learning, as well as distributed ledgers and smart contracts. This paper provides a state-of-the-art overview of these technologies and illustrates their functionality and performance, with special attention to the tradeoff among resources, latency, privacy and energy consumption. Finally, the paper provides a vision for integrating these enabling technologies in energyefficient iIoTe and a roadmap to address the open research challenges.
... The framework is flexible so that it can be extended by adding further constraints or shifted to other optimization targets. Our previous work [54] has leveraged Cardellini's framework and has extended it by incorporating further constraints for the optimization goal, namely the overall energy usage of the application. ...
... Specifically, we consider the problem of allocating the application components to the available end devices. As first presented in [54], we extend the Integer Linear Programming (ILP) based framework defined by Cardellini et al. [53] (Section II-C). In [54] the goal was to minimize the overall energy consumption needed for executing an IoT application. ...
... As first presented in [54], we extend the Integer Linear Programming (ILP) based framework defined by Cardellini et al. [53] (Section II-C). In [54] the goal was to minimize the overall energy consumption needed for executing an IoT application. The formulated ILP model is described below. ...
Preprint
Full-text available
An Intelligent IoT Environment (iIoTe) is comprised of heterogeneous devices that can collaboratively execute semi-autonomous IoT applications, examples of which include highly automated manufacturing cells or autonomously interacting harvesting machines. Energy efficiency is key in such edge environments, since they are often based on an infrastructure that consists of wireless and battery-run devices, e.g., e-tractors, drones, Automated Guided Vehicle (AGV)s and robots. The total energy consumption draws contributions from multipleiIoTe technologies that enable edge computing and communication, distributed learning, as well as distributed ledgers and smart contracts. This paper provides a state-of-the-art overview of these technologies and illustrates their functionality and performance, with special attention to the tradeoff among resources, latency, privacy and energy consumption. Finally, the paper provides a vision for integrating these enabling technologies in energy-efficient iIoTe and a roadmap to address the open research challenges
... Self-repair or self-healing is defined as a property of systems that are able to identify and diagnose problems that appear during their operation and to determine and propose solution strategies in an autonomous way [8]. More specifically, self-healing provides reliability to a system through responsibility and awareness of the environment. ...
Article
Full-text available
Due to technological advances, Internet of Things (IoT) systems are becoming increasingly complex. They are characterized by being multi-device and geographically distributed, which increases the possibility of errors of different types. In such systems, errors can occur anywhere at any time and fault tolerance becomes an essential characteristic to make them robust and reliable. This paper presents a framework to manage and detect errors and malfunctions of the devices that compose an IoT system. The proposed solution approach takes into account both, simple devices such as sensors or actuators, as well as computationally intensive devices which are distributed geographically. It uses knowledge graphs to model the devices, the system’s topology, the software deployed on each device and the relationships between the different elements. The proposed framework retrieves information from log messages and processes this information automatically to detect anomalous situations or malfunctions that may affect the IoT system. This work also presents the ECO ontology to organize the IoT system information.
... The use of self-healing [22] -the ability of a system to automatically detect, diagnose and repair system defects at both hardware and software levels -to attain fault-tolerance on IoT systems has been suggested by several authors [1,4,20,35,37,38]. In previous work [14] a set of patterns to achieve fault-tolerance in IoT systems by adding self-healing mechanisms was introduced, along with a reference implementation in Node-RED. ...
Conference Paper
The widespread use of Internet-of-Things (IoT) across different application domains leads to an increased concern regarding their dependability, especially as the number of potentially mission-critical systems becomes considerable. Fault-tolerance has been used to reduce the impact of faults in systems, and their adoption in IoT is becoming a necessity. This work focuses on how to exercise fault-tolerance mechanisms by deliberately provoking its malfunction. We start by describing a proof-of-concept fault-injection add-on to a commonly used publish/subscribe broker. We then present several experiments mimicking real-world IoT scenarios, focusing on injecting faults in systems with (and without) active self-healing mechanisms and comparing their behavior to the baseline without faults. We observe evidence that fault-injection can be used to (a) exercise in-place fault-tolerance apparatus, and (b) detect when these mechanisms are not performing nominally, providing insights into enhancing in-place fault-tolerance techniques.
... IntellIoT develops a mechanism for optimized allocation of workloads to computing resources (i.e., mapping of IoT applications to devices). It consists of a flexible algorithmic framework that builds on prior work [23] and is adjustable to different optimality criteria at runtime. Further, it needs to be dynamically adapting to network changes based on high-level application requirements [4]; i.e., establishing closed-loop infrastructure management. ...
Chapter
Traditional IoT setups are cloud-centric and typically focused around a centralized IoT platform to which data is uploaded for further processing. Next generation IoT applications are incorporating technologies such as artificial intelligence, augmented reality, and distributed ledgers to realize semi-autonomous behaviour of vehicles, guidance for human users, and machine-to-machine interactions in a trustworthy manner. Such applications require more dynamic IoT environments, which can operate locally without the necessity to communicate with the Cloud. In this paper, we describe three use cases of next generation IoT applications and highlight associated challenges for future research. We further present the IntellIoT framework that comprises the required components to address the identified challenges.
... The use of self-healing [22] -the ability of a system to automatically detect, diagnose and repair system defects at both hardware and software levels -to attain fault-tolerance on IoT systems has been suggested by several authors [1,4,20,35,37,38]. In previous work [14] a set of patterns to achieve fault-tolerance in IoT systems by adding self-healing mechanisms was introduced, along with a reference implementation in Node-RED. ...
Preprint
The widespread use of Internet-of-Things (IoT) across different application domains leads to an increased concern regarding their dependability, especially as the number of potentially mission-critical systems becomes considerable. Fault-tolerance has been used to reduce the impact of faults in systems, and their adoption in IoT is becoming a necessity. This work focuses on how to exercise fault-tolerance mechanisms by deliberately provoking its malfunction. We start by describing a proof-of-concept fault-injection add-on to a commonly used publish/subscribe broker. We then present several experiments mimicking real-world IoT scenarios, focusing on injecting faults in systems with (and without) active self-healing mechanisms and comparing their behavior to the baseline without faults. We observe evidence that fault-injection can be used to (a) exercise in-place fault-tolerance apparatus, and (b) detect when these mechanisms are not performing nominally, providing insights into enhancing in-place fault-tolerance techniques.
... It applies a control loop combining recovery techniques (using checkpointing) and monitoring as well as failure notification and reconfiguration. Current research in this area includes the work of Seeger et al. who introduce an approach for failure detection and mitigation strategies of IoT (edge) devices for complex software tasks [119]. The mitigation strategies include an optimal task allocation strategy for distributed task execution on IoT devices. ...
Preprint
Full-text available
Internet-of-Things (IoT) ecosystems tend to grow both in scale and complexity as they consist of a variety of heterogeneous devices, which span over multiple architectural IoT layers (e.g., cloud, edge, sensors). Further, IoT systems increasingly demand the resilient operability of services as they become part of critical infrastructures. This leads to a broad variety of research works that aim to increase the resilience of these systems. In this paper, we create a systematization of knowledge about existing scientific efforts of making IoT systems resilient. In particular, we first discuss the taxonomy and classification of resilience and resilience mechanisms and subsequently survey state-of-the-art resilience mechanisms that have been proposed by research work and are applicable to IoT. As part of the survey, we also discuss questions that focus on the practical aspects of resilience, e.g., which constraints resilience mechanisms impose on developers when designing resilient systems by incorporating a specific mechanism into IoT systems.
Article
Ensuring the seamless operation of cloud computing services is paramount for meeting user demands and ensuring business continuity. Fault‐tolerant self‐healing techniques play a crucial role in enhancing the reliability and availability of cloud platforms, minimizing downtime and ensuring uninterrupted service delivery. This article systematically categorizes and analyzes existing research on fault‐tolerant self‐healing techniques published between 2005 and 2024. We provide a comprehensive technical taxonomy organizing self‐healing techniques based on fault tolerance processes, encompassing considerations for both reliability and availability. Additionally, we evaluate applications of proactive self‐healing techniques, highlighting their achievements, and limitations in enhancing service continuity. Strategies to address identified weaknesses are discussed, alongside future research challenges and open issues in the domain of cloud resilience. Through this analysis, the article contributes to understanding self‐healing techniques in cloud computing, offering insights into their effectiveness in ensuring service continuity. The findings aim to guide future research efforts in developing more robust and resilient cloud infrastructures, ultimately enhancing overall service reliability and availability. By emphasizing the importance of fault tolerance and self‐healing techniques, this article lays the foundation for advancing the state‐of‐the‐art in cloud computing.
Article
Full-text available
Availability of components in online systems cannot be guaranteed due to the unstable nature of the web (updates, changes, etc.). A well-designed system must take this fact into account in order to ensure the availability of services which is a very difficult challenge due to the confidentiality and autonomy of each service component. An interesting solution for this is to tolerate these problems at the composite level by having a mechanism of recovery, called Self-healing. In this work, we proposed a solution that consists of implementing a formal approach, making it possible to model a business process (web service composition) by timed automata of the type daTA, while ensuring the quality of service taking into account the functional and non-functional needs of the system (in this case, the QoS represents the response time). The main objective of this project is to create a system that allows you to compare two web service compositions in pairs, to decide whether they are equivalent or not to ensure a perfect self-healing working system by the end.
Article
Full-text available
The Internet of Things is growing at a dramatic rate and extending into various application domains. We have designed, implemented, and evaluated a resilient and decentralized system to enable dynamic IoT choreographies. We applied it to maintaining the functionality of building automation systems so that new devices can appear and vanish on-the-fly.
Conference Paper
Full-text available
IoT systems are growing larger and larger and are becoming suitable for basic automation tasks. One of the features IoT automation systems can provide is dealing with a dynamic system -- Devices leaving and joining the system during operation. Additionally, IoT automation systems operate in a decentralized manner. Current commercial automation systems have difficulty providing these features. Integrating new devices into an automation system takes manual intervention. Additionally, automation systems also require central entities to orchestrate the operation of participants. With smarter sensors and actors, we can move control operations into software deployed on a decentralized network of devices, and provide support for dynamic systems. In this paper, we present a framework for automation systems that demonstrates these two properties (distributed and dynamic). We represent applications as semantically described data flows that are run decentrally on participating devices, and connected at runtime via rules. This allows integrating new devices into applications without manual interaction and removes central controllers from the equation. This approach provides similar features to current automation systems (central engineering, multiple instantiation of applications), but enables distributed and dynamic operation. We demonstrate satisfying performance of the system via a quantitative evaluation.
Article
Full-text available
Failure detectors are one of the fundamental components for maintaining the high availability of vehicular cloud computing. In vehicular cloud computing, lots of RSUs are deployed along the road to improve the connectivity. Many of them are equipped with solar battery due to the unavailability or excess expense of wired electrical power. So it is important to reduce the battery consumption of RSU. However, the existing failure detection algorithms are not designed to save battery consumption RSU. To solve this problem, a new energy-efficient failure detector 2E-FD has been proposed specifically for vehicular cloud computing. 2E-FD does not only provide acceptable failure detection service, but also saves the battery consumption of RSU. Through the comparative experiments, the results show that our failure detector has better performance in terms of speed, accuracy and battery consumption.
Article
Full-text available
Internet of Things (IoT) is bringing an increasing number of connected devices that have a direct impact on the growth of data and energy-hungry services. These services are relying on Cloud infrastructures for storage and computing capabilities, transforming their architecture into more a distributed one based on edge facilities provided by Internet Service Providers (ISP). Yet, between the IoT device, communication network and Cloud infrastructure, it is unclear which part is the largest in terms of energy consumption. In this paper, we provide end-to-end energy models for Edge Cloud-based IoT platforms. These models are applied to a concrete scenario: data stream analysis produced by cameras embedded on vehicles. The validation combines measurements on real test-beds running the targeted application and simulations on well-known simulators for studying the scaling-up with an increasing number of IoT devices. Our results show that, for our scenario, the edge Cloud part embedding the computing resources consumes 3 times more than the IoT part comprising the IoT devices and the wireless access point.
Conference Paper
Full-text available
Industry 4.0, also referred to as the fourth industrial revolution aims at mass customized production with low-cost and shorter production time. Automation Systems (ASs) used in the manufacturing processes should be flexible to meet the constantly changing needs of mass customized production. Low-effort engineering of an Automation System (AS) is an important requirement towards this goal. Secondly, transparency and interoperability of ASs across different domains open a new class of applications. In order to address these challenges we propose a low-effort approach to engineer, configure and re-engineer an AS by employing Web of Things and Semantic Web Technologies. The approach allows for creating semantic specification for a new functionality or an application. It automatically checks whether a target AS can run a new functionality. We developed an engineering tool with a graphical user interface for our approach that enables an engineer to easily interact with an AS when discovering its functionality, engineering, configuring and deploying new functionality on it.
Conference Paper
Full-text available
The Internet of Things (IoT) is on rise. More and more physical devices and their virtual shadows emerge and become accessible through IoT platforms. Marketplaces are being built to enable and monetize the access to IoT offerings, i.e., data and functions offered by platforms, things, and services. In order to maximize the usefulness of such IoT offerings we need mechanisms that allow their efficient and flexible composition. This paper describes a novel approach for such compositions. The approach is based on the notion of Recipes that define work-flows on how their ingredients, i.e., instances of IoT offerings, shall interact with each other. Furthermore the paper presents a novel user interface that enables users to create and instantiate recipes by selecting their ingredients. An example from the smart mobility domain guides through the paper, illustrates our approach, and demonstrates as a proof-of-concept.
Article
Full-text available
In recent years, there has been a paradigm shift in Internet of Things (IoT) from centralized cloud computing to edge computing (or fog computing). Developments in ICT have resulted in the significant increment of communication and computation capabilities of embedded devices and this will continue to increase in coming years. However, existing paradigms do not utilize low-level devices for any decision-making process. In fact, gateway devices are also utilized mostly for communication interoperability and some low-level processing. In this paper, we have proposed a new computing paradigm, named Edge Mesh, which distributes the decision-making tasks among Edge devices within the network instead of sending all the data to a centralized server. All the computation tasks and data are shared using a mesh network of Edge devices and routers. Edge Mesh provides many benefits including distributed processing, low latency, fault tolerance, better scalability, better security and privacy, etc. These benefits are useful for critical applications which require higher reliability, real-time processing, mobility support, and context awareness. We first give an overview of existing computing paradigms to establish the motivation behind Edge Mesh. Then, we describe in detail about the Edge Mesh computing paradigm including the proposed software framework, research challenges, and benefits of Edge Mesh. We have also described the task management framework and done a preliminary study on task allocation problem in Edge Mesh. Different application scenarios, including Smart Home, Intelligent Transportation System, and Healthcare, are presented to illustrate the significance of Edge Mesh computing paradigm.
Conference Paper
Full-text available
Internet of Things typically involves a significant number of smart sensors sensing information from the environment and sharing it to a cloud service for processing. Various architectural abstractions, such as Fog and Edge computing, have been proposed to localize some of the processing near the sensors and away from the central cloud servers. In this paper, we propose Edge-Fog Cloud which distributes task processing on the participating cloud resources in the network. We develop the Least Processing Cost First (LPCF) method for assigning the processing tasks to nodes which provide the optimal processing time and near optimal networking costs. We evaluate LPCF in a variety of scenarios and demonstrate its effectiveness in finding the processing task assignments.
Conference Paper
Anomaly detection systems deployed for monitoring in oil and gas industries are mostly WSN based systems or SCADA systems which all suffer from noteworthy limitations. WSN based systems are not homogenous or incompatible systems. They lack coordinated communication and transparency among regions and processes. On the other hand, SCADA systems are expensive, inflexible, not scalable, and provide data with long delay. In this paper, a novel IoT based architecture is proposed for Oil and gas industries to make data collection from connected objects as simple, secure, robust, reliable and quick. Moreover, it is suggested that how this architecture can be applied to any of the three categories of operations, upstream, midstream and downstream. This can be achieved by deploying a set of IoT based smart objects (devices) and cloud based technologies in order to reduce complex configurations and device programming. Our proposed IoT architecture supports the functional and business requirements of upstream, midstream and downstream oil and gas value chain of geologists, drilling contractors, operators, and other oil field services. Using our proposed IoT architecture, inefficiencies and problems can be picked and sorted out sooner ultimately saving time and money and increasing business productivity.