Conference PaperPDF Available

Characterization of failure dynamics in SDN controllers

Authors:
  • NTNU – Norwegian University of Science and Technology

Abstract and Figures

With Software Defined Networking (SDN) the con-trol plane logic of forwarding devices, switches and routers, isextracted and moved to an entity called SDN controller, whichacts as a broker between the network applications and physicalnetwork infrastructure. Failures of the SDN controller inhibitthe network ability to respond to new application requests andreact to events coming from the physical network. Despite of thehuge impact that a controller has on the network performanceas a whole, a comprehensive study on its failure dynamics isstill missing in the state of the art literature. The goal of thispaper is to analyse, model and evaluate the impact that differentcontroller failure modes have on its availability. A model in theformalism of Stochastic Activity Networks (SAN) is proposedand applied to a case study of a hypothetical controller based oncommercial controller implementations. In case study we showhow the proposed model can be used to estimate the controllersteady state availability, quantify the impact of different failuremodes on controller outages, as well as the effects of softwareageing, and impact of software reliability growth on the transientbehaviour. Characterization of failure dynamics in SDN controllers. Available from: https://www.researchgate.net/publication/320832289_Characterization_of_failure_dynamics_in_SDN_controllers [accessed Feb 12 2018].
Content may be subject to copyright.
Characterization of Failure Dynamics
in SDN Controllers
Petra Vizarreta, Poul Heegaard, Bjarne Helvik, Wolfgang Kellerer, and Carmen Mas Machuca
Chair of Communication Networks, Technical University of Munich
Department of Information Security and Communication Technology, Norwegian University in Science and Technology
petra.vizarreta@lkn.ei.tum.de, {poulh,bjarne}@ntnu.no, {wolfgang.kellerer,cmas}@tum.de
Abstract—With Software Defined Networking (SDN) the con-
trol plane logic of forwarding devices, switches and routers, is
extracted and moved to an entity called SDN controller, which
acts as a broker between the network applications and physical
network infrastructure. Failures of the SDN controller inhibit
the network ability to respond to new application requests and
react to events coming from the physical network. Despite of the
huge impact that a controller has on the network performance
as a whole, a comprehensive study on its failure dynamics is
still missing in the state of the art literature. The goal of this
paper is to analyse, model and evaluate the impact that different
controller failure modes have on its availability. A model in the
formalism of Stochastic Activity Networks (SAN) is proposed
and applied to a case study of a hypothetical controller based on
commercial controller implementations. In case study we show
how the proposed model can be used to estimate the controller
steady state availability, quantify the impact of different failure
modes on controller outages, as well as the effects of software
ageing, and impact of software reliability growth on the transient
behaviour.
I. INTRODUCTION
Software Defined Networking (SDN) is a novel network ar-
chitecture concept, where inherently distributed control plane
logic of forwarding devices, switches and routers, is extracted
and moved to an entity called SDN controller. SDN controllers
provide an integrated interface towards the forwarding devices,
which significantly simplifies the network management and
augments the network programmability, as illustrated in Fig. 1.
The controller also has a global overview of the network
state and can react faster to the events from the network
environment, such as congestion of the links or a failure
of the forwarding devices, leading to an improved network
performance [1]. Despite all the benefits it offers, its avail-
ability is still a big concern and a major blocker for the
wide spread adoption for SDN in commercial telecom and
industrial networks [2]. Ros et al. [3] showed that more than
one controller must be deployed in the network to achieve
the ”5-nines” availability. However, in the case of software
components, such as SDN controllers, a simple replication is
an inefficient way to improve the availability, since the root
cause of the failure is often shared among the replicas, e.g.
a bug in the code. Different controller failure modes have
to be treated differently, since they occur with a different
frequency, and have different impact on the controller outages.
Understanding the failure dynamics in SDN controllers is an
important step towards being able to predict the performance
Network
monitoring
Traffic
engineering
Bandwidth
on demand
SDN Controller s
Fig. 1: SDN architecture.
of the whole system. The goal of this paper is to analyse,
model and evaluate impact of different failure modes on the
controller’s availability. We analyse the fault reports of the two
major open source controllers, ONOS [4] and OpenDaylight
[5], and describe their corresponding detection and recovery
schemes. A comprehensive model of the controller in the for-
malism of Stochastic Activity Network (SAN) is provided to
capture all relevant factors and their interdependencies. A case
study of a hypothetical controller, based on the parameters of
commercial controllers, is used to demonstrate how different
controller availability attributes can be evaluated based on such
model.
The rest of the paper is organized as follows. Section II
provides an overview of the related work on dependability
modelling in SDN. In Sections III different failure modes
of an SDN controllers are presented and classified according
to a common terminology of software reliability. A dynamic
model of SDN controller based on stochastic activity networks
is presented in the Section IV, while Section V discuss the
possible applications of the proposed model. We conclude the
paper with a summary and an outlook for the future work.
II. RELATED WORK
Despite of the huge impact that a controller has on the
network performance as a whole, a comprehensive study on its
failure dynamics is still missing in the state of the art literature.
Ros et al. [3] have shown that in order to achieve the
availability of five nines, the forwarding devices are required
to connect to at least 2 controllers, for all wide area net-
works included in the study. The controller availability has
been considered as a static random Weibull variable, and
different failure modes are not considered. Studies in [6],
[7] distinguish between permanent and transient hardware
and software failures. Nguyen et al [6] used hierarchical
models based on reliability graphs and stochastic reward nets,
whereas Nencioni et al. [7] based their model on Markov
chains to capture dependencies between data and control plane.
However, only steady state availability analysis was presented
and the impact of different software failure modes and their
corresponding recovery procedures were neglected. Impact
of the controller workload on its failure rate was studied
by Longo et al. [8]. When one of the controllers fails, the
load of the remaining controllers, modelled as the number of
devices assigned to each controller, changes accordingly. Such
changes in operating conditions are modelled as continuous
phase type distributions. Although the model captured well
the accumulative nature of certain class of software failures,
deterministic and intermittent failures were neglected.
In the following section we provide a thorough classification
of different controller failures, their corresponding recovery
mechanism and impact on the system performance.
III. CLASSIFICATION OF CONTROLLER FAILURES
In SDN-based networks the entire control plane logic re-
sides in the controllers. Controller failures obstruct the net-
work ability to serve new requests coming from the network
applications, and react to the events coming from the physical
network, such as routing of unknown packet flows or re-
routing of the traffic after node or link failures. Different
failure modes of the controller occur with different frequency
and have different recovery times, and as such will have
different impact on the services provided to other layers [9].
In this section, we provide an overview of the different failure
modes according to the standard terminology of software
availability [10]. All examples in the following sections are
taken from actual bug reports of the corresponding open source
SDN controllers [4], [5], [11].
A. Deterministic failures (Bohrbugs)
Deterministic failures, also called Bohrbugs in the context
of software availability, represent the class of faults that are
easily reproducible and manifest themselves consistently under
the specific input sequence that led to a failure. This kind of
faults can be activated, by a rare, but valid input sequence,
or by accessing the rarely used code path, that has not been
tested thoroughly in the test and debugging phase. Examples
of Bohrbugs in the commercial controllers are:
FlowEntry.life() is in milliseconds, but gets cre-
ated in seconds
Path Computation Element (PCE) able to create tunnel
with negative bandwidth
Java exception during attempt to create intent with band-
width/latency
Random selection of VLAN ids in the Intent Framework
Topology crash when using big topology (e.g., 20X20
torus)
After a system crash the whole application must be reloaded
from the last check-pointed state, while in the case of a
hanging failure only a single process must be restarted. System
crashes are reported, confirmed and resolved based on its
criticality and priority. Through debugging process software
is continuously improved, the concept known as software
reliability growth [12].
B. Non-deterministic failures (Mandelbugs)
Non-deterministic failures occur due to the specific com-
bination of error conditions and their relative timing, which
makes them extremely difficult to reproduce and identify the
exact root cause of the failure, hence the name Mandelbugs
[10]. Such failures are usually related to timing and syn-
chronization issues, such as data race after occurrence of the
specific sequence of failures, simultaneous access to the shared
data store, etc. Examples of Mandelbugs in the commercial
controllers are:
Distributed database locking in ONOS
Concurrency Issue in NetconfDevice
Floodlight did not discover inter-switch link; LLDP pack-
ets were dropped due to interference with data plane
traffic
Race condition when adding a RPC implementation with
an output
This class of failures can often be mitigated by retrying the
operation in different execution environment, e.g., by changing
the scheduling timers of the concurrent processes [9], [13].
C. Faults related to software ageing1
Ageing-related faults are a subset of Mandelbugs, and reflect
the gradual degradation of the system performance, due to
memory leaks, data corruption, accumulation of numerical
errors, etc. [10]. Common characteristic of such failures is
that they are accumulated over time and can be prevented
by occasional restart or reboot and cleaning of the internal
system state. Examples of SDN controller failures related to
the software ageing are:
Flows still reported in oper data store after they have been
deleted from both config and network
A thread calling a default OVSDB configuration
DefaultOvsdbClient.insertConfig() is
blocked permanently, if OVSDB connection is closed
while configuring OVS
1Note that in the literature the term of software ageing is sometimes used
to describe the software fitness on the longer time scale, showing how it fails
to meet the new system requirements, as they change with time.
pce-delete-path is not working; not able to delete
all the logs, when multiple tunnels are created
Timed out flows not removed from operational space
D. External failures
An SDN controller is a software component that needs an
operating system and the supporting hardware to run on. The
failures of the operating system and computing hardware do
not depend on the controller software, but must be taken into
account when modelling the availability of the whole system.
IV. MODEL
Stochastic Activity Networks (SAN) are stochastic exten-
sion of Petri Nets, and represent a powerful tool for depend-
ability modelling [14]. In SAN formalism, the combination
of markings in the places represents model states, and the
activities (firings) are timed or instantaneous are changing the
system state upon firing. Tools, such as M¨
obius, automatically
translate SAN models to Markov chains [15] and solve them
using numerical methods. The proposed SAN model of a con-
troller is presented in Fig. 2. The model captures the effects of
software reliability growth, software ageing, different software
failure modes, as well as the external failures of the operating
system (OS) and the computing hardware (HW).
A. Controller software failures
We assume that the instantaneous software failure rate de-
pends on the software maturity and the state of the controller.
1) Software maturity model: When the controller software
is introduced to the market it might still contain the bugs that
have not been detected or resolved during software testing
phase. During the operational phase end users report the prob-
os_failed
os_crash
hw_failed
under_repairspare_hw
resolved_bugs
active_bugs
unresolved_bugs
sw_ok
transient_err
hanging_proc
ctrl_crash
sw_prob
transient_err_det
hanging_proc_det
ctrl_crash_det
bug_detected
os_fail
os_reboot
os_repair
os_fail_1
hw_fail
hw_replace
hw_fail_1
hw_repair
sw_fail
retry
restart
reload
sw_fail_1
sw_age
catch_except
reponse_timer
heartbeat
detect debug
try_again
os_crash
hw_failed
under_repairspare_hw
resolved_bugs
active_bugs
unresolved_bugs
sw_ok
transient_err
hanging_proc
ctrl_crash
sw_prob
transient_err_det
hanging_proc_det
ctrl_crash_det
bug_detected
os_fail
os_reboot
os_repair
os_fail_1
hw_fail
hw_replace
hw_fail_1
hw_repair
sw_fail
retry
restart
reload
sw_fail_1
sw_age
catch_except
reponse_timer
heartbeat
detect debug
try_again
Fig. 2: SDN controller is modelled as Stochastic Activity Network (SAN).
lems, that are verified and removed by the developers, which
leads to software reliability growth. In the case of the open
source SDN controllers, such as ONOS and OpenDaylight, this
process is transparent and logged in their corresponding bug
trackers. From such bug reports, parameters like the number
of resolved bugs in the stable release, the detection rate of the
new bugs and the average time to debug can be derived. It can
also be noted from the reports that some bugs are reopened
several times, suggesting that the debugging process is not
always successful. We assume that initially a finite number
Nbugs of residual bugs to be present in the controller software
code. The bugs are detected with the rate λdetect per bug,
and are resolved with the rate μdebug. The success rate of
debugging process is pdebug. If the bug is not successfully
resolved, it is returned to the pool of detected bugs.
2) Controller state: When controller is initiated, restarted
or reloaded it starts from the state sw ok. The software
failure rate in this state depends on the number of residual
bugs, represented as the number of markings in the place
active bugs, and the baseline software failure rate, as in the
Jelinski-Moranda model [16]. According to this model, the
failure rate after icorrected bugs is:
λok(ti)=ϕsw fail(Nbugs (i1))
During the continuous operation, software ageing effects are
accumulated and the controller performance is degraded. A
common way to model this effect is to assume that the risk of
failure increases after a certain resource utilization threshold
has been reached [17]. The time to reach this threshold (called
application’s base longevity interval) depends on the controller
load. In the model, the software ageing is denoted as a
transition between the states sw ok to sw prob. The rate of
the ageing is denoted as λsw age, and the failure rate due to
the ageing process is λage fail.
In addition to the ageing related failures, failures due to
unresolved bugs still may be activated. Since we assume that
both failure mechanisms have negative exponential distribu-
tion, the combined failure processes may be expressed as a
single failure process with the rate:
λprob(t)=λok (t)+λag e fail
3) Detection and recovery from software failures: We dif-
ferentiate between three types of software failures depending
their corresponding recovery process. Transient errors, such as
synchronization and timing issues, can often be resolved by
retrying an operation in a different execution environment [13].
Hanging processes can be detected by the controller software
itself, and are resolved by a process restart. Both OpenDaylight
[5] and ONOS [4] implement ϕ-accrual failure detector [18]
based on the heartbeats for the detection of controller crashes.
After the controller crash, the whole software application must
be reloaded from the last saved checkpoint (system snapshot).
Distribution between different software failure modes de-
pends on the controller state. When the controller is in the
highly robust state (swok) majority of the failures are expected
to lead to a crash, while in the failure prone (swprob) state,
majority of the failures are expected to be transient and
resolved by a restart.
B. External failures
The SAN models of operating system and computing hard-
ware have been adapted from [19] [7]. OS fails with a rate
λos fail. Some OS failures can be successfully resolved with
OS reboot, while others require OS reparation, involving
complete OS reload. The rate of the reboot is μos reboot and
the repair rate μos repair. The success rate of the OS reboot
is pos reboot. HW fails with a rate λhw fail, and is replaced
with the spare component if there is any available. Initially
Nspare hw spare components are available. Hardware replace
rate is μhw replace and hardware repair rate is μhw repair.
V. C ASE STUDY
Next, we present the case study on an SDN controller,
whose model is based on realistic parameters. We show how
the proposed model can be used to estimate the controller
steady state availability, identify the most relevant parameters,
analyse downtime distribution, and to study the impact of
software reliability growth on the transient behaviour.
A. Model parameters
Model parameters are based on actual SDN controllers, or
on the studies of software components of similar complexity,
when the data was not available. Parameters related to the
software failure rates [7], [20], software reliability growth [4],
[5], software ageing [21], failure type distribution [22] and
recovery procedures [20], [23] are presented in Table I.
TABLE I: Controller software failures [7], [20]–[23]
Parameter Description Baseline value
Nbugs Initial number of active bugs 60
pdebug Debugging success rate 0.99
λ1
bug detect Bug detection rate 60 days
μ1
debugt Debug rate 60 days
ϕ1
sw fail Baseline software failure rate 7 days
λ1
sw age Rate of software ageing 1 day
λ1
age fail Ageing failure rate 7 days
pretry(ok/prob)Failures recovered by retry 0.15, 0.15
prestart (ok/prob) Failures recovered by restart 0.15, 0.70
preload(ok/prob) Failures requiring reload 0.15, 0.15
μ1
catch Catch the exception 1 msec
μ1
timeout Detect hanging process 1 sec
μ1
heartbeat Detecting controller crash 10 sec
μ1
retry Retry the operation 0.5 sec
μ1
proc restart Process restart 5 min
μ1
reload Restart controller and reload 30 min
The parameters related to the availability of operating
system and computing hardware [7], [19] are summarized in
Table II.
TABLE II: Failures of external components [7], [19]
Parameter Description Baseline value
λ1
os fail Mean time between OS failures 60 days
pos reboot Success of OS reboot 0.9
μ1
os reboot OS reboot time 10 min
μ1
os repair OS repair time 1h
λ1
hw fail Mean time between HW failures 6 months
μ1
hw replace HW replace time 2 hours
μ1
hw repair HW repair time 24 hours
Nspare hw Spare computing hardware 1
B. Steady State Availability (SSA)
Let us show the steady state availability of the controller.
Steady state availability of the entire controller system, as well
as the contribution of different system sub-components: soft-
ware (SW), operating system (OS) and computing hardware
(HW) are presented in Table III.
TABLE III: Steady state availability of the controller system
Component Controller SW OS HW
Availability 0.99889 0.99956 0.99981 0.99951
It can be seen from the results that a single controller can
provide availability of only two nines. At least two controllers
are needed to achieve ”5-nines” availability. Availability of
the controller software alone was on the same level as the
availability of the operating system and computing hardware
in this case study.
C. Sensitivity analysis
We performed a sensitivity analysis to determine which of
the parameters have the highest impact on the steady state
availability. All the parameters that have an impact on the
steady state availability were varied ±50% of their baseline
values. The parameters, sorted by its impact in a decreasing
order are presented in the Fig. 3.
The most important parameters are hardware failure and re-
placement rate, followed by the rate of process restart, success
of OS reboot and software ageing failure rate. The factors with
the least impact are the software failure detection rates and rate
of the retry (shortest software recovery procedure). Note that
the parameters related to the software reliability growth do not
impact the steady state availability.
D. Failure frequency and downtime contribution
Around 50 failures per year with the total duration of 9.68
hours on average are expected. The contribution of different
failure types, in terms of their frequency and the contribution
to the controller downtime is presented in Fig. 4.
It has been observed that software failures are the most
frequent, accounting for 84% of all the failures, but contribute
to only 38% of the controller downtime. On the other hand,
hardware failures represent less than 4% of all the failures,
they contribute to more than 44% of the controller’s downtime.
Fig. 3: Sensitivity analysis of the controller steady state
availability. All parameters are varied in the range of ±50%
of their baseline value from Table I-III
E. Downtime distribution
Controller downtime distribution is presented in Fig. 5. It
can be observed that 80% of the failures resulted in downtime
lower than 10 minutes (shaded area), with median of 3.5 min.
The relatively short duration of the downtime is due to the high
frequency of software failures, whose recovery procedures
(retry, restart, reload) are much faster than the recovery from
hardware failures. Duration of the controller outages has to
be taken into account when designing the appropriate fault
tolerance mechanism.
F. Software ageing
The sensitivity analysis showed that ageing failure rate
has a big impact on the availability of the controller. Yet,
software ageing rate and ageing failure rate are the most
uncertain parameters, since they depend on many factors, such
as controller utilization rate, platform on which it operates and
software implementation. Therefore, we have explored a wide
Fig. 4: Failure frequency and contribution to controller down-
time for different failure modes.
Fig. 5: Downtime distribution of SDN controller system. 80%
of the failures resulted in downtime lower than 10 minutes
(shaded area).
range of software ageing parameters, varying the ageing rate
between 30 min and 1 week and ageing failure rate between 2
hours and 100 days. The controller availability for different
combination of the parameters is presented in the Fig. 6.
It can be seen that the impact of the ageing failure rates
depends greatly on the rate of ageing. When software ageing
is fast, ageing failures will have much higher impact on the
availability.
G. Software reliability growth
During the life cycle of controller software, the remaining
bugs in the operational software are detected and removed.
This leads to the availability improvement, assuming that new
bugs are not introduced during the software fix. The software
maturity model presented in Fig. 2 is based on the bug track
reports of ONOS and OpenDaylight [4], [5]. The number of
detected and resolved bugs over time in the controller model
has been compared to ONOS Avocet (v1.0), as shown in Fig. 7.
Fig. 6: Impact of the software ageing parameters (ageing rate
and ageing failure rate) on the controller availability.
Fig. 7: Number of detected and resolved bugs over time in
SDN controller model compared to bug track report of ONOS
Avocet (v1.0) [4].
It can be observed that the proposed model provides a good fit
to the commercial controller. For purpose of this case study,
the trivial and minor bugs were removed from the bug report,
since they do not have an impact on the controller availability.
The effect of the software reliability growth on the availabil-
ity single controller instance can be observed in the transient
behaviour of the proposed model. The impact of the initial
number of bugs and debugging success on the controller
availability in the first two years (730 hours) of its operation
are presented in the Fig. 8. The initial number of the residual
bugs has a much higher impact of the controller availability,
than the success of debugging. We observe that the steady state
is reached after approximately 400 days for the model with
Nbug =60bugs, while for the model with Nbug = 600 bugs
it took almost two years. Such kind of reports can be used to
determine the optimal time of controller software release (for
developers) and the optimal time for adoption of new releases
(for users), based on desired level of software reliability.
Fig. 8: Availability of SDN controller in the first 2 years
(730 days) of its operation.
VI. CONCLUSION AND FUTURE WORK
In this paper, failure dynamics in SDN controllers has been
analysed, modelled and evaluated. We have presented the most
typical controller failures, analysed their root causes and the
typical detection and recovery techniques. A comprehensive
model based on the Stochastic Activity Networks (SAN),
including several failure modes (controller software as well as
the external components) is proposed. The controller model
also captures the effects of software ageing and software
reliability growth, which have not been considered so far in
the state-of-the art literature.
We presented a case study to demonstrate the impact of
different failure modes on the controller availability. The
parameters of the model in the case study were based on the
commercial controllers, whenever possible, and the studies on
the systems of similar complexity when the data was not avail-
able. We have shown that a single controller instance is not
sufficient to achieve the availability of ”5-nines”. Sensitivity
analysis has been performed to identify the most important
controller parameters w.r.t. its availability.
The study has shown big differences in the frequency and
the impact on downtime of the different failure modes consid-
ered in the model. We have observed that the software accounts
for 84% of all the failures, but contribute to only one third
of the controller’s downtime. The analysis of the downtime
distribution has shown that more than 80% of the failures have
a downtime below 10 minutes, and the median is 3.6 minutes.
We leave it for the future work to study the impact on the
network services and user perceived performance of such
downtime distribution. Software ageing has been identified as
one of the most important, and yet most uncertain factor in the
controller’s availability. We have studied how the relationship
between the ageing rate and software ageing failures influences
the impact factor of the ageing. We have also observed the
effect of software reliability growth on the availability of the
single controller instances. The proposed software reliability
growth model is based on ONOS Avocet controller. We leave
for the future work to include other commercially available
controllers.
ACKNOWLEDGMENT
This work has received funding from EU Horizon 2020
research and innovation programme under grant agreement
No 671648 (VirtuWind), COST Action CA15127 Resilient
communication services protecting end-user applications from
disaster-based failures (RECODIS) and CELTIC EUREKA
project SENDATE-PLANETS (Project ID C2015/3-1) and is
partly funded by the German BMBF (Project ID 16KIS0473).
REFERENCES
[1] S. Jain, A. Kumar, S. Mandal, J. Ong, L. Poutievski, A. Singh,
S. Venkata, J. Wanderer, J. Zhou, M. Zhu et al., “B4: Experience with
a globally-deployed software defined wan,ACM SIGCOMM Computer
Communication Review, vol. 43, no. 4, pp. 3–14, 2013.
[2] J. Vestin, A. Kassler, and J. Akerberg, “Resilient software defined
networking for industrial control networks,” in 2015 10th International
Conference on Information, Communications and Signal Processing
(ICICS). IEEE, 2015, pp. 1–5.
[3] F. J. Ros and P. M. Ruiz, “Five nines of southbound reliability in
software-defined networks,” in Proceedings of the third workshop on
Hot topics in software defined networking. ACM, 2014, pp. 31–36.
[4] ON.Lab, “ONOS: Open Neetwork Operating System,”
http://onosproject.org/, 2017.
[5] Linux Foundation, “Opendaylight.” [Online]. Available:
https://www.opendaylight.org/
[6] T. A. Nguyen, T. Eom, S. An, J. S. Park, J. B. Hong, and D. S. Kim,
“Availability modeling and analysis for software defined networks,” in
Dependable Computing (PRDC), 2015 IEEE 21st Pacific Rim Interna-
tional Symposium on. IEEE, 2015, pp. 159–168.
[7] G. Nencioni, B. E. Helvik, A. J. Gonzalez, P. E. Heegaard, and
A. Kamisinski, “Availability modelling of software-defined backbone
networks,” in Dependable Systems and Networks Workshop, 2016 46th
Annual IEEE/IFIP International Conference on. IEEE, 2016, pp. 105–
112.
[8] F. Longo, S. Distefano, D. Bruneo, and M. Scarpa, “Dependability
modeling of software defined networking,” Computer Networks, vol. 83,
pp. 280–296, 2015.
[9] M. Grottke and K. S. Trivedi, “Fighting bugs: Remove, retry, replicate,
and rejuvenate,” Computer, vol. 40, no. 2, 2007.
[10] K. S. Trivedi, M. Grottke, and E. Andrade, “Software fault mitigation
and availability assurance techniques,International Journal of System
Assurance Engineering and Management, vol. 1, no. 4, pp. 340–350,
2010.
[11] C. Scott, A. Wundsam, B. Raghavan, A. Panda, A. Or, J. Lai, E. Huang,
Z. Liu, A. El-Hassany, S. Whitlock et al., “Troubleshooting blackbox
sdn control software with minimal causal sequences,” ACM SIGCOMM
Computer Communication Review, vol. 44, no. 4, pp. 395–406, 2015.
[12] N. Ullah and M. Morisio, “An empirical analysis of open source software
defects data through software reliability growth models,” in EUROCON,
2013 IEEE. IEEE, 2013, pp. 460–466.
[13] F. Qin, J. Tucek, J. Sundaresan, and Y. Zhou, “Rx: treating bugs as
allergies—a safe method to survive software failures,” in Acm sigops
operating systems review, vol. 39, no. 5. ACM, 2005, pp. 235–248.
[14] W. H. Sanders and J. F. Meyer, “Stochastic activity networks: Formal
definitions and concepts,” in Lectures on Formal Methods and Perfor-
manceAnalysis. Springer, 2001, pp. 315–343.
[15] D. Daly, D. D. Deavours, J. M. Doyle, P. G. Webster, and W. H.
Sanders, “M¨
obius: An extensible tool for performance and dependability
modeling,” in International Conference on Modelling Techniques and
Tools for Computer Performance Evaluation. Springer, 2000, pp. 332–
336.
[16] Z. Jelinski and P. B. Moranda, “Software reliability research,” Statistical
Computer Performance Evaluation, pp. 465–484, 1972.
[17] Y. Huang, C. Kintala, N. Kolettis, and N. D. Fulton, “Software rejuvena-
tion: Analysis, module and applications,” in Fault-Tolerant Computing,
1995. FTCS-25. Digest of Papers., Twenty-Fifth International Sympo-
sium on. IEEE, 1995, pp. 381–390.
[18] N. Hayashibara, X. Defago, R. Yared, and T. Katayama, “The/spl
phi/accrual failure detector,” in Reliable Distributed Systems, 2004.
Proceedings of the 23rd IEEE International Symposium on. IEEE,
2004, pp. 66–78.
[19] D. S. Kim, F. Machida, and K. S. Trivedi, “Availability modeling and
analysis of a virtualized system,” in Dependable Computing, 2009.
PRDC’09. 15th IEEE Pacific Rim International Symposium on. IEEE,
2009, pp. 365–371.
[20] S. A. Vilkomir, D. L. Parnas, V. B. Mendiratta, and E. Murphy,
“Availability evaluation of hardware/software systems with several re-
covery procedures,” in 29th Annual International Computer Software
and Applications Conference (COMPSAC’05), vol. 1. IEEE, 2005, pp.
473–478.
[21] W. Xie, Y. Hong, and K. S. Trivedi, “Software rejuvenation policies for
cluster systems under varying workload,” in Dependable Computing,
2004. Proceedings. 10th IEEE Pacific Rim International Symposium on.
IEEE, 2004, pp. 122–129.
[22] S. Chandra and P. M. Chen, “Whither generic recovery from application
faults? a fault study using open-source software,” in Dependable Systems
and Networks, 2000. DSN 2000. Proceedings International Conference
on. IEEE, 2000, pp. 97–106.
[23] V. B. Mendiratta, “Reliability analysis of clustered computing systems,”
in Software Reliability Engineering, 1998. Proceedings. The Ninth
International Symposium on. IEEE, 1998, pp. 268–272.
... Table I summarizes the additional features of our work compared to [19]. Another work [20] considers a similar approach to model an SDN controller's availability. Though we can not use the values from [20], the approach in [20] is a good reference point. ...
... Another work [20] considers a similar approach to model an SDN controller's availability. Though we can not use the values from [20], the approach in [20] is a good reference point. Note that in this work, we consider the links to have the same availability always as in [19]. ...
... Another work [20] considers a similar approach to model an SDN controller's availability. Though we can not use the values from [20], the approach in [20] is a good reference point. Note that in this work, we consider the links to have the same availability always as in [19]. ...
... worldwide#investment, accessed on March 10, 2024 by the application plane. Given the crucial role played by SDN-controllers, a crash of even one controller could yield unforeseen behaviors, e.g., impossibility for applications to interact over the network, packet losses, degraded performance and wrong results due to the lack of optimized paths within the data plane [37]. Therefore, it becomes imperative to improve the reliability of SDN-controllers (i.e., probability that the controller will continuously perform its intended function during a specified time interval [0, T ]), in order to reduce their failure risk, while not significantly reducing their cumulative availability (i.e., expected time during which the controller is working during a specified time interval [0, T ]), in order to limit their downtime [32], [44]. ...
... In this study, we tackle the problem of software aging in an SDN-controller, an issue already demonstrated and addressed in the literature [37], [38]. The aging indicators of the SDN-controller are sent across the network to another element that analyzes them to determine the aging status. ...
Article
Full-text available
Software rejuvenation is a proactive maintenance technique that counteracts software aging by restarting a system, making selection of rejuvenation times critical to improve reliability without incurring excessive downtime costs. Various stochastic models of Software Aging and Rejuvenation (SAR) have been developed, mostly having an underlying stochastic process in the class of Continuous Time Markov Chains (CTMCs), Semi-Markov Processes (SMPs), and Markov Regenerative Processes (MRGPs) under the enabling restriction, requiring that at most one general (GEN), i.e., non-Exponential, timer be enabled in each state. We present a SAR model with an underlying MRGP under the bounded regeneration restriction, allowing for multiple GEN timers to be concurrently enabled in each state. This expressivity gain not only supports more accurate fitting of duration distributions from observed statistics, but also enables the definition of mixed rejuvenation strategies combining time-based and inspection-based policies, where the time to the next inspection or rejuvenation depends on the outcomes of diagnostic tests. Experimental results show that replacing GEN timers with Exponential timers with the same mean (to satisfy the enabling restriction) yields inaccurate rejuvenation policies, and that mixed rejuvenation outperforms time-based rejuvenation in maximizing reliability, though at the cost of an acceptable decrease in availability.
... L'arrêt du contrôleur, quelle soit due à une attaque ou une défaillance, est la crainte numéro un pour un réseau dans une architecture SDN (Vizarreta, Heegaard, Helvik, Kellerer, & Machuca, 2017) (Pashkov, Shalimov, & Smeliansky, 2014). En effet, en cas de défaillance du contrôleur, tout le réseau est en panne de contrôle. ...
Thesis
Full-text available
Les architectures réseau de type Software Defined Networking (SDN) ont été introduites dans l'objectif de proposer un contrôle centralisé par un contrôleur. Une conséquence de cette centralisation est le fait qu'une seule entité soit en charge du contrôle. Par conséquent, cela fait du contrôleur la cible privilégiée en cas d'attaque sur une architecture SDN. Une telle attaque permettrait à un attaquant d'avoir une vue globale sur le réseau, mettre en place un contrôle visant à dégrader le service, etc. De plus, une simple défaillance du contrôleur est également une menace sur le réseau puisque cela le priverait de contrôle. On peut trouver dans la littérature que l'architecture multicontrôleur a été introduite afin de renforcer le plan de contrôle contre ces menaces. Cependant, une telle architecture amène de nouvelles spécificités et pour assurer la cohérence entre les contrôleurs, une interface de communication entre eux est nécessaire. Cette interface constitue une menace pour la sécurité puisqu'un attaquant peut propager des informations malveillantes et erronées sur le réseau aux autres contrôleurs. Dans cet objectif, ces travaux visent à introduire une architecture multicontrôleur sans interface de communication entre eux. Cette architecture est composée d'un contrôleur nominal en charge du calcul du plan de données et un second en charge de la détection d'anomalies dans les décisions prises par le contrôleur principal. Pour cela, le comportement de l'activité de la commande a été formalisé sous la forme d'un extit{template} : en réponse à une requête des infrastructures réseau, le contrôleur doit mettre en place un plan de données dans un certain intervalle de temps. Pour être considéré comme anormal, le contrôleur doit respecter ça, mais ce n'est pas suffisant. Il faut ensuite vérifier le contenu du plan de données et pour cela des propriétés structurelles définissant ce qu'on considère comme étant un plan de données cohérent sont introduites. Ces propriétés sont nécessaires, mais pas suffisantes et en conséquence une méthode de détection a été proposée selon le type de contrôle considéré : déterministe ou non. Dans le cas déterministe, on estime les variables internes du contrôleur à partir des décisions cohérentes prises et observées. Ensuite, il faut vérifier que le contrôleur ne se contredit pas du fait de l'hypothèse de déterminisme. Cette méthode n'est donc plus applicable au cas non déterministe et on propose d'établir un score de vraisemblance aux décisions prises par le contrôleur. Ce score est établi suivant une approche multi critères dont le choix des critères dépend du cas d'application. Ici, deux critères ont étés proposés : vérification que l'impact des décisions prises par le contrôleur est vraisemblable et vérification que la séquence des décisions prises par le contrôleur est vraisemblable. Cette méthode est évaluée selon deux types de métriques : réactivité et nombre de bonnes décisions prises par le contrôleur (précision et rappel) sur divers cas d'étude. En conclusion, les performances montrent que les méthodes proposées sont applicables, mais présentent des limites. De plus, ces travaux posent les fondements d'une méthode de détection, mais chaque application est propre au cas d'étude considéré.
Preprint
Current approaches to tackle the single point of failure in SDN entail a distributed operation of SDN controller instances. Their state synchronization process is reliant on the assumption of a correct decision-making in the controllers. Successful introduction of SDN in the critical infrastructure networks also requires catering to the issue of unavailable, unreliable (e.g. buggy), and malicious controller failures. We propose MORPH, a framework tolerant to unavailability and Byzantine failures, which distinguishes and localizes faulty controller instances and appropriately reconfigures the control plane. Our controller-switch connection assignment leverages the awareness of the source of failure to optimize the number of active controllers and minimize the controller and switch reconfiguration delays. The proposed re-assignment executes dynamically after each successful failure identification. We require 2FM + FA +1 controllers to tolerate FM malicious and FA availability-induced failures. After a successful detection of FM malicious controllers, MORPH reconfigures the control plane to require a single controller message to forward the system state. Moreover, we outline and present a solution to the practical correctness issues related to the statefulness of the distributed SDN controller applications, previously ignored in the literature. We base our performance analysis on a resource-aware routing application, deployed in an emulated testbed comprising up to 16 controllers and up to 34 switches, so to tolerate up to 5 unique Byzantine and additional 5 availability-induced controller failures (a total of 10 unique controller failures). We quantify and highlight the dynamic decrease in the packet and CPU load and the response time after each successful failure detection.
Article
Next generation cellular networks are expected to enable a wide range of new applications, increasing societal dependence on the network infrastructure and requiring a higher level of resilience than current networks. In this paper, we consider the challenges network operators face in providing end-to-end connections across the backhaul part of the cellular network in the face of equipment failures and power outages. In particular, we discuss the impact of the move to commodity hardware, disaggregation of the radio access network, edge computing, densification of the network, and the increased electric power requirements on resilience. Techniques and research directions for overcoming the challenges are presented. This includes thinking beyond methods for a single network operator including cooperative operator techniques and extending resilient overlays to the wireless edge.
Article
This paper proposes a controller placement model that takes into account the load-dependent sojourn time at each controller while considering controller failures in a software-defined network. The sojourn time is expressed by the queuing theory. The sojourn time varies depending on the amount of load arriving at each controller in the proposed model. The proposed model is formulated as a mixed integer second-order cone programming (MISOCP) problem. The controller placement problem studied in this paper is proven to be NP-hard. We develop a heuristic algorithm for the case where the solution to an optimization problem of the proposed model cannot be obtained in practical time. The proposed model is compared with two baseline models presented in the previous research. In the baseline models, the sojourn time does not depend on the amount of load arriving at each controller. Numerical results show that the number of placed controllers becomes smaller in the proposed model than in the baseline models. We also compare results obtained by solving the MISOCP problem to those of the heuristic algorithm. Numerical results show that the heuristic algorithm reduces the computation time required to determine the controller placement, whereas the difference between the number of controllers determined by the heuristic algorithm and the optimal value is at most 4.84%. The number of controllers placed by the heuristic algorithm tends to decrease by considering network centrality.
Article
Software Defined Networking (SDN) is a networking architecture within the control is centralized through a software-based controller. Thus, being a single point of attack makes it the preferred target in case of attack. Multi controller architecture has been considered to reinforce the control plane. However, the communication interface between the controller is a security threat. We already propose a dual controller architecture, one nominal controller which is in charge of the data plane computation plus a second one which is in charge of the detection of anomalies in the decisions taken by the first controller. Previous work considered a deterministic control and this paper extends to the case of a non-determinist algorithm. In this objective we introduce a multi-criteria detection approach and we developed two approaches: verifying the consistency of the performance of the decisions taken and verifying the consistency of the sequence of decisions of the controller. We tested the proposition on a study case.
Conference Paper
Full-text available
Resilience against disaster scenarios is essential to network operators, not only because of the potential economic impact of a disaster but also because communication networks form the basis of crisis management. COST RECODIS aims at studying measures, rules, techniques and prediction mechanisms for different disaster scenarios. This paper gives an overview of different solutions in the context of technology-related disasters. After a general overview, the paper focuses on resilient Software Defined Networks.
Conference Paper
Full-text available
Software Defined Network (SDN) is an emerging paradigm for flexible network design and implementation. Availability metric of SDNs is critically demanding further studies. This paper aims to propose hierarchical models to assess the availability of SDNs. We incorporate various failure modes and recovery behaviors in the SDN including (i) link failures at network level, and (ii) software and hardware failures at network device level. We use hierarchical models in which a Reliability Graph (RG) is used to represent the reachability of hosts (and switches) in the SDN at the upper level and Stochastic Reward Net (SRN)s are used to represent the detailed failure and recovery of network devices at the lower level, respectively. We incorporate the programmable capability of the SDN at the upper level (i.e., the RG). We perform numerical analysis to assess the availability of the SDN in terms of steady state availability and downtime in minutes per year, and we also show the sensitivity analysis.
Conference Paper
Software-Defined Networking (SDN) promises to improve the programmability and flexibility of networks, but it may also bring new challenges that need to be explored. The main objective of this paper is to present a quantitative assessment of the properties of SDN backbone networks to determine whether they can provide similar availability to the traditional IP backbone networks. To achieve this goal, we have completed the following steps: i) we formalized a two-level availability model that is able to capture the global network connectivity without neglecting the essential details; ii) we proposed Markov models for characterizing the single network elements in both SDN and traditional networks; iii) we carried out an extensive sensitivity analysis of a~national and a~world-wide backbone networks. The results have highlighted the considerable impact of operational and management (O&M) failures on the overall availability of SDN. High O&M failure intensity may reduce the availability of SDN as much as one order of magnitude compared to traditional networks. Moreover, the results show that the impact of software and hardware failures on the overall availability of SDN can be significantly reduced through proper overprovisioning of the SDN controller(s).
Article
Stochastic activity networks have been used since the mid- 1980s for performance, dependability, and performability evaluation. They have been used as a modeling formalism in three modeling tools (METASAN, UltraSAN, and M¨obius), and have been used to evaluate a wide range of systems. This chapter provides the formal definitions and basic concepts associated with SANs, explaining their behavior and their execution policy precisely.
Chapter
This chapter presents the software reliability research. Software reliability study was initiated by Advanced Information Systems subdivision of McDonnell Douglas Astronautics Company, Huntington Beach, California, to conduct research into the nature of the software reliability problem including definitions, contributing factors, and means for control. Discrepancy reports, which originated during the development of two large-scale real-time systems, form two separate primary data sources for the reliability study. A mathematical model, descriptively entitled the De-Eutrophication Process, was developed to describe the time pattern of the occurrence of discrepancies. This model has been employed to estimate the initial or residual error content in a software package as well as to estimate the time between discrepancies at any phase of its development. The chapter describes the means of predicting mission success on the basis of errors which occur during testing. Moreover, it also describes the problems in categorizing software anomalies and discusses the special area of the genesis of discrepancies during the integration of modules.
Conference Paper
Detecting failures is a fundamental issue for fault-tolerance in distributed systems. Recently, many people have come to realize that failure detection ought to be provided as some form of generic service, similar to IP address lookup or time synchronization. However, this has not been successful so far. One of the reasons is the difficulty to satisfy several application requirements simultaneously when using classical failure detectors. We present a novel abstraction, called accrual failure detectors, that emphasizes flexibility and expressiveness and can serve as a basic building block to implementing failure detectors in distributed systems. Instead of providing information of a boolean nature (trust vs. suspect), accrual failure detectors output a suspicion level on a continuous scale. The principal merit of this approach is that it favors a nearly complete decoupling between application requirements and the monitoring of the environment. In this paper, we describe an implementation of such an accrual failure detector, that we call the φ\varphi failure detector. The particularity of the φ\varphi failure detector is that it dynamically adjusts to current network conditions the scale on which the suspicion level is expressed. We analyzed the behavior of our φ\varphi failure detector over an intercontinental communication link during several days. Our experimental results show that our φ\varphi failure detector performs equally well as other known adaptive failure detection mechanisms, with an improved flexibility.
Article
Software Defined Networking (SDN) is a new network design paradigm that aims at simplifying the implementation of complex networking infrastructures by separating the forwarding functionalities (data plane) from the network logical control (control plane). Network devices are used only for forwarding, while decisions about where data is sent are taken by a logically centralized yet physically distributed component, i.e., the SDN controller. From a quality of service (QoS) point of view, an SDN controller is a complex system whose operation can be highly dependent on a variety of parameters, e.g., its degree of distribution, the corresponding topology, the number of network devices to control, and so on. Dependability aspects are particularly critical in this context. In this work, we present a new analytical modeling technique that allows us to represent an SDN controller whose components are organized in a hierarchical topology, focusing on reliability and availability aspects and overcoming issues and limitations of Markovian models. In particular, our approach allows to capture changes in the operating conditions (e.g., in the number of managed devices) still allowing to represent the underlying phenomena through generally distributed events. The dependability of a use case on a two-layer hierarchical SDN control plane is investigated through the proposed technique providing numerical results to demonstrate the feasibility of the approach.
Article
Software bugs are inevitable in software-defined networking control software, and troubleshooting is a tedious, time-consuming task. In this paper we discuss how to improve control software troubleshooting by presenting a technique for automatically identifying a minimal sequence of inputs responsible for triggering a given bug, without making assumptions about the language or instrumentation of the software under test. We apply our technique to five open source SDN control platforms---Floodlight, NOX, POX, Pyretic, ONOS---and illustrate how the minimal causal sequences our system found aided the troubleshooting process.