Content uploaded by Poul E. Heegaard
Author content
All content in this area was uploaded by Poul E. Heegaard on Feb 12, 2018
Content may be subject to copyright.
Characterization of Failure Dynamics
in SDN Controllers
Petra Vizarreta∗, Poul Heegaard†, Bjarne Helvik†, Wolfgang Kellerer∗, and Carmen Mas Machuca∗
∗Chair of Communication Networks, Technical University of Munich
†Department of Information Security and Communication Technology, Norwegian University in Science and Technology
petra.vizarreta@lkn.ei.tum.de, {poulh,bjarne}@ntnu.no, {wolfgang.kellerer,cmas}@tum.de
Abstract—With Software Defined Networking (SDN) the con-
trol plane logic of forwarding devices, switches and routers, is
extracted and moved to an entity called SDN controller, which
acts as a broker between the network applications and physical
network infrastructure. Failures of the SDN controller inhibit
the network ability to respond to new application requests and
react to events coming from the physical network. Despite of the
huge impact that a controller has on the network performance
as a whole, a comprehensive study on its failure dynamics is
still missing in the state of the art literature. The goal of this
paper is to analyse, model and evaluate the impact that different
controller failure modes have on its availability. A model in the
formalism of Stochastic Activity Networks (SAN) is proposed
and applied to a case study of a hypothetical controller based on
commercial controller implementations. In case study we show
how the proposed model can be used to estimate the controller
steady state availability, quantify the impact of different failure
modes on controller outages, as well as the effects of software
ageing, and impact of software reliability growth on the transient
behaviour.
I. INTRODUCTION
Software Defined Networking (SDN) is a novel network ar-
chitecture concept, where inherently distributed control plane
logic of forwarding devices, switches and routers, is extracted
and moved to an entity called SDN controller. SDN controllers
provide an integrated interface towards the forwarding devices,
which significantly simplifies the network management and
augments the network programmability, as illustrated in Fig. 1.
The controller also has a global overview of the network
state and can react faster to the events from the network
environment, such as congestion of the links or a failure
of the forwarding devices, leading to an improved network
performance [1]. Despite all the benefits it offers, its avail-
ability is still a big concern and a major blocker for the
wide spread adoption for SDN in commercial telecom and
industrial networks [2]. Ros et al. [3] showed that more than
one controller must be deployed in the network to achieve
the ”5-nines” availability. However, in the case of software
components, such as SDN controllers, a simple replication is
an inefficient way to improve the availability, since the root
cause of the failure is often shared among the replicas, e.g.
a bug in the code. Different controller failure modes have
to be treated differently, since they occur with a different
frequency, and have different impact on the controller outages.
Understanding the failure dynamics in SDN controllers is an
important step towards being able to predict the performance
Network
monitoring
Traffic
engineering
Bandwidth
on demand
SDN Controller s
Fig. 1: SDN architecture.
of the whole system. The goal of this paper is to analyse,
model and evaluate impact of different failure modes on the
controller’s availability. We analyse the fault reports of the two
major open source controllers, ONOS [4] and OpenDaylight
[5], and describe their corresponding detection and recovery
schemes. A comprehensive model of the controller in the for-
malism of Stochastic Activity Network (SAN) is provided to
capture all relevant factors and their interdependencies. A case
study of a hypothetical controller, based on the parameters of
commercial controllers, is used to demonstrate how different
controller availability attributes can be evaluated based on such
model.
The rest of the paper is organized as follows. Section II
provides an overview of the related work on dependability
modelling in SDN. In Sections III different failure modes
of an SDN controllers are presented and classified according
to a common terminology of software reliability. A dynamic
model of SDN controller based on stochastic activity networks
is presented in the Section IV, while Section V discuss the
possible applications of the proposed model. We conclude the
paper with a summary and an outlook for the future work.
II. RELATED WORK
Despite of the huge impact that a controller has on the
network performance as a whole, a comprehensive study on its
failure dynamics is still missing in the state of the art literature.
Ros et al. [3] have shown that in order to achieve the
availability of five nines, the forwarding devices are required
to connect to at least 2 controllers, for all wide area net-
works included in the study. The controller availability has
been considered as a static random Weibull variable, and
different failure modes are not considered. Studies in [6],
[7] distinguish between permanent and transient hardware
and software failures. Nguyen et al [6] used hierarchical
models based on reliability graphs and stochastic reward nets,
whereas Nencioni et al. [7] based their model on Markov
chains to capture dependencies between data and control plane.
However, only steady state availability analysis was presented
and the impact of different software failure modes and their
corresponding recovery procedures were neglected. Impact
of the controller workload on its failure rate was studied
by Longo et al. [8]. When one of the controllers fails, the
load of the remaining controllers, modelled as the number of
devices assigned to each controller, changes accordingly. Such
changes in operating conditions are modelled as continuous
phase type distributions. Although the model captured well
the accumulative nature of certain class of software failures,
deterministic and intermittent failures were neglected.
In the following section we provide a thorough classification
of different controller failures, their corresponding recovery
mechanism and impact on the system performance.
III. CLASSIFICATION OF CONTROLLER FAILURES
In SDN-based networks the entire control plane logic re-
sides in the controllers. Controller failures obstruct the net-
work ability to serve new requests coming from the network
applications, and react to the events coming from the physical
network, such as routing of unknown packet flows or re-
routing of the traffic after node or link failures. Different
failure modes of the controller occur with different frequency
and have different recovery times, and as such will have
different impact on the services provided to other layers [9].
In this section, we provide an overview of the different failure
modes according to the standard terminology of software
availability [10]. All examples in the following sections are
taken from actual bug reports of the corresponding open source
SDN controllers [4], [5], [11].
A. Deterministic failures (Bohrbugs)
Deterministic failures, also called Bohrbugs in the context
of software availability, represent the class of faults that are
easily reproducible and manifest themselves consistently under
the specific input sequence that led to a failure. This kind of
faults can be activated, by a rare, but valid input sequence,
or by accessing the rarely used code path, that has not been
tested thoroughly in the test and debugging phase. Examples
of Bohrbugs in the commercial controllers are:
•FlowEntry.life() is in milliseconds, but gets cre-
ated in seconds
•Path Computation Element (PCE) able to create tunnel
with negative bandwidth
•Java exception during attempt to create intent with band-
width/latency
•Random selection of VLAN ids in the Intent Framework
•Topology crash when using big topology (e.g., 20X20
torus)
After a system crash the whole application must be reloaded
from the last check-pointed state, while in the case of a
hanging failure only a single process must be restarted. System
crashes are reported, confirmed and resolved based on its
criticality and priority. Through debugging process software
is continuously improved, the concept known as software
reliability growth [12].
B. Non-deterministic failures (Mandelbugs)
Non-deterministic failures occur due to the specific com-
bination of error conditions and their relative timing, which
makes them extremely difficult to reproduce and identify the
exact root cause of the failure, hence the name Mandelbugs
[10]. Such failures are usually related to timing and syn-
chronization issues, such as data race after occurrence of the
specific sequence of failures, simultaneous access to the shared
data store, etc. Examples of Mandelbugs in the commercial
controllers are:
•Distributed database locking in ONOS
•Concurrency Issue in NetconfDevice
•Floodlight did not discover inter-switch link; LLDP pack-
ets were dropped due to interference with data plane
traffic
•Race condition when adding a RPC implementation with
an output
This class of failures can often be mitigated by retrying the
operation in different execution environment, e.g., by changing
the scheduling timers of the concurrent processes [9], [13].
C. Faults related to software ageing1
Ageing-related faults are a subset of Mandelbugs, and reflect
the gradual degradation of the system performance, due to
memory leaks, data corruption, accumulation of numerical
errors, etc. [10]. Common characteristic of such failures is
that they are accumulated over time and can be prevented
by occasional restart or reboot and cleaning of the internal
system state. Examples of SDN controller failures related to
the software ageing are:
•Flows still reported in oper data store after they have been
deleted from both config and network
•A thread calling a default OVSDB configuration
DefaultOvsdbClient.insertConfig() is
blocked permanently, if OVSDB connection is closed
while configuring OVS
1Note that in the literature the term of software ageing is sometimes used
to describe the software fitness on the longer time scale, showing how it fails
to meet the new system requirements, as they change with time.
•pce-delete-path is not working; not able to delete
all the logs, when multiple tunnels are created
•Timed out flows not removed from operational space
D. External failures
An SDN controller is a software component that needs an
operating system and the supporting hardware to run on. The
failures of the operating system and computing hardware do
not depend on the controller software, but must be taken into
account when modelling the availability of the whole system.
IV. MODEL
Stochastic Activity Networks (SAN) are stochastic exten-
sion of Petri Nets, and represent a powerful tool for depend-
ability modelling [14]. In SAN formalism, the combination
of markings in the places represents model states, and the
activities (firings) are timed or instantaneous are changing the
system state upon firing. Tools, such as M¨
obius, automatically
translate SAN models to Markov chains [15] and solve them
using numerical methods. The proposed SAN model of a con-
troller is presented in Fig. 2. The model captures the effects of
software reliability growth, software ageing, different software
failure modes, as well as the external failures of the operating
system (OS) and the computing hardware (HW).
A. Controller software failures
We assume that the instantaneous software failure rate de-
pends on the software maturity and the state of the controller.
1) Software maturity model: When the controller software
is introduced to the market it might still contain the bugs that
have not been detected or resolved during software testing
phase. During the operational phase end users report the prob-
os_failed
os_crash
hw_failed
under_repairspare_hw
resolved_bugs
active_bugs
unresolved_bugs
sw_ok
transient_err
hanging_proc
ctrl_crash
sw_prob
transient_err_det
hanging_proc_det
ctrl_crash_det
bug_detected
os_fail
os_reboot
os_repair
os_fail_1
hw_fail
hw_replace
hw_fail_1
hw_repair
sw_fail
retry
restart
reload
sw_fail_1
sw_age
catch_except
reponse_timer
heartbeat
detect debug
try_again
os_crash
hw_failed
under_repairspare_hw
resolved_bugs
active_bugs
unresolved_bugs
sw_ok
transient_err
hanging_proc
ctrl_crash
sw_prob
transient_err_det
hanging_proc_det
ctrl_crash_det
bug_detected
os_fail
os_reboot
os_repair
os_fail_1
hw_fail
hw_replace
hw_fail_1
hw_repair
sw_fail
retry
restart
reload
sw_fail_1
sw_age
catch_except
reponse_timer
heartbeat
detect debug
try_again
Fig. 2: SDN controller is modelled as Stochastic Activity Network (SAN).
lems, that are verified and removed by the developers, which
leads to software reliability growth. In the case of the open
source SDN controllers, such as ONOS and OpenDaylight, this
process is transparent and logged in their corresponding bug
trackers. From such bug reports, parameters like the number
of resolved bugs in the stable release, the detection rate of the
new bugs and the average time to debug can be derived. It can
also be noted from the reports that some bugs are reopened
several times, suggesting that the debugging process is not
always successful. We assume that initially a finite number
Nbugs of residual bugs to be present in the controller software
code. The bugs are detected with the rate λdetect per bug,
and are resolved with the rate μdebug. The success rate of
debugging process is pdebug. If the bug is not successfully
resolved, it is returned to the pool of detected bugs.
2) Controller state: When controller is initiated, restarted
or reloaded it starts from the state sw ok. The software
failure rate in this state depends on the number of residual
bugs, represented as the number of markings in the place
active bugs, and the baseline software failure rate, as in the
Jelinski-Moranda model [16]. According to this model, the
failure rate after icorrected bugs is:
λok(ti)=ϕsw fail(Nbugs −(i−1))
During the continuous operation, software ageing effects are
accumulated and the controller performance is degraded. A
common way to model this effect is to assume that the risk of
failure increases after a certain resource utilization threshold
has been reached [17]. The time to reach this threshold (called
application’s base longevity interval) depends on the controller
load. In the model, the software ageing is denoted as a
transition between the states sw ok to sw prob. The rate of
the ageing is denoted as λsw age, and the failure rate due to
the ageing process is λage fail.
In addition to the ageing related failures, failures due to
unresolved bugs still may be activated. Since we assume that
both failure mechanisms have negative exponential distribu-
tion, the combined failure processes may be expressed as a
single failure process with the rate:
λprob(t)=λok (t)+λag e fail
3) Detection and recovery from software failures: We dif-
ferentiate between three types of software failures depending
their corresponding recovery process. Transient errors, such as
synchronization and timing issues, can often be resolved by
retrying an operation in a different execution environment [13].
Hanging processes can be detected by the controller software
itself, and are resolved by a process restart. Both OpenDaylight
[5] and ONOS [4] implement ϕ-accrual failure detector [18]
based on the heartbeats for the detection of controller crashes.
After the controller crash, the whole software application must
be reloaded from the last saved checkpoint (system snapshot).
Distribution between different software failure modes de-
pends on the controller state. When the controller is in the
highly robust state (swok) majority of the failures are expected
to lead to a crash, while in the failure prone (swprob) state,
majority of the failures are expected to be transient and
resolved by a restart.
B. External failures
The SAN models of operating system and computing hard-
ware have been adapted from [19] [7]. OS fails with a rate
λos fail. Some OS failures can be successfully resolved with
OS reboot, while others require OS reparation, involving
complete OS reload. The rate of the reboot is μos reboot and
the repair rate μos repair. The success rate of the OS reboot
is pos reboot. HW fails with a rate λhw fail, and is replaced
with the spare component if there is any available. Initially
Nspare hw spare components are available. Hardware replace
rate is μhw replace and hardware repair rate is μhw repair.
V. C ASE STUDY
Next, we present the case study on an SDN controller,
whose model is based on realistic parameters. We show how
the proposed model can be used to estimate the controller
steady state availability, identify the most relevant parameters,
analyse downtime distribution, and to study the impact of
software reliability growth on the transient behaviour.
A. Model parameters
Model parameters are based on actual SDN controllers, or
on the studies of software components of similar complexity,
when the data was not available. Parameters related to the
software failure rates [7], [20], software reliability growth [4],
[5], software ageing [21], failure type distribution [22] and
recovery procedures [20], [23] are presented in Table I.
TABLE I: Controller software failures [7], [20]–[23]
Parameter Description Baseline value
Nbugs Initial number of active bugs 60
pdebug Debugging success rate 0.99
λ−1
bug detect Bug detection rate 60 days
μ−1
debugt Debug rate 60 days
ϕ−1
sw fail Baseline software failure rate 7 days
λ−1
sw age Rate of software ageing 1 day
λ−1
age fail Ageing failure rate 7 days
pretry(ok/prob)Failures recovered by retry 0.15, 0.15
prestart (ok/prob) Failures recovered by restart 0.15, 0.70
preload(ok/prob) Failures requiring reload 0.15, 0.15
μ−1
catch Catch the exception 1 msec
μ−1
timeout Detect hanging process 1 sec
μ−1
heartbeat Detecting controller crash 10 sec
μ−1
retry Retry the operation 0.5 sec
μ−1
proc restart Process restart 5 min
μ−1
reload Restart controller and reload 30 min
The parameters related to the availability of operating
system and computing hardware [7], [19] are summarized in
Table II.
TABLE II: Failures of external components [7], [19]
Parameter Description Baseline value
λ−1
os fail Mean time between OS failures 60 days
pos reboot Success of OS reboot 0.9
μ−1
os reboot OS reboot time 10 min
μ−1
os repair OS repair time 1h
λ−1
hw fail Mean time between HW failures 6 months
μ−1
hw replace HW replace time 2 hours
μ−1
hw repair HW repair time 24 hours
Nspare hw Spare computing hardware 1
B. Steady State Availability (SSA)
Let us show the steady state availability of the controller.
Steady state availability of the entire controller system, as well
as the contribution of different system sub-components: soft-
ware (SW), operating system (OS) and computing hardware
(HW) are presented in Table III.
TABLE III: Steady state availability of the controller system
Component Controller SW OS HW
Availability 0.99889 0.99956 0.99981 0.99951
It can be seen from the results that a single controller can
provide availability of only two nines. At least two controllers
are needed to achieve ”5-nines” availability. Availability of
the controller software alone was on the same level as the
availability of the operating system and computing hardware
in this case study.
C. Sensitivity analysis
We performed a sensitivity analysis to determine which of
the parameters have the highest impact on the steady state
availability. All the parameters that have an impact on the
steady state availability were varied ±50% of their baseline
values. The parameters, sorted by its impact in a decreasing
order are presented in the Fig. 3.
The most important parameters are hardware failure and re-
placement rate, followed by the rate of process restart, success
of OS reboot and software ageing failure rate. The factors with
the least impact are the software failure detection rates and rate
of the retry (shortest software recovery procedure). Note that
the parameters related to the software reliability growth do not
impact the steady state availability.
D. Failure frequency and downtime contribution
Around 50 failures per year with the total duration of 9.68
hours on average are expected. The contribution of different
failure types, in terms of their frequency and the contribution
to the controller downtime is presented in Fig. 4.
It has been observed that software failures are the most
frequent, accounting for 84% of all the failures, but contribute
to only 38% of the controller downtime. On the other hand,
hardware failures represent less than 4% of all the failures,
they contribute to more than 44% of the controller’s downtime.
Fig. 3: Sensitivity analysis of the controller steady state
availability. All parameters are varied in the range of ±50%
of their baseline value from Table I-III
E. Downtime distribution
Controller downtime distribution is presented in Fig. 5. It
can be observed that 80% of the failures resulted in downtime
lower than 10 minutes (shaded area), with median of 3.5 min.
The relatively short duration of the downtime is due to the high
frequency of software failures, whose recovery procedures
(retry, restart, reload) are much faster than the recovery from
hardware failures. Duration of the controller outages has to
be taken into account when designing the appropriate fault
tolerance mechanism.
F. Software ageing
The sensitivity analysis showed that ageing failure rate
has a big impact on the availability of the controller. Yet,
software ageing rate and ageing failure rate are the most
uncertain parameters, since they depend on many factors, such
as controller utilization rate, platform on which it operates and
software implementation. Therefore, we have explored a wide
Fig. 4: Failure frequency and contribution to controller down-
time for different failure modes.
Fig. 5: Downtime distribution of SDN controller system. 80%
of the failures resulted in downtime lower than 10 minutes
(shaded area).
range of software ageing parameters, varying the ageing rate
between 30 min and 1 week and ageing failure rate between 2
hours and 100 days. The controller availability for different
combination of the parameters is presented in the Fig. 6.
It can be seen that the impact of the ageing failure rates
depends greatly on the rate of ageing. When software ageing
is fast, ageing failures will have much higher impact on the
availability.
G. Software reliability growth
During the life cycle of controller software, the remaining
bugs in the operational software are detected and removed.
This leads to the availability improvement, assuming that new
bugs are not introduced during the software fix. The software
maturity model presented in Fig. 2 is based on the bug track
reports of ONOS and OpenDaylight [4], [5]. The number of
detected and resolved bugs over time in the controller model
has been compared to ONOS Avocet (v1.0), as shown in Fig. 7.
Fig. 6: Impact of the software ageing parameters (ageing rate
and ageing failure rate) on the controller availability.
Fig. 7: Number of detected and resolved bugs over time in
SDN controller model compared to bug track report of ONOS
Avocet (v1.0) [4].
It can be observed that the proposed model provides a good fit
to the commercial controller. For purpose of this case study,
the trivial and minor bugs were removed from the bug report,
since they do not have an impact on the controller availability.
The effect of the software reliability growth on the availabil-
ity single controller instance can be observed in the transient
behaviour of the proposed model. The impact of the initial
number of bugs and debugging success on the controller
availability in the first two years (730 hours) of its operation
are presented in the Fig. 8. The initial number of the residual
bugs has a much higher impact of the controller availability,
than the success of debugging. We observe that the steady state
is reached after approximately 400 days for the model with
Nbug =60bugs, while for the model with Nbug = 600 bugs
it took almost two years. Such kind of reports can be used to
determine the optimal time of controller software release (for
developers) and the optimal time for adoption of new releases
(for users), based on desired level of software reliability.
Fig. 8: Availability of SDN controller in the first 2 years
(730 days) of its operation.
VI. CONCLUSION AND FUTURE WORK
In this paper, failure dynamics in SDN controllers has been
analysed, modelled and evaluated. We have presented the most
typical controller failures, analysed their root causes and the
typical detection and recovery techniques. A comprehensive
model based on the Stochastic Activity Networks (SAN),
including several failure modes (controller software as well as
the external components) is proposed. The controller model
also captures the effects of software ageing and software
reliability growth, which have not been considered so far in
the state-of-the art literature.
We presented a case study to demonstrate the impact of
different failure modes on the controller availability. The
parameters of the model in the case study were based on the
commercial controllers, whenever possible, and the studies on
the systems of similar complexity when the data was not avail-
able. We have shown that a single controller instance is not
sufficient to achieve the availability of ”5-nines”. Sensitivity
analysis has been performed to identify the most important
controller parameters w.r.t. its availability.
The study has shown big differences in the frequency and
the impact on downtime of the different failure modes consid-
ered in the model. We have observed that the software accounts
for 84% of all the failures, but contribute to only one third
of the controller’s downtime. The analysis of the downtime
distribution has shown that more than 80% of the failures have
a downtime below 10 minutes, and the median is 3.6 minutes.
We leave it for the future work to study the impact on the
network services and user perceived performance of such
downtime distribution. Software ageing has been identified as
one of the most important, and yet most uncertain factor in the
controller’s availability. We have studied how the relationship
between the ageing rate and software ageing failures influences
the impact factor of the ageing. We have also observed the
effect of software reliability growth on the availability of the
single controller instances. The proposed software reliability
growth model is based on ONOS Avocet controller. We leave
for the future work to include other commercially available
controllers.
ACKNOWLEDGMENT
This work has received funding from EU Horizon 2020
research and innovation programme under grant agreement
No 671648 (VirtuWind), COST Action CA15127 Resilient
communication services protecting end-user applications from
disaster-based failures (RECODIS) and CELTIC EUREKA
project SENDATE-PLANETS (Project ID C2015/3-1) and is
partly funded by the German BMBF (Project ID 16KIS0473).
REFERENCES
[1] S. Jain, A. Kumar, S. Mandal, J. Ong, L. Poutievski, A. Singh,
S. Venkata, J. Wanderer, J. Zhou, M. Zhu et al., “B4: Experience with
a globally-deployed software defined wan,” ACM SIGCOMM Computer
Communication Review, vol. 43, no. 4, pp. 3–14, 2013.
[2] J. Vestin, A. Kassler, and J. Akerberg, “Resilient software defined
networking for industrial control networks,” in 2015 10th International
Conference on Information, Communications and Signal Processing
(ICICS). IEEE, 2015, pp. 1–5.
[3] F. J. Ros and P. M. Ruiz, “Five nines of southbound reliability in
software-defined networks,” in Proceedings of the third workshop on
Hot topics in software defined networking. ACM, 2014, pp. 31–36.
[4] ON.Lab, “ONOS: Open Neetwork Operating System,”
http://onosproject.org/, 2017.
[5] Linux Foundation, “Opendaylight.” [Online]. Available:
https://www.opendaylight.org/
[6] T. A. Nguyen, T. Eom, S. An, J. S. Park, J. B. Hong, and D. S. Kim,
“Availability modeling and analysis for software defined networks,” in
Dependable Computing (PRDC), 2015 IEEE 21st Pacific Rim Interna-
tional Symposium on. IEEE, 2015, pp. 159–168.
[7] G. Nencioni, B. E. Helvik, A. J. Gonzalez, P. E. Heegaard, and
A. Kamisinski, “Availability modelling of software-defined backbone
networks,” in Dependable Systems and Networks Workshop, 2016 46th
Annual IEEE/IFIP International Conference on. IEEE, 2016, pp. 105–
112.
[8] F. Longo, S. Distefano, D. Bruneo, and M. Scarpa, “Dependability
modeling of software defined networking,” Computer Networks, vol. 83,
pp. 280–296, 2015.
[9] M. Grottke and K. S. Trivedi, “Fighting bugs: Remove, retry, replicate,
and rejuvenate,” Computer, vol. 40, no. 2, 2007.
[10] K. S. Trivedi, M. Grottke, and E. Andrade, “Software fault mitigation
and availability assurance techniques,” International Journal of System
Assurance Engineering and Management, vol. 1, no. 4, pp. 340–350,
2010.
[11] C. Scott, A. Wundsam, B. Raghavan, A. Panda, A. Or, J. Lai, E. Huang,
Z. Liu, A. El-Hassany, S. Whitlock et al., “Troubleshooting blackbox
sdn control software with minimal causal sequences,” ACM SIGCOMM
Computer Communication Review, vol. 44, no. 4, pp. 395–406, 2015.
[12] N. Ullah and M. Morisio, “An empirical analysis of open source software
defects data through software reliability growth models,” in EUROCON,
2013 IEEE. IEEE, 2013, pp. 460–466.
[13] F. Qin, J. Tucek, J. Sundaresan, and Y. Zhou, “Rx: treating bugs as
allergies—a safe method to survive software failures,” in Acm sigops
operating systems review, vol. 39, no. 5. ACM, 2005, pp. 235–248.
[14] W. H. Sanders and J. F. Meyer, “Stochastic activity networks: Formal
definitions and concepts,” in Lectures on Formal Methods and Perfor-
manceAnalysis. Springer, 2001, pp. 315–343.
[15] D. Daly, D. D. Deavours, J. M. Doyle, P. G. Webster, and W. H.
Sanders, “M¨
obius: An extensible tool for performance and dependability
modeling,” in International Conference on Modelling Techniques and
Tools for Computer Performance Evaluation. Springer, 2000, pp. 332–
336.
[16] Z. Jelinski and P. B. Moranda, “Software reliability research,” Statistical
Computer Performance Evaluation, pp. 465–484, 1972.
[17] Y. Huang, C. Kintala, N. Kolettis, and N. D. Fulton, “Software rejuvena-
tion: Analysis, module and applications,” in Fault-Tolerant Computing,
1995. FTCS-25. Digest of Papers., Twenty-Fifth International Sympo-
sium on. IEEE, 1995, pp. 381–390.
[18] N. Hayashibara, X. Defago, R. Yared, and T. Katayama, “The/spl
phi/accrual failure detector,” in Reliable Distributed Systems, 2004.
Proceedings of the 23rd IEEE International Symposium on. IEEE,
2004, pp. 66–78.
[19] D. S. Kim, F. Machida, and K. S. Trivedi, “Availability modeling and
analysis of a virtualized system,” in Dependable Computing, 2009.
PRDC’09. 15th IEEE Pacific Rim International Symposium on. IEEE,
2009, pp. 365–371.
[20] S. A. Vilkomir, D. L. Parnas, V. B. Mendiratta, and E. Murphy,
“Availability evaluation of hardware/software systems with several re-
covery procedures,” in 29th Annual International Computer Software
and Applications Conference (COMPSAC’05), vol. 1. IEEE, 2005, pp.
473–478.
[21] W. Xie, Y. Hong, and K. S. Trivedi, “Software rejuvenation policies for
cluster systems under varying workload,” in Dependable Computing,
2004. Proceedings. 10th IEEE Pacific Rim International Symposium on.
IEEE, 2004, pp. 122–129.
[22] S. Chandra and P. M. Chen, “Whither generic recovery from application
faults? a fault study using open-source software,” in Dependable Systems
and Networks, 2000. DSN 2000. Proceedings International Conference
on. IEEE, 2000, pp. 97–106.
[23] V. B. Mendiratta, “Reliability analysis of clustered computing systems,”
in Software Reliability Engineering, 1998. Proceedings. The Ninth
International Symposium on. IEEE, 1998, pp. 268–272.