Techniques for fast simulation of models of highly dependable systems
ABSTRACT With the ever-increasing complexity and requirements of highly
dependable systems, their evaluation during design and operation is
becoming more crucial. Realistic models of such systems are often not
amenable to analysis using conventional analytic or numerical methods.
Therefore, analysts and designers turn to simulation to evaluate these
models. However, accurate estimation of dependability measures of these
models requires that the simulation frequently observes system failures,
which are rare events in highly dependable systems. This renders
ordinary Simulation impractical for evaluating such systems. To overcome
this problem, simulation techniques based on importance sampling have
been developed, and are very effective in certain settings. When
importance sampling works well, simulation run lengths can be reduced by
several orders of magnitude when estimating transient as well as
steady-state dependability measures. This paper reviews some of the
importance-sampling techniques that have been developed in recent years
to estimate dependability measures efficiently in Markov and nonMarkov
models of highly dependable systems
-
Citations (0)
-
Cited In (0)
Page 1
246IEEE TRANSACTIONS ON RELIABILITY, VOL. 50, NO. 3, SEPTEMBER 2001
Techniques for Fast Simulation of Models of Highly
Dependable Systems
Victor F. Nicola, Perwez Shahabuddin, and Marvin K. Nakayama
Abstract—With the ever-increasing complexity and require-
ments of highly dependable systems, their evaluation during
design and operation is becoming more crucial. Realistic models of
such systems are often not amenable to analysis using conventional
analytic or numerical methods. Therefore, analysts and designers
turn to simulation to evaluate these models. However, accurate
estimation of dependability measures of these models requires that
the simulation frequently observes system failures, which are rare
events in highly dependable systems. This renders ordinary sim-
ulation impractical for evaluating such systems. To overcome this
problem, simulation techniques based on importance sampling
have been developed, and are very effective in certain settings.
When importance sampling works well, simulation run lengths
can be reduced by several orders of magnitude when estimating
transient as well as steady-state dependability measures. This
paper reviews some of the importance-sampling techniques that
have been developed in recent years to estimate dependability
measures efficiently in Markov and non-Markov models of highly
dependable systems.
Index Terms—Highly dependable system, importance sampling,
Markov chain, simulation, steady-state dependability measure,
transient dependability measure.
ACRONYMS1
BFB
BLBLR
BLBLRC BLBLR with cuts
BLR balanced likelihood ratio
BREbounded RE
CLTcentral limit theorem
CTMC continuous-time MC
DTMC discrete-time MC
GSMPgeneralized semi-Markov process
i.i.d.
-independent and identically distributed
IS importance sampling
MC Markov chain
MSDISmeasure-specific dynamic IS
MTBFmean time between failures
MTTFmean time to failure
balanced failure biasing
balance over links BLR
Manuscript received November 1, 1998; revised June 6, 2000. This work was
supported in part by the US National Science Foundation under Grants DMI-
9625297, DMI-9624469, and DMI-9900117.
Responsible Editor: C. Alexopoulos
V. F. Nicola is with Telematics Systems and Services, Department of Elec-
trical Engineering, University of Twente, Enschede, The Netherlands.
P. Shahabuddin is with Columbia University, New York, NY 10027 USA
(e-mail: Perwez@ieor.columbia.edu).
M. Nakayama is with the Department of Computer and Information Science,
New Jersey Institute of Technology, Newark, NJ, USA.
Publisher Item Identifier S 0018-9529(01)11169-3.
1The singular and plural of an acronym are always spelled the same.
NHPP
pdf
r.v.
RE
RP
SAVE
TH
TRR
VRR
nonhomogeneous Poisson process
probability density function
random variable
relative error
repair person
system availability estimator
time horizon
total effort reduction ratio
variance reduction ratio.
I. INTRODUCTION
H
designs. The ability to predict relevant dependability measures
forsuchcomplexsystemsisessential,notonlytoguaranteehigh
levels of dependability during system operation but also to im-
prove the cost-effectiveness during system design and develop-
ment.
Several measures are commonly used for assessing the de-
pendability of a system, and the choice of the particular de-
pendability measures used to evaluate a particular system de-
pends on the intended operation and the environment of such a
system. For example, mission-oriented systems are often evalu-
ated using transient measures, such as system reliability (prob-
ability that the system is operational during the entire mission
time). Given that the system is initially in an operational state,
MTTF is the mean time to the first system failure; this is an-
other measure of interest for mission-oriented systems. On the
other hand, MTBF is the mean time between subsequent system
failures in steady-state. MTBF and the steady-state availability
(fraction of time the system is operational in the long run) are
often used for evaluating continuously operating systems.
Fault-tolerance and recovery techniques are frequently used
in the design of complex systems to enhance their depend-
ability. As a consequence, very high reliability/availability
requirements of systems can now be sustained. However,
the performance of continuously operating systems can be
degraded/upgraded due to load surges or reconfigurations
after failures/repairs. In other words, the performance level
of degradable/repairable systems is changing with time in
response to internal or external events. To evaluate these
systems properly, there is a need for measures that combine
performance and reliability/availability aspects. Such measures
were first introduced in [74], and were termed “performability”
measures. An example of such a measure is the distribution (or
-expectation) of cumulative performance in a given interval
of time. A special case of this measure is the distribution (or
IGHdependabilityrequirementsoftoday’scriticaland/or
commercial systems often lead to complicated and costly
0018–9529/01$10.00 © 2001 IEEE
Page 2
NICOLA et al.: TECHNIQUES FOR FAST SIMULATION OF MODELS247
-expectation) of interval availability, which is the fraction
of time the system is operational (regardless of performance)
during a given interval of time. The distribution of interval
availability (or guaranteed availability [48]) is a relevant
attribute of continuously operating systems, because it gives
the probability that the system is operational for more than
a specified fraction of a given interval of time. For example,
one might be interested in computing the probability that the
system is unavailable for more than 0.1% of the time in 1 year
of system-operation.
A. Numerical Evaluation of Dependability Measures
Researchers have long been aware of the importance and
necessity of developing techniques and tools to evaluate
highly dependable systems effectively. Most of the efforts are
limited to analytic or numerical solutions, usually restricted to
Markov (less often, semi-Markov) models. For a more detailed
discussion on performability measures and state-of-the-art
techniques for their evaluation, see [23]. The applicability of
these techniques, however, is quickly hindered by practical
problems, such as state-space explosion and/or the inadequacy
of Markov or semi-Markov representations of real systems.
Because the number of states in Markov models usually
grows exponentially with the number of system-components,
and because of storage and computational limitations, only
relatively small systems can be analyzed using numerical
solution techniques. Several techniques have been proposed
and, if applicable, can help to reduce the state-space of large
Markov models. For example, exact lumping [45], [84], or
approximations obtained by truncation and bounding [76],
are used. However, even for a moderately-sized system, the
corresponding Markov model can be “stiff”2(usually when
transition rates are of different orders of magnitude), leading
to difficulties when using numerical solvers [92]. Behavioral
decomposition [9] and iterative decomposition/aggregation
techniques [19] are among several techniques that can help
overcome “stiffness” of Markov models.
B. Effective Simulation
When conventional analytic/numerical methods are no
longer feasible, analysts often turn to computer simulation,
with the obvious advantages of flexible representation of
complex systems at the desired level of abstraction and low
storage requirements. However, the accurate estimation of
dependability measures using simulation requires frequent
observations of the system-failure event, which by definition
are rare events in highly dependable systems. This renders
conventional (ordinary) simulation impractical for evaluating
such systems [30]. To attack this problem, in recent years, there
have been considerable and successful efforts to develop fast
simulation techniques based on IS [41], [51]. The basic idea
is quite simple: simulate the system using new probability-dy-
2Astochasticprocessis“stiff”whenitcontains2essentiallydifferenttypesof
transitions, slow and rapid [66, ch. 8]. Highly dependable systems consisting of
highly dependable components fit this description because, typically, the com-
ponent lifetimes are very long, whereas repairs take only a short time to com-
plete.
namics (different from the original probability-dynamics of the
system), so as to increase the probability of typical sequences
of events leading to system failure. For example, in a redundant
system with 2 components, accelerating the component#2
failure while component#1 is being repaired, typically in-
creases the probability of another component failure, which
would lead to system failure. The obtained measure in a
given observation (a sample path of a simulation trial) is then
multiplied by a correction factor called the “likelihood ratio”
to yield a
-unbiased estimate of the measure. This factor
is the ratio of the probabilities (likelihoods) of the sample
path in the original and modified systems, respectively; its
computation is straightforward and can be done recursively
at simulation event times. Appropriate and careful choice of
the new underlying probability dynamics of the simulated
system can yield an appreciable reduction in the variance of the
resulting estimate, which implies appreciable reduction in the
simulation time needed to achieve a specified precision. Also,
the new probability dynamics should be easy to implement.
For a fixed run-length, ordinary simulation produces esti-
mates with RE (a constant times the coefficient of variation of
the estimate) that tends to infinity as the probability of the rare
event tends to zero. An “effective” heuristic for IS is one that,
for a fixed run-length, produces estimates with a RE that re-
mains bounded as the probability of the rare event tends to zero.
However, BRE is an asymptotic property, and in practice, even
if an IS heuristic possesses this property, the amount of simu-
lation effort required to achieve a given precision can still be
large. Also, the BRE property might not ensure a variance re-
duction relative to ordinary simulation for many types of highly
dependable systems (e.g., systems with an appreciable level of
redundancies) whose parameters fall in the practical range; it
only guarantees that as the event of interest becomes rarer, the
-expected amount of simulation effort remains bounded by a
constant (in contrast to ordinary simulation where this effort
tends to infinity), but the bound can be large.
C. This Work
This paper reviews some of the recent IS techniques devel-
oped for the efficient estimation of transient and steady-state
dependability measures in Markov and non-Markov models of
highly dependable systems.3Parts of [53] also review some IS
techniques for the simulation of dependability measures, with
emphasis on the underlying mathematical ideas needed to es-
tablish their theoretical properties; thus, it is more suitable for
researchers.Thispaperpresentsacomprehensiveandlessmath-
ematical treatment of the subject; therefore, it is more suited for
reliability practitioners, and requires only a basic understanding
of probability and statistics.
There are two main ways in which a system can be made
highly dependable in a cost-effective manner.
1) Usecomponentsthatare“highlyreliable”andhave“low”
built-in redundancies in the system. Examples of these
are computer systems where the main components (e.g.,
processors) fail rarely.
3Preliminary versions of some parts of this review have appeared in [80] and
[100].
Page 3
248IEEE TRANSACTIONS ON RELIABILITY, VOL. 50, NO. 3, SEPTEMBER 2001
2) Build “significant” redundancies in the system and use
components that are just “reliable” instead of “highly re-
liable.” (The distinction is clearer when some examples
are examined later in the paper.)
There might also be a third way: Use “unreliable” components
buthave“veryhigh”built-inredundanciesinthesystem. Exam-
ples are more difficult to find in practice.
Much of the recent research work on effective simulation of
highly dependable systems has been done for systems that fall
in categories 1 and 2, and this paper mainly covers those.
Thefocusinthispaperison“dynamic”systems(systemsthat
change over time), in contrast to “static” systems. An example
ofastaticsystemisa2-terminal reliabilitynetworkwith -inde-
pendent components and no repairs (strictly speaking, there can
be repairs as long as they do not create -dependencies among
components). See, [69], [70], [95] for fast simulation methods
for such systems.
Section II formally describes the wide class of systems for
which these IS techniques are designed, and reviews the basic
idea of IS.
Section III discusses IS techniques for estimating depend-
ability measures in Markov models. “Markov” implies that all
failure, repair, and other underlying distributions in the system
are exponential, so that it can be modeled by a CTMC. Some
work is reviewed on the estimation of derivatives with respect
to model parameters (e.g., component failure rates) for various
steady-state and transient measures in these models. This work
is of much interest, because it can be used to identify system-
components that might need improvement and to optimize sys-
tems.
Section IV considers the estimation of dependability mea-
sures for models in which the failure and repair times are
not exponentially distributed. Because these types of system
can no longer be directly modeled as a MC, they are called
“non-Markovmodels.”Amathematicalframeworkforstudying
such systems is the GSMP; see [37] for a formal development
of GSMP. The general theory of IS for discrete-event systems
(without discussing the particular changes of measures for
specific models) is in [37], [41]. For the IS heuristics discussed
in this paper, some empirical studies have been presented in the
literature, and many of these methods are provably effective.
In both Markov and non-Markov models, the concern is esti-
mation of
• transient measures, such as system unreliability, -distri-
bution and -expectation of interval unavailability,
• steady-state measures, such as steady-state unavailability
and MTBF.
Although MTTF is in fact a transient measure, for regenerative
models it can be represented as a ratio of 2 -expectations of
regenerative-cycle-based quantities that can be estimated using
the regenerative method of simulation. Thus MTTF is included
in discussions of steady-state measures.
Section V discusses ongoing work and directions for future
research.
D. Related Work and Software
IS can be applied, not only for estimating dependability mea-
sures of reliability systems, but for estimating buffer-overflow
probabilities in queuing systems and networks [18], [28], [91],
[96], [107]. Applications to communication systems are of par-
ticular interest [4], [15], [113], [67]. The IS techniques used in
this setting are often based on the theory of large deviations. A
survey on existing techniques is in [53].
An approach, other than IS, based on “fault-injection” is
used in [75] to speed up steady-state simulations involving
rare (failure) events in communication systems. The method
assumes knowledge of the frequency of the “rare failure event”
and exploits the fact that, except for relatively short periods
after failures, the system is operating normally in a failure-free
environment. Fault-injection is used to obtain an accurate
estimate of the performance measure of interest during periods
affected by the failure. This estimate is appropriately combined
with an accurate estimate under failure-free environment (with
no rare events) to yield an overall steady-state estimate of the
dependability measure.
Another method to simulate rare sample paths is to use the
technique of “splitting” sample paths. Splitting for rare-event
simulation was originally discussed in [62] in the context of es-
timating rare particle transmission probabilities in physics [51].
Since then, it continues to be an active area of research in that
field[24].Variationsofthistechniqueforsteady-staterare-event
estimation in stochastic service systems seem to have been first
done in [6], [7], and later in [57] (see [14] for a related idea);
a variation for transient rare-event estimation in stochastic ser-
vice systems is in [65]. It was revisited in [110], [111], [112]
for estimating probabilities of rare events in computer and com-
munication systems; the version of the technique used in these
papers was called “RESTART.” Some of the most recent ver-
sions/implementations of the technique are in [29], [35], [43],
[52].
The basic idea behind the splitting technique is explained
here. The goal typically is to estimate some performance mea-
sure that is “associated with” visiting some set of states
statespaceofthestochasticprocess,andtheset
rarely. For example, compute the probability of a buffer over-
flow, where
corresponds to states in which the buffer content
has reached its capacity. In ordinary simulation, the stochastic
process being simulated spends a lot of time in regions of the
statespacethatare“faraway”fromtheinterestingrareset
gionsfromwherethechanceofenteringtheraresetisextremely
low).Inoneversionofsplitting,aregionofthestatespacethatis
“closer” to the rare set is defined. Each time the process enters
this region from the “far away” region, many identical copies
of the process are generated. Each of the split copies is sim-
ulated until the process exits back into the “far away” region.
From there on, only one of the split copies is continued until
another entrance into the “closer” region. This way gives more
instances of the stochastic process spending time in the “closer”
region where the rare event is more likely to occur. The idea can
be extended to: instead of just 2 regions, use multiple regions of
slowly increasing degrees of rarity. Reference [35] describes a
of the
isvisitedonly
(re-
Page 4
NICOLA et al.: TECHNIQUES FOR FAST SIMULATION OF MODELS249
unifying class of models and implementation conditions under
which this type of multi-level splitting is provably effective for
steady-state rare-eventsimulation. Related work is in [33],[34].
The method of splitting has also been used and analyzed in con-
texts other than rare-event simulation, e.g., [73].
There are a few software-based modeling tools which use
rare-event simulation techniques for dependability evaluation.
SAVE [45] is a software package that consists of a high-level
modeling language that can be used to specify the model of
interest. From this specification and Markov assumptions on
the lifetime and repair-time distributions, the detailed Markov
chain is derived. It is then solved for dependability measures
using either numerical (nonsimulation) or simulation methods.
A recent version of SAVE [8] incorporates the IS technique,
BFB (as described in Section III-A) at the MC level to estimate
dependability measures efficiently. Another software package
where IS is used is ULTRASAN [20]. In ULTRASAN, the
high-level modeling construct of stochastic activity networks
is used to specify the model of interest. Again, from this
specification, the detailed stochastic process is derived and
solvedforperformance/dependabilitymeasuresofinterest,using
either numerical (nonsimulation) MC methods or simulation
methods. In recent versions of ULTRASAN [89], [90] an
“IS governor” has been incorporated. Here, instead of the
IS heuristic being built-in as in SAVE, one can choose and
specify the IS change of measure at the stochastic activity
network level. The RESTART version of the splitting method
has also been implemented in ASTRO [112].
II. BACKGROUND
Notation
number of types of components
number of components of type ,
number of operational components of type
time
vector
stochastic process
state of the system at time
stochastic process
state space of
subset of failure states in
time to first system-failure
probability under measure
-expectation under measure
variance under measure
system unreliability at time
indicator function of event
convergence in distribution
-normal distribution with mean , variance
relative error of an estimator
a sample path
in the setof all
pdf of under measure
likelihood ratio on.
at
RE
of a stochastic process
A. Highly Dependable Systems
This section discusses the broad class of highly dependable
systems that can be described by SAVE [45] (basically, a gen-
eralized Machine Repairman Model). These models consist of
multiple types of components, where each component can be in
1 of 4 states:
• operational,
• failed,
• spare,
• dormant.
The first 3 of these states are self-explanatory. An operational
component becomes dormant if its operation depends upon the
operation of some other component and that other component
fails. For example, a processor might not be operational unless
its power supply is also operational; therefore, if the power
supply fails, then the processor is dormant. In SAVE, different
(exponential) failure rates can be specified for the operational,
spare, and dormant states. The SAVE modeling language is
also used to describe operational/repair dependencies among
components (e.g., the operation/repair of a component depends
on some other components being operational), as well as
failure propagation (e.g., the failure of a component causes
some other components to fail with given probabilities). The
system is operational if certain combinations of components are
operational. Unlike SAVE, in non-Markov models (Section IV)
general failure and repair distributions are allowed. Also, there
is a set of RP who repair failed components according to some
reasonably arbitrary service (priority or nonpriority) discipline.
Tosimplifythepresentation,systemsareconsideredinwhich
each component is either operational or failed. (Unless other-
wisespecified,theresultsalso applytothemoregeneralmodels
in the SAVE modeling language.) Section II-B briefly reviews
the basic idea of IS and shows how (when applied appropri-
ately) it could appreciably speed-up simulations involving rare
events. For illustration, also consider estimating the system un-
reliability; however, the same concepts also apply to other de-
pendability measures.
B. Importance Sampling
Consider a system with
nent is subject to failure and repair.
All components are operational at time 0:
.
All components are “new” at time 0.
In general,
contains the information
formation might be needed, e.g., the queuing of failed compo-
nents waiting to be repaired and the remaining lifetimes and re-
pair times of components when using distributions other than
exponential.
There is some subset
of the state space
system is failed at time
if
System unreliability is
component-types. Each compo-
, for all
, but other in-
such that the
.
(1)
TH.
Page 5
250IEEE TRANSACTIONS ON RELIABILITY, VOL. 50, NO. 3, SEPTEMBER 2001
The subscript
underlying original probability distributions governing the dy-
namics of the system.
In a highly reliable system, for a sufficiently small , the
: is rare.
In ordinary (naive) simulation generate
from time 0 to time
, say,. Then
denotes the original probability measure: the
i.i.d. replications of
to obtain samples of
isan -unbiasedestimatorof
is
.Thevarianceofthisestimator
From the CLT
as
of
which istherelativehalfwidth ofthe99% -confidenceinterval
derived from the CLT approximation. For a fixed , the
as . This is the main problem when using ordinary
simulation to evaluate highly dependable systems. The goal of
IS is to overcome this inherent difficulty.
Notation
another probability measure
a sample path (of a replication) in the set
possible sample paths of
time 0 to time
pdf of according to
of all
taking the system from
(2)
The only condition imposed on
is:
whenever
Thus the system can be simulated using
ples of
to obtaini.i.d. sam-
:.
An -unbiased estimate of
is
The variance ofis
One measure of effectiveness of any new simulation algo-
rithmistheVRR:ratioofthevarianceusingordinarysimulation
to that using the new simulation algorithm; in this case:
The VRR gives the ratio of the number of samples using ordi-
nary simulation to that using the new algorithm so as to achieve
the same RE. However this measure of effectiveness does not
consider the effort (e.g., CPU time) required to simulate each
sampleunderthetwomethods.Henceamorefairmeasureofef-
fectiveness is the TRR: ratio of (the product of the variance and
the effort per sample using ordinary simulation) to (that using
the new simulation algorithm), [42]. The TRR gives the ratio of
the total effort using ordinary simulation to that using the new
algorithm so as to achieve the same RE.
The main challenge in IS is to find a robust new probability
measure
that can be implemented in a computationally effi-
cient manner such that
:
(3)
Appreciable variance reduction from (3) is obtained if
whenever(4)
Choosing
because it involves each sample path. But the general intuition
one obtains is that
should be chosen to appreciable increase
theprobabilityoftherareevent
has to be very careful; choosing an arbitrary (but not suitable)
that increases the probability of the rare event can lead to a
substantial increase in variance.
For highly dependable systems, try to come up with IS tech-
niques that are “effective” (see Section I-B): techniques whose
RE remains bounded (implying that
ability of the rare event tends to zero. This property has been
established at least empirically (and, in many cases, also theo-
retically) for most of the IS techniques in this paper. However,
as mentioned before, this does not always guarantee efficient
simulation of systems with high redundancies.
such that (4) is satisfied is usually very difficult
.Atthesametimeone
) as the prob-
III. FAST SIMULATION OF MARKOV MODELS
Notation
collection of all (measurable) subsets of
DTMC embedded on
generic states from the state space
transition probability matrix of the DTMC
(whenis a CTMC)
,