Techniques for fast simulation of models of highly dependable systems
ABSTRACT With the everincreasing complexity and requirements of highly
dependable systems, their evaluation during design and operation is
becoming more crucial. Realistic models of such systems are often not
amenable to analysis using conventional analytic or numerical methods.
Therefore, analysts and designers turn to simulation to evaluate these
models. However, accurate estimation of dependability measures of these
models requires that the simulation frequently observes system failures,
which are rare events in highly dependable systems. This renders
ordinary Simulation impractical for evaluating such systems. To overcome
this problem, simulation techniques based on importance sampling have
been developed, and are very effective in certain settings. When
importance sampling works well, simulation run lengths can be reduced by
several orders of magnitude when estimating transient as well as
steadystate dependability measures. This paper reviews some of the
importancesampling techniques that have been developed in recent years
to estimate dependability measures efficiently in Markov and nonMarkov
models of highly dependable systems

Conference Paper: Output Analysis for Simulations
[Show abstract] [Hide abstract]
ABSTRACT: We discuss methods for statistically analyzing the output from stochastic discreteevent or Monte Carlo simulations. Terminating and steadystate simulations are consideredSimulation Conference, 2006. WSC 06. Proceedings of the Winter; 01/2006 
Conference Paper: Application of nonMarkovian stochastic Petri Nets to the modeling of rail system maintenance and availability
[Show abstract] [Hide abstract]
ABSTRACT: With the increasingly stringent contractual requirements placed on system availability in urban and intercity passenger rail systems, and the emergence of publicprivate partnerships with maintenance contracts over periods of 25 years or more, rail system suppliers such as ALSTOM Transport now adopt an integrated logistic support (ILS) vision, where the entire support system (maintenance policy, crew scheduling, spare parts, tools, etc.) is modeled at the same time as the main system. The need to overcome the restrictive assumptions imposed by Markov models has led us to the use of nonMarkovian stochastic Petri Nets, which in addition lend themselves to building decentralized, hierarchical models. The challenges that are addressed in this paper are how to deal with: deferred maintenance, aging, and different time scales (not all units in the rail system have the same mission profile). Comparisons are made with results obtained with Markov models.Simulation Conference (WSC), Proceedings of the 2012 Winter; 01/2012 
Conference Paper: Statistical analysis of simulation output
[Show abstract] [Hide abstract]
ABSTRACT: We discuss methods for statistically analyzing the output from stochastic discreteevent or Monte Carlo simulations. Terminating and steadystate simulations are considered.Simulation Conference, 2008. WSC 2008. Winter; 01/2008
Page 1
246IEEE TRANSACTIONS ON RELIABILITY, VOL. 50, NO. 3, SEPTEMBER 2001
Techniques for Fast Simulation of Models of Highly
Dependable Systems
Victor F. Nicola, Perwez Shahabuddin, and Marvin K. Nakayama
Abstract—With the everincreasing complexity and require
ments of highly dependable systems, their evaluation during
design and operation is becoming more crucial. Realistic models of
such systems are often not amenable to analysis using conventional
analytic or numerical methods. Therefore, analysts and designers
turn to simulation to evaluate these models. However, accurate
estimation of dependability measures of these models requires that
the simulation frequently observes system failures, which are rare
events in highly dependable systems. This renders ordinary sim
ulation impractical for evaluating such systems. To overcome this
problem, simulation techniques based on importance sampling
have been developed, and are very effective in certain settings.
When importance sampling works well, simulation run lengths
can be reduced by several orders of magnitude when estimating
transient as well as steadystate dependability measures. This
paper reviews some of the importancesampling techniques that
have been developed in recent years to estimate dependability
measures efficiently in Markov and nonMarkov models of highly
dependable systems.
Index Terms—Highly dependable system, importance sampling,
Markov chain, simulation, steadystate dependability measure,
transient dependability measure.
ACRONYMS1
BFB
BLBLR
BLBLRC BLBLR with cuts
BLRbalanced likelihood ratio
BRE bounded RE
CLT central limit theorem
CTMC continuoustime MC
DTMC discretetime MC
GSMP generalized semiMarkov process
i.i.d.
independent and identically distributed
IS importance sampling
MC Markov chain
MSDIS measurespecific dynamic IS
MTBF mean time between failures
MTTF mean time to failure
balanced failure biasing
balance over links BLR
Manuscript received November 1, 1998; revised June 6, 2000. This work was
supported in part by the US National Science Foundation under Grants DMI
9625297, DMI9624469, and DMI9900117.
Responsible Editor: C. Alexopoulos
V. F. Nicola is with Telematics Systems and Services, Department of Elec
trical Engineering, University of Twente, Enschede, The Netherlands.
P. Shahabuddin is with Columbia University, New York, NY 10027 USA
(email: Perwez@ieor.columbia.edu).
M. Nakayama is with the Department of Computer and Information Science,
New Jersey Institute of Technology, Newark, NJ, USA.
Publisher Item Identifier S 00189529(01)111693.
1The singular and plural of an acronym are always spelled the same.
NHPP
pdf
r.v.
RE
RP
SAVE
TH
TRR
VRR
nonhomogeneous Poisson process
probability density function
random variable
relative error
repair person
system availability estimator
time horizon
total effort reduction ratio
variance reduction ratio.
I. INTRODUCTION
H
designs. The ability to predict relevant dependability measures
forsuchcomplexsystemsisessential,notonlytoguaranteehigh
levels of dependability during system operation but also to im
prove the costeffectiveness during system design and develop
ment.
Several measures are commonly used for assessing the de
pendability of a system, and the choice of the particular de
pendability measures used to evaluate a particular system de
pends on the intended operation and the environment of such a
system. For example, missionoriented systems are often evalu
ated using transient measures, such as system reliability (prob
ability that the system is operational during the entire mission
time). Given that the system is initially in an operational state,
MTTF is the mean time to the first system failure; this is an
other measure of interest for missionoriented systems. On the
other hand, MTBF is the mean time between subsequent system
failures in steadystate. MTBF and the steadystate availability
(fraction of time the system is operational in the long run) are
often used for evaluating continuously operating systems.
Faulttolerance and recovery techniques are frequently used
in the design of complex systems to enhance their depend
ability. As a consequence, very high reliability/availability
requirements of systems can now be sustained. However,
the performance of continuously operating systems can be
degraded/upgraded due to load surges or reconfigurations
after failures/repairs. In other words, the performance level
of degradable/repairable systems is changing with time in
response to internal or external events. To evaluate these
systems properly, there is a need for measures that combine
performance and reliability/availability aspects. Such measures
were first introduced in [74], and were termed “performability”
measures. An example of such a measure is the distribution (or
expectation) of cumulative performance in a given interval
of time. A special case of this measure is the distribution (or
IGHdependabilityrequirementsoftoday’scriticaland/or
commercial systems often lead to complicated and costly
0018–9529/01$10.00 © 2001 IEEE
Page 2
NICOLA et al.: TECHNIQUES FOR FAST SIMULATION OF MODELS247
expectation) of interval availability, which is the fraction
of time the system is operational (regardless of performance)
during a given interval of time. The distribution of interval
availability (or guaranteed availability [48]) is a relevant
attribute of continuously operating systems, because it gives
the probability that the system is operational for more than
a specified fraction of a given interval of time. For example,
one might be interested in computing the probability that the
system is unavailable for more than 0.1% of the time in 1 year
of systemoperation.
A. Numerical Evaluation of Dependability Measures
Researchers have long been aware of the importance and
necessity of developing techniques and tools to evaluate
highly dependable systems effectively. Most of the efforts are
limited to analytic or numerical solutions, usually restricted to
Markov (less often, semiMarkov) models. For a more detailed
discussion on performability measures and stateoftheart
techniques for their evaluation, see [23]. The applicability of
these techniques, however, is quickly hindered by practical
problems, such as statespace explosion and/or the inadequacy
of Markov or semiMarkov representations of real systems.
Because the number of states in Markov models usually
grows exponentially with the number of systemcomponents,
and because of storage and computational limitations, only
relatively small systems can be analyzed using numerical
solution techniques. Several techniques have been proposed
and, if applicable, can help to reduce the statespace of large
Markov models. For example, exact lumping [45], [84], or
approximations obtained by truncation and bounding [76],
are used. However, even for a moderatelysized system, the
corresponding Markov model can be “stiff”2(usually when
transition rates are of different orders of magnitude), leading
to difficulties when using numerical solvers [92]. Behavioral
decomposition [9] and iterative decomposition/aggregation
techniques [19] are among several techniques that can help
overcome “stiffness” of Markov models.
B. Effective Simulation
When conventional analytic/numerical methods are no
longer feasible, analysts often turn to computer simulation,
with the obvious advantages of flexible representation of
complex systems at the desired level of abstraction and low
storage requirements. However, the accurate estimation of
dependability measures using simulation requires frequent
observations of the systemfailure event, which by definition
are rare events in highly dependable systems. This renders
conventional (ordinary) simulation impractical for evaluating
such systems [30]. To attack this problem, in recent years, there
have been considerable and successful efforts to develop fast
simulation techniques based on IS [41], [51]. The basic idea
is quite simple: simulate the system using new probabilitydy
2Astochasticprocessis“stiff”whenitcontains2essentiallydifferenttypesof
transitions, slow and rapid [66, ch. 8]. Highly dependable systems consisting of
highly dependable components fit this description because, typically, the com
ponent lifetimes are very long, whereas repairs take only a short time to com
plete.
namics (different from the original probabilitydynamics of the
system), so as to increase the probability of typical sequences
of events leading to system failure. For example, in a redundant
system with 2 components, accelerating the component#2
failure while component#1 is being repaired, typically in
creases the probability of another component failure, which
would lead to system failure. The obtained measure in a
given observation (a sample path of a simulation trial) is then
multiplied by a correction factor called the “likelihood ratio”
to yield a
unbiased estimate of the measure. This factor
is the ratio of the probabilities (likelihoods) of the sample
path in the original and modified systems, respectively; its
computation is straightforward and can be done recursively
at simulation event times. Appropriate and careful choice of
the new underlying probability dynamics of the simulated
system can yield an appreciable reduction in the variance of the
resulting estimate, which implies appreciable reduction in the
simulation time needed to achieve a specified precision. Also,
the new probability dynamics should be easy to implement.
For a fixed runlength, ordinary simulation produces esti
mates with RE (a constant times the coefficient of variation of
the estimate) that tends to infinity as the probability of the rare
event tends to zero. An “effective” heuristic for IS is one that,
for a fixed runlength, produces estimates with a RE that re
mains bounded as the probability of the rare event tends to zero.
However, BRE is an asymptotic property, and in practice, even
if an IS heuristic possesses this property, the amount of simu
lation effort required to achieve a given precision can still be
large. Also, the BRE property might not ensure a variance re
duction relative to ordinary simulation for many types of highly
dependable systems (e.g., systems with an appreciable level of
redundancies) whose parameters fall in the practical range; it
only guarantees that as the event of interest becomes rarer, the
expected amount of simulation effort remains bounded by a
constant (in contrast to ordinary simulation where this effort
tends to infinity), but the bound can be large.
C. This Work
This paper reviews some of the recent IS techniques devel
oped for the efficient estimation of transient and steadystate
dependability measures in Markov and nonMarkov models of
highly dependable systems.3Parts of [53] also review some IS
techniques for the simulation of dependability measures, with
emphasis on the underlying mathematical ideas needed to es
tablish their theoretical properties; thus, it is more suitable for
researchers.Thispaperpresentsacomprehensiveandlessmath
ematical treatment of the subject; therefore, it is more suited for
reliability practitioners, and requires only a basic understanding
of probability and statistics.
There are two main ways in which a system can be made
highly dependable in a costeffective manner.
1) Usecomponentsthatare“highlyreliable”andhave“low”
builtin redundancies in the system. Examples of these
are computer systems where the main components (e.g.,
processors) fail rarely.
3Preliminary versions of some parts of this review have appeared in [80] and
[100].
Page 3
248IEEE TRANSACTIONS ON RELIABILITY, VOL. 50, NO. 3, SEPTEMBER 2001
2) Build “significant” redundancies in the system and use
components that are just “reliable” instead of “highly re
liable.” (The distinction is clearer when some examples
are examined later in the paper.)
There might also be a third way: Use “unreliable” components
buthave“veryhigh”builtinredundanciesinthesystem. Exam
ples are more difficult to find in practice.
Much of the recent research work on effective simulation of
highly dependable systems has been done for systems that fall
in categories 1 and 2, and this paper mainly covers those.
Thefocusinthispaperison“dynamic”systems(systemsthat
change over time), in contrast to “static” systems. An example
ofastaticsystemisa2terminal reliabilitynetworkwith inde
pendent components and no repairs (strictly speaking, there can
be repairs as long as they do not create dependencies among
components). See, [69], [70], [95] for fast simulation methods
for such systems.
Section II formally describes the wide class of systems for
which these IS techniques are designed, and reviews the basic
idea of IS.
Section III discusses IS techniques for estimating depend
ability measures in Markov models. “Markov” implies that all
failure, repair, and other underlying distributions in the system
are exponential, so that it can be modeled by a CTMC. Some
work is reviewed on the estimation of derivatives with respect
to model parameters (e.g., component failure rates) for various
steadystate and transient measures in these models. This work
is of much interest, because it can be used to identify system
components that might need improvement and to optimize sys
tems.
Section IV considers the estimation of dependability mea
sures for models in which the failure and repair times are
not exponentially distributed. Because these types of system
can no longer be directly modeled as a MC, they are called
“nonMarkovmodels.”Amathematicalframeworkforstudying
such systems is the GSMP; see [37] for a formal development
of GSMP. The general theory of IS for discreteevent systems
(without discussing the particular changes of measures for
specific models) is in [37], [41]. For the IS heuristics discussed
in this paper, some empirical studies have been presented in the
literature, and many of these methods are provably effective.
In both Markov and nonMarkov models, the concern is esti
mation of
• transient measures, such as system unreliability, distri
bution and expectation of interval unavailability,
• steadystate measures, such as steadystate unavailability
and MTBF.
Although MTTF is in fact a transient measure, for regenerative
models it can be represented as a ratio of 2 expectations of
regenerativecyclebased quantities that can be estimated using
the regenerative method of simulation. Thus MTTF is included
in discussions of steadystate measures.
Section V discusses ongoing work and directions for future
research.
D. Related Work and Software
IS can be applied, not only for estimating dependability mea
sures of reliability systems, but for estimating bufferoverflow
probabilities in queuing systems and networks [18], [28], [91],
[96], [107]. Applications to communication systems are of par
ticular interest [4], [15], [113], [67]. The IS techniques used in
this setting are often based on the theory of large deviations. A
survey on existing techniques is in [53].
An approach, other than IS, based on “faultinjection” is
used in [75] to speed up steadystate simulations involving
rare (failure) events in communication systems. The method
assumes knowledge of the frequency of the “rare failure event”
and exploits the fact that, except for relatively short periods
after failures, the system is operating normally in a failurefree
environment. Faultinjection is used to obtain an accurate
estimate of the performance measure of interest during periods
affected by the failure. This estimate is appropriately combined
with an accurate estimate under failurefree environment (with
no rare events) to yield an overall steadystate estimate of the
dependability measure.
Another method to simulate rare sample paths is to use the
technique of “splitting” sample paths. Splitting for rareevent
simulation was originally discussed in [62] in the context of es
timating rare particle transmission probabilities in physics [51].
Since then, it continues to be an active area of research in that
field[24].Variationsofthistechniqueforsteadystaterareevent
estimation in stochastic service systems seem to have been first
done in [6], [7], and later in [57] (see [14] for a related idea);
a variation for transient rareevent estimation in stochastic ser
vice systems is in [65]. It was revisited in [110], [111], [112]
for estimating probabilities of rare events in computer and com
munication systems; the version of the technique used in these
papers was called “RESTART.” Some of the most recent ver
sions/implementations of the technique are in [29], [35], [43],
[52].
The basic idea behind the splitting technique is explained
here. The goal typically is to estimate some performance mea
sure that is “associated with” visiting some set of states
statespaceofthestochasticprocess,andtheset
rarely. For example, compute the probability of a buffer over
flow, where
corresponds to states in which the buffer content
has reached its capacity. In ordinary simulation, the stochastic
process being simulated spends a lot of time in regions of the
statespacethatare“faraway”fromtheinterestingrareset
gionsfromwherethechanceofenteringtheraresetisextremely
low).Inoneversionofsplitting,aregionofthestatespacethatis
“closer” to the rare set is defined. Each time the process enters
this region from the “far away” region, many identical copies
of the process are generated. Each of the split copies is sim
ulated until the process exits back into the “far away” region.
From there on, only one of the split copies is continued until
another entrance into the “closer” region. This way gives more
instances of the stochastic process spending time in the “closer”
region where the rare event is more likely to occur. The idea can
be extended to: instead of just 2 regions, use multiple regions of
slowly increasing degrees of rarity. Reference [35] describes a
of the
isvisitedonly
(re
Page 4
NICOLA et al.: TECHNIQUES FOR FAST SIMULATION OF MODELS249
unifying class of models and implementation conditions under
which this type of multilevel splitting is provably effective for
steadystate rareeventsimulation. Related work is in [33],[34].
The method of splitting has also been used and analyzed in con
texts other than rareevent simulation, e.g., [73].
There are a few softwarebased modeling tools which use
rareevent simulation techniques for dependability evaluation.
SAVE [45] is a software package that consists of a highlevel
modeling language that can be used to specify the model of
interest. From this specification and Markov assumptions on
the lifetime and repairtime distributions, the detailed Markov
chain is derived. It is then solved for dependability measures
using either numerical (nonsimulation) or simulation methods.
A recent version of SAVE [8] incorporates the IS technique,
BFB (as described in Section IIIA) at the MC level to estimate
dependability measures efficiently. Another software package
where IS is used is ULTRASAN [20]. In ULTRASAN, the
highlevel modeling construct of stochastic activity networks
is used to specify the model of interest. Again, from this
specification, the detailed stochastic process is derived and
solvedforperformance/dependabilitymeasuresofinterest,using
either numerical (nonsimulation) MC methods or simulation
methods. In recent versions of ULTRASAN [89], [90] an
“IS governor” has been incorporated. Here, instead of the
IS heuristic being builtin as in SAVE, one can choose and
specify the IS change of measure at the stochastic activity
network level. The RESTART version of the splitting method
has also been implemented in ASTRO [112].
II. BACKGROUND
Notation
number of types of components
number of components of type ,
number of operational components of type
time
vector
stochastic process
state of the system at time
stochastic process
state space of
subset of failure states in
time to first systemfailure
probability under measure
expectation under measure
variance under measure
system unreliability at time
indicator function of event
convergence in distribution
normal distribution with mean , variance
relative error of an estimator
a sample path
in the set of all
pdf of under measure
likelihood ratio on.
at
RE
of a stochastic process
A. Highly Dependable Systems
This section discusses the broad class of highly dependable
systems that can be described by SAVE [45] (basically, a gen
eralized Machine Repairman Model). These models consist of
multiple types of components, where each component can be in
1 of 4 states:
• operational,
• failed,
• spare,
• dormant.
The first 3 of these states are selfexplanatory. An operational
component becomes dormant if its operation depends upon the
operation of some other component and that other component
fails. For example, a processor might not be operational unless
its power supply is also operational; therefore, if the power
supply fails, then the processor is dormant. In SAVE, different
(exponential) failure rates can be specified for the operational,
spare, and dormant states. The SAVE modeling language is
also used to describe operational/repair dependencies among
components (e.g., the operation/repair of a component depends
on some other components being operational), as well as
failure propagation (e.g., the failure of a component causes
some other components to fail with given probabilities). The
system is operational if certain combinations of components are
operational. Unlike SAVE, in nonMarkov models (Section IV)
general failure and repair distributions are allowed. Also, there
is a set of RP who repair failed components according to some
reasonably arbitrary service (priority or nonpriority) discipline.
Tosimplifythepresentation,systemsareconsideredinwhich
each component is either operational or failed. (Unless other
wisespecified,theresultsalso applytothemoregeneralmodels
in the SAVE modeling language.) Section IIB briefly reviews
the basic idea of IS and shows how (when applied appropri
ately) it could appreciably speedup simulations involving rare
events. For illustration, also consider estimating the system un
reliability; however, the same concepts also apply to other de
pendability measures.
B. Importance Sampling
Consider a system with
nent is subject to failure and repair.
All components are operational at time 0:
.
All components are “new” at time 0.
In general,
contains the information
formation might be needed, e.g., the queuing of failed compo
nents waiting to be repaired and the remaining lifetimes and re
pair times of components when using distributions other than
exponential.
There is some subset
of the state space
system is failed at time
if
System unreliability is
componenttypes. Each compo
, for all
, but other in
such that the
.
(1)
TH.
Page 5
250IEEE TRANSACTIONS ON RELIABILITY, VOL. 50, NO. 3, SEPTEMBER 2001
The subscript
underlying original probability distributions governing the dy
namics of the system.
In a highly reliable system, for a sufficiently small , the
:is rare.
In ordinary (naive) simulation generate
from time 0 to time
, say, . Then
denotes the original probability measure: the
i.i.d. replications of
to obtain samples of
isan unbiasedestimatorof
is
.Thevarianceofthisestimator
From the CLT
as
of
which istherelativehalfwidth ofthe99% confidenceinterval
derived from the CLT approximation. For a fixed , the
as. This is the main problem when using ordinary
simulation to evaluate highly dependable systems. The goal of
IS is to overcome this inherent difficulty.
Notation
another probability measure
a sample path (of a replication) in the set
possible sample paths of
time 0 to time
pdf ofaccording to
of all
taking the system from
(2)
The only condition imposed on
is:
whenever
Thus the system can be simulated using
ples of
to obtain i.i.d. sam
:.
An unbiased estimate of
is
The variance of is
One measure of effectiveness of any new simulation algo
rithmistheVRR:ratioofthevarianceusingordinarysimulation
to that using the new simulation algorithm; in this case:
The VRR gives the ratio of the number of samples using ordi
nary simulation to that using the new algorithm so as to achieve
the same RE. However this measure of effectiveness does not
consider the effort (e.g., CPU time) required to simulate each
sampleunderthetwomethods.Henceamorefairmeasureofef
fectiveness is the TRR: ratio of (the product of the variance and
the effort per sample using ordinary simulation) to (that using
the new simulation algorithm), [42]. The TRR gives the ratio of
the total effort using ordinary simulation to that using the new
algorithm so as to achieve the same RE.
The main challenge in IS is to find a robust new probability
measure
that can be implemented in a computationally effi
cient manner such that
:
(3)
Appreciable variance reduction from (3) is obtained if
whenever (4)
Choosing
because it involves each sample path. But the general intuition
one obtains is that
should be chosen to appreciable increase
theprobabilityoftherareevent
has to be very careful; choosing an arbitrary (but not suitable)
that increases the probability of the rare event can lead to a
substantial increase in variance.
For highly dependable systems, try to come up with IS tech
niques that are “effective” (see Section IB): techniques whose
RE remains bounded (implying that
ability of the rare event tends to zero. This property has been
established at least empirically (and, in many cases, also theo
retically) for most of the IS techniques in this paper. However,
as mentioned before, this does not always guarantee efficient
simulation of systems with high redundancies.
such that (4) is satisfied is usually very difficult
.Atthesametimeone
) as the prob
III. FAST SIMULATION OF MARKOV MODELS
Notation
collection of all (measurable) subsets of
DTMC embedded on
generic states from the state space
transition probability matrix of the DTMC
(when is a CTMC)
,
Page 6
NICOLA et al.: TECHNIQUES FOR FAST SIMULATION OF MODELS251
systemstateinwhichallcomponentsareoperational
time to first return ofto state
sample path between two successive entries to a
subset of states
system failure time in a regenerativecycle, or
cycle
steadystate unavailability of the system
estimator of
original and IS probability measures of
IS probability measure under BFB
variance of a ratio estimator; probability measures
and are used to estimate the numerator and
denominator, respectively
failure biasing parameter
transition probability matrix of the DTMC
BFB
failure, repair rates of component type
parameter in the failure rate of component type
failure rarity parameter
exact asymptotic order of magnitude
“distance” of statefrom the failure set
“criticality” of the transition
set of component types having failure rates of the
th largest order of magnitude
stack of likelihood ratios associated with failure
events of components in
likelihood ratio on top of
2dimensional vector, where
the number of operational (respectively, currently
under repair) components of type ,
set of states in which components are failed
total system failure time in
expected interval unavailability
total transition rate out of state
probability measure
IS pdf used to sample a random holding time when
in state
total time in statein a regenerative cycle
total time in states other than , from the beginning
of a regenerative cycle until either the system fails
or the end of the cycle
{system fails during a regenerative cycle}
upper bound for
lower bound for
generic parameter (e.g., a component failure rate)
partial derivative operator with respect to
hitting time of state
hitting time of set
partial derivative of the likelihood ratio with respect
to .
Most of the approaches in the following sections are appro
priate for highly dependable Markov systems consisting of
highly reliable components (i.e., component failure rates are
much smaller than the repair rates) that satisfy:
Assumption A: Each state, other than the state in which all
components are up, has at least one repair transition possible.
AssumptionAissatisfiedbysystemsofthetypein[44],[45].
cycle
, on
under
,
(respectively,) is
under the original
For systems with repairunit sharing,4let
are defined in Section IIB. For systems with more general re
pair disciplines, add a list of components either waitingfor or
undergoing repair at each RP.
all failure and repair times are exponentiallydistributed, and the
methodologies in this section are independent of the definition
of the state.
Unless stated otherwise, let
ulate a CTMC by generating the next state visited using
then generating the exponentiallydistributed holdingtime in
that state with the appropriate rate. When estimating steady
state measures, instead of sampling the holding times in a state,
use the expected holding time in that state [25], [26], [56].
CTMC are regenerative processes, where entrance to any
fixed state constitutes a system regeneration. Let the regenera
tion epochs to be the entrances to state
time to first systemfailure.
; they
is a CTMC when
. One can sim
and
. As in Section IIB,
A. SteadyState Measures
For estimating steadystate measures, the regenerative
method of simulation is often used, and it is usually sufficient
to simulate the embedded process at transition times, as
described in Section II. Many steadystate measures can be
expressed by a ratio of regenerativecyclebased quantities
[21], e.g.,
(5)
The ordinary way of estimating unavailability is to run some re
generativecyclesandcollectsamplesof
estimate
and by their respective sample means.
However most samples of
are zero, thus one often uses IS to
try to obtain more precise estimates of
tion IIB):
that
simulation with
is much more efficient.
1) Failure Biasing: As mentioned in Section I, the imple
mentation of IS involves failure biasing [71], in which the basic
idea is to take the system along typical sample paths to failure,
more frequently. All states of the MC, other than , have both
failure and repair transitions.
• A failuretransition is a transition from one state to an
other, corresponding to the failure of at least one compo
nent.
• A repairtransition is a transition from one state to another
corresponding to the repair of at least one component.
We do not allow a single transition to correspond to some com
ponents failing and other components being repaired. Typically,
the total probability of repair transitions is close to 1, and the
total probability of failure transitions is close to 0. In failure bi
asing, the total probability of failure transitions is increased to
somevalue ,thefailurebiasingparameter;thusthetotalproba
bilityofrepairtransitionsisdecreasedto
and .Thenonecan
. Then (as in Sec
. The problem is to find a
, which implies that
so
.Empiricalstudies
4The repair discipline in which the RP works on all failed components simul
taneously, with the effort devoted to each component proportional to the repair
rate of that component.
Page 7
252IEEE TRANSACTIONS ON RELIABILITY, VOL. 50, NO. 3, SEPTEMBER 2001
suggest that we should choose
close to 1, e.g.,
increase or even infinite variance.) Thus failure biasing enables
the system to go along paths to system failure more often.
However, just making the rare event occur more often might
not always work. How the rare event happens (the sequence of
events that lead to the rare event) plays a crucial role. Under
the original probability measure, some sample paths to system
failure are more likely than others. For IS to be effective,
• All the mostlikely (in terms of order of magnitude of their
probabilities under the original measure
should be made more probable under the new measure
• Secondary sample paths (those paths with probability
under
that are at least an order of magnitude smaller
than the probability of the most likely ones) also need to
be made more probable under
most likely paths.
If an IS distribution does not assign enough probability to a
likely path to system failure, then the resulting variance can be
worse than that of ordinary simulation. (In mathematical terms,
this means that
will be large, because, for a sample
path
for which
to
, the
the original version of failure biasing, called “simple failurebi
asing” here, the relative probabilities under the new measure of
individual failure (repair) transitions with respect to each other
remainunchanged.Insystemswherethefailuretransitionprob
abilities are of different orders of magnitude (e.g., unbalanced
systems), this can deprive a path of a high enough probability
under IS, thus causing inefficient estimation.
2) BalancedFailureBiasing: BFB[47],[98]overcomesthe
problem in Section IIIA1 by making all failure transitions
occur with equal probabilities (this is also done in state ). This
ensures that all paths get sufficient probability, though it also
wastes some probability by giving certain paths more weight
than necessary. This can degrade a simulation’s performance
whentherearelargeredundanciesinthesystem.ISschemesthat
trytominimizethiswasteinclude“failuredistancebiasing”and
“BLR methods.”
is used on the sample paths of
in whichis used until system failure, and
that.
3) MSDIS: In (5), one can use different probability mea
sures (and thus different regenerative cycles) to estimate
and . This approach is called MSDIS [46], [47]. When
implementing MSDIS, we typically use IS to estimate
and use ordinary simulation to estimate
provides accurate estimates of
one can run
regenerative cycles using
tuples
can run
regenerative cycles using
of . Then
. (Settingtoo
, can sometimes lead to a variance
) sample paths
.
but not as much as the
is large relative
is large [81].) In
is used after
,
, because it
without using IS. Hence,
to get the sample
of
to get the samples
is estimated
, and
(6)
Theasymptoticvarianceofthisestimator(large
and )is[47]
(7)
which when estimated (by replacing ,
and in (7) by their respective simulation estimates) can
be used to construct 99%; confidence intervals.
Another quantity of interest is the MTTF defined by
For regenerative systems, the MTTF can be expressed as a ratio
of regenerativecyclebased quantities [47], [64], [103], [108]:
,
.
(8)
A sample of
obtained from 1 regenerative cycle. Hence, again use MSDIS to
estimate
by separately estimating each term of the ratio
[47], [103]. In this case, the rareevent problem occurs in esti
mating the denominator of the ratio. Hence, use
the denominator and
to estimate the numerator.
To estimate
and, one can use other heuristic IS
measures instead of BFB.
4) MathematicalAnalysisofFailureBiasing: Mathematical
analysis of failure biasing techniques began in [97], [98]. This
analysis is used to study the increase in simulation efficiency
obtained by using these techniques, or for proving BRE proper
ties of these techniques. In [97], [98], the failure rate of compo
nenttype is assumed to be of the form
a small parameter (rarity parameter) and
constants. This enables modeling a situation in which compo
nents have small failure rates (components are highly reliable).
Prior to [97], [98], there was other work [32] that studied the
asymptotic behavior (nonsimulation aspects) of systems with
highlyreliablecomponents.However,thisearlierworkassumed
that
, for all , which does not allow the modeling of
systems in which component failure rates are of different or
ders of magnitude. The use of the exponents
modeling. This paper assumes that the repair rates are constants
and the failurepropagation probabilities (probabilities used to
determine if the failure of certain components cause others to
fail simultaneously) are either constants or are of the same gen
eral form as the failure rates: a constant multiplied by
to some power. The simulation analysis in [3], [77]–[79], [81],
[97]–[99], [115], [102], [105] deals with the asymptotic be
havior of the simulation efficiency for small . The simulation
techniqueforhighlydependablesystemsisformallysaidtohave
BRE if the RE remains bounded as
References [97], [98] show that BFB has the BRE property
when estimating
and
BRE property of the MSDIS approach (using BFB) to estimate
thesteadystateunavailabilityand the MTTF.It was shownthat,
for fixed numbers
,of regenerative cycles, the RE in the es
timation of
using standard regenerative simulation is
for some constant
; whereas the RE using the MSDIS
scheme is
. [A function
constants
, such that
ficiently small .]
[or a sample of] can be
to estimate
, where is
are positive and
facilitates this
raised
.
. This leads to the
,
is,, if there exist
for all suf
Page 8
NICOLA et al.: TECHNIQUES FOR FAST SIMULATION OF MODELS253
References [97], [98] also show that simple failure biasing
has the BRE property for the special class of balancedsystems
(systems in which the failure transition probabilities are of the
same order of magnitude; e.g., when
failure propagation probabilities are independent of ). Using
a counterexample, it was shown that the BRE property might
not hold when simple failurebiasing is used
systems. More general conditions (on the system) under which
any failure biasing method (or any more general IS scheme)
does or does not give BRE are in [79], [81]. Although it seems
difficult to check these conditions except in very simple cases,
they provide insight into how IS should be implemented. Some
additional results are in [109].
5) FailureDistance Biasing: Failuredistance biasing [12]
attempts to refine failurebiasing schemes to make the system
go mainly along the most likely paths to system failure (for bal
ancedsystemswithnofailurepropagation,themostlikelypaths
are those with the least number of transitions): there is no im
portant waste of probabilities on paths that are not most likely.
As in failure biasing,the total failure transition probability is in
creased to . However, now, the way in which
the individual failure transitions
on the “distance” from state
to some failure state. To do this,
compute, for each state
, the
failing components whose failure in
to a state in which the system is failed. The failure distance
for a state
is 0. The “criticality” of a failure transition
is defined as
biasing is implemented by partitioning the set of failure transi
tions from the current state
based on the criticalities of the in
dividualtransitions:eachsetcontainsallfailuretransitionsfrom
havinga particularcriticality. Eachsetis assigned a portion of
thefailurebiasing probability , with setshavinglargercritical
ities getting larger portions of . Failure transitions within the
same set occur with their original relative probabilities (simple
failuredistance biasing) or with equal probabilities (balanced
failuredistance biasing).
Exact computationof thefailure distances assumes a descrip
tion of the structure function of the system [5] and requires de
termining all the minimal cutsets corresponding to that struc
ture function. The latter is NPhard [94]. Hence the users need
to limit the number of minimal cutsets considered. An efficient
algorithm for computing and maintaining the data structures of
the failure distances is in [12].
It follows directly from [98] that balanced failuredistance
biasing also has the BRE property, but [81] presents an example
showing that simple failuredistance biasing might not have
this property. Experiments on examples in [12] seem to support
the intuition that failure distance based biasing schemes should
havebettersimulationefficiencythantheusual biasingschemes
(but with an important implementation overhead). However,
the amount of efficiency improvement, if any, on a particular
system in practice depends on whether each computed distance
from a state correctly reflects its true proximity to the set
The distance defined in this subsection seems to reflect the
actual proximity only for the class of balanced systems with no
failure propagation (the structure function, by definition, does
not consider any failure propagation). It appears to be difficult
for all , and the
or unbalanced
is allocated to
dependsfrom a state
: the minimum number of
would bring the system
. Failuredistance
.
and computationally expensive to compute a distance reflecting
the actual proximity for the general case. Even for the balanced
case with no failure propagation, only an approximation to the
failure distance is computed, because the users need to limit
the number of minimal cutsets considered.
6) Balanced Likelihood Ratio Methods: References [2],[3],
[105] show, experimentally, that the methods in this section
might not work well for systems that have an important de
gree of redundancy. The “BLR methods” [2], [3], [105] are ap
proaches for effectively simulating such systems. They attempt
tocanceltermsofthelikelihoodratiowithinaregenerativecycle
by defining the IS probabilities for events in such a way that the
contribution to the likelihood ratio from a repairevent cancels
the contribution to the likelihood ratio from a failureevent that
occurred previously in the current cycle.
Some additional terminology is needed to describe the basic
method. Partition the set of component types
sets
, such that
nent types with failure rates of the th largest order of magni
tude. Throughout the simulation of a cycle, one stores the event
likelihood ratios associated with component failure events from
in a stack . If, let
top of
; . The system state is denoted by
is the number of components of each type
that are operational, and
RP currently repairing components of each type. References
[2], [3], [105] consider models for which
scribes the system state, which is a subset of the class of models
described in this paper, but one can easily apply the method to
the more general setting of models in this paper.
In terms of the algorithm, the BLR method differs from the
failure biasing methods in two respects.
1) Insteadofusingafixed forthefailurebiasingparameter,
use a
that is a function of the current state
and the. In particular,
into
contains all compo
be the likelihood ratio on
, where
is the number of
completely de
2) Thetotal(new)probabilityallottedtorepairtransitionsof
components of type
instead of their being proportional to
done in the failure biasing methods).
By doing this, one can ensure the cancellation of likelihood
ratios, and guarantee that the overall likelihood ratio on any re
generative cycle is always bounded above by 1). This implies
that
is proportional to ,
(as usually
thus the variance under the new measure
than that under the original measure
ciallyusefulforsystemswithimportantredundancies,wherethe
number of transitions until system failure can be large, leading
to high variabilities in the likelihood ratios when using methods
is never greater
. The method is espe
Page 9
254IEEE TRANSACTIONS ON RELIABILITY, VOL. 50, NO. 3, SEPTEMBER 2001
like BFB. Reference [3] also shows that the resulting estimators
have the BRE property when the particular way in which indi
vidual failure transition probabilities are assigned is similar to
that in BFB.
Failuredistance biasing tries to exploit system structure in
allocating probabilities to transitions, and [2], [105] apply sim
ilar ideas to BLR methods. In particular, they obtain additional
efficiency gains by allocating probabilities to the individual
failure transitions so that those failure transitions corresponding
to component types that lie on minimumcuts are more heavily
weighted. Their algorithm does not need to maintain a list of
all the minimum cuts; it needs only to maintain a list of all the
components in a minimum cut. This can be done in
where
is thenumber of links.As with failuredistance biasing,
one might not get any additional efficiency gains in certain
systems that are unbalanced and/or have failure propagation.
This is because in such systems the most likely paths to system
failure might not lie along minimum cuts (the definition of
minimum cut does not consider failure propagation).
References[2],[3],[105]alsodescribeimprovementsthatare
based on using semistationary cycles [106] rather than regen
erative cycles. The simulation method is similar to the
method [88] but the motivation for its use is different. In steady
statesimulationsofhighlydependablesystems,oneusuallyuses
the set of states with all components “up” as the regenerative
state. However, when the BLR method is applied to systems
withhighdegreesofredundancy,theregenerativecyclescanbe
come very long, leading to inefficient estimation. Thus, [2], [3],
[105] instead consider a set of states with no 1step transition
probabilities within the set. An example is the set of states
withfailedcomponents,where
the system (the least number of components that have to fail for
the system to fail). The process in between two entrances to
is a semistationary cycle, and has properties similar to regen
erative cycles, except that these cycles are not necessarily in
dependent (thus complicating the construction of confidence
intervals). Also one needs to know the steadystate distribution
on the set of states in
at the times of entrances to this set, in
order to apply IS; in general, this is very difficult to compute.
These problems are similar to those in the
Section IVB).
7) Other IS Methods: Another heuristic for failure biasing
in acyclic models (of nonrepairable systems) is considered in
[31], in which the extent to which one biases the failure transi
tionsalongapathleadingtosystemfailureisproportionaltothe
path’scontributiontothemeasurebeingestimated.Whenappli
cable, this heuristic requires more overhead than simple failure
biasing or BFB. Reference [66, ch. 10] describes some efficient
simulation methods for
outof :G systems; these methods
combine the IS technique known as forcing (see Section IIIB)
with some analytic calculations.
8) Some Empirical Results: Example #1 is a computing
system (originally presented in [47] and then in many papers
thereafter). Consider the unbalanced version of this computing
system. Fig. 1 is the block diagram. It consists of
• 2 sets of processors with 4 processors/set,
• 2 sets of controllers with 2 controllers/set,
• 6 clusters of discs, each consisting of 4 disk units.
time,
cycle
islessthantheredundancyof
cycle method (see
Fig. 1. Computingsystem example.
In a disk cluster, data are replicated so that one disk can fail
without affecting the system. The “primary” data on a disk are
replicated so that 1/3 is on each of the other 3 disks in the same
cluster. Thus 1 disk in each cluster can be inaccessible without
losingaccesstothedata.Itisassumedthatwhenaprocessorofa
giventypefails it has a 0.01 probability of causingthe operating
processor of the other type to fail. Each unit in the system has
2 failure modes which occur with equal probability. The failure
rates (per hour) are
• 1/1000 for processors,
• 1/20000 for controllers,
• 1/60000 for disks.
The repair rates (per hour) are
• 1 for all mode 1 failures,
• 1/2 for all mode 2 failures.
This is an unbalanced system with a redundancy of 2. Compo
nents are repaired by a single RP who chooses a component at
random from the set of failed units. The system is operational
if all data are accessible to both processor types, which means
that at least 1 processor of each type, 1 controller in each set,
and 3 out of 4 disk units in each disk cluster are operational.
Operational components continue to fail at given rates when the
system is failed.
To facilitate comparisons between simulation methods on the
same CPU, simulation results (in the MSDIS framework) are
quoted from the latest implementation of these methods [3].
BFB using a total of 200000 cycles (100000 cycles each for
the numerator and the denominator) and
steadystateunavailabilityestimateof
3.8% is the estimate of the RE corresponding to a 90% confi
dence interval.
The corresponding VRR was 167 with a TRR of 415.
TheBFB estimateof theMTTF was
and
Forthesameproblem,themostpromisingMSDISimplemen
tation of the BLR method without the use of minimum cuts (de
noted by BLBLR in [3]) gave
gave the
3.8%,the
6.5% with
.
•
•
for the steadystate unavailability
for the MTTF.
Page 10
NICOLA et al.: TECHNIQUES FOR FAST SIMULATION OF MODELS255
Fig. 2.Network with redundancies.
Hence, for this example, without the use of additional informa
tion about the system, BFB does better than the BLR method.
For the balanced version of this computing system (the results
of which are in [3]), the improvements obtained by BFB and
BLBLR are similar.
The performance of the failure biasing method and BLR
methods can be improved by using some information about
component types on minimum cuts in the system. The most
recommended MSDIS version of the BLR method with min
imum cuts (denoted by BLBLRC in [3]) gave
•
for the steadystate unavailability
•
for the MTTF.
There is appreciable improvement over BFB for the MTTF. For
the balanced case of this network, there was appreciable im
provement over BFB for both the MTTF and the unavailability.
Consider thesystem (see Fig.2) with important redundancies
that was considered in [3]. The following description is from
[3], with minor modification. The network contains 3 types of
components:
• Type A links contain 3 identical components, which on
averagefailevery13(1/3)hoursandcanberepairedinhalf
an hour. The type A link fails when 2 components are in
the failed states.
• Type B links contain 1 component, which on average fail
every 40 hours and can be repaired in 1 hour.
• Type C links contain 2 components, which on average fail
every 26(1/3) hours and can be repaired in 2/3 hour. One
component failure on a type C link causes the link to fail.
The system operates as long as there exists a path along oper
ating links between node 1 and node 20. There are 5 RP, and
repairs make components “good as new.” Upon completing a
repair, a RP selects (uniformly over the failed components in
the network) the next component to repair.
The results are
•
,
•
,
•
,
ordinarysimulation)intheestimationoftheunavailability
(estimates of which were of the order of 10
The BLBLRC methods when applied to another version of
the same network yield orders of magnitude improvement over
BFB when estimating unavailabilities that are of the order of
10
.Nocomparisonsweremadewithordinarysimulationfor
, for the BLBLR
, for the BLBLRC
, for BFB (worse than
).
thesecasesbecausetherewerenosystemfailureeventswiththis
method, even in the allotted run time of 10 events. The results
suggest that for systems with appreciable redundancies, BFB is
not at all effective, and the BLR method (with and without cuts)
seems to improve the simulation efficiency.
B. Transient Measures
This section considers the estimation of transient measures
in highly dependable Markov systems. Consider the three mea
sures:
unreliability:
expected interval unavailability:
guaranteed availability:
Researchin fast simulationfor guaranteedavailabilityis limited
to experiments; see [47].
1) Case1:SmallTH: SmallTHmeansthattheTH issmall
compared to the expected lifetimes of components. From the
analytic standpoint, it means that the TH
independent of . The effectiveness of simulation techniques
for small
are studied again.
For transient measures, failure biasing (relative to repair)
alone might not be sufficient to observe many system failures,
because it affects only the transitions of the embedded DTMC
and not the random holding times in each state. To see why
this is the case, note that the first component failure in a
system occurs at a very low rate (the sum of failure rates of all
the components). Thus, typically, the first componentfailure
occurs after time
; thus the chance that the system fails
before the mission time expires is very small. To address this
issue, “forcing” was introduced [71] to modify the random
holding times in particular states. With forcing, the time to first
component failure is sampled conditionally on the fact that it is
less than , i.e., the time to first component failure is sampled
from the distribution:
is a constant, i.e.,
for(9)
the transition rate out of stateunder the original measure
.
References [99], [115], [101] show that a “combination of
BFB and forcing” gives BRE in estimating the unreliability and
the expected interval unavailability. From a modeling view
point, this implies that for small TH, the simulation can be very
efficient.Thisagreeswithexperimentalresults[47],[99],[115],
[101].
Another technique for estimating transient dependability
measures is to combine failure biasing with “conditioning”
[47]. Conditioning is applied by simulating the embedded
DTMC until the system fails; failure biasing is used to gen
erate the transitions. Random holding times are generated for
each of the states visited, except for those states having slow
transition rates (e.g., the “fully operational” state, which has
no repairs taking place). Then for each generated sample path,
one can analytically compute the conditional probability that
Page 11
256IEEE TRANSACTIONS ON RELIABILITY, VOL. 50, NO. 3, SEPTEMBER 2001
the system fails before time , given the path of the embedded
DTMC and the sum of the holding times in the states that
do not have slow transition rates. This computation involves
calculating the convolution of exponentially distributed r.v.,
corresponding to the visits of the “conditioned out” states. The
technique is guaranteed to reduce variance, but requires more
computation. Experimental results and comparisons with the
forcing technique, are in [47].
2) Case 2: Moderate and Large TH: Even though, for small
TH, the ISbased simulation of transient measures has the BRE
property, it becomes inefficient for moderate and large TH. A
moderate TH implies:
is of the same order (of magnitude) as
the expectedtimetofirstcomponentfailure.AnyTHthatisat
least 1 order larger is termed “large.” For moderate TH, tuning
the value of the failure biasing parameter
tation can yield efficient estimates [85], but it is difficult to pro
vide guidelines for how
should be set in general. For large
TH, irrespective of the value of , the estimates using failure bi
asing are always poor, because the variance of the IS estimator
increases with the variance of the likelihood ratio. The larger
is, the more transitions there are in
the likelihood ratio grows approximately exponentially with the
number of transitions [39].
In estimating unavailability and MTTF, the expressions used
in this paper are in terms of regenerativecyclebased quanti
ties, which are estimated using the regenerative method of sim
ulation. Since in highly dependable systems, regenerative cy
cles typically contain a small number of transitions, the use of
IS does not lead to a likelihood ratio with an important vari
ance. A similar approach can be used in the context of transient
measures. Though transient measures cannot be expressed ex
actly in terms of regenerativecyclebased quantities, it is pos
sible to develop bounds that are expressed in terms of regener
ativecyclebased quantities. Thus, when the direct application
of IS to estimate the transient measure itself is inefficient, it is
possible to estimate the bounds efficiently. For highly depend
able systems these bounds are close to the transient measure in
the sense explained in this section, [99], [115], [101], [102].
The
is exponentiallydistributed with rate , which is equal
to the sum of all component failure rates in state
definition,
through experimen
, and the variance of
. From its
. Let
•
•
if the system does not fail in a regenerative cycle,
time between the first system failure in a cycle and
the end of the cycle if the system does fail:
. Hence.
When the highly dependable system consists of highly reliable
components, then most regenerative cycles consist of a single
component failure transition followed by a component repair
transition. Because component repair times are typically much
smaller than component failure times, the regenerative cycle
time consists mainly of the first component failure time,
( implies
exponentially distributed with rate . The number of regenera
tive cycles until system failure is geometrically distributed with
probability
ceptance probability ) of exponentially distributed r.v. (each of
which has rate ) is exponentially distributed (with rate
), which is
. The geometric sum (with ac
).
Thus,
is approximately exponentially distributed with rate
. Let
(10)
then
TH are modeled by
to a small TH), then for
large TH),
it has been shown [101], [102] that
,(
(corresponding to moderate and
corresponds
asand
thus we have an upper bound. Similarly, let
(11)
then
, for all .
,
Also, as for
as
i.e., the lower bound is close to the unreliability for moderate
and large TH.
Both
andare in terms of regenerativecyclebased
quantities. Hence for estimating
type procedure in which
is used to estimate the quantities
associated with rare events like
the original probability measure
like
.
There are other bounds on the unreliability (in terms of re
generativecyclebased quantities), like the ones in [11], [63].
These bounds are close for large , whereas the bounds in [101],
[102] are close for both moderate and large . Bounds for the
expected interval unavailability were developed in [99], [115]
and (as for the unreliability bounds) close to the actual measure
for moderate and large .
3) Estimation of the Laplace Transform Function: An ap
proach for estimating the actual transient measure (instead of
estimating close bounds) for large TH is outlined in [13]. In
stead of estimating the transient measure, the “Laplace trans
form function” of the transient measure is estimated (the tran
sient measure is a function of ). Then a Laplace transform in
version method is used to estimate the transient measure for
any given . The advantage of this approach is that the Laplace
transform function of the transient measure can be expressed
exactly in terms of Laplace transform functions of regenera
tivecyclebased quantities, which can be estimated very effi
ciently using IS (if necessary).
and, use a MSDIS
and
to estimate other quantities
and
Page 12
NICOLA et al.: TECHNIQUES FOR FAST SIMULATION OF MODELS257
Forexample,considertheunreliability.Foranyfunction
the Laplace transform function is:
,
Let:
and
Then the Laplace transform of the unreliability is [13]:
(12)
Both
For any fixed , the
IS, and the
simulation. Then, the method is: estimate
of
[by estimating
transform inversion algorithm to obtain
similar method for estimating the interval unavailability is in
[13].Thistransformapproach[13]isabittedioustoimplement,
but yields good experimental results.
and are regenerativecyclebased quantities.
can be efficiently estimated using
can be efficiently estimated using ordinary
for some values
and ], and then use a Laplace
for a given . A
C. Estimation of Derivatives
Performance measures of a system are (complicated) func
tions of the system parameters, such as the component failure
and repair rates. Thus, one can compute derivatives of perfor
mance measures with respect to these parameters. This section
reviews work in this area for highly dependable Markov sys
tems. For example, determining the derivative of the MTTF
withrespecttoaparticularcomponent’sfailurerate.Thederiva
tive information is useful when designing systems, because this
knowledge can help the designer identify system parts that need
improvement.
First consider estimating derivatives of the MTTF. Recall the
ratio expression in (8) for the MTTF; then differentiate it with
respect to some system parameter
failure rate).
(e.g., some component’s
(13)
derivative operator with respect to .
Thus, estimating
4 quantities in (13). A central limit theorem for the resulting es
timator of
is derived in [83]; confidence intervals
for the derivative can be formed. Section IIIA discussed esti
mating
and
here is on estimating their derivatives.
requires estimating each of these
; thus the focus
One simulationbased approach for estimating derivatives is
the likelihoodratio derivative method [38], [93], which is now
briefly described. Focus on estimating
derivative with respect to . To estimate
requires simulating the embedded DTMC,
Let
andbe the hitting time to
the embedded DTMC, respectively (the numbers of transitions
until hitting
and , respectively); then
and its
, only
.
and the cycle length of
The(original)transitionprobabilitymatrixof
Then, under certain regularity conditions [38], [93],
(under)is.
The
to estimate
using the original measure
is determined within a single regenerative cycle. Thus,
, generate
, and collect observations:
regenerative cycles
of
The standardsimulation estimator of
is
Similarly, estimate
One drawback of the likelihoodratio derivative method is
that it yields derivative estimators with large variances in many
settings.Specifically,theoreticalandempiricalwork, [36],[93],
show that the variances of derivative estimators—
• are typically much larger than those of the respective per
formancemeasure estimators,
• grow linearly in the expected number of events in an
observation.
When regenerative simulation is used, an observation corre
sponds to a regenerative cycle, which typically consists of very
few transitions for highly dependable Markov systems. Thus,
the likelihoodratio method seems to be wellsuited for these
types of systems.
Theoretical studies in [77], [78] established that when esti
mating derivatives with respect to certain system parameters
(e.g., failure rates of certain components) using ordinary simu
lation, the ratio of the RE of the estimate of
and the estimate of
curs when the parameter corresponds to one of the largest (in
absolute value) sensitivities, where the sensitivity with respect
to a parameter
is defined as the product of
with respect to . Sensitivities measure how relative changes in
a parameter value affect the overall performance. Thus, for pa
rameters
corresponding to the largest sensitivities, one can es
timatethederivativewithrespectto
in (13).
remains bounded. This oc
and the derivative
andtheperformancemea
Page 13
258IEEE TRANSACTIONS ON RELIABILITY, VOL. 50, NO. 3, SEPTEMBER 2001
sure with about the same relative accuracy. However, the RE of
both these estimators go to infinity as the system unreliability
tends to zero (see Section IIB); therefore IS must be used. The
derivatives with respect to parameters that do not correspond to
the largest sensitivities might not be estimated as efficiently as
the performance measure when using ordinary simulation; [78]
gives an example illustrating this.
This section implements IS by simulating
cles using another probability measure and collecting observa
tions
,
the triplet
, where
Then the IS estimator of
regenerative cy
of
is the likelihood ratio.
is
WhenBFBisapplied,thentheestimatorof
BRE[78].NecessaryandsufficientconditionsforBREofderiva
tiveestimatorsobtainedusingotherfailurebiasingmethodsand
moregeneralISschemesareestablishedin[79],[81].
Reference [82] shows that even though the numerator
in the MTTF ratio formula can be estimated
with BRE using ordinary simulation, its derivative estimators
can have unbounded RE. Consequently, if
•
and its derivative are estimated using
ordinary simulation,
• BFB is applied to the estimation of the denominator
and its derivative,
• all 4 terms are estimated mutually independently (using
measurespecific IS),
then the resulting estimator of the derivative of the MTTF can
have unbounded RE. On the other hand, if BFB is also used to
estimate
, then its estimator has BRE and
so does the resulting estimator of the derivative of the MTTF.
Experimental work in [83] seems to indicate that derivatives
of the MTTF and the steadystate unavailability for large sys
tems can be estimated efficiently using BFB. When estimating
derivatives of the unreliability using BFB and forcing (see Sec
tionIIIB),theempiricalresultsshowthattheREoftheestimators
aretypicallysmallwhentheTHissmall,buttheygrowastheTH
increases[101].Thisisanalogoustothebehaviorof(nonderiva
tive)estimatorsoftheunreliability,asdiscussedinSectionIIIB.
Estimation of derivatives of the unreliability for large TH, using
theboundingmethodinSectionIIIB,istreatedin[101].
Reference [79] presents an example of a system showing
that when estimating
simple failure biasing, estimators of derivatives with respect
to certain component failure rates can have BRE, while the
performancemeasure estimator does not. Thus, it is possible
to estimate a derivative more efficiently than the performance
measure when using simple failure biasing.
has
and its derivatives using
IV. FAST SIMULATION OF NONMARKOV MODELS
Notation
NHPP
intensity rate function of NHPP,
upper bound for intensity rate function
timehomogeneous Poisson process with
rate
time of eventof
Cdf,pdfofthelifetimeofcomponent ,eval
uated at
hazard rate of the lifetime of component
evaluated at
hazard rate of the repairtime of component
evaluated at
failurerateofcomponent attime without,
with IS
repair rate of component at time
with IS
total failure rate of all components at time
without, with IS
total repair rate of all components at time
without, with IS
total event rate at time
likelihood ratio of failure, repair events at
time
likelihoodratioofpseudo,alleventsattime
number of component failures by time
det of operational components at time
time component fails at its failure #
length of a genericcycle
number of system failures in an
the steadystate
total system failure time multiplied by
the likelihood ratio on a generic (biased)
cycle
expectation under probability measure
and initial distribution
variance under probability measure
initial distribution
number of batches used in the batch means
method
generic batch mean of
generic batch mean of DL and its estimator
covariance under probability measures:
for the first r.v.,for the second r.v., and
under initial distribution .
ThissectionusesIStoestimatedependabilitymeasureswhen
the failure and repair times of components might not be expo
nentially distributed (under certain assumptions), and is based
on [40], [54], [55], [87], [88].
Except for some technical and implementation details, most
of the IS heuristics developed for Markov models also apply to
nonMarkovmodels.Oneapproachtoimplementfailurebiasing
(or forcing) in discreteevent systems is to reschedule failure
events by sampling from new accelerated failure distributions
[85].5Heuristics and their implementation, as well as experi
mental results demonstrating the effectiveness of the techniques
to estimate steadystate and transient measures are in [85], [86].
Another approach to IS in discreteevent systems (briefly de
scribedinthissection) isbasedontheuniformizationmethodof
,
,
,without,
,
,
,without, with IS
,
,
cycle in
DL
and
,
,
and its estimator
5There is considerable freedom in the choice of the new distributions.
Page 14
NICOLA et al.: TECHNIQUES FOR FAST SIMULATION OF MODELS259
simulation. The method requires the underlying (uniformized)
distributions to have bounded hazardrate functions. A closely
related method which avoids generation of pseudo events is
exponential transformation; here, the time to the next failure
event is sampled directly from an exponential distribution [87].
Failure biasing or forcing are affected by increasing the failure
rate (relativeto therepair rateor themission time, respectively).
The latter two techniques are somewhat simpler to implement
than the technique in [85], because failure events need not be
rescheduled and are generated using only the exponential dis
tribution.
The next paragraph briefly describes the uniformization
method of simulation, which is use in this section as a basis of
our approach to IS in nonMarkov models.
UniformizationBased Sampling: Uniformization (or thin
ning) is a simple technique for sampling (simulating) the event
times of certain stochastic processes including NHPP, renewal
processes, or Markov processes in continuous time on either
discrete or continuous state spaces [22], [26], [49], [58], [72],
[104]. It is describe for a NHPP
. Assume that
constant . Then the event times of
thinning the
process as follows:
For each
, include (accept)
with probability
cluded (is rejected).
Rejected events are sometimes called pseudo events.
Throughout it is assumed that all rate functions are leftcon
tinuous:
; thus if an event occurs at some random
time
, thenis the event rate just prior to time
Renewal processes can be simulated using uniformization,
provided that
is the hazard rate of the interevent time dis
tribution at time . Uniformization can be generalized to cases
in which the process being thinned is not a timehomogeneous
Poisson process [72]. For example, at time
, wherehas an exponential distribution with rate
. The pointis then accepted with probability
This requires only that
Section IVA describes how the uniformization method of
simulationcanbecombinedwithIStodevelopaneffectivetech
nique for estimating transient measures in nonMarkov models
of highly dependable systems.
with intensity function
for some finite
can be sampled by
for all
as an eventtime in
; otherwise the point is not in
.
, set
.
, for all.
A. Transient Measures
Consider the problem of estimating the unreliability
{time to system failure,
To simplify the notation, let there be 1 component of each
type (although more general situations can be handled). The
hazard rate [5] of component
, which we assume is well defined and finite.
,
[If component is not operational at time
,
nent at time . [If component is not being repaired at time
then .
There are a variety of ways to use IS in simulations of such
a system. Begin with a direct analog of forcing and BFB. This
method is based on uniformization. If components are highly
for some fixed value of }.
is then
age of componentat time .
then .]
elapsed repair time on compo
reliable, then
bounded, then the system can be simulated (without IS) by uni
formization as follows.
Assume
is a constant.
Then a Poisson process with rate
Let an event in this Poisson process occur at time
That event is accepted as a component
probability
, and is accepted as a component
event with probability
that
,anotherpossibilityexists:apseudoevent(neither
a failure nor a repair occurs). This occurs with probability
. The probability of a failure event is
repair event is
.
For highly reliable components,
ever repairs are ongoing, thus the probability of a failure is
verysmall. Toaccelerate failures, simplychangetheacceptance
probabilities of the various event types, which is equivalent to
changing component failure and repair rates to, say,
•
[such that
•
[such that
The likelihood ratio (at time ) is:
. If the failure and repair rates are
, for all times ;
is simulated.
.
failure event with
repair
. However, because it might be
, and of a
when
],
].
(14)
These likelihood ratios have a simple form. For example, let
component
fails on its own (not through failure propagation)
times in .
(15)
Equation (15) assumes that the failure propagation probabilities
aresampledfromtheirgivendistributions.However,IScanalso
be applied to these as well. The terms
be expressed similarly. The likelihood ratio can be computed
(updated) recursively at uniformization event times during the
simulation.
The analog of BFB with forcing is accomplished as follows.
If no repairs are ongoing (e.g., in the state where all compo
nents are operational), let
. (In practice [87],could be chosen such that
. This means that, with probability 0.8, some com
ponent fails before the TH
expires.) If repairs are ongoing,
let
: the total event rates the same as without IS.
Then, let
, for some constant : given that the
event is real, make it a failure event with probability . (In prac
tice [87],
is usually set in the range from 0.3 to 0.5.) Given a
failure event, pick an operating component
ability
. Under appropriate technical conditions, it can
be shown that such a heuristic for IS (which is the analog of
forcing and failure biasing) results in estimates having BRE
[55]. In particular, let
exist a small positive parameter
, where
that, for all
,
and can
, for some constant
to fail with prob
, and let there
such that
and . If IS is done such
[when],
Page 15
260IEEE TRANSACTIONS ON RELIABILITY, VOL. 50, NO. 3, SEPTEMBER 2001
(when componentis under
going repair), then under some additional minor assumptions
(including that the failure propagation probabilities are inde
pendent of ), the estimates of
Because repair distributions might not have bounded hazard
rates (e.g., discrete and uniform), it is desirable to seek effec
tive IS methods that do not rely on uniformization for repair
events. The above uniformizationbased algorithm can be ap
plied to just failure events: repair times are sampled directly
from their original distributions, while uniformization is used
to simulate failure events. The likelihood ratio is then
: it does not include the repair event term
Again,underappropriatetechnicalconditions,thismodification
results in BRE [55].
Uniformizationcanbecomputationallyinefficientifthereare
manypseudoevents. Inaddition, supposeeventsfrom a Poisson
process with rate
are accepted as failure events with proba
bility . Then the time until an accepted event has an exponen
tial distribution with rate
. This suggests sampling the time to
next failure event directly from an exponential distribution with
rate
.Ageneralizationofthisapproach(exponentialtransfor
mation)alsoresultsinestimateshavingBRE(underappropriate
technical conditions[55]). Thelikelihood ratiotakes on a some
what different form [55], [87].
Empirical studies testing the competence of these methods
are reported in [54], [55], [87]. Generally, good variance reduc
tion isobtained if
is small,say, lessthan10
smaller
andare, the greater the
Finally, though no formal study has been done, we anticipate
that these techniques (with minor modifications) apply also for
estimating the expected interval unavailability.
haveBRE as.
.
.The
can be made.
B. SteadyState Measures
NonMarkovmodels ofhighly dependablesystems mightnot
possess an explicit regenerative structure. If they do not, then
a ratio representation of steadystate measures in terms of re
generativecyclebased quantities, such as in (5), is no longer
possible. This section discusses an approach for the efficient es
timation of steadystate measures, such as system unavailability
and MTBF, in nonMarkov nonregenerative models. The ap
proachusesarepresentationofsteadystatemeasuresintermsof
quantities based on
cycles: a sample path between two suc
cessive entries of the system into some set of states
context of highly dependable systems (as in Section II), choose
to be the state in which all components are operational. Only
when all component failuretime distributions are exponential
(regardless of the repairtime distributions), entrance into the
set
constitutes a regeneration point, and a ratio representa
tion of the steadystate unavailability,
ativecyclebased quantities, as in (5), is still valid [85]. How
ever, this is no longer true if component failure times are gen
erally distributed. Therefore, in general,
and one cannot use classical statistical techniques to estimate
the variances of
cyclebased quantities. Instead, one can use
the method of batch means to estimate the variances of these
quantities by grouping successive
nonoverlapping batches, and then treating the batch means as
. In the
, in terms of regener
cycles are not i.i.d.,
cyclebased quantities into
i.i.d. observations; this is an approximation whose validity in
creases with the batch size.
Let
betheinitialdistributionofthecorresponding(original)
stochastic process upon entering the set
processhasreachedthesteadystate.Accordingtothedefinition
of
, is the steadystate joint distribution of the components’
ages upon entering the state in which all components are oper
ational; upon entering
, at least 1 component has an age
Underfairly general ergodicity conditions(whichalso ensure
that the system returns to the set
representation of
in terms of
, after the stochastic
0.
infinitely often), the ratio
cyclebased quantities is:6
(16)
thesubscriptsdenotethatthe expectationiswithrespecttothe
original probability measure
the original system) and the steadystate initial distribution
the
cycles. A ratio representation for the MTBF in terms of
cyclebased quantities is
(which governs the behavior of
of
(17)
Theremainderofthissectionreviewstheestimationof ,which
has been considered in [88]. A similar approach to estimate the
MTBF is in [40].
Because systemfailure is a rare event, ordinary simulation is
very inefficient to estimate
IS.
a new probability measure to simulate the system.
a sample path in the original process, on which the
total system downtime is evaluated to be
must satisfy the condition
ever
.
With IS,
an unbiased estimate of
(
likelihood ratio).
An appropriate choice of
should yield
, which implies much better precision in estimating
.
can be estimated efficiently using ordinary simula
tion. Therefore, the ratio estimator in (16) can be written as
; this motivates the use of
.
, “when
,
(18)
The resulting scheme is analogous to MSDIS for estimating
the steadystate unavailability in Markov models [46] (see Sec
tionIIIB).First,thesystemissimulatedusingtheoriginal
a sufficiently long time to approximately reach the steadystate.
At that time, the initial distribution upon entry of
sufficiently close to
, and begin to use the following splitting
technique. For each (steadystate)
(once or more) starting with the same component failure ages
andusing
togetsamplesof
cles. Then run the same
cycle using the
for
cycles is
cycle, run the simulation
and ;theseare biased cy
to get a sample of
6For details, see [10], [16], [27], [106].
View other sources
Hide other sources
 Available from Victor F. Nicola · May 17, 2014
 Available from psu.edu