Page 1

246IEEE TRANSACTIONS ON RELIABILITY, VOL. 50, NO. 3, SEPTEMBER 2001

Techniques for Fast Simulation of Models of Highly

Dependable Systems

Victor F. Nicola, Perwez Shahabuddin, and Marvin K. Nakayama

Abstract—With the ever-increasing complexity and require-

ments of highly dependable systems, their evaluation during

design and operation is becoming more crucial. Realistic models of

such systems are often not amenable to analysis using conventional

analytic or numerical methods. Therefore, analysts and designers

turn to simulation to evaluate these models. However, accurate

estimation of dependability measures of these models requires that

the simulation frequently observes system failures, which are rare

events in highly dependable systems. This renders ordinary sim-

ulation impractical for evaluating such systems. To overcome this

problem, simulation techniques based on importance sampling

have been developed, and are very effective in certain settings.

When importance sampling works well, simulation run lengths

can be reduced by several orders of magnitude when estimating

transient as well as steady-state dependability measures. This

paper reviews some of the importance-sampling techniques that

have been developed in recent years to estimate dependability

measures efficiently in Markov and non-Markov models of highly

dependable systems.

Index Terms—Highly dependable system, importance sampling,

Markov chain, simulation, steady-state dependability measure,

transient dependability measure.

ACRONYMS1

BFB

BLBLR

BLBLRC BLBLR with cuts

BLRbalanced likelihood ratio

BRE bounded RE

CLT central limit theorem

CTMCcontinuous-time MC

DTMCdiscrete-time MC

GSMP generalized semi-Markov process

i.i.d.

-independent and identically distributed

ISimportance sampling

MC Markov chain

MSDIS measure-specific dynamic IS

MTBF mean time between failures

MTTF mean time to failure

balanced failure biasing

balance over links BLR

Manuscript received November 1, 1998; revised June 6, 2000. This work was

supported in part by the US National Science Foundation under Grants DMI-

9625297, DMI-9624469, and DMI-9900117.

Responsible Editor: C. Alexopoulos

V. F. Nicola is with Telematics Systems and Services, Department of Elec-

trical Engineering, University of Twente, Enschede, The Netherlands.

P. Shahabuddin is with Columbia University, New York, NY 10027 USA

(e-mail: Perwez@ieor.columbia.edu).

M. Nakayama is with the Department of Computer and Information Science,

New Jersey Institute of Technology, Newark, NJ, USA.

Publisher Item Identifier S 0018-9529(01)11169-3.

1The singular and plural of an acronym are always spelled the same.

NHPP

pdf

r.v.

RE

RP

SAVE

TH

TRR

VRR

nonhomogeneous Poisson process

probability density function

random variable

relative error

repair person

system availability estimator

time horizon

total effort reduction ratio

variance reduction ratio.

I. INTRODUCTION

H

designs. The ability to predict relevant dependability measures

forsuchcomplexsystemsisessential,notonlytoguaranteehigh

levels of dependability during system operation but also to im-

prove the cost-effectiveness during system design and develop-

ment.

Several measures are commonly used for assessing the de-

pendability of a system, and the choice of the particular de-

pendability measures used to evaluate a particular system de-

pends on the intended operation and the environment of such a

system. For example, mission-oriented systems are often evalu-

ated using transient measures, such as system reliability (prob-

ability that the system is operational during the entire mission

time). Given that the system is initially in an operational state,

MTTF is the mean time to the first system failure; this is an-

other measure of interest for mission-oriented systems. On the

other hand, MTBF is the mean time between subsequent system

failures in steady-state. MTBF and the steady-state availability

(fraction of time the system is operational in the long run) are

often used for evaluating continuously operating systems.

Fault-tolerance and recovery techniques are frequently used

in the design of complex systems to enhance their depend-

ability. As a consequence, very high reliability/availability

requirements of systems can now be sustained. However,

the performance of continuously operating systems can be

degraded/upgraded due to load surges or reconfigurations

after failures/repairs. In other words, the performance level

of degradable/repairable systems is changing with time in

response to internal or external events. To evaluate these

systems properly, there is a need for measures that combine

performance and reliability/availability aspects. Such measures

were first introduced in [74], and were termed “performability”

measures. An example of such a measure is the distribution (or

-expectation) of cumulative performance in a given interval

of time. A special case of this measure is the distribution (or

IGHdependabilityrequirementsoftoday’scriticaland/or

commercial systems often lead to complicated and costly

0018–9529/01$10.00 © 2001 IEEE

Page 2

NICOLA et al.: TECHNIQUES FOR FAST SIMULATION OF MODELS247

-expectation) of interval availability, which is the fraction

of time the system is operational (regardless of performance)

during a given interval of time. The distribution of interval

availability (or guaranteed availability [48]) is a relevant

attribute of continuously operating systems, because it gives

the probability that the system is operational for more than

a specified fraction of a given interval of time. For example,

one might be interested in computing the probability that the

system is unavailable for more than 0.1% of the time in 1 year

of system-operation.

A. Numerical Evaluation of Dependability Measures

Researchers have long been aware of the importance and

necessity of developing techniques and tools to evaluate

highly dependable systems effectively. Most of the efforts are

limited to analytic or numerical solutions, usually restricted to

Markov (less often, semi-Markov) models. For a more detailed

discussion on performability measures and state-of-the-art

techniques for their evaluation, see [23]. The applicability of

these techniques, however, is quickly hindered by practical

problems, such as state-space explosion and/or the inadequacy

of Markov or semi-Markov representations of real systems.

Because the number of states in Markov models usually

grows exponentially with the number of system-components,

and because of storage and computational limitations, only

relatively small systems can be analyzed using numerical

solution techniques. Several techniques have been proposed

and, if applicable, can help to reduce the state-space of large

Markov models. For example, exact lumping [45], [84], or

approximations obtained by truncation and bounding [76],

are used. However, even for a moderately-sized system, the

corresponding Markov model can be “stiff”2(usually when

transition rates are of different orders of magnitude), leading

to difficulties when using numerical solvers [92]. Behavioral

decomposition [9] and iterative decomposition/aggregation

techniques [19] are among several techniques that can help

overcome “stiffness” of Markov models.

B. Effective Simulation

When conventional analytic/numerical methods are no

longer feasible, analysts often turn to computer simulation,

with the obvious advantages of flexible representation of

complex systems at the desired level of abstraction and low

storage requirements. However, the accurate estimation of

dependability measures using simulation requires frequent

observations of the system-failure event, which by definition

are rare events in highly dependable systems. This renders

conventional (ordinary) simulation impractical for evaluating

such systems [30]. To attack this problem, in recent years, there

have been considerable and successful efforts to develop fast

simulation techniques based on IS [41], [51]. The basic idea

is quite simple: simulate the system using new probability-dy-

2Astochasticprocessis“stiff”whenitcontains2essentiallydifferenttypesof

transitions, slow and rapid [66, ch. 8]. Highly dependable systems consisting of

highly dependable components fit this description because, typically, the com-

ponent lifetimes are very long, whereas repairs take only a short time to com-

plete.

namics (different from the original probability-dynamics of the

system), so as to increase the probability of typical sequences

of events leading to system failure. For example, in a redundant

system with 2 components, accelerating the component#2

failure while component#1 is being repaired, typically in-

creases the probability of another component failure, which

would lead to system failure. The obtained measure in a

given observation (a sample path of a simulation trial) is then

multiplied by a correction factor called the “likelihood ratio”

to yield a

-unbiased estimate of the measure. This factor

is the ratio of the probabilities (likelihoods) of the sample

path in the original and modified systems, respectively; its

computation is straightforward and can be done recursively

at simulation event times. Appropriate and careful choice of

the new underlying probability dynamics of the simulated

system can yield an appreciable reduction in the variance of the

resulting estimate, which implies appreciable reduction in the

simulation time needed to achieve a specified precision. Also,

the new probability dynamics should be easy to implement.

For a fixed run-length, ordinary simulation produces esti-

mates with RE (a constant times the coefficient of variation of

the estimate) that tends to infinity as the probability of the rare

event tends to zero. An “effective” heuristic for IS is one that,

for a fixed run-length, produces estimates with a RE that re-

mains bounded as the probability of the rare event tends to zero.

However, BRE is an asymptotic property, and in practice, even

if an IS heuristic possesses this property, the amount of simu-

lation effort required to achieve a given precision can still be

large. Also, the BRE property might not ensure a variance re-

duction relative to ordinary simulation for many types of highly

dependable systems (e.g., systems with an appreciable level of

redundancies) whose parameters fall in the practical range; it

only guarantees that as the event of interest becomes rarer, the

-expected amount of simulation effort remains bounded by a

constant (in contrast to ordinary simulation where this effort

tends to infinity), but the bound can be large.

C. This Work

This paper reviews some of the recent IS techniques devel-

oped for the efficient estimation of transient and steady-state

dependability measures in Markov and non-Markov models of

highly dependable systems.3Parts of [53] also review some IS

techniques for the simulation of dependability measures, with

emphasis on the underlying mathematical ideas needed to es-

tablish their theoretical properties; thus, it is more suitable for

researchers.Thispaperpresentsacomprehensiveandlessmath-

ematical treatment of the subject; therefore, it is more suited for

reliability practitioners, and requires only a basic understanding

of probability and statistics.

There are two main ways in which a system can be made

highly dependable in a cost-effective manner.

1) Usecomponentsthatare“highlyreliable”andhave“low”

built-in redundancies in the system. Examples of these

are computer systems where the main components (e.g.,

processors) fail rarely.

3Preliminary versions of some parts of this review have appeared in [80] and

[100].

Page 3

248IEEE TRANSACTIONS ON RELIABILITY, VOL. 50, NO. 3, SEPTEMBER 2001

2) Build “significant” redundancies in the system and use

components that are just “reliable” instead of “highly re-

liable.” (The distinction is clearer when some examples

are examined later in the paper.)

There might also be a third way: Use “unreliable” components

buthave“veryhigh”built-inredundanciesinthesystem. Exam-

ples are more difficult to find in practice.

Much of the recent research work on effective simulation of

highly dependable systems has been done for systems that fall

in categories 1 and 2, and this paper mainly covers those.

Thefocusinthispaperison“dynamic”systems(systemsthat

change over time), in contrast to “static” systems. An example

ofastaticsystemisa2-terminal reliabilitynetworkwith -inde-

pendent components and no repairs (strictly speaking, there can

be repairs as long as they do not create -dependencies among

components). See, [69], [70], [95] for fast simulation methods

for such systems.

Section II formally describes the wide class of systems for

which these IS techniques are designed, and reviews the basic

idea of IS.

Section III discusses IS techniques for estimating depend-

ability measures in Markov models. “Markov” implies that all

failure, repair, and other underlying distributions in the system

are exponential, so that it can be modeled by a CTMC. Some

work is reviewed on the estimation of derivatives with respect

to model parameters (e.g., component failure rates) for various

steady-state and transient measures in these models. This work

is of much interest, because it can be used to identify system-

components that might need improvement and to optimize sys-

tems.

Section IV considers the estimation of dependability mea-

sures for models in which the failure and repair times are

not exponentially distributed. Because these types of system

can no longer be directly modeled as a MC, they are called

“non-Markovmodels.”Amathematicalframeworkforstudying

such systems is the GSMP; see [37] for a formal development

of GSMP. The general theory of IS for discrete-event systems

(without discussing the particular changes of measures for

specific models) is in [37], [41]. For the IS heuristics discussed

in this paper, some empirical studies have been presented in the

literature, and many of these methods are provably effective.

In both Markov and non-Markov models, the concern is esti-

mation of

• transient measures, such as system unreliability, -distri-

bution and -expectation of interval unavailability,

• steady-state measures, such as steady-state unavailability

and MTBF.

Although MTTF is in fact a transient measure, for regenerative

models it can be represented as a ratio of 2 -expectations of

regenerative-cycle-based quantities that can be estimated using

the regenerative method of simulation. Thus MTTF is included

in discussions of steady-state measures.

Section V discusses ongoing work and directions for future

research.

D. Related Work and Software

IS can be applied, not only for estimating dependability mea-

sures of reliability systems, but for estimating buffer-overflow

probabilities in queuing systems and networks [18], [28], [91],

[96], [107]. Applications to communication systems are of par-

ticular interest [4], [15], [113], [67]. The IS techniques used in

this setting are often based on the theory of large deviations. A

survey on existing techniques is in [53].

An approach, other than IS, based on “fault-injection” is

used in [75] to speed up steady-state simulations involving

rare (failure) events in communication systems. The method

assumes knowledge of the frequency of the “rare failure event”

and exploits the fact that, except for relatively short periods

after failures, the system is operating normally in a failure-free

environment. Fault-injection is used to obtain an accurate

estimate of the performance measure of interest during periods

affected by the failure. This estimate is appropriately combined

with an accurate estimate under failure-free environment (with

no rare events) to yield an overall steady-state estimate of the

dependability measure.

Another method to simulate rare sample paths is to use the

technique of “splitting” sample paths. Splitting for rare-event

simulation was originally discussed in [62] in the context of es-

timating rare particle transmission probabilities in physics [51].

Since then, it continues to be an active area of research in that

field[24].Variationsofthistechniqueforsteady-staterare-event

estimation in stochastic service systems seem to have been first

done in [6], [7], and later in [57] (see [14] for a related idea);

a variation for transient rare-event estimation in stochastic ser-

vice systems is in [65]. It was revisited in [110], [111], [112]

for estimating probabilities of rare events in computer and com-

munication systems; the version of the technique used in these

papers was called “RESTART.” Some of the most recent ver-

sions/implementations of the technique are in [29], [35], [43],

[52].

The basic idea behind the splitting technique is explained

here. The goal typically is to estimate some performance mea-

sure that is “associated with” visiting some set of states

statespaceofthestochasticprocess,andtheset

rarely. For example, compute the probability of a buffer over-

flow, where

corresponds to states in which the buffer content

has reached its capacity. In ordinary simulation, the stochastic

process being simulated spends a lot of time in regions of the

statespacethatare“faraway”fromtheinterestingrareset

gionsfromwherethechanceofenteringtheraresetisextremely

low).Inoneversionofsplitting,aregionofthestatespacethatis

“closer” to the rare set is defined. Each time the process enters

this region from the “far away” region, many identical copies

of the process are generated. Each of the split copies is sim-

ulated until the process exits back into the “far away” region.

From there on, only one of the split copies is continued until

another entrance into the “closer” region. This way gives more

instances of the stochastic process spending time in the “closer”

region where the rare event is more likely to occur. The idea can

be extended to: instead of just 2 regions, use multiple regions of

slowly increasing degrees of rarity. Reference [35] describes a

of the

isvisitedonly

(re-

Page 4

NICOLA et al.: TECHNIQUES FOR FAST SIMULATION OF MODELS249

unifying class of models and implementation conditions under

which this type of multi-level splitting is provably effective for

steady-state rare-eventsimulation. Related work is in [33],[34].

The method of splitting has also been used and analyzed in con-

texts other than rare-event simulation, e.g., [73].

There are a few software-based modeling tools which use

rare-event simulation techniques for dependability evaluation.

SAVE [45] is a software package that consists of a high-level

modeling language that can be used to specify the model of

interest. From this specification and Markov assumptions on

the lifetime and repair-time distributions, the detailed Markov

chain is derived. It is then solved for dependability measures

using either numerical (nonsimulation) or simulation methods.

A recent version of SAVE [8] incorporates the IS technique,

BFB (as described in Section III-A) at the MC level to estimate

dependability measures efficiently. Another software package

where IS is used is ULTRASAN [20]. In ULTRASAN, the

high-level modeling construct of stochastic activity networks

is used to specify the model of interest. Again, from this

specification, the detailed stochastic process is derived and

solvedforperformance/dependabilitymeasuresofinterest,using

either numerical (nonsimulation) MC methods or simulation

methods. In recent versions of ULTRASAN [89], [90] an

“IS governor” has been incorporated. Here, instead of the

IS heuristic being built-in as in SAVE, one can choose and

specify the IS change of measure at the stochastic activity

network level. The RESTART version of the splitting method

has also been implemented in ASTRO [112].

II. BACKGROUND

Notation

number of types of components

number of components of type ,

number of operational components of type

time

vector

stochastic process

state of the system at time

stochastic process

state space of

subset of failure states in

time to first system-failure

probability under measure

-expectation under measure

variance under measure

system unreliability at time

indicator function of event

convergence in distribution

-normal distribution with mean , variance

relative error of an estimator

a sample path

in the set of all

pdf of under measure

likelihood ratio on.

at

RE

of a stochastic process

A. Highly Dependable Systems

This section discusses the broad class of highly dependable

systems that can be described by SAVE [45] (basically, a gen-

eralized Machine Repairman Model). These models consist of

multiple types of components, where each component can be in

1 of 4 states:

• operational,

• failed,

• spare,

• dormant.

The first 3 of these states are self-explanatory. An operational

component becomes dormant if its operation depends upon the

operation of some other component and that other component

fails. For example, a processor might not be operational unless

its power supply is also operational; therefore, if the power

supply fails, then the processor is dormant. In SAVE, different

(exponential) failure rates can be specified for the operational,

spare, and dormant states. The SAVE modeling language is

also used to describe operational/repair dependencies among

components (e.g., the operation/repair of a component depends

on some other components being operational), as well as

failure propagation (e.g., the failure of a component causes

some other components to fail with given probabilities). The

system is operational if certain combinations of components are

operational. Unlike SAVE, in non-Markov models (Section IV)

general failure and repair distributions are allowed. Also, there

is a set of RP who repair failed components according to some

reasonably arbitrary service (priority or nonpriority) discipline.

Tosimplifythepresentation,systemsareconsideredinwhich

each component is either operational or failed. (Unless other-

wisespecified,theresultsalso applytothemoregeneralmodels

in the SAVE modeling language.) Section II-B briefly reviews

the basic idea of IS and shows how (when applied appropri-

ately) it could appreciably speed-up simulations involving rare

events. For illustration, also consider estimating the system un-

reliability; however, the same concepts also apply to other de-

pendability measures.

B. Importance Sampling

Consider a system with

nent is subject to failure and repair.

All components are operational at time 0:

.

All components are “new” at time 0.

In general,

contains the information

formation might be needed, e.g., the queuing of failed compo-

nents waiting to be repaired and the remaining lifetimes and re-

pair times of components when using distributions other than

exponential.

There is some subset

of the state space

system is failed at time

if

System unreliability is

component-types. Each compo-

, for all

, but other in-

such that the

.

(1)

TH.

Page 5

250IEEE TRANSACTIONS ON RELIABILITY, VOL. 50, NO. 3, SEPTEMBER 2001

The subscript

underlying original probability distributions governing the dy-

namics of the system.

In a highly reliable system, for a sufficiently small , the

:is rare.

In ordinary (naive) simulation generate

from time 0 to time

, say, . Then

denotes the original probability measure: the

i.i.d. replications of

to obtain samples of

isan -unbiasedestimatorof

is

.Thevarianceofthisestimator

From the CLT

as

of

which istherelativehalfwidth ofthe99% -confidenceinterval

derived from the CLT approximation. For a fixed , the

as. This is the main problem when using ordinary

simulation to evaluate highly dependable systems. The goal of

IS is to overcome this inherent difficulty.

Notation

another probability measure

a sample path (of a replication) in the set

possible sample paths of

time 0 to time

pdf ofaccording to

of all

taking the system from

(2)

The only condition imposed on

is:

whenever

Thus the system can be simulated using

ples of

to obtain i.i.d. sam-

:.

An -unbiased estimate of

is

The variance of is

One measure of effectiveness of any new simulation algo-

rithmistheVRR:ratioofthevarianceusingordinarysimulation

to that using the new simulation algorithm; in this case:

The VRR gives the ratio of the number of samples using ordi-

nary simulation to that using the new algorithm so as to achieve

the same RE. However this measure of effectiveness does not

consider the effort (e.g., CPU time) required to simulate each

sampleunderthetwomethods.Henceamorefairmeasureofef-

fectiveness is the TRR: ratio of (the product of the variance and

the effort per sample using ordinary simulation) to (that using

the new simulation algorithm), [42]. The TRR gives the ratio of

the total effort using ordinary simulation to that using the new

algorithm so as to achieve the same RE.

The main challenge in IS is to find a robust new probability

measure

that can be implemented in a computationally effi-

cient manner such that

:

(3)

Appreciable variance reduction from (3) is obtained if

whenever (4)

Choosing

because it involves each sample path. But the general intuition

one obtains is that

should be chosen to appreciable increase

theprobabilityoftherareevent

has to be very careful; choosing an arbitrary (but not suitable)

that increases the probability of the rare event can lead to a

substantial increase in variance.

For highly dependable systems, try to come up with IS tech-

niques that are “effective” (see Section I-B): techniques whose

RE remains bounded (implying that

ability of the rare event tends to zero. This property has been

established at least empirically (and, in many cases, also theo-

retically) for most of the IS techniques in this paper. However,

as mentioned before, this does not always guarantee efficient

simulation of systems with high redundancies.

such that (4) is satisfied is usually very difficult

.Atthesametimeone

) as the prob-

III. FAST SIMULATION OF MARKOV MODELS

Notation

collection of all (measurable) subsets of

DTMC embedded on

generic states from the state space

transition probability matrix of the DTMC

(when is a CTMC)

,

Page 6

NICOLA et al.: TECHNIQUES FOR FAST SIMULATION OF MODELS251

systemstateinwhichallcomponentsareoperational

time to first return ofto state

sample path between two successive entries to a

subset of states

system failure time in a regenerative-cycle, or

-cycle

steady-state unavailability of the system

estimator of

original and IS probability measures of

IS probability measure under BFB

variance of a ratio estimator; probability measures

andare used to estimate the numerator and

denominator, respectively

failure biasing parameter

transition probability matrix of the DTMC

BFB

failure, repair rates of component type

parameter in the failure rate of component type

failure rarity parameter

exact asymptotic order of magnitude

“distance” of statefrom the failure set

“criticality” of the transition

set of component types having failure rates of the

th largest order of magnitude

stack of likelihood ratios associated with failure

events of components in

likelihood ratio on top of

2-dimensional vector, where

the number of operational (respectively, currently

under repair) components of type ,

set of states in which components are failed

total system failure time in

-expected interval unavailability

total transition rate out of state

probability measure

IS pdf used to sample a random holding time when

in state

total time in statein a regenerative cycle

total time in states other than , from the beginning

of a regenerative cycle until either the system fails

or the end of the cycle

{system fails during a regenerative cycle}

upper bound for

lower bound for

generic parameter (e.g., a component failure rate)

partial derivative operator with respect to

hitting time of state

hitting time of set

partial derivative of the likelihood ratio with respect

to .

Most of the approaches in the following sections are appro-

priate for highly dependable Markov systems consisting of

highly reliable components (i.e., component failure rates are

much smaller than the repair rates) that satisfy:

Assumption A: Each state, other than the state in which all

components are up, has at least one repair transition possible.

AssumptionAissatisfiedbysystemsofthetypein[44],[45].

-cycle

, on

under

,

(respectively,) is

under the original

For systems with repair-unit sharing,4let

are defined in Section II-B. For systems with more general re-

pair disciplines, add a list of components either waiting-for or

undergoing repair at each RP.

all failure and repair times are exponentiallydistributed, and the

methodologies in this section are independent of the definition

of the state.

Unless stated otherwise, let

ulate a CTMC by generating the next state visited using

then generating the exponentially-distributed holding-time in

that state with the appropriate rate. When estimating steady-

state measures, instead of sampling the holding times in a state,

use the -expected holding time in that state [25], [26], [56].

CTMC are regenerative processes, where entrance to any

fixed state constitutes a system regeneration. Let the regenera-

tion epochs to be the entrances to state

time to first system-failure.

; they

is a CTMC when

. One can sim-

and

. As in Section II-B,

A. Steady-State Measures

For estimating steady-state measures, the regenerative

method of simulation is often used, and it is usually sufficient

to simulate the embedded process at transition times, as

described in Section II. Many steady-state measures can be

expressed by a ratio of regenerative-cycle-based quantities

[21], e.g.,

(5)

The ordinary way of estimating unavailability is to run some re-

generativecyclesandcollectsamplesof

estimate

andby their respective sample means.

However most samples of

are zero, thus one often uses IS to

try to obtain more precise estimates of

tion II-B):

that

simulation with

is much more efficient.

1) Failure Biasing: As mentioned in Section I, the imple-

mentation of IS involves failure biasing [71], in which the basic

idea is to take the system along typical sample paths to failure,

more frequently. All states of the MC, other than , have both

failure and repair transitions.

• A failure-transition is a transition from one state to an-

other, corresponding to the failure of at least one compo-

nent.

• A repair-transition is a transition from one state to another

corresponding to the repair of at least one component.

We do not allow a single transition to correspond to some com-

ponents failing and other components being repaired. Typically,

the total probability of repair transitions is close to 1, and the

total probability of failure transitions is close to 0. In failure bi-

asing, the total probability of failure transitions is increased to

somevalue ,thefailure-biasingparameter;thusthetotalproba-

bilityofrepairtransitionsisdecreasedto

and.Thenonecan

. Then (as in Sec-

. The problem is to find a

, which implies that

so

.Empiricalstudies

4The repair discipline in which the RP works on all failed components simul-

taneously, with the effort devoted to each component proportional to the repair

rate of that component.

Page 7

252IEEE TRANSACTIONS ON RELIABILITY, VOL. 50, NO. 3, SEPTEMBER 2001

suggest that we should choose

close to 1, e.g.,

increase or even infinite variance.) Thus failure biasing enables

the system to go along paths to system failure more often.

However, just making the rare event occur more often might

not always work. How the rare event happens (the sequence of

events that lead to the rare event) plays a crucial role. Under

the original probability measure, some sample paths to system

failure are more likely than others. For IS to be effective,

• All the mostlikely (in terms of order of magnitude of their

probabilities under the original measure

should be made more probable under the new measure

• Secondary sample paths (those paths with probability

under

that are at least an order of magnitude smaller

than the probability of the most likely ones) also need to

be made more probable under

most likely paths.

If an IS distribution does not assign enough probability to a

likely path to system failure, then the resulting variance can be

worse than that of ordinary simulation. (In mathematical terms,

this means that

will be large, because, for a sample

path

for which

to

, the

the original version of failure biasing, called “simple failure-bi-

asing” here, the relative probabilities under the new measure of

individual failure (repair) transitions with respect to each other

remainunchanged.Insystemswherethefailure-transitionprob-

abilities are of different orders of magnitude (e.g., unbalanced

systems), this can deprive a path of a high enough probability

under IS, thus causing inefficient estimation.

2) BalancedFailure-Biasing: BFB[47],[98]overcomesthe

problem in Section III-A-1 by making all failure transitions

occur with equal probabilities (this is also done in state ). This

ensures that all paths get sufficient probability, though it also

wastes some probability by giving certain paths more weight

than necessary. This can degrade a simulation’s performance

whentherearelargeredundanciesinthesystem.ISschemesthat

trytominimizethiswasteinclude“failure-distancebiasing”and

“BLR methods.”

is used on the sample paths of

in whichis used until system failure, and

that.

3) MSDIS: In (5), one can use different probability mea-

sures (and thus different regenerative cycles) to estimate

and . This approach is called MSDIS [46], [47]. When

implementing MSDIS, we typically use IS to estimate

and use ordinary simulation to estimate

provides accurate estimates of

one can run

regenerative cycles using

tuples

can run

regenerative cycles using

of. Then

. (Settingtoo

, can sometimes lead to a variance

) sample paths

.

but not as much as the

is large relative

is large [81].) In

is used after

,

, because it

without using IS. Hence,

to get the sample

of

to get the samples

is estimated

, and

(6)

Theasymptoticvarianceofthisestimator(large

and )is[47]

(7)

which when estimated (by replacing ,

andin (7) by their respective simulation estimates) can

be used to construct 99%; -confidence intervals.

Another quantity of interest is the MTTF defined by

For regenerative systems, the MTTF can be expressed as a ratio

of regenerative-cycle-based quantities [47], [64], [103], [108]:

,

.

(8)

A sample of

obtained from 1 regenerative cycle. Hence, again use MSDIS to

estimate

by separately estimating each term of the ratio

[47], [103]. In this case, the rare-event problem occurs in esti-

mating the denominator of the ratio. Hence, use

the denominator and

to estimate the numerator.

To estimate

and, one can use other heuristic IS

measures instead of BFB.

4) MathematicalAnalysisofFailureBiasing: Mathematical

analysis of failure biasing techniques began in [97], [98]. This

analysis is used to study the increase in simulation efficiency

obtained by using these techniques, or for proving BRE proper-

ties of these techniques. In [97], [98], the failure rate of compo-

nent-type is assumed to be of the form

a small parameter (rarity parameter) and

constants. This enables modeling a situation in which compo-

nents have small failure rates (components are highly reliable).

Prior to [97], [98], there was other work [32] that studied the

asymptotic behavior (nonsimulation aspects) of systems with

highlyreliablecomponents.However,thisearlierworkassumed

that

, for all , which does not allow the modeling of

systems in which component failure rates are of different or-

ders of magnitude. The use of the exponents

modeling. This paper assumes that the repair rates are constants

and the failure-propagation probabilities (probabilities used to

determine if the failure of certain components cause others to

fail simultaneously) are either constants or are of the same gen-

eral form as the failure rates: a constant multiplied by

to some power. The simulation analysis in [3], [77]–[79], [81],

[97]–[99], [115], [102], [105] deals with the asymptotic be-

havior of the simulation efficiency for small . The simulation

techniqueforhighlydependablesystemsisformallysaidtohave

BRE if the RE remains bounded as

References [97], [98] show that BFB has the BRE property

when estimating

and

BRE property of the MSDIS approach (using BFB) to estimate

thesteady-stateunavailabilityand the MTTF.It was shownthat,

for fixed numbers

,of regenerative cycles, the RE in the es-

timation of

using standard regenerative simulation is

for some constant

; whereas the RE using the MSDIS

scheme is

. [A function

constants

,such that

ficiently small .]

[or a sample of] can be

to estimate

, where is

are positive and

facilitates this

raised

.

. This leads to the

,

is, , if there exist

for all suf-

Page 8

NICOLA et al.: TECHNIQUES FOR FAST SIMULATION OF MODELS253

References [97], [98] also show that simple failure biasing

has the BRE property for the special class of balanced-systems

(systems in which the failure transition probabilities are of the

same order of magnitude; e.g., when

failure propagation probabilities are -independent of ). Using

a counter-example, it was shown that the BRE property might

not hold when simple failure-biasing is used

systems. More general conditions (on the system) under which

any failure biasing method (or any more general IS scheme)

does or does not give BRE are in [79], [81]. Although it seems

difficult to check these conditions except in very simple cases,

they provide insight into how IS should be implemented. Some

additional results are in [109].

5) Failure-Distance Biasing: Failure-distance biasing [12]

attempts to refine failure-biasing schemes to make the system

go mainly along the most likely paths to system failure (for bal-

ancedsystemswithnofailurepropagation,themostlikelypaths

are those with the least number of transitions): there is no im-

portant waste of probabilities on paths that are not most likely.

As in failure biasing,the total failure transition probability is in-

creased to . However, now, the way in which

the individual failure transitions

on the “distance” from state

to some failure state. To do this,

compute, for each state

, the

failing components whose failure in

to a state in which the system is failed. The failure distance

for a state

is 0. The “criticality” of a failure transition

is defined as

biasing is implemented by partitioning the set of failure transi-

tions from the current state

based on the criticalities of the in-

dividualtransitions:eachsetcontainsallfailuretransitionsfrom

havinga particularcriticality. Eachsetis assigned a portion of

thefailure-biasing probability , with setshavinglargercritical-

ities getting larger portions of . Failure transitions within the

same set occur with their original relative probabilities (simple

failure-distance biasing) or with equal probabilities (balanced

failure-distance biasing).

Exact computationof thefailure distances assumes a descrip-

tion of the structure function of the system [5] and requires de-

termining all the minimal cutsets corresponding to that struc-

ture function. The latter is NP-hard [94]. Hence the users need

to limit the number of minimal cutsets considered. An efficient

algorithm for computing and maintaining the data structures of

the failure distances is in [12].

It follows directly from [98] that balanced failure-distance

biasing also has the BRE property, but [81] presents an example

showing that simple failure-distance biasing might not have

this property. Experiments on examples in [12] seem to support

the intuition that failure distance based biasing schemes should

havebettersimulationefficiencythantheusual biasingschemes

(but with an important implementation overhead). However,

the amount of efficiency improvement, if any, on a particular

system in practice depends on whether each computed distance

from a state correctly reflects its true proximity to the set

The distance defined in this subsection seems to reflect the

actual proximity only for the class of balanced systems with no

failure propagation (the structure function, by definition, does

not consider any failure propagation). It appears to be difficult

for all , and the

or unbalanced

is allocated to

dependsfrom a state

: the minimum number of

would bring the system

. Failure-distance

.

and computationally expensive to compute a distance reflecting

the actual proximity for the general case. Even for the balanced

case with no failure propagation, only an approximation to the

failure distance is computed, because the users need to limit

the number of minimal cutsets considered.

6) Balanced Likelihood Ratio Methods: References [2],[3],

[105] show, experimentally, that the methods in this section

might not work well for systems that have an important de-

gree of redundancy. The “BLR methods” [2], [3], [105] are ap-

proaches for effectively simulating such systems. They attempt

tocanceltermsofthelikelihoodratiowithinaregenerativecycle

by defining the IS probabilities for events in such a way that the

contribution to the likelihood ratio from a repair-event cancels

the contribution to the likelihood ratio from a failure-event that

occurred previously in the current cycle.

Some additional terminology is needed to describe the basic

method. Partition the set of component types

sets

, such that

nent types with failure rates of the th largest order of magni-

tude. Throughout the simulation of a cycle, one stores the event

likelihood ratios associated with component failure events from

in a stack . If , let

top of

; . The system state is denoted by

is the number of components of each type

that are operational, and

RP currently repairing components of each type. References

[2], [3], [105] consider models for which

scribes the system state, which is a subset of the class of models

described in this paper, but one can easily apply the method to

the more general setting of models in this paper.

In terms of the algorithm, the BLR method differs from the

failure biasing methods in two respects.

1) Insteadofusingafixed forthefailurebiasingparameter,

use a

that is a function of the current state

and the . In particular,

into

contains all compo-

be the likelihood ratio on

, where

is the number of

completely de-

2) Thetotal(new)probabilityallottedtorepairtransitionsof

components of type

instead of their being proportional to

done in the failure biasing methods).

By doing this, one can ensure the cancellation of likelihood

ratios, and guarantee that the overall likelihood ratio on any re-

generative cycle is always bounded above by 1). This implies

that

is proportional to,

(as usually

thus the variance under the new measure

than that under the original measure

ciallyusefulforsystemswithimportantredundancies,wherethe

number of transitions until system failure can be large, leading

to high variabilities in the likelihood ratios when using methods

is never greater

. The method is espe-

Page 9

254IEEE TRANSACTIONS ON RELIABILITY, VOL. 50, NO. 3, SEPTEMBER 2001

like BFB. Reference [3] also shows that the resulting estimators

have the BRE property when the particular way in which indi-

vidual failure transition probabilities are assigned is similar to

that in BFB.

Failure-distance biasing tries to exploit system structure in

allocating probabilities to transitions, and [2], [105] apply sim-

ilar ideas to BLR methods. In particular, they obtain additional

efficiency gains by allocating probabilities to the individual

failure transitions so that those failure transitions corresponding

to component types that lie on minimum-cuts are more heavily

weighted. Their algorithm does not need to maintain a list of

all the minimum cuts; it needs only to maintain a list of all the

components in a minimum cut. This can be done in

where

is thenumber of links.As with failure-distance biasing,

one might not get any additional efficiency gains in certain

systems that are unbalanced and/or have failure propagation.

This is because in such systems the most likely paths to system

failure might not lie along minimum cuts (the definition of

minimum cut does not consider failure propagation).

References[2],[3],[105]alsodescribeimprovementsthatare

based on using semi-stationary cycles [106] rather than regen-

erative cycles. The simulation method is similar to the

method [88] but the motivation for its use is different. In steady-

statesimulationsofhighlydependablesystems,oneusuallyuses

the set of states with all components “up” as the regenerative

state. However, when the BLR method is applied to systems

withhighdegreesofredundancy,theregenerativecyclescanbe-

come very long, leading to inefficient estimation. Thus, [2], [3],

[105] instead consider a set of states with no 1-step transition

probabilities within the set. An example is the set of states

with failedcomponents,where

the system (the least number of components that have to fail for

the system to fail). The process in between two entrances to

is a semi-stationary cycle, and has properties similar to regen-

erative cycles, except that these cycles are not necessarily -in-

dependent (thus complicating the construction of -confidence

intervals). Also one needs to know the steady-state distribution

on the set of states in

at the times of entrances to this set, in

order to apply IS; in general, this is very difficult to compute.

These problems are similar to those in the

Section IV-B).

7) Other IS Methods: Another heuristic for failure biasing

in acyclic models (of nonrepairable systems) is considered in

[31], in which the extent to which one biases the failure transi-

tionsalongapathleadingtosystemfailureisproportionaltothe

path’scontributiontothemeasurebeingestimated.Whenappli-

cable, this heuristic requires more overhead than simple failure

biasing or BFB. Reference [66, ch. 10] describes some efficient

simulation methods for

-out-of- :G systems; these methods

combine the IS technique known as forcing (see Section III-B)

with some analytic calculations.

8) Some Empirical Results: Example #1 is a computing

system (originally presented in [47] and then in many papers

thereafter). Consider the unbalanced version of this computing

system. Fig. 1 is the block diagram. It consists of

• 2 sets of processors with 4 processors/set,

• 2 sets of controllers with 2 controllers/set,

• 6 clusters of discs, each consisting of 4 disk units.

time,

-cycle

islessthantheredundancyof

-cycle method (see

Fig. 1. Computing-system example.

In a disk cluster, data are replicated so that one disk can fail

without affecting the system. The “primary” data on a disk are

replicated so that 1/3 is on each of the other 3 disks in the same

cluster. Thus 1 disk in each cluster can be inaccessible without

losingaccesstothedata.Itisassumedthatwhenaprocessorofa

giventypefails it has a 0.01 probability of causingthe operating

processor of the other type to fail. Each unit in the system has

2 failure modes which occur with equal probability. The failure

rates (per hour) are

• 1/1000 for processors,

• 1/20000 for controllers,

• 1/60000 for disks.

The repair rates (per hour) are

• 1 for all mode 1 failures,

• 1/2 for all mode 2 failures.

This is an unbalanced system with a redundancy of 2. Compo-

nents are repaired by a single RP who chooses a component at

random from the set of failed units. The system is operational

if all data are accessible to both processor types, which means

that at least 1 processor of each type, 1 controller in each set,

and 3 out of 4 disk units in each disk cluster are operational.

Operational components continue to fail at given rates when the

system is failed.

To facilitate comparisons between simulation methods on the

same CPU, simulation results (in the MSDIS framework) are

quoted from the latest implementation of these methods [3].

BFB using a total of 200000 cycles (100000 cycles each for

the numerator and the denominator) and

steady-stateunavailabilityestimateof

3.8% is the estimate of the RE corresponding to a 90% -confi-

dence interval.

The corresponding VRR was 167 with a TRR of 415.

TheBFB estimateof theMTTF was

and

Forthesameproblem,themostpromisingMSDISimplemen-

tation of the BLR method without the use of minimum cuts (de-

noted by BLBLR in [3]) gave

gave the

3.8%,the

6.5% with

.

•

•

for the steady-state unavailability

for the MTTF.

Page 10

NICOLA et al.: TECHNIQUES FOR FAST SIMULATION OF MODELS 255

Fig. 2.Network with redundancies.

Hence, for this example, without the use of additional informa-

tion about the system, BFB does better than the BLR method.

For the balanced version of this computing system (the results

of which are in [3]), the improvements obtained by BFB and

BLBLR are similar.

The performance of the failure biasing method and BLR

methods can be improved by using some information about

component types on minimum cuts in the system. The most

recommended MSDIS version of the BLR method with min-

imum cuts (denoted by BLBLRC in [3]) gave

•

for the steady-state unavailability

•

for the MTTF.

There is appreciable improvement over BFB for the MTTF. For

the balanced case of this network, there was appreciable im-

provement over BFB for both the MTTF and the unavailability.

Consider thesystem (see Fig.2) with important redundancies

that was considered in [3]. The following description is from

[3], with minor modification. The network contains 3 types of

components:

• Type A links contain 3 -identical components, which on

averagefailevery13(1/3)hoursandcanberepairedinhalf

an hour. The type A link fails when 2 components are in

the failed states.

• Type B links contain 1 component, which on average fail

every 40 hours and can be repaired in 1 hour.

• Type C links contain 2 components, which on average fail

every 26(1/3) hours and can be repaired in 2/3 hour. One

component failure on a type C link causes the link to fail.

The system operates as long as there exists a path along oper-

ating links between node 1 and node 20. There are 5 RP, and

repairs make components “good as new.” Upon completing a

repair, a RP selects (uniformly over the failed components in

the network) the next component to repair.

The results are

•

,

•

,

•

,

ordinarysimulation)intheestimationoftheunavailability

(estimates of which were of the order of 10

The BLBLRC methods when applied to another version of

the same network yield orders of magnitude improvement over

BFB when estimating unavailabilities that are of the order of

10

.Nocomparisonsweremadewithordinarysimulationfor

, for the BLBLR

, for the BLBLRC

, for BFB (worse than

).

thesecasesbecausetherewerenosystemfailureeventswiththis

method, even in the allotted run time of 10 events. The results

suggest that for systems with appreciable redundancies, BFB is

not at all effective, and the BLR method (with and without cuts)

seems to improve the simulation efficiency.

B. Transient Measures

This section considers the estimation of transient measures

in highly dependable Markov systems. Consider the three mea-

sures:

unreliability:

-expected interval unavailability:

guaranteed availability:

Researchin fast simulationfor guaranteedavailabilityis limited

to experiments; see [47].

1) Case1:SmallTH: SmallTHmeansthattheTH issmall

compared to the -expected lifetimes of components. From the

analytic standpoint, it means that the TH

-independent of . The effectiveness of simulation techniques

for small

are studied again.

For transient measures, failure biasing (relative to repair)

alone might not be sufficient to observe many system failures,

because it affects only the transitions of the embedded DTMC

and not the random holding times in each state. To see why

this is the case, note that the first component failure in a

system occurs at a very low rate (the sum of failure rates of all

the components). Thus, typically, the first component-failure

occurs after time

; thus the chance that the system fails

before the mission time expires is very small. To address this

issue, “forcing” was introduced [71] to modify the random

holding times in particular states. With forcing, the time to first

component failure is sampled conditionally on the fact that it is

less than , i.e., the time to first component failure is sampled

from the distribution:

is a constant, i.e.,

for(9)

the transition rate out of state under the original measure

.

References [99], [115], [101] show that a “combination of

BFB and forcing” gives BRE in estimating the unreliability and

the -expected interval unavailability. From a modeling view-

point, this implies that for small TH, the simulation can be very

efficient.Thisagreeswithexperimentalresults[47],[99],[115],

[101].

Another technique for estimating transient dependability

measures is to combine failure biasing with “conditioning”

[47]. Conditioning is applied by simulating the embedded

DTMC until the system fails; failure biasing is used to gen-

erate the transitions. Random holding times are generated for

each of the states visited, except for those states having slow

transition rates (e.g., the “fully operational” state, which has

no repairs taking place). Then for each generated sample path,

one can analytically compute the conditional probability that

Page 11

256IEEE TRANSACTIONS ON RELIABILITY, VOL. 50, NO. 3, SEPTEMBER 2001

the system fails before time , given the path of the embedded

DTMC and the sum of the holding times in the states that

do not have slow transition rates. This computation involves

calculating the convolution of exponentially distributed r.v.,

corresponding to the visits of the “conditioned out” states. The

technique is guaranteed to reduce variance, but requires more

computation. Experimental results and comparisons with the

forcing technique, are in [47].

2) Case 2: Moderate and Large TH: Even though, for small

TH, the IS-based simulation of transient measures has the BRE

property, it becomes inefficient for moderate and large TH. A

moderate TH implies:

is of the same order (of magnitude) as

the -expectedtimetofirstcomponent-failure.AnyTHthatisat

least 1 order larger is termed “large.” For moderate TH, tuning

the value of the failure biasing parameter

tation can yield efficient estimates [85], but it is difficult to pro-

vide guidelines for how

should be set in general. For large

TH, irrespective of the value of , the estimates using failure bi-

asing are always poor, because the variance of the IS estimator

increases with the variance of the likelihood ratio. The larger

is, the more transitions there are in

the likelihood ratio grows approximately exponentially with the

number of transitions [39].

In estimating unavailability and MTTF, the expressions used

in this paper are in terms of regenerative-cycle-based quanti-

ties, which are estimated using the regenerative method of sim-

ulation. Since in highly dependable systems, regenerative cy-

cles typically contain a small number of transitions, the use of

IS does not lead to a likelihood ratio with an important vari-

ance. A similar approach can be used in the context of transient

measures. Though transient measures cannot be expressed ex-

actly in terms of regenerative-cycle-based quantities, it is pos-

sible to develop bounds that are expressed in terms of regener-

ative-cycle-based quantities. Thus, when the direct application

of IS to estimate the transient measure itself is inefficient, it is

possible to estimate the bounds efficiently. For highly depend-

able systems these bounds are close to the transient measure in

the sense explained in this section, [99], [115], [101], [102].

The

is exponentiallydistributed with rate , which is equal

to the sum of all component failure rates in state

definition,

through experimen-

, and the variance of

. From its

. Let

•

•

if the system does not fail in a regenerative cycle,

time between the first system failure in a cycle and

the end of the cycle if the system does fail:

. Hence.

When the highly dependable system consists of highly reliable

components, then most regenerative cycles consist of a single

component failure transition followed by a component repair

transition. Because component repair times are typically much

smaller than component failure times, the regenerative cycle

time consists mainly of the first component failure time,

(implies

exponentially distributed with rate . The number of regenera-

tive cycles until system failure is geometrically distributed with

probability

ceptance probability ) of exponentially distributed r.v. (each of

which has rate ) is exponentially distributed (with rate

), which is

. The geometric sum (with ac-

).

Thus,

is approximately exponentially distributed with rate

. Let

(10)

then

TH are modeled by

to a small TH), then for

large TH),

it has been shown [101], [102] that

,(

(corresponding to moderate and

corresponds

asand

thus we have an upper bound. Similarly, let

(11)

then

, for all .

,

Also, as for

as

i.e., the lower bound is close to the unreliability for moderate

and large TH.

Both

andare in terms of regenerative-cycle-based

quantities. Hence for estimating

type procedure in which

is used to estimate the quantities

associated with rare events like

the original probability measure

like

.

There are other bounds on the unreliability (in terms of re-

generative-cycle-based quantities), like the ones in [11], [63].

These bounds are close for large , whereas the bounds in [101],

[102] are close for both moderate and large . Bounds for the

-expected interval unavailability were developed in [99], [115]

and (as for the unreliability bounds) close to the actual measure

for moderate and large .

3) Estimation of the Laplace Transform Function: An ap-

proach for estimating the actual transient measure (instead of

estimating close bounds) for large TH is outlined in [13]. In-

stead of estimating the transient measure, the “Laplace trans-

form function” of the transient measure is estimated (the tran-

sient measure is a function of ). Then a Laplace transform in-

version method is used to estimate the transient measure for

any given . The advantage of this approach is that the Laplace

transform function of the transient measure can be expressed

exactly in terms of Laplace transform functions of regenera-

tive-cycle-based quantities, which can be estimated very effi-

ciently using IS (if necessary).

and, use a MSDIS

and

to estimate other quantities

and

Page 12

NICOLA et al.: TECHNIQUES FOR FAST SIMULATION OF MODELS257

Forexample,considertheunreliability.Foranyfunction

the Laplace transform function is:

,

Let:

and

Then the Laplace transform of the unreliability is [13]:

(12)

Both

For any fixed , the

IS, and the

simulation. Then, the method is: estimate

of

[by estimating

transform inversion algorithm to obtain

similar method for estimating the interval unavailability is in

[13].Thistransformapproach[13]isabittedioustoimplement,

but yields good experimental results.

and are regenerative-cycle-based quantities.

can be efficiently estimated using

can be efficiently estimated using ordinary

for some values

and ], and then use a Laplace

for a given . A

C. Estimation of Derivatives

Performance measures of a system are (complicated) func-

tions of the system parameters, such as the component failure

and repair rates. Thus, one can compute derivatives of perfor-

mance measures with respect to these parameters. This section

reviews work in this area for highly dependable Markov sys-

tems. For example, determining the derivative of the MTTF

withrespecttoaparticularcomponent’sfailurerate.Thederiva-

tive information is useful when designing systems, because this

knowledge can help the designer identify system parts that need

improvement.

First consider estimating derivatives of the MTTF. Recall the

ratio expression in (8) for the MTTF; then differentiate it with

respect to some system parameter

failure rate).

(e.g., some component’s

(13)

derivative operator with respect to .

Thus, estimating

4 quantities in (13). A central limit theorem for the resulting es-

timator of

is derived in [83]; -confidence intervals

for the derivative can be formed. Section III-A discussed esti-

mating

and

here is on estimating their derivatives.

requires estimating each of these

; thus the focus

One simulation-based approach for estimating derivatives is

the likelihood-ratio derivative method [38], [93], which is now

briefly described. Focus on estimating

derivative with respect to . To estimate

requires simulating the embedded DTMC,

Let

andbe the hitting time to

the embedded DTMC, respectively (the numbers of transitions

until hitting

and , respectively); then

and its

, only

.

and the cycle length of

The(original)transition-probabilitymatrixof

Then, under certain regularity conditions [38], [93],

(under)is.

The

to estimate

using the original measure

is determined within a single regenerative cycle. Thus,

, generate

, and collect observations:

regenerative cycles

of

The standard-simulation estimator of

is

Similarly, estimate

One drawback of the likelihood-ratio derivative method is

that it yields derivative estimators with large variances in many

settings.Specifically,theoreticalandempiricalwork, [36],[93],

show that the variances of derivative estimators—

• are typically much larger than those of the respective per-

formance-measure estimators,

• grow linearly in the -expected number of events in an

observation.

When regenerative simulation is used, an observation corre-

sponds to a regenerative cycle, which typically consists of very

few transitions for highly dependable Markov systems. Thus,

the likelihood-ratio method seems to be well-suited for these

types of systems.

Theoretical studies in [77], [78] established that when esti-

mating derivatives with respect to certain system parameters

(e.g., failure rates of certain components) using ordinary simu-

lation, the ratio of the RE of the estimate of

and the estimate of

curs when the parameter corresponds to one of the largest (in

absolute value) sensitivities, where the sensitivity with respect

to a parameter

is defined as the product of

with respect to . Sensitivities measure how relative changes in

a parameter value affect the overall performance. Thus, for pa-

rameters

corresponding to the largest sensitivities, one can es-

timatethederivativewithrespectto

in (13).

remains bounded. This oc-

and the derivative

andtheperformancemea-

Page 13

258IEEE TRANSACTIONS ON RELIABILITY, VOL. 50, NO. 3, SEPTEMBER 2001

sure with about the same relative accuracy. However, the RE of

both these estimators go to infinity as the system unreliability

tends to zero (see Section II-B); therefore IS must be used. The

derivatives with respect to parameters that do not correspond to

the largest sensitivities might not be estimated as efficiently as

the performance measure when using ordinary simulation; [78]

gives an example illustrating this.

This section implements IS by simulating

cles using another probability measure and collecting observa-

tions

,

the triplet

, where

Then the IS estimator of

regenerative cy-

of

is the likelihood ratio.

is

WhenBFBisapplied,thentheestimatorof

BRE[78].NecessaryandsufficientconditionsforBREofderiva-

tiveestimatorsobtainedusingotherfailure-biasingmethodsand

moregeneralISschemesareestablishedin[79],[81].

Reference [82] shows that even though the numerator

in the MTTF ratio formula can be estimated

with BRE using ordinary simulation, its derivative estimators

can have unbounded RE. Consequently, if

•

and its derivative are estimated using

ordinary simulation,

• BFB is applied to the estimation of the denominator

and its derivative,

• all 4 terms are estimated mutually -independently (using

measure-specific IS),

then the resulting estimator of the derivative of the MTTF can

have unbounded RE. On the other hand, if BFB is also used to

estimate

, then its estimator has BRE and

so does the resulting estimator of the derivative of the MTTF.

Experimental work in [83] seems to indicate that derivatives

of the MTTF and the steady-state unavailability for large sys-

tems can be estimated efficiently using BFB. When estimating

derivatives of the unreliability using BFB and forcing (see Sec-

tionIII-B),theempiricalresultsshowthattheREoftheestimators

aretypicallysmallwhentheTHissmall,buttheygrowastheTH

increases[101].Thisisanalogoustothebehaviorof(nonderiva-

tive)estimatorsoftheunreliability,asdiscussedinSectionIII-B.

Estimation of derivatives of the unreliability for large TH, using

theboundingmethodinSectionIII-B,istreatedin[101].

Reference [79] presents an example of a system showing

that when estimating

simple failure biasing, estimators of derivatives with respect

to certain component failure rates can have BRE, while the

performance-measure estimator does not. Thus, it is possible

to estimate a derivative more efficiently than the performance

measure when using simple failure biasing.

has

and its derivatives using

IV. FAST SIMULATION OF NON-MARKOV MODELS

Notation

NHPP

intensity rate function of NHPP,

upper bound for intensity rate function

time-homogeneous Poisson process with

rate

time of eventof

Cdf,pdfofthelifetimeofcomponent ,eval-

uated at

hazard rate of the lifetime of component

evaluated at

hazard rate of the repair-time of component

evaluated at

failurerateofcomponent attime without,

with IS

repair rate of component at time

with IS

total failure rate of all components at time

without, with IS

total repair rate of all components at time

without, with IS

total event rate at time

likelihood ratio of failure, repair events at

time

likelihoodratioofpseudo,alleventsattime

number of component failures by time

det of operational components at time

time component fails at its failure #

length of a generic -cycle

number of system failures in an

the steady-state

total system failure time multiplied by

the likelihood ratio on a generic (biased)

-cycle

-expectation under probability measure

and initial distribution

variance under probability measure

initial distribution

number of batches used in the batch means

method

generic batch mean of

generic batch mean of DL and its estimator

covariance under probability measures:

for the first r.v.,for the second r.v., and

under initial distribution .

ThissectionusesIStoestimatedependabilitymeasureswhen

the failure and repair times of components might not be expo-

nentially distributed (under certain assumptions), and is based

on [40], [54], [55], [87], [88].

Except for some technical and implementation details, most

of the IS heuristics developed for Markov models also apply to

non-Markovmodels.Oneapproachtoimplementfailurebiasing

(or forcing) in discrete-event systems is to reschedule failure

events by sampling from new accelerated failure distributions

[85].5Heuristics and their implementation, as well as experi-

mental results demonstrating the effectiveness of the techniques

to estimate steady-state and transient measures are in [85], [86].

Another approach to IS in discrete-event systems (briefly de-

scribedinthissection) isbasedontheuniformizationmethodof

,

,

, without,

,

,

,without, with IS

,

,

-cycle in

DL

and

,

,

and its estimator

5There is considerable freedom in the choice of the new distributions.

Page 14

NICOLA et al.: TECHNIQUES FOR FAST SIMULATION OF MODELS259

simulation. The method requires the underlying (uniformized)

distributions to have bounded hazard-rate functions. A closely

related method which avoids generation of pseudo events is

exponential transformation; here, the time to the next failure

event is sampled directly from an exponential distribution [87].

Failure biasing or forcing are affected by increasing the failure

rate (relativeto therepair rateor themission time, respectively).

The latter two techniques are somewhat simpler to implement

than the technique in [85], because failure events need not be

rescheduled and are generated using only the exponential dis-

tribution.

The next paragraph briefly describes the uniformization

method of simulation, which is use in this section as a basis of

our approach to IS in non-Markov models.

Uniformization-Based Sampling: Uniformization (or thin-

ning) is a simple technique for sampling (simulating) the event

times of certain stochastic processes including NHPP, renewal

processes, or Markov processes in continuous time on either

discrete or continuous state spaces [22], [26], [49], [58], [72],

[104]. It is describe for a NHPP

. Assume that

constant . Then the event times of

thinning the

process as follows:

For each

, include (accept)

with probability

cluded (is rejected).

Rejected events are sometimes called pseudo events.

Throughout it is assumed that all rate functions are left-con-

tinuous:

; thus if an event occurs at some random

time

, thenis the event rate just prior to time

Renewal processes can be simulated using uniformization,

provided that

is the hazard rate of the inter-event time dis-

tribution at time . Uniformization can be generalized to cases

in which the process being thinned is not a time-homogeneous

Poisson process [72]. For example, at time

, wherehas an exponential distribution with rate

. The pointis then accepted with probability

This requires only that

Section IV-A describes how the uniformization method of

simulationcanbecombinedwithIStodevelopaneffectivetech-

nique for estimating transient measures in non-Markov models

of highly dependable systems.

with intensity function

for some finite

can be sampled by

for all

as an event-time in

; otherwise the point is not in-

.

, set

.

, for all.

A. Transient Measures

Consider the problem of estimating the unreliability

{time to system failure,

To simplify the notation, let there be 1 component of each

type (although more general situations can be handled). The

hazard rate [5] of component

, which we assume is well defined and finite.

,

[If component is not operational at time

,

nent at time . [If component is not being repaired at time

then.

There are a variety of ways to use IS in simulations of such

a system. Begin with a direct analog of forcing and BFB. This

method is based on uniformization. If components are highly

for some fixed value of }.

is then

age of componentat time .

then.]

elapsed repair time on compo-

reliable, then

bounded, then the system can be simulated (without IS) by uni-

formization as follows.

Assume

is a constant.

Then a Poisson process with rate

Let an event in this Poisson process occur at time

That event is accepted as a component

probability

, and is accepted as a component

event with probability

that

,anotherpossibilityexists:apseudo-event(neither

a failure nor a repair occurs). This occurs with probability

. The probability of a failure event is

repair event is

.

For highly reliable components,

ever repairs are ongoing, thus the probability of a failure is

verysmall. Toaccelerate failures, simplychangetheacceptance

probabilities of the various event types, which is equivalent to

changing component failure and repair rates to, say,

•

[such that

•

[such that

The likelihood ratio (at time ) is:

. If the failure and repair rates are

, for all times;

is simulated.

.

failure event with

repair

. However, because it might be

, and of a

when-

],

].

(14)

These likelihood ratios have a simple form. For example, let

component

fails on its own (not through failure propagation)

times in.

(15)

Equation (15) assumes that the failure propagation probabilities

aresampledfromtheirgivendistributions.However,IScanalso

be applied to these as well. The terms

be expressed similarly. The likelihood ratio can be computed

(updated) recursively at uniformization event times during the

simulation.

The analog of BFB with forcing is accomplished as follows.

If no repairs are ongoing (e.g., in the state where all compo-

nents are operational), let

. (In practice [87],could be chosen such that

. This means that, with probability 0.8, some com-

ponent fails before the TH

expires.) If repairs are ongoing,

let

: the total event rates the same as without IS.

Then, let

, for some constant : given that the

event is real, make it a failure event with probability . (In prac-

tice [87],

is usually set in the range from 0.3 to 0.5.) Given a

failure event, pick an operating component

ability

. Under appropriate technical conditions, it can

be shown that such a heuristic for IS (which is the analog of

forcing and failure biasing) results in estimates having BRE

[55]. In particular, let

exist a small positive parameter

, where

that, for all

,

andcan

, for some constant

to fail with prob-

, and let there

such that

and. If IS is done such

[when],

Page 15

260IEEE TRANSACTIONS ON RELIABILITY, VOL. 50, NO. 3, SEPTEMBER 2001

(when componentis under-

going repair), then under some additional minor assumptions

(including that the failure propagation probabilities are -inde-

pendent of ), the estimates of

Because repair distributions might not have bounded hazard

rates (e.g., discrete and uniform), it is desirable to seek effec-

tive IS methods that do not rely on uniformization for repair

events. The above uniformization-based algorithm can be ap-

plied to just failure events: repair times are sampled directly

from their original distributions, while uniformization is used

to simulate failure events. The likelihood ratio is then

: it does not include the repair event term

Again,underappropriatetechnicalconditions,thismodification

results in BRE [55].

Uniformizationcanbecomputationallyinefficientifthereare

manypseudoevents. Inaddition, supposeeventsfrom a Poisson

process with rate

are accepted as failure events with proba-

bility . Then the time until an accepted event has an exponen-

tial distribution with rate

. This suggests sampling the time to

next failure event directly from an exponential distribution with

rate

.Ageneralizationofthisapproach(exponentialtransfor-

mation)alsoresultsinestimateshavingBRE(underappropriate

technical conditions[55]). Thelikelihood ratiotakes on a some-

what different form [55], [87].

Empirical studies testing the competence of these methods

are reported in [54], [55], [87]. Generally, good variance reduc-

tion isobtained if

is small,say, lessthan10

smaller

andare, the greater the

Finally, though no formal study has been done, we anticipate

that these techniques (with minor modifications) apply also for

estimating the -expected interval unavailability.

haveBRE as.

.

.The

can be made.

B. Steady-State Measures

Non-Markovmodels ofhighly dependablesystems mightnot

possess an explicit regenerative structure. If they do not, then

a ratio representation of steady-state measures in terms of re-

generative-cycle-based quantities, such as in (5), is no longer

possible. This section discusses an approach for the efficient es-

timation of steady-state measures, such as system unavailability

and MTBF, in non-Markov nonregenerative models. The ap-

proachusesarepresentationofsteady-statemeasuresintermsof

quantities based on

-cycles: a sample path between two suc-

cessive entries of the system into some set of states

context of highly dependable systems (as in Section II), choose

to be the state in which all components are operational. Only

when all component failure-time distributions are exponential

(regardless of the repair-time distributions), entrance into the

set

constitutes a regeneration point, and a ratio representa-

tion of the steady-state unavailability,

ative-cycle-based quantities, as in (5), is still valid [85]. How-

ever, this is no longer true if component failure times are gen-

erally distributed. Therefore, in general,

and one cannot use classical statistical techniques to estimate

the variances of

-cycle-based quantities. Instead, one can use

the method of batch means to estimate the variances of these

quantities by grouping successive

nonoverlapping batches, and then treating the batch means as

. In the

, in terms of regener-

-cycles are not i.i.d.,

-cycle-based quantities into

i.i.d. observations; this is an approximation whose validity in-

creases with the batch size.

Let

betheinitialdistributionofthecorresponding(original)

stochastic process upon entering the set

processhasreachedthesteady-state.Accordingtothedefinition

of

, is the steady-state joint distribution of the components’

ages upon entering the state in which all components are oper-

ational; upon entering

, at least 1 component has an age

Underfairly general ergodicity conditions(whichalso ensure

that the system returns to the set

representation of

in terms of

, after the stochastic

0.

infinitely often), the ratio

-cycle-based quantities is:6

(16)

thesubscriptsdenotethatthe -expectationiswithrespecttothe

original probability measure

the original system) and the steady-state initial distribution

the

-cycles. A ratio representation for the MTBF in terms of

-cycle-based quantities is

(which governs the behavior of

of

(17)

Theremainderofthissectionreviewstheestimationof ,which

has been considered in [88]. A similar approach to estimate the

MTBF is in [40].

Because system-failure is a rare event, ordinary simulation is

very inefficient to estimate

IS.

a new probability measure to simulate the system.

a sample path in the original process, on which the

total system down-time is evaluated to be

must satisfy the condition

ever

.

With IS,

an -unbiased estimate of

(

likelihood ratio).

An appropriate choice of

should yield

, which implies much better precision in estimating

.

can be estimated efficiently using ordinary simula-

tion. Therefore, the ratio estimator in (16) can be written as

; this motivates the use of

.

, “when-

,

(18)

The resulting scheme is analogous to MSDIS for estimating

the steady-state unavailability in Markov models [46] (see Sec-

tionIII-B).First,thesystemissimulatedusingtheoriginal

a sufficiently long time to approximately reach the steady-state.

At that time, the initial distribution upon entry of

sufficiently close to

, and begin to use the following splitting

technique. For each (steady-state)

(once or more) starting with the same component failure ages

andusing

togetsamplesof

cles. Then run the same

-cycle using the

for

-cycles is

-cycle, run the simulation

and ;theseare -biased-cy-

to get a sample of

6For details, see [10], [16], [27], [106].