Page 1

Abstract—Fault trees theories have been used in years

because they can easily provide a concise representation of

failure behavior of general non-repairable fault-tolerant

systems. But the defect of traditional fault trees is lack of

accuracy when modeling dynamic failure behavior of certain

systems with fault-recovery process. A solution to this problem

is called behavioral decomposition. A system will be divided

into several dynamic or static modules, and each module can

be further analyzed using BDD or Markov Chains separately.

In this paper, we will show a decomposition scheme that

independent subtrees of a dynamic module are detected and

solved hierarchically for saving computation time of solving

Markov Chains without losing unacceptable accuracy when

assessing components sensitivities. In the end, we present our

analyzing software toolkit that implements our enhanced

methodology.

Index Terms—Dynamic fault tree, Markov model, Reliability

analysis, Sensitivity analysis

I.

INTRODUCTION

For the recent forty-years, fault trees have been widely

used for hardware systems reliability analysis. It provides

an intuitive and easy-to-specified representation of the

failure behavior of a system, and hence has been supported

by a rich body of research since 1960s. Traditional static

fault trees represent what combination of component

failures could cause the whole system to fail by Boolean

gates such as AND, OR, Voting gates and usually be solved

by Binary Decision Diagram (BDD) solution. When the

concept of fault trees analysis was applied on software or

embedded systems in early 1980s, researchers noted that

some dynamic behavior of the system failure mechanisms

cannot be modeled by traditional static fault trees. Those

failure mechanisms are usually associated with sequence-

dependent events, spares and dynamic redundancy

management, and priorities of failure events. For this reason,

many modelers turned to Markov Chains for reliability

assessment of software-involved systems and suffered from

its computational complexity. In order to overcome this

difficulty, the concept of Dynamic Fault Trees (DFT), which

was first introduced by Dugan [3], is to try adding

sequential notion into the traditional fault tree approach and

applying the linear modularization algorithm [1] to divide

the whole fault tree into several independent sub-trees.

Those independent sub-trees are further identified as static

or dynamic [2]. Finally, we translate and solve those

dynamic modules by Markov Chains and leave other static

modules to traditional BDD solution.

Once the Markov models have been built, numerical

transient analysis will be applied for the transient state

probabilities [6, 7]. Two most common methods to compute

transient individual Markov state probabilities are: (1)

Differential-equations-based method like Runge-Kutta

method, and (2) Markov-chain-specific probabilistic

methods such as the Randomization method. The

computational complexities of these methods are: O(KN3) =

O(K(np)3) = O(K(n3p)), where N is the size of the Markov

states in the order of np, p is the number of possible status of

each basic event, and n is the number of the basic events of

the dynamic fault tree, and K stands for the number of

iterations or time-steps. K depends on the desired accuracy

and mission time [8].

In order to reduce the state space of a Markov model, one

straightforward approach called decomposition scheme [9]

has been proposed. In such scheme, independent subtrees of

a fault tree are detected and solved hierarchically. An

independent subtree is replaced by a single event whose

probability of occurrence represents the probability of the

occurrence of the subtree. Once the independent subtrees

are separated, they are translated into corresponding

Markov models individually

computational complexity of overall system Markov state

decreases significantly. However, such approach has a

drawback that once the state space of a Markov model has

been reduced, it is difficult to evaluate the component

sensitivities of the eliminated basic events. Thus, in Dugan’s

fault tree analysis algorithm, modularization techniques will

not be applied to the subtree whose top-level node is a

dynamic gate [2, 10].

In this paper, we demonstrate an improved decomposition

scheme where the dynamic subtrees can be modularized and

thus the state space of the result Markov model is reduced.

Even though, our approach still has the capability of

evaluating the component sensitivities of the eliminated

basic events. In Section II, we begin with stating a

motivating example for this paper. Section III explains how

the improved decomposition algorithm works and the

detailed manipulations of each phase with the example in

Section II. Section IV shows the theoretical efficiency gain

and the actual difference of computational time costs

between the traditional approach and the improved one.

Finally, we present our analyzing software toolkit that

implements our enhanced methodology in Section V.

II. MOTIVATION

In order to have a comparison to the traditional dynamic

fault tree analysis methodology, we use the same example, a

cardiac assist system presented in [12-13]. This system is

designed to treat both electrical and mechanical failures of

the heart. Electrical failure can cause the heart to beat

abnormally, where as mechanical heart failure reduces the

heart’s ability to generate the blood pressure.

and therefore, the

Reliability and Sensitivity Analysis of Embedded

Systems with Modular Dynamic Fault Trees

Hsiang-Kai Lo*, Chin-Yu Huang*, Yung-Ruei Chang**, Wei-Chih Huang*, and Jun-Ru Chang*

*Department of Computer Science

National Tsing Hua University

Hsinchu, Taiwan

**Institute of Nuclear Energy Research

Atomic Energy Council

Taoyuan, Taiwan

Page 2

The cardiac assist system example consists of two parts, a

patient-worn vest and the part implanted in the patient’s

body. The components on the patient-worn vest include an

external TEDTS (Transcutaneous Energy and Data

Transmission Systems) coil and rechargeable batteries.

Power and data are transferred between the external and the

implanted portion of the system by the TEDTS. The

implanted portion includes an electronic controller, a

mechanical blood pump, internal TEDTS coil, and a

rechargeable battery. Audible and tactile alarms included in

the electric controller of patient vest will warn the patient if

a fault is detected. Some pace leads are attached to the heart

and connected to electric controller for monitoring. The

fault tree of the cardiac assist system is shown in Fig. 1.

Note that in the CPU module of Fig. 1, there are two

dynamic fault tree gates which are necessary for modeling

the fault-tolerant characteristic of the system. FDEP gate

will force the dependent events to occur when the trigger

event occurs. And the output of Spare gate becomes true if

and only if all the input events occur. The detailed

definitions and discussions of functional behavior of these

dynamic gates can be found in [3, 4].

III. DYNAMIC FAULT TREE ANALYSIS

In this section, we demonstrate our fault tree analysis

methodology in detail phase by phase. First, we apply a

linear time modularization algorithm, which is modified

from [1], on a fault tree to find out the dynamic modules.

This approach will try to replace independent subtrees of

dynamic modules by basic events without violating failure

dependencies and translate the simplified dynamic modules

into corresponding Markov models. However it will raise an

issue that it may not provide an exact solution in sensitivity

assessment, which was mentioned in [2]. In order to

overcome this problem and obtain the component

sensitivities of the eliminated basic events, we propose a

pre-fetching methodology called sample points pre-fetching

to get some coefficients from translating both the original

fault tree and the reduced one into Markov models. Finally,

we obtain the system reliability from the reduced fault tree

with fewer time cost, and also get the approximated

component sensitivities of the eliminated basic events by

rebuilding the distribution of the state probability vector

from the sample points we pre-fetched.

A. Modularization & Decomposition

We use a linear-time algorithm proposed in [1] by Dutuit

and Rauzy to detect modules of a fault tree no matter

coherent or not. The idea of the algorithm derived from the

Tarjan algorithm [11] is to find strongly connected

components of a graph. It can detect modules in a fault tree

with several hundred gates and basic events within a few

milliseconds. Moreover, its simplicity makes it easier to be

implemented.

Fig. 1. Fault tree of the cardiac assist system

Page 3

The basic principle of the algorithm can be stated as

follows: Let e be an internal event and d1 and d2 be the first

and the second dates of visits of e in a depth first left most

(DFLM) traversal of the fault tree, respectively. Then e is a

module if and only if none of its descendents is visited

before d1 and after d2 during the traversal [1]. From the

definition, the root event and the terminal events (basic

events) are always modules and will be ignored in the

output of this algorithm.

Take the fault tree in the Fig. 1 for example. First, we

assign a unique id for each logical gate and event, and view

them as tree nodes. For convenience, we use the traversal

order as the ID of each node in this case (as shown in Fig.

2). Then we apply the modularization algorithm on the fault

tree. The visited date for each node is determined by a

counter, which has an initial value 0, and increases by 1 for

each traversal step. Both the first and the second visited

dates of each fault tree node are recorded and reported in

Table I. According to the definition of the modularization

algorithm and Table I, we can find that modules of the fault

tree are: {2, 5, 9, 11, 15, 17}. Note that node #17 will not be

identified as a head of a module by the traditional

modularization algorithms of dynamic fault tree analysis

because they will stop traversing when they meet a dynamic

gate (node #16 and #22 in this case). To prevent confusing,

we call the modularization behavior in the dynamic modules

as “decomposition”.

Fig. 2. Simplified fault tree

B. Markov Model Translation

Since there exist a dynamic module (node #15), we

should translate the module into a Markov chain for

subsequent analysis. Meanwhile, node #17 is identified as

an independent static module by our decomposition

algorithm, and we replace it by a new basic event (as shown

in Fig. 3). For the necessity of the sample points

pre-fetching algorithm, we need to translate both the

original and the reduced fault tree to Markov models.

The transformation algorithm we adopted was proposed

in [4], and for sake of conciseness, we do not mention

process in detail. The resultant Markov models are shown in

Fig. 4 and Fig. 5. The bit-string labeled on each state

represents the current component configuration (1 for

operational, 0 for failed), and the system status of each state

is labeled near to it. The mapping relation between bit and

corresponding component (or basic event) is described on

the top of each figure. The number on each directed edge

indicates which component occurs to fail.

Fig. 3. Replace a module with a basic event

Fig. 4. The Markov model for original dynamic module

Fig. 5. The Markov model for reduced dynamic module

C. Sample Points Pre-fetching

If we assess the system reliability by solving the Markov

model of reduced dynamic module and other static modules,

we will find out that the result is exactly the same as that by

traditional dynamic fault tree analysis methodology. But if

we want to apply a component sensitivity analysis on the

dynamic module using the reduced Markov model, it will

turn out that it is difficult to obtain the sensitivities of some

components. To calculate the sensitivity measure of

component C from the Markov state probabilities, use:

Q

tCS

+

or

R

tCS

+

CC

C

CC

C

RQ

Q

RQ

+

−=

)|(

(1)

CC

C

CC

C

RQ

R

+

RQ

−=

)|(

(2)

C

C

C

C

QQRR,,,

are the probability sum of these Markov

state subsets:

?

C

R : System is operational with component C failed.

?

C

operational.

?

C

Q : System is failed with component C failed.

?

C

R : System is operational with component C

Q : System is failed with component C operational.

Page 4

ID

d1

d2

1

1

44

2

2

15

3

3

4

4

5

6

5

7

14

6

8

9

7

10

11

8

12

13

9

16

27

10

17

18

11

19

26

ID

d1

d2

12

20

21

13

22

23

14

24

25

15

28

43

16

29

40

17

30

35

18

31

32

19

33

34

20

36

37

21

38

39

22

41

42

For example, if we want to calculate the sensitivity of the

crossbar switch in original Markov model, we can use either

Eq. 1 or Eq. 2 (here we use Eq. 2):

State

t SwtS

++

)()(

)|(

0010

(3)

00110111 10111111

011110111111

StateState StateStateState

StateState

++

++

=

Unfortunately, the basic event of crossbar switch is

eliminated in the reduced Markov model. We can not find

the one-to-one state mapping between two models

especially for three states: State0010, State0001, and State0000.

However, one thing we know is that State000 in the reduced

Markov model covers these three states. That is:

0001000

StateState State

+=

00000010

State

+

(4)

which also means:

1 re whe

,

b

,

,

000

+

0000

0000010

000 0001

=+

⋅=

⋅=

⋅=

ca

StatecState

StatebState

StateaState

(5)

We call the values a, b, c the “proportion coefficients”. If

we have the exact proportion coefficients for any

component at any time points, we will be able to compute

the exact component sensitivity. Therefore, we compute the

proportion coefficients by pre-fetching several probability

vectors in step of proper time offset as follows:

)(

0001

StatetState)(

)(

,

)(

)(

,

)(

000

0000

000

0010

000

tState

tState

c

t

tState

b

tState

a

ttt

===

(6)

After pre-fetching, the proportion coefficients of any time

points can be referenced to the most adjoining sample point

(but not time-exceeded). Generally, the smaller offset we set,

the more accurate proportion coefficients we obtain.

Then the Eq. 3 can be rewritten with respect to the

reduced Markov model as:

)|(

State State

++

)()(

000 001011 101111

011101 111

StatebStateState StateState

State

t SwtS

⋅++

++

=

(7)

Fig. 6 shows the result when the sample points

pre-fetching approach is applied on the Markov model of

the example system for component crossbar switch with a

mission time 10000 hours. And in this case, the time offset

we adopt here is 2000 hours.

IV. IMPROVEMENT & IMPLEMENTATION

Theoretically, the key point of the approach we proposed

is to divide a dynamic module into a smaller one and thus

result in a smaller Markov model and a faster analysis

process.

methodology by fetching several sample points to prevent

losing the accuracy of sensitivity assessment. That is, in the

other words, to trade a complexity overhead of sample

points pre-fetching for reducing the computational

complexity of each component sensitivity analysis. The

pre-fetching algorithm will be only applied once for

obtaining several transient solutions and then every

sensitivity evaluating process for different time points will

benefit from it. Generally, the pre-fetching algorithm is

directly based on the exponential series of the transition

matrix. The computational complexity of this evaluating

will be:

M

m

=⋅+

Meanwhile, we apply an approximating

)()) (log()()) (log(

33p

nO

m

M

mNO

m

⋅+

(8)

where M means the mission time and m stands for the time

offset between each fetching.

Fig. 6. Sample points and proportion coefficients

Since the overall complexity will depend on the algorithm

used by Markov model solver, here we adopt the survey in

[8] and that in Section I for evaluating the complexity of

common Markov solution algorithms. If we apply our

approach on the dynamic fault trees where the dynamic

modules contain some independent sub-fault-trees and thus

can be merged and eliminated k components for example.

The ratio of the computational complexity of the reduced

Markov model with respect to the original one will be:

)'(

3

NO)(

)) ((

O)(

3

33

p

p

n

knONO

−

=

(9)

For comparison with the traditional dynamic fault tree

analysis algorithm proposed by Dugan in [2], we set up an

experiment for measuring the actual analysis time costs.

Markov models generated by these two approaches are all

solved by the solver implemented with CVODE [14] which

is also adopted by Dugan [15].

TABLE I. VISIT DATES RECORD

Page 5

TABLE II. TIME COST MEASUREMENT RESULT

Mission Time (hours)

104

0.83 ms

1.79 ms

3.97 ms

10000 5000

Time Cost

103 105

Analysis Times: 5

Analysis Times: 10

Analysis Times: 20

Sampling Step (hours)

Sample Points Pre-fetching

(Once)

Analysis Times:

Sensitivity

Analysis

&

Efficiency

Gain (%)

Analysis Times:

0.59 ms

1.43 ms

3.23 ms

5000

1.85 ms

4.03 ms

8.66 ms

5000

Traditional

Algorithm

10000 2000 2000 10000 2000

<0.1 ms <0.1 ms 0.28 ms 0.13 ms 0.31 ms 0.71 ms 0.42 ms 0.78 ms 1.88 ms

5

0.46 ms

>5.1%

0.98 ms

>24.4%

2.14 ms

>30.7%

0.46 ms

>5.1%

0.98 ms

>24.4%

2.14 ms

>30.7%

0.46 ms

-25.4%

0.98 ms

11.9%

2.14 ms

25.1%

0.57 ms

15.7%

1.03 ms

35.2%

2.28 ms

39.3%

0.57 ms

-6.0%

1.03 ms

25.1%

2.28 ms

34.8%

0.57 ms

-54.2%

1.03 ms

0.03%

2.28 ms

24.7%

1.08 ms

18.9%

2.29 ms

32.8%

5.23 ms

34.8%

1.08 ms

-0.5%

2.29 ms

23.8%

5.23 ms

30.6%

1.08 ms

-60.0%

2.29 ms

-0.03%

5.23 ms

17.9%

Analysis Times:

10

20

Proposed

Algorithm

Overall Mean Square Error

(MSE)

1.3·10

-2 2.4·10

-4 3.1·10

-5 1.2·10

-2 2.4·10

-4 3.1·10

-5 1.2·10

-2 2.4·10

-4 3.1·10

-5

In this experiment, we set three different time offsets

(10000, 5000, and 2000 hours) of the sample points

pre-fetching approach in the proposed algorithm. And we

assume that each profile (with different mission time) is

applied for sensitivity assessments at 5, 10, and 20 different

user-specified time points (as “Analysis Times” in Table II).

Table II shows the total time costs (in micro-seconds) of this

experiment.

The efficiency gain in Table II is computed by:

algorithm' Proposed

1 (

−

% 100)

cost timesalgorithm' l Traditiona

cost times

×

(10)

Fig. 7 and Fig. 8 show the differences between the exact

(as solid-line) and the approximated (as dot-line and

dash-line) sensitivities of crossbar switch and system

supervisor.

Finally, we have been developing a software toolkit,

called DyFA (Dynamic Fault-trees Analyzer), which offers a

friendly user interface to end-users for the proper dynamic

fault tree layout and automate reliability-and-sensitivity-

analysis tasks. Fig. 9 shows DyFA’s architecture. It can read

both dynamic and static fault trees from either files or user

layout interface, and automatically identifies and translates

the dynamic modules (if exists) to corresponding Markov

models. Then the decomposition algorithm is applied on

these Markov models and we solve them with the Markov

model solver which is implemented with CVODE. Finally,

the analysis results including system reliability and

components sensitivity are outputted to a screen or printer.

Figures 10-12 show three screenshots of several working

views of DyFA.

V. CONCLUSION

We have presented an enhanced approach for sensitivity

analysis of dynamic fault trees models with dependencies.

The approach first applies a modularization algorithm on a

fault tree. Then it identifies the independent subtrees from

the dynamic modules and replaces them with basic events

for the purpose of reducing the state space of the Markov

model. Meanwhile, the sample point pre-fetching algorithm

is proposed for computing the approximated component

sensitivities. Finally, we solve these modules hierarchically

and take time benefit from the reduced Markov model

without unacceptable accuracy in sensitivities assessment.

This approach will spend fewer time cost in analyzing

components’ sensitivities if the dynamic modules of fault

tree contain one or more independent subtrees. And the

experimental results show a significant improvement of

overall efficiency compared with traditional algorithms if a

reasonable number of pre-fetching sample points is adopted.

The research areas including dynamic fault tree and

sensitivity analysis methodology will be the focus of our

future works.

ACKNOWLEDGMENT

We would like to express our gratitude for the support of

the National Science Council, Taiwan, under Grant NSC

94-2213-E-007-087 and also substantially supported by a

grant from the Ministry of Economic Affairs (MOEA) of

Taiwan (Project No. 94-EC-17-A-01-S1-038).

Fig. 7. Sensitivity of crossbar switch

Fig. 8. Sensitivity of system supervisor