ArticlePDF Available

Abstract and Figures

Fault tree analysis is a probability-based technique for estimating the risk of an undesired top event, typically a system failure. Traditionally, building a fault tree requires involvement of knowledgeable experts from different fields, relevant for the system under study. Nowadays’ systems, however, integrate numerous Internet of Things (IoT) devices and are able to generate large amounts of data that can be utilized to extract fault trees that reflect the true fault-related behavior of the corresponding systems. This is especially relevant as systems typically change their behaviors during their lifetimes, rendering initial fault trees obsolete. For this reason, we are interested in extracting fault trees from data that is generated from systems during their lifetimes. We present DDFTAnb algorithm for learning fault trees of systems using time series data from observed faults, enhanced with Naïve Bayes classifiers for estimating the future fault-related behavior of the system for unobserved combinations of basic events, where the state of the top event is unknown. Our proposed algorithm extracts repairable fault trees from multinomial time series data, classifies the top event for the unseen combinations of basic events, and then uses proxel-based simulation to estimate the system’s reliability. We, furthermore, assess the sensitivity of our algorithm to different percentages of data availabilities. Results indicate DDFTAnb’s high performance for low levels of data availability, however, when there are sufficient or high amounts of data, there is no need for classifying the top event.
Content may be subject to copyright.
DATA-DRIVEN EXTRACTION AND ANALYSIS OF REPAIRABLE FAULT TREES FROM
TIME SERIES DATA
Parisa Niloofar, a
Sanja Lazarova-Molnar b, a
a Mærsk Mc-Kinney Møller Institute, University of Southern Denmark, Campusvej 55, Odense, 5230,
DENMARK
b Institute of Applied Informatics and Formal Description Methods, Karlsruhe Institute of Technology,
Kaiserstr. 89, Karlsruhe, 76133, Germany
Corresponding author
Email addresses: parni@mmmi.sdu.dk (P. Niloofar), sanja.lazarova-molnar@kit.edu (S. Lazarova-Molnar)
Niloofar and Lazarova-Molnar
ABSTRACT
Fault tree analysis is a probability-based technique for estimating the risk of an undesired top event,
typically a system failure. Traditionally, building a fault tree requires involvement of knowledgeable
experts from different fields, relevant for the system under study. Nowadays systems, however, integrate
numerous Internet of Things (IoT) devices and are able to generate large amounts of data that can be utilized
to extract fault trees that reflect the true fault-related behavior of the corresponding systems. This is
especially relevant as systems typically change their behaviors during their lifetimes, rendering initial fault
trees obsolete. For this reason, we are interested in extracting fault trees from data that is generated from
systems during their lifetimes. We present DDFTAnb algorithm for learning fault trees of systems using
time series data from observed faults, enhanced with Naïve Bayes classifiers for estimating the future fault-
related behavior of the system for unobserved combinations of basic events, where the state of the top event
is unknown. Our proposed algorithm extracts repairable fault trees from multinomial time series data,
classifies the top event for the unseen combinations of basic events, and then uses proxel-based simulation
to estimate the system’s reliability. We, furthermore, assess the sensitivity of our algorithm to different
percentages of data availabilities. Results indicate DDFTAnb’s high performance for low levels of data
availability, however, when there are sufficient or high amounts of data, there is no need for classifying the
top event.
Keywords: Classification, data-driven simulation, fault tree analysis, multi-state system, proxel-based
simulation, reliability analysis.
1 NOMENCLATURE
AADL
=
Architecture Analysis & Design Language
ACC
=
Accuracy
AltaRica
=
Altarica Language and Its Semantics
BE
=
Basic Event
DAG
=
Directed Acyclic Graph
DDFTA
=
Data Driven Fault Tree Analysis
DDFTAnb
=
Data Driven Fault Tree Analysis enhanced with Naïve Bayes classifier
Dij
=
Disk number ij
FN
=
False Negative
FP
=
False Positive
FTA
=
Fault Tree Analysis
f/h
=
failures per hour
HiP-HOPS
=
Hierarchically Performed Hazard Origin & Propagation Studies
IE
=
Intermediate Event
IFT
=
Induction of Fault Trees
ILTA
=
Interpretable Logic Tree Analysis
IoT
=
Internet of Things
LIFT
=
Learning Fault Trees from observational data
MAP
=
Maximum a Posteriori
MBDA
=
Model Based Dependability Analysis
MCS
=
Minimal Cut Sets
Mi
=
Memory number i
MILTA
=
Multi-Level Interpretable Logic Tree Analysis
MP
=
Multiprocessor
Niloofar and Lazarova-Molnar
MTTF
=
Mean Time to Failure
MTTR
=
Mean Time to Repair
NB
=
Naïve Bayes
Pi
=
Processor number i
Proxel
=
Probability Elements
PS
=
Power Supply
RBC
=
Radio Block Center
RMSE
=
Root Mean Square Error
SHyFTA
=
Stochastic Hybrid Fault Tree Automaton
T
=
Total time
TE
=
Top Event
TN
=
True Negative
TP
=
True Positive
=
Unavailability at time step i
=
Estimated unavailability at time step i
t
=
Size of a time step
2 INTRODUCTION
Fault Tree Analysis (FTA) is a prominent method in analysing safety and reliability of systems (Vesely et
al. 1981; Lee et al. 1985; Ruijters and Stoelinga 2015). While in most of the real-world cases, it is necessary
to consider both failures and repairs for components of a system, traditional fault trees do not consider
repairable components. Repairable fault trees address this issue and consider information not only about
failure times of basic components, but also about maintenance or repairs within a system.
Multi-state fault trees have the same structure as regular fault trees, but the components or the system
may have more than two functioning levels. If the system and its components, either completely function
or fail, reliability analysis for this system has a binary perspective. Nonetheless, there are systems that
operate at various levels of performance, which usually yields more than two states associated with basic
events (Lisnianski and Levitin 2003). Studies have been dedicated to analyse these types of systems
(Compare et al. 2017; Barlow and Heidtmann 1984; Nadjafi et al. 2017; Caldarola 1980).
Many extensions of fault trees have been proposed in the literature, each having their own variety of
shortcomings and assumptions. However, even with the emerging availability of data through Internet of
Thing (IoT) devices and all existing software tools, yet fault tree analysis requires a lot of manual effort
and expert knowledge. Hence, the possibility to use data-driven methods to extract information about the
status of a system under study has not yet been fully explored. Data-driven approaches are gaining attraction
in many areas for their ability to analyse data from a system to derive the system’s behaviour (Huang et al.
2021; Solomatine and Ostfeld 2008). Big data are nowadays collected in a large portion of manufacturing
systems, especially in non-safety-critical systems, where faults are more common occurrence and do not
have associated catastrophic consequences. Data-driven fault detection, diagnosis or prediction are well-
studied using machine learning and data mining methods (Dogan and Birant 2021; Ayvaz and Alpay 2021).
However, completely ignoring human cognitive capabilities and expert knowledge causes a great loss of
information, which might only be compensated by collecting large amounts of data that is costly in many
aspects. Hence, a systematic method for fusion of data and expert knowledge can increase the accuracy in
reliability analysis of a system (Niloofar and Lazarova-Molnar 2021).
Models extracted from observational data are often used to predict future behaviours of systems under
study for unseen inputs of the models. When it comes to learning/extracting fault tree models from fault
records of the components in a system, one also needs to know how unseen events/failures of components
or combinations of events/failures of components impact the overall system, i.e., whether they lead to
system failures or not. To address the issue of unavailability of data for unseen events or combinations of
events, we present DDFTAnb that adds a classification functionality to the DDFTA algorithm introduced
Niloofar and Lazarova-Molnar
by Lazarova-Molnar et al. (2020), in an attempt to forecast system’s behavior for unobserved combinations
of basic events. Our proposed extended algorithm, DDFTAnb, first extracts repairable multi-state fault trees
from observational multinomial time series data, then analyses the results to estimate reliability and
maintainability distributions of basic events, and finally estimates the future behaviour of the system for
the unobserved occurrences of combinations of basic events. Data-driven modelling can detect hidden
causes of a system failure which are not evident utilizing solely expert knowledge. However, expert
knowledge can be deployed as a prior information into the model and can also be supplemented for model
validation. The highlights of the paper are:
1. Learning structures of repairable multi-state fault trees using only time series data from
faults and other relevant basic events that contribute to a system failure
2. Working with reliability and maintainability distributions other than exponential
3. Estimating future fault-related behaviours of systems in terms of fault tree structures and
systems’ reliabilities
4. Simulating fault trees using proxel-based simulation, which is especially well suited for
simulating complex stochastic dependencies and different probability distributions.
The rest of the paper is organized as follows. Section 3 presents the literature review. Section 4 provides
the methodology where DDFTAnb with its classification module are described in details. In Section 5, two
case studies are demonstrated, and finally, in Section 6, we conclude the paper.
3 LITERATURE REVIEW
Classic FTA is primarily knowledge-driven, rather than data-driven. This can limit the capacity of the
generated fault trees in depicting the true fault-related behaviours of the corresponding systems, especially
as systems often evolve and change their behaviours during their lifetimes, rendering initial fault trees
obsolete. Model building based on experts’ knowledge alone is becoming outdated with evolving systems
designs, data collection technologies, and blockchain-based data storage and access frameworks.
Automation of extracting dependability information from system models has led to the field of model-
based dependability analysis (MBDA) (Sharvia et al. 2016; Aizpurua and Muxika 2013; Kabir 2017).
Different tools and techniques have been developed as part of MBDA to automate the generation of
dependability analysis artefacts such as fault trees (Papadopoulos and McDermid 1999; Feiler et al. 2006;
Arnold et al. 2000). Besides model-based approaches, statistical and artificial intelligence methods are
another solution for automating or semi-automating the extraction of dependability information from
systems (Jardine et al. 2006; Zhang et al. 2017).
While model-based methodologies need information about the physical characteristics of the system
for the establishment of an explicit mathematical model, statistical models use historical data to represent
and predict the future behaviour of a system. In addition, artificial intelligence techniques are suitable for
addressing the complex and large-scale nonlinear problems that mostly requires no statistical assumptions
about the data. Neural computation, evolutionary algorithms, and fuzzy computing as different
categorizations of computational intelligence (which is a branch of artificial intelligence) have been applied
for fault detection and classification (Chen et al. 2008; Zheng et al. 2017; Theodoropoulos et al. 2021; Brito
et al. 2022). The applications of (explainable) artificial intelligence techniques for fault diagnosis and
machinery monitoring is a subject based on the theory of signal processing and pattern recognition, and
these techniques are mostly used for estimating the remaining useful life (Sikorska et al. 2011; Ayvaz and
Alpay 2021). However, some researchers combine artificial intelligence and statistical approaches with
FTA, which we note in the following.
Lampis and Andrews (2009) applied Bayesian Belief networks to diagnose faults in a system. They
first constructed fault trees to indicate how the component failures can combine to cause unexpected
deviations in the variables monitored by the sensors, and then converted these fault trees into Bayesian
networks for further analysis. Cai et al. (2015) addressed a case study of subsea pipe ram BOP system by
proposing a new method for real-time reliability analysis through a combination of traditional and dynamic
Niloofar and Lazarova-Molnar
Bayesian networks. In this study, prior reliability knowledge of the system (failure distributions) is updated
via dynamic Bayesian networks. In FTA, basic components are assumed to be independent and this is a
strong assumption for some dynamic systems. Guo et al. (2021) proposed a reliability analysis model for
dynamic systems with common cause failures based on discrete-time Bayesian networks. They applied their
model for fault diagnosis of a digital safety-level distributed control system of nuclear power plants. These
studies do not apply observational/historical data to build or to learn the fault tree structure, but some
researchers use data to update or estimate the failure rates (Cai et al. 2015).
Observational data were used to generate FTs with the IFT (Induction of Fault Trees) algorithm (Nolan
et al. 1994) based on standard decision-tree statistical learning. Later, Liggesmeyer and Rothfelder (1998)
coined the term formal risk analysis and developed an approach for automatically generating a fault tree
from finite state machine-based descriptions of a system where the generated fault tree is complete with
respect to all failures assumed possible. Mukherjee and Chakraborty (2007) describe a technique to
automatically generate fault trees using historical maintenance data in text form. Their technique relies on
domain knowledge and linguistic analysis. Majdara and Wakabayashi (2009) represent a new system of
modelling approach, composed of some components and different types of flows propagating through them,
for computer-aided fault tree generation. Chiacchio et al. (2016) combined the Dynamic Fault Tree
technique and the Stochastic Hybrid Automaton within the Simulink environment that represented an
important step ahead for the delivering of a user-friendly computer-aided tool for the dynamic reliability.
They also developed a library called Stochastic Hybrid Fault Tree Automaton (SHyFTA) that allows the
accurate dependability analysis of repairable multi-state systems (Chiacchio et al. 2020). Nauta et al. (2018)
introduced LIFT (Learning Fault Trees from observational data) to learn structures of static fault trees from
untimed data bases with Boolean event variables, however, their method needs information about
intermediate events. Linard et al. (2019) applied an evolutionary algorithm to learn fault trees from untimed
Boolean basic event variables. Instead of the independence test in the LIFT algorithm, they used a score-
based algorithm to extract fault trees. Furthermore, Waghen and Ouali (2019) proposes interpretable logic
tree analysis (ILTA), which characterizes and quantifies event causality occurring in engineering systems
with the minimum involvement of human experts. Their method is an integration of two concepts:
knowledge discovery in database and fault tree analysis, which was improved to a multi-level interpretable
logic tree (MILTA) (Waghen and Ouali 2021). Qian et al. (2021), for the first time, applied association rule
analysis to extract fault trees from overhead contact system of an electrified railway. They first transform
the failure records of overhead contact system into transaction database, and then the extracted association
rules from the data are converted to a fault tree. Lazarova-Molnar et al. (2020) introduce DDFTA algorithm
that uses time series data of faults to extract repairable multi-state fault tree of a system.
The above-mentioned techniques have different requirements; however, except for the work of
Lazarova-Molnar et al. (2020), they cannot extract reliability models from time series data recorded from
multi-state/repairable systems. Also, labelling the top event is not studied in the literature. Time series data
of a system consists of a sequence of status change times for each basic event and the system state. In this
study, we follow the work of Lazarova-Molnar et al. (2020) and add a classification module so that the
algorithm does not only extract repairable multi-state fault trees from observational data, but also makes
predictions on the future reliability state of the system. Being able to label the system state (classify the top
event), becomes more important when the system contains rare events (or components with rare failures),
which is the case for safety critical systems, or for systems composed of so many components that observing
all the possibilities becomes unfeasible and non-realistic.
In the following, we provide background on the relevant concepts and methods that we refer to in this
paper, i.e., repairable multi-state fault trees, Naïve Bayes classifier and proxel-based simulation.
3.1.1 Repairable Multi-State Fault Trees
A fault tree is a Directed Acyclic Graph (DAG) whose leaves are the basic events (typically basic faults),
and the root represents the top event, which is typically a system failure. The gates in a fault tree represent
Niloofar and Lazarova-Molnar
the propagation of failure through the tree (Ruijters and Stoelinga 2015). Multi-state fault trees have the
same structure of regular fault trees, except that the components or the system may have more than two
functioning levels. In other words, the state space of the system and its components may be represented by
󰇝󰇞, where 0 indicates a completely failed state, M indicates a perfectly working state, and the
others are degraded states. Repairable fault trees consider both faults and repairs within a system. Hence,
for each basic event that is typically associated with a fault, there are probability distributions that describe
the fault's occurrences and repair times.
There are two essential analysis techniques for fault trees, qualitative analysis, and quantitative analysis.
Qualitative analysis considers the structure of the fault tree, while the quantitative analysis computes failure
probabilities, reliability, etc. of the system represented by the fault tree. The first step towards computing
reliability of a system is to extract the structure of the system's underlying fault tree. When the structure of
the fault tree is extracted, using the probability distribution functions of the basic events, we can calculate
the reliability of the system, the likelihood of a top event occurrence, as well as those of the basic events
that have caused the occurrence of the top event. The results of quantitative analysis give analysts an
indication about system reliability and also help to determine which components or parts of the system are
more critical so analysts can put more emphasis on the critical components or parts by taking necessary
steps, e.g., including redundant components in the system model (Kabir 2017).
3.1.2 Naïve Bayes Classifier
The Naïve Bayes (NB) classifier is a probability-based supervised learning classification method which is
well studied in the literature. NB is among the simplest Bayesian Network models and has received much
attention due to its simple classification model and excellent classification performance. An early
description can be found in Duda and Hart (1973). Domingos and Pazzani (1996) discuss its feature
independence assumption and explain why Naïve Bayes performs well for classification even with such an
over-simplified assumption.
In this paper, we apply NB to classify the state of the system for the unobserved combinations of basic
events. Basic events, which are always considered independent, are the features in NB, and the top event is
the class variable. We use observed data from fault occurrences related to basic events as a training set to
fit the NB model, and we use the unobserved combinations of basic events as a testing set. NB first
calculates the posterior for the top event and then applies the maximum a posteriori (MAP) decision rule:
the label is the class with the maximum posterior. Those combinations of basic events in the testing set for
which the top event is classified as “failed”, along with those in the training set, where the state of the top
event is “failed”, are considered cut sets. These predicted cut sets are used to extract minimal cut sets that
will construct the predicted behaviour of the system in terms of a fault tree.
Challenging point in data-driven modelling of faults for classification tasks is the imbalanced
proportion of classes as faults are rarely observed, especially for highly reliable systems. Hence, we are
troubled with an imbalanced classification where one class of the dependent (response) variable (here,
working state) outnumbers the other class (failed state) by a large proportion. There are many ways to
combat this issue, where the very best is to accumulate more data. This, however, is not possible in our
case. Another approach is to manually balance the classes. One common method of doing this is to
upsample/oversample the minority class or undersample the majority class using resampling
(bootstrapping) techniques. In this study we upsample the faulty state to balance the classes and apply
bootstrapping techniques.
3.1.3 Proxel-Based Simulation
Proxel-based simulation is a state space-based simulation method to compute transient solutions for discrete
stochastic systems. It relies on a user-definable discrete time step and computes the probability of all
possible single state changes (and the case that no change happens at all) during a time step. The target
Niloofar and Lazarova-Molnar
states along with their probabilities are stored as so-called proxels (short for probability elements). To
account for aging (i.e., non-Markovian) transitions, proxels contain supplementary variables that keep track
of the ages of all active and all race-age transitions. For each proxel created, the algorithm iteratively
computes all successors for each time step. This results in a tree of proxels where all proxels having the
same distance from the tree root belong to the same time step and all leaf proxels represent the possible
states being reached at the end of the simulation.
Proxel-based simulation explores all possible future developments of the system each with a determined
computable probability, based on the distribution functions which describe the events, as well as the time
they have been pending, in discrete time steps. It determines all possible follow-up states and the rendering
probability of the corresponding state transitions. The proxel-based simulation is well-known for its ability
to cope with stiff models, as fault models are typically (Lazarova-Molnar and Horton 2003; Lazarova-
Molnar 2005).
4 METHODOLOGY
In this section, we describe the methods and techniques that we developed to enable the data-driven
reliability modelling and analysis to extract repairable multi-state fault trees from observational data and to
estimate the future reliability state of the system. The overall framework that describes the high-level
workflow of DDFTAnb algorithm is shown in Figure 1, and more detailed workflows are illustrated in
Figure 2 and Figure 3.
Figure 1: Overall framework of DDFTAnb algorithm with the classification module.
DDFTA, as illustrated in Figure 2, comprises of three steps (Lazarova-Molnar et al. 2020): 1)
converting time series data of faults into a truth table with time steps, 2) structure learning and parameter
learning of the fault tree, and 3) estimating reliability measures. To learn the structure of the fault tree, we
extract the minimal cut sets (MCS) from the time series data set, and then use Boolean algebra to build a
fault tree that is aimed to be mathematically identical to the true fault tree of the system. For parameter
learning, reliability and repair distribution functions of the basic events, along with the fault tree structure,
are inputs to the proxel-based simulation in the final step, which is used to calculate the system's reliability
measures, in form of complete transient solutions. The classification module of DDFTAnb algorithm is
described in the Section 4.1.
Niloofar and Lazarova-Molnar
Figure 2: The process workflow of the DDFTA algorithm.
4.1 Classification Module for DDFTAnb Algorithm
The basic DDFTA approach performs reliability analysis based on observed components’ faults. DDFTA,
however, does not provide a robust solution for very rare events or cases of small amounts of data with low
resolution, where not all possible combinations of basic events have occurred and the corresponding top
event statuses are unknown. The DDFTA approach (Figure 2) begins by converting time series data of
faults to a truth table with time steps. The next steps are structure learning and parameter learning, and the
final step is estimating reliability measures. In this section, the classification module of the advanced
DDFTAnb approach, as illustrated in Figure 3, is described in detail.
The classification module for the DDFTAnb algorithm consists of six steps: 1) dividing the truth table
with time steps into training set and testing sets, 2) fitting Naïve Bayes classifier to the training set, 3)
classification of the top event for the testing set using the fitted Naïve Bayes model, 4) learning the fault
tree structure from the combination of training and testing data set, 5) learning the fault tree parameters
from the training set, and finally 6) estimate reliability measures. DDFTAnb with its classification module
are better explained through an illustrative example in the next section.
Niloofar and Lazarova-Molnar
Figure 3: Workflow of classification module for DDFTAnb algorithm.
4.2 Illustrative example
Assume that time series data on faults for a system with five basic components (BEi, i=1, 2, …, 5) and the
top event (TE) are collected until a specific point in time. The goal is to use these observed time series data
to assess the current reliability of the system and estimate the future structure of the system’s fault tree as
well as its reliability measures. For simplicity and without loss of generality, we assume that the observed
data contain 10 records as in Table 1.
Table 1: Time series data of faults converted into a truth table.
Time
BE1
BE4
TE
17.96968
0
1
0
18.63438
0
1
1
20.1585
0
0
0
21.11844
1
0
0
21.52825
0
0
0
22.12907
0
0
0
23.07983
0
1
0
24.67361
0
0
0
24.74219
1
0
0
25.01376
1
1
1
Niloofar and Lazarova-Molnar
4.2.1 DDFTA algorithm
According to DDFTA, the first step is to convert the time series data of faults to a truth table with time
stamp. Table 1 shows the time-stamped truth table of the collected data, where 0 indicates working state
and 1 shows failure state.
Structure Learning: To build the structure of the fault tree, we need to extract the minimal cut
sets. The shaded rows in Table 1, where the system is failed (TE has label 1) indicate the cut sets
(Table 2), and since they cannot be reduced to smaller cut sets, they are also minimal cut sets. These
minimal cut sets build the structure of the fault tree (Table 3), which is also shown in Figure 4.
Table 2: Sets of cut sets and minimal cut sets for the truth table data of Table 1.
Cut sets
Minimal cut sets
{BE2, BE3, BE4}
{BE1, BE3, BE4}
{BE2, BE3, BE4}
{BE1, BE3, BE4}
Table 3: Constructing the fault tree based on the minimal cut sets of Table 2.
Step
Boolean representation
1
TE=(BE1.BE3.BE4)
+(BE2.BE3.BE4)
2
TE=(BE1+ BE2). (BE3.BE4)
3
TE=IE1.IE2
Figure 4: Fault tree constructed from data of Table 1.
To extract the minimal cut sets of a multi-state fault tree, the multi-state events with m (>2) number
of states, are converted into m-1 binary events. Assume a system with three basic events {BE1,
BE2, BE3}, in which BE1 has three states: working (0), failed (1), idle (2) and the recorded data of
Table 4. Here, CS={{BE2, BE3}, {BE1_1, BE2, BE3}, {BE1_1, BE3}}is the set of cut sets and
Niloofar and Lazarova-Molnar
hence the minimal cut sets are MCS={{BE2, BE3}, {BE1_1, BE3}}. Finally the fault tree equals
TE= (BE2.BE3)+(BE1_1.BE3)= BE3.(BE2+BE1_1), which is also illustrated in Figure 5.
Table 4: Truth table with a multi-state event BE1 (left) turned into a truth table with binary events (right).
Time
BE2
TE
Time
BE1_1
BE1_2
TE
17.96968
0
0
17.96968
0
0
0
18.63438
1
1
18.63438
0
0
1
20.1585
1
1
20.1585
0
1
1
21.11844
1
0
21.11844
1
0
0
21.52825
1
0
21.52825
0
0
0
22.12907
0
0
22.12907
0
0
0
23.07983
0
0
23.07983
0
1
0
24.67361
0
0
24.67361
0
0
0
24.74219
0
0
24.74219
1
0
0
25.01376
0
1
25.01376
1
0
1
Figure 5: Multi-state fault tree extracted from the data in Table 4.
Parameter Learning: Once the structure of the fault tree is extracted from data, we use it to
calculate the reliability metrics of the constructed fault tree (here is the fault tree of Figure 4). The
first step to the quantitative analysis is to estimate reliability and maintainability probability
distribution functions of the basic events, based on the time series data. Suppose we are interested
in estimating the reliability distribution of BE1. We calculate the times to failures by looking at the
points in time where the state of the basic event changes from working (label= 0) to failed (label=
1). For example, the first two times to failures for BE1 are:
r1=21.11844-17.96968= 3.14876 , r2=24.74219-21.52825= 3.21394
Also, times to repairs are calculated by looking at the points in time where the state of the basic
event changes from failed (label= 1) to working (label= 0). Hence, m1=21.52825- 21.11844=
0.40981. ri’s and mi’s are then used to estimate not only the parameters of the reliability and
Niloofar and Lazarova-Molnar
maintainability distributions, but also types of the distributions themselves, because our algorithm
can cope with distributions other than the common exponential distribution.
The packages in R, gamlss (Rigby and Stasinopoulos 2005) and fitdistrplus (Delignette-Muller
and Dutang 2015), cover a wide range of probability distributions supported on the interval [0,).
Hence, we applied these R packages for the distribution fitting part. MTTF and MTTR for each
basic event are the means of the reliability and maintainability distributions, respectively.
For exponential distributions, unavailability of an event is MTTR/(MTTR+MTTF), but for
non-exponential distributions we need more advanced methods to calculate the unavailability of
the system. Unavailability is the probability that the component or system is not operational.
Proxel-based simulation (Lazarova-Molnar and Horton 2003; Niloofar and Lazarova-Molnar 2022)
is not limited to exponential distributions, and can be used to determine the instantaneous
unavailability of basic components with nonexponential distributions and multi-state events.
Assume a binary repairable basic event where the reliability distribution is estimated as an
Exponential distribution function with rate 0.1 and the estimated repair distribution function is
Normal with mean 2 and the standard deviation of 1. Figure 6 illustrates the first three-time steps
of the proxel simulation process for this basic event. Each proxel is a vector with three elements:
State, Age intensity (which tracks the time that each of the possible state changes has been pending)
and Probability.
Assuming that t = 0.1, the detailed calculation of p1, p2 and p3 in Figure 6 are as follows:
 󰇛󰇜
󰇛󰇜 


Figure 6: First three time steps of proxel simulation.
Niloofar and Lazarova-Molnar
 󰇛󰇜
󰇛󰇜 

 
 󰇛󰇜
󰇛󰇜
󰇛󰇜󰇛󰇜
󰇛󰇜
󰇛󰇜󰇛󰇜
󰇛󰇜

We, then, propagate the unavailability of each individual component through the fault tree to
calculate the unavailability of the system. shows the unavailability related to the basic events and
the top event, for the fault tree from Figure 4.
4.2.2 DDFTAnb algorithm
The system’s fault tree along with the corresponding reliability measures are extracted from the observed
time series data of faults using DDFTA algorithm. The observed data in Table 1 is only a portion of what
can happen in a system with five components, and not all possibilities can be considered in fault tree
analysis of the system. For example, the state of the system is unknown when only basic events 4 and 5
occur and other components are working perfectly (row 13 in Table 5). In DDFTAnb’s classification
module, we address the problem of the unobserved combination of basic events.
Step 1: Obtaining training and testing sets is the first step of the classification module. To obtain the
training and testing sets, we need the power set for the 5 binary basic events. The power set for {BE1, BE2,
Figure 7: Unavailabilities of basic events along with the top event for the fault tree from Figure 4.
Niloofar and Lazarova-Molnar
BE3, BE4, BE5} or the set of all possible subsets of these basic events has 25=32 elements, as shown in
Table 5.
Table 5: The power set for the five binary basic events and the number of occurrences in parentheses.
BE1
BE2
BE3
BE4
BE5
TE (#)
BE1
BE2
BE3
BE4
BE5
TE
1
1
0
1
0
0
0 (1)
17
1
0
0
0
0
NA
2
0
0
1
0
0
0 (2)
18
1
0
0
0
1
NA
3
0
0
1
1
0
0 (2)
19
1
0
0
1
0
NA
4
0
1
1
0
0
0 (2)
20
1
0
0
1
1
NA
5
0
1
1
1
0
1 (1)
21
0
0
0
1
0
NA
6
1
1
1
0
0
0 (1)
22
1
0
1
0
1
NA
7
1
0
1
1
0
1 (1)
23
0
1
0
0
0
NA
8
0
0
0
0
0
0
24
0
1
0
0
1
NA
9
1
1
1
1
1
1
25
1
1
0
0
0
NA
10
1
0
1
1
1
1
26
1
1
0
0
1
NA
11
1
1
1
1
0
1
27
1
1
0
1
0
NA
12
0
1
1
1
1
1
28
1
1
0
1
1
NA
13
0
0
0
1
1
NA
29
0
0
1
1
1
NA
14
0
0
0
0
1
NA
30
1
1
1
0
1
NA
15
0
1
0
1
0
NA
31
0
0
1
0
1
NA
16
0
1
0
1
1
NA
32
0
1
1
0
1
NA
Combinations shown in rows 1 to 7 in Table 5 are observed and the state of the system (TE) for these
combination of basic events can be extracted from the truth table in Table 1. Also, the number of
occurrences for each row is indicated in the parenthesis in the TE column. For example, row 2 belongs to
the case where all components are working, except for the one linked to the basic event BE3. We see this
combination in Table 1 at times 22.12907 and 24.67361, along with the state of TE as working. Obviously,
the state of TE when all basic events are working and when all of them are failed (rows 8 and 9 of Table 5)
is 0 and 1, respectively. The top event also occurs in rows 10 to 12 because minimal cut sets {BE1, BE3,
BE4} and {BE2, BE3, BE4} (Table 2) are subsets of these rows. The state of the top event is unknown for
rows 13 to 32, because we have no information on these combinations of basic events. It is worth noticing
that at this stage we have the highest percentage of missing values for TE, as not enough data is collected
from the system yet. We take rows 1 to 12 as the training set with TE as the class variable, and rows 13-32
with missing information on TE belong to the testing set.
Step 2 and 3: In the next two steps, Naïve Bayes classifier as a supervised machine learning algorithm
is fitted to the training set and the state of TE is classified by applying the fitted model to the testing set.
Once we label the values of TE for these rows, we apply the method explained in Section 3.2.1, to build an
updated structure of the fault tree. The extracted fault tree at this stage is most probably not reliable enough,
because it is estimated using 12/32=37.5% of the data. Classification results of the top event for rows 13-
32 can be found in Table 6.
Niloofar and Lazarova-Molnar
Table 6: Classification results for the top event using Naïve Bayes classifier.
BE2
BE5
Classified TE
13
0
1
0
14
0
1
0
15
1
0
1
16
1
1
1
17
0
0
0
18
0
1
0
19
0
0
1
20
0
1
1
21
0
0
0
22
0
1
0
23
1
0
0
24
1
1
0
25
1
0
0
26
1
1
0
27
1
0
1
28
1
1
1
29
0
1
0
30
1
1
0
31
0
1
0
32
1
1
0
Step 4: Shaded rows in Table 6 are the new cut sets that should be added to the ones in Table 2. Table
7 shows that the new cut sets impose a great change in the minimal cuts sets which consequently affects
the constructed fault tree as can be seen in Table 8 and Figure 8.
Table 7: Updated sets of cut sets and minimal cut sets based on the Classifications in Table 6.
Cut sets
Minimal cut sets
{BE2, BE3, BE4}
{BE1, BE3, BE4}
{BE2, BE4}
{BE2, BE4, BE5}
{BE1, BE4}
{BE1, BE4, BE5}
{BE1, BE2, BE4}
{BE1, BE2, BE4, BE5}
{BE2, BE4}
{BE1, BE4}
Niloofar and Lazarova-Molnar
Table 8: Boolean representation of the fault tree based on the minimal cut sets of Table 7.
Step
Boolean representation
1
TE=(BE1.BE4)
+(BE2.BE4)
2
TE=(BE1+ BE2). (BE4)
3
TE=IE1. BE4
Figure 8: Extracted fault tree based on the minimal cut sets of Table 7.
Step 5 and 6: Since the results of the top event using Naïve Bayes model are not time-stamped, they
cannot be used to update the estimates for the reliability and maintainability distributions. Hence, we use
the estimated distribution functions of the DDFTA algorithm and the extracted fault tree of Figure 8 to
estimate unavailability of the system through proxel-based simulation. The unavailability of the system
changes by the new structure of the fault tree and this change is depicted in Figure 9.
Niloofar and Lazarova-Molnar
Figure 9: Unavailability of the system changes by the new structure of the fault tree in Figure 8.
As more data are recorded, newly observed data can be added to the training set to update the fault tree
analysis and increase the classification accuracy.
4.3 Performance Evaluation
To measure the performance of DDFTA in depicting a system’s behaviour, we assume that the true
behaviour of that system follows a repairable fault tree with a set of reliability and maintainability
distributions as its parameters. We call this fault tree the original fault tree, and in the first simulation step,
time series data are fabricated from this model. In the second step, truth table of the generated data set with
time steps is used as an input to DDFTA algorithm. The structure of the fault tree is learnt and the
unavailability of the system is computed. Finally, using DDFTAnb, future fault tree of the system and its
unavailabilities are estimated. Hence, the performance of the presented method needs to be evaluated in
regard to three aspects: structure learning evaluation, evaluation of reliability measures estimation and
classification evaluation.
4.3.1 Structure Learning Evaluation
To compare the reconstructed fault tree with the original fault tree, we use the 2*2 confusion matrix of
Table 9, that depicts all four possible outcomes.
Table 9: 2*2 confusion matrix that depicts all four possible outcomes in classification.
Reconstructed fault tree
True fault tree
Identified
Not identified
Identified
True Positive (TP)
False Positive (FP)
Not identified
False Negative (FN)
True Negative (TN)
Niloofar and Lazarova-Molnar
In this confusion matrix, true positive represents the number of sets that are both in the MCS of the
reconstructed fault tree and the true fault tree (correctly identified sets). False positive is the number of sets
in the MCS of the extracted fault tree which are not in the MCS of the true fault tree (incorrectly identified
sets). False negative is the number of incorrectly rejected sets and finally, true negative is the number of
correctly rejected sets. Using the confusion matrix, we calculate the sensitivity, specificity, and accuracy
(ACC):
 
  
  

Larger values of above-mentioned measures indicate higher performance in structure learning.
4.3.2 Reliability Measures Estimation
When the structure of the fault tree is extracted from the data set, the unavailability of the system can be
calculated using proxel-based simulation. Since unavailabilities are calculated as transient solutions for
each time step, we have a vector of instantaneous unavailabilities calculated for the extracted fault tree

, where n is the total number of time steps. For the original fault tree, there is also an
associated vector of instantaneous unavailabilities: 󰇝󰇞. Root Mean Square Error (RMSE)
is used to compare these vectors of unavailabilities:
󰇛
󰇜

(1)
Better estimation of unavailability leads to a smaller distance between 
and 󰇝󰇞, hence smaller
values of RMSE. We also report
and as the final stable unavailability values.
4.3.3 Classification Evaluation
In the classification module, first the training set (observed cut sets) and the testing set (unobserved cut
sets) are prepared. Then, a Naïve Bayes model is fitted to the training set and the fitted model is applied to
classify the top event in the test set. Since the classification module includes extracting the structure and
the unavailability of the learnt fault tree, it is evaluated in regard to structure learning and estimation of the
reliability measures. Hence, the methods of subsections 4.3.1 and 4.3.2 are applied for the classification
module as well.
5 CASE STUDIES
We assess the performance of our algorithm using two repairable fault trees: 1) A fault-tolerant
multiprocessor system shown in Figure 10 (Malhotra and Trivedi 1995); 2) Radio Block Center (RBC) fault
tree (Figure 12) explained in Galileo textual format (Sullivan and Dugan 1996). The general steps in the
experiments are as follows:
1. Generate time series data from the basic events of each original fault tree.
2. Build the timely truth tables based on each generated time series.
3. Obtain training and testing sets using the truth table with time stamps.
4. Learn the fault tree (structure and parameters) from the observed data set using DDFTAnb
algorithm.
5. Compare the MCS of the reconstructed fault tree with that of the original fault tree in terms of
sensitivity, specificity, accuracy (ACC).
Niloofar and Lazarova-Molnar
6. Use the reconstructed fault tree and the reliability and maintainability distributions to obtain the
reliability measures of the top event as well as those of the basic events.
7. Estimate the structure and reliability measures of the system for the unobserved combinations of
the basic events using DDFTAnb’s classification module.
8. Report the evaluation measures in terms of 95% confidence intervals.
5.1 A Fault-Tolerant Multiprocessor System
Figure 10 shows the fault tree of a fault-tolerant multiprocessor system which consists of two processors Pi
(i =1, 2) with private memories Mi (i = 1, 2) and M3 as a shared one. A processor and a memory form a
processing unit. Each processing unit is connected to a mirrored disk system Dij (i = 1, 2 and j = 1, 2),
forming a processing subsystem. Both the processing subsystems and M3 are connected via an
interconnection Bus N. (Bobbio et al. 2001) refine the description of the multiprocessor system by adding
the component power supply (PS) such that, when failing, it causes a system failure. The PS is modelled
with three possible modes: working, defective and failed, where the first corresponds to a nominal
behaviour, the second to a defective working mode with abnormal voltage provided, while the last mode
(failed) corresponds to a situation where the PS cannot work at all. As anticipated, the failed mode causes
the whole system to be down. According to the literature, the failure distribution of all components (except
for the PS) is assumed to be exponential with failure rates given in Table 10, expressed in failures per hour
(f/h). State changes diagram for PS, is also illustrated in Table 10, where it has exponential probability
distribution with the rate of 3.0e-05 (Exp(3.0e-05)) as the transition probability from working to defective
state. PS fails with a rate following Normal(0.25, 0.1) distribution function, and it is repaired again with
Uniform(0.1, 0.2) transition proability. For binary events, we add individual repair distributions that are not
limited to exponential distribution to highlight the ability of our algorithm to cope with non-exponential
probability distribution functions, as well as repairable and multi-state components.
Figure 10: A fault-tolerant multiprocessor system with a multi-state component PS.
Niloofar and Lazarova-Molnar
Table 10: Reliability and maintainability distribution functions of the basic events in Figure 10.
Basic events
Reliability distribution
(rate in f/h)
Maintainability distribution
1
Disk Dij
Exp(8.0e-05)
Weibull(5, 0.75)
2
Proc Pi
Exp(5.0e-07)
Exp(0.25)
3
Mem Mj
Exp(3.0e-08)
Weibull(5, 20)
4
Bus N
Exp(2.0e-09)
Exp(0.006)
5
Power supply PS
Unavailability values, calculated using proxel-based simulation for the basic events and the system (top
event), are illustrated in Figure 11Figure 11, and the system unavailability (Un) is 6.422826e-06. The results
of the DDFTA algorithm for the fault tree in Figure 10 considering 10% to 100% data availabilities are
shown in Table 11. As we observe more data points, unavailability and RMSE values converge to the true
value 6.422826e-06 and the ideal value of 0, respectively. As expected, the best structure learning
performance occurs with the highest data availabilities and worsens as the data availability decreases. As
can be seen, Naïve Bayes classifier, indicated by NB, performs relatively well for small amounts of training
data because it has a low propensity to overfit.
Niloofar and Lazarova-Molnar
Figure 11: Instantaneous unavailabilities for the fault-tolerant multiprocessor of Figure 10.
Niloofar and Lazarova-Molnar
Table 11: Results of the DDFTA and DDFTAnb algorithms for the multiprocessor fault tree of Figure 10
considering different levels of data availability.
Data Availability
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
DDFTA
Structure
learning
measures
Sen
0.1091
±
0.0238
0.1909
±
0.0178
0.2636
±
0.0178
0.3273
±
0.0394
0.4000
±
0.0713
0.7182
±
0.0178
0.6566
±
0.0376
0.8889
±
0.0248
0.9182
±
0.0178
1.0000
±
0.0000
Spe
0.9904
±
0.0004
0.9895
±
0.0011
0.9911
±
0.0004
0.9926
±
0.0011
0.9941
±
0.0007
0.9965
±
0.0004
0.9957
±
0.0007
0.9987
±
0.0006
0.9985
±
0.0005
1.0000
±
0.0000
ACC
0.9881
±
0.0003
0.9874
±
0.0011
0.9891
±
0.0005
0.9908
±
0.0012
0.9925
±
0.0009
0.9956
±
0.0004
0.9948
±
0.0008
0.9984
±
0.0007
0.9983
±
0.0005
1.0000
±
0.0000
Parameter
Learning
measures
3.2673
e-10
±
1.03
e-10
6.35589
e-07
±
1.24
e-06
5.0887
e-06
±
1.65
e-06
2.6121
e-06
±
2.03
e-06
6.4228
e-06
±
1.54
e-15
5.1527
e-06
±
1.65
e-06
6.4228
e-06
±
8.27
e-13
5.7172
e-06
±
1.31
e-06
6.4228
e-06
±
1.19
e-12
6.4228
2649198
e-06
±
0.0000
RMSE
6.3411
e-06
±
1.02
e-10
5.7111
e-06
±
1.23
e-06
1.2983
e-06
±
1.64
e-06
3.7822
e-06
±
2.01
e-06
3.5645
e-12
±
1.41
e-15
1.2606
e-06
±
1.64
e-06
3.1690
e-12
±
7.36
e-13
7.0034
e-07
±
1.30
e-06
2.4905
e-12
±
1.06
e-12
0.0000
±
0.0000
DDFTAnb
Structure
learning
measures
Sen
0.4182
±
0.0606
0.5545
±
0.0675
0.4545
±
0.0000
0.5727
±
0.0597
0.5545
±
0.0675
0.7545
±
0.0380
0.6967
±
0.0398
0.9091
±
0.0000
0.9818
±
0.0238
-----
Spe
0.9928
±
0.0012
0.9922
±
0.0013
0.9890
±
0.0005
0.9892
±
0.0007
0.9894
±
0.0014
0.9926
±
0.0006
0.9914
±
0.0010
0.9962
±
0.0003
0.9977
±
0.0003
-----
ACC
0.9912
±
0.0013
0.9910
±
0.0015
0.9876
±
0.0005
0.9881
±
0.0007
0.9882
±
0.0016
0.9920
±
0.0007
0.9906
±
0.0010
0.9960
±
0.0003
0.9977
±
0.0003
-----
Parameter
learning
measures
6.4229
090742
e-06
±
3.52
e-11
6.4228
257002
e-06
±
1.05
e-12
6.4228
261012
e-06
±
7.82
e-13
6.4228
248979
e-06
±
1.28
e-12
6.4228
252955
e-06
±
1.20
e-12
6.4228
232938
e-06
±
1.05
e-12
6.4228
229310
e-06
±
8.28
e-13
6.4228
264911
e-06
±
1.52
e-15
6.4228
264919
e-06
±
1.96
e-17
-----
RMSE
7.6858
e-11
±
3.18
e-11
7.2144
e-13
±
9.29
e-13
3.6145
e-13
±
6.93
e-13
1.4266
e-12
±
1.13
e-12
1.0729
e-12
±
1.06
e-12
2.8470
e-12
±
9.29
e-13
3.1686
e-12
±
7.36
e-13
7.6663
e-16
±
1.40
e-15
1.9424
e-17
±
1.57
e-17
-----
5.2 Radio Block Center
Radio Block Center (RBC) is the most important subsystem of The European Railway Traffic
Management System / European Train Control System (Flammini et al. 2005). It is responsible for
guaranteeing a safe outdistancing between trains by managing the information received from the onboard
subsystem and from the interlocking subsystem. In the RBC fault tree illustrated in Figure 12, “BUS1”
lambda=4.4444e-6 repair=4 means that the reliability and maintainability distribution of the basic event
“BUS1” are exponential with a failure rate of 4.4444e-6 and a repair rate of 4, respectively. Estimated
unavailability of the system is 6.8699e-12 and the instantaneous unavailabilities are illustrated in Figure
13. Results shown in Table 12, demonstrate that this fault tree has been affected by loss of data more than
the other two examples, because even with 90% of data availability, DDFTA’s sensitivity is 0.775. We
suspect that the reason for this is that the fault events are rare, and the system is highly reliable.
Niloofar and Lazarova-Molnar
Toplevel “System”;
“System” or “Power” “WANinterface” “SystemBUS” “GSMRinterface” “TMR”;
“Power” and “PowerSupply1” “PowerSupply2” “PowerSupply3”;
“WANinterface” and “WANcard1” “WANcard2”;
“SystemBUS” and “BUS1” “BUS2”;
“GSMRinterface” and “GSMRCard1” “GSMRCard2”;
“TMR” or “CPUcore” “voter”;
“CPUcore” 2of3 “CPUboard1” “CPUboard2” “CPUboard3”;
“voter” and “FPGA1” “FPGA2”;
1
“BUS1” lambda=4.4444e-6 repair=4;
“BUS2” lambda=4.4444e-6 repair=4;
2
“FPGA1” lambda=3.003e-9 repair=4;
“FPGA2” lambda=3.003e-9 repair=4;
3
“PowerSupply1” lambda=1.8182e-5 repair=6;
“PowerSupply2” lambda=1.8182e-5 repair=6;
“PowerSupply3” lambda=1.8182e-5 repair=6;
4
“WANcard1” lambda=2.5e-6 repair=6;
“WANcard2” lambda=2.5e-6 repair=6;
5
“GSMRCard1” lambda=5.7078e-6 repair=6;
“GSMRCard2” lambda=5.7078e-6 repair=6;
6
“CPUboard1” lambda=7.4074e-6 repair=6;
“CPUboard2” lambda=7.4074e-6 repair=6;
“CPUboard3” lambda=7.4074e-6 repair=6;
Figure 12: Radio Block Center fault tree with six different types of basic events
Figure 13: Unavailability values for the RBC fault tree in Figure 12.
Niloofar and Lazarova-Molnar
Table 12: Results of the DDFTA and DDFTAnb algorithms for the RBC fault tree considering different
percentages of data availabilities.
Data Availability
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
DDFTA
Structure
learning
measures
Sen
0.1250
±
0.0000
0.2750
±
0.1200
0.3250
±
0.0980
0.4500
±
0.1660
0.4750
±
0.1430
0.5000
±
0.1550
0.6500
±
0.1630
0.7500
±
0.1340
0.7750
±
0.1429
1.0000
±
0.0000
Spe
0.9972
±
3e-04
0.997
±
5e-04
0.9974
±
5e-04
0.9981
±
7e-04
0.9983
±
7e-04
0.9983
±
7e-04
0.9990
±
8e-04
0.9995
±
6e-04
0.9995
±
6e-04
1.0000
±
0.0000
ACC
0.9967
±
3e-04
0.9966
±
6e-04
0.9971
±
6e-04
0.9978
±
8e-04
0.9980
±
7e-04
0.9981
±
8e-04
0.9988
±
8e-04
0.9994
±
6e-04
0.9994
±
7e-04
1.0000
±
0.0000
Parameter
learning
measures
1.80e-13
±
3.54e-13
1.76e-12
±
4.82e-13
2.61e-12
±
1.01e-12
3.71e-12
±
1.16e-12
4.39e-12
±
1.29e-12
4.20e-12
±
1.87e-12
5.29e-12
±
8.59e-13
5.78e-12
±
1.19e-12
6.09e-12
±
1.19e-12
6.86e-12
±
0.0000
RMSE
6.48e-12
±
3.44e-13
4.94e-12
±
4.59e-13
4.12e-12
±
9.74e-13
3.06e-12
±
1.12e-12
2.41e-12
±
1.25e-12
2.59e-12
±
1.81e-12
1.53e-12
±
8.29e-13
1.06e-12
±
1.16e-12
7.60e-13
±
1.16e-12
0.0000
±
0.0000
DDFTAnb
Structure
learning
measures
Sen
0.3500
±
0.1800
0.3500
±
0.2616
0.5000
±
0.1550
0.6250
±
0.1096
0.6562
±
0.1049
0.6500
±
0.1200
0.7250
±
0.1429
0.8250
±
0.0600
0.8250
±
0.0600
-----
Spe
0.9989
±
4e-04
0.9981
±
7e-04
0.9982
±
4e-04
0.9983
±
5e-04
0.9982
±
7e-04
0.9983
±
6e-04
0.9987
±
7e-04
0.9994
±
3e-04
0.9995
±
3e-04
-----
ACC
0.9985
±
3e-04
0.9978
±
8e-04
0.9979
±
5e-04
0.9981
±
5e-04
0.9981
±
8e-04
0.9982
±
6e-04
0.9985
±
8e-04
0.9993
±
4e-04
0.9994
±
3e-04
-----
Parameter
learning
measures
1.48e-06
±
1.78e-06
7.40e-07
±
9.67e-07
2.47e-07
±
4.84e-07
9.47e-12
±
2.46e-12
7.64e-12
±
2.57e-12
7.89e-12
±
3.00e-12
7.28e-12
±
1.32e-12
6.51e-12
±
4.33e-13
6.51e-12
±
4.33e-13
-----
RMSE
1.45e-06
±
1.74e-06
7.24e-07
±
9.47e-07
2.41e-07
±
4.73e-07
2.87e-12
±
1.94e-12
2.36e-12
±
9.93e-13
2.15e-12
±
2.20e-12
1.16e-12
±
7.00e-13
3.51e-13
±
4.21e-13
3.51e-13
±
4.21e-13
-----
5.3 Discussion
In this paper, we investigate two fault trees as case studies, a fault-tolerant multiprocessor (MP) and the
radio block center (RBC). MP has a lower reliability measure than RBC since the unavailability value for
MP is 6.4228e-06 and that of the RBC equals 6.86e-12. Furthermore, MP has a multi-state event (PS) and
repair rates that follow distributions other than exponential.
In terms of structure learning, comparing the accuracies of the two fault trees indicate that for lower
data availability applying the classification module is highly promising (Figure 14). For RBC, as a highly
reliable system, average ACC values when applying the classification module (NB) are higher even for 60
percentage of data availability. Accuracy values of DDFTA with classification module for MP are higher
only for very low levels of data availabilities. In general, DDFTA with and without classification module
has a higher accuracy for RBC fault tree.
Niloofar and Lazarova-Molnar
Considering parameter learning, as illustrated in Figure 15, estimated unavailability values converge to
the true unavailability values as data availability increases. For both fault trees, estimated unavailabilities
using DDFTA are lower than the true unavailability value, as opposed to DDFTAnb where the estimated
unavailabilities are higher than the true unavailability value. The reason is that the sets of minimal cut sets
predicted using DDFTA are always subsets of the real set of minimal cut sets. Unavailability values
calculated using DDFTAnb are always higher (for low data availabilities) or equal (full data availability)
to the true unavailability value which makes this algorithm more conservative since it estimates the
reliability of the system lower than it really is. DDFTA calculates lower (for low data availability) or the
same (for full data availability) unavailability values compared to the true unavailability value, which is
risky since it shows the system as more reliable than it truly is.
Figure 14: ACC mean values for MP and RBC fault trees, considering different levels of data
availability, for DDFTA and DDFTAnb.
Niloofar and Lazarova-Molnar
DDFTAnb is affected by a set of experimental parameters. The structure learning step is affected by
the number of basic events, whether basic events are multistate or binary, repairable/nonrepairable events
and the number of minimal cut sets. Numbers of basic events and multistate or binary events affect the size
of the truth table. For example, a system with 7 binary basic events has 27 = 128 possible combinations of
basic events, whereas a system with 13 binary events and an event with three states has 213×31 = 24,576
number of combinations of basic events. The quantitative analysis part of the DDFTAnb is responsive to
repairable/nonrepairable events, rare events, size of the time step and the total simulation time. If the total
simulation time is 5 years and the size of the time step is 1 day, then the total number of time steps are
5×365 = 1,825. For the same total simulation time of 5 years, if we take time steps of one month then we
only have 5×12 = 60 time steps to calculate instantaneous unavailabilities.
For a single fault tree, all the above parameters are fixed and the experimenter cannot change them,
except for the total simulation time and the size of the time step. DDFTAnb’s results are not very sensitive
to these parameters in general. However, for rare events DDFTAnb may obtain different unavailabilities
for a fixed fault tree if we choose different total simulation time and size of the time step. Rare events may
require larger total time and smaller time steps, hence larger number of time steps are necessary. Also, state
changes of repairable events define the number of proxels that need to be calculated at each time step.
Table 13 summarizes the computation time on a workstation with 16GB RAM and processor Core i7
2.8GHz for the MP and RBC case studies. For the parameter learning step, the number of different types of
basic events and the number of time steps are considered, and for the structure learning, we access the
number of combinations of basic events and the number of minimal cut sets for both MP and RBC
(computation times are reported in seconds).
Figure 15:
means of MP and RBC fault trees, considering different levels of data availability, for
DDFTA and DDFTAnb.
Niloofar and Lazarova-Molnar
Table 13: Computational times (in seconds) for RBC and MP fault trees
Parameter Learning
Fault Tree
# different types
of basic events
# time steps
50
(T=5, ∆t = 0.1)
100
(T=5, ∆t = 0.05)
500
(T=5, ∆t = 0.01)
MP
5
12.68
58.07
1529
RBC
6
22.73
94.17
2067.95
Structure Learning
Fault Tree
# combinations
of basic events
# minimal cut sets
MP
3×210= 3,072
11
78.18
RBC
214=16,384
8
3380.05
According to Table 13, DDFTA and DDFTAnb algorithms are not very efficient for complex systems
with rare events. The main drawback is that as the number of cut sets, or the number of basic components
increases with the size of the system, the presented algorithm becomes slower. Also, some types of basic
events need larger T (total time) with smaller ∆t (size of the time steps). Hence, the quantitative analysis
becomes more time-consuming. One way to overcome this difficulty is to divide the whole system into its
major subsystems and use parallel computing methods to overcome these issues.
6 CONCLUSION
We presented DDFTAnb algorithm, an efficient and novel algorithm for extracting repairable fault trees
from incomplete multinomial time series data to extract the future fault-related behaviour of a system in
terms of a fault tree and estimate the system’s reliability measures. We extended the work of Lazarova-
Molnar et al. (2020) by providing classification capability to address the issue of missing or unobserved cut
sets in fault occurrences of basic events. Classifying the top event for unseen combinations of events
becomes more critical when the system of interest is highly reliable with significantly rare events, or when
it is composed of significantly many basic components. We demonstrated that our approach has clear
benefits through two case studies.
DDFTAnb can extract and analyse multi-state repairable fault trees, compute reliability metrics for
probability distributions other than the usual exponential probability distribution and estimate the future
reliability of the system. In addition, DDFTAnb is highly recommended in cases when there is insufficient
amount of data. In terms of our case studies, we observed the following: for 10% of data availability,
accuracies of DDFTAnb (DDFTA) are 0.9912 (0.9881) and 0.9985 (0.9967) for MP and RBC, respectively.
However, in cases when there are sufficient or high amounts of data, DDFTA alone has a high performance:
for 90% of data availability, accuracies of DDFTAnb (DDFTA) are 0.9977 (0.9983) and 0.9994 (0.9994)
for MP and RBC, respectively. Moreover, the reliability measure calculated by DDFTA for a system of
interest is higher than the system’s true reliability value, while DDFTAnb calculates a lower reliability
measure than the system’s true reliability.
DDFTAnb can be used to analyse any system where its fault tree can be expressed in terms of its
minimal cut sets and has no limitations in this regard. The main limitation of the presented algorithm is that
Niloofar and Lazarova-Molnar
as the number of cut sets, or the number of basic components increase with the size of the system, it becomes
slower and parallel computing can be considered as a solution. As future work, we intend to improve the
classification performance of the presented algorithm and extend the tool to model and extract dynamic
fault trees with more types of gates.
REFERENCES
Aizpurua, J. I., and E. Muxika. 2013. "Model-Based Design of Dependable Systems: Limitations and
Evolution of Analysis and Verification Approaches". International Journal on Advances in
Security 6 (1):12-31.
Arnold, A., A. Griffault, G. Point, and A. Rauzy. 2000. "The Altarica Language and Its Semantics".
Fundamenta Informaticae 34 (2-3):109-124.
Ayvaz, S., and K. Alpay. 2021. "Predictive Maintenance System for Production Lines in Manufacturing:
A Machine Learning Approach Using Iot Data in Real-Time". Expert Systems with Applications
173:114598.
Barlow, R. E., and K. D. Heidtmann. 1984. "Computing K-out-of-N System Reliability". IEEE
Transactions on Reliability 33 (4):322-323.
Bobbio, A., L. Portinale, M. Minichino, and E. Ciancamerla. 2001. "Improving the Analysis of
Dependable Systems by Mapping Fault Trees into Bayesian Networks". Reliability Engineering
& System Safety 71 (3):249-260.
Brito, L. C., G. A. Susto, J. N. Brito, and M. A. Duarte. 2022. "An Explainable Artificial Intelligence
Approach for Unsupervised Fault Detection and Diagnosis in Rotating Machinery". Mechanical
Systems and Signal Processing 163:108105.
Cai, B., Y. Liu, Y. Ma, Z. Liu, Y. Zhou, and J. Sun. 2015. "Real-Time Reliability Evaluation
Methodology Based on Dynamic Bayesian Networks: A Case Study of a Subsea Pipe Ram Bop
System". ISA transactions 58:595-604.
Caldarola, L. 1980. "Fault Tree Analysis with Multistate Components". In G. Apostolakis, S. Garribba
and G. Volta (eds.), In Synthesis and Analysis Methods for Safety and Reliability Studies, Boston,
MA: Springer US.
Chen, J., C. Roberts, and P. Weston. 2008. "Fault Detection and Diagnosis for Railway Track Circuits
Using Neuro-Fuzzy Systems". Control Engineering Practice 16 (5):585-596.
Chiacchio, F., J. I. Aizpurua, L. Compagno, and D. D'Urso. 2020. "Shyftoo, an Object-Oriented Monte
Carlo Simulation Library for the Modeling of Stochastic Hybrid Fault Tree Automaton". Expert
Systems with Applications 146:113139.
Chiacchio, F., D. D'Urso, L. Compagno, M. Pennisi, F. Pappalardo, and G. Manno. 2016. "Shyfta, a
Stochastic Hybrid Fault Tree Automaton for the Modelling and Simulation of Dynamic
Reliability Problems". Expert Systems with Applications 47:42-57.
Compare, M., P. Baraldi, I. Bani, E. Zio, and D. Mc Donnell. 2017. "Development of a Bayesian Multi-
State Degradation Model for up-to-Date Reliability Estimations of Working Industrial
Components". Reliability Engineering & System Safety 166:25-40.
Delignette-Muller, M. L., and C. Dutang. 2015. "Fitdistrplus: An R Package for Fitting Distributions".
Journal of statistical software 64 (4):1-34.
Dogan, A., and D. Birant. 2021. "Machine Learning and Data Mining in Manufacturing". Expert Systems
with Applications 166:114060.
Domingos, P., and M. Pazzani. 1996. "Beyond Independence: Conditions for the Optimality of the Simple
Bayesian Classi Er." In Proc. 13th Intl. Conf. Machine Learning, edited 105-112: Citeseer.
Duda, R. O., and P. E. Hart. 1973. Pattern Classification and Scene Analysis. Wiley New York.
Feiler, P. H., B. A. Lewis, and S. Vestal. 2006. "The Sae Architecture Analysis & Design Language
(Aadl) a Standard for Engineering Performance Critical Systems". In 2006 IEEE Conference on
Computer Aided Control System Design, 2006 IEEE International Conference on Control
Applications, 2006 IEEE International Symposium on Intelligent Control, October 4th-6th,
Munich, Germany, 1206-1211.
Niloofar and Lazarova-Molnar
Flammini, F., N. Mazzocca, M. Iacono, and S. Marrone. 2005. "Using Repairable Fault Trees for the
Evaluation of Design Choices for Critical Repairable Systems". In Ninth IEEE International
Symposium on High-Assurance Systems Engineering (HASE'05), October 12th-14th, Heidelberg,
Germany, 163-172.
Guo, Y., M. Zhong, C. Gao, H. Wang, X. Liang, and H. Yi. 2021. "A Discrete-Time Bayesian Network
Approach for Reliability Analysis of Dynamic Systems with Common Cause Failures".
Reliability Engineering and System Safety 216:108028.
Huang, W., Y. Zhang, Y. Yu, Y. Xu, M. Xu, R. Zhang, G. J. De Dieu, D. Yin, and Z. Liu. 2021.
"Historical Data-Driven Risk Assessment of Railway Dangerous Goods Transportation System:
Comparisons between Entropy Weight Method and Scatter Degree Method". Reliability
Engineering & System Safety 205:107236.
Jardine, A. K., D. Lin, and D. Banjevic. 2006. "A Review on Machinery Diagnostics and Prognostics
Implementing Condition-Based Maintenance". Mechanical Systems and Signal Processing 20
(7):1483-1510.
Kabir, S. 2017. "An Overview of Fault Tree Analysis and Its Application in Model Based Dependability
Analysis". Expert Systems with Applications 77:114-135.
Lampis, M., and J. Andrews. 2009. "Bayesian Belief Networks for System Fault Diagnostics". Quality
and Reliability Engineering International 25 (4):409-426.
Lazarova-Molnar, S. 2005. The Proxel-Based Method-Formalisation, Analysis and Applications, Ph.D.
thesis, Otto-von-Guericke-University, Magdeburg, Germany.
Lazarova-Molnar, S., and G. Horton. 2003. "Proxel-Based Simulation of Stochastic Petri Nets Containing
Immediate Transitions". Electronic Notes in Theoretical Computer Science 85 (4):203-217.
Lazarova-Molnar, S., P. Niloofar, and G. K. Barta. 2020. "Automating Reliability Analysis: Data-Driven
Learning and Analysis of Multi-State Fault Trees". In 30th European Safety and Reliability
Conference and 15th Probabilistic Safety Assessment and Management Conference, November
1st-5th, Venice, Italy, 1805-1812.
Lee, W.-S., D. L. Grosh, F. A. Tillman, and C. H. Lie. 1985. "Fault Tree Analysis, Methods, and
Applications-a Review". IEEE Transactions on Reliability 34 (3):194-203.
Liggesmeyer, P., and M. Rothfelder. 1998. "Improving System Reliability with Automatic Fault Tree
Generation." In Digest of Papers. Twenty-Eighth Annual International Symposium on Fault-
Tolerant Computing (Cat. No. 98CB36224), edited 90-99: IEEE.
Linard, A., D. Bucur, and M. Stoelinga. 2019. "Fault Trees from Data: Efficient Learning with an
Evolutionary Algorithm". In 5th International Symposium on Dependable Software Engineering:
Theories, Tools, and Applications, November 27th29th, Shanghai, China, 19-37.
Lisnianski, A., and G. Levitin. 2003. Multi-State System Reliability: Assessment, Optimization and
Applications. Singapore, Singapore: World scientific.
Majdara, A., and T. Wakabayashi. 2009. "Component-Based Modeling of Systems for Automated Fault
Tree Generation". Reliability Engineering & System Safety 94 (6):1076-1086.
Malhotra, M., and K. S. Trivedi. 1995. "Dependability Modeling Using Petri-Nets". IEEE Transactions
on Reliability 44 (3):428-440.
Mukherjee, S., and A. Chakraborty. 2007. "Automated Fault Tree Generation: Bridging Reliability with
Text Mining". In 2007 Annual Reliability and Maintainability Symposium, January 22nd-25th,
Orlando, USA, 83-88.
Nadjafi, M., M. A. Farsi, H. Jabbari, and Management. 2017. "Reliability Analysis of Multi-State
Emergency Detection System Using Simulation Approach Based on Fuzzy Failure Rate".
International Journal of System Assurance Engineering 8 (3):532-541.
Nauta, M., D. Bucur, and M. Stoelinga. 2018. "Lift: Learning Fault Trees from Observational Data". In
15th International Conference on Quantitative Evaluation of Systems, September 4th-7th, Beijing,
China, 306-322.
Niloofar, P., and S. Lazarova-Molnar. 2021. "Fusion of Data and Expert Knowledge for Fault Tree
Reliability Analysis of Cyber-Physical Systems". In 2021 5th International Conference on
System Reliability and Safety (ICSRS), November 24th-26th, Palermo, Italy, 92-97.
Niloofar and Lazarova-Molnar
Niloofar, P., and S. Lazarova-Molnar. 2022. "Collaborative Data-Driven Reliability Analysis of Multi-
State Fault Trees". Proceedings of the Institution of Mechanical Engineers, Part O: Journal of
Risk and Reliability:1748006X221076290.
Nolan, P. J., M. G. Madden, and P. Muldoon. 1994. "Diagnosis Using Fault Trees Induced from
Simulated Incipient Fault Case Data." In Second International Conference on Intelligent Systems
Engineering, edited 304-309:
Papadopoulos, Y., and J. A. McDermid. 1999. "Hierarchically Performed Hazard Origin and Propagation
Studies". In International Conference on Computer Safety, Reliability, and Security, September
27th-29th, Toulouse, France, 139-152.
Qian, K., L. Yu, and S. Gao. 2021. "Fault Tree Construction Model Based on Association Analysis for
Railway Overhead Contact System". International Journal of Computational Intelligence Systems
14 (1):96-105.
Rigby, R. A., and D. M. Stasinopoulos. 2005. "Generalized Additive Models for Location, Scale and
Shape". Journal of the Royal Statistical Society: Series C 54 (3):507-554.
Ruijters, E., and M. Stoelinga. 2015. "Fault Tree Analysis: A Survey of the State-of-the-Art in Modeling,
Analysis and Tools". Computer science review 15:29-62.
Sharvia, S., S. Kabir, M. Walker, and Y. Papadopoulos. 2016. "Model-Based Dependability Analysis:
State-of-the-Art, Challenges, and Future Outlook". Software Quality Assurance:251-278.
Sikorska, J. Z., M. Hodkiewicz, and L. Ma. 2011. "Prognostic Modelling Options for Remaining Useful
Life Estimation by Industry". Mechanical Systems and Signal Processing 25 (5):1803-1836.
Solomatine, D. P., and A. Ostfeld. 2008. "Data-Driven Modelling: Some Past Experiences and New
Approaches". Journal of Hydroinformatics 10 (1):3-22.
Sullivan, K., and J. B. Dugan. 1996. Galileo User’s Manual & Design Overview, University of Virginia.
https://www.cse.msu.edu/~cse870/Materials/FaultTolerant/manual-galileo.htm, accessed 28
March 2021.
Theodoropoulos, P., C. C. Spandonidis, F. Giannopoulos, and S. Fassois. 2021. "A Deep Learning-Based
Fault Detection Model for Optimization of Shipping Operations and Enhancement of Maritime
Safety". Sensors 21 (16):5658.
Vesely, W. E., F. F. Goldberg, N. H. Roberts, and D. F. Haasl. 1981. "Fault Tree Handbook".Technical
Report NUREG-0492, Nuclear Regulatory Commission Washington DC, USA.
Waghen, K., and M.-S. Ouali. 2019. "Interpretable Logic Tree Analysis: A Data-Driven Fault Tree
Methodology for Causality Analysis". Expert Systems with Applications 136:376-391.
———. 2021. "Multi-Level Interpretable Logic Tree Analysis: A Data-Driven Approach for Hierarchical
Causality Analysis". Expert Systems with Applications 178:115035.
Zhang, W., M.-P. Jia, L. Zhu, and X.-A. Yan. 2017. "Comprehensive Overview on Computational
Intelligence Techniques for Machinery Condition Monitoring and Fault Diagnosis". Chinese
Journal of Mechanical Engineering (English Edition) 30 (4):782-795.
Zheng, J., H. Pan, and J. Cheng. 2017. "Rolling Bearing Fault Detection and Diagnosis Based on
Composite Multiscale Fuzzy Entropy and Ensemble Support Vector Machines". Mechanical
Systems and Signal Processing 85:746-759.
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Fault tree modelling and failure analysis of systems that are equipped with sensors and meters are becoming more automated and less human-dependent. For a single system to benefit from its own collected data, it will need to wait for a long time to collect sufficient data to build representative models to increase its reliability. Therefore, if multiple systems with similar functionalities cooperate, the resolution of the collected data will increase. This leads to extracting fault trees with higher accuracy in failure detection and prediction. In this paper, we present an extended approach for collaborative Data-Driven Fault Tree Analysis (DDFTA) of a system which extracts repairable fault trees from time series data streaming from multiple systems/machines sharing similar functionalities. Results are analyzed to estimate the system's reliability measures and investigate the effect of number of machines cooperating in data collection. Our method is not limited to binary (two states) components, nor to exponential distributions. Results show that applying collaborative data analytics significantly increases the accuracy of data-driven fault tree analysis, specifically for systems following nonexponential distributions.
Conference Paper
Full-text available
Reliability analysis of cyber-physical systems have benefitted substantially from the introduction of a range of technology enablers. Internet of things (IoT), advanced computing architectures and digital platforms are among the new technologies that are enhancing the data collection and analytics perceptions in the era of Industry 4.0. Fault tree modelling and failure analysis of systems have been traditionally performed using exhaustively expert knowledge. However, nowadays cyber-physical systems are equipped with sensors and meters, enabling reliability analysis to become more automated and less human-dependent. There have been approaches that fully depend on data that utilized these new developments. However, completely ignoring human cognitive capabilities and expert knowledge causes a great loss of information, which might only be compensated by collecting large amounts of data that is costly in many aspects, and sometimes even impossible. In this paper we discuss how and to what extend expert knowledge can be fused or combined with data to learn fault trees of cyber-physical systems. We, furthermore, point out the gap in availability of systematic methods for fusing data with expert knowledge for the purpose of reliability analysis of cyber-physical systems. Results of an initial simulation study indicate that hybrid reliability analysis of a system increases the accuracy and is less tedious.
Article
Full-text available
The ability to exploit data for obtaining useful and actionable information and for providing insights is an essential element for continuous process improvements. Recognizing the value of data as an asset, marine engineering puts data considerations at the core of system design. Used wisely, data can help the shipping sector to achieve operating cost savings and efficiency increase, higher safety, wellness of crew rates, and enhanced environmental protection and security of assets. The main goal of this study is to develop a methodology able to harmonize data collected from various sensors onboard and to implement a scalable and responsible artificial intelligence framework, to recognize patterns that indicate early signs of defective behavior in the operational state of the vessel. Specifically, the methodology examined in the present study is based on a 1D Convolutional Neural Network (CNN) being fed time series directly from the available dataset. For this endeavor, the dataset undergoes a preprocessing procedure. Aspiring to determine the effect of the parameters composing the networks and the values that ensure the best performance, a parametric inquiry is presented, determining the impact of the input period and the degree of degradation that our models identify adequately. The results provide an insightful picture of the applicability of 1D-CNN models in performing condition monitoring in ships, which is not thoroughly examined in the maritime sector for condition monitoring. The data modeling along with the development of the neural networks was undertaken with the Python programming language.
Conference Paper
Full-text available
Analysis of failure modes in a system is essential in increasing the reliability of the system. Fault trees model probabilistic causal chains of events that lead to a global system failure. With the emerging availability of data, deriving fault trees from observational data, rather than expert knowledge, would more accurately reflect the true behaviour of a system. Furthermore, systems change their behaviours during their lifetimes. We present an approach for Data-Driven Fault Tree Analysis (DDFTA) of a system with multi-state components which extracts repairable fault trees from time series data, and then analyses the results to estimate the system's reliability measures. Fault trees are typically designed for systems with binary (two states) components, while this is not always the case. There are components with more than two states (multi-state components) in telecommunications, gas and oil production, transportation and electric power distribution.
Article
Full-text available
The overhead contact system (OCS) is the power source of electrified railway, it is very important to evaluate the system status accurately to maintain its safe and stable operation. At present, fault tree analysis is the main method of reliability assessment for OCS. Existent methods to establish the fault tree divide the system into functional modules and require professional knowledge, which limits the failure cause and makes the result subjective. Besides, the fault tree constructed by traditional method cannot reflect the difference between the same system under different working conditions. In this paper, we applied association analysis to OCS for the first time and proposed a novel fault tree construction model based on characteristics of OCS. Firstly, hierarchical structure and method of multi-dimensional partition (MDP) is proposed to flexibly transform the failure records of OCS into transaction database, which can reduce the sparsity of data. Secondly, the failure mode intensity is defined to replace the support, which can eliminate the influence by complex layout of railway and make the mining result more reasonable. Adjustment factor and new pruning strategy are proposed for mining association rules among levels. Thirdly, the transformation from rules into fault tree structure is presented and fault tree construction model based on association rules is proposed. At last, a case study is conducted based on real failure records database of two local railway administrations of China sharing the same OCS design and the comparison between fault trees constructed by our model and traditional method verifies that the proposed model can build the adaptable fault tree in a more objective way without the limitation of modules.
Article
The dynamic and dependent behaviors are typical characteristics of modern complex systems, whose reliability is often improved through the design of multichannel parallel structures. The existence of common cause failures (CCFs) has a significant impact on system reliability. A reliability analysis model is proposed for dynamic systems with CCFs based on discrete-time Bayesian networks (DTBNs). The system operating time is dispersed into several time intervals, and the component failures are divided into independent and CCF states. Dynamic systems with cold and warm spare parts are examined to determine the modelling methodology and conditional probability tables (CPTs) of Bayesian network (BN) nodes. The reliability calculation is realised through the Bayesian inference mechanism. The model is applied to the CCF analysis and fault diagnosis of a digital safety-level distributed control system (DCS) of nuclear power plants (NPPs) to prove the effectiveness and feasibility of the method.
Article
The monitoring of rotating machinery is an essential task in today’s production processes. Currently, several machine learning and deep learning-based modules have achieved excellent results in fault detection and diagnosis. Nevertheless, to further increase user adoption and diffusion of such technologies, users and human experts must be provided with explanations and insights by the modules. Another issue is related, in most cases, with the unavailability of labeled historical data that makes the use of supervised models unfeasible. Therefore, a new approach for fault detection and diagnosis in rotating machinery is here proposed. The methodology consists of three parts: feature extraction, fault detection and fault diagnosis. In the first part, the vibration features in the time and frequency domains are extracted. Secondly, in the fault detection, the presence of fault is verified in an unsupervised manner based on anomaly detection algorithms. The modularity of the methodology allows different algorithms to be implemented. Finally, in fault diagnosis, Shapley Additive Explanations (SHAP), a technique to interpret black-box models, is used. Through the feature importance ranking obtained by the model explainability, the fault diagnosis is performed. Two tools for diagnosis are proposed, namely: unsupervised classification and root cause analysis. The effectiveness of the proposed approach is shown on three datasets containing different mechanical faults in rotating machinery. The study also presents a comparison between models used in machine learning explainability: SHAP and Local Depth-based Feature Importance for the Isolation Forest (Local-DIFFI). Lastly, an analysis of several state-of-art anomaly detection algorithms in rotating machinery is included.
Article
This paper presents a data-driven approach for a hierarchical causality analysis of faults in a complex system, named a multi-level interpretable logic tree (MILTA). From a representative faults dataset, this approach constructs dependent trees that explain the relation structure between the root-causes, intermediate causes and faults with the minimum expert involvement. The MILTA model combines the discovered knowledge in dataset (KDD) in the form of feasible solutions and the fault tree analysis (FTA), level after level, as long as the root-causes are not completely uncovered. A burn-and-build algorithm is developed to maximize the representability of the feasible solutions with a minimum number of patterns. Using Bayes’ theorem, the hierarchical causality between the root-causes and the fault is captured through different causality rules that quantify the effects of the root-causes on the fault occurrence. An actuator system dataset that consists of complex fault and normal operation states is used as an illustrative example. The MILTA model finds the same documented root-cause and uncovers other root-causes with higher accuracy.
Article
In this study, a data driven predictive maintenance system was developed for production lines in manufacturing. By utilizing the data generated from IoT sensors in real-time, the system aims to detect signals for potential failures before they occur by using machine learning methods. Consequently, it helps address the issues by notifying operators early such that preventive actions can be taken prior to a production stop. In current study, the effectiveness of the system was also assessed using real-world manufacturing system IoT data. The evaluation results indicated that the predictive maintenance system was successful in identifying the indicators of potential failures and it can help prevent some production stops from happening. The findings of comparative evaluations of machine learning algorithms indicated that models of Random Forest, a bagging ensemble algorithm, and XGBoost, a boosting method, appeared to outperform the individual algorithms in the assessment. The best performing machine learning models in this study have been integrated into the production system in the factory.
Article
Manufacturing organizations need to use different kinds of techniques and tools in order to fulfill their foundation goals. In this aspect, using machine learning (ML) and data mining (DM) techniques and tools could be very helpful for dealing with challenges in manufacturing. Therefore, in this paper, a comprehensive literature review is presented to provide an overview of how machine learning techniques can be applied to realize manufacturing mechanisms with intelligent actions. Furthermore, it points to several significant research questions that are unanswered in the recent literature having the same target. Our survey aims to provide researchers with a solid understanding of the main approaches and algorithms used to improve manufacturing processes over the past two decades. It presents the previous ML studies and recent advances in manufacturing by grouping them under four main subjects: scheduling, monitoring, quality, and failure. It comprehensively discusses existing solutions in manufacturing according to various aspects, including tasks (i.e., clustering, classification, regression), algorithms (i.e., support vector machine, neural network), learning types (i.e., ensemble learning, deep learning), and performance metrics (i.e., accuracy, mean absolute error). Furthermore, the main steps of knowledge discovery in databases (KDD) process to be followed in manufacturing applications are explained in detail. In addition, some statistics about the current state are also given from different perspectives. Besides, it explains the advantages of using machine learning techniques in manufacturing, expresses the ways to overcome certain challenges, and offers some possible further research directions.