Conference PaperPDF Available

Efficient Probabilistic Model Checking of Smart Building Maintenance using Fault Maintenance Trees


Abstract and Figures

Cyber-physical systems, like Smart Buildings, power plants and data centers have to meet high standards, both in terms of reliability and availability. Such metrics are typically evaluated using Fault trees (FTs) and do not consider maintenance strategies which can significantly improve lifespan and reliability. Fault Maintenance trees (FMTs) -- an extension of FTs that also incorporate maintenance and degradation models, are a novel technique that serve as a good planning platform for balancing total costs and dependability of a system. In this work, we apply the FMT formalism to a Smart Building application. We propose a framework for modelling FMTs using probabilistic model checking and present an algorithm for performing abstraction of the FMT in order to reduce the size of its equivalent Continuous Time Markov Chain. This allows us to apply the probabilistic model checking more efficiently. We demonstrate the applicability of our proposed approach by evaluating various dependability metrics and maintenance strategies of a Heating, Ventilation and Air-Conditioning system's FMT.
Content may be subject to copyright.
Eicient Probabilistic Model Checking of Smart Building
Maintenance using Fault Maintenance Trees
Nathalie Cauchi
Department of Computer Science, University of Oxford
Oxford, United Kingdom
Khaza Anuarul Hoque
Department of Computer Science, University of Oxford
Oxford, United Kingdom
Alessandro Abate
Department of Computer Science, University of Oxford
Oxford, United Kingdom
elle Stoelinga
FMT Group, University of Twente
Twente, e Netherlands
Cyber-physical systems, like Smart Buildings and power plants,
have to meet high standards, both in terms of reliability and avail-
ability. Such metrics are typically evaluated using Fault trees (FTs)
and do not consider maintenance strategies which can signicantly
improve lifespan and reliability. Fault Maintenance trees (FMTs) –
an extension of FTs that also incorporate maintenance and degra-
dation models, are a novel technique that serve as a good planning
platform for balancing total costs and dependability of a system.
In this work, we apply the FMT formalism to a Smart Building
application. We propose a framework for modelling FMTs using
probabilistic model checking and present an algorithm for per-
forming abstraction of the FMT in order to reduce the size of its
equivalent Continuous Time Markov Chain. is allows us to apply
the probabilistic model checking more eciently. We demonstrate
the applicability of our proposed approach by evaluating various
dependability metrics and maintenance strategies of a Heating,
Ventilation and Air-Conditioning system’s FMT.
Computer systems organization
Maintainability and main-
Fault Maintenance Trees, Formal modelling, Probabilistic Model
checking, Reliability, Building Automation Systems, PRISM
ACM Reference format:
Nathalie Cauchi, Khaza Anuarul Hoque, Alessandro Abate, and Mari
Stoelinga. 2017. Ecient Probabilistic Model Checking of Smart Building
Maintenance using Fault Maintenance Trees. In Proceedings of BuildSys ’17,
Del, Netherlands, November 8–9, 2017, 10 pages.
DOI: 10.1145/3137133.3137138
e corresponding author
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for prot or commercial advantage and that copies bear this notice and the full citation
on the rst page. To copy otherwise, or republish, to post on servers or to redistribute
to lists, requires prior specic permission and/or a fee. Request permissions from
BuildSys ’17, Del, Netherlands
©2017 ACM. 978-1-4503-5544-5/17/11.. . $15.00
DOI: 10.1145/3137133.3137138
Worldwide, buildings account for approximately 40% of the total
energy consumption and 20% of the total
emissions, annu-
ally [
]. Ecient Building Automation Systems (BAS) can reduce
energy consumption by up to 30% through their optimal operation,
continuous commissioning and maintenance [
]. Constructions
employing such technologies are termed Smart Buildings. High
standards have to be adhered by such technologies, both in terms of
reliability and availability. One way of achieving this is by employ-
ing methods to perform preventative and predictive maintenance
actions. Diagnostic and fault detection techniques for Smart Build-
ing applications have been developed in [
]. Predictive and
preventative maintenance strategies are devised in [
]. However,
these techniques preclude availability and reliability measurements
and focus only on synthesis of maintenance policies in the pres-
ence of degradation and fault nding. Reliability and availability
are typically tackled using Fault Trees (FTs), where the focus is
on nding the root causes of a system failure using a top-down
approach. FTs do not include maintenance strategies in the analysis
– a key element in reducing component failures. [
] presents the
Fault Maintenance Tree (FMT) as an extension of FT encompassing
both degradation and maintenance models. e degradation models
represent the dierent levels of component degradation and are
known as Extended Basic Events (EBE). e maintenance models
incorporate the undertaken maintenance policy which includes
both inspections and repairs. ese are modelled using Repair and
Inspection modules in the FMT framework.
In literature, FMTs are analysed using Statistical Model Checking
technique (SMC) [
] and provide statistical guarantees. In contrast,
Probabilistic Model Checking (PMC), based on numerical analysis,
provide formal guarantees with higher accuracy when compared
with SMC [
]. However, numerical methods are far more memory
intensive and may result in a state space explosion. is limitation
of PMC oen leaves SMC as the last resort [
]. In this paper
we tackle the FMT analysis using PMC. Our contributions can be
summarised as follows:
We formalise the FMT framework using Continuous Time
Markov Chain (CTMCs).
We formalise the dependability metrics using the extended
Continuous Stochastic Logic (CSL) formalism such that
they can be computed using the PRISM model checker [
BuildSys ’17, November 8–9, 2017, Del, Netherlands N. Cauchi et al.
To mitigate the state space explosion problem, we present
an FMT abstraction technique which decomposes a large
FMT into an equivalent abstract FMT based on our pro-
posed graph decomposition algorithm. Using our frame-
work, we are able to achieve a 67% reduction in the state
space size.
Finally, we construct a FMT that identies failure of a Heat-
ing, Ventilation and Air-conditioning system (HVAC). We
apply the developed framework to the built FMT and evalu-
ate relevant dependability metrics, together with dierent
maintenance strategies using the PRISM model checker.
To the best of our knowledge, this is the rst aempt to anal-
yse FMTs using Probabilistic Model Checking and also the rst
application to Smart Building systems.
is article has the following structure: Section 2 introduces the
fault maintenance trees and probabilistic model checking frame-
works. is is followed by the developed methodology for mod-
elling FMT using CTMCs and performing model checking in Sec-
tion 3. e framework is applied to a heating, ventilation and
air-conditioning (HVAC) case study which is presented in Section 4.
2.1 Fault maintenance trees framework
Fault trees are directed acyclic graphs (DAG) describing the com-
binations of component failures that lead to system failures. e
leaves in the fault trees are called basic events and denote the sys-
tem failures. e internal nodes of the graph are called gates and
describe the dierent ways that failures can interact to cause other
components to fail. e gates in a fault tree can be of several types
and these include the AND gate, OR gate, k/N-gate [14].
Fault maintenance trees (FMT) extend fault trees by including
maintenance (all the standard FT gates are also employed by the
FMTs). is is achieved by making use of:
Extended Basic Events - e basic events are modied to
incorporate degradation models of the component the leaf
represents. e degradation models represent dierent
discrete levels of degradations the components can be in
and are a function of time.
Rate Dependency Events - A new gate introduced in [
labelled as RDEP that accelerates the degradation rates of
dependent child nodes and is depicted in Figure 1. When
the component connected to the input of the RDEP fails,
the degradation rate of the dependent components is ac-
celerated with an acceleration factor γ.
Children (n)
Figure 1: RDEP gate with 1 input and dependent components also known
as children.
Repair and Inspection modules - e repair module (RM) per-
forms cleaning or replacements actions. ese actions can
be either carried out using xed time schedules or when
enabled by the inspection module (IM). e IM performs
periodic inspections and when components fall below a cer-
tain degradation threshold a repair or partial replacement
is initiated by the IM to be performed by the RM.
2.2 Probabilistic model checking
Model checking is a well-established formal verication technique
used to verify the correctness of nite-state systems. Given a for-
mal model of the system to be veried in terms of labelled state
transitions and the properties to be veried in terms of temporal
logic, the model checking algorithm exhaustively and automatically
explores all the possible states in a system to verify if the property
is satisable or not. Probabilistic model checking deals with systems
that exhibit stochastic behaviour and is based on the construction
and analysis of a probabilistic model of the system. We make use
of CTMCs, having both transition and state labels, to perform sto-
chastic modelling. Properties are expressed in the form of extended
Continuous Stochastic Logic (CSL) [11].
Denition 2.1. e tuple
denes a CTMC
which is composed of a set of states
, the initial state
, a nite
set of transition labels
, a nite set of atomic propositions
a labelling function
and the transition rate matrix
. e rate
denes the delay before which
a transition between states
takes place. If
then the probability that a transition between the states
dened as 1
is time. No transitions will trigger
if R(s,s0)=0.
e logic of CSL species state-based properties for CTMCs,
built out of propositional logic, a steady-state operator that refers
to the stationary probabilities, and a probabilistic operator for rea-
soning about transient state probabilities. e state formulas are
interpreted over states of a CTMC, whereas the path formulas are
interpreted over paths in a CTMC. For detail about the syntax and
semantics of CSL (which also includes reward formulae), we refer
the interested readers to [
]. Examples of a CSL property with
its natural language translation are: (i) P
complet e
] - “e
probability of the system eventually completing its execution suc-
cessfully is at least 0.95”. (ii) R
] - “What is the expected
reward accumulated before the system successfully terminates?”
In this section, we rst formalise the FMT framework by presenting
the formal syntax and semantics for modelling FMTs using CTMCs.
Next, we list the set of metrics used to analyse the FMT. Finally, we
present the developed framework which allows us to analyse large
FMTs using probabilistic model checking (PMC).
3.1 FMT Syntax
To formalise the syntax of FMTs using CTMCs, we rst dene the
, characterizing each FMT element by type, inputs and rates.
We introduce a new element called DELAY which will be used to
model the deterministic time delays required by the extended basic
events (EBE), repair module (RM) and inspection module (IM). We
restrict the set
to contain the EBE, RDEP gate, OR gate, DELAY,
RM and IM modules since these will be the components used in the
case study presented in Section 4.
Eicient PMC of Smart Building Maintenance using FMTs BuildSys ’17, November 8–9, 2017, Del, Netherlands
Denition 3.1. e set
of FMT elements consists of the follow-
ing tuples. Here,
are natural numbers,
take binary values,
Tde д,Tcl n
Trp lc ,Tr e p ,Toh
Tinsp R0
are deterministic delays, and γR0is a rate.
(EBE ,Td eд,Tc l n ,Tr pl c ,N)
represent the extended basic
events with
discrete degradation levels, each of which
degrade with a time delay equal to
Tde д
. It also takes as
inputs the time taken to restore the EBE to the previous
degradation level
Tcl n
when cleaning is performed and
the time taken to restore the EBE to its initial state
Trp l c
following a replacement action.
(RDE P ,n,γ,in,Td e д)
represents the RDEP gate with
pendent children, acceleration rate
, the input
activates the gate and
Tde д
the degradation rate of the
dependent children.
(OR,n)represents the OR gate with ninputs.
(RM,n,Tr ep ,Toh ,Tin sp ,Tc l n ,Tr pl c ,thresh,trig)
the RM module which acts on
EBEs (in our case, this cor-
responds to all the EBEs in the FMT). e RM can either be
triggered periodically to perform a cleaning action, every
Tr ep
delay, or a replacement action, every
delay, or by
the IM when the delay
has elapsed and the
condition is met. e time to perform a cleaning action
Tcl n
, while the time taken to perform a replacement is
Trp l c
. e
signal ensures that when the component is
not in the degraded states, no unnecessary maintenance
actions are carried out.
(IM,n,Ti ns p ,Tcl n ,Tr pl c ,thresh)
represents the IM module
which acts on
EBEs (in our case, this corresponds to all the
EBEs in the FMT). e IM initiates a repair depending on
the current state of the EBE. Inspections are performed in a
periodic manner, every
. If during an inspection, the
current state of the EBE does not correspond to the new or
failed state (i.e. the degradation level of the inspected EBE
is below a certain threshold), the
thr esh
signal is activated
and is sent to the RM. Once a repair action is performed the
IM moves back to the initial state with a delay equal to
Tcl n
Trp l c
depending on the maintenance action performed.
represents the DELAY module which takes
two inputs representing the deterministic delay
T∈ {Tdeд,
Tcl n ,Tr pl c ,Tr ep ,Toh ,Tins p }
to be approximated using an
Erlang distribution with
number of states. is DELAY
module can be extended by inclusion of a reset transition
label, which when triggered restarts the approximation of
the deterministic delay before it has elapsed. e extended
DELAY module is referred to as (DELAY ,T,N)ex t .
e FMT is dened as a special type of directed acyclic graph
where the vertices
represent the gates and the events
which represent an occurrence within the system, typically the
failure of a subsystem down to an individual component level, and
the edges
which represent the connections between vertices.
Events can either represent the EBEs or intermediate events which
are caused by one or more other events. e event at the top
of the FMT is the top event (TE) and corresponds to the event
being analysed - modelling the failure of the (sub)system under
consideration. e EBE are the leaves of the DAG. For
to be a
well-formed FMT, we take the following assumptions (i) vertices
are composed of the OR, RDEP gates, (ii) there is only one top event,
(iii) RDEP can only be triggered by EBEs and (iv) RM and IM are
not part of the DAG tree but are modelled separately
. is DAG
formulation allows us to propose a framework in Subsection 3.5
such that we can eciently perform probabilistic model checking.
Denition 3.2. A fault maintenance tree is a directed acyclic
graph G=(V,E)composed of vertices Vand edges E.
3.2 Semantics of FMT elements
Next, we provide the CTMC semantics for each FMT element
f∈ F
ese elements are then instantiated based on the underlying FMT
structure to form the semantics of the whole FMT in CTMC form.
.We dene the semantics for the
using Figure 2(a) and describe the corresponding CTMC using
the set of states given by
D={d0,d1, . . . , dN+1}
, the initial state
, the set of transitions labels
TL ={trigger,move}
, the set of
atomic propositions
AP ={T}
L(d0)=· · · =L(dN)=
, and
. e rate matrix
becomes clear from Figure 2(a)
Rij =
0 otherwise,
representing the current state,
is the next state and
a xed large value corresponding to introducing a negligible de-
lay, which is used to trigger all the DELAY modules at the same
time (cf. Denition 2.1). In Figure 2(b) we dene the semantics of
(DE LAY ,T,N)ex t
. is results in the CTMC described using the
state space
D={d0,d1, . . . , dN+1}
, the initial state
, the set of
transition labels
TL ={trigger,move,reset}
, the set of atomic
AP ={T}
, the labelling function
· · · =L(dN)=, and L(dN+1)={T}and the rate matrix Rwhere
Rij =
0 otherwise,
representing the current state and
is the next state. In both
instances, the deterministic delays is approximated using an Erlang
distribution [
] and all DELAY modules are synchronised to start
together using the trigger transition label. e extended DELAY
module have the transition labels
which restarts the Erlang
distribution approximation whenever the guard condition is met
at a rate of 1
is the rate coming from the
use of synchronisation with other modules causing the reset to
occur ( as explained in Subsection 3.3). is is required when a
maintenance action is performed which restores the EBE’s state
back to the original state and thus restart the degradation process,
before the degradation time has elapsed.
Note, for dierent FMT structure same RM and IM modules are used, thus RM and
IM modules are independent of FMT structure
BuildSys ’17, November 8–9, 2017, Del, Netherlands N. Cauchi et al.
Remark 1. e basic properties of an Erlang distribution: A ran-
dom variable
has an Erlang distribution with
and a rate
λR+,ZErl anд(k,λ)
, if
Z=Y1+Y2+. . . Yk
is exponentially distributed with rate
. e cumulative den-
sity function of the Erlang distribution is characterised using,
n!exp(λt)(λt)nfor t,λ0 (3)
and for
1, the Erlang distribution simplies to the exponential
distribution. In particular, the sequence
ZkErl anд(k,λk)
to the deterministic value
for large
. us, we can approximate a
deterministic delay
with a random variable
ZkErl anд(k,k
Note, there is a trade-o between the accuracy and the resulting blow-
up in size of the CTMC model for larger values of
(a factor of
increase in the model size) [
]. In this work, the Erlang distribution
will be used to model the xed degradation rates, the maintenance
and inspection signals. is is a similar approach taken in [
] where
degradation phases are approximated by an (k,
)-Erlang distribution.
d1d2d3. . . dN+1
(a) CTMC representing DELAY with Nstates used to approximate a de-
lay equal to Tapproximated using Er l anд(N,N
T). e transition labels
TL ={trigger,move}are shown on each of the transitions. e state la-
bels are not shown and the initial state of the CTMC is pointed to using
an arrow labelled with start.
d1d2d3. . . dN+1
trigger ,µ
move ,N
Tmove ,N
Tmove ,N
Tmove ,N
(b) CTMC representing the extended DELAY with Nstates used to ap-
proximate a delay equal to T. Delay approximated using Er l anд(N,N
e transition labels TL ={trigger,move,reset}are shown on each of
the state transitions, while the state labels are not shown.
Figure 2: CTMC for (a) DELAY and (b) DELAY with reset guard.
Extended Basic Events (EBE)
.e EBE are the leaves of the
FMT and incorporate the component’s degradation model. EBE are
a function of the total number of degradation steps
Figure 3 shows the semantics of the
(EBE ,Td eд,Tc l n ,Tr ep ,N=
e corresponding CTMC is described by the tuple
s0,TLE BE ,APE B E ,LEB E ,RE BE )where s0is the initial state ,
TLEBE ={degradei {0, . .., N},perform clean,perform replace},
the atomic propositions
APEBE ={new,thresh,failed}
, the la-
belling function
{f ailed}
REB E ="0 1 0 0
1 0 1 0
1 1 0 1
1 0 1 0 #.
e deterministic time delays
taken as inputs are modelled using three dierent DELAY modules:
perform clean,1perform clean,1
perform clean,1
perform replace,1
perform replace,1
perform replace,1
Figure 3: CTMC representing the EBE with N=3with the transition labels
TLE BE ={degradei {1,2,3},perform clean,perform replace}on each of the
state transitions. e state labels are not shown and the initial state is pointed
to by the arrow labelled with start.
an extended DELAY module approximating
Tde д
with the
transition label
replaced with
such that
synchronisation between the two CTMCs is performed
(explained in Subsection 3.3). When
Tde д
has elapsed the
transition labelled with
is triggered and the EBE
moves to the next state at a rate equal to
Tde д×
. e
transition label and corresponding transitions are
replicated in extended DELAY module and replaced with
perform clean
perform replace
. When the corre-
sponding maintenance action is performed one of the tran-
sition label is triggered and the state of the EBE moves to
previous state (if cleaning action is carried out) or to the
initial state (if replace action is performed).
a DELAY module approximating
Tcl n
with the transition
replaced with
perform clean
. When
Tcl n
elapsed the transition with transition label
perform clean
is triggered and the EBE moves to the previous state at a
rate equal to N
Tcl n .
a DELAY module approximating
Trp l c
with the transi-
tion label
replaced with
perform replace
. When
Trp l c
has elapsed the transition having the transition label
perform replace
is triggered and the EBE moves to the
initial state at a rate equal to N
Trp l c .
e transition labels
perform clean
perform replace
be triggered at the same time and it is assumed that
Tcl n ,Tr pl c
is is a realistic assumption as only one maintenance action is
performed at the same time.
RDEP gate
.e RDEP gate has static semantics and is used in
combination with the semantics of its
dependent EBEs. When trig-
gered (
in =
1), the associated EBE reaches the state labelled
the degradation rate of the
dependent children is accelerated by
a factor γ. We model the in signal using,
in =
0 otherwise,(4)
is the label of the current state of the associated EBE.
Similarly, we map the RDEP gate function using,
RA =
γTde д1, . . . , γTde дnin =1,
Tde д1, . . . , Tde дnotherwise,(5)
is is a direct consequence of synchronisation and corresponds to
. Refer
to Subsection 3.3
Eicient PMC of Smart Building Maintenance using FMTs BuildSys ’17, November 8–9, 2017, Del, Netherlands
Tde дi,i1, . . . n
corresponds to the degradation rate of the
ndependent children. 3
OR gate
.e OR gate indicates a failure when either of its input
nodes have failed and also does not have semantics itself but is used
in combination with the semantics of its
dependent input events
(EBEs or intermediate events). We use,
0E1=1∧ · · · En=1
1 otherwise (6)
. . . n
corresponds to when the
events, con-
nected to the OR gate, represent a failure in the system. In the case
of EBEs, E1=1 occurs when the EBE reaches the failed state .
Repair module (RM)
.Figure 4 (a) shows the semantics of
Tr ep ,Toh ,Ti ns p ,Tcl n ,Tr pl c ,Tr pl c ,thresh,trig)
. e CTMC is de-
scribed using the state space
, the initial state
, the
transition labels
TLRM ={inspect,check clean,check replace,
trigger clean,trigger replace}
, the atomic propositions
AP =
{maintenance }
, the labelling function
and with
RI M =f1 1
1 0 g
. For the sake of clarity in
Figure 4 (a), we used the transition labels
check maintenance
trigger maintenance
. e transition label
check maintenance
and corresponding transitions are replicated and the transition
labels replaced by
check clean
check replace
to allow for
both type of maintenance checks. Similarly, the transition la-
trigger maintenance
and corresponding transitions are du-
plicated and the transition labels replaced by
trigger clean
trigger replace
to allow the initiation of both type of main-
tenance actions to be performed. Due to synchronisation, only
one of the transitions may trigger at any time instance (as ex-
plained in Subsection 3.3). e transition labels
trigger clean
trigger replace
correspond to the transition label
within the DELAY module approximating the deterministic delays
Tcl n
Trp l c
respectively. e deterministic delays which trig-
check clean
check replace
correspond to when
the time delays
Tin sp ,Tr e p
respectively, have elapsed. All
these signals are generated using individual DELAY modules with
transition label for each module replaced using
check clean
check replace
respectively. e
signal is
modelled using,
thresh =
1L(sj,1)=thresh ∨ · · · ∨ L(sj,n)=thresh,
0 otherwise,(7)
. . . N,i
. . . n
correspond to the label of the
current state
of each of the
EBE. Similarly, we model the
signal using
trig =
1L(sj,1),new ∨ · · · ∨ L(sj,n),new,
0 otherwise.(8)
Both signals act as guards which when triggered determine which
transition to perform (cf. Fig. 4 (a)).
Note, this eectively results in changing the deterministic delay being modelled by
the DELAY module to a new value if in =1.
Inspection module (IM)
.e semantics of the
(IM,n,Ti ns p ,
Tcl n ,Tr pl c ,thresh)
is depicted in Figure 4 (b). e CTMC is dened
using the tuple
({im0,im1},im0,TLI M ,API M ,LI M ,RI M )
. Here,
TLI M ={inspect,perform clean,perform replace}
RI M =f1 1
1 0 g
. e
signal corre-
sponds to same signal used by the RM, given using
. In Figure 4
(b), for clarity, we use the transition label
perform maintenance
is transition label and corresponding transitions are duplicated
and the transition labels are replaced by either perform clean or
perform replace
to allow for both type of maintenance actions to
be performed when one of them is triggered using synchronisation.
e same DELAY modules used in the RM and EBE to represent
the deterministic delays are used by the IM. e DELAY module
used to represent the deterministic delays
Tcl n
Trp l c
the transition labels
perform clean
perform replace
. is
represents that the maintenance action has completed.
start rm1
inspect,thresh =0,1
check maintenance, trig =0,1
check maintenance, trig=1,1
inspect, thresh =1,1
trigger maintenance,1
(a) CTMC representing the RM with TLRM =
{inspect,check maintenance,perform maintenance}shown on the
state transitions. e guard condition trig =0/1or thresh =0/1must be
satised for the corresponding transition to trigger when it is activated
via synchronisation with the transition label.
start im1
inspect, thresh =0,1
inspect, thresh =1,1
perform maintenance ,1
(b) CTMC representing the IM with TLI M =
{inspect,perform maintenance}shown on the state transitions.
e guard condition trig =0and thresh =1must be satised for the cor-
responding transition to trigger when it is activated via synchronisation
with the transition label.
Figure 4: CTMC for (a) RM and (b) IM.
3.3 Semantics of FMT
Next, we show how to obtain the semantics of a FMT from the
semantics of its elements using the FMT syntax introduced in Sub-
section 3.1. We dene the DAG
by dening the vertices
the corresponding events
. e leaves of the DAG are the events
corresponding to the EBE. e events
are connected to the ver-
, which trigger the corresponding auxiliary function used to
represent the semantics of the gates. e
connected to the
RM and IM are initiated by triggering the auxiliary functions
given using
respectively. Based on the structure
, we compute the corresponding CTMC by applying parallel
composition of the individual CTMCs representing the elements of
the FMT. e parallel composition formulae are derived from [
and dened as follows,
BuildSys ’17, November 8–9, 2017, Del, Netherlands N. Cauchi et al.
Denition 3.3 (Interleaving Synchronization). e interleaving
synchronous product of
C1||C2=(S1×S2,(s01,s02 ),TL1TL2,AP1
AP2,L1L2,R)where Ris given by:
,and s2
Denition 3.4 (Full Synchronization). e full synchronous prod-
uct of
where Ris given by:
1and s2
For any pair of states, synchronisation is performed either using
interleaving or full synchronisation. For full synchronisation, as
in Denitions 3.3, the rate of a synchronous transition is dened
as the product of the rates for each transition. e intended rate
is specied in one transition and the rate of other transition(s)
is specied as 1. For instance, the RM synchronises using full
synchronisation with the DELAY modules representing
Tr ep
Trp l c
and therefore, to perform synchronisation between the
RM and the DELAY modules, the rates of all the transitions of RM
should have a value of 1 (cf. Fig. 4 (a)), while the rate of the DELAY
modules represent the actual rates (cf. Fig 2). e same principle
holds for the EBEs and the IM. We refer the reader to Table 1 to
further elucidate the synchronisation between the FMT components
and the method employed during the parallel composition.
Example. Consider, a simple example showing the time signals
and synchronisations required for modelling an EBE and the RM
and IM. e EBE has a degradation rate equal to
Tde д
and we limit
the functionality of the RM and IM by allowing only the mainte-
nance action to perform cleaning. We also need the corresponding
DELAY modules generating the degradation rates,
Tde д
and the main-
tenance rates
Tcl n ,Ti ns p ,Tre p
. e resulting CTMC is obtained by
performing a parallel composition of the components
Cal l =CE BE ||
CTde д||CRM | |CI M ||CTcl n ||CTin sp | |CTr ep .
e resulting state space
is then
Sal l =SE BE ×STd e д×SRM ×SI M ×STc l n ×STin sp ×STr e p
e synchronisation between the dierent components is shown in
Figure 5 and proceeds as follows:
All the DELAY modules (except
Tcl n
) start at the same time
using the trigger transition label.
When the extended DELAY module generating the
Tde д
delay elapses, the corresponding EBE moves to the next state
through synchronisation with the transition label
e clock signals
Tr ep ,Ti ns p
represent periodic maintenance
and inspection actions and when the deterministic delay
is reached, through synchronisation with the transition la-
check clean
or the
, the RM or IM modules is
triggered (cf. Fig. 4(a) and 4(b)). If RM triggers a main-
tenance action, the DELAY representing
Tcl n
is triggered
using the synchronisation labels
trigger clean
. Once the
deterministic delay
Tcl n
elapses, the EBE, the extended DE-
LAY module representing
Tde д
(where the
label within the extended DELAY module is replaced with
perform clean
) and the IM are reset using the transition
label perform clean.
Figure 5: Block diagram showing the synchronisation connections be-
tween one component and the other, together with the corresponding tran-
sition label which trigger synchronisation.
Remark 2. One should note that this results in the requirement of
a large state space, which is a function of the number of states used to
approximate the deterministic delays. us, to counteract this eect
we propose an abstraction framework in Subsection 3.5.
3.4 Metrics
We use PRISM to compute the metrics of the model described in
Subsection 2.1. e metrics can be expressed using the extended
Continuous Stochastic Logic (CSL) as follows:
Reliability : is can be expressed as the complement of the
probability of failure over the time
, 1
Tf ailed
Availability: is can be expressed as R
, which
corresponds to the cumulative reward of the total time
spent in states labelled with okay and thresh during the
time T.
Expected cost: is can be expressed using R
], which
corresponds to the cumulative reward of the total costs
(operational, maintenance and failure) within the time T.
Expected number of failure: is can be expressed using
], which corresponds to the cumulative transition
reward that counts the number of times the top event enters
the failed state within the time T.
3.5 Decomposition of FMTs
e use of CTMC and deterministic time delays results in the re-
quirement of a large state space for modelling the whole FMT (cf.
Remark 2). We therefore propose an approach which decomposes
the large FMT into an equivalent abstract CTMC which can be
analysed using PRISM. e process involves two transformation
steps. First we convert the FMT into the equivalent directed acyclic
graph (DAG) and split this graph into a set of smaller sub-graphs.
Second, we transform the sub-graphs into the equivalent CTMC
by making use of the developed FMT components semantics (cf.
Subsec. 3.2), and performing parallel composition of the individual
Eicient PMC of Smart Building Maintenance using FMTs BuildSys ’17, November 8–9, 2017, Del, Netherlands
Component Synchronised with component Transition label Synchronisation method
DELAY representing Tde дDELAY modules representingTc l n ,Trp lc ,Ti ns p trigger Full synchronisation
RM DELAY module representingTr e p trigger clean Full synchronisation
RM DELAY module representingTo h trigger replace Full synchronisation
EBE DELAY representing Tde дdegradeNFull synchronisation
DELAY representing Tcl n RM, EBE check clean Full synchronisation
DELAY representing Trp lc RM, EBE check replace Full synchronisation
DELAY representing Tins p RM, IM inspect Full synchronisation
DELAY representing Tre p RM, IM, EBE perform clean Full synchronisation
DELAY representing Toh RM, IM, EBE perform replace Full synchronisation
EBE RM,IM, all DELAY modules, other EBEs - Interleave synchronisation
Table 1: Performing synchronisation between the dierent FMT components and the synchronisation method used.
B1 B2
B1 B2 B4B3
Figure 6: Overall developed framework for decomposition of FMTs into the equivalent abstract CTMCs.
FMT components based on the underlying structure of the sub-
graph. e smaller sub-graphs are then sequentially recomposed to
generate the higher level abstract FMT. Figure 6 depicts a high-level
diagram of the decomposition procedure.
Conversion of original FMT to the equivalent graph
FMT is a DAG (cf. Subsection 3) and in this framework we need to
apply a transformation to the DAG in the presence of an RDEP gate,
such that we can perform the decomposition. e RDEP causes an
acceleration of events on dependent child nodes when the input
node fails. In order to capture this feature in a DAG, we need to
duplicate the input node such that it is connected directly to the
RDEP vertex. is allows us to capture when the failure of the
input occurs and the corresponding acceleration of the the children.
is is reasonable as the same RM and IM are used irrespective of
the underlying FMT structure.
Graph decomposition
.We dene modules within the DAG as
sub-trees composed of at least two events which have no inputs
from the rest of the tree and no outputs to the rest except from its
output event [
]. We can divide the graph into multiple partitions
based on the number of modules making up the DAG. We dene
the following notations to ease in the description of the algorithm:
indicates whether the node is the top node of the DAG.
Vдindicates the node where graph split is performed.
Modules correspond to sub-graphs in DAG.
We set
when we construct the DAG from the FMT and then
proceed with executing Algorithm 1. We rst identify all the sub-
graphs within the whole DAG and label all the top nodes of each sub-
. We loop through each sub-graph and its immediate
child (the sub-graph at immediate lower level) and at the point
where the sub-graph and child are connected, the two graphs are
split and a new node
is introduced. us, executing Algorithm 1
results in a set of sub-graphs linked together by the labelled nodes
. For each of lower level sub-graphs we now proceed to compute
the mean time to failure (MTTF). is will serve as an input to the
higher-level sub-graphs such that metrics for the abstract equivalent
CTMC can be computed.
PMC of sub-graphs
.We start from the boom level sub-graphs
and perform the conversion to CTMC using the formal models pre-
sented in Subsection 3.2. e formal models have been built into a
library of PRISM modules and based on the underlying components
and structure making up the sub-graph, the corresponding individ-
ual formal models are converted into the sub-graph’s equivalent
CTMC by performing parallel composition (cf. Subsec. 3.3). For
each sub-graph, we compute the probability of failure
at time
, from which we calculate the MTTF using,
MTTF =ln(1De(T))
BuildSys ’17, November 8–9, 2017, Del, Netherlands N. Cauchi et al.
Algorithm 1: DAG decomposition algorithm
input : DAG G=(V,E)
output: Set of sub-graphs with one of the end nodes labelled
as Vд.
1Identify sub-graphs using ‘depth-rst’ traversal
2Label all top nodes of each sub-graph ias VTi
3forall the
select the top node of every sub-graph and immediate
child dened at immediate lower level do
4if label VTalready found in one of the leaf nodes of
sub-graph then
5Split sub-graph
6Insert new node Vдwhich will be used as input from
connected sub-graph
e MTTF serves as the input to the higher level sub-graph at time
. e new node in the higher-level sub-graph, now degrades with
the a new time delay
Tde д=MTTF
, which is fed into the corre-
sponding DELAY component. is process is repeated for all the
dierent sub-graphs until the top level node Vois reached.
PMC of nal equivalent abstract CTMC
.On reaching the top
level node
, we compute the metrics for the equivalent abstract
CTMC for a specic time horizon
. For dierent horizons, the
previous step of computing the MTTF for the underlying lower
level sub-graphs needs to be repeated. Using this technique, we
can formally verify larger FMTs, while using less memory and
computational time due to signicantly smaller state space of the
underlying CTMCs. Next, we proceed with an illustrative example
comparing the process of directly modelling the large FMT using
CTMCs versus the de-compositional modelling procedure. Figure 7
presents the FMT composed of two modules and the corresponding
abstracted FMT. e abstract FMT is a pictorial representation of
the moel represented by the equivalent abstract CTMC obtained
using the developed decomposition framework (cf. Fig. 6). For
Figure 7: e original FMT and the abstract FMT corresponding to the
equivalent abstract CTMC generated by the developed framework. e MT TF
for the F’ is computed based on the probability of failure of the heating coil.
both the large FMT and the equivalent abstract FMT a comparison
between the total number of states for the resulting CTMC models,
the total time to compute the reliability metric and the resulting
reliability metric is performed. All computations are run on an 2.3
GHz Intel Core i5 processor with 8GB of RAM and the resulting
statistics are listed in Table 2. e original FMT has a state space
with 193543 states, while the equivalent abstract CTMC has a state
space with 63937 states. is corresponds to a 67% reduction in the
state space size. e total time to compute the reliability metric is a
function of the nal time horizon and a maximal 73% reduction in
computation time is achieved. Accuracy in the reliability metric of
the abstract model is a function of the time horizon. e accuracy
of the reliability metric computed by the abstract FMT results in a
maximal reduction of 0.61%.
Time Original FMT Abstracted FMT
Horizon Time to compute Reliability Time to compute Total Reliability
metric MT TF metric Time
(years) (mins) (mins) (mins) (mins)
50.727 0.9842 0.142 0.181 0.223 0.9842
10 1.406 0.8761 0.219 0.309 0.528 0.8769
15 2.489 0.3290 0.292 0.622 0.914 0.3270
Table 2: Comparison between the original large FMT and the abstracted
We apply the FMT framework to a Heating, Ventilation and Air-
conditioning (HVAC) system used to regulate a building’s internal
environment. e HVAC system under consideration for the FMT
analysis is presented in Figure 8. It is composed of two circuits -
the air ow circuitry and the water circuit. e gas boiler heats
up the supply water which is fed into the heat pump. e heat
pump transfers the supply water into two sections - the supply
air heating and cooling coils and the radiators - via the splier.
e rate of water owing in the heating coil is controlled using a
heating coil valve, while the rate of water ow in the radiator is
controlled using a separate valve. e outside air is mixed with the
extracted room air temperature via the mixer. is is fed into the
heating coil, which warms up the input air to the desired supply
air temperature. is air is supplied back, at a rate controlled by
the Air Handling unit (AHU) dampers, into the zone via the supply
fan. e radiators are directly connected to the water circuitry and
transfer the heat from the water into the zone. e return water is
then passed through the collector and is returned back to the boiler.
Based on this HVAC system we construct the corresponding FMT
shown in Figure 9. e leaves of the tree are EBE with discrete
degradation rates computed using Table 3, approximated by the
Erlang distribution where
is the number of degradation phases
for the Erlang distribution) and MTTF is the expected
time to failure with
(cf. Remark 1). We choose
an acceleration factor
2 for the RDEP gate. e system is
periodically repaired every 6 months (
Tr ep =
) and a major
overhaul with a complete replacement of all components is carried
out once every 20 years (
Toh =
). Weekly inspections
are performed (
Tinsp =
) which return the components
back to the previous state. Only cleaning actions are performed
when inspections are carried out. e total time to perform a
cleaning action is 1 day (
Tcl n =
), while performing a total
replacement of components takes 7 days (
Trp l c =
). e time
timing signals
{Tre p ,Toh ,Ti nsp ,Tc l n ,Tr plc }
are all approximated
using the Erlang distribution with
3. All maintenance actions
are performed simultaneously on all components.
Eicient PMC of Smart Building Maintenance using FMTs BuildSys ’17, November 8–9, 2017, Del, Netherlands
Heating &
cooling coil
Outside Air Intake
Splitter Collector
Supply Fan
Air Input Water Input
coil valve
Radiator valve
Figure 8: High level schematic ofan HVAC system.
Failure of HVAC component
Radiator Pout
Failure in
Heating coil Failure of
Supply Fan
No heating /
cooling Reduced
1 2
93 5
Figure 9: FMT for failure in HVAC system with leaves represented using
EBE (associated RM and IM not shown in gure). e EBE are labelled to cor-
respond to the component failure they represent using the fault index pre-
sented in Table 3.
Fault Index Failure Mode N MTTF
1 Failure in cooling coil 4 20
2 Broken AHU Damper 2 20
3 Fan motor failure 3 35
4 Obstructed supply fan 4 31
5 Fan bearing failure 6 17
6 Radiator failure 4 25
7 Radiator stuck valve 2 10
8 Heater stuck valve 2 10
9 Failure in heat pump 4 20
Table 3: Extended Basic events in FMT with associated degradation rates
(N, MTTF) obtained from [6, 10].
4.1 antitative results
We make use of the developed framework (cf. Subsec. 3.5) and con-
vert the FMT representing the failure of the HVAC system (cf. Fig.
9) into the equivalent abstract CTMC. e abstracted CTMC has a
state space of 62779 states. Using our current computing set-up, the
complex CTMC representing the whole FMT was not computable
as it results in a state space explosion. Highlighting, the advantage
of the developed framework. e process is performed over six
time horizons
years with the maintenance
policy consisting of periodic cleaning every 6 months, a major over-
haul every 20 years and inspections on a weekly basis. For this
set-up, the metrics corresponding to the reliability and availabil-
ity of the HVAC systems over the time horizon are computed and
are shown in Figure 10(b). e maximal time taken to compute a
metric using the abstract FMT is 1.47 minutes. It is deduced that
both the reliability and availability reduce over time and there is
a saturation in the number of maintenance actions which one can
perform before the system no longer achieves higher performance
in reliability and availability. Next, we compare the total cost of
0 10 20
Time (years)
(a) Reliability of HVAC system.
0 10 20
Time (years)
(b) Availability of HVAC system.
Figure 10: Reliability and availability of HVAC over time horizon Nr.
maintenance and the expected number of failures over the time
years when considering dierent
maintenance strategies, such that we can identify the maintenance
strategy that minimises cost and the number of failures over time.
We consider six dierent maintenance strategies which are listed
in Table 4. e total maintenance cost to perform a repair is 100
[GBP], while a replacement costs 5000 [GBP]. We now compute the
total expected maintenance costs and the total expected number of
failures for each strategy. ese are shown in Figure 11. e most ef-
fective strategy which oers a good trade-o between maintenance
costs and the expected number of failures is achieved when repairs
are carried out on a yearly basis, replacements are carried out every
20 years and inspections are carried out weekly (corresponding to
BuildSys ’17, November 8–9, 2017, Del, Netherlands N. Cauchi et al.
). Furthermore, it can be seen that the frequency of
inspections has a large eect on the total number of failures. When
the frequency of inspection is low (as in
), the expected
number of component failures increases signicantly. Note that
reducing the periodicity of repairs, as in the case of maintenance
also results in an increase in the expected number of
Strategy index Tre p Toh Ti ns p
M06 months 20 years 1 Week
M112 months 20 years 1 Week
M248 months 20 years 1 Week
M36 months 10 years 1 Week
M46 months 20 years 2 years
M56 months 20 years 5 years
Table 4: Implemented maintenance strategies
0 10 20
Time (years)
Maintenance cost
(a) Maintenance Costs.
5 10 15 20 25
Time (years)
Expected number of failures
(b) Expected number of failures.
Figure 11: Comparison between dierent number of maintenance strate-
gies for an HVAC systems.
e paper has presented a methodology for applying probabilistic
model checking to FMTs. e FMTs are modelled in the form of
CTMCs which simplies the transformation of FMT into formal
models that can be analysed using PRISM. A novel technique for
abstracting the equivalent CTMC model is also presented. e novel
decomposition procedure tackles the issue of state space explosion
and results in a signicant reduction in both the state space size
and the total time required to compute metrics. e framework
has been applied to an HVAC system and the eect of applying
dierent maintenance strategies has been presented. e presented
framework can be further enhanced by adding more gates to the
PRISM modules library which include the Priority-AND, INHIBIT,
k/N gates and to incorporate lumping of states as in [
], such that
the state space can be further reduced.
is work has been funded by the AMBI project under Grant No.:
324432, by the Alan Turing Institute, UK, post-doctoral research
grant from Fonds de Recherche du ebec - Nature et Technologies
(FRQNT) and Malta’s ENDEAVOUR Scholarships Scheme.
Vladimir Babishin and Sharareh Taghipour. 2016. Optimal maintenance policy
for multicomponent systems with periodic and opportunistic inspections and
preventive replacements. Applied Mathematical Modelling 40, 24 (2016), 10480–
Francesca Boem, Riccardo MG Ferrari, Christodoulos Keliris, omas Parisini,
and Marios M Polycarpou. 2017. A distributed networked approach for fault
detection of large-scale systems. IEEE Trans. Automat. Control 62, 1 (2017), 18–33.
Luca Bortolussi and Jane Hillston. 2012. Fluid approximation of CTMC with
deterministic delays. In antitative Evaluation of Systems (QEST), 2012 Ninth
International Conference on. IEEE, 53–62.
Nathalie Cauchi, Karel Macek, and Alessandro Abate. 2017. Model-based predic-
tive maintenance in building automation systems with user discomfort. Energy
European Parliament and Council of the European Union. 2010. Directive
2010/31/EU. (2010).
[6] ASHRAE Handbook. 1996. HVAC systems and equipment. American Society of
Heating, Refrigerating, and Air Conditioning Engineers, Atlanta, GA (1996).
Holger Hermanns and Lijun Zhang. 2011. From Concurrency Models to Numbers.
In Nato Science for Peace and Security Series. IOS Press.
Khaza Anuarul Hoque, Otmane Ait Mohamed, and Yvon Savaria. 2015. Towards
an accurate reliability, availability and maintainability analysis approach for
satellite systems based on probabilistic model checking. In Proceedings of the 2015
Design, Automation & Test in Europe Conference & Exhibition. EDA Consortium,
Khaza Anuarul Hoque, O Ait Mohamed, Yvon Savaria, and Claude ibeault. 2014.
Probabilistic model checking based DAL analysis to optimize a combined TMR-
blind-scrubbing mitigation technique for FPGA-based aerospace applications. In
Formal Methods and Models for Codesign (MEMOCODE), 2014 Twelh ACM/IEEE
International Conference on. IEEE, 175–184.
Faisal I Khan and Mahmoud M Haddara. 2003. Risk-based maintenance (RBM):
a quantitative approach for maintenance/inspection scheduling and planning.
Journal of Loss Prevention in the Process Industries 16, 6 (2003), 561–573.
Marta Kwiatkowska, Gethin Norman, and David Parker. 2007. Stochastic model
checking. In International School on Formal Methods for the Design of Computer,
Communication and Soware Systems. Springer, 220–270.
Marta Kwiatkowska, Gethin Norman, and David Parker. 2011. PRISM 4.0: Veri-
cation of Probabilistic Real-time Systems. In Proc. 23
International Conference on
Computer Aided Verication (CAV’11) (LNCS), G. Gopalakrishnan and S. Qadeer
(Eds.), Vol. 6806. Springer, 585–591.
ZF Li, Yi Ren, LL Liu, and ZL Wang. 2015. Parallel algorithm for nding modules
of large-scale coherent fault trees. Microelectronics Reliability 55, 10 (2015), 1400–
1403. Proceedings of the 26
European Symposium on Reliability of Electron
Devices, Failure Physics and AnalysisSI:Proceedings of {ESREF }2015.
Enno Ruijters, Dennis Guck, Peter Drolenga, and Mari
elle Stoelinga. 2016. Fault
maintenance trees: reliability centered maintenance via statistical model check-
ing. In Reliability and Maintainability Symposium (RAMS), 2016 Annual. IEEE,
Ying Yan, Peter B Luh, and Krishna R Paipati. 2017. Fault Diagnosis of HVACAir-
Handling Systems Considering Fault Propagation Impacts Among Components.
IEEE Transactions on Automation Science and Engineering 14, 2 (April 2017),
Olexandr Yevkin. 2015. An ecient approximate Markov chain method in
dynamic fault tree analysis. ality and Reliability Engineering International
akan LS Younes, Marta Kwiatkowska, Gethin Norman, and David Parker. 2006.
Numerical vs. statistical probabilistic model checking. International Journal on
Soware Tools for Technology Transfer 8, 3 (2006), 216–228.
... The specific HVAC set-up considered in this work is depicted in Fig. 2, and corresponds to the system found in the "smart buildings" laboratory at the Department of Computer Science, University of Oxford, first examined by Cauchi, Hoque, Abate, and Stoelinga (2017). It also corresponds to HVAC units found commonly in medium-sized buildings (Kim & Katipamula, 2017). ...
... These affect at least one subsystem component from the set-up depicted in Fig. 2, namely a heating coil, a supply fan, or a radiator. In Fig. 6 we present an FMT decomposition of the corresponding fault modes, derived from analyses presented in Cauchi et al. (2017) for this case study. ...
... In Cauchi et al. (2017), maintenance schemes distinguish between inspections, repair checks, and overhauls, which in our setting take place every half, two, and fifteen years respectively. We term this policy "full maintenance," and experiment also with "half maintenance," where the time periods are respectively one, four, and thirty years for inspections, repair checks, and overhauls. ...
Conference Paper
Full-text available
Cyber-physical systems must meet high RAMS-reliability, availability, maintainability, and safety-standards. It is of essence to implement robust maintenance policies that decrease system downtime in a cost-effective way. Power plants and smart buildings are prominent examples where the cost of periodic inspections is high, and should be mitigated without compromising system reliability and availability. Fault Maintenance Trees (FMTs), a novel extension in fault tree analysis, can be used to assess system resilience: FMTs allow reasoning about failures in the presence of maintenance strategies, by encoding fault modes in a comprehensible and "maintenance-friendly" manner. A main concern is how to build a concrete model from the FMT, in order to compute the relevant RAMS metrics via (ideally automatic) analyses. Formal methods offer automated and trustworthy techniques to tackle with such task. In this work, we apply quantitative model checking-a well established formal verification technique-to analyse the FMT of a Heating, Ventilation and AirConditioning unit from a smart building. More specifically , we model the FMT in terms of continuous-time Markov chains and priced time automata, which we respectively analyse using probabilistic and statistical model checking. In this way we are capable of automatically estimating the reliability, availability, expected number of failures, and differentiated costs of the FMT model for various time horizons and maintenance policies. We further contrast the two approaches we use, and identify their advantages and drawbacks.
... Considering the wide adaptation of FT towards threat modelling, several attempts have been made to improve the analysis of FT. Studies, such as Ref. [7][8][9][10] have used different statistical and probabilistic model checking approaches as a tool to enhance the analysis capabilities of underlying FT. Attack trees (AT) as one of the threat models in security engineering, evolved from fault trees [11] and were popularized by Schneier [12] who suggested using them as a way to model security threats and to perform quantitative security assessment using this convenient hierarchical representation utilizing bottom-up single parameter propagation [13]. ...
The complexity of socio-technical systems using Ambient Intelligence (AmI) and the Internet of Things (IoT) is growing exponentially, involving numerous entities, such as humans, infrastructures, and cyber systems. Achieving and maintaining a specified level of security and privacy in such systems is challenging and crucial. Attack Tree is a powerful technique used in safety and reliability engineering. In this paper, we attempted to enhance Attack Tree analysis by transforming it into a Markov Decision Process (MDP) model. We propose an algorithm to transform an Attack Tree into an MDP model. We argue that formal methods, such as probabilistic model checking can significantly improve the security analysis capabilities. Moreover, the mixture of MDP and probabilistic model checking can overcome the limitations of Attack Trees, such as state explosion, scalability, and manual interaction. We used a probabilistic model checker, namely PRISM to model an attack scenario and perform security analysis on it. To demonstrate the significance, we took a real-world use case and performed a probabilistic analysis on it. The results revealed that formal analysis can prove certain properties, which were not possible to verify using attack trees.
... Several FT extensions have been proposed that adds to the expressiveness of FT, for example, the Dynamic Fault Trees (DFTs, [19]), the Fault Maintenance Trees (FMTs, [20]), etc. FMT expands the FT with maintenance and repair policies. They have been used in many practical case studies, for example railway industries [21] and Heating, Ventilation and Air-conditioning (HVAC) [22]. Notably, in our paper, we extend the attackfault framework with the complex maintenance policies, many of these policies are adapted from the FMT framework, thus our work overarches the FMT framework with additional security risk assessment. ...
Full-text available
Modern day industrial control systems are overwhelmingly complex. These systems feature intricate interactions between the cyber and the physical components. At the same time, they need to be trustworthy and deliver their services continuously. Underpinning, a crucial industrial activity to ensure the dependability of such critical systems is through timely maintenance, inspections and repairs. Several strategies exist here: "fix it when it breaks" (reactive maintenance), monitor and maintain a system in pre-established time intervals (preventive maintenance), preventive action based upon detected symptoms of failures condition-based maintenance (CBM), etc. In literature, the question of optimal maintenance frequency have been a subject of intense study. However, most papers, do not take information security aspects into account. This paper provides an automated tool-supported quantitative risk analysis framework, Attack-Fault-Maintenance Trees, AFMTs, that will enable practitioners to make informed choice on: (a) identifying the critical component(s) necessary for uninterrupted systems; (b) a decision support system that will provide informed choices on policy measures, countermeasures and safeguards that will reduce the disruptions; (c) run the "what-if" scenarios to find the optimal trade-offs between system attributes (safety, security, us-ability and maintenance). The front-end of the tool is a domain-specific language geared to represent the system architecture using graphical-constructs. The back-end of the framework remains hidden to the practitioner. It consists of a mathematical engine based on statistical model-checking techniques. A case study of oil-pipeline is used to demonstrate the efficacy of our framework.
... Secondly, in a relatively high-scale VRLE setup case, the AFT might grow too large, and pertinent attack/fault-tree reduction techniques will need to be employed. For instance, to reduce the size of the AFT in high-scale VRLE setup cases, popular techniques in the dynamic fault tree (DFT) domain such as graph rewriting [62], graph partitioning [63], or state space reduction through the bisimulation-based [64] technique can be employed. ...
Full-text available
Social Virtual Reality Learning Environments (VRLE) offer a new medium for flexible and immersive learning environments with geo-distributed users. Ensuring user safety in VRLE application domains such as education, flight simulations, military training is of utmost importance. Specifically, there is a need to study the impact of “immersion attacks” (e.g., chaperone attack, occlusion) and other types of attacks/faults (e.g., unauthorized access, network congestion) that may cause user safety issues (i.e., inducing of cybersickness ). In this article, we present a novel framework to quantify the security, privacy issues triggered via immersion attacks and other types of attacks/faults. By using a real-world social VRLE viz., vSocial and creating a novel attack-fault tree model, we show that such attacks can induce undesirable levels of cybersickness. Next, we convert these attack-fault trees into stochastic timed automata (STA) representations to perform statistical model checking for a given attacker profile. Using this model checking approach, we determine the most vulnerable threat scenarios that can trigger high occurrence cases of cybersickness for VRLE users. Lastly, we show the effectiveness of our attack-fault tree modeling by incorporating suitable design principles such as hardening , diversity , redundancy and principle of least privilege to ensure user safety in a VRLE session.
The emergence of COVID-19 pandemic is causing tremendous impact on our daily lives, including the way people interact with buildings. Leveraging the advances in machine learning and other supporting digital technologies, recent attempts have been sought to establish exciting smart building applications that facilitates better facility management and higher energy efficiency. However, relying on the historical data collected prior to the pandemic, the resulting smart building applications are not necessarily effective under the current ever-changing situation due to the drifts of data distribution. This paper investigates the bidirectional interaction between human and buildings that leads to dramatic change of building performance data distributions post-pandemic, and evaluates the applicability of typical facility management and energy management applications against these changes. According to the evaluation, this paper recommends three mitigation measures to rescue the applications and embedded machine learning algorithms from the data inconsistency issue in the post-pandemic era. Among these measures, incorporating occupancy and behavioural parameters as independent variables in machine learning algorithms is highlighted. Taking a Bayesian perspective, the value of data is exploited, historical or recent, pre- and post-pandemic, under a people-focused view.
Conference Paper
The increasing complexity of space missions, their software architectures, and hardware that has to meet the demands for those missions, imposes numerous new challenges for many engineering disciplines such as reliability engineering. Affected by the ever growing demand for more onboard computation power are the onboard computers. They in return require Fault Detection, Isolation, and Recovery (FDIR) architectures to support their fault tolerant operation in the harsh environment of space. Especially high performance commercial processing units face the challenge of dealing with negative radiation effects, which may significantly degrade their operation. To design performant and fault tolerant onboard computers, it is of high interest to assess the effectiveness of the FDIR architecture in the early phase of system design. This can be achieved using Fault Tree Analysis (FTA). However, to create complete fault trees manually is an error prone and labor intensive task. In this paper, the methodology for assessing the FDIR design of onboard computers in space systems, presented in [1], is refined by introducing a library of FDIR routines. The routines are modeled using fault trees and are composed into a software system fault tree using a basic fault model and a design configuration chosen by the reliability engineer. To assess the configurations, we give a heuristic based on a factor-criteria-metric model. We demonstrate the feasability of our approach on the basis of a case study on the rover of the Martian Moons eXploration (MMX) mission. Several FDIR configurations are studied and fault trees are generated for them. For the chosen case study, we obtain a reduction of up to 80% in terms of modeling effort.
Conference Paper
Abstract: Future space missions will demand greater capabilities regarding the processing of sensor data on onboard computers of satellites than current space technology can provide. Limited downlink bandwidth, high resolution sensors and more rigid real-time control algorithms, dedicated to increase satellite autonomy, drive the need for growing onboard computing performance. To overcome these challenges, new high-performance onboard computers are necessary, leading to an increased consideration of Commercial-Of-The-Shelf (COTS) components. The DLR project Scalable Onboard Computing for Space Avionics (ScOSA) targets these challenges with a complex onboard computer design consisting of space-qualified and COTS computing devices, arranged as heterogeneous SpaceWire-interconnected grid computer in space. However, the utilization of COTS components in the harsh space environment imposes new challenges on the system. Therefore, Fault Detection Isolation and Recovery (FDIR) mechanisms are important functionalities of systems like ScOSA. These enable the preservation of the demanded dependability levels for an embedded system in space. To ensure this dependability, the FDIR subsystem configuration requires a detailed analysis regarding potential faults in the system. For this purpose, we employed Dynamic Fault Tree (DFT) analysis, a methodology which is used to model faults and their temporal propagation through an onboard computer. With this paper, we contribute a new building block for showing the applicability of DFT analysis and for closing the gap between theory and practical application of DFTs. The quantitative results of the analysis of the contribution of the ScOSA FDIR subsystem to the overall system reliability are taken as baseline for a discussion on how to effectively improve the system's reliability further. To showcase the methodology, an earth observation low earth orbit use case scenario is defined and the by FDIR means enforced processing system of the Xilinx Zynq SoC computing devices with a DFT analysis evaluated.
Full-text available
This work presents a new methodology for quantifying the discomfort caused by non-optimal temperature regulation, in a building automation system, as a result of degraded biomass boiler operation. This discomfort is incorporated in a model-based dynamic programming algorithm that computes the optimal maintenance action for cleaning or replacing the boiler. A non-linear cleaning model is used to represent the different cleaning strategies under taken by contractors. The maintenance strategy minimizes the total operational costs of the boiler, the cleaning costs and the newly defined discomfort costs, over a long-term prediction horizon that captures the short-term daily thermal comfort within the heating zone. The approach has been developed based on real data obtained from a biomass boiler at a Spanish school and the resulting optimal maintenance strategies are shown to have the potential of significant energy and cost savings.
Full-text available
The computation of the probability of the top event or minimal cut sets of fault trees is known as intractable NP-hard problems. Modularization can be used to reduce the computational cost of basic operations on fault trees efficiently. The idea of the linear time algorithm, as a very efficient and compact modules detecting algorithm, is visiting the nodes one by one with top-down depth-first left-most traversal of the tree. So the efficiency of the linear time algorithm is limited by nodes visiting time successively and serially, especially when confronting large-scale fault trees. Aiming at improving the efficiency of modularizing large-scale fault trees, this paper proposes a new parallel method to find all possible modules. Firstly, we transform the fault tree into a directed acyclic graph (DAG) and treat the terminal basic nodes as entries of the algorithm. And then, according to the proposed rules in this paper, we traverse the graph bottom-up from the terminal nodes and mark the internal nodes in a parallel way. Therefore, we can compare all internal nodes and decide which nodes are modules. Eventually, an experiment is carried out to compare the linear and parallel algorithm, and the result shows that the proposed parallel algorithm is efficient on handling large-scale fault trees.
Conference Paper
Full-text available
From navigation to telecommunication, and from weather forecasting to military, or entertainment services-satellites play a major role in our daily lives. Satellites in the Medium Earth Orbit (MEO) and geostationary orbit have a life span of 10 years or more. Reliability, Availability and Maintainability (RAM) analysis of a satellite system is a crucial part at their design phase to ensure the highest availability and optimized reliability. This paper shows the formal modeling and verification of RAM related properties of a satellite system. In a previously reported approach, time between possible failures and time between repairs are assumed to follow an exponential distribution, which does not represent a realistic scenario. In contrast, in our work, discrete time delays in the classical Continuous Time Markov Chain (CTMC) are approximated using the Erlang distribution. This is done by approximating nonexponential holding time with several intermediate states based on a phase type distribution. The RAM properties are then verified using the PRISM model checker. We present and compare modeling results with those obtained with a previously reported approach that demonstrate an improved modeling accuracy.
In a heating, ventilation, and air conditioning system, an air-handling system is a key module. Its components (e.g., air handling unit, air-mixing box, and fans), linked through airflows, condition air to a desired temperature and/or humidity based on comfort or controlled environment requirements. Identifying failure modes and estimating their severities allow maintenance crews to know which faults have occurred, how critical they are, and be guided in the repair process to improve the system availability. The problem of fault detection and diagnosis in air-handling systems is complex because of fault propagation across components, and high false alarm rates caused by uncertainties in system and measurement dynamics. In this paper, to capture fault propagation impacts in an efficient manner, dynamic hidden Markov models are developed to identify failure modes, since they contain state transition matrices depending on other components and do not generate joint states. To filter out false alarms, "coupled statistical process control" techniques are developed by using state transitions matrices representing coupling among components. Experimental results show that the method can effectively diagnose faults with high-diagnosis accuracy.
In the present paper, a system with components subject to soft and hard failures is considered. It is assumed that hard failures are revealed and fixed immediately and present an additional opportunity for inspection (opportunistic inspection), but soft failures are hidden and only corrected at periodic inspections. The objective is to find the optimal maintenance policy for all components and the optimal periodic inspection for the entire system. Two models are considered in this context. In the first model, hard-type and soft-type components are subject to minimal repair or corrective replacement, and soft-type components undergo opportunistic inspections. In the second model, in addition to the assumptions of the first model, hard-type components may be preventive replaced at periodic inspections. In our models, we base the maintenance decision for the soft-type components on the optimal number of minimal repairs until replacement, and for the hard-type components – on the optimal age before replacement. A recursive equation is provided for deriving the required expected values. Hidden failures preclude us from expressing the terms of the objective function in closed form. For this reason, the optimal periodic inspection interval for the system minimising its total expected life cycle cost is found for both models using simulation.
Conference Paper
The current trend in infrastructural asset management is towards risk-based (a.k.a. reliability centered) maintenance, promising better performance at lower cost. By maintaining crucial components more intensively than less important ones, dependability increases while costs decrease. This requires good insight into the effect of maintenance on the dependability and associated costs. To gain these insights, we propose a novel framework that integrates fault tree analysis with maintenance. We support a wide range of maintenance procedures and dependability measures, including the system reliability, availability, mean time to failure, as well as the maintenance and failure costs over time, split into different cost components. Technically, our framework is realized via statistical model checking, a state-of-the-art tool for flexible modelling and simulation. Our compositional approach is flexible and extendible. We deploy our framework to two cases from industrial practice: insulated joints, and train compressors.
Networked systems present some key new challenges in the development of fault diagnosis architectures. This paper proposes a novel distributed networked fault detection methodology for large-scale interconnected systems. The proposed formulation incorporates a synchronization methodology with a filtering approach in order to reduce the effect of measurement noise and time delays on the fault detection performance. The proposed approach allows the monitoring of multi-rate systems, where asynchronous and delayed measurements are available. This is achieved through the development of a virtual sensor scheme with a model-based re-synchronization algorithm and a delay compensation strategy for distributed fault diagnostic units. The monitoring architecture exploits an adaptive approximator with learning capabilities for handling uncertainties in the in-terconnection dynamics. A consensus-based estimator with time-varying weights is introduced, for improving fault detectability in the case of variables shared among more than one subsystem. Furthermore, time-varying threshold functions are designed to prevent false-positive alarms. Analytical fault detectability sufficient conditions are derived and extensive simulation results are presented to illustrate the effectiveness of the distributed fault detection technique.
Approximate Markov chain method for dynamic fault tree analysis is suggested for both reparable and non-reparable systems. The approximation is based on truncation, aggregation and elimination of Markov chain states during the process of dynamic fault tree transformation to corresponding Markov chain. The method is valid for small probabilities. For reparable systems, it is true if mean time to repair is much less than mean time to failure. Several examples are studied. Additional simplification is considered in case the system is in a steady state. Copyright © 2015 John Wiley & Sons, Ltd.
Conference Paper
We compare population models in terms of Continuous Time Markov Chains with embedded deterministic delays (delayed CTMC), in which an exponential timed transition can only update the state of the system after a deterministicdelay, and delay differential equations (DDE). We prove a fluid approximation theorem, showing that, when the size of the population goes to infinity, the delayed CTMC converges to a solution of the DDE.