Assessment of catastrophic risk using Bayesian network constructed from domain knowledge and spatial data.
ABSTRACT Prediction of natural disasters and their consequences is difficult due to the uncertainties and complexity of multiple related factors. This article explores the use of domain knowledge and spatial data to construct a Bayesian network (BN) that facilitates the integration of multiple factors and quantification of uncertainties within a consistent system for assessment of catastrophic risk. A BN is chosen due to its advantages such as merging multiple source data and domain knowledge in a consistent system, learning from the data set, inference with missing data, and support of decision making. A key advantage of our methodology is the combination of domain knowledge and learning from the data to construct a robust network. To improve the assessment, we employ spatial data analysis and data mining to extend the training data set, select risk factors, and fine-tune the network. Another major advantage of our methodology is the integration of an optimal discretizer, informative feature selector, learners, search strategies for local topologies, and Bayesian model averaging. These techniques all contribute to a robust prediction of risk probability of natural disasters. In the flood disaster's study, our methodology achieved a better probability of detection of high risk, a better precision, and a better ROC area compared with other methods, using both cross-validation and prediction of catastrophic risk based on historic data. Our results suggest that BN is a good alternative for risk assessment and as a decision tool in the management of catastrophic risk.
-
Citations (0)
-
Cited In (0)
Page 1
Risk Analysis
DOI: 10.1111/j.1539-6924.2010.01429.x
Assessment of Catastrophic Risk Using Bayesian Network
Constructed from Domain Knowledge and Spatial Data
Lianfa Li,1,2Jinfeng Wang,1Hareton Leung,2and Chengsheng Jiang1
Prediction of natural disasters and their consequences is difficult due to the uncertainties and
complexity of multiple related factors. This article explores the use of domain knowledge and
spatial data to construct a Bayesian network (BN) that facilitates the integration of multiple
factors and quantification of uncertainties within a consistent system for assessment of catas-
trophic risk. A BN is chosen due to its advantages such as merging multiple source data and
domain knowledge in a consistent system, learning from the data set, inference with missing
data, and support of decision making. A key advantage of our methodology is the combi-
nation of domain knowledge and learning from the data to construct a robust network. To
improve the assessment, we employ spatial data analysis and data mining to extend the train-
ing data set, select risk factors, and fine-tune the network. Another major advantage of our
methodology is the integration of an optimal discretizer, informative feature selector, learn-
ers, search strategies for local topologies, and Bayesian model averaging. These techniques
all contribute to a robust prediction of risk probability of natural disasters. In the flood disas-
ter’s study, our methodology achieved a better probability of detection of high risk, a better
precision, and a better ROC area compared with other methods, using both cross-validation
and prediction of catastrophic risk based on historic data. Our results suggest that BN is a
good alternative for risk assessment and as a decision tool in the management of catastrophic
risk.
KEY WORDS: Bayesian network; domain knowledge; risk analysis; spatial data mining
1. INTRODUCTION
Emergency catastrophic events like natural dis-
asters are affected by complex factors that are both
diverse (natural, environmental, ecological, demo-
graphic, and socioeconomic) and may have a large
measure of uncertainty.(1)The interactions of these
factors are complex and affected by random fluctu-
1State Key Laboratory of Resources and Environmental Informa-
tion System, Institute of Geographical Sciences and Resources
Research, Chinese Academy of Sciences, Beijing, China.
2Department of Computing, The Hong Kong Polytechnic Univer-
sity, Kowloon, Hong Kong.
∗Address correspondence to Lianfa Li, LREIS, Institute of Geo-
graphical Sciences and Resources Research, Chinese Academy
of Sciences, Rm. 1305, No. All, Road Datun, Anwai, District
Chaoyang, Beijing, China 100101; lspatial@gmail.com.
ations. Many scholars have done studies on natu-
ral disasters resulting in the construction of complex
systematic theories thereof and suggestions of spe-
cialist methods for risk assessment.(1−4)Shi et al.(3)
proposed a systematic theory of natural disasters by
dividing the risk-related factors into three aspects: in-
ducing factors, environmental factors, and vulnera-
bility. Based on practical surveys, Guo and Chen(5)
concluded that there is a monotonously decreas-
ing relationship (the solid line in Fig. 1) between
loss risk and predictability of natural disasters or a
monotonously increasing relationship (the dash-dot
line in Fig. 1) between loss risk and mitigation delay.
As seen from Fig. 1, if predictability of natural disas-
ters is improved (early warning) or timely mitigation
actions are carried out, loss can be decreased consid-
erably, thus lowering the risk and saving more lives.
1
0272-4332/10/0100-0001$22.00/1C ?2010 Society for Risk Analysis
Page 2
2 Li et al.
Predictability / Mitigation delay
Loss risk
predictability / loss risk
mitigation delay / loss risk
Lmin
Lmax
Fig. 1. Relationship between predictability of natural disasters
and loss risk and between mitigation delay and loss risk.
Generally,thetraditionalapproaches usethefol-
lowing form, or a derivation thereof, to model catas-
trophic risk:
? ?
where P(A) is the probability of the disaster event,
A, P(V |A) is the probability of vulnerability for a
certain individual (V) given event A, and C(V) is
the damage potential of V. Estimation of risk us-
ing Equation (1) is dependent on multiple factors
and these are subject to uncertainty.(6)Many pre-
vious approaches, however, ignore or are limited in
quantifying the uncertainty of these factors and their
interactions.(7)Building on top of the traditional ap-
proaches, we apply new techniques to enhance the
predictability of natural disasters and thus to de-
crease the potential loss risk (Fig. 1).
Recently, with the development of geographical
information science, data mining, and artificial intel-
ligence, new techniques have been used in assess-
ing catastrophic risk.(8−11)There is still the need for
more exploration of new methods and applications.
As pointed out in NASA’s report,(12)some existing
methods of data analysis lack consideration of do-
main knowledge, making it difficult to interpret the
results; there are relatively few studies that merge
multiple sources of information (some of which may
be significant) although natural disasters are affected
by multiple uncertainty factors.
As an exploration of pragmatic and efficient
methods, this article proposes an uncertainty in-
ference model, a Bayesian network, which is ini-
tially constructed according to domain knowledge
and then fine-tuned by learning from historic data.
Specifically, we explore how to fuse knowledge and
spatial data from multiple sources to construct an
adaptive network, how to combine components of
R =
C(V)P(V | A)P(A) dV dA,
(1)
kernel density analysis, exposure analysis, and vul-
nerability analysis into a consistent system, and how
to use the network to make a robust prediction that
overcomes overfitting.
Our model is based on Bayesian network (BN)
that, as a directed acyclic graph, is able to repre-
sent uncertainty interdependences between factors
that describe many real-world domains, such as pub-
lic emergency catastrophic events. They have many
advantages (over traditional probabilistic methods)
including merging domain knowledge and multiple-
source data within a consistent system, flexible net-
work structure beneficial for searching a locally
optimal solution, and inference under missing data
conditions. Furthermore, by adding nodes of utilities
and decision, a BN can easily become a tool for sup-
porting reasonable decision making. With new infor-
mation (evidence) at hand, the tool can instantly up-
date the risk assessment and an adaptive decision can
be made accordingly. Therefore, it is a good integra-
tion approach to assessing catastrophic risk.
Although BNs have been widely applied in many
domains, for example, economy, public health, eco-
logical risk assessment, and mineral exploitation, to
the best of our knowledge, there are only a few re-
ports of BNs being applied in catastrophic risk as-
sessment that include debris flow,(13)earthquakes,(14)
and avalanches.(7)
Compared with the small number of BN appli-
cations in catastrophic risk assessment, our study fo-
cuses on learning and robust prediction of risk using
BN. We use data mining and learning methods to im-
prove the BN’s risk assessment. Hence, our model-
ing is different from those studies based mostly on
domain knowledge. Through this study, we make the
following contributions:
1. We use kernel density analysis (KDA) to pre-
process the spatial data set. KDA is an ap-
proach to modeling the intensity of a certain
event (e.g., how far a vulnerable individual is
close to a flooded river) or a quantity (e.g.,
amount of loss) spread across the geospa-
tial landscape.(15)In our study, this method
is employed to preprocess the data to obtain
the intensity buffer classification of the fea-
tures (e.g., rivers, roads, or residential areas).
This approach considers the influence of spa-
tial distance on intensity classification of fea-
tures, which is beneficial for making a robust
prediction.
2. We design an optimal discretization algo-
rithm. This is a supervised learning algorithm
Page 3
Assessment of Catastrophic Risk Using Bayesian Network 3
that relates the discretization of quantitative
factors to classification of the target variable.
This algorithm is beneficial to improving the
performance of the learned models in par-
ticular when we have little knowledge about
the discretization and its influence on the risk
prediction.
3. We design a generic modeling framework of
a BN according to domain knowledge. Us-
ing data mining techniques such as informa-
tive feature selector, optimal discretizer, and
search strategies, the framework facilitates in-
tegration of multiple quantitative factors and
qualitative factors within a consistent system
to make uncertainty inferences. It combines
domain knowledge and historic data within
an integration platform for multidisciplinary
communication among experts in different
fields (geographers, construction engineers,
knowledge engineers, and economists). Ag-
gregative use of multiple techniques is ben-
eficial for improving the robustness of our
method. Bayesian model averaging (BMA)
can be used to enhance the prediction’s
robustness.
Our method was successfully applied in monitoring
the flood disaster. Using both cross-validation and
historic data to predict the new situations, our BMA
prediction achieved a better probability of detection
of high risk, a better precision, and ROC area than
the other learning methods. Our encouraging result
has implications for using BNs as an approach to as-
sessment and decision making of catastrophic risk.
If nodes of utilities and decision actions are added
to the BN, a better decision-making functionality
can easily be implemented on the basis of robust
prediction by our methodology. Our methodology
of risk assessment, although using the flood as the
study case, is based on a generic modeling frame-
work and it can be easily adjusted and extended
to other types of catastrophic risk such as seismic
and typhoon risk if relevant factors are selected and
relevant domain knowledge is incorporated in the
framework.
2. MODELING FRAMEWORK
OF BAYESIAN NETWORK
This section briefly describes the BN model-
ing framework. Specifically, Section 2.1 introduces
knowledge of the factors associated with catastrophic
risk, Section 2.2 introduces the BN, and Section 2.3
proposes the BN framework.
2.1. Factors Associated with Catastrophic Risk
There are many factors associated with catas-
trophic risk. As described in Shi,(3)we can di-
vide these factors into three aspects, namely, induc-
ing factors, environmental factors, and vulnerability
factors.
1. Disaster-inducing factors: As direct exposure-
related factors, inducing factors are mainly re-
sponsible for occurrence of the hazards and
are closely related to the occurrence of catas-
trophic loss. For instance, heavy rainfall may
induce floods and landslides; extremely in-
tense wind may cause typhoons or cyclones;
and movements of the earth’s crust may in-
duce earthquakes.
2. Environmental factors for breeding disasters:
Environmental factors are relevant to the en-
vironment that breeds the disasters. Such a
factor can be either physical or artificial and
is able to mitigate or aggregate the destruc-
tive power of a hazard. For instance, land with
good water-soil conservation capabilities can
prevent a mudslide or mitigate its destruc-
tive effect, while a flood has more destruc-
tive power on the infrastructure or residences
close to a river floodplain than those further
away from the floodplain.
3. Vulnerability: This is the degree to which a
system or subsystem is likely to experience
and adapt to harm due to exposure to a haz-
ard. Different systems and individuals have
different vulnerability due to their differences
in adaptation to harm. For instance, young
people are less vulnerable to a flood than se-
niors; and a house with a lightweight steel
structure has a better ability to withstand
earthquakes.
Table I gives a brief summary of the three types
of factors for five natural disasters, namely, flood,
typhoon, earthquake, tsunami, and landslide. Basi-
cally, these factors embody the domain knowledge
about the mechanism of disasters, that is, their oc-
currence and effects, which is the source for the sub-
sequent modeling of BN as described in the next
section.
Page 4
4Li et al.
Table I. Division of Factors for Loss Risk in Five Disasters
CategoryFlood TyphoonEarthquake TsunamiLandslide
Typical inducing
factors
Heavy rainfall, dam
collapse, etc.
Extreme climate
events, for
example, El Nino
Location, close to
sea, vegetation,
elevation, etc.
Release of
interstitial fluid
pressure
Location, soil, close
to fault, etc.
Earthquake deep at
sea
Heavy rainfall
Environmental
factors
Water, soil,
vegetation,
elevation, slope,
etc.
Materials, structure and the number of stories of buildings; demographic and socioeconomic conditions; age,
knowledge, and income of individuals
Location, close to
sea, etc.
Water, soil,
vegetation,
elevation, slope,
etc.
Vulnerability
2.2. Bayesian Network
A BN is a probabilistic graphical model that en-
codes a set of random variables and their proba-
bilistic interdependencies through a directed acyclic
graph (DAG) consisting of nodes and edges. It is a
good method for modeling uncertainties and interac-
tions between related factors inherent in monitoring
and prediction of catastrophic risk. Given below is a
brief introduction to BNs.
Definition 1: Given a set of random variables (rv), V,
a BN is an ordered pair (BS, BP) such that
1. BS = G(V, E) is a directed acyclic graph,
called the network structure of B, where E ∈
V × V is the set of directed edges, represent-
ing the probabilistic conditional dependency
relationship between rv nodes that satisfies
the Markov property, that is, there are no di-
rect dependencies in BSthat are not already
explicitly shown via edges, E, and
2. BP = {γu: ?u× ?πu→ [0...1]|u ∈ V} is a
set of assessment functions, where the state
space ?uis the finite set of values of u; πuis
the set of parent nodes of u, indicated by BS;
if X is a set of variables, ?Xis the Cartesian
product of all state spaces of the variables in
X; and γuuniquely defines the joint probabil-
itydistributionP(u|πu)ofuconditionalonits
parent set, πu.
BNs are based on the Bayesian theorem, that is,
inference of the posterior probability (a.k.a. belief)
of a hypothesis according to some evidence. In as-
sessment of catastrophic risk, evidence comes from
inducing factors and environmental and vulnerabil-
ity factors (Table I), while the hypothesis refers to
the risk that is classified as several states of loss. Let r
Table II. Different States of the Risk and Their
Damage Definition
Damage
State
Damage Factor
Range (%)
Central Damage
Factor (%)
None
Slight
Light
Moderate
Heavy
Major
Destroyed
00
0.5
5
20
45
80
100
0–1
1–10
10–30
30–60
60–100
100
be such a hypothesis variable of loss risk and its state
space is ?r. The risk can be classified as seven states,
that is, ?r = {none, slight, light, moderate, heavy,
major, destroyed} whose definitions of the damage
states are given in Table II. In practice, the risk can
be classified into two states, for example, ?r= {low
loss, high loss}, or {no loss, loss} for convenience of
surveying the loss when the training samples just in-
volves two states of loss such as Wang’s(16)survey of
earthquake loss and the binary survey of flood loss
in our flood study. In our binary classification, the
threshold for the damage factor of “high risk” is 10%
(damage factor), that is, if over 10% of the properties
or people at a certain place are damaged, this place
will be classified as “high loss.”
In a specified BN, given some evidences at hand,
we can estimate the posterior probability or belief of
the target variable r as the risk probability by calcu-
lating the marginal probability:
Bel(rk) =
?
ui∈V,ui?=r
p(u1,u2,...,r,...,un),
(2)
where p(u1,...,un) =?
ui∈Vp(ui|πui) is the joint
probability over V.
Page 5
Assessment of Catastrophic Risk Using Bayesian Network5
Qualitative factors
Quantitative factors
Risk
{"high","low"}
Inducing
factors
Breeding
factors
Vulnerability
... ...
Risk (A)
Quantitative factors (B)
Qualitative
factors (C)
A. Disease-symptoms pattern and
influence pattern from qualitative factors
B. Initial network framework for risk assessment
(constructed from domain knowledge)
Factor 1
......
Factor n
Fig. 2. Modeling framework of Bayesian network.
In practice, we often use an efficient algorithm of
exact inference or approximate inference rather than
the marginalization of the joint probability to com-
pute Bel in Equation (2).
2.3. Initial Network Framework
We devise an initial network framework of
BN according to domain knowledge. The domain
knowledge comes from the generalization and classi-
fication of factors associated with natural disasters as
described in Section 2.1. To simplify the implemen-
tation of the domain knowledge in the BN, we use
two relationships to represent the knowledge: one
is the relationship between quantitative independent
factors such as rainfall, elevation, and slope, and the
target variable, that is, loss risk, while the other is
theinfluenceofqualitativefactorssuchasvegetation,
landform, and soil type on the relationship between
quantitative factors and the target variable.
1. When a quantitative factor is closely associ-
ated with the target variable (risk), abnormal
values of the quantitative factor may indicate
higher loss risk (probability). This relation-
ship between quantitative factors and the loss
risk is similar to that between a disease and
the related symptoms of the patient catching
the disease: a certain disease often causes the
patient to have some abnormal test results or
symptoms. Similarly, if a study region has a
high loss risk, it often has a higher or abnor-
mal measurement value of some quantitative
factors. Based on empirical knowledge, such
a relationship becomes a basic pattern for our
Bayesian modeling framework. We call this
relationship the disease-test pattern or disease-
symptom pattern.
2. Another aspect of domain knowledge is the
influence of qualitative factors such as soil
type, geological type, and vegetation on the
relationship between quantitative factors and
loss risk. In this study, we regard these qual-
itative factors as contributing factors to the
disease-symptom pattern relationship. The in-
fluence of qualitative factors on quantitative
factors and loss risk naturally becomes our
second basic pattern for BN modeling and we
call it an influence pattern.
From the above two basic patterns (Fig. 2A), we
can construct the BN modeling framework (Fig. 2B).
We use a simple diverging connection(17)to model
the “disease-symptom” pattern. First, we assume
that the quantitative factors used are independent.
Under this assumption, we can specify this connec-
tion using the loss risk (“disease”) node as root and
quantitative nodes (“symptoms”) as leaves. In this
pattern, each leaf node has an edge directed from
the root node. If we do not temporarily consider the
influence from qualitative factors, the diverging con-
nection is a typical na¨ ıve Bayes that is often used in
Page 6
6Li et al.
medical diagnosis. From this connection, we get the
likelihood of the target node, r = “high loss”:
L(m1,...,mn| R) = P(m1| R) · ··· · P(mn| R), (3)
where R is the target node of loss risk and mithe
ith node of the independent quantitative factors.
Then using the normalization constant, μ, we get the
posterior probability, or belief, of the risk variable:
Bel(R) = μP(R) ·L(m1,...,mn|R).
To model the influence pattern, we use the fun-
damental rule of probability calculus:(18)
P(A, B|C) = P(B| A,C)P(A|C).
If we regard the risk factor as A, the independent
quantitative factors as B, and the qualitative factors
as C, according to the rule, it is natural to have the
edges directed from the qualitative factor nodes (C)
to the quantitative factor (B) and risk (A) nodes to
obtain our framework (Fig. 2B). In BN, an edge rep-
resents a probabilistic dependence (P(A|B)) and the
node (A) at the edge’s arrow is statistically depen-
dent on the one (B) at the edge’s source. According
to Equation (4), if C has a statistical influence on A
and B (i.e., A and B are statistically conditional on
C), this can be specified as the product of two proba-
bility dependencies, that is, P(B|A,C) and P(A|C),
and thus we can direct an edge from C to A to repre-
sent the dependence P(A|C), and direct two edges
from A and C to B to represent the dependence
P(B|A,C). According to the BN principle, such a
productofprobabilitiesinEquation(4)canbeimple-
mented in the network structure as in Fig. 2A. When
A represents different independent factors and B dif-
ferent qualitative factors, this interdependence be-
tween the three factors can be extended to multi-
ple factors with the independence assumption among
quantitative factors (C). This gives us the BN’s initial
framework (Fig. 2B). Furthermore, in order to de-
crease the computation burden, we can use domain
knowledge to confirm the interdependency relation-
ship between quantitative factors (B) and qualitative
factors (C). Thus, domain knowledge is used to select
the links in the final network by removing those non-
interdependent relationships. This method of remov-
ing the noninterdependence conforms to the simpli-
fication principle of Occam’s window(19)to select the
model. Thus, in Fig. 2, we use dotted lines to de-
note such a relationship finally determined by do-
main knowledge.
On the other hand, there may be interdependen-
cies between qualitative factors and these interde-
pendencies can be learned from the data set using
(4)
various search algorithms. Then, the learned interde-
pendencies from the data are used to determine the
local structure of the network framework, thereby
constructing a complete BN.
In the framework, we must ensure independence
between quantitative factors. Furthermore, since the
predictive-related factors are quantitative (continu-
ous) or qualitative (discrete or categorical), we need
to develop methods to fuse such different types of
variables in the BN. In our methodology, we use
PCA and Quinlan’s information measures to se-
lect independent quantitative factors, an optimal dis-
cretization algorithm to discretize the quantitative
factors, thus enabling the BN to integrate both quali-
tative and quantitative factors within a consistent sys-
tem, search algorithms to obtain the local structure,
and BMA to obtain a robust prediction of the loss
risk.
On the basis of the principles of na¨ ıve Bayes,
Occam’s window’s simplification,(19)and probability
calculus, our BN framework is constructed. Given
its independence assumption and theoretical foun-
dation, our framework, although artificial, is reason-
able and adequately describes the probability re-
lationships between the variables. This framework
combines simple domain knowledge (relevant fac-
tors and their classification) and learning to obtain
a robust assessment of catastrophic risk. The frame-
work is practical and very useful, in particular when
we lack domain knowledge of the probability inter-
dependences among risk-related factors. But, if we
have a clearer knowledge of the interdependence, we
can directly construct the network.
3. BACKGROUND OF THE FLOOD
CASE STUDY
3.1. Study Area
The study area is one part of the basin of the
Heihe River that is located between east longitude
96◦42?and 102◦04?and between north latitude 36◦45?
and 42◦40?in northwest China. Fig. 3 shows the study
area, which includes 11 counties, for example, Jinta,
Jia Yuguan, Jiuquan, Gaotai, Zhangye, etc. within
Gansu Province.
The study area, unlike the southern regions of
China, does not often experience heavy rainfalls. But
due to worsening local ecological systems and poor
water-soil conservation ability, a few heavy rainfalls
can cause moderate flood disasters as recorded in
history.(20)In the summer (July) of three successive
Page 7
Assessment of Catastrophic Risk Using Bayesian Network7
Fig. 3. The study area.
years, 2006, 2007, and 2008, three flood disasters hap-
pened in this region.
3.2. Study Goal
The study goal is to use the flood disaster data
set and the related factors to construct a robust BN
learner that can be used to predict the flood’s loss
risk. Our methods are compared with nine other
probabilistic or nonprobabilistic models.
In the case study, we used three data sets of the
same study area from three successive years, July
2006, July 2007, and July 2008.
For model validation, the data from 2006 and
2007 were, respectively, employed to validate the
modelusingtheusual10×10cross-validation.Inthis
validation, the data set was randomly divided into
10 buckets of equal size. Nine buckets were used for
training and the last bucket was used as the test. The
procedure was iterated 10 times and the results were
averaged. The various methods are compared using
scalar measures, that is, the probability of detection
(pd), probability of false alarms (pf), precision, and
receiver operating characteristic (ROC) area.
Furthermore, the 2006 data set was used as the
training data to teach the models to predict the 2007
and 2008 loss risks. Thus, through four comparisons
(the 2006 and 2007 cross-validations, the 2007 and
2008 predictions), we validated our methodology.
3.3. Data Set
This study’s data set is based on the grid for-
mat. In this format, the study region is subdivided
into compartments or cells (pixels) on the basis of
a spatial data set obtained from a rectangular grid.
The grid’s resolution is about 500 m. The data set in-
cludes data of three relevant factors and loss survey
from 2006, 2007, and 2008, respectively. The involved
factors are described as follows.
3.3.1. Predictive Factors
The predictive factors cover the three aspects as
described in Section 2.1.
1. One inducing factor: heavy rainfalls (rf) are
the direct cause of the flood.
2. Six environmental factors: elevation (e), slope
(s), daily mean wind velocity (dmn), daily max-
imum wind velocity (dmax), normalized differ-
ence vegetation index (NDVI, n), and geol-
ogy type (g). These factors correspond to the
physical geographical environment in which
the flood occurs and can aggregate or miti-
gate the flood’s destructive power: a higher
altitude (e) indicates less influence from the
flood; a larger wind velocity (dmn and dmax)
implies more destruction indirectly related to
the flood; geological conditions (g) have an in-
fluence on the indirect damages of the flood;
Page 8
8Li et al.
Fig. 4. Procedure for risk assessment in our methodology.
NDVI (n) is an indicator of the study area’s
vegetation and also has secondary effects on
the flood disaster.
3. Three vulnerability-related factors: whether
close to residents (re), whether close to roads
(ro), and whether close to rivers (ri). These
factors are associated with the location and
surroundings of the vulnerable individuals
(e.g., human beings, houses, or constructions).
If individuals are closer to the flooded river
(ri), they are more vulnerable to the disaster’s
damage. If a flood is closer to a residential
area (re), this area can be more vulnerable to
the loss. Conversely, if individuals are closer
to a road, they can have a greater chance pro-
videdbytheroad(ro)forescapingthedisaster
and thus have less vulnerability.
Among these factors, the inducing factor, rf, and
the five environmental factors, e, s, n, dmn, and dmax,
are quantitative factors, while the remaining environ-
mental factor, g, and the three vulnerability factors,
ri, ro, and re, are qualitative factors.
3.3.2. The Target Variable
The target variable is the loss caused by the
flood. We obtained the loss data using thematic map-
ping (TM) images in combination with yearbooks,
statistics, and references from practical surveys. Us-
ing TM images, we obtained the submergence depth
and duration of the flood and then we examined and
confirmed the loss situation within the submergence
range referring to other materials and statistics from
practical surveys.(21)We used a binary categorization
variable to indicate whether a unit (a cell in the grid
data set) has a “high loss” or “low loss” risk (“1”
representing “high loss” and “0” representing “low
loss”). Basically, the areas having “high loss” each
have a loss proportion of over 10%.
4. FLOOD RISK ASSESSMENT
The assessment of catastrophic risk is defined by
the following steps:
1. Preprocess the data set using a kernel density
function (Section 4.1).
2. Discretize quantitative factors using the opti-
mal multisplits algorithm (Section 4.2).
3. Select independent quantitative factors us-
ing principal component analysis (PCA) and
Quinlan’s information measures (Section 4.3).
4. Build the BN with discrete or categorical pre-
dictive factors (Section 4.4).
5. Perform a robust prediction of the catas-
trophic risk (Section 4.5).
Figure 4 shows the procedure of the flood risk
assessment. To obtain the grid data set from mul-
tiple heterogeneous sources, we apply various pre-
processing steps, for example, converting the vector
Page 9
Assessment of Catastrophic Risk Using Bayesian Network9
data and resampling the grid data into the target grid
data set at the standardized resolution and projec-
tion. We perform these steps in a GIS environment
such as ARCGIS. The following sections describe
major techniques of the procedure in Fig. 4.
4.1. Using Kernel Density Functions
to Preprocess the Data Set
KDA is a nonparametric unsupervised learning
procedure. The kernel, k, is a probability density
function that is symmetric around the origin and de-
creases with an increasing distance from the origin.
We can use the normal density function to simulate
a kernel function, Kλ(z, Zi).(15)Then, we can sum-
marize the kernel density values of any unit from
the sample (observation) units to obtain the intensity
value of any unit in the geographical area:
Density(z) =1
n
n
?
i=1
Zi· Kλ(z, Zi),
(5)
where n is the number of sample units. According to
Density(z), we obtain the classification of the predic-
tive factors. Ziis a count of a certain type of event or
a quantity of the feature.
Kernel density estimation represents the concept
of a spatial correlation, that is, closer spatial dis-
tance (d) between geospatial features means more
correlation or more influence between them.(22)Con-
sideration of spatial correlation in using KDA to
quantify predictive factors is in particular useful for
risk analysis of natural disasters given its reasonable
assumption.
How to set the bandwidth or search radius, λ, is
determined by empirical knowledge and the design
goal. A big λ means more generalization over the en-
tire study area while a small λ means overlocalization
over the area. Since our goal is to reflect the influence
of relevant indicators such as rivers on the damage to
the individuals exposed to the flood, λ can be set ac-
cording to the biggest influence range of the relevant
indicators in practical disasters. In the flood disaster,
three variables, that is, re, ro, and ri, are quantified
and classified using KDA. Fig. 5 shows the classifica-
tions of the three variables via KDA (Fig. 5A for ri;
Fig. 5B for ro, and Fig. 5C for re).
4.2. Optimal Discretization of Quantitative Factors
This step involves discretizing the quantitative
factors. The discretization will be used first in the se-
lection of independent predictive factors described
Fig. 5. KDA classification of ri, ro, and re.
in Section 4.3, and then in BN modeling, as inputs
of the discrete state space ?u. We use a supervised
learning algorithm to find the optimal splits to dis-
cretize quantitative factors for the BN to achieve the
data-adaptive prediction of the target variable. We
describe this in two parts:Section 4.2.1 introduces the
concept of Quinlan’s measure that is used later in our
algorithm, while Section 4.2.2 presents the discretiza-
tion algorithm.
4.2.1. Quinlan’s Information Measures
Quinlan’s information gain ratio (GR)(23)is used
to measure the contribution of the splits of each
Page 10
10Li et al.
Table III. Splits of Quantitative Factors Discretized
Quantitative Factors Discretized Intervals (Splits)
Elevation (e)
Slope (s)
Rainfall (rf)
[0, 117), [117, 1411.5), [1411.5, 2285), [2285, +∞)
[0, 1.095), [1.095, 4.1), [4.1, 6.495), [6.495, 17.705), [17.705, +∞)
[0, 48.0), [48.0, 56.5), [56.5, 57.5), [57.5, 58.5), [58.5, 2515.0), [2515.0, 2764.5), [2764.5, 2850.5),
[2850.5, 3897.5), [3897.5, 4901.5), [4901.5, 4994.0), [4994.0, 7524.0), [7524.0, 7689.5), [7689.5, +∞)
[0, 24.5), [24.5, 29.5), [29.5, 30.5), [30.5, +∞)
[0, 72.5), [72.5, 73.5), [73.5, 76.5), [76.5, 77.5), [77.5, 80.5), [80.5, 81.5), [81.5, +∞)
[0, 0.015), [0.015, 0.075), [0.075, 0.115), [0.115, 0.225), [0.225, 0.305), [0.305, 0.675) [0.675, +∞)
Daily mean wind velocity (dmn)
Daily max. wind velocity (dmax)
NDVI (n)
quantitative factor to risk prediction. GR measures
the information GR given the discretization of the
variable to be assessed. GR takes into account the
information that the discretized variable contains.
Qianlan’s GR also measures the contribution of an
indicator to the prediction and thus can be used as a
means of feature selection (Section 4.3).
4.2.2. Optimal Multisplitting Discretization
Algorithm
The algorithm is designed according to the “re-
cursion” idea in the algorithm by Fulton et al.(24)and
the minimal description length (MDL) stopping cri-
teria in Fayyad and Irani’s algorithm.(25)It recur-
sively finds the optimal splits of a continuous pre-
dictor based on the discretization’s contribution to
the class prediction. Compared with other supervised
methods, this algorithm can achieve the same or bet-
ter splits with the number of intervals adjusted adap-
tively, although we need to set the maximum number
of intervals.
This algorithm assumes that optimal cut points
fall on the boundary points that are defined as the
points between two successive attribute values of
the sorted instances that have two different class la-
bels. This assumption has been theoretically proven
to be reasonable.(26)The algorithm uses GR as the
goodness criterion for discretization. It recursively
searches the boundary points from big to small un-
til the optimal cut points with the maximum GR are
obtained:
GR(k,1,i) = max
1≤j<i(GR(k− 1,1, j)
+GR(1, j + 1,i)),
(6)
where GR(k, j, i) denotes the maximum GR that re-
sults when the training instances j through i are parti-
tioned into k intervals. The best k-split is the one that
maximizes GR(k,1,N), where N is the cardinality of
the set of values of the continuous predictor.
Although we can set a maximum number of in-
tervals, the algorithm makes use of Occam’s MDL
principle of information theory as the stopping cri-
terion,(27)thus adaptively adjusting the number of
discretization intervals. The algorithm can decide
whether a candidate cut point is acceptable and new
partitions are unnecessary according to the MDL
criterion.
This algorithm considers the characteristics of
thedatasuchasthevariance.Ifafactorhasabigvari-
ance and a split can improve the discretization’s con-
tribution to classification, the split will be kept. The
even discretization is simple and easy to use but some
of its splits may be unnecessary (e.g., only the rainfall
beyond a certain threshold can result in a flood dis-
aster and the discretization below such a threshold is
meaningless for the flood risk prediction). So we can
use this algorithm to automatically detect and iden-
tify such threshold splits when we have little knowl-
edge of the risk-related factors and experts cannot
give precise splits. The splits identified by this algo-
rithm can be adjusted according to domain knowl-
edge if necessary.
In our study, six quantitative factors, namely,
rf, e, s, dmn, dmax, and n were discretized using
this algorithm. Table III gives the splits of these
factors.
Then, the splits were used to discretize the corre-
spondingfactorsofthevalidationandtestdatasetsto
supply discrete versions of the continuous factors.
4.3. Feature Selection
To obtain independent quantitative factors, we
employ PCA to detect the underlying independence
among quantitative factors (Section 4.3.1) and then
use Quinlan’s GR to confirm the relationship of each
Page 11
Assessment of Catastrophic Risk Using Bayesian Network11
Table IV. Loading and GR of Quantitative Factors
Measures
Esrf
dmn
dmax
n
Loading
#Component
GR
0.76
5
0.077
0.979
2
0.69
0.811
4
0.079
0.883
1
0.039
0.965
1
0.052
0.981
3
0.131
factor to the target variable (risk) to get the set of
independent quantitative factors (Section 4.3.2).
4.3.1. PCA to Detect Underlying Factors
PCA(28)is a classic statistical method used to ex-
plain variability among observed variables in terms
of fewer unobserved variables called principal com-
ponents (PC). In this study, we used the commonly
used varimax rotation strategy to make distinct the
PC. If a PC’s eigenvalue is greater than 1.0, it will be
selected as a predictive factor.
4.3.2. Selection of Quantitative Factors According
to Their Loadings and GR
From PCA,we obtain a subset of PCs witheigen-
values greater than 1.0. Next we select those quanti-
tative factors within each PC whose loading is maxi-
mum and close to or above 0.8 according to empirical
knowledge, and then from these responsible factors
for each PC, we select one predictive factor whose
GR is relatively large with the loading threshold 0.8
toavoidinformationlosswhileselectingindependent
PCs. A loading value of 0.8 or more makes the quan-
titative factor contain most information of the princi-
pal component. If a PC has several predictive factors
whose loading coefficients are equal to or bigger than
0.8, we just select the one with the largest GR among
them. Thus the features selected, while maintaining
independence, are informative and beneficial for the
prediction of risk.
Using the above selection criteria, we selected
the predictive factors from the quantitative and qual-
itative factors in the flood’s study. Table IV shows
the loading and GR of the quantitative factors and
Table V shows the GR scores of qualitative factors.
In total, we selected nine predictive factors including
five quantitative factors, namely, n, rf, s, e, and dmax,
as well as four qualitative factors, namely, g, re, ro,
and ri.
Table V. GR of Qualitative Factors Used
Close to
Residences?
(re)
Geology
Type
(g)
Close to
River?
(ri)
Close to
Road?
(ro)
Measures
GR0.066 0.0560.132 0.0132
4.4. Model Construction and Estimation
of Parameters
This section describes learning of the BN’s lo-
cal structure (Section 4.4.1) and estimation of assess-
ment parameters (Section 4.4.2).
4.4.1. Learning of Local Structure of
Qualitative Factors
Once the independent quantitative factors have
been selected, they are used to construct an initial
network(Fig.2).Wethenusethelearningalgorithms
to learn the local structure of the qualitative factors
from the training data set.
The learning uses a quality score to measure the
network’s quality. There are three kinds of score
measuresthatbearacloseresemblance:theBayesian
approach, the information criterion, and the mini-
mum description length. In this study, we used the
Bayesian approach, which uses the a posteriori prob-
ability of the learned structure given the training in-
stances as a quality measure. The Bayesian approach
can achieve a good effect as it is unaffected by the
specific structure, unlike other measures.(29)
A search algorithm can be applied to the space
of the network structures to find the locally optimal
network with a high-quality score. Table VI shows
various typical algorithms to obtain the topology of
local network. In this table, the methods in bold font
were used in our methodology.
4.4.2. Learning of Assessment Parameters
Once the BN’s structure has been constructed,
the CPT parameters for each node in the BN can be
obtained in two ways.
1. If the training data sets are missing, we can
elicit CPT from domain knowledge by con-
sulting domain experts, modeling, or using
various yearbooks, statistics, or references.
2. If we have enough data and we do not have
clearknowledge aboutadisaster,wecanlearn
Page 12
12Li et al.
Table VI. Methods for Construction, Inference, and Prediction of BN
StepsType Methods
Structure Domain knowledge based
Dependency analysis based
Search scoring based(29)
Construct BN according to domain or empirical knowledge
Conditional independence (CI)(17)
Bayesian approach, information criterion approach, and minimum
description length approach
Heuristic search strategies: K2, hill climbing (HC), and TAN etc.
general-purpose search strategies: Tabu, simulated annealing
(SA), and genetic algorithm (GA), etc.
Reports, statistics, and experienced models
Dirichlet-based parameter estimator
Expectation maximization, Gibbs sampling
Joint probability, na¨ ıve Bayesian, graph reduction, and polytree(30)
Forward simulation, random simulation(30)
Quality measure
Learning methods
Parameter learning Domain knowledge based
Distribution based
With missing data(17)
Exact inference
Approximate inference
Inference
CPT from the data set by using a learning al-
gorithm (Table VI).
4.5. Robust Prediction of the Flood Disaster Risk
To mitigate the sampling bias and model uncer-
tainties (also avoiding the overfitting problem), we
use BMA and Occam’s window(19,31)to produce a ro-
bust prediction of the flood risk.
Assume r to be the target variable of risk, D to
be the training data set, and Mito be the ith model
of BN. Then we can get the averaged value of the
probability of the target variable being a certain state
using BMA:
pr(r | D) =
K
?
k=1
p(r | Mk, D)p(Mk| D),
(7)
where K is the number of models selected and
p(Mk| D) =
p(D| Mk)p(Mk)
K
?
k=1
p(D| Ml)p(Ml)
(8)
is the weight of Bayes factor that is ratios of marginal
likelihoods or of posterior odds to prior odds for dif-
ferent models. We use the BN’s inference algorithms
(Table VI) to obtain p(D| Mk) and assume that the
prior probability of each model (p(Mk)) is the same.
While BMA can average the predictions of the
models obtained using various learning algorithms
(TableVI),wecanalsouseOccam’swindowtoselect
the qualified models and remove those poor models,
thus improving the computation efficiency. Occam’s
window(19)has two principles: (1) if a model receives
much less support (e.g., the ratio of 20:1) than the
model with maximum posterior probability, then it
should be dropped; (2) complex models that receive
less support than their simpler counterparts should
be dropped.
We use six search algorithms shown in Table VI
(in bold font) to get the local structures of qualitative
factors and use BMA and Occam’s window to aver-
age the qualified models, thus decreasing model bias
and improving the robustness.
5. EVALUATION
To evaluate our method, we compared it with
othermethods(Section5.1)usingscalarperformance
measures (Section 5.2).
5.1. Methods Compared
Our methods were compared with other predic-
tion methods. This section provides a simple intro-
duction of these methods.
The methods compared include both nonprob-
abilistic and probabilistic methods. Nonprobabilistic
methods do not output the risk probability of each
predicted instance, but instead output its class la-
bel directly and these methods include J48, RF, RT,
SMO, and Winnow. Probabilistic methods predict
the risk probability distribution of each test instance
and classify the instance according to the distribu-
tion. These methods include LR, NB, RBF, MPer,
and our seven BN methods (six search algorithms
and the BMA averaged prediction). Table VII gives
brief descriptions and references for these methods,
and predictive factors used.
Page 13
Assessment of Catastrophic Risk Using Bayesian Network 13
Table VII. Prediction Methods Compared
T MethodDescription and Reference Predictive Factors Used
NPJ48 A C4.5 decision tree(23)recursively partitions the training data by means
of attribute splits and generates a pruned or unpruned tree using the
information-theoretical concept of entrophy.
A forest of random trees(32)is a meta learning scheme that embodies
several base-classifiers (CART) that are built independently and
participate in a voting procedure to obtain a final class prediction.
A tree that considers K randomly chosen attributes at each node without
pruning.(33)
Sequential minimal optimization algorithm for training a support vector
classifier(34)globally transforms nominal attributes into binary ones and
multiclass problems are solved using pairwise classification.
Winnow and Balanced Winnow algorithm(35)updates a vector of
parameters used to construct its weight vector that has an inner product
with the vector of features as the prediction by repeated corrections.
Logistic regression(36)directly estimates the posterior probabilities by
fitting data to a logistic curve.
Na¨ ıve Bayes(37)assumes that the presence of a feature of a class is
unrelated to the presence of any other feature.
Normalized Gaussian radial basis function (RBF) network(38)comprise a
hidden layer of RBF nodes and an output layer with linear nodes and
its output activity is normalized by the total input activity in the hidden
layer.
Multilayer perceptron, a back-propagation classifier(38)whose network
can be built manually or created by an algorithm and can also be
monitored and modified during training time (the nodes in this network
are all sigmoid).
BNs are constructed based on our framework (Fig. 2) and six search
strategies (Table VI) for the local structures among qualitative factors.
Among the six search algorithms, K2 uses a hill climbing (HC)
algorithm restricted by an order of the variables; HC searches locally
optimal network by adding, deleting, and reversing arcs without any
restriction of the variables’ order; Tan determines the maximum weight
spanning tree and returns a NB network augmented with a tree; Tabu
uses tabu search for finding a well scoring and is similar to HC;
simulated annealing (SA) and genetic algorithm (GA) are generic
probabilistic metaheuristic algorithms (SA: physical mechanism; GA:
biological mechanism). All the predictions from the six strategies are
averaged to get a robust prediction using BMA.
All predictive factors (no discretization):
quantitative: rf, e, s, dmn, dmax, n;
qualitative: ri, ro, re, g
RF
RT
SMO
Win
PLR
Independent quantitative factors#(no
discretization): rf, e, s, dmax, n
NB
RBF
MPer
BN
Independent quantitative factors#
(discretization) and qualitative factors:
Quantitative: rf, e, s, dmax, n;
qualitative: ri, ro, re, g
Note: T = type; P = probabilistic methods; NP = nonprobabilistic methods.
#Independent quantitative factors include rf, e, s, dmax, n; see Section 4.3 regarding feature selection.
5.2. Performance Measures
We use four scalar measures, that is, pd, pf, pre-
cision, and ROC area, for the comparison.
1. Pd refers to the probability of detection of
“high loss” risk and it measures the propor-
tion of correctly predicted positive instances
among the actual positive ones. If a method
achieves a higher pd, it can detect more pos-
itive instances (more cell units of “high loss”
risk detected).
2. Pf refers to the probability of false alarms and
a good method has a low pf.
3. Precision refers to the proportion of true pos-
itives among the instances predicted as posi-
tive, but it cannot measure how the method
detects the actual positive instances. Good
precision does not always mean a good pd.
A method with high precision but a lower pd
is less useful since it cannot detect significant
positiveinstances(lessunitsof“highloss”risk
detected).
4. ROC area is the area between the horizontal
axis and the ROC curve, and it is a compre-
hensive scalar value representing the model’s
expected performance. The ROC area is
Page 14
14 Li et al.
between 0.5and 1,where a value close to0.5is
less precise, while a value close to 1.0 is more
precise. A larger ROC area indicates better
prediction performance.
In terms of risk assessment, security and warning
are the major concerns and a good method should
detect more positive (“high loss” risk) instances (a
high pd and lower pf). Thus pd and pf are the main
scalar measures of performance. Precision is a sec-
ondary measure used with pd. In other words, a good
method should have a high pd and low pf with an ac-
ceptable precision.
6. RESULTS
This section presents the results that include the
learned topologies of the BNs (Section 6.1) and the
prediction comparisons of the different methods us-
ing both 10 × 10 cross-validation (Section 6.2) and
using the historical data to predict the new situations
(Section 6.3).
6.1. Network Topologies of Local Structure
We constructed the BNs according to the mod-
eling framework (Fig. 2). Specifically, we first ob-
tained an initial framework of the BN using the
selected independent quantitative factors and qual-
itative factors. On the basis of this framework, we
used six search strategies, namely, K2, hill climb-
ing(HC), Tan, Tabu, simulated annealing (SA), and
genetic algorithm (GA), to search the qualified local
structure among the four qualitative factors, namely,
ri, g, re, and ro. The searches help fine-tune the net-
work structure. Fig. 6 shows partial topologies of the
local structure constructed.
6.2. Performance Comparison Using
Cross-Validation
This section presents the comparison of our
methods (six search strategies and the BMA method)
with nine other methods using 10 × 10 cross-
validation. We respectively used the 2006 and 2007
data as the validation data set.
Tables VIII and IX, respectively, list the scalar
measures, pd, balance, and precision, for all the 16
methods. As seen from these tables, totally, our
methods have a relatively good pd, precision, and
ROC area that indicates that our methods were able
to capture many units (cells) of high risk in the
Qualitative factors
Quantitative factors
risk
ri
g
re
es
rf
n
ro
dmax
(a) Initial network framework according to domain knowledge
ri
g
re
ro
ri
g
re
ro
ri
g
re
ro
ri
g
re
ro
(e) Topology searched with Tabu and GA
(c) Topology searched with Tan
(d) Topology searched with SA
(b) Topology searched with K2, HC
Fig. 6. Initial framework (a) and partial topologies of local struc-
ture (among qualitative factors) searched using the six algorithms.
Table VIII. Comparison of Prediction Models in the 2006
Cross-Validation
ROC
Area
Model TypeModel
pd pfPrecision
Probabilistic
models
BN (BMA)
BN (K2)
BN (HC)
BN (Tan)
BN (Tabu)
BN (AN)
BN (GA)
LR
NB
RBF
MPer
J48
RF
RT
SMO
Win
0.876
0.875
0.875
0.836
0.837
0.766
0.874
0.638
0.783
0.797
0.733
0.801
0.747
0.709
0.774
0.711
0.161
0.188
0.188
0.143
0.175
0.129
0.188
0.107
0.144
0.171
0.105
0.133
0.093
0.125
0.127
0.167
0.783
0.693
0.693
0.739
0.699
0.742
0.693
0.743
0.725
0.693
0.742
0.745
0.746
0.734
0.747
0.711
0.924
0.917
0.916
0.922
0.911
0.904
0.917
0.879
0.891
0.879
0.828
0.88
0.901
0.798
0.823
0.772
Nonprobabilistic
models
flood disaster. On the other hand, although some of
our methods such as BN(Tabu) in Table VIII and
BN(HC) in Table IX do not have the highest preci-
sion, the difference in precision between them and
the other non-BN models is small. In particular, we
can see that the prediction of BMA has an improve-
ment either in the 2006 or 2007 validation although
such an improvement is slight or small in pd and
ROC area but significant in precision. The averaged
BN prediction using BMA has the best pd (0.876
for 2006; 0.873 for 2007), the best precision (0.783
for 2006; 0.835 for 2007), and the best ROC area
(0.924 for 2006; 0.890 for 2007) compared with other
Page 15
Assessment of Catastrophic Risk Using Bayesian Network15
Table IX. Comparison of Prediction Models in the 2007
Cross-Validation
ROC
Area
Model TypeModel
pd pfPrecision
Probabilistic
models
BN (BMA)
BN (K2)
BN (HC)
BN (Tan)
BN (Tabu)
BN (AN)
BN (GA)
LR
NB
RBF
MPer
J48
RF
RT
SMO
Win
0.873
0.825
0.830
0.771
0.792
0.596
0.805
0.635
0.715
0.691
0.619
0.634
0.651
0.627
0.659
0.474
0.403
0.229
0.231
0.161
0.214
0.126
0.205
0.117
0.188
0.183
0.122
0.115
0.102
0.125
0.118
0.142
0.835
0.606
0.605
0.670
0.613
0.668
0.626
0.629
0.619
0.618
0.681
0.702
0.731
0.682
0.706
0.586
0.890
0.881
0.879
0.889
0.871
0.869
0.869
0.850
0.822
0.819
0.845
0.801
0.810
0.801
0.771
0.665
Nonprobabilistic
models
non-BN methods. Across all the compared models,
the 2006 and 2007 cross-validations demonstrated
that BMA effectively decreased the model bias and
improved the robustness of the risk prediction.
6.3. Performance Comparison of Prediction
This section presents the comparison of our BN-
based methods with nine other methods using the
2006 data to train the learners used to predict the risk
of the 2007 and 2008 flood disaster in the same area.
TablesXandXI,respectively,listthescalarmea-
sures, pd, pf, precision, and ROC area of the 2007
and 2008 risk prediction for all the 16 methods. These
tables show that, totally, our methods have a rela-
tively good pd that indicates that our methods are
able to predict most units of high risk in the flood dis-
aster. From these tables, we can see that our method
also has a reasonable probability of false alarms (pf).
In particular, the BN-based average prediction us-
ing BMA has moderately improved the probability
of detection (pd: 0.828 vs. 0.250–0.740, an improve-
ment of about 12–200% for 2007; 0.914 vs. 0.349–
0.718, an improvement of about 27–162% for 2008)
and precision (0.851 vs. 0.454–0.640, an improvement
of about 32–88% for 2007; 0.805 vs. 0.554–0.763, an
improvement of about 6–45% for 2007). The BMA
prediction also has a slightly better ROC area in ei-
ther 2007 or 2008 prediction (0.881–0.887). Across all
the compared models, the 2007 and 2008 predictions
Table X. Comparison of Prediction Models Using the 2006 Data
to Predict the Disaster Risk of the 2007 Flood
ROC
Area
Model Type Model
pd pfPrecision
Probabilistic
models
BN (BMA)
BN (K2)
BN (HC)
BN (Tan)
BN (Tabu)
BN (AN)
BN (GA)
LR
NB
RBF
MPer
J48
RF
RT
SMO
Win
0.828
0.673
0.673
0.687
0.741
0.532
0.673
0.250
0.671
0.705
0.479
0.680
0.544
0.566
0.548
0.331
0.205
0.179
0.178
0.21
0.223
0.211
0.179
0.062
0.181
0.199
0.142
0.202
0.169
0.202
0.135
0.077
0.851
0.616
0.612
0.62
0.62
0.454
0.6166
0.633
0.613
0.602
0.591
0.591
0.579
0.546
0.634
0.640
0.887
0.856
0.836
0.823
0.823
0.754
0.756
0.761
0.755
0.842
0.826
0.807
0.820
0.684
0.706
0.627
Nonprobabilistic
models
Table XI. Comparison of Prediction Models Using the 2006 Data
to Predict the Disaster Risk of the 2008 Flood
ROC
Area
Model TypeModel
pd pf Precision
Probabilistic
models
BN (BMA)
BN (K2)
BN (HC)
BN (Tan)
BN (Tabu)
BN (AN)
BN (GA)
LR
NB
RBF
MPer
J48
RF
RT
SMO
Win
0.914
0.642
0.629
0.718
0.632
0.509
0.645
0.617
0.645
0.641
0.614
0.52
0.473
0.420
0.559
0.349
0.123
0.122
0.126
0.20
0.122
0.199
0.123
0.233
0.124
0.125
0.096
0.084
0.086
0.079
0.112
0.053
0.805
0.718
0.707
0.634
0.633
0.554
0.722
0.616
0.722
0.714
0.756
0.750
0.725
0.721
0.708
0.763
0.881
0.877
0.851
0.852
0.814
0.679
0.870
0.814
0.728
0.810
0.828
0.733
0.830
0.775
0.724
0.648
Nonprobabilistc
models
demonstrated that BMA considerably decreased the
model bias and improved the robustness of the risk
prediction.
Figs. 7 and 8, respectively, show the maps of the
2007 and 2008 BMA prediction of risk probability. In
these two figures, the degree of grayness represents
the probability of high risk. We can see that the re-
gion of higher risk (darker region) is close to rivers
and residential areas and this result is consistent with
the practical situation. From Figs. 7 and 8, we can see