ArticlePDF Available

Model averaging versus model selection: estimating design floods with uncertain river flow data


Abstract and Figures

This study compares model averaging and model selection methods to estimate design floods, while accounting for the observation error that is typically associated with annual maximum flow data. Model selection refers to methods where a single distribution function is chosen based on prior knowledge or by means of selection criteria. Model averaging refers to methods where the results of multiple distribution functions are combined. Numerical experiments were carried out by generating synthetic data using the Wakeby distribution function as the parent distribution. For this study, comparisons were made in terms of relative error and root mean square error (RMSE) referring to the 1-in-100 year flood. The experiments show that model averaging and model selection methods lead to similar results, especially when short samples are drawn from a highly asymmetric parent. Also, taking an arithmetic average of all design flood estimates gives estimated variances similar to those obtained with more complex weighted model averaging.
Content may be subject to copyright.
Full Terms & Conditions of access and use can be found at
Hydrological Sciences Journal
ISSN: 0262-6667 (Print) 2150-3435 (Online) Journal homepage:
Model averaging versus model selection:
estimating design floods with uncertain river flow
Kenechukwu Okoli, Korbinian Breinl, Luigia Brandimarte, Anna Botto, Elena
Volpi & Giuliano Di Baldassarre
To cite this article: Kenechukwu Okoli, Korbinian Breinl, Luigia Brandimarte, Anna Botto, Elena
Volpi & Giuliano Di Baldassarre (2018) Model averaging versus model selection: estimating design
floods with uncertain river flow data, Hydrological Sciences Journal, 63:13-14, 1913-1926, DOI:
To link to this article:
© 2018 The Author(s). Published by Informa
UK Limited, trading as Taylor & Francis
View supplementary material
Accepted author version posted online: 09
Nov 2018.
Published online: 06 Dec 2018.
Submit your article to this journal
Article views: 94
View Crossmark data
Model averaging versus model selection: estimating design floods with
uncertain river flow data
Kenechukwu Okoli
, Korbinian Breinl
, Luigia Brandimarte
, Anna Botto
, Elena Volpi
and Giuliano Di Baldassarre
Department of Earth Sciences, Uppsala University, Uppsala, Sweden;
Centre of Natural Hazards and Disaster Science (CNDS), Uppsala,
Department of Sustainable Development, Environmental Science and Engineering, Royal Institute of Technology, Stockholm,
Department of Civil, Environmental and Architectural Engineering, University di Padova, Padova, Italy;
Department of Scienze
dellIngegneria Civile, University of Roma Tre, Rome, Italy
This study compares model averaging and model selection methods to estimate design floods,
while accounting for the observation error that is typically associated with annual maximum flow
data. Model selection refers to methods where a single distribution function is chosen based on
prior knowledge or by means of selection criteria. Model averaging refers to methods where the
results of multiple distribution functions are combined. Numerical experiments were carried out
by generating synthetic data using the Wakeby distribution function as the parent distribution.
For this study, comparisons were made in terms of relative error and root mean square error
(RMSE) referring to the 1-in-100 year flood. The experiments show that model averaging and
model selection methods lead to similar results, especially when short samples are drawn from
a highly asymmetric parent. Also, taking an arithmetic average of all design flood estimates gives
estimated variances similar to those obtained with more complex weighted model averaging.
Received 9 March 2018
Accepted 13 September 2018
A. Castellarin
S. Vorogushyn
model averaging; model
selection; design flood;
Akaike information criterion
1 Introduction
A common task in applied hydrology is the estima-
tion of the design flood, i.e. a value of river dis-
charge corresponding to a given exceedence
probability that is often expressed as a return per-
iod in years. Flood risk assessment, floodplain map-
ping and the design of hydraulic structures are
a few examples of applications where estimates of
design floods are required. Two common
approaches for estimating a design flood are either
rainfallrunoff modelling (e.g. Moretti and
Montanari 2008, Beven 2012,Breinl2016)orthe
fitting of a probability distribution function to
a record of annual maximum or peak-over-
threshold flows (Viglione et al.2013, Yan and
Moradkhani 2016).Thelatterapproach,whichis
the focus of this paper, has been referred to in the
literature as the standard approachto the fre-
quency analysis of floods (Klemeš1993). The stan-
dard approach is affected by various sources of
uncertainty, including: the choice of the sample
technique (peak-over-threshold or annual maxi-
mum flows), a limited sample size, the selection
of a suitable probability distribution function, the
method of parameter estimation for the chosen
distribution function, and errors in the observed
annual peak flows derived from a rating curve
(Sonuga 1972;Laioet al.2009; Di Baldassarre
et al.2012)
It is common practice in any form of modelling
or statistical analysis (including flood frequency
analysis) to consider a range of models as possible
representations of the observed reality. A single
model is usually selected based on different criteria,
such as (a) goodness-of-fit statistics, e.g. by using
the chi-squared (χ
) test; (b) prior selection of
a distribution function as a result of what
Chamberlain (1965) referred to as parental affec-
tiontowards a given model; or (c) standardization,
such as the log-Pearson Type III distribution used
for flood frequency analysis in the USA (US Water
Resources Council 1982). In the field of flood fre-
quency analysis, the selection of a single best dis-
tribution function represents an implicit
assumption that the selected model can adequately
describe the frequency of observed and future
floods, including the extreme ones. This
CONTACT Kenechukwu Okoli
Supplementary data for this article can be accessed here
2018, VOL. 63, NOS. 1314, 19131926
© 2018 The Author(s). Published by Informa UK Limited, trading as Taylor & Francis Group
This is an Open Access article distributed under the terms of the Creative Commons Attribution-NonCommercial-NoDerivatives License (
nd/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited, and is not altered, transformed, or built
upon in any way.
assumption departs from the understanding that
low and ordinary floods (which usually make up
the annual peak flow record) are dominated by
different processes compared to extreme floods,
which are often the main focus in flood risk man-
agement. Therefore, the selection of a single distri-
bution model, which is valid for the whole range of
flows, may lead to uncertainty in the design flood
estimates. Also, smaller floods are known to influ-
ence the smoothing and extrapolation of the largest
discharges in the record and in turn may lead to
uncertain estimates of the design flood (Klemeš
Experience from evaluating probability plots of
discharge records shows that different distribution
functions commonly used in flood frequency ana-
lysis give similar fits to data. The reason for this is
that a majority of the parametric models used in
flood frequency analysis have two or three para-
meters and are built to preserve the mean and
variance of the calibration data (Koutsoyiannis
2004). Hence, there is always a model choice uncer-
tainty when a particular distribution function is
selected for estimation purposes. Within the hydro-
logical modelling community, the phenomenon
where different models give a similar fit to data
has been referred to as equifinality(Beven 1993,
2006). In that context, Beven and Binley (1992)
developed the generalized likelihood uncertainty
estimation (GLUE) to support ensemble predictions
of a model output variable. Just like GLUE, other
techniques on how to combine estimates from dif-
ferent model structures (and parameter sets) using
weights were developed and are generally referred
to as model averaging (Hoeting et al.1999,
Burnham and Anderson 2002). Bayesian model
averaging (BMA), for example, is used extensively
in hydrogeology (Tsai and Li 2008,Yeet al.2010,
Foglia et al.2013) to quantify predictive uncer-
tainty when diverse conceptual models are used
for recharge and/or hydraulic conductivity esti-
mates. The reader is referred to Schöniger et al.
(2014) and Volpi et al.(2017)foradetaileddis-
cussion on Bayesian model evidence (BME) for
hydrological applications, especially when the pro-
blem of model selection is addressed using BMA.
Uncertainties present in the record of annual maximum
flows are often neglected. For example, flood discharges,
which are considerably larger than the directly measured
discharges, and are therefore derived by extrapolating the
rating curve, are subject to major errors, which may in turn
impact the estimate of sample statistics such as the skew-
ness (Potter and Walker 1985). Kuczera (1996,1992)
showed that significant uncertainty in the design flood
estimate is often caused by errors in discharge data derived
from a rating curve. Other studies made use of numerical
approaches based on hydraulic modelling or Monte Carlo
sampling to quantify the uncertainty in flow data due to
rating curve errors (Di Baldassarre and Montanari 2009,
Westerberg and McMillan 2015). According to their find-
ings, the uncertainty present in derived discharges may
add up to 30% or more.
Given this background, in this study we account for
two sources of uncertainty that can significantly affect
the design flood estimate: errors in the river flow data,
i.e. annual maximum flows derived from a rating
curve, and the choice of distribution function.
We compare model selection (denoted here as MS) with
two different types of model averaging: arithmetic model
averaging (denoted as MM) and weighted model averaging
(MA). Model selection refers to a case where a single best
distribution function is selected based on a selection
criterion; MM describes the averaging by applying the
arithmetic average of all estimated design floods; and MA
refers to the application of a weighted average of design
flood estimates from different probability functions (with
weights based on a selection criterion). We used the Akaike
information criterion (AIC) as a selection criterion for both
MS and MA. The study was conducted in a simulation
framework using the Wakeby distribution as the parent
model for generating synthetic annual maximum flows of
different sample sizes.
The aims of our study are as follows: (a) to simu-
late the systemic uncertainty in the real-world sce-
nario; that is, in the real world, the parent
distribution is unknown and likely more complex
than the simpler distribution functions used for fit-
ting and estimation purposes; (b) to make
a systematic assessment and comparison of the per-
formance of alternative methods for estimating
design floods (MS vs MA vs MM) and; (c) to analyse
the effect of flood data errors.
across the three techniques and the respective can-
didate distribution functions. The 1-in-100 year
a return period of 100 years (hereafter 100-year
flood), is selected as the design flood of interest
due to its wide use as a design standard in flood
risk management (Brandimarte and Di Baldassarre
2012). For example, the current policy in the USA
for flood defence design refers to the 100-year
flood (Commission on Geosciences Environment
and Resources). The analyses presented in this
study are built on the assumption of stationarity,
which has been widely discussed in hydrology (e.g.
1914 K. OKOLI ET AL.
Milly et al.2008, Montanari and Koutsoyiannis
2014, Serinaldi and Kilsby 2015,Lukeet al.2017)
and is not further discussed here.
2 Methods
The problem of the MS and MA methods is formu-
lated as follows: a record of a random variable Xis
available and sampled from an unknown parent dis-
tribution g(x). The samples are arranged in ascend-
ing order x
. A set of probability
distribution functions, whose general mathematical
form can be written as fx
ijθðÞwith θas model para-
meter, are specified as potential candidates for design
flood estimation. To implement the MS and MA
techniques, we used the Akaike selection criterion,
which is a commonly used method for model com-
parisoninhydrology(e.g.Mutua1994, Strupczewski
et al.2001). MS techniques based on information
theory require the estimation of a measure of dis-
crepancy, or amount of information loss, when
a model is used to approximate the full reality
(Linhart and Zucchini 1986). Akaike (1973)formu-
lated the AIC as an estimator of information loss or
gain when a model is fitted to data. The AIC index
where Kis the number of parameters, L^
is the
numerical value for the log-likelihood at its maximum
point for the selected model and ^
θis the maximum
likelihood estimator of model parameters. For
a detailed mathematical description, the reader is
referred to Linhart and Zucchini (1986) and Burnham
and Anderson (2002). A heuristic interpretation of
Equation (1) suggests that the first term decreases
with an increase in the second term. This shows
a distinct property of the AIC in finding a trade-off
between bias and variance of an estimator. The AIC is
relative and since the truthis not known the
relationship between AIC values of respective models
indicates the model of choice, not AIC values per se
(Burnham and Anderson 2002).
An extension of the AIC, denoted AIC
, was pro-
posed by Sugiura (1978) to correct for bias due to
a short sample size n, the AIC
index (I
expressed as:
Burnham and Anderson (2002) suggested using AIC
when the ratio n/Kis small (e.g. <40), and the
original formulation when the ratio is sufficiently
large. We considered both AIC and AIC
in this
study. The AIC
was used for the short samples,
which in this application is a sample size of
30 years, and AIC was used for large sample sizes
generated in our numerical experiments, as detailed
in the following sections. In principle, the model
with a minimum AIC (or AIC
ered the most suitable model.
2.1 Model selection
The aim of MS is to identify an optimal model from a set of
possible candidates using a selection criterion (such as the
aforementioned AIC). The MS technique can also be seen
as a special case of model averaging (see Section 2.2 for
details), where a weight of 1 is given to one distribution
function and a weight of 0 is assigned to all other models
considered. The efficiency of selecting the right parent
model using various model selection techniques and their
effect on design flood estimation has been discussed in
detail in the hydrological literature (e.g. Turkman 1985;Di
Baldassarre et al.2009;Laioet al.2009).
2.2 Model averaging (MA and MM)
Both model averaging methods (MA and MM) address
the issue of uncertainty in the choice of probability dis-
tribution functions, by combining all model estimates of
the design flood. Several studies have demonstrated the
use of MA in dealing with model structure uncertainty
(Bodo and Unny 1976, Tung and Mays 1981a,1981b,
Laio et al.2011, Najafi et al.2011, Najafi and Moradkhani
2015, Yan and Moradkhani 2016). Model averaging is
similar to the concept of multiple working hypotheses
(Chamberlain 1965), which is thought to cope better with
the unavoidable bias of using a single model.
The weighted MA technique assigns different weights
to the distribution functions considered for estimation. In
order to compute these weights, models are first ranked
based on their estimated AIC values, followed by the
computation of weights for all the distribution functions.
The distribution with the minimum AIC is assigned the
highest weight. These weights are referred to as Akaike
weights (w
) (Burnham and Anderson 2002):
wi¼exp 1
r¼1exp 1
where Ris the number of models considered and Δ
called the Akaike difference, which represents the dis-
crepancy between the best model with the minimum
AIC and the ith model, and is expressed as:
A zero value for the Akaike difference (i.e.Δi¼0)
points to the best distribution function to be used to
fit the data. The arbitrariness in the use of Akaike
weights is recognized in this work since, in practice,
the truedesign flood is not known and the weighting
only gives information about the adequacy of a model
to fit the observations, not about the accuracy of the
estimated discharge.
Let us consider Rcompeting probability distribution
functions that are denoted f
. A posterior predictive dis-
tribution of a quantity of interest φ(e.g. a design flood)
given the vector of observed data Xcan be expressed as:
ijXðÞ (5)
where p:jXðÞrepresents the conditional probability dis-
tribution function and pf
ijXðÞis the posterior prob-
ability for a given model. Equation (5) was adapted
from Hoeting et al.(1999) and provides a way of
averaging the posterior distributions of the design
flood under each of the models considered, weighted
by the posterior model probability pf
ijXðÞ. The poster-
ior model probability represents the degree of fit
between a particular distribution function and the
data, and can be assigned by expert judgement (Merz
and Thieken 2005), estimated using Bayesian or Akaike
techniques, with the latter already described earlier as
Akaike weights w
Uncertainty in the parameters of individual pdfs, and
their effect on the accuracy of the estimated design flood is
not considered in this study. However, the focus was on
evaluating point estimates and not the posterior probability
distribution of the design flood; a simplification of
where ^
QTis the estimated design flood for a given
return period T. The estimated model weights w
assigned to candidate models, with the model that
fits the data best having the highest weight.
As for MM, a simple arithmetic average is applied over
the design flood estimates of all models, i.e. all models have
equal weights. Similar to Graefe et al.(2015), we use it as
3 Numerical experiments
3.1 Choice of parent distribution
The Wakeby distribution function was used as the
parent distribution to generate synthetic annual max-
imum flows of different sample sizes. The synthetic
samples were used for the systematic assessment of
the MS, MA and MM techniques. Various distribution
functions (see Table 1) were then used to fit these
synthetic time series and estimate the 100-year flood.
The Wakeby distribution function is a five-parameter
distribution and was defined by Houghton (1977,
1978). The use of the Wakeby distribution first came
about as a result of findings by Matalas et al.(1975),
who showed that many commonly used distribution
functions are not capable of reproducing the instability
observed in sample estimates of skewness derived from
flow records. In other words, the standard deviation of
sample estimates of skewness derived from real-world
flow data is higher than that derived from synthetic
flow data. Matalas et al.(1975) called this behaviour
the separation-effect, a contradiction similar to the
Hurst effect. The Wakeby quantile function is
described as follows:
where F;FxðÞ¼PXxðÞand x. The density func-
tion f;fxðÞis defined as:
dx ab 1FðÞ
b1þcd 1FðÞ
The distribution can be thought of in two parts: a
left-hand tail a11FðÞ
(small flows) and a right-
hand tail c11FðÞ
þm(large flows). The
letters a, b, c, d and mrepresent the distribution parameters,
xis the flood quantile (or design flood) for a given return
Table 1. Probability distribution functions used in this study as operative models.
Probability model Parameters pdf or cdf
Gumbel or EV1 (θ
)Fx;θðÞ¼exp exp xθ1
Generalized extreme value (GEV) (θ
)Fx;θðÞ¼exp 1θ3xθ1
Pearson Type III (P3) (θ
θ3exp xθ1
Lognormal (LN) (θ
)fx;θðÞ¼ 1
pexp 1
log xθ1
1916 K. OKOLI ET AL.
period T, and Fis the non-exceedence probability, i.e.
F¼11=T.IfF¼0, then x¼mand f¼1=
ab þcdðÞ.Notethatsincef0"x;ab þcdðÞ0, for
F¼1;the values of xand fdepend upon the values of
the parameters of the distribution, the upper bound on
xbeing þ1or mþacðÞ:Not all parameterizations
of the Wakeby distribution are capable of accounting for the
conditions of separation mentioned above. However, in an
extensive Monte Carlo experiment, Landwehr and Wallis
(1978) found that when b>1 and d>0(i.e.longstretched
upper tails) the Wakeby distribution accounts for condi-
tions of separation. The parameter combinations used in
this study (i.e. fixed values) in defining a Wakeby parent are
listed in Table 2 and were taken from Landwehr and
Matalas (1979). A detailed presentation about parameter
limits and valid parameter combinations for the Wakeby
distribution is provided by Landwehr and Wallis (1978).
We chose the Wakeby distribution for the following
reasons: first, we want to simulate the epistemic uncer-
tainty that affects any design flood estimation exercise,
i.e. the understanding that the flood generation pro-
cesses are complex (and not completely known), while
simpler models are commonly used for fitting and
estimation purposes. The Wakeby distribution has
a higher level of complexity in the form of more para-
meters than the other distribution functions commonly
used for estimation purposes. Second, it mimics the
upper tail structures typical of flood distributions,
which are essential to capture in any synthetic data,
i.e. the occasional presence of an outlier (in this case an
extreme flood peak), which is not expected, but prob-
able. Third, its quantile function is expressed explicitly
in terms of the unknown variable, making the genera-
tion of synthetic data straightforward (Hosking and
Wallis 1997).
It should be noted that previous numerical studies on
flood frequency analysis used more common distribu-
tion functions, e.g. lognormal (Matalas et al.1975, Slack
et al.1975, Matalas and Wallis 1978) as the parent
model. However, our choice was based on the need to
simulate the fact that, in the real world, the parent
distribution is unknown and likely more complex than
the simpler distribution functions.
3.2 Choice of probability distribution functions
Four commonly used distribution functions were selected
as operational models (i.e. R= 4) to fit the synthetic flows
and to estimate the 100-year event. The distribution
functions considered are (i) the EV1 (Gumbel) distribu-
tion, (ii) the generalized extreme value (GEV) distribu-
tion, (iii) the generalized gamma or Pearson Type III (P3)
distribution, and (iv) the lognormal (LN) distribution.
Table 1 provides their cumulative distribution functions
(cdf), Fx;θðÞ, and the probability density functions
(pdf), fx;θðÞ; the latter are shown for those distribution
functions whose cdf is not invertible.
3.3 Simulation framework
We set up a Monte Carlo simulation framework con-
sisting of the following steps, in which the procedure is
repeated for each of the Wakeby parent distributions
fully determined by the five sets of parameters reported
in Table 2. We also let the sample size nvary by
assuming values of 30, 50, 100 and 200 years.
(1) One of the Wakeby pdfs, with a fixed set of
parameters (Table 2) is selected as parent dis-
tribution gxðÞ:As parameters are fixed, the
truedesign flood value Q100 is the quantile
corresponding to a return period of 100 years,
which is computed using Equation (7),
with F¼11=100.
(2) The Wakeby cdf described in Equation (7) is used
to generate a sample of synthetic annual maxi-
mum flows Qof fixed length; these values are
considered true discharges. Introducing observa-
tion error, corrupted discharges Qare generated
using the error model for uncorrelated observa-
tion error (Kuczera 1992)asfollows:
where εdenotes a standard Gaussian random variable
(i.e. zero mean and standard deviation of 1), Qis the true
discharge, and βis a positive valued coefficient denoting
the magnitude of observation error. Values for βof 0.00,
0.15 and 0.30 (i.e. 0%, 15% and 30%) are magnitudes of
observation error considered in this study and taken from
Di Baldassarre et al.(2012). A βvalue of 0% represents
the scenario in which observed discharge equals the true
discharge; thus there is no observation error.
Table 2. Wakeby distribution functions; μ,σ, Cv, γand λ
denote: mean, standard deviation, coefficient of variation,
skewness and kurtosis, respectively.
Parameters Statistical characteristics
Distribution ma b c d μσCv γλ
Wakeby-1 0 1 16.0 4 0.20 1.94 1.34 0.69 4.14 63.74
Wakeby-2 0 1 7.5 5 0.12 1.56 0.90 0.58 2.01 14.08
Wakeby-3 0 1 1.0 5 0.12 1.18 1.03 0.87 1.91 10.73
Wakeby-4 0 1 16.0 10 0.04 1.36 0.51 0.38 1.10 7.69
Wakeby-5 0 1 1.0 10 0.04 0.92 0.70 0.76 1.11 4.73
Source: Landwehr and Matalas (1979)
(3) Using the corrupted discharges Q,thepara-
meters of the four pdfs (Table 1) are esti-
mated using the method of maximum
likelihood. For the P3 and GEV distributions,
maximum likelihood estimators are either not
available or asymptotically efficient in a few
non-regular cases. Due to this, Smithsesti-
mators (Smith 1985) were used instead of
maximum likelihood estimators.
(4) The four pdfs are used to estimate the design
flood as the quantile corresponding to a return
period of 100 years.
(5) AIC is applied for both MS and MA:
(i) MS: AIC or AIC
(depending on the sample
size generated, see Section 2) is applied by
using Equation (1) or (2), respectively, for
the four distribution functions and the opti-
mal distribution is used to estimate the
design flood as the flood quantile corre-
sponding to a return period of Tyears.
(ii) MA: Using Equation (4), the Akaike differences
(i=1, 2, 3, ,R) are evaluated and used for
the computation of model weights using
Equation (3); the estimated design floods for
each of the candidate distribution functions are
combined by applying Equation (6).
(6) The arithmetic average (MM) of design floods
estimated using the candidate distribution
functions (Step 4) is implemented.
(7) A percentage relative error is computed in order
to compare the true design flood (derived in Step
1) with the design floods estimated by: each of
the four candidate models (as in Step 4), model
selection (MS, Step 5(a)), weighted model aver-
aging (MA, Step 5(b)), and arithmetic model
averaging (MM, Step 6). Thus, we obtained
seven relative error estimates (four candidate
distribution functions, MS, MA, and MM).
Steps 27 are repeated 1000 times, generating
1000 synthetic flow samples from a given parent
Wakeby distribution and of a fixed sample size.
A generated sample size of 30 and 50 years reflects
the typical length of historical observations, while
samples of length 100 and 200 years represent an
optimistic case in hydrology.
4 Results
between the different techniques (four candidate
distribution functions, MS, MA and MM). Figures
1to Figures 5 show the results of the numerical
experiments and summarize the performance of
MA, MM and MS, and also the candidate distribu-
tions, in estimating the 100-year flood, for different
statistical characteristics of the underlying parent
distribution, different record lengths and levels of
observation uncertainty.
Figure 1 shows box plots of percentage relative
estimation errors when Wakeby-1 is used as the parent
distribution. Observation errors for a given sample size
increase from the left to the right panels, while the
sample size for a given observation error increases
from the top to the bottom panels. In general,
a tendency towards underestimation is observed for
all techniques, namely MS, MA and MM, and the
individual distribution functions when the parent is
highly skewed, as shown in Figures 1 and 2, respec-
tively. For instance, considering Wakeby-1 as the par-
ent model, an error magnitude of 15% and a sample
size of 50 years, on average, MA underestimates the
true design flood by 19.6%, while MS and MM give an
equal underestimation of 22.3%. Major deviations
across all techniques and distribution functions appear
to be reasonable, as the underlying population was
based on a complex parent distribution with five para-
meters, while the fitting is conducted using distribution
functions with only two or three parameters.
Figures 35show the boxplots obtained by using
Wakeby-3 to Wakeby-5 as parent distributions, and
in that order refer to the reduction in skewness of
the parent distributions (see Table 2 for details
about the value of skewness for each Wakeby par-
ent). These diagrams show that, in general, all three
techniques (MS, MA and MM) tend towards over-
estimation. For instance, considering Wakeby-3 as
the parent model, with an error magnitude of 15%
and sample size of 50 years, on average, MS, MA
and MM overestimate the true value by 1.3, 3.6 and
6.88%, respectively. Looking at the panels of
Figures 35from left to right, this overestimation
is influenced by increasing observation errors. This
is due to the fact that these errors tend to increase
the variance of the sample (see Equation (7)),
which in turn leads to increased variance of the
design flood estimates (Di Baldassarre et al.2012).
Another set of box plots was produced to help
understand the influence of Akaike weights used in
MA on the overall accuracy and variance of design
flood estimates. For example, if one considers the
centre panel of the first row in Figure 6 (i.e. the
case of β=15% and sample size 30), the interpre-
tation is as follows: on average, the best model
1918 K. OKOLI ET AL.
Figure 1. Box plots of percentage relative error for MS, MA, MM and all candidate models, with Wakeby-1 as parent model. The red
line represents the median (50th percentile) and the lower and upper ends of the blue box represent the 25th and 75th percentiles,
respectively. Outliers are represented by red crosses.
Figure 2. Box plots of percentage relative error for MS, MA, MM and all candidate models, with Wakeby-2 as parent model. Symbols
as in Figure 1.
Figure 4. Box plots of percentage relative error for MS, MA, MM and all candidate models, with Wakeby-4 as parent model. Symbols
as in Figure 1.
Figure 3. Box plots of percentage relative error for MS, MA, MM and all candidate models, with Wakeby-3 as parent model. Symbols
as in Figure 1.
1920 K. OKOLI ET AL.
Figure 5. Box plots of percentage relative error for MS, MA, MM and all candidate models, with Wakeby-5 as parent model. Symbols
as in Figure 1.
Figure 6. Box plots of Akaike weights for all candidate distribution functions with Wakeby-1 as parent model. Symbols as in Figure 1.
among the candidates is P3, and it accounts for
approximately 50% of the weighted average; LN
and GEV account for 20 and 18% respectively;
while EV1 accounts for 12% of the weighted aver-
age. Thus, P3 is clearly the best distribution func-
tion in terms of Akaike weights when the parent is
Wakeby-1. Figure 1 shows that for the same 15%
error P3 has a highly biased estimate (with less
variance) compared to the GEV, which has less bias
but increased variance. The selection of P3 as the
best distribution that fits the data in Figure 6 is
fairly consistent, as the sample size increases from
30 to 100 for an error of 15%. Note that this
behaviour changes when the sample size increases
to 200, with GEV as the best distribution function
followed by P3. If Wakeby-2 is selected as the
parent, Figure 7 suggests that LN is the distribution
function that fits the data best for almost all sample
sizes and error magnitudes considered. Comparing
Figures 2 and 7,itisobservedthatallmodelshave
almost the same accuracy and variance, except for
EV1, which is slightly more biased. In summary, it
is observed that, on the one hand, the performance
between P3 and GEV (comparing Figs. 1 and 6)
presents a scenario where a distribution function
with the highest weight has less accuracy but small
variance, and, on the other, comparing LN with all
other distribution functions (Figs. 2 and 7)presents
a scenario where distribution functions with
a smaller weight have almost equal variance, with
the distribution function having the highest weight.
That is, having a higher (or lower) Akaike weight
for a given distribution function does not necessa-
rily translate to better (or worse) estimates of the
design flood. The reason is that the Akaike weight,
or any other model selection criterion, refers only
good the estimation is. Box plots of Akaike weights
for Wakeby-3 to Wakeby-5 can be found in the
Supplementary material.
Tables 3 and 4show the root mean square error
(RMSE) and average percentage relative error (RE
%), respectively, for the three methods and the four
candidate distribution functions. Table 3 shows that
for Wakeby-1, which has a true design flood of
7.05 m
/s, MA has a slightly better accuracy
(±1.71 m
/s) when compared to MS (±1.87 m
and MM (±1.77 m
observed (Table 4) for the three techniques in
terms of RE%. As skewness reduces, i.e. from
Wakeby-2 to Wakeby-5, we see that the three tech-
niques have similar performance in terms of RMSE
Figure 7. Box plots of Akaike weights for all candidate distribution functions with Wakeby-2 as parent model. Symbols as in Figure 1.
1922 K. OKOLI ET AL.
and average RE%. For instance, for Wakeby-3, with
a true design flood of 4.68 m
/s, MS has an accu-
racy of ±0.68 m
/s when compared to MA
and MM, with an RMSE of ±0.77 and ±1.04 m
respectively. However, all three models (MS, MA
and MM) have an average RE% of 2.8%. Table 4
a smaller average RE%, except for the case of
Wakeby-1, where GEV has the lowest value. This
may be seen as a positive outcome of model selec-
tion methods in selecting distribution functions for
estimation purposes.
5 Discussion and conclusions
When it comes to flood frequency analysis, the true
distribution of floods (which includes the true
design flood corresponding to a given return per-
iod) is not known apriori.Therefore,thetaskfor
model selection leading to a single best distribu-
tion function and model averaging methods is
driven towards better estimation, rather than the
search for the true distribution that generated the
In this study, the MM approach assigns equal
weights to all candidate distribution functions with-
out taking into account how well these distribution
functions fit the data. The MA approach is differ-
ent from MM in the sense that the former takes
into account individual performance of all distribu-
tion functions in fitting the data. The MA approach
assigns higher weights to distribution functions that
give better fits to the data.
There are certainly situations in which the dis-
tribution functions are all similar, i.e. having
almost the same AIC values and Akaike weights,
which will lead to similar estimates as MM. It
seems this behaviour, where candidate distribution
functions have similar AIC values, may be the
norm rather than the exception, as seen in studies
by Mutua (1994) and Strupczewski et al.(2001).
This might be the reason why, in our study, MA
approach. However, MA can only surpass MM in
terms of accuracy of estimates if one or more dis-
tributions have sufficient weights, and their esti-
mates are close to the true value of the design
The MM approach is usually neglected as a sort
of outcast because of its obvious simplicity when
compared to Bayesian and Akaike approaches for
sciences have demonstrated that MM can perform
well compared to ensemble Bayesian model aver-
aging (e.g. Graefe et al.2015), and similar conclu-
sions were drawn from studies focusing on
operational and financial forecasts (Clark and
McCracken 2010, Graefe et al.2014). By assigning
equal weights to all candidate distribution functions
when implementing the MM method, one ignores
the relative adequacy of fit of individual distribu-
tions, thereby deliberately introducing bias by tak-
ing into account distribution functions with
inadequate fit. The MA approach, however, tends
to assign higher weights to models with large
degrees of freedom, even though the AIC is for-
mulated to take into account overfitting. The effect
of overfitting due to the MA approach may lead to
improved accuracy, but at the expense of increased
variance of the estimated design floods. However,
introducing bias by implementing the MM
approach may lead to less overfitting but reduced
The trade-off between accuracy and variance
observed for some candidate distribution functions
between MA and MM. To illustrate that trade-off,
let us consider the top right corner of Figure 1:the
GEV model provides good accuracy, but high var-
iance when compared to the two averaging
approaches MA and MM. The same figure shows
that LN has less variance but also less accuracy
when compared to GEV. However, a comparison
of the distribution of Akaike weights (see Fig. 6,
Table 3. Root mean square error (RMSE) for all techniques and
distribution functions for a sample size of 50 and a magnitude
error of 15%.
Distribution True 1 in 100
Wakeby-1 7.05 1.87 1.71 1.77 2.05 2.19 1.88 2.50
Wakeby-2 4.69 0.79 0.81 0.82 0.78 0.90 0.87 0.98
Wakeby-3 4.68 0.68 0.77 1.04 1.69 2.78 0.62 0.99
Wakeby-4 3.02 0.37 0.36 0.36 0.41 0.46 0.38 0.35
Wakeby-5 3.01 0.59 0.67 0.83 1.50 1.73 0.59 0.31
Table 4. Average percentage relative error (RE%) for all tech-
niques and distribution functions for a sample size of 50 and
a magnitude error of 15%.
Distribution True 1
Average RE (%)
Wakeby-1 7.05 5.74 5.83 5.91 26.95 2.72 24.42 34.63
Wakeby-2 4.69 4.09 4.06 4.06 12.72 8.19 14.96 19.39
Wakeby-3 4.68 4.70 4.92 5.16 29.36 32.09 0.75 19.09
Wakeby-4 3.02 2.82 2.82 2.82 10.80 1.88 7.52 9.42
Wakeby-5 3.01 3.39 3.48 3.58 44.45 24.67 12.47 4.31
top right corner) shows that GEV has less weight
compared to LN. This shows that there is no clear
relationship between the calculated weights and the
accuracy and variability of the estimates, and there-
fore demands that one must give some thought to
the estimation problem before making up a list of
distribution functions suitable for reliable design
flood estimates. Also, one can speculate that the
similar performance provided by the MA and MM
approaches relates to the fact that none of the
candidate distributions deviate too much from the
parent distribution.
For water management, selecting a distribution
function with a high variance of the estimated
design flood will complicate the design of an infra-
structure, i.e. there is potential for substantial over-
design if the upper limit of the confidence interval
is considered. An unbiased estimate in flood esti-
mation is desirable, but, given the numerous
sources of uncertainty, engineers and planners
usually do not mind sacrificing accuracy in
exchange for reduced variability in estimations
(Slack et al.1975). However, this ethos of prefer-
ring a distribution function with reduced variability
can be problematic, since the true design flood is
not known in advance; it may lead to an increased
risk of over- or under-design of water-related infra-
structure if the true design flood is outside the
calculated confidence intervals. Furthermore, the
decision on the design flood for a given infrastruc-
ture does not only depend on estimates based on
distribution functions, but also on risk perception
and economic feasibility.
Our study likewise shows that, when facing short
sample sizes (3050 years), which are common in
hydrology and water resources engineering applica-
tions, model averaging (MA and MM) and model
selection (MS) lead to better results than arbitrarily
selecting a single distribution function. Moreover,
for very large sample sizes (100200 years), which
are rare in real-world applications, our study shows
that MS, MA and MM have similar variance also
when observation uncertainty is introduced. This is
related to the fact that thesamplesizesarelarge
enough for a better estimation of parameters (even
for highly parameterized distribution functions
such as GEV), but may not lead to reduced var-
iance due to over-fitting.
It is important to note that our work is focused
purely on the estimation of design floods using
statistical techniques. Several limitations, such as
the distribution functions considered, have una-
voidably influenced our results. Future studies on
design flood estimation couldbeextendedtocon-
sider the physical processes behind flood
This research was carried out within the CNDS (Centre of
Natural Hazards and Disaster Science) research school, www. We thank Francesco Laio, two anonymous reviewers
and the editor for providing critical comments that helped to
improve an earlier version of this paper.
Disclosure statement
No potential conflict of interest was reported by the authors.
Kenechukwu Okoli
Akaike, H., 1973. Information theory and an extension of the
maximum likelihood principle. In: B.N. Petrov and F.
Csáki, eds. 2nd International Symposium on Information
Theory, Tsahkadsor, Armenia, USSR, September 28,
1971, Budapest: Akadémiai Kiadó, 267281.
Beven, K., 1993. Reality and uncertainty in distributed
hydrological modelling. Advances in Water Resources, 16,
Beven, K., 2006. A manifesto for the equifinality thesis.
Journal of Hydrology, 320 (12), 1836. doi:10.1016/j.
Beven, K., 2012.Rainfall - runoff modelling the primer. 2nd
ed. West Sussex: John Wiley & Sons.
Beven, K. and Binley, A., 1992. The future of distributed
models: model calibration and uncertainty prediction.
Hydrological Processes, 6, 279298.
Bodo, B. and Unny, T.E., 1976. Model uncertainty in flood
frequency analysis and frequency-based design. Water
Resources Research, 12 (6), 11091117.
Brandimarte, L. and Di Baldassarre, G., 2012. Uncertainty in
design flood profiles derived by hydraulic modelling.
Hydrology Research, 43 (6), 753. doi:10.2166/nh.2011.086
Breinl, K., 2016. Driving a lumped hydrological model with
precipitation output from weather generators of different
complexity. Hydrological Sciences Journal, 61 (8),
Burnham, K.P. and Anderson, D.R., 2002.Model selection
and multimodel inference. 2nd ed. New York: Springer.
Chamberlain, T.C., 1965. The method of multiple working
hypotheses [reprint of 1890 science article]. Science, 148,
Clark, T.E., and McCracken, M.W., 2010. Averaging forcasts
from VARs with uncertain instabilities. Journal of Applied
Econometrics, 25 (1), 5-29. doi:10.1002/jae.1127
Di Baldassarre, G., Laio, F., and Montanari, A., 2009. Design
flood estimation using model selection criteria. Physics
and Chemistry of the Earth, 34 (1012), 606611, Parts
A/B/C. doi:10.1016/j.pce.2008.10.066
1924 K. OKOLI ET AL.
Di Baldassarre, G., Laio, F., and Montanari, A., 2012. Effect
of observation errors on the uncertainty of design floods.
Physics and Chemistry of the Earth,4244, 8590.
Di Baldassarre, G. and Montanari, A., 2009. Uncertainty in
river discharge observations: a quantitative analysis.
Hydrology and Earth System Sciences Discussions, 6 (1),
3961. doi:10.5194/hessd-6-39-2009
Foglia, L., et al., 2013. Evaluating model structure adequacy:
the case of the Maggia Valley groundwater system, south-
ern Switzerland. Water Resources Research, 49 (1),
260282. doi:10.1029/2011WR011779
Graefe, A., et al., 2014. Combining forecasts: an application
to elections. International Journal of Forecasting, 30 (1),
4354. doi:10.1016/j.ijforecast.2013.02.005
Graefe, A., et al., 2015. Limitations of ensemble Bayesian
model averaging for forecasting social science problems.
International Journal of Forecasting, 31 (3), 943951.
Hoeting, J.A., et al., 1999. Bayesian model averaging: a tutorial.
Statistical Science,14(4),382417. doi:10.2307/2676803
Hosking, J.R.M. and Wallis, J.R., 1997.Regional frequency
analysis: an approach based on L-moments. Cambridge,
UK: Cambridge University Press. doi:10.1017/
Houghton, J.C. 1977.Robust estimation of the frequency of
extreme events in a flood frequency context. Ph.D disserta-
tion. Harvard University, Cambridge, MA.
Houghton, J.C., 1978. Birth of a parent: the Wakeby distri-
bution for modeling flood flows. Water Resources
Research, 14, 6. doi:10.1029/WR015i005p01288
Klemeš, V., 1986. Dilettantism in hydrology: transition or
destiny? Water Resources Research, 22 (9S), 177S188S.
Klemeš, V., 1993.Probability of extreme hydrometeorologi-
cal events a different approach. In:Proceedings of the
Yokohama Symposium, Extreme Hydrological Events:
Precipitation, Floods and Droughts, Vol. 213, Yokohama,
Japan, IAHS Publ. Wallingford, UK: IAHS Press, Centre
for Ecology and Hydrology, 167176.
Koutsoyiannis, D., 2004. Statistics of extremes and estimation
of extreme rainfall: I. Theoretical investigation.
Hydrological Sciences Journal, 49, 4.
Kuczera, G., 1992. Uncorrelated measurement error in flood
frequency inference. Water Resources Research, 28 (1),
Kuczera, G., 1996. Correlated rating curve error in flood
frequency inference. Water Resources Research, 32 (7),
Laio, F., et al., 2011. Spatially smooth regional estimation of the
flood frequency curve (with uncertainty). Journal of Hydrology,
408 (12), 6777. doi:10.1016/j.jhydrol.2011.07.022
Laio, F., Di Baldassarre, G., and Montanari, A., 2009. Model
selection techniques for the frequency analysis of hydro-
logical extremes. Water Resources Research, 45 (7),
W07416. doi:10.1029/2007WR006666
Landwehr, J.M. and Matalas, N.C., 1979. Estimation of para-
meters and quantiles of Wakeby distributions 1. Known
lower bounds. Water Resources Research, 15 (6), 13611372.
Landwehr, J.M. and Wallis, J.R., 1978. Some comparisons of
flood statistics in real and log space. Water Resources
Research, 14 (5), 902920.
Linhart, H. and Zucchini, W., 1986.Model selection.
Hoboken, NJ: John Wiley.
Luke, A., et al., 2017. Predicting nonstationary flood frequen-
cies: evidence supports an updated stationarity thesis in the
United States. Water Resource Research,53(7),54695494.
Matalas, N.C., Slack, J.R., and Wallis, J.R., 1975. Regional skew
in search of a parent. Water Resources Research, 11 (6),
Matalas, N.C. and Wallis, J.R., 1978. Some comparisons of
flood statistics in real and log space. Water Resources
Research, 14 (5), 902920.
Merz, B. and Thieken, A.H., 2005. Separating natural and
epistemic uncertainty in flood frequency analysis. Journal
of Hydrology, 309 (14), 114132. doi:10.1016/j.
Milly, P.C.D., et al., 2008. Climate change - stationarity is
dead: whither water management? Science, 319 (5863),
Montanari, A. and Koutsoyiannis, D., 2014. Modeling and
mitigating natural hazards: stationarity is immortal! Water
Resources Research, 50 (12), 97489756.
Moretti, G. and Montanari, A., 2008. Inferring the flood
frequency distribution for an ungauged basin using
a spatially distributed rainfallrunoff model. Hydrology
and Earth System Sciences Discussions, 5 (1), 126.
Mutua, F.M., 1994. The use of the Akaike Information
Criterion in the identification of an optimum flood fre-
quency model. Hydrological Sciences Journal, 39 (3),
235244. doi:10.1080/02626669409492740
Najafi, M.R. and Moradkhani, H., 2015. Multi-model ensem-
ble analysis of runoff extremes for climate change impact
assesment. Journal of Hydrology, 525, 352361.
Najafi, M.R., Moradkhani, H., and Jung, I.W., 2011.
Assessing the uncertainties of hydrologic model selection
in climate change impact studies. Hydrological Processes,
25. doi:10.10002/hyp.8043
Potter, K.W. and Walker, J.F., 1985. An empirical study of
flood measurement error. Water Resources Research, 21 (3),
403406. doi:10.1029/WR021i003p00403
Schöniger, A., et al., 2014. Model selection on solid ground:
rigorous comparison of nine ways to evaluate Bayesian
model evidence. Water Resources Research, 50,
53425350. doi:10.1002/2012WR013085
Serinaldi, F. and Kilsby, C.G., 2015. Stationarity is undead:
uncertainty dominates the distribution of extremes.
Advances in Water Resources, 77, 1736.
Slack, J.R., Wallis, J.R., and Matalas, N.C., 1975. On the value
of information to flood frequency analysis. Water
Resources Research, 11 (5), 629647. doi:10.1029/
Smith, R.L., 1985. Maximum likelihood estimation in a class
of non-regular cases. Biometrika, 72 (1), 6790.
Sonuga, J.O., 1972. Principal of maximum entropy in hydro-
logic frequency analysis. Journal of Hydrology, 17, 177191.
Non-stationary approach to at-site flood frequency
modelling I. Maximum likelihood estimation. Journal
of Hydrology, 248, 123142.
Sugiura, N., 1978. Further analysts of the data by akaikes
information criterion and the finite corrections: further
analysts of the data by akaikes. Communications in
Statistics - Theory and Methods, 7 (1), 1326.
Tsai, F.T.-C. and Li, X., 2008. Water resources research.
Inverse Groundwater Modeling for Hydraulic Conductivity
Estimation Using Bayesian Model Averaging and Variance
Window, 44 (9), n/a-n/a. doi:10.1029/2007WR006576
Tung, Y. and Mays, L.W., 1981a. Optimal risk-based design of
flood levee systems. Water Resources Research, 17 (4), 843852.
Tung, Y.K. and Mays, L.W., 1981b. Risk models for flood
levee design. Water Resources Research, 17 (4), 833841.
Turkman, R.F., 1985. The choice of extremal models by Akaikes
information criterion. Journal of Hydrology, 82, 307315.
US Water Resources Council, 1982.Guidelines for determining
flood flow frequency: bulletin 17B, hydrology subcommittee,
office of water data coordination,USgeologicalsurvey,Reston
Virginia. Washington, DC: U.S. Government Printing Office.
Viglione, A., et al., 2013. Flood frequency hydrology: 3.
A Bayesian analysis. Water Resources Research, 49 (2),
675692. doi:10.1029/2011WR010782
Volpi, E., et al., 2017. Sworn testimony of the model evidence:
Resources Research, 53 (7), 61336158. doi:10.1002/
in hydrological signatures. Hydrology and Earth System
Sciences, 39513968. doi:10.5194/hess-19-3951-2015
Yan, H. and Moradkhani, H., 2016.Towardsmorerobust
extreme flood prediction by Bayesian hierarchical and
multimodeling. Natural Hazards,81,203225. doi:10.1007/
Ye, M., et al., 2010. A model-averaging method for
assessing groundwater conceptual model uncertainty.
Ground Water, 48 (5), 716728. doi:10.1111/j.1745-
1926 K. OKOLI ET AL.
... Statistical tests are commonly used to select the model that best fits the given time series data from a number of candidate probabilistic distribution models. The selection of a single best distribution model (i.e., model selection, MS) represents an implicit assumption that the selected model can adequately describe the frequency of the observed and future flood flow events (Okoli et al., 2018). Despite the well-established practice of using model selection (MS) in the field of flood frequency analysis, the technique itself does not take into account the inherent uncertainties (Okoli et al., 2019). ...
... Despite the good performance of the Gamma (2p) distribution, the current study also considered a modification of the arithmetic average (MM method, Okoli et al., 2018) of the considered candidate models in order to reduce the model uncertainty. The available record length of only 31 hydrological years, the non-usual adoption of the Gamma (2p) distribution for flood peaks (Rizwan et al., 2018) and the good performance of the other three distribution models highlighted the need of the herein proposed hydrological methodology. ...
... In the present investigation, a modification of the original MM method (Okoli et al., 2018), is introduced and proposed, which consists of considering only the distributions that ensured good performance on both goodness-of-fit tests and graphical methods. The modified MM method considered the arithmetic mean of the design flood estimates from the aforementioned four probabilistic distributions: the two-and three-parameter LogNormal (2p and 3p), Gumbel and Gamma (2p). ...
Understanding the risks associated with the likelihood of extreme events and their respective consequences for the stability of hydraulic infrastructures is essential for flood forecasting and engineering design purposes. Accordingly, a hydrological methodology for providing reliable estimates of extreme discharge flows approaching hydraulic infrastructures was developed. It is composed of a preliminary assessment of missing data, quality and reliability for statistically assessing the frequency of flood flows, allied to parametric and non-parametric methods. Model and parameter uncertainties are accounted for by the introduced and proposed modified model averaging (modified MM) approach in the extreme hydrological event's prediction. An assessment of the parametric methods accuracy was performed by using the non-parametric Kernel Density Estimate (KDE) as a benchmark model. For demonstration and validity purposes, this methodology was applied to estimate the design floods approaching the case study ‘new Hintze Ribeiro bridge’, located in the Douro river, one of the three main rivers in Portugal, and having one of Europe's largest river flood flows. Given the obtained results, the modified MM is considered a better estimation method.
... Hydrological modeling aims at obtaining a reliable estimate of extreme discharge flows and their occurrence probabilities [35] that might occur at a given location, namely at a bridge site. To estimate such extreme discharge flows (hereinafter referred as "design floods"), statistical methods, generally referred as "flood frequency analysis", are commonly considered [36-38, and references therein]. ...
... The selection of a single best distribution function (i.e. the model selection, MS) in the field of flood frequency analysis, represents an implicit assumption that the selected model can adequately describe the frequency of observed and future floods, including the extreme events [35]. Nevertheless, any model faces uncertainty and its quantification is crucial for ensuring data quality and usability [45]. ...
... Nevertheless, any model faces uncertainty and its quantification is crucial for ensuring data quality and usability [45]. According to Okoli et al. [35], one of the possibilities to deal with model uncertainty is to use all candidate probability distributions for the estimation of the design floods, where the final estimate is an average of all individual estimates, known as model averaging [46,47]. The model averaging can be performed either by taking the arithmetic mean of the design flood estimates from the candidate probabilistic models (i.e. the arithmetic model averaging, MM) or by attributing weights to the design floods of each individual candidate probability distribution (i.e. the weighted model averaging, MA) depending on how best the probability model fits the data [48,49]. ...
The collapse of bridges inevitably leads to economical losses and may also be responsible for human fatalities. A bridge may fail due to several reasons, with local scouring around its foundation being the most common. Despite decades of scouring research, there are still many uncertainties affecting the design process of bridge piers. The most critical and least explored are the hydrological and hydraulic variables. The recent intensification of floods may also increase the vulnerability of bridges to scour effects. Therefore, the present work aims to propose a risk-based methodology for considering scour at bridge foundations. It is composed of three main steps: (i) assessing extreme hydrological events (hazards); (ii) modeling river behavior through the computation of flow characteristics and bridge scour depths; and (iii) assessing bridge scour risk by associating its scour depth to foundation depth ratio with the priority factor (vulnerability) and assigning a qualitative evaluation of the scour risk rating (level of risk). The hydrological modeling incorporates uncertainty with an averaging approach in the design floods definition. The flow characteristics are simulated with the HEC-RAS model, which also contains a scour module for bridge scour assessment. However, other empirical estimates are considered for simple and pile-supported foundations. This study ends with a qualitative assessment of how the scouring phenomenon affects bridge vulnerability and its safety. The proposed risk-based methodology - validated through a case study, the new Hintze Ribeiro bridge in Portugal - can be potentially incorporated into regular bridge inspection schedules as a useful tool for risk management measures, assisting in catastrophic events’ prevention.
... Model uncertainties can be tackled by using all candidate probability distributions for estimating design floods [13], where the final estimate is an average of all individual estimates. The model averaging can be performed either by taking the arithmetic mean of the design floods from the candidate probabilistic models (MM), or by attributing weights to the design floods of each individual candidate probability distribution (MA), depending on how best the probability model fits the data. ...
... The model averaging can be performed either by taking the arithmetic mean of the design floods from the candidate probabilistic models (MM), or by attributing weights to the design floods of each individual candidate probability distribution (MA), depending on how best the probability model fits the data. A modification of the arithmetic averaging method, presented in Okoli et al. [13] was proposed by Bento et al. [14]. Such method consists of considering only the distributions that ensure good performance on both goodness-of-fit tests and graphical methods in the definition of the design floods (hereinafter referred to as modified MM). ...
Conference Paper
Full-text available
The scouring phenomenon can pose a serious threaten to bridge serviceability and users' safety, as well. In extreme circumstances, it can lead to the bridge's structural collapse. Despite efforts to reduce the scour's unfavorable effects in the vicinity of bridge foundations, this issue remains a significant challenge. Many uncertainties affect the design process of bridge foundations, namely the associated hydrological and hydraulic parameters. Past and recent flood records strengthen bridges' vulnerability by reducing scouring estimation uncertainties. Therefore, the present study applies a semi-quantitative methodology of scour risk assessmentto a Portuguese bridge case study, accounting for those uncertainties. The risk-based methodology comprises three main steps towards the assignment of the bridge's scour risk rating. The methodology constitutes a potential key tool for risk management activities, assisting bridge's owners and managers in decision-making.
... Although it appears to be an easy and well-established task, this procedure may be associated with a high degree of uncertainty (Salinas et al. 2014a(Salinas et al. , 2014b. Several factors can affect the performance of this analysis, such as the quality of the observed data, the choice of the sampling technique (annual maximum series (AMS) or peak-over-threshold (POT)), and the selection of a suitable probability model and its parameter estimation methodology (Gaume 2018, Okoli et al. 2018. Regarding model selection, many probability distributions are used to represent hydrological extreme events, with no general agreement on the best model , Nguyen et al. 2017, Papalexiou 2018). ...
... The common practice is to select a suitable distribution according to its performance on one or more goodness-of-fit statistical tests that assess its descriptive ability (Nguyen et al. 2017, Okoli et al. 2018. This approach, however, is associated with two main issues. ...
The popular approach to select a suitable distribution to characterize extreme rainfall events relies on the assessment of its descriptive performance. This study examines an alternative approach to this task that evaluates, in addition to the descriptive performance of the models, their performance in estimating out-of-sample events (predictive performance). With a numerical experiment and a study case in São Paulo state, Brazil, we evaluated the adequacy of seven probability distributions widely used in hydrological analysis to characterize extreme events in the region and compared the selection process of both popular and altenative frameworks. The results indicate that (1) the popular approach is not capable of selecting distributions with good predictive performance and (2) combining different predictive and descriptive tests can improve the reliability of extreme event prediction. The proposed framework allowed the assessment of model suitability from a regional perspective, identifying the Generalized Extreme Value (GEV) distribution as the most adequate to characterize extreme rainfall events in the region.
... Lin and Kuo (2016) state that AMA is appropriate to use if all the candidate models have similar prediction powers. Okoli et al. (2018) in a study of estimating designs associated with flooding obtained the same performance for AMA and weighted MA when the AIC values of candidate model were almost similar. ...
... The stacking model also showed good performance, especially because the WKS data varied greatly in the different numerical ranges. Furthermore, the stacking model utilizes a more flexible and extensive selection and combination of base classifiers [15,34,35]. In conclusion, the ensemble model for the WKS prediction (i.e., the stacking model) supersedes the three other base models despite the limitations of the dataset size and the considerable differences in the spatial distribution of each sample because of geographic, economic, and political factors. ...
Full-text available
The frequent occurrence of extreme weather and the development of urbanization have led to the continuously worsening climate-related disaster losses. Socioeconomic exposure is crucial in disaster risk assessment. Social assets at risk mainly include the buildings, the machinery and the equipment, and the infrastructure. In this study, the wealth capital stock (WKS) was selected as an indicator for measuring social wealth. However, the existing WKS estimates have not been gridded accurately, thereby limiting further disaster assessment. Hence, the multisource remote sensing and the POI data were used to disaggregate the 2012 prefecture-level WKS data into 1000 m × 1000 m grids. Subsequently, ensemble models were built via the stacking method. The performance of the ensemble models was verified by evaluating and comparing the three base models with the stacking model. The stacking model attained more robust prediction results (RMSE = 0.34, R2 = 0.9025), and its prediction spatially presented a realistic asset distribution. The 1000 m × 1000 m WKS gridded data produced by this research offer a more reasonable and accurate socioeconomic exposure map compared with existing ones, thereby providing an important bibliography for disaster assessment. This study may also be adopted by the ensemble learning models in refining the spatialization of the socioeconomic data.
... In multi-model ensemble learning, outputs received from multiple classifiers are combined to improve the classification accuracy. The ensemble developed by model averaging addresses the issue of uncertainty in the choice of probability distribution functions by combining all model estimates (Okoli et al. 2018). Model averaging technique is used by several researchers to demonstrate its use in dealing with model structure uncertainty (Bodo and Unny 1976;Tung and Mays 1981a, b;Laio et al. 2009;Najafi et al. 2011;Najafi and Moradkhani 2015;Yan and Moradkhani 2016). ...
Full-text available
Avalanche forecasting is carried out using physical as well as statistical models. All these models have certain limitations associated with their mathematical formulation that enable them to perform variably with respect to forecast of an avalanche event and associated danger. To overcome limitations of each individual model, a multi-model decision support system (MM-DSS) has been developed for forecasting of avalanche danger in Chowkibal–Tangdhar (C-T) region of North-West Himalaya. The MM-DSS has been developed for two different altitude zones of the C-T region by integrating four avalanche forecasting models-Hidden Markov model (HMM), nearest neighbour (NN), artificial neural network (ANN) and snow cover model-HIM-STRAT to deliver avalanche forecast with a lead time of three days. Weather variables for these models have been predicted using ANN. Root mean square error of predicted weather variables is computed by using leave one out cross-validation method. Snow and meteorological data of 22 winters (1992–2014) of the lower C-T region and 8 winters (2008–2016) of the higher C-T region have been used to develop avalanche forecasting models for these two sub-regions. All the avalanche forecasting models have been validated by true skill score (TSS), Heidke skill score (HSS), per cent correct (PC), probability of detection (POD), bias and false alarm rate (FAR) using data of five winters (2014–19) for the lower C-T region and three winters (2016–19) for the upper C-T region. In both the C-T regions, for day-1, day-2 and day-3, the HSS of MM-DSS lies between 0.26 and 0.4 and the POD between 0.64 and 0.86.
... Liu and Kuo (2016) state that AMA is appropriate to use if all the candidate models have similar prediction powers. Okoli et al. (2018) in a study of estimating designs associated with flooding obtained the same performance for AMA and weighted MA when the Akaike's information criterion (AIC) values of candidate models were almost similar. ...
... Among the methods discussed therein that are appropriate for probabilistic hydrological modelling are PDF combination methods. Simple PDF averaging has been exploited to some degree in hydrological contexts (see e.g., Okoli et al. 2018). ...
Full-text available
This thesis falls into the scientific areas of stochastic hydrology, hydrological modelling and hydroinformatics. It contributes with new practical solutions, new methodologies and large-scale results to predictive modelling of hydrological processes, specifically to solving two interrelated technical problems with emphasis on the latter. These problems are: (A) hydrological time series forecasting by exclusively using endogenous predictor variables (hereafter, referred to simply as “hydrological time series forecasting”); and (B) stochastic process-based modelling of hydrological systems via probabilistic post-processing (hereafter, referred to simply as “probabilistic hydrological post-processing”). For the investigation of these technical problems, the thesis forms and exploits a novel predictive modelling and benchmarking toolbox. This toolbox is consisted of: (i) approximately 6 000 hydrological time series (sourced from larger freely available datasets), (ii) over 45 ready-made automatic models and algorithms mostly originating from the four major families of stochastic, (machine learning) regression, (machine learning) quantile regression, and conceptual process-based models, (iii) seven flexible methodologies (which together with the ready-made automatic models and algorithms consist the basis of our modelling solutions), and (iv) approximately 30 predictive performance evaluation metrics. Novel model combinations coupled with different algorithmic argument choices result in numerous model variants, many of which could be perceived as new methods. All the utilized models (i.e., the ones already available in open software, as well as those automated and proposed in the context of the thesis) are flexible, computationally convenient and fast; thus, they are appropriate for large-sample (even global-scale) hydrological investigations. Such investigations are implied by the (mainly) algorithmic nature of the methodologies of the thesis. In spite of this nature, the thesis also provides innovative theoretical supplements to its practical and methodological contribution. Technical problem (A) is examined in four stages. During the first stage, a detailed framework for assessing forecasting techniques in hydrology is introduced. Complying with the principles of forecasting and contrary to the existing hydrological (and, more generally, geophysical) time series forecasting literature (in which forecasting performance is usually assessed within case studies), the introduced framework incorporates large-scale benchmarking. The latter relies on big hydrological datasets, large-scale time series simulation by using classical stationary stochastic models, many automatic forecasting models and algorithms (including benchmarks), and many forecast quality metrics. The new framework is exploited (by utilizing part of the predictive modelling and benchmarking toolbox of the thesis) to provide large-scale results and useful insights on the comparison of stochastic and machine learning forecasting methods for the case of hydrological time series forecasting at large temporal scales (e.g., the annual and monthly ones), with emphasis on annual river discharge processes. The related investigations focus on multi-step ahead forecasting. During the second stage of the investigation of technical problem (A), the work conducted during the previous stage is expanded by exploring the one-step ahead forecasting properties of its methods, when the latter are applied to non-seasonal geophysical time series. Emphasis is put on the examination of two real-world datasets, an annual temperature dataset and an annual precipitation dataset. These datasets are examined in both their original and standardized forms to reveal the most and least accurate methods for long-run one-step ahead forecasting applications, and to provide rough benchmarks for the one-year ahead predictability of temperature and precipitation. The third stage of the investigation of technical problem (A) includes both the examination-quantification of predictability of monthly temperature and monthly precipitation at global scale, and the comparison of a large number of (mostly stochastic) automatic time series forecasting methods for monthly geophysical time series. The related investigations focus on multi-step ahead forecasting by using the largest real-world data sample ever used so far in hydrology for assessing the performance of time series forecasting methods. With the fourth (and last) stage of the investigation of technical problem (A), the multiple-case study research strategy is introduced −in its large-scale version− as an innovative alternative to conducting single- or few-case studies in the field of geophysical time series forecasting. To explore three sub-problems associated with hydrological time series forecasting using machine learning algorithms, an extensive multiple-case study is conducted. This multiple-case study is composed by a sufficient number of single-case studies, which exploit monthly temperature and monthly precipitation time series observed in Greece. The explored sub-problems are lagged variable selection, hyperparameter handling, and comparison of machine learning and stochastic algorithms. Technical problem (B) is examined in three stages. During the first stage, a novel two-stage probabilistic hydrological post-processing methodology is developed by using a theoretically consistent probabilistic hydrological modelling blueprint as a starting point. The usefulness of this methodology is demonstrated by conducting toy model investigations. The same investigations also demonstrate how our understanding of the system to be modelled can guide us to achieve better predictive modelling when using the proposed methodology. During the second stage of the investigation of technical problem (B), the probabilistic hydrological modelling methodology proposed during the previous stage is validated. The validation is made by conducting a large-scale real-world experiment at monthly timescale. In this experiment, the increased robustness of the investigated methodology with respect to the combined (by this methodology) individual predictors and, by extension, to basic two-stage post-processing methodologies is demonstrated. The ability to “harness the wisdom of the crowd” is also empirically proven. Finally, during the third stage of the investigation of technical problem (B), the thesis introduces the largest range of probabilistic hydrological post-processing methods ever introduced in a single work, and additionally conducts at daily timescale the largest benchmark experiment ever conducted in the field. Additionally, it assesses several theoretical and qualitative aspects of the examined problem and the application of the proposed algorithms to answer the following research question: Why and how to combine process-based models and machine learning quantile regression algorithms for probabilistic hydrological modelling?
... This is not a peculiarity of the examined records but a generalized statistical effect (Koutsoyiannis and Baloutsos 2000). We also applied model selection using the Akaike information criterion (AIC c ) for short sample sizes (e.g., Burnham and Anderson 2004;Okoli et al. 2018) for the Gumbel and GEV distributions. The AIC c analysis can be found in appendix B. Based on the analysis of AIC c and the studies by Koutsoyiannis (2004a,b) and Koutsoyiannis and Baloutsos (2000), we used the GEV distribution for all stations (periods, durations, area sizes). ...
Full-text available
We estimate areal reduction factors (ARFs, the ratio of catchment rainfall and point rainfall) varying in space and time using a fixed-area method for Austria and link them to the dominating rainfall processes in the region. We particularly focus on two sub-regions in the West and East of the country, where stratiform and convective rainfall processes dominate, respectively. ARFs are estimated using a rainfall dataset of 306 rain gauges with hourly resolution for five durations between 1 hour and 1 day. Results indicate that the ARFs decay faster with area in regions of increased convective activity than in regions dominated by stratiform processes. Low ARF values occur where and when lightening activity (as a proxy for convective activity) is high, but some areas with reduced lightning activity exhibit also rather low ARFs as, in summer, convective rainfall can occur in any part of the country. ARFs tend to decrease with increasing return period, possibly because the contribution of convective rainfall is higher. The results of this study are consistent with similar studies in humid climates, and provide new insights regarding the relationship of ARFs and dominating rainfall processes.
Full-text available
Non-stationary extreme value analysis (NEVA) can improve the statistical representation of observed flood peak distributions compared to stationary (ST) analysis, but management of flood risk relies on predictions of out-of-sample distributions for which NEVA has not been comprehensively evaluated. In this study, we apply split-sample testing to 1,250 annual maximum discharge records in the United States and compare the predictive capabilities of NEVA relative to ST extreme value analysis using a log-Pearson Type III (LPIII) distribution. The parameters of the LPIII distribution in the ST and non-stationary (NS) models are estimated from the first half of each record using Bayesian inference. The second half of each record is reserved to evaluate the predictions under the ST and NS models. The NS model is applied for prediction by (1) extrapolating the trend of the NS model parameters throughout the evaluation period and (2) using the NS model parameter values at the end of the fitting period to predict with an updated ST model (uST). Our analysis shows that the ST predictions are preferred, overall. NS model parameter extrapolation is rarely preferred. However, if fitting period discharges are influenced by physical changes in the watershed, for example from anthropogenic activity, the uST model is strongly preferred relative to ST and NS predictions. The uST model is therefore recommended for evaluation of current flood risk in watersheds that have undergone physical changes. Supporting information includes a MATLAB program that estimates the (ST/NS/uST) LPIII parameters from annual peak discharge data through Bayesian inference.
Full-text available
Flood information, especially extreme flood, is necessary for any large hydraulic structure design and flood risk management. Flood mitigation also requires a comprehensive assessment of flood risk and an explicit quantification of the flood uncertainty. In the present study, we use a multimodel ensemble approach based on Bayesian model averaging (BMA) method to account for model structure and distribution uncertainties. The usefulness of this approach is assessed by a case study over the Willamette River Basin (WRB) in Pacific Northwest, U.S. Besides the standard log-Pearson Type III distribution, we also identified that the generalized extreme value and three parameter lognormal distributions were both potential distributions in WRB. Three different statistical models, including the Bulletin-17B quantile model, index-flood model, and spatial Bayesian hierarchical model, were considered in the study. The BMA method is then used to assign weights to different models, where better performing model receives higher weights. It was found that the major uncertainty in extreme flood prediction is contributed by model structure; while the choice of distribution plays a lesser important role in quantification of flood uncertainty. The BMA approach provides a more robust extreme flood prediction than any single model.
What is the “best” model? The answer to this question lies in part in the eyes of the beholder, nevertheless a good model must blend rigorous theory with redeeming qualities such as parsimony and quality of fit. Model selection is used to make inferences, via weighted averaging, from a set of K candidate models, Mk; k = 1,…,K), and help identify which model is most supported by the observed data, Ỹ = (ỹ1, …, ỹn). Here, we introduce a new and robust estimator of the model evidence, p(Ỹ|Mk), which acts as normalizing constant in the denominator of Bayes' theorem and provides a single quantitative measure of relative support for each hypothesis that integrates model accuracy, uncertainty and complexity. However, p(Ỹ|Mk)$is analytically intractable for most practical modeling problems. Our method, coined GAussian Mixture importancE (GAME) sampling, uses bridge sampling of a mixture distribution fitted to samples of the posterior model parameter distribution derived from MCMC simulation. We benchmark the accuracy and reliability of GAME sampling by application to a diverse set of multivariate target distributions (up to 100 dimensions) with known values of p(Ỹ|Mk) and to hypothesis testing using numerical modeling of the rainfall-runoff transformation of the Leaf River watershed in Mississippi, USA. These case studies demonstrate that GAME sampling provides robust and unbiased estimates of the evidence at a relatively small computational cost outperforming commonly used estimators. The GAME sampler is implemented in the MATLAB package of DREAM [Vrugt, 2016] and simplifies considerably scientific inquiry through hypothesis testing and model selection.
This chapter gives results from some illustrative exploration of the performance of information-theoretic criteria for model selection and methods to quantify precision when there is model selection uncertainty. The methods given in Chapter 4 are illustrated and additional insights are provided based on simulation and real data. Section 5.2 utilizes a chain binomial survival model for some Monte Carlo evaluation of unconditional sampling variance estimation, confidence intervals, and model averaging. For this simulation the generating process is known and can be of relatively high dimension. The generating model and the models used for data analysis in this chain binomial simulation are easy to understand and have no nuisance parameters. We give some comparisons of AIC versus BIC selection and use achieved confidence interval coverage as an integrating metric to judge the success of various approaches to inference.