Content uploaded by Farshid Vahedifard

Author content

All content in this area was uploaded by Farshid Vahedifard on Dec 22, 2021

Content may be subject to copyright.

Data-Driven Model for Estimating the Probability of

Riverine Levee Breach Due to Overtopping

Stefan Flynn, M.ASCE1; Soroush Zamanian, S.M.ASCE2; Farshid Vahedifard, F.ASCE3;

Abdollah Shafieezadeh, M.ASCE4; and David Schaaf, M.ASCE 5

Abstract: Breach due to overtopping is the most common failure mode of earthen levees. Historic records and future projections consistently

show exacerbating patterns in the frequency and severity of floods in several regions, which can increase the probability of levee overtopping.

The main objective of this study is twofold: (1) to present a comprehensive data set of levee overtopping events, and (2) to develop a data-

driven model for determining the probability of levee breach due to overtopping that can support risk assessment. For this purpose, we first

assessed available performance data to develop a refined data set of 185 riverine levee overtopping eventswithin the portfolio of levee systems

maintained by the US Army Corps of Engineers. The data set includes several geometric, geotechnical, and hydraulic variables for each

overtopping incident. We then employed the data set along with logistic regression to develop, train and validate a model for calculating the

probability of levee breach due to overtopping. Among several variables and functional forms examined, levee construction history, over-

topping depth, overtopping duration, embankment erosion resistance, and duration of levee hydraulic loading prior to overtopping were found

to be statistically significant, thus were included in the proposed model. The model was validated through k-fold cross validation and tested

against a separate performance data set aside for validation purposes. The data set presented in this study can be used for identifying key

factors controlling overtopping behavior, validation of model results, and providing new insight into the phenomenon of levee overtopping.

The proposed model offers a practical yet robust tool for levee risk analysis that can be readily employed in practice. DOI: 10.1061/(ASCE)

GT.1943-5606.0002743.© 2021 American Society of Civil Engineers.

Author keywords: Levees; Levee breach data set; Overtopping; Levee risk analysis; Probabilistic modeling; Logistic regression.

Introduction

Earthen levees are a critical component of flood risk management in

the United States. Over 160,000 km (∼100,000 mi) of levees protect

the safety and economy of flood-prone areas across the United States

(CRS 2017). The US Army Corps of Engineers (USACE) levee

safety portfolio includes over 24,000 km (∼15,000 mi) of docu-

mented levee systems as communicated by the National Levee

Database (NLD). More than 14% of USACE levee systems are

classified as having very high, high, or moderate risk (USACE

2021b). Often, high risk levees are associated with high economic

and potential life loss consequences. With over 2,000 systems

contained within the portfolio, the USACE-maintained portion

of the national levee inventory protects a population of nearly

13 million with a property value exceeding $1.3 trillion

(USACE 2021b). According to the American Society of Civil En-

gineers (ASCE) Report Card, identified damages from the recent

2019 flood event in the Midwest exceeded $20 billion with more

than 1,100 km (∼700 mi) of levee requiring repair (ASCE 2021).

Historic records and projected models consistently show exacer-

bating patterns in the frequency and severity of floods in several

regions (Villarinietal.2011;Mallakpour and Villarini 2015;

Vahedifard et al. 2016,2020;USACE 2018,2021a), which can

increase the probability of levee overtopping and, subsequently,

the risk posed to population and critical economic infrastructure

existing within leveed areas. It is evident that levee performance is

of critical interest to engineers, municipalities, and policy mak-

ers alike.

Levee overtopping is a critical concern for flood risk manage-

ment systems, with breach due to overtopping being the most

common mode of levee failure (Hui et al. 2016;USACE 2018).

In this study, overtopping refers to the static riverine water level

exceeding the elevation of the levee crest. Breach occurs when a

levee gives way, allowing flood water to pass through the barrier

and inundate the leveed area (USACE 2018). In this study breach is

the preferred nomenclature, as opposed to failure, due to the fact

the levee systems are typically not designed to withstand overtop-

ping. Rather, levees are designed to withstand hydraulic loads up to

a specified design height. Failure implies that a levee has failed to

perform as designed, whereas breach implies that a levee failed at

loading beyond its designed capacity.

1Geotechnical Engineer, US Army Corps of Engineers, Rock Island

District, 1500 Rock Island Dr., Rock Island, IL 61201; formerly, Graduate

Student, Richard A. Rula School of Civil and Environmental Engineering,

Mississippi State Univ., Mississippi State, MS 39762. ORCID: https://

orcid.org/0000-0003-1241-3681. Email: Stefan.G.Flynn@usace.army.mil

2Asset Management Analyst, Hazen and Sawyer, 7700 Irvine Center

Dr., Irvine, CA 92615; formerly, Graduate Research Associate, Dept. of

Civil, Environmental, and Geodetic Engineering, Ohio State Univ.,

Columbus, OH 43210. Email: zamanian.2@osu.edu

3CEE Advisory Board Endowed Professor and Professor, Richard A.

Rula School of Civil and Environmental Engineering, Mississippi State

Univ., Mississippi State, MS 39762 (corresponding author). ORCID:

https://orcid.org/0000-0001-8883-4533. Email: farshid@cee.msstate.edu

4Lichtenstein Associate Professor, Dept. of Civil, Environmental, and

Geodetic Engineering, Ohio State Univ., Columbus, OH 43210. Email:

shafieezadeh.1@osu.edu

5Senior Structural Engineer, Risk Management Center, US Army Corps

of Engineers, Lakewood, CO 80228. Email: David.M.Schaaf@usace

.army.mil

Note. This manuscript was submitted on April 14, 2021; approved on

October 22, 2021; published online on December 21, 2021. Discussion per-

iod open until May 21, 2022; separate discussions must be submitted for

individual papers. This paper is part of the Journal of Geotechnical and

Geoenvironmental Engineering, © ASCE, ISSN 1090-0241.

© ASCE 04021193-1 J. Geotech. Geoenviron. Eng.

J. Geotech. Geoenviron. Eng., 2022, 148(3): 04021193

Downloaded from ascelibrary.org by Mississippi State Univ Lib on 12/22/21. Copyright ASCE. For personal use only; all rights reserved.

When a levee gives way after being overtopped, the flow

through the breach can be substantial, as a river attempts to drain

through the now-available opening. For a levee to breach due to

overtopping, sustained flow over the embankment must occur long

enough to initiate erosion. Once levee erosion has initiated, the con-

tinuation of erosion unravels the embankment, leading to signifi-

cant material loss, resulting in a breach. However, a levee can be

overtopped without breaching. A levee breach may be avoided if

the embankment is resilient enough to substantially resist erosion

caused by overtopping flows. A non-breach overtopping event

occurs when the levee does not experience significant section loss.

Fig. 1shows examples of levee overtopping leading to breach

[Fig. 1(a)] and non-breach [Fig. 1(b)], respectively. When breach

does not occur, the consequences related to life safety and eco-

nomic loss are typically reduced.

Logistic regression models, or logit models, have been widely

used for data analysis in which the outcome variable is binary or

dichotomous and follows a Bernoulli distribution, such as is the

case when considering either breach or non-breach. Several studies

have employed logistic regression to assess a wide array of geo-

technical failure mechanisms, such as soil liquefaction and slope

instability (Das et al. 2010;Gandomi et al. 2013;Zhang and Goh

2013;Vahedifard et al. 2017). Logistic regression has been also

deployed to evaluate the performance of levees (Uno et al. 1987;

Flor et al. 2010;Heyer et al. 2010;Danka and Zhang 2015), includ-

ing considerations of both the rate of breach and overall breach

propagation. A summary of previous levee performance logit mod-

els is well documented by Heyer and Stamm (2013), in an effort

that considers the benefits and limitations of the use of logit models

in assessing levee failure.

As the quantity and quality of levee performance data increase

(e.g. due to advances in remote sensing technologies), so too should

the quantity and quality of models that assist in estimating the risk

of levee breach due to overtopping. Working towards a combined

levee performance data repository is a worthy goal and has been

proposed by many with the intent to increase understanding of

the phenomenon of levee breach due to overtopping. Özer et al.

(2020) provide just one example of such an effort, by introducing

the International Levee Performance Database (ILPD). This work

seeks to create a global data source for levee breach analysis.

Expanding upon the information held within this database and

others like it, such as the NLD, is just one step towards creating

reliable levee performance models both for overtopping breach

and many other failure modes in support of risk assessment.

The primary objective of this study is to develop a data-driven

model using logistic regression for estimating the probability of

levee breach due to overtopping for riverine levees. The proposed

model is best used to evaluate individual cross sections, which is a

discretization of the overall levee system. Additionally, a compre-

hensive data set documenting riverine levee overtopping event per-

formance is presented. Within the data set, and in this study, the

term event refers to any documented case of overtopping such that

the event could result in either breach or non-breach. This study

focuses on riverine levee overtopping events due to the fact that the

physics of dynamic (i.e. coastal) overtopping are materially differ-

ent and should be considered accordingly. The proposed model al-

lows for the calculation of breach probability based on independent

variables. The model is validated using k-fold cross validation and

set aside test data, with the goal of becoming a supplementary tool

for levee risk assessment.

Factors Affecting Levee Breach Due to Overtopping

Understanding and predicting the occurrence of levee breach due to

levee overtopping requires an understanding of geometric, geotech-

nical, and hydraulic parameters. Geometric factors such as levee

height, width, and slope steepness inherently affect the initiation and

progression of embankment erosion when overtopped. Geometric

factors can be influenced by spatial availability, which refers to

Fig. 1. Levee overtopping leading to (a) breach; and (b) non-breach. (Images courtesy of USACE Rock Island District Digital Photo Library.)

© ASCE 04021193-2 J. Geotech. Geoenviron. Eng.

J. Geotech. Geoenviron. Eng., 2022, 148(3): 04021193

Downloaded from ascelibrary.org by Mississippi State Univ Lib on 12/22/21. Copyright ASCE. For personal use only; all rights reserved.

the area needed to construct a levee. Spatial availability may be

constrained by factors such as floodway impacts, real estate rights,

environmental, and/or cultural impacts. Geotechnical properties in-

cluding, but not limited to, particle size, density, and shear strength,

play critical role in a levee’s ability to withstand surface erosion.

These properties have been shown to have a significant impact on

the breach potential of an overtopped levee (Briaud et al. 2008).

Hydraulic factors such as overtopping depth, duration, and velocity

are critical in determining forces applied to a levee during overtop-

ping events (Isola et al. 2020).

The selection of geometry is often at least partially dependent

upon geotechnical properties. Factors such as the shear strength of

embankment material impacts slope stability, thus defining the

allowable steepness of slopes. Erosion resistance properties vary

significantly by soil type and are generally considered when estab-

lishing the cross-sectional dimensions of a levee embankment using

modern standards of care. However, many levees in the United

States were built long before the utilization of geotechnical design

standards was standard practice, which is why it is important to

consider levee design and construction history in assessing

performance.

Geometric constraints placed on levee design by hydraulic con-

ditions may have the most significant impacts, as these constraints

typically control the height of the levee. Hydraulic parameters are

also significant in that they are constantly evolving as more data is

collected and statistically updated. Thus, hydraulic impacts must be

continuously updated to assess the ability of levees to withstand the

most up-to-date maximum projected flood.

With a proper knowledge of how these parameters affect levee

performance, in conjunction with known historical design and con-

struction conditions, this information can be used to assist in under-

standing levee overtopping performance. Employing statistical

models can be valuable in predicting the probability of levee breach

as a function of controlling geometric, geotechnical, and hydraulic

parameters. Where used to estimate system reliability, statistical

models have become an integral component in developing risk

assessment frameworks for various structures and infrastructure

systems, due to advances in computational efficiency and robust

predictive capability (Rahimi et al. 2020;Zamanian et al. 2020;

Dehghani et al. 2021). This progress continues to expand the tool-

box of the practicing engineer and improve the profession’s ability

to perform data-driven analyses that inform decision making when

dealing with large scale performance data.

Levee Risk Assessment

The comprehensive assessment of risk related to levees requires the

consideration of hazard, performance, and consequences as three

components to overall risk formulation (USACE 2018). For levees,

a hazard refers to an environmental load on the levee system, such

as flood or earthquake conditions. Performance refers to the resil-

iency, or reliability, of the levee when a hazard is presented.

Finally, consequences are the potential losses in terms of population

and economic assets that result from hazards affecting levee

performance.

Levee risk assessments are commonly conducted utilizing a

method referred to as semi-quantitative risk assessment (SQRA),

which is a form of risk assessment that utilizes both numerical es-

timates and qualitative descriptions to yield an order of magnitude

risk estimate (USACE 2019). When conducting a levee SQRA, a

primary task is to establish assumed potential failure modes. Each

potential failure mode deemed significant is assessed through the

event tree method, which considers the successive probabilities of

event nodes that must occur sequentially to lead to failure.

The initiating node on the event tree is the hydrologic or seismic

event that presents a hazard to the levee. For hydrologic loading

scenarios, this first node is represented as the annual exceedance

probability (AEP) of the hydraulic load on the levee. The AEP

establishes the anchoring point for the risk associated with a given

failure mode, as it is a quantitatively determined probability of

flood frequency. Subsequent nodes on the event tree, which con-

sider the progression of the failure mode, are then assigned sub-

jective probabilities based on available analysis, circumstantial

evidence, and engineering judgement (O’Leary 2018). The prod-

uct of all nodes on an event tree is equal to the estimated annual

probability of failure (APF), which is represented in the SQRA

process as an order of magnitude estimate. Presenting the assessed

risk as an order of magnitude estimate allows for the consideration

of uncertainty in the elicitation of nodal probability values. As

described, the annual probability of failure can be determined

as follows:

APF ¼AEP ×Πn

i¼1PðiÞð1Þ

where PðiÞ= probability of individual node occurring.

When calculating the annual probability of failure of levee

breach due to overtopping, the nodes on the event tree generally

follow the sequence of (1) a hydrologic event occurs which leads

to a given depth of overtopping, (2) erosion of the embankment

initiates, (3) embankment erosion progresses beyond a critical

state, and (4) widespread levee breach occurs. This process, which

considers the levee hazard and performance, is demonstrated in

Fig. 2.

Levee Overtopping Data Set

The data set presented in this study is a subset of the Levee Loading

and Incident Database (LLID), which consists of a collection of

both quantitative and qualitative information that documents past

performance of USACE levees under flood loading (Flynn et al.

2021). The database considers a wide range of flood risk manage-

ment system components, including levees, floodwalls, pump sta-

tions and closure structures. The LLID takes a risk-based approach

to the organization of data, with the data subdivided into categories

based not only on structure type, but also distress mechanisms,

Fig. 2. Levee overtopping breach development.

© ASCE 04021193-3 J. Geotech. Geoenviron. Eng.

J. Geotech. Geoenviron. Eng., 2022, 148(3): 04021193

Downloaded from ascelibrary.org by Mississippi State Univ Lib on 12/22/21. Copyright ASCE. For personal use only; all rights reserved.

including, but not limited to, overtopping, internal erosion, and sta-

bility. Levee performance data included within the database com-

prises event information ranging from 1948 to 2019.

The compilation of this data was initiated concurrent with the

formalization of the USACE Levee Safety Program in 2006, fol-

lowing the widespread levee breach events that occurred during

the Hurricane Katrina disaster. As a result of both past and on-going

research and data synthesis, the LLID was created by a team of risk

experts using project design, construction, inspection, and flood

response records, with the initial goal of creating a simplified risk

screening tool. Additionally, data compilation and assessment had

the goal of establishing base failure rates from historic perfor-

mance data.

At the time of this study, the LLID contained information on 214

riverine levee overtopping events which occurred across 124 levee

systems. Of the more than 30 variables included to describe the full

range of data in the database, only physically representative data

were used in this study. The levees considered in this study vary

significantly both compositionally and in terms of how they are

loaded hydraulically. Levees range from agricultural dikes that

have been in place for decades to federally authorized and designed

systems protecting large populations at risk. Flood sources, ranging

from minor tributaries to larger rivers, are captured in the data and

represent a wide range of loading, both in terms of magnitude and

duration. Data for overtopping events is a culmination of both

measurement and correlation. Strong emphasis is placed on the fact

that a significant portion of this data is based on human estimation

and recollection, which leads to data being reported in terms of

range, rather than as an exact value in many cases.

The levee overtopping data used in this study include levee sys-

tems over a wide geographical range in the continental United

States, ranging from New York to Washington State, east to west,

and northern Minnesota to Central Texas, north to south. The total

combined length of the levee systems included in this study is ap-

proximately 1,350 km (839 mi), or about 6% of the NLD inventory

in terms of total levee system length (USACE 2021b). The levees

evaluated in this study represent the range of unique factors that

differentiate flood risk management systems in the United States,

reflective of the overall LLID.

Of the 214 levee overtopping events contained within the LLID

overtopping data set, 185 were selected for use in developing a

logistic regression model (see Supplemental Materials for the com-

plete data set used in this study).

Of the 185 selected overtopping events used for the model, 119

events resulted in breach and 66 resulted in non-breach. In the initial

screening of data, event records with significant gaps were excluded

because these data could not be relied upon to provide valuable in-

formation, which led to the use and presentation of only 185 of the

214 known events. Once the final data set for model developmentwas

established, the data were assigned to variables, extrapolated to con-

sider implicit, cumulative effects, and missing data were addressed.

Management of the data is further described in following sections.

Selection of Model Variables

While the LLID considers a wide range of parameters and situa-

tional conditions to describe each overtopping event, only the var-

iables that are intuitively deemed to have a physical impact on the

performance of levee systems were selected for this study. In total,

seven variables were considered for model inclusion based on a

review of database metrics. The variables assessed in this study in-

clude those relating to the physical composition of the levee in

terms of geometry, material type, and consideration of construction

quality. Additionally, external loading on the levee were considered

in the form of hydraulic loading of the levee both prior to and dur-

ing overtopping. In summary, these factors are limited in scope to

geometrical, hydraulic, and geotechnical categorization. The con-

sideration of the number of input variables in this study is driven by

the desire to create a screening level approach that informs risk by

utilizing readily available data that can be categorized to fit a user-

friendly model. In addition to numerical variables, some variables

need to be estimated or ascertained by grouping them into ranges.

These data are considered categorical, which are data whose inputs

are grouped into levels such that the model does not require a pre-

cise numerical value. Categorical variables are also appropriate for

descriptive variables that are non-quantitative, and address the

qualitative information contained within the study data set. A sum-

mary of model variables and the associated categorical levels is pro-

vided in Table 1, with a representation of variables in Fig. 3. See

Supplemental Materials for the complete list of variables for all 185

overtopping events used as a basis for this study.

Table 1. Summary of model variables

Code Variable Type Level code Level description

X1Levee height Numerical —m

X2Landside slope geometry Categorical 1 <3H∶1V

2≥3H∶1V

X3Levee construction entity Categorical 1 Local

2 Federal

X4Water depth over levee Categorical 1 <0.152 m(<0.5ft)

2 0.152–0.305 m (0.5–1 ft)

3>0.305 m(>1ft)

X5Duration of overtopping flow prior to breach Categorical 1 <6h

26–24 h

3>24 h

X6Erosion resistance classification Categorical 1 Low

2 Moderate

3 High

X7Duration of levee loading prior to overtopping Categorical 1 <3days

23–14 days

3>14 days

© ASCE 04021193-4 J. Geotech. Geoenviron. Eng.

J. Geotech. Geoenviron. Eng., 2022, 148(3): 04021193

Levee Height (X1)

Levee height is a critical component of levee geometry that is often

considered in overtopping analyses and associated risk. Levee

height as a contributing metric has been included by others when

considering levee performance models (Gui et al. 1998;Heyer et al.

2010;Danka and Zhang 2015). Levee height in this study refers to

the average loading height of the overtopped embankment, which is

measured vertically from the riverside toe elevation to the levee

crest elevation. Levee height data is incorporated into analysis

as a continuous numerical variable, using the exact number pre-

sented in the LLID. The loading height for all levees ranges from

0.91 m (3 ft) to 7.32 m (24 ft).

Landside Slope Geometry (X2)

Landside slope geometry refers to the steepness of the landside

levee slope. The landside slope steepness for all levees ranges from

1H:1V to 5H:1V. Slope geometry was considered as a categorical

variable with two levels. Level 1 corresponds to a landside slope

less than 3H:1V, and Level 2 corresponds to a landside slope

greater than or equal to 3H:1V. The delineation between the two

levels of X2was selected as a representation of modernly con-

structed levees being specified to have a minimum slope of

3H:1V to allow for ease of maintenance (USACE 2000). By using

this logic, the authors acknowledge that a partial relationship may

inherently exist between X2and X3(Levee Construction Classifi-

cation), which will be described in the following section.

Levee Construction Classification (X3)

Levee systems are also categorized by quality of construction and

maintenance associated with the levee embankments. This cat-

egorical variable is classified into two levels. Level 1 includes

“locally constructed/maintained and re-classified federal levees”

and Level 2 includes “federally constructed/improved levees.”

The differences in these two designations consider construction

authorization, quality of original design and construction, available

data, and observed maintenance actions. A “re-classified”federal

levee is one which has known design, construction, or historical

maintenance deficiencies. Although levee construction classifica-

tion is easily applied to the data set described within this study

USACE study of portfolio levee systems, the categorization can

also be applied to any levee with a known construction history.

The difference in level of care should be considered when classi-

fying levees from different data sources. Level 1 should be used in

any case where a high level of quality control was not undertaken

during the design and construction process.

Overtopping Depth (X4)

Overtopping depth refers to the height of water over the levee while

being overtopped. This variable may also be referred to as “over-

flow depth”in related literature, however overtopping is the pre-

ferred nomenclature to align with USACE common terminology.

Additionally, the dynamics of wave loading are not implicitly con-

sidered. The hydraulic load of overtopping depth has been widely

considered in both numerical (Sharp and McAnally 2012) and stat-

istical models for different types of flood loading, including canal

loading (Lendering et al. 2018), riverine loading (Amabile et al.

2016;Isola et al. 2020), and compound riverine and coastal loading

(Jasim et al. 2020). Data for this variable is based on either physical

measurement or nearby stream gauge data to approximate depth at

the location of the breach. Overtopping depth is a categorical var-

iable in this model with three levels. Level 1 includes overtopping

depths less than 0.152 m (0.5 ft), Level 2 represents overtopping

depths between 0.152 m and 0.305 m (0.5 ft and 1.0 ft), and Level 3

denotes overtopping depths greater than 0.305 m (1 ft). These three

levels can be considered as “minor,”“moderate,”and “major”over-

topping, respectively. Assigning categorical levels for this data was

based on an assessment of relative base data distribution.

It was noted during initial review of the data set that a large

percentage of overtopping events, both breach and non-breach,

had associated overtopping depths of less than 0.305 m (1 ft). The

physical relevance of this observation is that hydraulic load leading

to overtopping is often constrained by other systemic factors that

restrict the ability of the river to rise significantly above the crest of

the levee. Some of these factors might be nearby diversion struc-

tures, lower levels of protection across the flood source channel, or

breach at the site of interest. It is critical to note that overtopping

depth in this data set is recorded for both breach and non-breach

events. The depth considered for breach events assumes that the

levee breached in the range represented by the categorical level,

and therefore was not able to increase in height after breach.

Overtopping Duration (X5)

While overtopping duration takes on two distinct meanings within

the data set, it can be treated as a single variable. When overtopping

leads to breach, the overtopping duration is a measure of the du-

ration that the levee crest elevation is exceeded until the breach

occurs. For events where overtopping does not result in breach,

the overtopping duration is simply the measure of time between

the levee crest elevation being exceeded by flood water and the

flood water returning to an elevation below the levee crest. In either

scenario, the duration represents the total load on the levee, so the

variable is treated proportionally for each case. Data for this var-

iable is approximated based on either physical measurement or

the use of stream gauge data. Overtopping duration is a categorical

variable in this model with three levels. Level 1 corresponds to

overtopping that occurred for less than 6 h, Level 2 considers over-

topping that took place for 6 to 24 h, and Level 3 corresponds to

overtopping events that occurred for more than 24 h.

In terms of physical relevance, Level 1 overtopping duration can

be correlated to flashy, or short-term flood events. Level 2 corre-

sponds to moderate term flood events, or those typically experi-

enced on riverine levees. Level 3 is the long-term case in which

a levee is overtopped for more than 1 day prior to breaching or

does not breach at all.

Erosion Resistance Classification (X6)

Each levee embankment used in this study is categorized by ero-

sion resistance, which is determined by the material descriptions

Fig. 3. Representation of model variables.

© ASCE 04021193-5 J. Geotech. Geoenviron. Eng.

J. Geotech. Geoenviron. Eng., 2022, 148(3): 04021193

contained within the LLID. Erosion resistance refers to the gen-

eral ability of the levee to resist degradation when subject to over-

topping load. Surface erosion is a field of study of its own with

much work having been done to understand how various material

combinations resist hydraulic stresses. Erosion specific to levees

has been studied by many, who have looked at contributing factors

such as soil type, shear stress from the hydraulic load, and soil

shear strength (Briaud et al. 2008;Kamalzare et al. 2013;Ellithy

et al. 2017;Osouli and Safarian Bahri 2018,Sills et al. 2008).

Additionally, many of the previously referenced levee perfor-

mance logistic regression models considered erosion resistance

as an input variable (Flor et al. 2010;Heyer et al. 2010;Danka

and Zhang 2015).

Erosion resistance classification is a categorical variable in this

model, with Levels 1, 2, and 3 defined as having “low,”“moderate,”

and “high”relative erosion resistance, respectively. Material types

range from clays to sand and gravel mixes, where gravel particle

size is generally assumed to be fine to medium. Erosion resistance

classification in this study does not consider any landside slope ar-

moring or the effects of vegetative cover. The erosion classifica-

tions within the database assumed a general erosion resistance

classification that considers both the general cohesiveness of soil

along with assumed particle size, based on general descriptions.

Erosion resistance classification limitations led to the material de-

scriptions provided in Table 2, which are based on a general review

of surface and embankment erosion literature and data provided to

the authors. The logic used is based on a general trend, and materi-

als encountered within levee embankments that are not listed will

need to be considered accordingly.

Duration of Levee Loading Prior to Overtopping (X7)

Duration of levee loading prior to overtopping refers to the length

of time the embankment was subjected to hydraulic load above the

riverside toe prior to overtopping. This variable is used as a general

representation of the effect of saturation of the levee prior to over-

topping. Physically, when the saturation of the levee increases,

porewater pressures within the levee increase and the overall

strength and stability of the levee decreases. Given how the differ-

ing material types and varying levee geometries included in this

study vary physically, this factor does not behave in a linear manner

from a statistical perspective.

Levee loading duration is a categorical variable in this model,

divided into three levels. Level 1 corresponds to flood water ex-

ceeding the riverside toe 3 days prior to overtopping; Level 2 du-

ration is 3 to 14 days; and Level 3 duration represents a duration

greater than 14 days. The physical relationship between these levels

corresponds to the general hydrograph of a given river during flood

loading. Level 1 can be considered as flashy loading; Level 2 cor-

responds to a moderately rising river; and Level 3 corresponds to a

slow rising river. Like the overtopping duration (X5), the duration

of levee loading data is approximated based on either physical

measurement or using stream gauge data to determine river levels

at the breach area of interest.

Data Cleaning and Processing

Cumulative Effects of Hydraulic Variables

As previously discussed, the data were selected and evaluated

based on a representation of physical processes. Given this consid-

eration, the cumulative effects of each variable also needed to be

incorporated into the model. Most notably, variables related to

hydraulic loading (X4,X

5, and X7) required the consideration of

cumulative effects of intermediate loading given breach or non-

breach results. To account for cumulative effects, the data was man-

ually extrapolated such that if a levee breached at a low level of

loading, it was assumed to fail at the higher levels of loading if

all other variables remain constant. Conversely, if the levee did

not breach at the highest level, it was assumed not to fail at lower

levels if all other variables remain constant. By considering the

implicit loading steps leading up to breach, or non-breach, the

185 events were extrapolated to consider 581 events that were used

for model generation. This is a critical assumption in the model that

leads to a controlled approach in estimating physical factors. All

other variables not directly related to hydraulic loading were con-

sidered independent of cumulative or ordinal effects.

It should be emphasized that data for at least one variable was

not recorded for 104 of the base 185 levee overtopping events.

Variables with missing information include overtopping depth

(X4), overtopping duration (X5), and duration of levee loading

(X7), which are the same variables considered for cumulative ef-

fects. Of the base 185 events, 30.8% of X4, 49.7% of X5, and 5.4%

of X7data was missing. Once data was extrapolated, 10.2% of X4

and 2.4% of X7were missing measurements in the data set. The

variable with the most missing data remained X5(overtopping du-

ration), with 24.1% of the data set missing information on this var-

iable after extrapolation. A summary of the missing data in each

step of the data extrapolation process is summarized in Table 3.

Data extrapolation to account for cumulative hydraulic effects had

a minimal effect on the base breach rate of the entire data set. Prior

to data extrapolation, breach events accounted for 63.8% of all

events. After extrapolation, 59.7% of the data set events resulted

in breach. This is because extrapolation of events was not uniform.

Some events were extrapolated to account for more scenarios than

others, given precedent conditions.

Data Imputation

To account for missing data, the k-nearest neighbor (kNN) impu-

tation algorithm through the VIM package for R version 6.1.1 was

used. The kNN imputation process allows for the replacement of

Table 2. Material description for erosion resistance classification

Material description

Erosion resistance

classification

Sand, silty sand, silty sand with gravel, sand/silt mix, sand/gravel mix, sand/gravel mix with silt, sandy silt, sandy gravel Low

Silt, clayey silt, silt with sand/clay, silt/clay mix with sand, silty loam, silty/clayey loam, sand/silt mix with clay, clayey sand Moderate

Clay, clay/silt mix, clay with sand/silt, zoned embankment with impervious cover, clay enlargement of an existing sand levee High

Table 3. Summary of missing data before and after extrapolation

Level

Portion of missing data

X4X5X7

Before extrapolation 30.8% 49.7% 5.4%

After extrapolation 10.2% 24.1% 2.3%

© ASCE 04021193-6 J. Geotech. Geoenviron. Eng.

J. Geotech. Geoenviron. Eng., 2022, 148(3): 04021193

missing data with categorical levels based on comparison of similar

events in the data set using pattern recognition (Beretta and

Santaniello 2016). In this study, an extension of the Gower distance

(Gower 1971) as the most popular distance for mixed-type varia-

bles, which enables the handling of distance for binary, categorical,

ordered, numerical, and semi-numerical variables is used for the

purpose of kNN imputation (Kowarik and Templ 2016;D’Orazio

2021).

The kNN is derived for a mix of numerical and categorical var-

iables, and the distance between the ith and jth observations is the

weighted mean of the contributions of each variable. The weight

represents the importance of the variable and is selected based

on the relative importance of the variables (Kowarik and Templ

2016). The distance between the ith and jth observations can be

determined as follows:

di;j¼Pp

m¼1wmδi;j;m

Pp

m¼1wm

ð2Þ

where wmdenotes the weight and δi;j;mrepresents the contribution

of the mth variable. The shortcomings of this method are predomi-

nantly centered on the effects of imputation on data distribution and

representation, where it is recognized that data are being generated

based on statistical interpolation.

A simplified visual representation of k-value selection for im-

putation in this study is shown in Fig. 4. Random data points are

shown based on training data for kNN imputation, with example

kNN groupings. Each data point represents the tendency of the im-

putation process to either assign a missing categorical variable to

Level 1, Level 2, or Level 3. In this scenario, an X1value of ap-

proximately 2.7 is known for an event, but X7data is missing.

When evaluating the scenario with the three nearest neighbors, a

Level 1 response for X7is predicted for the missing data because

the Level 1 data are more prevalent than the Level 2 and Level 3

data. Following the same logic, if the eight nearest neighbors are

used, which includes the previously assessed three nearest neigh-

bors, then the missing categorical variable is imputed as Level 3.

The process of k-value selection is iterative in assessing how

many neighbors need to be considered to minimize the distortion

of the data set while maintaining correlation between observed re-

sults and model prediction. The accuracy metrics considered were

error rates when validating the data using (1) k-fold cross valida-

tion, and (2) test data using a random set of real data that were

excluded from the training set. Both accuracy metrics will be de-

scribed in further detail in subsequent sections. As shown in Fig. 5,

volatility is high for low k-values and attenuates as the k-value

increases.

In addition to assessing the model error convergence, a probabi-

listic distribution of imputed variables needed to be simultaneously

considered, such that the physics of the model relationship re-

mained unchanged, and data were not grossly misrepresented.

As k-values were increased beyond eight within the model, probabi-

listic distribution smoothing to the mean was generally observed,

which led to greater deviation from the original data. Using a

k-value of eight, the imputation of X4(overtopping depth) and X7

(duration of levee loading prior to breach) had maximum probabil-

ity changes of 3.8%, and 2.0%, respectively, with each having an

increased probability of the data being categorized as Level 2.

Imputation had the greatest effect on X5, due to this variable having

the most missing data, with the maximum probability change being

8.5%, as shown in Fig. 6. A summary of changes in probability

distribution is shown in Table 4.

The result of imputation of missing X5data was to force more of

the data to Level 2, given three levels, which in this instance is

Fig. 4. Two-dimensional kNN imputation visualization.

Fig. 5. kNN error rate sensitivity.

Fig. 6. Probability change due to imputation with k¼8for (a) X4; (b) X5; and (c) X7.

© ASCE 04021193-7 J. Geotech. Geoenviron. Eng.

J. Geotech. Geoenviron. Eng., 2022, 148(3): 04021193

preferrable due to the fact that many of the missing data were

correlated to larger river events, i.e., the Mississippi and Missouri

Rivers. Level 2 of X5correlates to moderate duration overtopping

(6–24 h). This result aligns well with the data when comparing to

similar events; therefore it is not considered to be a gross misrep-

resentation of data or trend.

While changes in the probabilistic distribution of the original

data were less when selecting k-values less than eight, the offset in

error rate was not seen to justify the minor improvements in dis-

tribution changes. For these reasons, a k-value of eight was se-

lected for this data set in support of model creation. With these

checks, it is ensured that the applied data imputation does not alter

the key characteristics of the levee data. Finally, Fig. 7presents a

graphical representation of all 581 events, which account for

extrapolation and imputation of the base data, used to create

the proposed model.

Development of the Logistic Regression Model

Logistic regression is a statistical method which allows for the uti-

lization of both numerical and categorical independent input

parameters to predict a categorical outcome, which is often binary

(Kleinbaum 1994;Hilbe 2011). Logistic regression utilizes con-

cepts of linear regression where a logit transformation is applied

to the outcome of a linear regression to calculate a probability. This

probability output is then used to determine a categorical response

based on a set threshold. Because the nature of the studied data

consist of both numerical and categorical independent variables,

and the dependent variables are reported in two levels, binary lo-

gistic regression was used in this study. This method of statistical

analysis for levee overtopping is appropriate given that the phe-

nomenon of levee overtopping is generally well understood in

terms of what factors contribute, and that the result of overtopping

is binary in terms of breach or non-breach. The proposed logit

model serves to assess of the degree of importance of each param-

eter that contributes to overtopping.

The logistic regression model developed for this study attempts

to predict the probability of a levee breach given a set of one

numerical and six categorical variables related to construction, geo-

technical, hydraulic, and geometrical factors. These variables are

denoted as X1through X7. In this study, levee breach is defined

as Y¼1, and non-breach as Y¼0. The probability of breach,

PðY¼1Þ, is defined by the following general form:

Table 4. Change in probability of X4,X

5, and X7after kNN imputation for

k¼8

Level X4(%) X5(%) X7(%)

1−2.4−1.9−1.1

2 3.8 8.5 2.0

3−1.5−6.6−0.9

Fig. 7. Distribution of variables for model creation with extrapolation and imputation.

© ASCE 04021193-8 J. Geotech. Geoenviron. Eng.

J. Geotech. Geoenviron. Eng., 2022, 148(3): 04021193

PðY¼1Þ¼ eZ

1þeZð3Þ

where

Z¼β0þX

n

i¼1

βiXið4Þ

and Xiis the observation or predictor, with β0and βibeing the

coefficients estimated by the regression model. When considering

categorical input variables, the output of the proposed model con-

siders each categorical level as a predictor. So, rather than an Xi

input for the categorical variable, a coefficient is calculated for each

level of the categorical input variable relative to the base condition

(categorical Level 1). In the event that a non-base condition is met,

the contribution of the event occurring at that categorical level to

the model is βi×1. In the event that only the base categorical level

is met, the contribution is βi×0. The probability of breach is not

calculated as a binary output, but rather a probability of occurrence

between 0 and 1. Therefore, a threshold is established to determine if

the individual event is likely to result in breach. Breach is assumed to

occur if PðY¼1Þis greater than or equal to 0.5, and non-breach is

assumed to occur if PðY¼1Þis less than 0.5.

The first step in evaluating a logistic regression model is to

assess the statistical significance of each variable, which is accom-

plished in the proposed model using the p-value. It is a long-

established practice in logistic regression analysis that a variable

is significant if its p-value is less than 0.05, which reflects 95%

confidence (Fisher 1925). In this study, p-value computation is

done using a base regression equation, which considers the additive

properties of each variable in succession, and is represented by:

Z¼β0þβ1·X

1þβ2·X

X2¼2þβ3·X

X3¼2þβ4·X

X4¼2

þβ5·X

X4¼3þβ6·X

X5¼2þβ7·X

X5¼3þβ8·X

X6¼2

þβ9·X

X6¼3þβ10 ·X

X7¼2þβ11 ·X

X7¼3ð5Þ

The p-values calculated for each level of the base model are pro-

vided in Table 5. It should be noted that if any level of a categorical

variable was found to be significant, all levels had to be included to

maintain completeness of the model. Significance test results indi-

cate that all variables are significant based on the significance

threshold of 0.05, except for X1(levee height). This result for X1

is intuitive when looking at the distribution of breach and non-

breach data relative to an increasing levee height, as shown in Fig. 7.

Of the variables with greater significance, variable X2(landside

slope geometry) was reviewed by considering the baseline breach

rates to verify its significance, as this value was close to the

threshold. The p-value of 0.037 was found to have a loose relation-

ship with the base events in the overtopping data set, which indi-

cated that X2Level 1 was only 4% more likely to breach than X2

Level 2 when evaluating the pre-imputation data. Note that in

Eq. (5) there is no reference to Level 1. This is because Level 1 is

the base condition for each variable where, if that variable is at

Level 1, then its contribution to base regression equation, or Zin

Eq. (5), is zero. In other words, if all variables are at Level 1, then

the total value of the base regression equation is equal to the inter-

cept value.

Stepwise regression is a commonly used method to determine if

the addition or exclusion of an individual variable, or a combination

of independent variables, can improve accuracy using covariables

(Steyerberg et al. 1999). Stepwise regression was used for the lo-

gistic regression model to determine if covariables in the form of

two-way interactions of variables and quadratic terms could im-

prove the model accuracy while maintaining the simplicity in

the developed model. The stepwise logistic regression model

was evaluated based on an Akaike’s information criterion (AIC)

accuracy metric for each combination. When using AIC to compare

model fitting, the model with the lowest AIC value represents the

best performance. AIC scores competing models by reducing the

value if information loss is minimalized and increasing the value if

the model contains unnecessary complexity (Wagenmakers 2007).

For the presented data, the odds ratio represents the relative

change in breach probability for a given categorical level of an indi-

vidual variable. Odds ratios and regression coefficients for each

model term are presented in Table 6. In each case, the odds ratio

is relative to the base case, or Level 1. When the odds ratio is less

than 1, breach is less likely as the categorical level increases. When

the odds ratio is greater than 1, the probability of breach is more

likely as the categorical level increases. This relationship can be

formulated as follows:

Odds Ratio ¼

PðY¼1ÞXi¼n

ð1−PðY¼1ÞXi¼nÞ

PðY¼1ÞXi¼1

ð1−PðY¼1ÞXi¼1Þ

ð6Þ

Table 6. Variable odds ratio and coefficient

Variable

Coefficient

(β)

Odds

ratio

Intercept β0¼0.93 2.53

Levee construction entity ðX3Þ¼2β3¼−1.13 0.32

Water depth over levee ðX4Þ¼2β4¼0.87 2.38

Water depth over levee ðX4Þ¼3β5¼1.27 3.55

Duration of overtopping flow prior to

breach ðX5Þ¼2

β6¼0.08 1.08

Duration of overtopping flow prior to

breach ðX5Þ¼3

β7¼1.85 6.37

Erosion resistance classification ðX6Þ¼2β8¼−1.74 0.18

Erosion resistance classification ðX6Þ¼3β9¼−3.93 0.02

Duration of levee loading prior to

overtopping ðX7Þ¼2

β10 ¼−0.42 0.66

Duration of levee loading prior to

overtopping ðX7Þ¼3

β11 ¼0.17 1.18

X4¼2and X7¼2β12 ¼2.18 8.82

X4¼3and X7¼2β13 ¼2.65 14.08

X4¼2and X7¼3β14 ¼0.60 1.83

X4¼3and X7¼3β15 ¼2.29 9.89

X3¼2and X6¼2β16 ¼1.35 3.84

X3¼2and X6¼3β17 ¼−0.51 0.60

Note: β1and β2were not used in the proposed model. Model

AIC ¼479.19.

Table 5. Significance of model variable

Description Variable Level p-valuea

Levee height X1N/A 0.566

Landside slope geometry X22 0.037

Levee construction classification X321.79 ×10−5

Overtopping depth X424.07 ×10−5

31.07 ×10−9

Overtopping duration X52 0.089

31.12 ×10−7

Erosion resistance classification X624.00 ×10−4

3<2×10−16

Duration of levee loading prior

to overtopping

X72 0.022

3 0.0123

aAll variable significance references to Level 1.

© ASCE 04021193-9 J. Geotech. Geoenviron. Eng.

J. Geotech. Geoenviron. Eng., 2022, 148(3): 04021193

where PðY¼1ÞXi¼n= probability of breach given variable Xiat

Level “n”; and PðY¼1ÞXi¼n= probability of breach given variable

Xiat Level “1.”

For example, X6at Level 3 has an odds ratio of 0.02. This im-

plies that breach is 50 (1/0.02) times more likely to occur when

overtopped for a low erosion resistant levee than a high erosion

resistant levee. Conversely, when the X5variable is at level 3, the

odds ratio is 6.37. This implies that overtopping duration greater

than 24 h leads to a breach probability 6.37 times higher than

an overtopping event duration less than 6 h. The difference in these

results is not in the formulation, but rather that as the level of ero-

sion resistance classification (X6) increases, breach is inherently

more likely, whereas the opposite is true for overtopping duration

(X5), where an increased level implies increased loading. Where

two levels are specified for the odds ratio, both cases reference

Level 1 for each variable. To demonstrate, when X3¼2and

X6¼3, the probability of breach is 1.67 (1/0.60) times less likely

compared to the base condition of X3¼1and X6¼1. Conversely,

when X4¼3and X7¼3, breach is 9.89 times more likely than the

base condition of X4¼1and X7¼1.

After the stepwise regression was assessed, X1was removed

due to lack of significance as observed in the base model. X2

was considered in the stepwise model but was not included in

the final model based on the calculated AIC. As previously dis-

cussed, the rejection of this variable in the final model agrees with

the assessment of base data, which shows that landside slope geom-

etry does not have a significant statistical impact on the base rate of

breach.

The proposed logistic regression model for the LLID overtop-

ping data set is presented as follows:

Z¼β0þβ3·X

X3¼2þβ4·X

X4¼2þβ5·X

X4¼3þβ6·X

X5¼2

þβ7·X

X5¼3þβ8·X

X6¼2þβ9·X

X6¼3þβ10 ·X

X7¼2

þβ11 ·X

X7¼3þβ12 ·X

X4¼2;X7¼2þβ13 ·X

X4¼3;X7¼2

þβ14 ·X

X4¼2;X7¼3þβ15 ·X

X4¼3;X7¼3þβ16 ·X

X3¼2;X6¼2

þβ17 ·X

X3¼2;X6¼3ð7Þ

The proposed model considers the base interactions of X3,X

4,

X5,X

6, and X7and two-way interactions of X4with X7and X3

with X6.

Model Validation

The proposed model was validated using cross validation as well as

a test data set where a portion of the base data was set aside to be

used to evaluate the proposed model. Cross validation was con-

ducted using the k-fold cross validation method. In this method,

the data set is divided into “k”evenly distributed groups and sub-

sequently compared against the selected logistic regression model

to determine the model accuracy (Valavi et al. 2019). Each time the

data is redistributed to validate, a fold is created. For this study, a

k-value of 5 was selected to test 20% of the data set against the

remainder of the data set in each fold. Note that k-value here rep-

resents a different value than the k in kNN imputation. In this pro-

cess, each event for validation is selected randomly in each fold.

Cohen’s kappa value, which ranges between −1and þ1, assesses

relative model agreement for multiclass data with an unbalanced

response (Delgado and Tibau 2019). Results of cross validation in-

dicated model accuracy of 80.7% with a standard deviation of 3.4%

and a Cohen’s kappa value of 0.592. While there is no standardized

scale for kappa score, a value between 0.41 and 0.60 has been

considered as moderate agreement, with values of 0.61–0.80 as

being substantial agreement (Landis and Koch 1977).

Set aside data for validation purposes included 37 of 185 events

that were not used in model development. Test data were randomly

chosen, with the condition that each event must not have any miss-

ing variable data. These data were then extrapolated to consider

intermediate, implicit events that had to occur leading up to breach

using the same extrapolation process that was applied to model

data, which yielded 116 test data. As indicated in Table 7, the ac-

curacy of the fitted model based on the randomly selected test data

set was 73.3%, with the model predictions from 85 of 116 events in

agreement with the observed results. Accurate predictions are high-

lighted in Table 7as either a true positive (TP) or true nega-

tive (TN).

Validation using the set aside test data set indicates that there is a

slight tendency of the model to predict a false positive (FP), as op-

posed to a false negative (FN). The probability of false positive and

false negative predictions when using the test data are 14.7% and

12.1%, respectively. Fig. 8presents the calculated breach probabil-

ity of all real base events used to create the proposed model with

respect to the incident number, as identified in the presented

data set.

Application of Logit Model in Levee Risk

Assessment

Remember that for assessing the hazard and performance compo-

nents of risk related to levee breach due to overtopping, the nodes

on the event tree typically follow the sequence of (1) hydraulic

load of interest occurring, (2) initiation of embankment erosion,

(3) propagation of embankment erosion progressing beyond a criti-

cal state, and (4) breach. Because overtopping is assumed to be

occurring, the breach probability calculated by the proposed model

represents steps (2) through (4) of this process. The calculated

probability of breach, PðY¼1Þ, should be treated as a factor

which, when multiplied by the annual exceedance probability

(AEP), yields an annual probability of failure. This relationship

can be shown as:

APF ¼AEP ×PðY¼1ÞLogit ð8Þ

where PðY¼1ÞLogit represents the probability of breach as deter-

mined by the logit model.

This process can be demonstrated through the example event set

forth in Table 8. This scenario considers a non-federally con-

structed clay levee. The levee is assumed to be overtopped by

0.22 m for three hours, with loading above the riverside toe begin-

ning six days prior to overtopping. Given these conditions, the logit

model estimates a probability of breach, PðY¼1Þ, of 0.293, where

breach is not considered likely. If we assume the AEP of the over-

topping depth considered in the example has an annual probability

of occurrence equal to 0.1%, then the APF for this scenario can be

Table 7. Accuracy of model predictions versus observations

Observation

Model prediction

Non-breach (Y¼0) Breach (Y¼1)Σ

Non-breach (Y¼0)26 (TN) 17 (FP) 43

Breach (Y¼1) 14 (FN) 59 (TP) 73

Σ40 76 116

Note: Bold values signify correct predictions by the model. TN = true

negative; FN = false negative; FP = false positive; and TP = true positive.

© ASCE 04021193-10 J. Geotech. Geoenviron. Eng.

J. Geotech. Geoenviron. Eng., 2022, 148(3): 04021193

calculated using Eq. (8)as2.93 ×10−4, or 0.0293%. This value

would then be combined with either economic or life loss conse-

quences of breach to form the full representation of risk.

Discussion

The LLID overtopping data set can be used for many other pur-

poses that support both USACE and other parties interested in levee

performance. For instance, the information in the overtopping data

set can be used to highlight areas of typical poor performance of

past overtopping locations for various systems. The LLID overtop-

ping data set can also provide valuable information for researchers

looking to develop models related to overtopping reliability or sim-

ilar topics, such as breach propagation. Additionally, the presented

model can be modified such that only specific variables of interest,

such as hydraulic loading or erosion resistance, are considered rel-

ative to each system, to inform more specific research or program-

matic interests. Using the overtopping data, studies could be

considered set that focus on regional performance specific to water-

shed, river, or other specific geographical locations.

While the assessment of levee reliability analysis contributing to

risk assessment is generally thought to require a complex set of

hydrological and geotechnical parameters, one valuable trait of the

proposed model is that the input parameters are relatively easy to

ascertain or estimate. Variables such as levee construction classi-

fication can be categorized based on the known history of the struc-

ture. Erosion resistance classification can be determined by visual

inspection or with minimal sampling. Often, if engineering records

exist for a levee, this information is located within design documen-

tation. The hydraulic variables related to overtopping depth, over-

topping duration, and duration of levee loading prior to breach can

generally be estimated using known river gauge information for

levees within the United States. Furthermore, visual inspection

prior to, or during, the event can often provide enough insight into

the event that the magnitude of these variables can be estimated.

Geometric factors such as levee height and slope steepness were

found to be minimally impactful to the overall model, which was

not an expected result. Given the lack of inclusion of X1(levee

height) and X2(landside slope geometry), levees are differentiated

largely on composition and hydraulic loading. Thus, geometric fac-

tors are not considered to be highly critical in the evaluation of

cross sections using this method.

A key benefit of the presented model is that a relatively small

number of variables are required. This reduces the complexity of

interactions within the model and eliminates some potential factors

that may have highly variable effects. However, further studies are

suggested to examine whether the inclusion of additional variables

could be beneficial in potential future model updating where the

variables are reasonably easily obtained and do not lead to unnec-

essary model complexity. For example, the proposed model does

not explicitly consider the effect of slope vegetation, overtopping

velocities, cracking, or the initiation point of erosion (i.e. landside

slope or crest), which have been shown to affect the process of soil

erosion and levee overtopping breach. With additional investigation

and testing, erosion resistance properties could be better defined to

align more closely with documented studies. Additionally, flood

fighting and emergency response have likely varied across events

but were not considered in model creation. Considering such var-

iables, if available and applied in a direct manner, can potentially

improve the model accuracy but would need to be accurately as-

sessed in future studies.

Conclusions

Flood risk within the United States poses an existing and increasing

threat due to the increased frequency of extreme weather events and

expanding development within floodplains. It is imperative that

risks related to levee systems are thoroughly evaluated and regu-

larly assessed. Utilizing available tools and evolving methods,

levee data collection should be prioritized such that levee perfor-

mance can be better understood through both statistical and deter-

ministic means.

A comprehensive data set is presented within this study that

documents the performance of known overtopping events within

the United States. The data set, along with any future data addi-

tions, has the potential to allow for further refinement or expansion

of numerical values and categorical levels. The proposed model

uses known ranges of performance information in an effort to help

calibrate the minds of individuals involved with the technical as-

pects of levee risk assessment. One significant advantage of the

proposed model is that it can be updated as additional data is col-

lected and refined. The proposed model can be used by many with a

basic understanding of the limited range of required input variables.

The current study yields promising results when utilizing the

proposed model as a screening measure, given then extensive work

Fig. 8. Calculated probability of breach for 185 overtopping incidents included in the data set.

Table 8. Example overtopping scenario

Value/Level

Logit model inputs

PðY¼1ÞLogita

X3X4X5X6X7

Value Local 0.22 m 3 h Clay 6 days 0.293

Level 1 2 1 3 2

aProbability of breach as determined by logit model.

© ASCE 04021193-11 J. Geotech. Geoenviron. Eng.

J. Geotech. Geoenviron. Eng., 2022, 148(3): 04021193

that has gone into data collection. An accuracy range of 73.3% to

80.7% indicates quality model fitting of data for a process that is

inherently variable. Application of the proposed logit model is most

appropriate for assessing individual levee system cross sections.

When analyzing a full levee system, discretization should be con-

sidered where there are significant changes in conditions across the

system. The proposed model offers a viable alternative to the sub-

jective, judgment-based components typically relied upon in levee

risk assessment.

Efforts to refine the existing database will only help to improve

the model’s reliability. In the logit model’s current state, machine

learning is relied upon to fill in gaps where data has not been dis-

covered. This leads to an acknowledged level of uncertainty where

significant portions of data are generated rather than measured. As

such, continued research that attempts to fill these gaps should be

considered. Future data collection efforts should also consider as-

pects which further the understanding of breach due to overtopping

in a way that not only can predict its occurrence but also its mag-

nitude. Data related to vegetation, flow velocity, emergency re-

sponse, and other considerations not covered in this model may

have the potential to improve the model’s accuracy.

Lastly, it is highly recommended that unified data collection ef-

forts be considered that document levee performance not only of

USACE levees but also of those within the United States and

abroad which are maintained by other entities. In doing so, models

such as the one presented in this study can gain efficiency. An in-

creased understanding of levee performance during flood events

serves to benefit those who maintain, design, and create policy

for flood risk management projects.

Data Availability Statement

Some or all data, models, or code that support the findings of this

study are available from the corresponding author upon reasonable

request. Data includes that which was used in the Levee Loading

and Incident Database. Model and/or code includes text of code

used to create the logit model.

Acknowledgments

The authors thank the US Army Corps of Engineers (USACE) for

support related to this research.

Disclaimer

The opinions expressed in this paper are those of the authors and do

not necessarily reflect those of the US Army Corps of Engineers.

Supplemental Materials

The levee overtopping data set is available online in the ASCE

Library (www.ascelibrary.org).

References

Amabile, A., M. Cordão-Neto, F. De Polo, and A. Taratino. 2016.

“Reliability analysis of flood embankments taking into account a sto-

chastic distribution of hydraulic loading.”In Proc., 3rd European

Conf. on Unsaturated Soils. Les Ulis, France: EDP Sciences.

https://doi.org/10.1051/e3sconf/20160919005.

ASCE. 2021. “A Comprehensive Assessment of America’s Infrastructure.”

Accessed March 9, 2021. https://infrastructurereportcard.org/wp

-content/uploads/2020/12/National_IRC_2021-report-2.pdf.

Beretta, L., and A. Santaniello. 2016. “Nearest neighbor imputation algo-

rithms: A critical evaluation.”Supplement, BMC Med. Inf. Decis.

Making 16 (S3): 197–208. https://doi.org/10.1186/s12911-016-0318-z.

Briaud, J.-L., H.-C. Chen, A. V. Govindasamy, and R. Storesund. 2008.

“Levee erosion by overtopping in New Orleans during the Katrina

Hurricane.”J. Geotech. Geoenviron. Eng. 134 (5): 618–632. https://doi

.org/10.1061/(ASCE)1090-0241(2008)134:5(618).

CRS (Congressional Research Service). 2017. Levee safety and risk: Status

and Considerations. Version 3. Washington, DC: CRS.

Danka, J., and L. M. Zhang. 2015. “Dike failure mechanisms and breaching

parameters.”J. Geotech. Geoenviron. Eng. 141 (9): 04015039. https://

doi.org/10.1061/(ASCE)GT.1943-5606.0001335.

Das, I., S. Sahoo, C. van Westen, A. Stein, and R. Hack. 2010. “Landslide

susceptibility assessment using logistic regression and its comparison

with a rock mass classification system, along a road section in the

northern Himalayas (India).”Geomorphology 114 (4): 627–637. https://

doi.org/10.1016/j.geomorph.2009.09.023.

Dehghani, N. L., A. B. Jeddi, and A. Shafieezadeh. 2021. “Intelligent hur-

ricane resilience enhancement of power distribution systems via deep

reinforcement learning.”Appl. Energy 285 (Mar): 116355. https://doi

.org/10.1016/j.apenergy.2020.116355.

Delgado, R., and X.-A. Tibau. 2019. “Why Cohen’sKappa should be

avoided as performance measure in classification.”PLoS One 14 (9):

e0222916. https://doi.org/10.1371/journal.pone.0222916.

D’Orazio, M. 2021. “Distances with mixed type variables some modified

Gower’s coefficients.”Preprint, submitted January 7, 2021. http://arxiv

.org/abs/2101.02481.

Ellithy, S., G. Savant, and J. L. Wibowo. 2017. “Effect of soil mix on over-

topping erosion.”In Proc., World Environmental and Water Resources

Congress 2017,34–49. Reston, VA: ASCE. https://doi.org/10.1061

/9780784480625.005.

Fisher, R. A. 1925. Statistical methods for research workers. Edinburgh,

Scotland: Oliver and Boyd.

Flor, A., N. Pinter, and J. W. F. Remo. 2010. “Evaluating levee failure sus-

ceptibility on the Mississippi River using logistic regression analysis.”

Eng. Geol. 116 (1–2): 139–148. https://doi.org/10.1016/j.enggeo.2010

.08.003.

Flynn, S. G., F. Vahedifard, and D. M. Schaaf. 2021. “A dataset of levee

overtopping incidents.”In Proc., Geo-Extreme 2021: Infrastructure

Resilience, Big Data, and Risk,99–108. Reston, VA: ASCE. https://doi

.org/10.1061/9780784483701.010.

Gandomi, A., M. Fridline, and D. Roke. 2013. “Decision tree approach for

soil liquefaction assessment.”Sci. World J. 2013: 346285. https://doi

.org/10.1155/2013/346285.

Gower, J. C. 1971. “A general coefficient of similarity and some of itsprop-

erties.”Biometrics 27 (4): 857–871. https://doi.org/10.2307/2528823.

Gui, S., R. Zhang, and X. Xue. 1998. “Overtopping reliability models for

river levee.”J. Hydraul. Eng. 124 (12): 1227–1234. https://doi.org/10

.1061/(ASCE)0733-9429(1998)124:12(1227).

Heyer, T., H.-B. Horlacher, and J. Stamm. 2010. “Multicriteria stability

analysis of river embankments based on past experience.”In Proc.,

1st European IAHR Congress. Madrid, Spain: International Association

for Hydro-environment Engineering and Research.

Heyer, T., and J. Stamm. 2013. “Levee reliability analysis using logistic

regression models—Abilities, limitations and practical considerations.”

Georisk 7 (2): 77–87. https://doi.org/10.1080/17499518.2013.790734.

Hilbe, J. 2011. “Logistic regression.”In International encyclopedia of stat-

istical science,15–32. New York: Springer.

Hui, R., E. Jachens, and J. Lund. 2016. “Risk-based planning analysis for a

single levee.”Water Resour. Res. 52 (4): 2513–2528. https://doi.org/10

.1002/2014WR016478.

Isola, M., E. Caporali, and L. Garrote. 2020. “River levee overtopping: A

bivariate methodology for hydrological characterization of overtopping

failure.”J. Hydrol. Eng. 25 (6): 04020026. https://doi.org/10.1061

/(ASCE)HE.1943-5584.0001929.

Jasim, F. H., F. Vahedifard, A. Alborzi, H. Moftakhari, and A. Aghakouchak.

2020. “Effect of compound flooding on performance of earthen levees.”

© ASCE 04021193-12 J. Geotech. Geoenviron. Eng.

J. Geotech. Geoenviron. Eng., 2022, 148(3): 04021193

In Geo-Congress 2020: Engineering, Monitoring, and Management of

Geotechnical Infrastructure,707–716. Reston, VA: ASCE.

Kamalzare, M., T. S. Han, M. McMullan, C. Stuetzle, T. F. Zimmie,

B. Cutler, and W. R. Franklin. 2013. “Computer simulation of levee

erosion and overtopping.”In Geo-Congress 2013: Stability and Perfor-

mance of Slopes and Embankments III, 1851–1860. Reston, VA: ASCE.

Kleinbaum, D. G. 1994. Logistic regression: A self-learning text.NewYork:

Springer.

Kowarik, A., and M. Templ. 2016. “Imputation with the R package VIM.”

J. Stat. Software 74 (7): 1–16. https://doi.org/10.18637/jss.v074.i07.

Landis, J. R., and G. G. Koch. 1977. “The measurement of observer agree-

ment for categorical data.”Biometrics 33 (1): 159–174. https://doi.org

/10.2307/2529310.

Lendering, K., T. Schweckendiek, and M. Kok. 2018. “Quantifying the fail-

ure probability of a canal levee.”Georisk 12 (3): 203–217. https://doi

.org/10.1080/17499518.2018.1426865.

Mallakpour, I., and G. Villarini. 2015. “The changing nature of flooding

across the central United States.”Nat. Clim. Change 5 (3): 250–254.

https://doi.org/10.1038/nclimate2516.

O’Leary, T. 2018. SQRA calculation methodology. RMC-TN-2018-01.

Lakewood, CO: USACE.

Osouli, A., and P. Safarian Bahri. 2018. “Erosion rate prediction model

for levee-floodwall overtopping applications in fine-grained soils.”

Geotech. Geol. Eng. 36 (4): 2823–2838. https://doi.org/10.1007/s10706

-018-0505-z.

Özer, I. E., M. van Damme, and N. Jonkman. 2020. “Towards an

international levee performance database (ILPD) and its use for macro-

scale analysis of levee breaches and failures.”Water 12 (1): 119.

Rahimi, M., Z. Wang, A. Shafieezadeh, D. Wood, and E. J. Kubatko. 2020.

“Exploring passive and active metamodeling-based reliability analysis

methods for soil slopes: A new approach to active training.”Int. J. Geo-

mech. 20 (3): 04020009. https://doi.org/10.1061/(ASCE)GM.1943

-5622.0001613.

Sharp, J. A., and W. H. McAnally. 2012. “Numerical modeling of surge

overtopping of a levee.”Appl. Math. Modell. 36 (4): 1359–1370.

https://doi.org/10.1016/j.apm.2011.08.039.

Sills, G. L., N. D. Vroman, R. E. Wahl, and N. T. Schwanz. 2008. “Over-

view of New Orleans leveefailures: Lessons learned and their impact on

national levee design and assessment.”J. Geotech. Geoenviron. Eng.

134 (5): 556–565. https://doi.org/10.1061/(ASCE)1090-0241(2008)

134:5(556).

Steyerberg, E. W., M. J. C. Eijkemans, and J. D. F. Habbema. 1999. “Step-

wise selection in small data sets: A simulation study of bias in logistic

regression analysis.”J. Clin. Epidemiol. 52 (10): 935–942. https://doi

.org/10.1016/S0895-4356(99)00103-1.

Uno, T., H. Morisugi, T. Sugii, and K. Ohashi. 1987. “Application of a logit

model to stability evaluation of river levees.”J. Nat. Disaster Sci.

9 (1): 61–77.

USACE. 2000. Design and construction of levees. Engineer Manual, EM

1110-2-1913. Washington, DC: USACE.

USACE. 2018. Levee portfolio report. Washington, DC: USACE.

USACE. 2019. Interim approach for risk-informed designs for dam and

levee projects. ECB No. 2019-15. Washington, DC: USACE.

USACE. 2021a. Climate hydrology assessment tool v1.0. Washington, DC:

USACE.

USACE. 2021b. National levee database. Washington, DC: USACE.

Vahedifard, F., A. AghaKouchak, and N. H. Jafari. 2016. “Compound haz-

ards yield Louisiana flood.”Science 353 (6306): 1374. https://doi.org

/10.1126/science.aai8579.

Vahedifard, F., F. H. Jasim, F. T. Tracy, M. Abdollahi, A. Alborzi, and

A. AghaKouchak. 2020. “Levee fragility behavior under projected future

flooding in a warming climate.”J. Geotech. Geoenviron. Eng. 146 (12):

04020139. https://doi.org/10.1061/(ASCE)GT.1943-5606.0002399.

Vahedifard, F., S. Sehat, and J. Aanstoos. 2017. “Effects of rainfall,

geomorphological and geometrical variables on vulnerability of the

lower Mississippi River levee system to slump slides.”GeoRisk 11 (3):

257–271. https://doi.org/10.1080/17499518.2017.1293272.

Valavi, R., J. Elith, J. J. Lahoz-Monfort, and G. Guillera-Arroita. 2019.

“blockCV: An R package for generating spatially or environmentally

separated folds for k-fold cross-validation of species distribution mod-

els.”Methods Ecol. Evol. 10 (2): 225–232. https://doi.org/10.1111/2041

-210X.13107.

Villarini, G., J. A. Smith, M. L. Baeck, and W. F. Krajewski. 2011.

“Examining flood frequency distributions in the Midwest U.S.”J. Am.

Water Resour. Assoc. 47 (3): 447–463. https://doi.org/10.1111/j.1752

-1688.2011.00540.x.

Wagenmakers, E.-J. 2007. “A practical solution to the pervasive problems

of pvalues.”Psychonomic Bull. Rev. 14 (5): 779–804. https://doi.org

/10.3758/BF03194105.

Zamanian, S., J. Hur, and A. Shafieezadeh. 2020. “Significant variables for

leakage and collapse of buried concrete sewer pipes: A global sensitiv-

ity analysis via Bayesian additive regression trees and Sobol’indices.”

Struct. Infrastruct. Eng. 17 (5); 1–13. https://doi.org/10.1080/15732479

.2020.1762674.

Zhang, W. G., and A. T. C. Goh. 2013. “Multivariate adaptive regression

splines for analysis of geotechnical engineering systems.”Comput.

Geotech. 48: 82–95. https://doi.org/10.1016/j.compgeo.2012.09.016.

© ASCE 04021193-13 J. Geotech. Geoenviron. Eng.

J. Geotech. Geoenviron. Eng., 2022, 148(3): 04021193