ArticlePDF Available

K-means for earthquakes: disaggregation analyses of small events by considering wave components and soil types

Authors:

Abstract and Figures

In this paper, k-means algorithm has been used to disaggregate seismic parameters to evaluate their inter-correlations. A goal is to quantify in a disaggregated way the weights and effects of each parameter with respect to other ones. From the database, about 4900.0 data, divided into 22.0 categories, have been collected. The main divisions regard the wave components in horizontal and vertical axis and the soil characteristics. The studied seismic zone is the “Norpirenaica oriental,” placed in the Pyrenees area between Spain and France, classified as a very high seismic hazard. Numerical and analytical analyses have been carried out to implement the algorithm. Preliminary analyses and results would quantify the role of the sand horizontal stratigraphy, the non-linear effects, the elasticity of the soil, and the energy damping phenomenon. Curves are plotted in stochastic distributions and elastic spectra accelerations. Results show good prediction for vertical spectral accelerations and for far and relative strong events. Rigorously, results are valid only for the studied seismogenic zone under predefined constrictions and ranges.
Content may be subject to copyright.
Vol.:(0123456789)
Arabian Journal of Geosciences (2024) 17:302
https://doi.org/10.1007/s12517-024-12113-0
ORIGINAL PAPER
K‑means forearthquakes: disaggregation analyses ofsmall events
byconsidering wave components andsoil types
EnricoZacchei1,2· ReyolandoBrasil3,4
Received: 29 July 2023 / Accepted: 12 October 2024
© The Author(s) 2024
Abstract
In this paper, k-means algorithm has been used to disaggregate seismic parameters to evaluate their inter-correlations. A
goal is to quantify in a disaggregated way the weights and effects of each parameter with respect to other ones. From the
database, about 4900.0 data, divided into 22.0 categories, have been collected. The main divisions regard the wave com-
ponents in horizontal and vertical axis and the soil characteristics. The studied seismic zone is the “Norpirenaica oriental,”
placed in the Pyrenees area between Spain and France, classified as a very high seismic hazard. Numerical and analytical
analyses have been carried out to implement the algorithm. Preliminary analyses and results would quantify the role of the
sand horizontal stratigraphy, the non-linear effects, the elasticity of the soil, and the energy damping phenomenon. Curves
are plotted in stochastic distributions and elastic spectra accelerations. Results show good prediction for vertical spectral
accelerations and for far and relative strong events. Rigorously, results are valid only for the studied seismogenic zone under
predefined constrictions and ranges.
Keywords AI· K-means· Disaggregation analyses· Pyrenees area· Seismic analyses
Introduction
Background
After the twentieth century, with the gradual development
and promotion of artificial intelligence (AI), various algo-
rithms have been employed to simulate input–output relation-
ships for high-precision analyses. In (Afshoon etal. 2021),
the k-means algorithm is categorized as an “artificial neural
network (ANN)” and thus considered a form of AI as shown
in Turco etal. (2021); Salvador and Chan 2004). The k-means
algorithm is useful to analyze several values and dimensions
since it is very difficult for humans to compare items of such
complexity reliably without a support to aid the compari-
son (e.g., the classical issue of the “big data” treatment (Li
etal. 2021)). Analyses based on human subjective judgments
are often influenced by personal experiences, whereas the
k-means algorithm, considered an unsupervised machine
learning, belongs to the AI in the great area of the computer
science (Turco etal. 2021; Salvador and Chan 2004).
The k-means algorithm has been used for “several
branches of science,” as mentioned in Weatherill and Bur-
ton (2009) (e.g., chemistry (Almeida etal. 2007), medicine
(Symons 1981), engineering (Chen etal. 2021a)), due to its
simplicity, efficiency, and versatility (Chen etal. 2021a; Tay-
fur etal. 2018), since it can be represented by several types
and forms (Daszykowski etal. 2001; Salvador and Chan
2004; Ji etal. 2022). In (Hu and Ma 2021), it was stated that
“the greater the number of predictor variables, the harder it
is to interpret and isolate each predictor variable’s effect”;
thus, as stated in Chen etal. (2021b), “clustering analysis
can help identify the data pattern, merge similar components
and prepare for predictive model construction.
In particular, the k-means algorithm has been used to
recognize the concrete aggregates from images (Chen etal.
2021a), to predict the fracture energy in concrete (Afshoon
Responsible Editor: Issa El-Hussain
* Enrico Zacchei
enricozacchei@gmail.com
1 University ofCoimbra, CERIS, Coimbra, Portugal
2 Itecons, Coimbra, Portugal
3 Polytechnic School, University ofSão Paulo (USP), 380
Prof. Luciano Gualberto, SãoPaulo, SP, Brazil
4 Center forEngineering, Modeling andApplied Social
Sciences, Federal University ofABC (UFABC), Alameda da
Universidade s/n, SãoBernardoDoCampo, SP, Brazil
Arab J Geosci (2024) 17:302 302 Page 2 of 14
etal. 2021), to characterize the steel fiber/matrix de-bonding
apart from concrete matrix cracking sources (Tayfur etal.
2018), and to estimate deformations for dams (Hu and Ma
2021; Li etal. 2021; Ji etal. 2022).
In the seismology and earthquake engineering, the
k-means algorithm has been used for defining the seismic
source zones from hypocentre distributions (Weatherill and
Burton 2009; Rehman etal. 2014), classification of seismic
activities (Sheikhhosseini etal. 2021; Kuyuk etal. 2012)
and earthquake ground-motion records by considering their
frequency content (Yaghmaei-Sabegh 2017), and treatments
of seismic data (Giuseppe etal. 2014; Shang etal. 2018).
The main goals of the k-means algorithm are as follows:
(i) to produce groups of cases/variables with a high degree
of similarity within each group (called “compactness” in
Giuseppe etal. (2014) or “cohesion” in Sheikhhosseini
etal. (2021)) and a low degree of similarity between groups
(“separation” (Giuseppe etal. 2014)); (ii) to reduce the com-
plexity of the data to obtain useful outputs.
The abovementioned publications provided the inspira-
tion for the present paper. In fact, a possible alternative for
seismic analyses can be to adopt the k-means algorithm
(Weatherill and Burton 2009). Also, in Sheikhhosseini
etal. (2021), it was recently stated that “there is enormous
potential for extending the k-means method and includ-
ing more geological and geophysical information into the
analysis.” Like (Ji etal. 2018), this study could be an inter-
esting example of how problems in earthquake engineering
could be dealt with using state-of-the-art machine learning
techniques.
K‑means forearthquakes
The use of seismogenic zones (ZSs) is largely accepted by
geophysicists and engineers; however, there are some stud-
ies where it is highlighted that the model used to define
them is often controversial. In (Weatherill and Burton
2009), it was stated that “there can be substantial dispar-
ity in the way in which seismic sources are characterized,”
whereas in Sheikhhosseini etal. (2021), it was stated that
“there does not exist a coordinated unique approach for
the development of potential seismic source models.” The
reasons are mainly the lack of geological and seismologi-
cal information as discussed in Zacchei and Brasil (2022);
Zacchei and Lyra 2022).
In the Web of Science database (Web of Science (WoS),
database 2023), there are only 5 articles (Yuan 2021; Lee
and Kim 2020; Ramdani etal. 2015; Shafapourtehrany
etal. 2022; Ji etal. 2018) published between 2015 and
2022 with the words “k-means” and “earthquake” in the
title, indicating the necessity to improve the research. In
general, to the best of the authors’ knowledge, there are
no sufficient studies on the inter-correlations of the param-
eters involved in seismic analyses by using the k-means
algorithm.
In (Ramdani etal. 2015), the k-means algorithm has been
used to classify earthquakes in the Gibraltar Arc and Andean
regions. It was stated that “very little research has been done
on the basis of seismic events analyzed from clustering point
of view” (Ramdani etal. 2015). In (Shafapourtehrany etal.
2022), it was proposed and tested, for Turkey, the ability of
the k-means clustering method to create the training dataset
for earthquake vulnerability analysis. In (Ji etal. 2018), the
non-linear seismic site response for Japan has been classi-
fied by using the k-means algorithm. According to Ji etal.
(2018), this study was the first to apply the machine learning
clustering algorithm to address this problem. In (Lee and Kim
2020), a new search space reduction algorithm using machine
learning techniques for the earthquake source parameter deter-
mination was presented. Finally, in Yuan (2021), a seismic
prediction model based on clustering of global earthquake
data was proposed.
In a parallel way to the mentioned studies, here, a unique
ZS with relative few events (in Ji etal. (2018), only one
event was considered) has been considered to try to obtain
homogeneous results. The Pyrenees area has been taken as
a case study since it represents an interesting area (Garcia-
Mayordomo and Insua-Arevalo 2011; Garcia-Fernandez
etal. 1989; Beauval etal. 2006). The considered parame-
ters deserve more attentions, and this algorithm would allow
to disaggregate each parameter with respect to other ones
and to understand their weight/effect on a seismic analysis.
These aspects, plus the lack of studies as mentioned, would
justify the novelty of the present study.
Thus, the main goal of this paper is to disaggregate
some seismic parameters to identify their inter-correla-
tions and provide new statistical values and seismic inputs
in terms of elastic spectra accelerations. Analytical and
numerical analyses have been carried out to develop the
k-means algorithm.
In the “Seismic context” section, the studied area is
described where some constrictions were predefined.
In the “Materials and methods” section, materials (raw
data indicating the used range) and methods (the k-means
algorithm) are explained; thus, preliminary results and
considerations were shown. Finally, the “Analyses and
results” section shows the analyses and results in terms of
new stochastic outputs and seismic inputs. These results
could be useful for more refined seismic analyses of the
studies ZS.
Arab J Geosci (2024) 17:302 Page 3 of 14 302
Seismic context
The seismogenic zone (ZS) studied in this paper is the ZS16
called “Norpirenaica oriental” zone placed in the Pyrenees area
between Spain and France. The main seismogenic parameters
that characterize this area are as follows: the mean annual rate
of exceedance λc = 0.653; b-value of the Gutenberg-Richer,
b = 1.08; and the mean maximum moment magnitude,
Mw
=
6.50 ± 0.30. The samples collected to define this ZS16 have
been considered abundant with a very homogeneous distribu-
tion, whereas the style of fault ruptures that mainly generates
the earthquakes is “normal faulting.” This ZS16 is considered
a high seismic hazard with a frequency of occurrence for
Mw > 4.0 every 1.50years. Figure1 shows the ZS16 zone (light
red area) from the Zesis database (IGME 2015) highlighting
the 8 events (red stars) considered for this study.
This area has been chosen due to its seismic relevance,
as explained by the abovementioned information, and
because, by a “cross-referencing data” process in ESM
database (Luzi etal. 2020), it was possible to obtain rel-
evant and complete information regarding these 8 events.
Thus, the main criteria were to select a unique ZS to have
earthquakes with the same probability of occurrence at any
point inside the zone, and to find sufficient processed data
(e.g., magnitude, depth, epicentral distance, accelerations)
to carry out the analyses. For the selection of the events,
two main conditions have been verified since the ZSs are
calculated under the following hypotheses: (i) the epicenter
depth, Δ, of each record is Δ < 30.0km, and (ii) the style
of fault is “normal faulting” (only for events 1, 2, and 6 this
information was available).
Materials andmethods
Materials
Collected raw data
From the ESM database (Luzi etal. 2020), 234.0 data for
22.0 parameters have been collected (4914.0 data in total).
These data regard the different records of each event. For
brevity, in Table1 are listed only some parameters (9 param-
eters) indicating their mean value.
The other 13 parameters are as follows: Mw, soil type,
wave component, sampling interval, cut-off frequency of
the low- and high-pass filter, unprocessed peak ground
acceleration (PGA), peak ground velocity (PGV), peak
ground displacement (PGD), Housner intensity, IH, spec-
tral acceleration, Sa, at 1.0s and 3.0s structural period,
moment magnitude. The meaning of these parameters
is usually well known; thus, they are not explained here
but they can be retrieved in literature (Zacchei and Bra-
sil 2022; Zacchei and Lyra 2022; Faccioli and Paolucci
2005).
Values in Table1 regard small earthquakes with low
HPGA. Note that the values in Table1 represent raw data
without separating two important aspects treated later in this
paper, i.e., the wave components and the soil type.
This number of data should be sufficient to carry out in
a reliable way the k-means algorithm for the purpose of
this paper since, for example, in previous credible studies,
only 150 data in Daszykowski etal. (2001) and 246 data in
Afshoon etal. (2021) have been adopted.
Fig. 1 Studied seismogenic zone (ZS16) showing the 8 events (modified form (IGME 2015))
Arab J Geosci (2024) 17:302 302 Page 4 of 14
Interdependencies andpreliminary treatments
To better analyze the collected data, a separation between
wave components (i.e., N, E, Z) and soil (i.e., A, B) has
been made obtaining thus 6 main conditions. N e E refer
to the horizontal waveform components orthogonal to each
other, north–south and east–west, respectively, whereas Z
refers to the vertical waveform component (positive upward)
(2012). Soil A and B corresponds to rock and very dense
sand, respectively, in accordance with Eurocode (2004).
Figure2 shows the value of some parameters for each
seismic event, referring to seismic wave in N axis and soil
A, by using a box and whisker chart where the mean value
is indicated by a “ × ” (note that the parameters have different
Table 1 Mean value (except for “date”) of the collected parameters
HPGAhorizontal peak ground acceleration, VPGAvertical PGA, Sa0.30spectral acceleration at 0.30s structural period. All values refer to the
original list without separation in terms of wave components, soil types, etc. (except where indicated)
a Surface distance between the station where the event was registered and the earthquake epicenter
b Calculated by considering both horizontal waveform components, orthogonal to each other, i.e., N (north–south) and E (east–west)
c It refers to the vertical waveform component Z (positive upward)
Parameter Event 1 Event 2 Event 3 Event 4 Event 5 Event 6 Event 7 Event 8
Date (dd/mm/yy) 11/12/2002 21/01/2003 09/02/2009 30/12/2012 15/11/2007 29/04/2014 17/11/2006 03/05/2008
Local magnitude, ML4.20 4.50 4.10 4.70 4.50 4.40 5.20 3.70
Depth, Δ (km) 2.0 2.0 5.0 15.10 10.0 12.70 14.80 3.40
Epicentral distance, de (km) a29.0 67.30 60.40 49.0 79.60 82.10 93.60 71.20
HPGA (cm/s2) b − 0.54 2.74 − 1.14 − 0.70 − 0.63 − 2.44 0.88 − 5.37
VPGA (cm/s2) c2.70 9.46 0.36 0.99 − 0.70 − 3.24 − 3.04 2.53
Arias intensity, IA (cm/s) 4.70 × 10−3 2.08 × 10−1 3.94 × 10−3 2.57 × 10−2 3.03 × 10−3 3.27 × 10−2 1.74 × 10−1 6.25 × 10−2
Time interval, T90 (s) 7.53 18.38 17.83 17.38 21.53 23.58 23.92 18.76
Sa0.30 (cm/s2) b1.53 3.02 0.68 2.42 0.92 3.15 7.12 1.94
a) b)
c)
Fig. 2 Variability of some parameters of the seismic wave in the N axis and soil A
Arab J Geosci (2024) 17:302 Page 5 of 14 302
dimensions). All parameters have been already explained,
except the hypo-central distance, dh, calculated as
dh =
d2
e
2 ). A long box indicates a great variability of
some parameters highlighting and justifying those to be well
studied (e.g., HPGA, de).
Here, some preliminary results are obtained. Figure3 shows
(non-clustered) combinations between some parameters. There
are several parameters that characterize an earthquake, some
of which are very significant, e.g., (i) IA, which is proportional
to the total energy input of an infinity set of undamped linear
oscillators (T90 is correlated to IA); (ii) PGD, which indicates
the more correct input to be used to well design a structure
instead of PGA; however, managing the displacements is more
complicated than inertial forces; (iii) HPGA/VPGA ratio,
which can provide the amplification of the horizontal waves
due to site effects (Kuyuk etal. 2012; Ji etal. 2018).
The parameters from Fig.3a to d have been plotted under
the same physical meaning (i.e., velocity and displacement).
In fact, IA is calculated as the integral of the acceleration
squared over the entire signal duration.
Soil ASoilB
a) b)
c) d)
e) f)
Fig. 3 Correlations between some non-clustered parameters divided into components and soils
Arab J Geosci (2024) 17:302 302 Page 6 of 14
In Fig.3a and b, the linear trends (plotted in logarithmic
scale) with R-squared value R2 = 0.83–0.91 indicate a good
approximation. The difference of the N axis with respect
to E axis for soil B (Fig.3b) is probably due to the sand
horizontal stratigraphy that provides a non-linear response
between IA and PGV, whereas the different response of the Z
axis could be correlated to the elasticity of the soil (Kramer
1996; Lanzo and Silvestri 1999). In Fig.3c and d, a more
dispersion of the value is noted for soil B. Another difference
between soil A and soil B regards the IA values, since for soil
A they are slightly higher than those for soil B, probably due
to an energy damping phenomenon.
Finally, the HPGA/VPGA ratio shows more amplified
events (i.e., > 1.0, red dashed line) for soil A (Fig.3e), but
more events with longer significant duration (from ~ 13.0s)
are noted for soil B (Fig.3f). The slightly damped values in
Fig.3f probably are due to the non-linearity of the soil B
(Kramer 1996; Lanzo and Silvestri 1999).
Methodology
The k‑means algorithm
The concept of the “similarity” in the cluster analysis is usu-
ally taken as meaning “proximity,” and elements closer to
one another in the input space are considered more similar.
To calculate this similarity, the most used Euclidian distance
(Li etal. 2021), dE, thus the squared Euclidian distance (i.e.,
square error), dE2, is adopted (Morissette and Chartier 2013;
Song etal. 2021):
where ci(x, y) is the center of the cluster i (up to the total
cluster number k), pi(x, y) is the point to be compared, and
the subscript i = {1, 2, 3, …, k} indicates thus the dimen-
sion. Since a 2D space is considered, the points pi and ci
are defined by two coordinates x and y. For a set, t, of cases
Π = {p1, p2, p3, …, pn}
Rd
, where
Rd
is the data space of
d dimensions, the k-means algorithm tries to find a set of k
cluster centers C = {c1, c2, c3, …, ck}
Rd
that is a solution
of the minimization problem:
where ni is the number of cases with j = {1, 2, 3, …, ni}
included in cluster k and
k
i
n
i
=
n
. Thus, the k-means clus-
tering technique is considered a variance (Eq.(2)) minimiza-
tion technique.
(1)
dE
=
k
i
(
cipi
)
2
dE2
=
k
i
(
cipi
)2
(2)
E
=
k
in
i
j
pij ci
2
(3)
min
n
{E}=min
nk
in
i
j
pij ci
2
The interpretation of Eq.(3) is to minimize this func-
tion by optimizing the assignment of points to clusters and
updating the centroids. The goal is to ensure that the sum
of squared distances between points and their respective
centroids is as small as possible, resulting in tightly packed
clusters.
For each iterated ci, the following condition must be veri-
fied (Symons 1981):
Thus, the average of the elements of each group is taken
as the new centroid.
Equation(4) represents the tolerance level between the
cluster solutions. As mentioned in Morissette and Chartier
(2013); Symons 1981), mathematically, the k-means algo-
rithm approximates the Gaussian model with an estimation
of the clusters by maximum likelihood. This model consid-
ers a cluster as a probability for each case, based on the
mean, μi, standard deviation, ± σi, and its probability density
function (PDF). The k-means algorithm is a sub-case that
assumes that the clusters have ± σi values and a trend PDF.
Procedure
The main goal of the k-means algorithm is to produce groups
with a high degree of similarity and reduce the complexity
of the data. The procedures are divided into the following
steps:
1) Collection data to be analyzed: This is a key step since it
is necessary that the cases, Π = {p1, p2, p3, …, pn}
“present a high connectivity among them” as mentioned
in Almeida etal. (2007). In this study, the values shown
in Table1 refer to the same ZS; therefore, they can be
correctly used. Thus, it is possible to obtain one data
for each group from several data and some groups. This
shows the necessary to collect data with direct correla-
tions.
2) Choose the distance function to be used between ci and
pi (Eq.(1)). Mathematica software (2019) treats pairs
of elements as being less similar when their distances
dE(xi, yi) are larger. The function dE,
d
E
, can be any
appropriate distance or dissimilarity function. A dis-
similarity
d
E
must satisfy the following conditions:
d
E
(xi, xi) = 0,
d
E
(xi, yi) ≥ 0,
d
E
(xi, yi) =
d
E
(yi, xi).
3) Definition of cluster numbers, k. Given that k-means
algorithm is an unsupervised (i.e., a classification
made by clustering) and non-hierarchical or partitional
method (i.e., a construction of clusters where the objects
in a cluster are more like one another than to objects
in different clusters), this step is mandatory (Jain etal.
(4)
𝜕E
𝜕
ci
=0c(t+1)
i=
1
|
|
|
nt
i
|
|
|
n
t
i
jp
ij
Arab J Geosci (2024) 17:302 Page 7 of 14 302
1999). In fact, the k-means algorithm requires a priori
knowledge about the number of clusters k (called the
“true number” in Sugar and James (2003)). In contrast,
a supervised classification is made by a “discriminant
analysis,” and a hierarchical method decomposes the
objects into several levels of “nested clusters” (e.g.,
by dendrograms) as mentioned in Jain etal. (1999);
Daszykowski etal. 2001).
4) Random position of ci and definition of its coordinates
(first iteration). Here, a further determination of a k
number is made; however, Mathematica (2019) provides
automatic solutions, under multiple iterations, that are
intrinsically computed. Therefore, it is important to
highlight that step 3 only regards the optimum solution
for each k where the convergence is already verified.
5) Estimation of E (Eq.(2)),
min
n{E}
(Eq.(3)) and verifica-
tion (Eq.(4)). This step would be repeated until cen-
troids move no longer and a separation of the objects
into classes for which the distances are minimized is
obtained. If this is not verified, it is necessary to repeat
step 4 up to convergence.
Analyses andresults
Clustering combinations
As stated in Morissette and Chartier (2013), “the k-means
clustering technique will always converge”; however, it is
liable to find a local minimum solution instead of a global
one; thus, it could not find the optimal configuration. For
this, it is necessary not only to define the number of clusters
k, which is a mandatory operation to develop the k-means
algorithm (here, as already mentioned, developed automati-
cally by Mathematica software (2019)), but also to estimate
a-posteriori the k number. Mathematica (2019) is an adequate
environment to learn, experiment, and apply the k-means
algorithm. With its built-in functions or by creating a custom
solution, its powerful computational capabilities and visuali-
zation capabilities make it ideal for understanding clustering.
In general, all cases could represent useful results; however,
the goal should be to use the k-means algorithm “to produce
clusters that are more equiprobable than the population clus-
ters” (Morissette and Chartier 2013). Also, this algorithm should
“clean” the collected data that come from different studies.
In (Salvador and Chan 2004; Milligan and Cooper 1985;
Sugar and James 2003), several methods are explained. In
(Almeida etal. 2007), a further “visual inspection” could
help to establish an optimum k number since, as mentioned
in Salvador and Chan (2004), a visual inspection (or “manu-
ally” (Lee and Kim 2020)) is useful since the majority of the
“methods to determine the number of clusters/segments may
not work very well in practice”. Also, in Weatherill and Bur-
ton (2009) it was stated that “the choice of k may still be an
expert decision based on the output from different indices and
methods.”
In (Giuseppe etal. 2014), it was mentioned that “there are
no theoretical solutions to a priori define k,” and the best solu-
tion consists in obtaining “a compromise between good data
clustering and acceptable physical interpretability.” In (Ji etal.
2018), it was stated that “there is no standard procedure for
computation of the optimum number of clusters.”
Here we adopt k = 3.0 since the used data are already suf-
ficiently homogeneous due to the applied collecting and fil-
tering process, and to well separate “outliers,” as indicated in
Daszykowski etal. (2001); Lee and Kim (2020); Song etal.
(2021); Ji etal. (2022), which can contaminate other clus-
ters. For example, in Fig.4a, the outlier blue point is evident
indicating that k = 3.0 is adequate; however, this is not always
verified.
Figure4a shows some combinations of clustered values for
the N axis component (soil A) developed by using Mathemat-
ica software (2019) as detailed in the “Methodology” section.
Obviously, when the horizontal and vertical axes quantify the
same parameters, all clusters are plotted in a diagonal line as
also shown in Chen etal. (2021b). For brevity, only one case
has been indicated, which corresponds to best cluster 1. In
general, the best cluster is the cluster placed in the middle with
respect to other ones. The general criterium was to choose a
cluster that provides a mean value more like that for raw data.
This is because the goal is to obtain more reliable and refined
results from real registrations.
Figure4b shows the mode (i.e., the value that appears
most often in a set of data values) calculated for 96.0 com-
binations indicating that the cluster 2 could be the more
adequate for these analyses. This should further confirm that
using k = 3.0 is sufficient.
Figure5 shows the scatter diagrams with histograms
where it is possible to count how many values are placed
between a certain interval for each axis. For example, in
the deIA pair, a green scatter area indicates 1 point up
to magenta area with 11 points. The color tone changes
depending on the value number in a specific area. Also, the
horizontal de histograms correspond to the points sum, i.e.,
from left to right, there are 9, 12, 3, 7, and 5 points. For brev-
ity, only some combinations (matrix 3 × 3) have been shown.
Results anddiscussions
Stochastic outputs
As mentioned, the k-means clustering technique can be
described as a centroid model as one vector representing
the mean and its standard deviation, μi ± σi, used to describe
each cluster (Morissette and Chartier 2013). Thus, Table2
Arab J Geosci (2024) 17:302 302 Page 8 of 14
lists the obtained μi ± σi values. For brevity, only the N axis
component in soil A has been shown. For some combina-
tions, e.g., dede pair, the results do not vary with respect to
other components under the same soil type since the distance
de is the same.
In general, Table2 can be read in (i) vertical way, e.g.,
for de parameter, it is possible to see its trend weighed in
function of other ones. This is useful to obtain a general
estimation of a certain parameter by considering several
combinations; in fact, by accounting for all combinations,
the mean μi should provide the best results; (ii) horizontal
way, e.g., for HPGA parameter, it is possible to evaluate all
parameters only in function of it. Thus, if HPGA parameter
is fixed, other parameters could represent good solutions to
be adopted.
It is important to highlight that values in Table1 have
been obtained independently of the other parameters,
whereas μi ± σi values have been estimated by several cou-
pled analyses. In this sense, results provide disaggregated
values to be used to evaluate the weight of a certain param-
eter with respect to other ones. This could show the great
potential of the AI by k-means algorithm.
Table3 shows μi ± σi outputs of the studied parameters
obtained by considering all combinations, which would
indicate new values for each component to be used to frame
the seismic context of the ZS16 in a more correct way.
This division in terms of components and soils is important
since it allows to obtain outputs in a universal way.
In Table3, it is possible to note some physical behaviors
(some already discussed in Fig.3), in particular for soil A,
given that the rock can schematize an ideal soil without strong
influences of the elasticity, heterogeneity, and non-linearity in
accordance with classical theories (Kramer 1996; Lanzo and
Silvestri 1999). In fact, the ratio between vertical and hori-
zontal PGA for soil A is 0.16 (= 0.53/3.35), which is like the
mean ratio obtained by using European code (i.e., 0.18) (2004).
PGA values of the horizontal components appear a little
different from each other, in particular for soil A, probably
due to the different horizontal frequency of the rock in two
directions. The high value of 3.35cm/s2 (E axis in soil A) is
strictly correlated to a minor de and T90, indicating relatively
closer, short, and intense events. For the E axis in soil B,
these correlations are also verified; however, a low value of
PGA was found. This difference is probably correlated to
the fact that a rock provides amplificated values, whereas a
non-linear sand provides damped values due to the material
damping (Kramer 1996; Lanzo and Silvestri 1999; Ji etal.
2018). Also, for medium–high de (i.e., 75.0km (Zacchei and
Lyra 2022)) values, the radiation damping due to the spread-
ing of the energy plays an important role.
It is important to mention that the k-means algorithm
only divides the data in a geometrical way. However, under
Fig. 4 a A combination for
N axis in soil A; b mode
calculated considering 96.0
combinations
a)
b)
Arab J Geosci (2024) 17:302 Page 9 of 14 302
defined constrictions (discussed in the “Seismic context
section) and ranges (see Table1), these divisions can also
assume a physical meaning.
Figures6 and 7 show the PDF curves of the clustered
results (solid curves) and non-clustered results (dashed
curves) for soil A and B, respectively. The formers have
been plotted by using values in Table3, whereas the latter
regard raw data discussed in the “Materials” section. Note
that raw data plotted in Figs.6 and 7, called μ0 ± σ0, are dif-
ferent from the values shown in Table3 since, as mentioned,
were not divided in function of components and soils.
In Fig.6, obviously, the raw data, i.e., μ0 ± σ0, provide
wide curves indicating poor distributions, whereas the
N axis in soil A
de(km)HPGA (cm/s2
)I
A(cm/s)
HPGA (cm/s2)
See Fig. 4a)
IA(cm/s)T90 (s)
Fig. 5 Scatter with histograms (some cases)
Table 2 Clustered values for N axis in soil A (obtained by cluster 1
except where indicated)
a Obtained by cluster 2
b Obtained by cluster 3
N axis in soil A
de (km) HPGA (cm/
s2)
IA (cm/s) T90 (s)
de82.36 ± 17.83 a0.78 ± 6.30 0.03 ± 0.06 28.16 ± 6.59 a
HPGA 77.60 ± 63.57 -0.80 ± 6.30 0.02 ± 0.05 19.64 ± 10.71
IA44.35 ± 28.58 -0.80 ± 6.30 0.19 ± 0.05 b10.99 ± 4.86
T90 87.18 ± 20.89 a-0.80 ± 6.30 0.04 ± 0.07 17.58 ± 3.74 a
μi ± σi72.87 ± 32.72 -0.41 ± 6.30 0.07 ± 0.06 19.09 ± 6.48
Arab J Geosci (2024) 17:302 302 Page 10 of 14
sharpened shape of the PDFs indicates that the σi values are
low; thus, the clustered results tend to a good calibration.
In fact, the most useful information regard σi values, which
directly represent the error of the stochastic distribution. The
highlighted parameters appear on the E axis (Fig.6a) and
Z axis (Fig.6b) in soil A, and the E axis in soil B (Fig.7c).
Also, in general, when μ0i ≈ 1.0, the preliminary
non-clustered value does not suffer alterations; there-
fore, it could be adopted for the seismic analysis. When
μ0i > 1.0 or μ0i < 1.0, the preliminary value could be
over- or under-estimated, respectively. For Fig.7, the same
considerations are valid.
New seismic inputs
In this section, new seismic inputs in terms of (elastic
and synthetic) spectral acceleration, Sa, have been shown.
According to Eurocode (2004), “if the earthquakes that
contribute most to the seismic hazard defined for the site
for the purpose of probabilistic hazard assessment have a
Table 3 μi ± σi results by
considering all combinations
a It corresponds to VPGA
de (km) PGA (cm/s2)IA (cm/s) T90 (s)
N axis in soil A (Table2)72.87 ± 32.72 − 0.41 ± 6.30 0.07 ± 0.06 19.09 ± 6.48
E axis in soil A 68.77 ± 16.01 − 3.35 ± 7.90 0.05 ± 0.14 16.96 ± 4.71
Z axis in soil A 75.94 ± 32.39 0.53 ± 2.18 a0.09 ± 0.06 21.77 ± 7.09
N axis in soil B 35.73 ± 24.61 − 0.40 ± 1.32 0.02 ± 0.01 18.18 ± 4.49
E axis in soil B 36.81 ± 21.26 0.17 ± 1.63 0.01 ± 0.0 15.33 ± 4.58
Z axis in soil B 47.30 ± 23.0 − 0.30 ± 1.65 a0.01 ± 0.01 23.93 ± 6.51
Fig. 6 PDF curves for different parameters and wave components (soil A)
Arab J Geosci (2024) 17:302 Page 11 of 14 302
surface-wave magnitude, Ms, not greater than 5.5, it is rec-
ommended that the type 2 spectrum is adopted.” Regard-
ing the studied ZS16, in the seismic characteristics, it is
indicated that “very wide representation of Mw > 5.0” was
registered (IGME 2015); therefore, the “type 1” spectrum
was here adopted.
Figure8 shows Sa trends in the function of the structural
period, T, for a classical return period of 475years (curves
of this study are represented by mean values).
In Fig.8, it is possible to see that the vertical Sa by Euroc-
ode (2004) (lower red line) provides good results for the site,
whereas the horizontal Sa (upper red line) overestimates the
response for soil A. This could indicate that the clustered
mean curves approximate the vertical Sa curves well.
In (Yaghmaei-Sabegh 2017), it is shown a simple relation
to estimate the characteristic period, Tc, of an earthquake
ground-motion by Tc = 4.30 × (PGV/PGA). Thus, by estimat-
ing Tc trend for each component (here 0.025–0.25s), it is
useful to avoid possible resonance phenomena.
Synthetic spectra have been obtained by using four atten-
uation relations (i.e., SP96 (Sabetta and Pugliese 1996),
Am96 (Ambraseys etal. 1996), Am05 (Ambraseys etal.
2005), BT03 (Berge-Thierry etal. 2003)). The use of these
relations is justified by the fact that they have been calibrated
for a seismologic and geological context like that studied in
this paper (Benito and Gaspar-Escribano 2007). They are
also valid within the range of magnitude and distance con-
sidered. In general, these equations have been largely used
in the South-European area, in particular for the Pyrenees
area where Am05 (Garcia-Mayordomo and Insua-Arevalo
2011; Beauval etal. 2006) and BT03 (Beauval etal. 2006)
were adopted.
Table4 lists the used parameters for each attenuation
equation (for more details see (Sabetta and Pugliese 1996;
Ambraseys etal. 1996, 2005; Berge-Thierry etal. 2003))
regarding two scenarios: (i) far and relative strong event
(Mw = 4.8, de = 68.77km); (ii) near-weak event (Mw = 3.4,
de = 36.81km). The de values of the E axis component have
been considered for both scenarios since for soil A, it was
found the maximum HPGA value, i.e., HPGA = − 3.35cm/
s2 (Table3). Regarding both magnitudes, the minimum
and maximum values of the 6 events have been used.
Fig. 7 PDF curves for different parameters and wave components (soil B)
Arab J Geosci (2024) 17:302 302 Page 12 of 14
It is noted that “the greatest PGA can come from small
earthquakes in the near-field, even if such earthquakes may
not be especially damaging” (Weatherill and Burton 2009).
As known, attenuation equations have some limitations
(called “random” and “epistemic” uncertainties (Zacchei and
Lyra 2022; Zacchei and Molina 2022)); thus, it is recom-
mended to use more than one to reduce their uncertainties.
Figure9 shows synthetic spectra for two scenarios. For
soil A (Fig.9a), all curves provide similar results for stiff
structures (i.e., T = 0–1.0 s) indicating the real goddess
in using these attenuation equations for ZS16. The ratio
between the maximum Sa and PGA is about 5.0, like to 5.5
in Garcia-Mayordomo and Insua-Arevalo (2011).
For soil B (Fig.9b), these agreements are not verified;
in fact, the 4 attenuation equations have not been calibrated
for very dense sand (they are calibrated for upper and lower
limit, i.e., rock and soft or alluvium sites, respectively), and,
in general, it is difficult to calibrate them for low magni-
tudes. Although it is difficult to estimate well the response
of a very dense sand (soil B), these results would show that
clustered curves provide values in favor of safety.
Conclusions
In this paper, the k-means algorithm has been applied to dis-
aggregate several seismic parameters to understand possible
inter-correlations. As a case study, ZS16 called “Norpire-
naica oriental” placed in the Pyrenees area between Spain
and France has been considered. The main conclusions are:
1) 4914.0 data divided into 22 categories have been treated
to carry out the k-means algorithm. The main divi-
sions regard the wave components (i.e., N, E, Z) and
soil characteristics (i.e., A, B). This could quantify the
role of the sand horizontal stratigraphy, the non-linear
response, and the elasticity of the soil. Also, the HPGA/
VPGA ratio would quantify the amplification of the
earthquake. Preliminary results show linear trends with
good approximations (R2 ≈ 0.87) highlighting some dif-
ferences correlated to the sand horizontal stratigraphy,
non-linear response, elasticity of the soil, and energy
damping phenomenon.
2) Clustered results, considering all combinations, provide
new values to be used for seismic analyses, in terms
Fig. 8 Elastic spectra for a soil A and b soil B
Table 4 Values used for the attenuation equations
a Other parameters/factors have been considered null; thus, they are not indicated here
b It corresponds to 4.8 Mw (Scordilis 2006)
c For relative small magnitude (i.e., Mw < 6.0), the distance between the station to the surface projection of the fault rupture can be approximated
to the de distance (Ambraseys etal. 1996, 2005)
d As shown in Fig.2a, dhde
Attenuation equation Magnitude, MwDistance, de (km) Observations a
Soil A Soil B Soil A Soil B Soil A Soil B
Am96 (Ambraseys etal. 1996) 4.1 Ms b2.0 Ms68.77 c36.81 Stiff soil site, SA = 1.0 -
Am05 (Ambraseys etal. 2005) 4.8 3.4 68.77 36.81 Stiff soil site, SA = 1.0
Normal faulting, FN = 1.0
Normal
faulting,
FN = 1.0
SP96 (Sabetta and Pugliese 1996) 4.1 Ms2.0 Ms68.77 36.81 Stiff site -
BT03 (Berge-Thierry etal. 2003) 4.1 Ms2.0 Ms68.77 d36.81 Rock site -
Arab J Geosci (2024) 17:302 Page 13 of 14 302
of μi ± σi outputs, in ZS16 in a more refined way. This
disaggregation analysis allows to evaluate the weight
and effect of a certain parameter with respect to other
ones. This could show the great potential of the AI by
k-means algorithm. Stochastic results indicate the dis-
tribution error, highlighting that the best components
appear on the E and Z axes in soil A, and the E axis in
soil B. In general, k-means algorithm divides the data in
a geometrical way; however, under defined constrictions
and ranges, these divisions can also assume a physical
meaning.
3) New seismic inputs for ZS16 in terms of elastic spec-
tra, Sa, have been plotted. Results show that horizontal
Sa values by code overestimates response for soil A.
Synthetic spectra by attenuation equations provide good
approximation for a far and relative strong event (sce-
nario 1), and poor approximation for a near-weak event
(scenario 2).
It is important to highlight that all results are rigorously
valid only for ZS16 under constrictions and hypotheses
described in the paper. The future integration of geophysi-
cal and geotechnical data concerning the input model and
database (IGME 2015; Luzi etal. 2020) will certainly
improve the analyses and outputs. Based on the results of
this study, we hope in the future to carry out an experi-
mental campaign allowing us to compare and validate our
proposed model. This will require new research and chal-
lenges in this field.
Acknowledgements The first author acknowledges the Itecons Insti-
tute, Coimbra, Portugal, for the Wolfram Mathematica license and the
University of Coimbra (UC), Portugal, to pay the rights (when appli-
cable) to completely download all papers in the references. The first
author is also grateful for the Foundation for Science and Technol-
ogy’s support through funding UIDB/04625/2020 from the research
unit CERIS (https:// doi. org/ 10. 54499/ UIDB/ 04625/ 2020)
Funding Open access funding provided by FCT|FCCN (b-on).
Data Availability Data will be made available on request.
Declarations
Conflict of interest The authors declare no competing interests.
Open Access This article is licensed under a Creative Commons Attri-
bution 4.0 International License, which permits use, sharing, adapta-
tion, distribution and reproduction in any medium or format, as long
as you give appropriate credit to the original author(s) and the source,
provide a link to the Creative Commons licence, and indicate if changes
were made. The images or other third party material in this article are
included in the article’s Creative Commons licence, unless indicated
otherwise in a credit line to the material. If material is not included in
the article’s Creative Commons licence and your intended use is not
permitted by statutory regulation or exceeds the permitted use, you will
need to obtain permission directly from the copyright holder. To view a
copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
References
Afshoon I, Miri M, Mousavi SR (2021) Combining Kriging meta
models with U-function and K-Means clustering for prediction
of fracture energy of concrete. J Build Eng 35:1–16
Almeida JAS, Barbosa LMS, Pais AACC, Formosinho SJ (2007)
Improving hierarchical cluster analysis: a new method with out-
lier detection and automatic clustering. Chemom Intell Lab Syst
87:208–217
Ambraseys NN, Simpson KA, Bommer JJ (1996) Prediction of hori-
zontal response spectra in Europe. Ear Thq Eng Struct Dyn
25:371–400
Ambraseys NN, Douglas J, Sarma SK, Smit PM (2005) Equations for
the estimation of strong ground motions from shallow crustal
earthquakes using data from Europe and the Middle East: hori-
zontal peak ground acceleration and spectral acceleration. Bull
Earthq Eng 3(1):1–53
Beauval C, Hainzl S, Scherbaum F (2006) Probabilistic seismic hazard
estimation in low-seismicity regions considering non-Poissonian
seismic occurrence. Geophys J Int 164:543–550
Benito B, Gaspar-Escribano JM (2007) Ground motion characterization
and seismic hazard assessment in Spain: context, problems and
recent developments. J Seismolog 11:433–452
Berge-Thierry C, Cotton F, Scotti O (2003) New empirical response
spectral attenuation laws for moderate European earthquakes. J
Earthquake Eng 7(2):193–222
Chen L, Shan W, Liu P (2021a) Identification of concrete aggre-
gates using K-means clustering and level set method. Structures
34:2069–2076
Fig. 9 Synthetic spectra for a far and relative strong event (scenario
1); b near-weak event (scenario 2)
Arab J Geosci (2024) 17:302 302 Page 14 of 14
Chen W, Wang X, Cai Z, Liu C, Zhu Y, Lin W (2021b) DP-GMM
clustering-based ensemble learning prediction methodology for
dam deformation considering spatiotemporal differentiation.
Knowl-Based Syst 222:1–16
Daszykowski M, Walczak B, Massart DL (2001) Looking for natural
patterns in data: part 1. Density-based approach. Chemometr Intell
Lab Syst 56:83–92
Di Giuseppe MG, Troiano A, Troise C, De Natale G (2014) k-means
clustering as tool for multivariate geophysical data analysis.
An application to shallow fault zone imaging. J Appl Geophys
101:108–115
European Committee for Standardization (CEN) (2004) Eurocode 8:
design of structures for earthquake resistance, part 1: general
rules, seismic actions and rules for buildings, BS EN 1998–1:
2004. Brussels, Belgium
Faccioli E, Paolucci R (2005) Elements of seismology applied to engi-
neering, Pitagora Editrice, Bologna, Italy, p. 255
Garcia-Fernandez M, Jimenez MJ, Kijko A (1989) Seismic hazard
parameters estimation in Spain from historical and instrumental
catalogues. Tectonophysics 167:245–251
Garcia-Mayordomo J, Insua-Arevalo JM (2011) Seismic hazard assess-
ment for the Itoiz dam site (Western Pyrenees, Spain). Soil Dyn
Earthq Eng 31:1051–1063
Hu J, Ma F (2021) Comparison of hierarchical clustering based defor-
mation prediction models for high arch dams during the initial
operation period. J Civ Struct Heal Monit 11:897–914
IGME (2015) ZESIS: Base de Datos de Zonas Sismogénicas de la
Península Ibérica y territorios de influencia para el cálculo de
la peligrosidad sísmica en España. http:// info. igme. es/ zesis.
Accessed March 2023
Jain AK, Murty MN, Flynn PJ (1999) Data clustering: a review. ACM
Comput Surv 31:1–60
Ji K, Wen R, Ren Y, Dhakal YP (2020) Nonlinear seismic site response
classification using K-means clustering algorithm: case study of
the September 6, 2018 Mw6.6 Hokkaido Iburi-Tobu earthquake
Japan. Soil Dynamics Earthquake Eng 128:1–14
Ji L, Zhang X, Zhao Y, Li Z (2022) Anomaly detection of dam monitor-
ing data based on improved spectral clustering. J Internet Technol
23:1–11
Kramer SL (1996) Geotechnical Earthquake Engineering, first ed.,
Prentice-Hall, Upper Saddle River, NJ, p 653
Kuyuk HS, Yildirim E, Dogan E, Horasan G (2012) Application of
k-means and Gaussian mixture model for classification of seismic
activities in Istanbul. Nonlin Process Geophys 19:411–419
Lanzo G, Silvestri F (1999) Risposta Sismica Locale – Teorie ed Espe-
rienze, Hevelius Editor Srl, Italy, p 159
Lee S, Kim T (2020) Search space reduction for determination of earth-
quake source parameters using PCA and k-means clustering. J
Sens 1–12:2020
Li Y, Min K, Zhang Y, Wen L (2021) Prediction of the failure point
settlement in rockfill dams based on spatial-temporal data and
multiple-monitoring-point models. Eng Struct 243:1–12
Luzi L, Lanzano G, Felicetta C, D’Amico MC, Russo E, Sgobba S,
Pacor F, ORFEUS Working Group 5 (2020) Engineering Strong
Motion Database (ESM) (Version 2.0). Istituto Nazionale di
Geofisica e Vulcanologia (INGV). https:// doi. org/ 10. 13127/
ESM.2
Milligan GW, Cooper MC (1985) An examination of procedures for
determining the number of clusters in a data set. Psychometrika
50:159–179
Morissette L, Chartier S (2013) The k-means clustering technique:
general considerations and implementation in Mathematica. Tutor
Quant Methods Psychol 9:15–24
Ramdani F, Kettani O, Tadili B (2015) Evidence for subduction
beneath Gibraltar Arc and Andean regions from k-means earth-
quake centroids. J Seismol 19:41–53
Rehman K, Burton PW, Weatherill GA (2014) K-means cluster analysis
and seismicity partitioning for Pakistan. J Seismol 18:401–419
Sabetta F, Pugliese A (1996) Estimation of response spectra and simu-
lation of nonstationary earthquake ground motions. Bull Seismol
Soc Am 86(2):337–352
Scordilis EM (2006) Empirical global relations converting Ms and mb
to moment magnitude. J Seismolog 10:225–236
Shafapourtehrany M, Yariyan P, Ozener H, Pradhan B, Shabani F
(2022) Evaluating the application of K-mean clustering in Earth-
quake vulnerability mapping of Istanbul. Turkey Int J Disaster
Risk Reduction 79:1–23
Shang X, Li X, Morales-Esteban A, Asencio-Cortes G, Wang Z (2018)
Data field-based k-means clustering for spatio-temporal seismicity
analysis and hazard assessment. Remote Sensing 10:1–22
Sheikhhosseini Z, Mirzaei N, Heidari R, Monkaresi H (2021) Delinea-
tion of potential seismic sources using weighted K-means cluster
analysis and particle swarm optimization (PSO). Acta Geophysi-
cal 69:2161–2172
Salvador S, Chan P (2004) Determining the number of clusters/seg-
ments in hierarchical clustering/segmentation algorithms, Pro-
ceedings of the 16th IEE International Conference on Tools with
Artificial Intelligence (ICTAI), 15–17 November, 2004, Boca
Raton, Florida, USA, 1–9.
Song J, Zhang S, Tong F, Yang J, Zeng Z, Yuan S (2021) Outlier
detection based on multivariable panel data and K-means clus-
tering for dam deformation monitoring data. Adv Civil Eng
1–11:2021
Standard for the Exchange of Earthquake Data (SEED) (2012) refer-
ence manual, version 2.4, Incorporated Research Institutions for
Seismology (IRIS), USA, p. 224
Sugar CA, James GM (2003) Finding the number of clusters in a
dataset: an information-theoretic approach. J Am Stat Assoc
98:750–763
Symons MJ (1981) Clustering criteria and multivariate normal mix-
tures. Biometrics 37:35–43
Tayfur S, Alver N, Abdi S, Saatci S, Ghiami A (2018) Characteriza-
tion of concrete matrix/steel fiber de-bonding in an SFRC beam:
Principal component analysis and k-mean algorithm for clustering
AE data. Eng Fract Mech 194:73–85
Turco C, Funari MF, Teixeira E, Mateus R (2021) Artificial neural
networks to predict the mechanical properties of natural fibre-
reinforced compressed earth blocks (CEBs). Fibers 9:1–21
Weatherill G, Burton PW (2009) Delineation of shallow seismic source
zones using K-means cluster analysis, with application to the
Aegean region. Geophys J Int 176:565–588
Web of Science (WoS), database (2023) https:// www. webof scien ce.
com/ wos/. Accessed July 2023
Wolfram Mathematica, version 12.0, Wolfram Research, Inc.: Cham-
paign, IL, USA, 2019.
Yaghmaei-Sabegh S (2017) A novel approach for classification of
earthquake ground-motion records. J Seismol 21:885–907
Yuan R (2021) An improved K-means clustering algorithm for global
earthquake catalogs and earthquake magnitude prediction. J Seis-
mol 25:1005–1020
Zacchei E, Brasil R (2022) A new approach for physically based
probabilistic seismic hazard analyses for Portugal. Arab J Geosci
15:1–22
Zacchei E, Lyra P (2022) Recalibration of low seismic excitations in
Brazil through probabilistic and deterministic analyses: applica-
tion for shear buildings structures. Struct Concr 1–19:2022
Zacchei E, Molina JL (2022) Probabilistic seismic hazard analysis for
Andalusian dams in Southern Spain using new seismogenic zones.
ASCE-ASME J Risk Uncertain Eng Syst Part a: Civil Eng 8:1–13
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
A dam is a super-structure widely used in water conservancy engineering fields, and its long-term safety is a focus of social concern. Deformation is a crucial evaluation index and comprehensive reflection of the structural state of dams, and thus there are many research papers on dam deformation data analysis. However, the accuracy of deformation data is the premise of dam safety monitoring analysis, and original deformation data may have some outliers caused by manual errors or instruments aging after long-time running. These abnormal data have a negative impact on the evaluation of dam structural safety. In this study, an analytical method for detecting outliers of dam deformation data was established based on multivariable panel data and K-means clustering theory. First, we arranged the original spatiotemporal monitoring data into the multivariable panel data format. Second, the correlation coefficients between the deformation signals of different measuring points were studied based on K-means clustering theory. Third, the outlier detection rules were established through the changes of the correlation coefficients. Finally, the proposed model was applied to the Jinping-I Arch Dam in China which is the highest dam in the world, and results indicate that the detection method has high accuracy detection ability, which is valuable in dam safety monitoring applications.
Article
Full-text available
The purpose of this study is to explore Artificial Neural Networks (ANNs) to predict the compressive and tensile strengths of natural fibre-reinforced Compressed Earth Blocks (CEBs). To this end, a database was created by collecting data from the available literature. Data relating to 332 specimens (Database 1) were used for the prediction of the compressive strength (ANN1), and, due to the lack of some information, those relating to 130 specimens (Database 2) were used for the prediction of the tensile strength (ANN2). The developed tools showed high accuracy, i.e., correlation coefficients (R-value) equal to 0.97 for ANN1 and 0.91 for ANN2. Such promising results prompt their applicability for the design and orientation of experimental campaigns and support numerical investigations.
Article
Pakistan and the western Himalaya is a region of high seismic activity located at the triple junction between the Arabian, Eurasian and Indian plates. Four devastating earthquakes have resulted in significant numbers of fatalities in Pakistan and the surrounding region in the past century (Quetta, 1935; Makran, 1945; Pattan, 1974 and the recent 2005 Kashmir earthquake). It is therefore necessary to develop an understanding of the spatial distribution of seismicity and the potential seismogenic sources across the region. This forms an important basis for the calculation of seismic hazard; a crucial input in seismic design codes needed to begin to effectively mitigate the high earthquake risk in Pakistan. The development of seismogenic source zones for seismic hazard analysis is driven by both geological and seismotectonic inputs. Despite the many developments in seismic hazard in recent decades, the manner in which seismotectonic information feeds the definition of the seismic source can, in many parts of the world including Pakistan and the surrounding regions, remain a subjective process driven primarily by expert judgment. Whilst much research is ongoing to map and characterise active faults in Pakistan, knowledge of the seismogenic properties of the active faults is still incomplete in much of the region. Consequently, seismicity, both historical and instrumental, remains a primary guide to the seismogenic sources of Pakistan. This study utilises a cluster analysis approach for the purposes of identifying spatial differences in seismicity, which can be utilised to form a basis for delineating seismogenic source regions. An effort is made to examine seismicity partitioning for Pakistan with respect to earthquake database, seismic cluster analysis and seismic partitions in a seismic hazard context. A magnitude homogenous earthquake catalogue has been compiled using various available earthquake data. The earthquake catalogue covers a time span from 1930 to 2007 and an area from 23.00° to 39.00°N and 59.00° to 80.00°E. A threshold magnitude of 5.2 is considered for K-means cluster analysis. The current study uses the traditional metrics of cluster quality, in addition to a seismic hazard contextual metric to attempt to constrain the preferred number of clusters found in the data. The spatial distribution of earthquakes from the catalogue was used to define the seismic clusters for Pakistan, which can be used further in the process of defining seismogenic sources and corresponding earthquake recurrence models for estimates of seismic hazard and risk in Pakistan. Consideration of the different approaches to cluster validation in a seismic hazard context suggests that Pakistan may be divided into K = 19 seismic clusters, including some portions of the neighbouring countries of Afghanistan, Tajikistan and India.
Article
Brazil has a low level of seismicity compared to several other regions in the world; however, some significant earthquakes occurred. In this article, four techniques have been used to define new seismic inputs: probabilistic seismic hazard, elastic spectra, synthetic spectra, and artificial accelerograms. To carry out the probabilistic analysis, new seismogenic zones for southeast Brazil established in 2018 have been implemented. Elastic response spectra follow the Brazilian code, whereas the synthetic response spectra have been defined by using attenuation equations. Finally, the artificial accelerograms have been developed using analytical models. Modified Brazilian elastic response spectra by considering the results of the probabilistic seismic hazard analyses are proposed. Artificial accelerations, consistent to the non‐modified and modified elastic response spectra, show that for non‐modified spectra the structural response increases of ~2.50 times. Data for seismogenic zones and specific analytical equations to obtain the probability of non‐exceedance of the spectral accelerations have been adapted for this study. An application for shear buildings was analyzed.
Article
In response to the abnormal data mining in dam safety monitoring, and based on the traditional spectral clustering, this paper presents an anomaly detection method based on improved spectral clustering. This method applies a distance and density adaptive similarity measure. The natural eigenvalue is introduced to adaptively select the neighbors of data points, and the similarity is redefined to be combined with the natural k-nearest neighbor. Furthermore, the shared neighbor is introduced to adjust the similarity between the monitoring data samples according to the regional density. Moreover, considering the distribution of dam monitoring data, the initialization of clustering centers is optimized according to both the density and distance feature. This method can prevent the algorithm from local optimum, better adapt to the density of non-convex dataset, reduce the number of iterations, and enhance the efficiencies of clustering and anomaly detection. Taking the dam slab monitoring data as the research object, experimental datasets are formed. Experiments on these datasets further verify that the method of this paper can effectively adapt to discrete distribution datasets and is superior to the classical spectral clustering method in both clustering and anomaly detection.
Article
Performing the most up-to-date and accurate vulnerability assessment is key to an effective earthquake disaster management. In cities like Istanbul (Turkey) with a high rate of urban expansion, the safety of the residents must not be neglected. The challenges in such studies are related to the lack of a training dataset. Some areas are highly prone to earthquakes, however, there have been no earthquakes in those areas recently. This research proposes and tests the ability of the k-mean clustering method to create the training dataset for earthquake vulnerability analysis. Subsequently, the derived sample dataset was used in four state-of-the-art models i.e. Decision Tree (DT), Support Vector Machine (SVM), Self-Organizing Map (SOM) and Logistic Regression (LR) for assessing earthquake vulnerability in Istanbul, Turkey. The multicollinearity among the variables was determined using tolerance (TOL) and variance inflation factor (VIF) which revealed no multicollinearity among the variables. The highest VIF belonged to the “distance to faults” factor. Vulnerability related variables were classified, weighed and using k-mean clustering, a training database was constructed. Then, the standardized variables were keyed in as input alongside the training site maps into DT, SVM, SOM and LR to construct an Earthquake Vulnerability Map (EVM). EVMs were created for all the four samples and graded as very-low, relatively-low, moderate, high, or extremely-high. Several statistical metrics such as Area under the ROC curve (AUC), sensitivity (SST), specificity (SPF), root-mean-squared-errors (RMSE), positive predictive value (PPV), and negative predictive value (NPV) were used to evaluate the accuracy of the resultant maps. The highest and lowest AUC prediction rates were 0.962 and 0.912 from the K-means-SOM and K-means-LR models, respectively. The lowest RSME results using the testing dataset (0.329) belonged to K-means-SVM model. The region's most prone vulnerability maps were found to be in the districts 9, 13, 20, 21 and 35. Finally, an analysis of the buildings and population distribution was carried out among the 39 districts of Istanbul considering the SOM outcomes. The research outcome could help in laying strategies for earthquake preparedness in the Istanbul city.
Article
Probabilistic seismic hazard analysis (PSHA) is nowadays the more complete analysis method to estimate the seismic input for structural analysis. However, it is strongly influenced by seismogenic parameters and attenuation equations. Here PSHA using empirical Green’s functions (EGFs) with 2 + 2 variables is carried out, which, as proposed, are related to each other through the moment magnitude. This combination, already known as “physically based PSHA (pb-PSHA),” is an approach that should be disseminated since it could provide a good alternative in countries where the seismogenic zones and/or attenuation equations are not well established. The proposed model, using differential equations, is based on a linear fault, random/periodic/impulsive/linear source functions, and punctual hypocenter. Results are shown in terms of new seismic parameters, specific return periods, and ground accelerations. The studied country is Portugal since it appears to the authors that no study has been published about pb-PSHA for Portugal. In this sense, the model could be of importance for hazard analyses to incentivize more research on the earthquake source physics.
Article
The importance of developing accurate seismic analysis for dams becomes crucial, especially in active seismic zones. In this research, a probabilistic seismic hazard analysis (PSHA) for 48 Spanish dams in the Andalusian region (southern Spain) has been carried out. The selected dams (25 concrete + 23 embankment dams) fulfill the double requirement of being higher than 61.0 m and having a reservoir capacity greater than 13.60 hm3. Seismogenic parameters have been retrieved from the new seismogenic zones established in 2015, considering the soil homogeneously as rocky. Furthermore, return periods and structural periods range from 500.0 to 5,000.0 years and from 0 to 4.0 s, respectively. Results are plotted in terms of peak grand accelerations (PGAs) and pseudospectra accelerations (PSAs) for each dam. A disaggregation analysis was carried out to provide the contribution of hazard for magnitude and distance pair. The results allowed a possible PGA and return period in any point in Andalusian region to be obtained by linear interpolations. A double dimension for results is provided: first, uniform hazard spectra (UHS) are provided indicating the probability of exceeding the spectral acceleration in each spectral period is constant at a certain site; then, disaggregated results indicate the right combination of magnitude and distance that contribute the more to the seismic hazard at a specific site, for a given intensity measure, and at a certain level of intensity.
Article
Potential seismic sources play an important role in seismic hazard analysis. Identification of seismic sources is generally carried out on the basis of expert judgments, and in most cases, different and controversial results are obtained when several experts are consulted. In fact, the method of source identification is probably an important cause of uncertainty in the seismic hazard analysis. The main objective of this research is to provide an algorithm which combines the weighted K-means clustering analysis and Particle Swarm Optimization in order to automatically identify global optimum clusters by analysing seismic event data. These clusters, together with seismotectonic information, can be used to determine seismic sources. Two validity indexes, Davies–Bouldin's measure and Chou–Su–Lai's measure (CS), are used to determine optimum number of clusters. Study area is located at the longitude of 46°–48° E and latitude of 34°–36° N that is considered as the most seismically active part of Zagros continental collision zone, which has experienced large and destructive earthquakes due to movements of Sahneh and Nahavand segments of Zagros Main Recent Fault. As a result, 7-cluster model which is identified on the basis of DB validity index seems to be suitable for the considered earthquake catalogue, despite some limitations in partitioning.
Article
Digital image processing techniques such as crack detection, disaster fasting and aggregate recognition have been widely used in civil engineering. However, how to robustly recognize low-contrast images is still a changing work. In this work, recognition of concrete aggregates of SEM or Microscope images with low contrast is implemented by the K-means clustering and level set method (LSM). The results show that K-means clustering can be used in the recognition of RGB images or gray images with different gray levels while LSM can be used to recognize aggregates in lowcontrast images with a simple gray level. Comparatively, LSM exhibits a higher accuracy than K-means clustering. For low contrast images with different gray levels, a combination of K-means clustering and LSM is more efficient for aggregate recognition in images when compared to other methods.