Available via license: CC BY 4.0
Content may be subject to copyright.
Vol.:(0123456789)
Arabian Journal of Geosciences (2024) 17:302
https://doi.org/10.1007/s12517-024-12113-0
ORIGINAL PAPER
K‑means forearthquakes: disaggregation analyses ofsmall events
byconsidering wave components andsoil types
EnricoZacchei1,2· ReyolandoBrasil3,4
Received: 29 July 2023 / Accepted: 12 October 2024
© The Author(s) 2024
Abstract
In this paper, k-means algorithm has been used to disaggregate seismic parameters to evaluate their inter-correlations. A
goal is to quantify in a disaggregated way the weights and effects of each parameter with respect to other ones. From the
database, about 4900.0 data, divided into 22.0 categories, have been collected. The main divisions regard the wave com-
ponents in horizontal and vertical axis and the soil characteristics. The studied seismic zone is the “Norpirenaica oriental,”
placed in the Pyrenees area between Spain and France, classified as a very high seismic hazard. Numerical and analytical
analyses have been carried out to implement the algorithm. Preliminary analyses and results would quantify the role of the
sand horizontal stratigraphy, the non-linear effects, the elasticity of the soil, and the energy damping phenomenon. Curves
are plotted in stochastic distributions and elastic spectra accelerations. Results show good prediction for vertical spectral
accelerations and for far and relative strong events. Rigorously, results are valid only for the studied seismogenic zone under
predefined constrictions and ranges.
Keywords AI· K-means· Disaggregation analyses· Pyrenees area· Seismic analyses
Introduction
Background
After the twentieth century, with the gradual development
and promotion of artificial intelligence (AI), various algo-
rithms have been employed to simulate input–output relation-
ships for high-precision analyses. In (Afshoon etal. 2021),
the k-means algorithm is categorized as an “artificial neural
network (ANN)” and thus considered a form of AI as shown
in Turco etal. (2021); Salvador and Chan 2004). The k-means
algorithm is useful to analyze several values and dimensions
since it is very difficult for humans to compare items of such
complexity reliably without a support to aid the compari-
son (e.g., the classical issue of the “big data” treatment (Li
etal. 2021)). Analyses based on human subjective judgments
are often influenced by personal experiences, whereas the
k-means algorithm, considered an unsupervised machine
learning, belongs to the AI in the great area of the computer
science (Turco etal. 2021; Salvador and Chan 2004).
The k-means algorithm has been used for “several
branches of science,” as mentioned in Weatherill and Bur-
ton (2009) (e.g., chemistry (Almeida etal. 2007), medicine
(Symons 1981), engineering (Chen etal. 2021a)), due to its
simplicity, efficiency, and versatility (Chen etal. 2021a; Tay-
fur etal. 2018), since it can be represented by several types
and forms (Daszykowski etal. 2001; Salvador and Chan
2004; Ji etal. 2022). In (Hu and Ma 2021), it was stated that
“the greater the number of predictor variables, the harder it
is to interpret and isolate each predictor variable’s effect”;
thus, as stated in Chen etal. (2021b), “clustering analysis
can help identify the data pattern, merge similar components
and prepare for predictive model construction.”
In particular, the k-means algorithm has been used to
recognize the concrete aggregates from images (Chen etal.
2021a), to predict the fracture energy in concrete (Afshoon
Responsible Editor: Issa El-Hussain
* Enrico Zacchei
enricozacchei@gmail.com
1 University ofCoimbra, CERIS, Coimbra, Portugal
2 Itecons, Coimbra, Portugal
3 Polytechnic School, University ofSão Paulo (USP), 380
Prof. Luciano Gualberto, SãoPaulo, SP, Brazil
4 Center forEngineering, Modeling andApplied Social
Sciences, Federal University ofABC (UFABC), Alameda da
Universidade s/n, SãoBernardoDoCampo, SP, Brazil
Arab J Geosci (2024) 17:302 302 Page 2 of 14
etal. 2021), to characterize the steel fiber/matrix de-bonding
apart from concrete matrix cracking sources (Tayfur etal.
2018), and to estimate deformations for dams (Hu and Ma
2021; Li etal. 2021; Ji etal. 2022).
In the seismology and earthquake engineering, the
k-means algorithm has been used for defining the seismic
source zones from hypocentre distributions (Weatherill and
Burton 2009; Rehman etal. 2014), classification of seismic
activities (Sheikhhosseini etal. 2021; Kuyuk etal. 2012)
and earthquake ground-motion records by considering their
frequency content (Yaghmaei-Sabegh 2017), and treatments
of seismic data (Giuseppe etal. 2014; Shang etal. 2018).
The main goals of the k-means algorithm are as follows:
(i) to produce groups of cases/variables with a high degree
of similarity within each group (called “compactness” in
Giuseppe etal. (2014) or “cohesion” in Sheikhhosseini
etal. (2021)) and a low degree of similarity between groups
(“separation” (Giuseppe etal. 2014)); (ii) to reduce the com-
plexity of the data to obtain useful outputs.
The abovementioned publications provided the inspira-
tion for the present paper. In fact, a possible alternative for
seismic analyses can be to adopt the k-means algorithm
(Weatherill and Burton 2009). Also, in Sheikhhosseini
etal. (2021), it was recently stated that “there is enormous
potential for extending the k-means method and includ-
ing more geological and geophysical information into the
analysis.” Like (Ji etal. 2018), this study could be an inter-
esting example of how problems in earthquake engineering
could be dealt with using state-of-the-art machine learning
techniques.
K‑means forearthquakes
The use of seismogenic zones (ZSs) is largely accepted by
geophysicists and engineers; however, there are some stud-
ies where it is highlighted that the model used to define
them is often controversial. In (Weatherill and Burton
2009), it was stated that “there can be substantial dispar-
ity in the way in which seismic sources are characterized,”
whereas in Sheikhhosseini etal. (2021), it was stated that
“there does not exist a coordinated unique approach for
the development of potential seismic source models.” The
reasons are mainly the lack of geological and seismologi-
cal information as discussed in Zacchei and Brasil (2022);
Zacchei and Lyra 2022).
In the Web of Science database (Web of Science (WoS),
database 2023), there are only 5 articles (Yuan 2021; Lee
and Kim 2020; Ramdani etal. 2015; Shafapourtehrany
etal. 2022; Ji etal. 2018) published between 2015 and
2022 with the words “k-means” and “earthquake” in the
title, indicating the necessity to improve the research. In
general, to the best of the authors’ knowledge, there are
no sufficient studies on the inter-correlations of the param-
eters involved in seismic analyses by using the k-means
algorithm.
In (Ramdani etal. 2015), the k-means algorithm has been
used to classify earthquakes in the Gibraltar Arc and Andean
regions. It was stated that “very little research has been done
on the basis of seismic events analyzed from clustering point
of view” (Ramdani etal. 2015). In (Shafapourtehrany etal.
2022), it was proposed and tested, for Turkey, the ability of
the k-means clustering method to create the training dataset
for earthquake vulnerability analysis. In (Ji etal. 2018), the
non-linear seismic site response for Japan has been classi-
fied by using the k-means algorithm. According to Ji etal.
(2018), this study was the first to apply the machine learning
clustering algorithm to address this problem. In (Lee and Kim
2020), a new search space reduction algorithm using machine
learning techniques for the earthquake source parameter deter-
mination was presented. Finally, in Yuan (2021), a seismic
prediction model based on clustering of global earthquake
data was proposed.
In a parallel way to the mentioned studies, here, a unique
ZS with relative few events (in Ji etal. (2018), only one
event was considered) has been considered to try to obtain
homogeneous results. The Pyrenees area has been taken as
a case study since it represents an interesting area (Garcia-
Mayordomo and Insua-Arevalo 2011; Garcia-Fernandez
etal. 1989; Beauval etal. 2006). The considered parame-
ters deserve more attentions, and this algorithm would allow
to disaggregate each parameter with respect to other ones
and to understand their weight/effect on a seismic analysis.
These aspects, plus the lack of studies as mentioned, would
justify the novelty of the present study.
Thus, the main goal of this paper is to disaggregate
some seismic parameters to identify their inter-correla-
tions and provide new statistical values and seismic inputs
in terms of elastic spectra accelerations. Analytical and
numerical analyses have been carried out to develop the
k-means algorithm.
In the “Seismic context” section, the studied area is
described where some constrictions were predefined.
In the “Materials and methods” section, materials (raw
data indicating the used range) and methods (the k-means
algorithm) are explained; thus, preliminary results and
considerations were shown. Finally, the “Analyses and
results” section shows the analyses and results in terms of
new stochastic outputs and seismic inputs. These results
could be useful for more refined seismic analyses of the
studies ZS.
Arab J Geosci (2024) 17:302 Page 3 of 14 302
Seismic context
The seismogenic zone (ZS) studied in this paper is the ZS16
called “Norpirenaica oriental” zone placed in the Pyrenees area
between Spain and France. The main seismogenic parameters
that characterize this area are as follows: the mean annual rate
of exceedance λc = 0.653; b-value of the Gutenberg-Richer,
b = 1.08; and the mean maximum moment magnitude,
Mw
=
6.50 ± 0.30. The samples collected to define this ZS16 have
been considered abundant with a very homogeneous distribu-
tion, whereas the style of fault ruptures that mainly generates
the earthquakes is “normal faulting.” This ZS16 is considered
a high seismic hazard with a frequency of occurrence for
Mw > 4.0 every 1.50years. Figure1 shows the ZS16 zone (light
red area) from the Zesis database (IGME 2015) highlighting
the 8 events (red stars) considered for this study.
This area has been chosen due to its seismic relevance,
as explained by the abovementioned information, and
because, by a “cross-referencing data” process in ESM
database (Luzi etal. 2020), it was possible to obtain rel-
evant and complete information regarding these 8 events.
Thus, the main criteria were to select a unique ZS to have
earthquakes with the same probability of occurrence at any
point inside the zone, and to find sufficient processed data
(e.g., magnitude, depth, epicentral distance, accelerations)
to carry out the analyses. For the selection of the events,
two main conditions have been verified since the ZSs are
calculated under the following hypotheses: (i) the epicenter
depth, Δ, of each record is Δ < 30.0km, and (ii) the style
of fault is “normal faulting” (only for events 1, 2, and 6 this
information was available).
Materials andmethods
Materials
Collected raw data
From the ESM database (Luzi etal. 2020), 234.0 data for
22.0 parameters have been collected (4914.0 data in total).
These data regard the different records of each event. For
brevity, in Table1 are listed only some parameters (9 param-
eters) indicating their mean value.
The other 13 parameters are as follows: Mw, soil type,
wave component, sampling interval, cut-off frequency of
the low- and high-pass filter, unprocessed peak ground
acceleration (PGA), peak ground velocity (PGV), peak
ground displacement (PGD), Housner intensity, IH, spec-
tral acceleration, Sa, at 1.0s and 3.0s structural period,
moment magnitude. The meaning of these parameters
is usually well known; thus, they are not explained here
but they can be retrieved in literature (Zacchei and Bra-
sil 2022; Zacchei and Lyra 2022; Faccioli and Paolucci
2005).
Values in Table1 regard small earthquakes with low
HPGA. Note that the values in Table1 represent raw data
without separating two important aspects treated later in this
paper, i.e., the wave components and the soil type.
This number of data should be sufficient to carry out in
a reliable way the k-means algorithm for the purpose of
this paper since, for example, in previous credible studies,
only 150 data in Daszykowski etal. (2001) and 246 data in
Afshoon etal. (2021) have been adopted.
Fig. 1 Studied seismogenic zone (ZS16) showing the 8 events (modified form (IGME 2015))
Arab J Geosci (2024) 17:302 302 Page 4 of 14
Interdependencies andpreliminary treatments
To better analyze the collected data, a separation between
wave components (i.e., N, E, Z) and soil (i.e., A, B) has
been made obtaining thus 6 main conditions. N e E refer
to the horizontal waveform components orthogonal to each
other, north–south and east–west, respectively, whereas Z
refers to the vertical waveform component (positive upward)
(2012). Soil A and B corresponds to rock and very dense
sand, respectively, in accordance with Eurocode (2004).
Figure2 shows the value of some parameters for each
seismic event, referring to seismic wave in N axis and soil
A, by using a box and whisker chart where the mean value
is indicated by a “ × ” (note that the parameters have different
Table 1 Mean value (except for “date”) of the collected parameters
HPGAhorizontal peak ground acceleration, VPGAvertical PGA, Sa0.30spectral acceleration at 0.30s structural period. All values refer to the
original list without separation in terms of wave components, soil types, etc. (except where indicated)
a Surface distance between the station where the event was registered and the earthquake epicenter
b Calculated by considering both horizontal waveform components, orthogonal to each other, i.e., N (north–south) and E (east–west)
c It refers to the vertical waveform component Z (positive upward)
Parameter Event 1 Event 2 Event 3 Event 4 Event 5 Event 6 Event 7 Event 8
Date (dd/mm/yy) 11/12/2002 21/01/2003 09/02/2009 30/12/2012 15/11/2007 29/04/2014 17/11/2006 03/05/2008
Local magnitude, ML4.20 4.50 4.10 4.70 4.50 4.40 5.20 3.70
Depth, Δ (km) 2.0 2.0 5.0 15.10 10.0 12.70 14.80 3.40
Epicentral distance, de (km) a29.0 67.30 60.40 49.0 79.60 82.10 93.60 71.20
HPGA (cm/s2) b − 0.54 2.74 − 1.14 − 0.70 − 0.63 − 2.44 0.88 − 5.37
VPGA (cm/s2) c2.70 9.46 0.36 0.99 − 0.70 − 3.24 − 3.04 2.53
Arias intensity, IA (cm/s) 4.70 × 10−3 2.08 × 10−1 3.94 × 10−3 2.57 × 10−2 3.03 × 10−3 3.27 × 10−2 1.74 × 10−1 6.25 × 10−2
Time interval, T90 (s) 7.53 18.38 17.83 17.38 21.53 23.58 23.92 18.76
Sa0.30 (cm/s2) b1.53 3.02 0.68 2.42 0.92 3.15 7.12 1.94
a) b)
c)
Fig. 2 Variability of some parameters of the seismic wave in the N axis and soil A
Arab J Geosci (2024) 17:302 Page 5 of 14 302
dimensions). All parameters have been already explained,
except the hypo-central distance, dh, calculated as
dh =
√
d2
e+Δ
2 ). A long box indicates a great variability of
some parameters highlighting and justifying those to be well
studied (e.g., HPGA, de).
Here, some preliminary results are obtained. Figure3 shows
(non-clustered) combinations between some parameters. There
are several parameters that characterize an earthquake, some
of which are very significant, e.g., (i) IA, which is proportional
to the total energy input of an infinity set of undamped linear
oscillators (T90 is correlated to IA); (ii) PGD, which indicates
the more correct input to be used to well design a structure
instead of PGA; however, managing the displacements is more
complicated than inertial forces; (iii) HPGA/VPGA ratio,
which can provide the amplification of the horizontal waves
due to site effects (Kuyuk etal. 2012; Ji etal. 2018).
The parameters from Fig.3a to d have been plotted under
the same physical meaning (i.e., velocity and displacement).
In fact, IA is calculated as the integral of the acceleration
squared over the entire signal duration.
Soil ASoilB
a) b)
c) d)
e) f)
Fig. 3 Correlations between some non-clustered parameters divided into components and soils
Arab J Geosci (2024) 17:302 302 Page 6 of 14
In Fig.3a and b, the linear trends (plotted in logarithmic
scale) with R-squared value R2 = 0.83–0.91 indicate a good
approximation. The difference of the N axis with respect
to E axis for soil B (Fig.3b) is probably due to the sand
horizontal stratigraphy that provides a non-linear response
between IA and PGV, whereas the different response of the Z
axis could be correlated to the elasticity of the soil (Kramer
1996; Lanzo and Silvestri 1999). In Fig.3c and d, a more
dispersion of the value is noted for soil B. Another difference
between soil A and soil B regards the IA values, since for soil
A they are slightly higher than those for soil B, probably due
to an energy damping phenomenon.
Finally, the HPGA/VPGA ratio shows more amplified
events (i.e., > 1.0, red dashed line) for soil A (Fig.3e), but
more events with longer significant duration (from ~ 13.0s)
are noted for soil B (Fig.3f). The slightly damped values in
Fig.3f probably are due to the non-linearity of the soil B
(Kramer 1996; Lanzo and Silvestri 1999).
Methodology
The k‑means algorithm
The concept of the “similarity” in the cluster analysis is usu-
ally taken as meaning “proximity,” and elements closer to
one another in the input space are considered more similar.
To calculate this similarity, the most used Euclidian distance
(Li etal. 2021), dE, thus the squared Euclidian distance (i.e.,
square error), dE2, is adopted (Morissette and Chartier 2013;
Song etal. 2021):
where ci(x, y) is the center of the cluster i (up to the total
cluster number k), pi(x, y) is the point to be compared, and
the subscript i = {1, 2, 3, …, k} indicates thus the dimen-
sion. Since a 2D space is considered, the points pi and ci
are defined by two coordinates x and y. For a set, t, of cases
Π = {p1, p2, p3, …, pn} ∈
Rd
, where
Rd
is the data space of
d dimensions, the k-means algorithm tries to find a set of k
cluster centers C = {c1, c2, c3, …, ck} ∈
Rd
that is a solution
of the minimization problem:
where ni is the number of cases with j = {1, 2, 3, …, ni}
included in cluster k and
∑k
i
n
i
=
n
. Thus, the k-means clus-
tering technique is considered a variance (Eq.(2)) minimiza-
tion technique.
(1)
dE
=
√∑
k
i
(
ci−pi
)
2
→dE2
=
∑
k
i
(
ci−pi
)2
(2)
E
=
�k
i�n
i
j‖
pij −ci
‖2
(3)
min
n
{E}=min
nk
in
i
j
pij −ci
2
The interpretation of Eq.(3) is to minimize this func-
tion by optimizing the assignment of points to clusters and
updating the centroids. The goal is to ensure that the sum
of squared distances between points and their respective
centroids is as small as possible, resulting in tightly packed
clusters.
For each iterated ci, the following condition must be veri-
fied (Symons 1981):
Thus, the average of the elements of each group is taken
as the new centroid.
Equation(4) represents the tolerance level between the
cluster solutions. As mentioned in Morissette and Chartier
(2013); Symons 1981), mathematically, the k-means algo-
rithm approximates the Gaussian model with an estimation
of the clusters by maximum likelihood. This model consid-
ers a cluster as a probability for each case, based on the
mean, μi, standard deviation, ± σi, and its probability density
function (PDF). The k-means algorithm is a sub-case that
assumes that the clusters have ± σi values and a trend PDF.
Procedure
The main goal of the k-means algorithm is to produce groups
with a high degree of similarity and reduce the complexity
of the data. The procedures are divided into the following
steps:
1) Collection data to be analyzed: This is a key step since it
is necessary that the cases, Π = {p1, p2, p3, …, pn} ∈
Rd
“present a high connectivity among them” as mentioned
in Almeida etal. (2007). In this study, the values shown
in Table1 refer to the same ZS; therefore, they can be
correctly used. Thus, it is possible to obtain one data
for each group from several data and some groups. This
shows the necessary to collect data with direct correla-
tions.
2) Choose the distance function to be used between ci and
pi (Eq.(1)). Mathematica software (2019) treats pairs
of elements as being less similar when their distances
dE(xi, yi) are larger. The function dE,
d
E
, can be any
appropriate distance or dissimilarity function. A dis-
similarity
d
E
must satisfy the following conditions:
d
E
(xi, xi) = 0,
d
E
(xi, yi) ≥ 0,
d
E
(xi, yi) =
d
E
(yi, xi).
3) Definition of cluster numbers, k. Given that k-means
algorithm is an unsupervised (i.e., a classification
made by clustering) and non-hierarchical or partitional
method (i.e., a construction of clusters where the objects
in a cluster are more like one another than to objects
in different clusters), this step is mandatory (Jain etal.
(4)
𝜕E
𝜕
ci
=0→c(t+1)
i=
1
|
|
|
nt
i
|
|
|
∑
n
t
i
jp
ij
Arab J Geosci (2024) 17:302 Page 7 of 14 302
1999). In fact, the k-means algorithm requires a priori
knowledge about the number of clusters k (called the
“true number” in Sugar and James (2003)). In contrast,
a supervised classification is made by a “discriminant
analysis,” and a hierarchical method decomposes the
objects into several levels of “nested clusters” (e.g.,
by dendrograms) as mentioned in Jain etal. (1999);
Daszykowski etal. 2001).
4) Random position of ci and definition of its coordinates
(first iteration). Here, a further determination of a k
number is made; however, Mathematica (2019) provides
automatic solutions, under multiple iterations, that are
intrinsically computed. Therefore, it is important to
highlight that step 3 only regards the optimum solution
for each k where the convergence is already verified.
5) Estimation of E (Eq.(2)),
min
n{E}
(Eq.(3)) and verifica-
tion (Eq.(4)). This step would be repeated until cen-
troids move no longer and a separation of the objects
into classes for which the distances are minimized is
obtained. If this is not verified, it is necessary to repeat
step 4 up to convergence.
Analyses andresults
Clustering combinations
As stated in Morissette and Chartier (2013), “the k-means
clustering technique will always converge”; however, it is
liable to find a local minimum solution instead of a global
one; thus, it could not find the optimal configuration. For
this, it is necessary not only to define the number of clusters
k, which is a mandatory operation to develop the k-means
algorithm (here, as already mentioned, developed automati-
cally by Mathematica software (2019)), but also to estimate
a-posteriori the k number. Mathematica (2019) is an adequate
environment to learn, experiment, and apply the k-means
algorithm. With its built-in functions or by creating a custom
solution, its powerful computational capabilities and visuali-
zation capabilities make it ideal for understanding clustering.
In general, all cases could represent useful results; however,
the goal should be to use the k-means algorithm “to produce
clusters that are more equiprobable than the population clus-
ters” (Morissette and Chartier 2013). Also, this algorithm should
“clean” the collected data that come from different studies.
In (Salvador and Chan 2004; Milligan and Cooper 1985;
Sugar and James 2003), several methods are explained. In
(Almeida etal. 2007), a further “visual inspection” could
help to establish an optimum k number since, as mentioned
in Salvador and Chan (2004), a visual inspection (or “manu-
ally” (Lee and Kim 2020)) is useful since the majority of the
“methods to determine the number of clusters/segments may
not work very well in practice”. Also, in Weatherill and Bur-
ton (2009) it was stated that “the choice of k may still be an
expert decision based on the output from different indices and
methods.”
In (Giuseppe etal. 2014), it was mentioned that “there are
no theoretical solutions to a priori define k,” and the best solu-
tion consists in obtaining “a compromise between good data
clustering and acceptable physical interpretability.” In (Ji etal.
2018), it was stated that “there is no standard procedure for
computation of the optimum number of clusters.”
Here we adopt k = 3.0 since the used data are already suf-
ficiently homogeneous due to the applied collecting and fil-
tering process, and to well separate “outliers,” as indicated in
Daszykowski etal. (2001); Lee and Kim (2020); Song etal.
(2021); Ji etal. (2022), which can contaminate other clus-
ters. For example, in Fig.4a, the outlier blue point is evident
indicating that k = 3.0 is adequate; however, this is not always
verified.
Figure4a shows some combinations of clustered values for
the N axis component (soil A) developed by using Mathemat-
ica software (2019) as detailed in the “Methodology” section.
Obviously, when the horizontal and vertical axes quantify the
same parameters, all clusters are plotted in a diagonal line as
also shown in Chen etal. (2021b). For brevity, only one case
has been indicated, which corresponds to best cluster 1. In
general, the best cluster is the cluster placed in the middle with
respect to other ones. The general criterium was to choose a
cluster that provides a mean value more like that for raw data.
This is because the goal is to obtain more reliable and refined
results from real registrations.
Figure4b shows the mode (i.e., the value that appears
most often in a set of data values) calculated for 96.0 com-
binations indicating that the cluster 2 could be the more
adequate for these analyses. This should further confirm that
using k = 3.0 is sufficient.
Figure5 shows the scatter diagrams with histograms
where it is possible to count how many values are placed
between a certain interval for each axis. For example, in
the de–IA pair, a green scatter area indicates 1 point up
to magenta area with 11 points. The color tone changes
depending on the value number in a specific area. Also, the
horizontal de histograms correspond to the points sum, i.e.,
from left to right, there are 9, 12, 3, 7, and 5 points. For brev-
ity, only some combinations (matrix 3 × 3) have been shown.
Results anddiscussions
Stochastic outputs
As mentioned, the k-means clustering technique can be
described as a centroid model as one vector representing
the mean and its standard deviation, μi ± σi, used to describe
each cluster (Morissette and Chartier 2013). Thus, Table2
Arab J Geosci (2024) 17:302 302 Page 8 of 14
lists the obtained μi ± σi values. For brevity, only the N axis
component in soil A has been shown. For some combina-
tions, e.g., de–de pair, the results do not vary with respect to
other components under the same soil type since the distance
de is the same.
In general, Table2 can be read in (i) vertical way, e.g.,
for de parameter, it is possible to see its trend weighed in
function of other ones. This is useful to obtain a general
estimation of a certain parameter by considering several
combinations; in fact, by accounting for all combinations,
the mean μi should provide the best results; (ii) horizontal
way, e.g., for HPGA parameter, it is possible to evaluate all
parameters only in function of it. Thus, if HPGA parameter
is fixed, other parameters could represent good solutions to
be adopted.
It is important to highlight that values in Table1 have
been obtained independently of the other parameters,
whereas μi ± σi values have been estimated by several cou-
pled analyses. In this sense, results provide disaggregated
values to be used to evaluate the weight of a certain param-
eter with respect to other ones. This could show the great
potential of the AI by k-means algorithm.
Table3 shows μi ± σi outputs of the studied parameters
obtained by considering all combinations, which would
indicate new values for each component to be used to frame
the seismic context of the ZS16 in a more correct way.
This division in terms of components and soils is important
since it allows to obtain outputs in a universal way.
In Table3, it is possible to note some physical behaviors
(some already discussed in Fig.3), in particular for soil A,
given that the rock can schematize an ideal soil without strong
influences of the elasticity, heterogeneity, and non-linearity in
accordance with classical theories (Kramer 1996; Lanzo and
Silvestri 1999). In fact, the ratio between vertical and hori-
zontal PGA for soil A is 0.16 (= 0.53/3.35), which is like the
mean ratio obtained by using European code (i.e., 0.18) (2004).
PGA values of the horizontal components appear a little
different from each other, in particular for soil A, probably
due to the different horizontal frequency of the rock in two
directions. The high value of 3.35cm/s2 (E axis in soil A) is
strictly correlated to a minor de and T90, indicating relatively
closer, short, and intense events. For the E axis in soil B,
these correlations are also verified; however, a low value of
PGA was found. This difference is probably correlated to
the fact that a rock provides amplificated values, whereas a
non-linear sand provides damped values due to the material
damping (Kramer 1996; Lanzo and Silvestri 1999; Ji etal.
2018). Also, for medium–high de (i.e., 75.0km (Zacchei and
Lyra 2022)) values, the radiation damping due to the spread-
ing of the energy plays an important role.
It is important to mention that the k-means algorithm
only divides the data in a geometrical way. However, under
Fig. 4 a A combination for
N axis in soil A; b mode
calculated considering 96.0
combinations
a)
b)
Arab J Geosci (2024) 17:302 Page 9 of 14 302
defined constrictions (discussed in the “Seismic context”
section) and ranges (see Table1), these divisions can also
assume a physical meaning.
Figures6 and 7 show the PDF curves of the clustered
results (solid curves) and non-clustered results (dashed
curves) for soil A and B, respectively. The formers have
been plotted by using values in Table3, whereas the latter
regard raw data discussed in the “Materials” section. Note
that raw data plotted in Figs.6 and 7, called μ0 ± σ0, are dif-
ferent from the values shown in Table3 since, as mentioned,
were not divided in function of components and soils.
In Fig.6, obviously, the raw data, i.e., μ0 ± σ0, provide
wide curves indicating poor distributions, whereas the
N axis in soil A
de(km)HPGA (cm/s2
)I
A(cm/s)
HPGA (cm/s2)
See Fig. 4a)
IA(cm/s)T90 (s)
Fig. 5 Scatter with histograms (some cases)
Table 2 Clustered values for N axis in soil A (obtained by cluster 1
except where indicated)
a Obtained by cluster 2
b Obtained by cluster 3
N axis in soil A
de (km) HPGA (cm/
s2)
IA (cm/s) T90 (s)
de82.36 ± 17.83 a0.78 ± 6.30 0.03 ± 0.06 28.16 ± 6.59 a
HPGA 77.60 ± 63.57 -0.80 ± 6.30 0.02 ± 0.05 19.64 ± 10.71
IA44.35 ± 28.58 -0.80 ± 6.30 0.19 ± 0.05 b10.99 ± 4.86
T90 87.18 ± 20.89 a-0.80 ± 6.30 0.04 ± 0.07 17.58 ± 3.74 a
μi ± σi72.87 ± 32.72 -0.41 ± 6.30 0.07 ± 0.06 19.09 ± 6.48
Arab J Geosci (2024) 17:302 302 Page 10 of 14
sharpened shape of the PDFs indicates that the σi values are
low; thus, the clustered results tend to a good calibration.
In fact, the most useful information regard σi values, which
directly represent the error of the stochastic distribution. The
highlighted parameters appear on the E axis (Fig.6a) and
Z axis (Fig.6b) in soil A, and the E axis in soil B (Fig.7c).
Also, in general, when μ0/μi ≈ 1.0, the preliminary
non-clustered value does not suffer alterations; there-
fore, it could be adopted for the seismic analysis. When
μ0/μi > 1.0 or μ0/μi < 1.0, the preliminary value could be
over- or under-estimated, respectively. For Fig.7, the same
considerations are valid.
New seismic inputs
In this section, new seismic inputs in terms of (elastic
and synthetic) spectral acceleration, Sa, have been shown.
According to Eurocode (2004), “if the earthquakes that
contribute most to the seismic hazard defined for the site
for the purpose of probabilistic hazard assessment have a
Table 3 μi ± σi results by
considering all combinations
a It corresponds to VPGA
de (km) PGA (cm/s2)IA (cm/s) T90 (s)
N axis in soil A (Table2)72.87 ± 32.72 − 0.41 ± 6.30 0.07 ± 0.06 19.09 ± 6.48
E axis in soil A 68.77 ± 16.01 − 3.35 ± 7.90 0.05 ± 0.14 16.96 ± 4.71
Z axis in soil A 75.94 ± 32.39 0.53 ± 2.18 a0.09 ± 0.06 21.77 ± 7.09
N axis in soil B 35.73 ± 24.61 − 0.40 ± 1.32 0.02 ± 0.01 18.18 ± 4.49
E axis in soil B 36.81 ± 21.26 0.17 ± 1.63 0.01 ± 0.0 15.33 ± 4.58
Z axis in soil B 47.30 ± 23.0 − 0.30 ± 1.65 a0.01 ± 0.01 23.93 ± 6.51
Fig. 6 PDF curves for different parameters and wave components (soil A)
Arab J Geosci (2024) 17:302 Page 11 of 14 302
surface-wave magnitude, Ms, not greater than 5.5, it is rec-
ommended that the type 2 spectrum is adopted.” Regard-
ing the studied ZS16, in the seismic characteristics, it is
indicated that “very wide representation of Mw > 5.0” was
registered (IGME 2015); therefore, the “type 1” spectrum
was here adopted.
Figure8 shows Sa trends in the function of the structural
period, T, for a classical return period of 475years (curves
of this study are represented by mean values).
In Fig.8, it is possible to see that the vertical Sa by Euroc-
ode (2004) (lower red line) provides good results for the site,
whereas the horizontal Sa (upper red line) overestimates the
response for soil A. This could indicate that the clustered
mean curves approximate the vertical Sa curves well.
In (Yaghmaei-Sabegh 2017), it is shown a simple relation
to estimate the characteristic period, Tc, of an earthquake
ground-motion by Tc = 4.30 × (PGV/PGA). Thus, by estimat-
ing Tc trend for each component (here 0.025–0.25s), it is
useful to avoid possible resonance phenomena.
Synthetic spectra have been obtained by using four atten-
uation relations (i.e., SP96 (Sabetta and Pugliese 1996),
Am96 (Ambraseys etal. 1996), Am05 (Ambraseys etal.
2005), BT03 (Berge-Thierry etal. 2003)). The use of these
relations is justified by the fact that they have been calibrated
for a seismologic and geological context like that studied in
this paper (Benito and Gaspar-Escribano 2007). They are
also valid within the range of magnitude and distance con-
sidered. In general, these equations have been largely used
in the South-European area, in particular for the Pyrenees
area where Am05 (Garcia-Mayordomo and Insua-Arevalo
2011; Beauval etal. 2006) and BT03 (Beauval etal. 2006)
were adopted.
Table4 lists the used parameters for each attenuation
equation (for more details see (Sabetta and Pugliese 1996;
Ambraseys etal. 1996, 2005; Berge-Thierry etal. 2003))
regarding two scenarios: (i) far and relative strong event
(Mw = 4.8, de = 68.77km); (ii) near-weak event (Mw = 3.4,
de = 36.81km). The de values of the E axis component have
been considered for both scenarios since for soil A, it was
found the maximum HPGA value, i.e., HPGA = − 3.35cm/
s2 (Table3). Regarding both magnitudes, the minimum
and maximum values of the 6 events have been used.
Fig. 7 PDF curves for different parameters and wave components (soil B)
Arab J Geosci (2024) 17:302 302 Page 12 of 14
It is noted that “the greatest PGA can come from small
earthquakes in the near-field, even if such earthquakes may
not be especially damaging” (Weatherill and Burton 2009).
As known, attenuation equations have some limitations
(called “random” and “epistemic” uncertainties (Zacchei and
Lyra 2022; Zacchei and Molina 2022)); thus, it is recom-
mended to use more than one to reduce their uncertainties.
Figure9 shows synthetic spectra for two scenarios. For
soil A (Fig.9a), all curves provide similar results for stiff
structures (i.e., T = 0–1.0 s) indicating the real goddess
in using these attenuation equations for ZS16. The ratio
between the maximum Sa and PGA is about 5.0, like to 5.5
in Garcia-Mayordomo and Insua-Arevalo (2011).
For soil B (Fig.9b), these agreements are not verified;
in fact, the 4 attenuation equations have not been calibrated
for very dense sand (they are calibrated for upper and lower
limit, i.e., rock and soft or alluvium sites, respectively), and,
in general, it is difficult to calibrate them for low magni-
tudes. Although it is difficult to estimate well the response
of a very dense sand (soil B), these results would show that
clustered curves provide values in favor of safety.
Conclusions
In this paper, the k-means algorithm has been applied to dis-
aggregate several seismic parameters to understand possible
inter-correlations. As a case study, ZS16 called “Norpire-
naica oriental” placed in the Pyrenees area between Spain
and France has been considered. The main conclusions are:
1) 4914.0 data divided into 22 categories have been treated
to carry out the k-means algorithm. The main divi-
sions regard the wave components (i.e., N, E, Z) and
soil characteristics (i.e., A, B). This could quantify the
role of the sand horizontal stratigraphy, the non-linear
response, and the elasticity of the soil. Also, the HPGA/
VPGA ratio would quantify the amplification of the
earthquake. Preliminary results show linear trends with
good approximations (R2 ≈ 0.87) highlighting some dif-
ferences correlated to the sand horizontal stratigraphy,
non-linear response, elasticity of the soil, and energy
damping phenomenon.
2) Clustered results, considering all combinations, provide
new values to be used for seismic analyses, in terms
Fig. 8 Elastic spectra for a soil A and b soil B
Table 4 Values used for the attenuation equations
a Other parameters/factors have been considered null; thus, they are not indicated here
b It corresponds to 4.8 Mw (Scordilis 2006)
c For relative small magnitude (i.e., Mw < 6.0), the distance between the station to the surface projection of the fault rupture can be approximated
to the de distance (Ambraseys etal. 1996, 2005)
d As shown in Fig.2a, dh ≈ de
Attenuation equation Magnitude, MwDistance, de (km) Observations a
Soil A Soil B Soil A Soil B Soil A Soil B
Am96 (Ambraseys etal. 1996) 4.1 Ms b2.0 Ms68.77 c36.81 Stiff soil site, SA = 1.0 -
Am05 (Ambraseys etal. 2005) 4.8 3.4 68.77 36.81 Stiff soil site, SA = 1.0
Normal faulting, FN = 1.0
Normal
faulting,
FN = 1.0
SP96 (Sabetta and Pugliese 1996) 4.1 Ms2.0 Ms68.77 36.81 Stiff site -
BT03 (Berge-Thierry etal. 2003) 4.1 Ms2.0 Ms68.77 d36.81 Rock site -
Arab J Geosci (2024) 17:302 Page 13 of 14 302
of μi ± σi outputs, in ZS16 in a more refined way. This
disaggregation analysis allows to evaluate the weight
and effect of a certain parameter with respect to other
ones. This could show the great potential of the AI by
k-means algorithm. Stochastic results indicate the dis-
tribution error, highlighting that the best components
appear on the E and Z axes in soil A, and the E axis in
soil B. In general, k-means algorithm divides the data in
a geometrical way; however, under defined constrictions
and ranges, these divisions can also assume a physical
meaning.
3) New seismic inputs for ZS16 in terms of elastic spec-
tra, Sa, have been plotted. Results show that horizontal
Sa values by code overestimates response for soil A.
Synthetic spectra by attenuation equations provide good
approximation for a far and relative strong event (sce-
nario 1), and poor approximation for a near-weak event
(scenario 2).
It is important to highlight that all results are rigorously
valid only for ZS16 under constrictions and hypotheses
described in the paper. The future integration of geophysi-
cal and geotechnical data concerning the input model and
database (IGME 2015; Luzi etal. 2020) will certainly
improve the analyses and outputs. Based on the results of
this study, we hope in the future to carry out an experi-
mental campaign allowing us to compare and validate our
proposed model. This will require new research and chal-
lenges in this field.
Acknowledgements The first author acknowledges the Itecons Insti-
tute, Coimbra, Portugal, for the Wolfram Mathematica license and the
University of Coimbra (UC), Portugal, to pay the rights (when appli-
cable) to completely download all papers in the references. The first
author is also grateful for the Foundation for Science and Technol-
ogy’s support through funding UIDB/04625/2020 from the research
unit CERIS (https:// doi. org/ 10. 54499/ UIDB/ 04625/ 2020)
Funding Open access funding provided by FCT|FCCN (b-on).
Data Availability Data will be made available on request.
Declarations
Conflict of interest The authors declare no competing interests.
Open Access This article is licensed under a Creative Commons Attri-
bution 4.0 International License, which permits use, sharing, adapta-
tion, distribution and reproduction in any medium or format, as long
as you give appropriate credit to the original author(s) and the source,
provide a link to the Creative Commons licence, and indicate if changes
were made. The images or other third party material in this article are
included in the article’s Creative Commons licence, unless indicated
otherwise in a credit line to the material. If material is not included in
the article’s Creative Commons licence and your intended use is not
permitted by statutory regulation or exceeds the permitted use, you will
need to obtain permission directly from the copyright holder. To view a
copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
References
Afshoon I, Miri M, Mousavi SR (2021) Combining Kriging meta
models with U-function and K-Means clustering for prediction
of fracture energy of concrete. J Build Eng 35:1–16
Almeida JAS, Barbosa LMS, Pais AACC, Formosinho SJ (2007)
Improving hierarchical cluster analysis: a new method with out-
lier detection and automatic clustering. Chemom Intell Lab Syst
87:208–217
Ambraseys NN, Simpson KA, Bommer JJ (1996) Prediction of hori-
zontal response spectra in Europe. Ear Thq Eng Struct Dyn
25:371–400
Ambraseys NN, Douglas J, Sarma SK, Smit PM (2005) Equations for
the estimation of strong ground motions from shallow crustal
earthquakes using data from Europe and the Middle East: hori-
zontal peak ground acceleration and spectral acceleration. Bull
Earthq Eng 3(1):1–53
Beauval C, Hainzl S, Scherbaum F (2006) Probabilistic seismic hazard
estimation in low-seismicity regions considering non-Poissonian
seismic occurrence. Geophys J Int 164:543–550
Benito B, Gaspar-Escribano JM (2007) Ground motion characterization
and seismic hazard assessment in Spain: context, problems and
recent developments. J Seismolog 11:433–452
Berge-Thierry C, Cotton F, Scotti O (2003) New empirical response
spectral attenuation laws for moderate European earthquakes. J
Earthquake Eng 7(2):193–222
Chen L, Shan W, Liu P (2021a) Identification of concrete aggre-
gates using K-means clustering and level set method. Structures
34:2069–2076
Fig. 9 Synthetic spectra for a far and relative strong event (scenario
1); b near-weak event (scenario 2)
Arab J Geosci (2024) 17:302 302 Page 14 of 14
Chen W, Wang X, Cai Z, Liu C, Zhu Y, Lin W (2021b) DP-GMM
clustering-based ensemble learning prediction methodology for
dam deformation considering spatiotemporal differentiation.
Knowl-Based Syst 222:1–16
Daszykowski M, Walczak B, Massart DL (2001) Looking for natural
patterns in data: part 1. Density-based approach. Chemometr Intell
Lab Syst 56:83–92
Di Giuseppe MG, Troiano A, Troise C, De Natale G (2014) k-means
clustering as tool for multivariate geophysical data analysis.
An application to shallow fault zone imaging. J Appl Geophys
101:108–115
European Committee for Standardization (CEN) (2004) Eurocode 8:
design of structures for earthquake resistance, part 1: general
rules, seismic actions and rules for buildings, BS EN 1998–1:
2004. Brussels, Belgium
Faccioli E, Paolucci R (2005) Elements of seismology applied to engi-
neering, Pitagora Editrice, Bologna, Italy, p. 255
Garcia-Fernandez M, Jimenez MJ, Kijko A (1989) Seismic hazard
parameters estimation in Spain from historical and instrumental
catalogues. Tectonophysics 167:245–251
Garcia-Mayordomo J, Insua-Arevalo JM (2011) Seismic hazard assess-
ment for the Itoiz dam site (Western Pyrenees, Spain). Soil Dyn
Earthq Eng 31:1051–1063
Hu J, Ma F (2021) Comparison of hierarchical clustering based defor-
mation prediction models for high arch dams during the initial
operation period. J Civ Struct Heal Monit 11:897–914
IGME (2015) ZESIS: Base de Datos de Zonas Sismogénicas de la
Península Ibérica y territorios de influencia para el cálculo de
la peligrosidad sísmica en España. http:// info. igme. es/ zesis.
Accessed March 2023
Jain AK, Murty MN, Flynn PJ (1999) Data clustering: a review. ACM
Comput Surv 31:1–60
Ji K, Wen R, Ren Y, Dhakal YP (2020) Nonlinear seismic site response
classification using K-means clustering algorithm: case study of
the September 6, 2018 Mw6.6 Hokkaido Iburi-Tobu earthquake
Japan. Soil Dynamics Earthquake Eng 128:1–14
Ji L, Zhang X, Zhao Y, Li Z (2022) Anomaly detection of dam monitor-
ing data based on improved spectral clustering. J Internet Technol
23:1–11
Kramer SL (1996) Geotechnical Earthquake Engineering, first ed.,
Prentice-Hall, Upper Saddle River, NJ, p 653
Kuyuk HS, Yildirim E, Dogan E, Horasan G (2012) Application of
k-means and Gaussian mixture model for classification of seismic
activities in Istanbul. Nonlin Process Geophys 19:411–419
Lanzo G, Silvestri F (1999) Risposta Sismica Locale – Teorie ed Espe-
rienze, Hevelius Editor Srl, Italy, p 159
Lee S, Kim T (2020) Search space reduction for determination of earth-
quake source parameters using PCA and k-means clustering. J
Sens 1–12:2020
Li Y, Min K, Zhang Y, Wen L (2021) Prediction of the failure point
settlement in rockfill dams based on spatial-temporal data and
multiple-monitoring-point models. Eng Struct 243:1–12
Luzi L, Lanzano G, Felicetta C, D’Amico MC, Russo E, Sgobba S,
Pacor F, ORFEUS Working Group 5 (2020) Engineering Strong
Motion Database (ESM) (Version 2.0). Istituto Nazionale di
Geofisica e Vulcanologia (INGV). https:// doi. org/ 10. 13127/
ESM.2
Milligan GW, Cooper MC (1985) An examination of procedures for
determining the number of clusters in a data set. Psychometrika
50:159–179
Morissette L, Chartier S (2013) The k-means clustering technique:
general considerations and implementation in Mathematica. Tutor
Quant Methods Psychol 9:15–24
Ramdani F, Kettani O, Tadili B (2015) Evidence for subduction
beneath Gibraltar Arc and Andean regions from k-means earth-
quake centroids. J Seismol 19:41–53
Rehman K, Burton PW, Weatherill GA (2014) K-means cluster analysis
and seismicity partitioning for Pakistan. J Seismol 18:401–419
Sabetta F, Pugliese A (1996) Estimation of response spectra and simu-
lation of nonstationary earthquake ground motions. Bull Seismol
Soc Am 86(2):337–352
Scordilis EM (2006) Empirical global relations converting Ms and mb
to moment magnitude. J Seismolog 10:225–236
Shafapourtehrany M, Yariyan P, Ozener H, Pradhan B, Shabani F
(2022) Evaluating the application of K-mean clustering in Earth-
quake vulnerability mapping of Istanbul. Turkey Int J Disaster
Risk Reduction 79:1–23
Shang X, Li X, Morales-Esteban A, Asencio-Cortes G, Wang Z (2018)
Data field-based k-means clustering for spatio-temporal seismicity
analysis and hazard assessment. Remote Sensing 10:1–22
Sheikhhosseini Z, Mirzaei N, Heidari R, Monkaresi H (2021) Delinea-
tion of potential seismic sources using weighted K-means cluster
analysis and particle swarm optimization (PSO). Acta Geophysi-
cal 69:2161–2172
Salvador S, Chan P (2004) Determining the number of clusters/seg-
ments in hierarchical clustering/segmentation algorithms, Pro-
ceedings of the 16th IEE International Conference on Tools with
Artificial Intelligence (ICTAI), 15–17 November, 2004, Boca
Raton, Florida, USA, 1–9.
Song J, Zhang S, Tong F, Yang J, Zeng Z, Yuan S (2021) Outlier
detection based on multivariable panel data and K-means clus-
tering for dam deformation monitoring data. Adv Civil Eng
1–11:2021
Standard for the Exchange of Earthquake Data (SEED) (2012) refer-
ence manual, version 2.4, Incorporated Research Institutions for
Seismology (IRIS), USA, p. 224
Sugar CA, James GM (2003) Finding the number of clusters in a
dataset: an information-theoretic approach. J Am Stat Assoc
98:750–763
Symons MJ (1981) Clustering criteria and multivariate normal mix-
tures. Biometrics 37:35–43
Tayfur S, Alver N, Abdi S, Saatci S, Ghiami A (2018) Characteriza-
tion of concrete matrix/steel fiber de-bonding in an SFRC beam:
Principal component analysis and k-mean algorithm for clustering
AE data. Eng Fract Mech 194:73–85
Turco C, Funari MF, Teixeira E, Mateus R (2021) Artificial neural
networks to predict the mechanical properties of natural fibre-
reinforced compressed earth blocks (CEBs). Fibers 9:1–21
Weatherill G, Burton PW (2009) Delineation of shallow seismic source
zones using K-means cluster analysis, with application to the
Aegean region. Geophys J Int 176:565–588
Web of Science (WoS), database (2023) https:// www. webof scien ce.
com/ wos/. Accessed July 2023
Wolfram Mathematica, version 12.0, Wolfram Research, Inc.: Cham-
paign, IL, USA, 2019.
Yaghmaei-Sabegh S (2017) A novel approach for classification of
earthquake ground-motion records. J Seismol 21:885–907
Yuan R (2021) An improved K-means clustering algorithm for global
earthquake catalogs and earthquake magnitude prediction. J Seis-
mol 25:1005–1020
Zacchei E, Brasil R (2022) A new approach for physically based
probabilistic seismic hazard analyses for Portugal. Arab J Geosci
15:1–22
Zacchei E, Lyra P (2022) Recalibration of low seismic excitations in
Brazil through probabilistic and deterministic analyses: applica-
tion for shear buildings structures. Struct Concr 1–19:2022
Zacchei E, Molina JL (2022) Probabilistic seismic hazard analysis for
Andalusian dams in Southern Spain using new seismogenic zones.
ASCE-ASME J Risk Uncertain Eng Syst Part a: Civil Eng 8:1–13