Access to this full-text is provided by Springer Nature.
Content available from Scientific Reports
This content is subject to copyright. Terms and conditions apply.
1
Vol.:(0123456789)
Scientic Reports | (2023) 13:21622 | https://doi.org/10.1038/s41598-023-49080-7
www.nature.com/scientificreports
An ensemble‑based machine
learning solution for imbalanced
multiclass dataset during lithology
log generation
Mohammad Saleh Jamshidi Gohari
1, Mohammad Emami Niri
2*, Saeid Sadeghnejad
3 &
Javad Ghiasi‑Freez
4
The lithology log, an integral component of the master log, graphically portrays the encountered
lithological sequence during drilling operations. In addition to oering real‑time cross‑sectional
insights, lithology logs greatly aid in correlating and evaluating multiple sections eciently. This
paper introduces a novel workow reliant on an enhanced weighted average ensemble approach
for producing high‑resolution lithology logs. The research contends with a challenging multiclass
imbalanced lithofacies distribution emerging from substantial heterogeneities within subsurface
geological structures. Typically, methods to handle imbalanced data, e.g., cost‑sensitive learning
(CSL), are tailored for issues encountered in binary classication. Error correcting output code (ECOC)
originates from decomposition strategies, eectively breaking down multiclass problems into
numerous binary subproblems. The database comprises conventional well logs and lithology logs
obtained from ve proximate wells within a Middle Eastern oileld. Utilizing well‑known machine
learning (ML) algorithms, such as support vector machine (SVM), random forest (RF), decision tree
(DT), logistic regression (LR), and extreme gradient boosting (XGBoost), as baseline classiers, this
study aims to enhance the accurate prediction of underground lithofacies. Upon recognizing a blind
well, the data from the remaining four wells are utilized to train the ML algorithms. After integrating
ECOC and CSL techniques with the baseline classiers, they undergo evaluation. In the initial
assessment, both RF and SVM demonstrated superior performance, prompting the development of
an enhanced weighted average ensemble based on them. The comprehensive numerical and visual
analysis corroborates the outstanding performance of the developed ensemble. The average Kappa
statistic of 84.50%, signifying almost‑perfect agreement, and mean F‑measures of 91.04% emphasize
the robustness of the designed ensemble‑based workow during the evaluation of blind well data.
Abbreviations
ML Machine learning
CSL Cost-sensitive learning
ADASYN Adaptive synthetic sampling
ECOC Error correcting output code
SVM Support vector machine
RF Random forest
DT Decision tree
LR Logistic regression
XGBoost Extreme gradient boosting
CNN Convolutional neural networks
DL Deep learning
OPEN
1Department of Petroleum Engineering, Kish International Campus, University of Tehran, Tehran, Iran. 2Institute
of Petroleum Engineering, School of Chemical Engineering, College of Engineering, University of Tehran, Tehran,
Iran. 3Department of Petroleum Engineering, Faculty of Chemical Engineering, Tarbiat Modares University, Tehran,
Iran. 4Faculty of Mining, Petroleum, and Geophysics, Shahrood University of Technology, Shahrood, Iran. *email:
Emami.m@ut.ac.ir
Content courtesy of Springer Nature, terms of use apply. Rights reserved
2
Vol:.(1234567890)
Scientic Reports | (2023) 13:21622 | https://doi.org/10.1038/s41598-023-49080-7
www.nature.com/scientificreports/
M-SMOTE Modied synthetic minority oversampling technique
OVA One-vs.-All
OVO One-vs.-One
Sh Shale
Ls Limestone
argiLs Argillaceous limestone
chkLs Chalky limestones
CGR Computed gamma ray log
SGR Spectral gamma ray log
NPHI Neutron porosity log
RHOB Density log
PE Photoelectric log
DT Sonic log
HD
Hamming distance
Mean.K
Mean kappa statistics
Mean.F
Mean F-measures
L.L. Lithology log
ED
Euclidean distance
F(V)
Voting function
Subscript and superscript
Ci,j Number of confusion matrix elements
s0
Random element
ys0
Decoded vector
wc
e weight assigned to class
c
Recognizing lithofacies holds signicant importance in characterizing subsurface reservoirs. e lithology log, an
essential segment of the master log, delineates the sequences encountered in subsurface drilling. is log oers a
real-time depiction of the subsurface layers. Utilizing lithology logs proves valuable for correlating and compar-
ing equivalent parts or subsections across various areas. Depending on the geologist’s goals, these logs can dier
in format and style. eir primary function is to display geological and lithological formations. A lithology log
is a visual summary of underground sedimentary rock units. Summarising extensive data, identifying patterns,
and recognizing changes in sedimentary facies due to creating an overview of the vertical sequence are some of
the key benets of such logs. Additionally, these logs are appropriate for verifying correlations across sections of
the corresponding age in diverse regions, called well-to-well correlation1. In the geo-energy industry, accessing
and analyzing lithology logs for reasons like the age of drilled wells and mud loss is challenging. In such cases,
they are traditionally generated manually by visually correlating lithology logs from nearby wells. Subsurface
geological heterogeneities exacerbate this technique’s inaccuracy2. Due to its reliance on the interpreter’s skills,
the manual method has a relatively long processing time and has considerable generalization errors. Aside
from that, even experienced interpreters nd this method cumbersome and inecient when dealing with the
increasing volume of data.
Additionally, cross-plot characterization can categorize lithofacies from well logs. Typically, well logs are
sampled continuously as part of underground exploration. Besides measuring the petrophysical characteristics
of subsurface rocks, well logs facilitate understanding lithofacies by revealing lithology, texture, and structure
changes. In light of the rising volume of data, cross-plot characterization also becomes time-consuming and
challenging, even for skilled interpreters. Salinity, uid content, diagenesis, fractures, and clay composition
can exhibit parallel log reactions to lithology in standard well logs. Nevertheless, well-log patterns for distinct
lithologies, notably their transition subtypes, can be identical. In cross plots, these cases can complicate and
non-linearise the problem. e Exploration and Production industry has focused on machine learning (ML)
techniques in light of their potential to handle non-linear issues, the massive volume of data, the need for skilled
interpreters, and manual methods’ generalization errors3–10. Developing an ML-based methodology to generate
high-resolution lithology logs via conventional well logs and lithology logs from nearby wells may be crucial.
Over the past several decades, researchers have extensively investigated how ML techniques can identify litho-
facies from well logs. Unsupervised learning techniques, e.g., expectation-maximization11, K-means clustering12,
hierarchical clustering13, self-organizing map14, and deep autoencoder15, provide only an overall perspective by
arranging the lithofacies based on their inherent characteristics. ey are helpful in cases where the dataset is
limited, i.e., no label is available. In contrast, semi-supervised learning techniques, e.g., positive and unlabeled
ML16, active semi-supervised algorithms17, and laplacian support vector machine (SVM)18, are benecial when
a limited amount of labelled data is accessible. Conversely, the supervised learning technique is applicable when
lithofacies are pre-dened in a well, and we need to determine which labels from the second well belong. Several
well-known supervised shallow learning algorithms are traditionally employed for lithofacies classication based
on well logs labelled by cores. is category encompasses backpropagation neural networks19, SVM20, bayesian
networks21, K-nearest neighbor22, logistic regression (LR)23, decision tree (DT)24, kernel Fisher discriminant
analysis25, quadratic discriminant analysis26, gaussian naive Bayes27, and bayesian-articial neural network28.
Moreover, homogeneous ensemble techniques, e.g., random forest (RF)29, adaptive boosting model30, extreme
gradient boosting (XGBoost)31, gradient boost DT32, logistic boosting regression, and generalized boosting
modeling33, also fall under the same category. Additionally, the integration of RF and XGBoost34, the combination
Content courtesy of Springer Nature, terms of use apply. Rights reserved
3
Vol.:(0123456789)
Scientic Reports | (2023) 13:21622 | https://doi.org/10.1038/s41598-023-49080-7
www.nature.com/scientificreports/
of articial neural networks and hidden Markov models35, and the stacked generalization of K-nearest neigh-
bours, DT, RF, and XGBoost22 can be considered heterogeneous ensemble algorithms in the related domain.
Such supervised algorithms use geological rules, making lithofacies estimation more trustworthy3. Moreover,
researchers have employed several deep learning (DL) algorithms, e.g., convolutional neural networks (CNNs)36,
hybrid CNN-long short-term memory networks37, and TabNet38, to classify lithofacies via core-labelled well logs.
Nevertheless, many DL applications need to pay more attention to the signicance of sample size, a critical factor
for eective lithofacies modeling. Generally, a more complex problem demands more sophisticated and improved
algorithms, which, in turn, request more training data. Collecting such a volume of data can take time and eort,
making the process infeasible. To address the sample size dilemma in lithofacies classication tasks, transfer
learning, which uses DL models trained on large amounts of data, has emerged as a solution3. Transfer learning,
however, requires access to a large volume of data similar to or related to the upcoming problem dataset. It may
be possible to locate such data sources occasionally, but this may only sometimes be true. Alternatively, ensemble
learning involves combining several baseline models into a larger one with more robust performance than each
model individually. Furthermore, combining diverse baseline models reduces overtting risk in ensemble learn-
ing. Many elds and domains have beneted from ensemble learning, oen outperforming single models39,40. e
selection of baseline classiers in ensemble techniques results in dierences. Two methodologies, homogeneous
and heterogeneous ensembles, generate multiple classiers based on their structure. Homogeneous ensembles,
e.g., RF and bagging41, comprise similar baseline classiers that utilize dierent datasets. e major limitation
of homogenous systems is generating diversity using a single algorithm. In contrast, the heterogeneous ensem-
ble, e.g., voting42 and stacking43, consists of several baseline classiers trained on a single dataset44. Research
has proven that heterogeneity in base classiers contributes to developing more accurate, robust, and scalable
ensemble models45. Ensemble methods provide a means to handle non-linear, intricate, and multi-dimensional
geoscience data46,47.
As aforementioned, to date, researchers have utilized several supervised shallow/deep algorithms to determine
the correspondence among multiple varieties of well logs (as input) and lithofacies derived from core data or
well logs (i.e., electrofacies) (as target) and then used the resultant correlation to locate lithofacies in uncorded
intervals/wells. However, this research focuses on designing a robust and scalable heterogeneous ensemble-based
workow for lithofacies modelling using lithology logs as the target. Nevertheless, several signicant drawbacks
can be found in nearly all ML/ensemble-based paradigms for lithofacies classication, mainly (1) their scalability
constraints and (2) their ignorance of multiclass imbalances in data. e investigation attempts to overcome
the rst drawback by utilizing the blind well dataset from an oileld with bold geological heterogeneity. As the
second drawback, subsurface geological heterogeneities place lithofacies modelling problems in the spotlight
in various real-world scenarios with multiclass imbalanced data classication diculties. Due to their focus on
accuracy, traditional classiers encounter challenges in performance when confronted with class imbalance,
leading to neglect of the minority class or classes. Moreover, conventional ML algorithms such as SVM, primarily
devised for binary classication tasks, oen demand adjustments to attain optimal performance in multiclass
scenarios48. Furthermore, most standard imbalanced data combat tactics, e.g., cost-sensitive learning (CSL)49,
adaptive synthetic sampling (ADASYN), and modied synthetic minority oversampling technique (M-SMOTE)
(as resampling techniques)50, are designed for binary issues and fail to adapt directly insituations with multiple
classes. However, in some research, e.g., Liu and Liu37 and Zhou etal.32, imbalanced binary data combat tac-
tics have been directly implemented for imbalanced multiclass lithofacies classication situations. We utilized
decomposition techniques to extend imbalanced binary data combat tactics and binary-based ML algorithms
(e.g., SVM) to multiclass environments. e original datasets are broken down into binary sets as part of these
techniques by a divide-and-conquer procedure. Consequently, multiple classiers are required, each responsible
for a specic binary problem. Decomposition strategies are divided into two main categories, i.e., One-vs.-All
(OVA) and One-vs.-One (OVO). When there are k classes in a problem, OVA compares each class with the others
using
k
binary classiers. Alternatively, OVO uses
k(k−1)/2
binary classiers to dierentiate between class pairs
in
k
-class problems3. ese binary classier architectures can be signicantly improved using error correcting
output code (ECOC)51. Furthermore, by under-sampling the majority samples or over-sampling the minority
observations, resampling techniques seek to balance data. Nevertheless, these methods will likely exclude some
relevant information or even raise the processing rates of irrelevant samples. Under-sampling techniques (e.g.,
one-sided selection52) and over-sampling algorithms (e.g., borderline-synthetic minority oversampling53) alter
class distribution. In return, CSL considers the costs of misclassifying samples49. Additionally, there are other
options available in this situation besides class decomposition. is way, the research uses ad-hoc approaches
designed to learn directly from dataset54.
In this study, we aim to develop a scalable ensemble-based workow to generate high-resolution lithology logs
reliably and automatically. We address two challenging topics: (1) the scalability of the designed workow and
(2) the analysis of the multiclass imbalanced dataset. e initial obstacle is overcome using a blind well dataset
from an oileld with complex heterogeneous conditions. Besides ad-hoc strategies, combining decomposi-
tion techniques with binary imbalance data combat tactics is crucial in addressing the second concern. In this
investigation, a heterogeneous ensemble model is designed and compared with baseline classiers as popular
algorithms in lithofacies classication research.
Methodology
General workow
Figure1 demonstrates an overview of the proposed high-resolution lithology log generation workow, consist-
ing of three main subsections: Workows 1, 2, and 3. Following data collection and preprocessing, it is parti-
tioned into training, testing, and blind verication datasets. Workow 1 evaluates the interaction of the baseline
Content courtesy of Springer Nature, terms of use apply. Rights reserved
4
Vol:.(1234567890)
Scientic Reports | (2023) 13:21622 | https://doi.org/10.1038/s41598-023-49080-7
www.nature.com/scientificreports/
classiers with the synergy of decomposition techniques and binary imbalanced data handling methods. rough
Workow 2, the baseline classiers are coupled with ad-hoc approaches. Finally, aer the training and evalua-
tion all baseline classiers, an enhanced weighted average ensemble of outstanding classiers is integrated with
superior synergies/ad-hoc tactics in Workow 3.
Multiclass imbalanced learning
Even though minority classes are rare, they frequently provide vital knowledge and crucial learning content. is
section should address two main challenges: (1) the usability of standard ML algorithms and (2) the feasibility
of conventional binary imbalance data combat tactics for solving multiclass imbalance issues. A widely accepted
methodology to simultaneously address both obstacles involves dividing the multiple-class modelling issue into
several binary subproblems through ECOC, OVA, and OVO as decomposition strategies. is investigation
focuses on the ECOC encoding process due to its functionality (in contrast, OVO/OVA). Specically, this is
true regarding overlap due to the vicinity across classes’ spectrum and inuenced by their spatial positions. By
exploiting ECOC, it is possible to use standard ML algorithms and strategies for combating binary imbalance
data in the upcoming multiclass imbalance concern. However, several studies have concentrated on an overall
framework that focuses on developing ad-hoc methods like Static-SMOTE55 instead of modifying conventional
techniques for handling binary imbalance data in the multiclass context. Ad-hoc approaches are generally limited
to several specic types of research and are not very general. Additionally, CSL can handle an imbalanced binary
class56,57. CSL proves more eective than sampling techniques (e.g., M-SMOTE) for imbalanced varieties58. Unlike
sampling methods, CSL maintains the original distribution of data59. As a result, due to CSL’s capabilities, this
paper focuses on its ability to address imbalanced data challenges. In the current research, through the ECOC
technique, the existing imbalanced multiclass problem is decomposed into binary subsets. en, strategies for
dealing with imbalanced binary data are implemented to address it. Additionally, the study utilizes Static-SMOTE
as an ad-hoc tactic to highlight the eciency of the proposed technique.
Error correcting output code concept
eoretically, encoding and decoding are the two phases involved in ECOC schemes. Encoding results in a
confusion matrix, while decoding places every unidentied instance in the most similar class. An
N∗m
confu-
sion matrix has a
ci,j
element in the ith row (
ci
) and jth column. e ith class and the jth column are respectively
symbolized by
clai
and
colj
. e confusion matrix must meet ve specications simultaneously. Initially, every
row ought to include either a ’ + 1’ or ’ − 1’:
If not, the relevant class cannot be identied during training. Secondly, to provide training examples for each
group, all columns must include a ’ + 1’ or ’ − 1’:
e third rule is to avoid having duplicate overlapping columns:
As a fourth rule, no two rows should be alike:
Lastly, no pair of columns should have a reverse correlation:
(1)
m
j=1
abs
ci,j
�= 0, ∀j∈[1, N
]
(2)
N
i=1
abs
ci,j
�= abs
N
i=1
ci,j
,∀j∈[1, m
]
(3)
N
i=1
abs
ci,j−ci,l
�= 0, ∀j,l∈[1, m],j�=
l
(4)
m
i=1
abs
ci,j−cl,j
�= 0, ∀i,l∈[1, N],i�=
l
Figure1. An overview of the proposed workow.
Content courtesy of Springer Nature, terms of use apply. Rights reserved
5
Vol.:(0123456789)
Scientic Reports | (2023) 13:21622 | https://doi.org/10.1038/s41598-023-49080-7
www.nature.com/scientificreports/
Every dichotomizer selects a random element
s0
during the decoding process, which forms the decoded vector
ys0
. Typically, hamming distance (
HD
) is applied to assess similarities among
ys0
with
ci
, and
s0
being allocated
to the clao exhibiting the most similarities.
In this case,
ys0,j
refers to the jth item in
ys0
. In cases where so outcomes are required, the euclidean distance
(
ED
) is applied instead of
HD
, which is restricted to complex results (+ 1/ − 1):
Data-independent and data-dependent strategies can be used to produce optimum confusion matrixes.
e earlier method generates confusing matrixes without considering the samples’ distribution. Subsets of this
approach include OVA and OVO. Due to the predetermined nature of the confusion matrixes in this category,
they cannot be used on a wide range of data sets with satisfactory results. In contrast, the latter method creates
confusion matrixes considering the numerical distributions, of which Data-Driven ECOC is one of its categories.
Due to the better t of its confusion matrixes to sample distributions, it typically provides superior classication
performance60.
Cost‑sensitive learning method
In analyzing data, the CSL tactic refers to a learning approach considering misclassication costs. Total cost
minimization is its objective. Under CSL procedures, such as the MetaCost approach, various classes pay varying
costs to address class imbalance challenges. CSL can be used to handle the costs associated with unfair misclas-
sications and class imbalances. CSL consists of two distinct groups. Developing classiers that are independently
cost-sensitive constitutes the primary group. A "wrapper" is designed in the second group that converts current
cost-insensitive classiers to cost-sensitive ones61. Due to its ability to convert a wide range of cost-intensive
classiers to cost-sensitive ones, the present study applies an instance-based weighting scheme from the second
group. Adjusting class weights is one of the most straightforward ways to increase the algorithm’s sensitivity to
minority class/classes (particularly in models that incorporate class weights). Logically, penalties for the misclassi-
cation of distinct categories correspond with class weights. A class with a higher weight will be subject to higher
penalties for misclassication than classes with a lower weight. ere are several options for setting the weight of
classes. is investigation utilizes the following equation as a balanced heuristic for class weight determination:
where
wc
refers to the weight assigned to the class
c
,
N
denotes the number of classes within the dataset,
k
stands
for the class count within the dataset, and
|c|
represents the sample count for class c62.
Baseline classiers. SVM, DT, RF, LR, and XGBoost are selected baseline classiers. e selection of such
algorithms was deliberate, aiming to leverage the diverse strengths of each model for addressing various aspects
of the research problem. Indeed, a diverse array of baseline algorithms, including linear, non-linear, homogeneous
ensemble, and tree-based methods, provides varied learning strategies for the available dataset. SVM handles
complex boundaries well. It uses a hyperplane to divide n-dimensional attribute vectors into two classes. Kernel
functions are utilized to train the SVM algorithm, facilitating the transformation of feature vectors into higher-
dimensional domains. Aer that, the convex optimization approach is adopted to solve the ML task. According
to the maximum marginal hyperplane, every incoming instance should t logically into either of the categories.
A support vector is a set of data points nearest the hyperplane, which divides the class63. Additionally, DT oers
interpretability and enables analysts to create intelligent forecasting classiers. A DT allows users to estimate an
object’s value based on gathered data. In light of a set of relevant decisions, DT illustrates potential scenarios.
As a result of this approach, users can weigh various decision alternatives, the costs, the probability, and the
importance of every option. is study implements a classication and regression tree training procedure. e
procedure facilitates classication and regression tasks by utilizing discrete or contiguous parameters. Classica-
tion and regression trees have just a pair of leaves on each node64. e classication task could also be conducted
using RF, which provides robustness through ensemble learning. e model generates multiple DTs (or a forest)
for the training process. When performing classication tasks, the model returns the class that corresponds to
the mode of classes. Moreover, this approach eliminates the risk of overtting inherent in DTs65. LR is another
ML algorithm primarily designed for predicting class membership, in which the objective is to estimate the
probability of whether an instance falls into a particular class66. LR oers simplicity and is adequate for binary
classication tasks. Moreover, XGBoost is a popular ML algorithm suitable for tabular data, ensuring high
performance and scalability. With XGBoost, it is possible to detect complex numerical correlations between the
(5)
N
i=1
abs
ci,j−ci,l
�= 0, ∀j,l∈[1, m],j�=
l
(6)
HD
ys0,ci
=
m
j=1
1−sign
ys0,j.ci,j
(7)
o
=argmin
i
={
1,
···
,N
}HD
y
s0
,c
i.
(8)
ED
ys0,ci
=
m
j=1
(ys0,j−ci,j)2
(9)
w
c=
N
(k∗|c|)
Content courtesy of Springer Nature, terms of use apply. Rights reserved
6
Vol:.(1234567890)
Scientic Reports | (2023) 13:21622 | https://doi.org/10.1038/s41598-023-49080-7
www.nature.com/scientificreports/
measured parameters and the desired model. is method combines conventional regression and categorization
trees alongside analytic boosting algorithms. XGBoost details are available at Raihan etal.67. Table1 outlines
the hyperparameters obtained through hyperparameter tuning for baseline classiers. ese specic parameters
are carefully chosen following preliminary experiments and subsequent ne-tuning conducted through grid
search and cross-validation. is iterative process aimed to attain optimal performance while mitigating the
risk of overtting.
Voting ensemble classier
Voting ensembles combine estimates of several distinct classiers. is technique improves the performance of
individual classiers in an ensemble, ideally outperforming any single algorithm. Pooling forecasts across dif-
ferent algorithms enables the creation of a voting ensemble applicable to regression and classication problems.
During classication, estimates for each label are added together, and the majority vote label is determined. Sup-
pose
N
classiers are chosen and identied by
S1,...,S
N
and
R={Si:i=1, 2, 3, ...N}
. In the case of
M
output
classes, the ensemble voting algorithm determines how to combine the classier
S1
by voting
V
to optimize the
F(V)
function. An array with dimensions
N×M
represents
V
. An indication of the weight of ith classier’s vote
for the jth class is provided by
V
i, j
. As a general rule, the more condent a classier is, the greater the weight
allocated, while the more uncertain a classier is, the lower the weight assigned.
V
i, j
∈
[
0, 1
]
represents the
level of assurance the ith classier has for the jth class. Combination rules use weights to combine the predicted
outcomes of classiers. ere are two approaches to predicting the majority vote for classication: hard voting
and so voting. Hard voting involves calculating the total number of votes for each class label and predicting
which has the most votes. So voting involves summing the probability estimates of each class label, and the
predicted class label is the one with the highest probability. Voting ensembles are recommended when all models
in an ensemble are predominantly in consensus or have similar exemplary performance. ey are particularly
benecial whenever several ts of identical baseline classiers are combined with various hyperparameters68.
e voting ensemble is limited in considering all algorithms equally, i.e., each model contributes identically to
forecasting. To address such issues, an extension of the voting ensemble involves applying weighted averaging
or weighted voting of the collaborating algorithms.
Enhanced weighted average ensemble method
is paper applies the enhanced weighted average ensemble model69 to classify multiclass imbalanced data.
ese ensembles have shown their eectiveness, accuracy, reliability, and robustness in addressing complex pat-
tern recognition challenges70. Baseline classiers that are more skilled than others are given additional weight
in this method. e algorithm modies voting ensembles in which all models are deemed equally qualied and
contribute identically to predictions. Each baseline classier is assigned a weight to determine its contribution
amount. Finding appropriate weights is a challenge for such algorithms. Optimum weights result in superior
eciency to ensembles based on similar weights and individual baseline classiers. e present study utilizes
the Grid Search strategy, assigning weights from a range of [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0] to
each baseline classier. is approach aims to optimize the assigned weights eectively, addressing the challenge.
Additionally, the research utilizes so and hard estimators for voting.
Case study
One of the Middle East oil elds is selected as a case study. Geologically, the eld lies in the transition zone
between the highly folded Zagros region and the stable Arabian platform. e underground formations explored
are Gurpi, Ilam, Laan, Sarvak, and Kazhdumi, whose predicted strata are as follows:
1. e Gurpi Formation comprises a sequence of Shale (Sh), Limestone (Ls), and Argillaceous Limestone
(argiLs) stratigraphically associated with the Ilam Formation (at the top section).
2. e Ilam Formation is composed mainly of yellow to grey-brown Ls containing glauconite alongside trace
quantities of hydrocarbons. Oolitic Ls appear frequently intermingled with Ls. ere are traces of Sh seg-
ments in its lower part and evidence of hydrocarbons. Sh sequences, secondary Ls, and hydrocarbon remains
are in the top position.
3. ere are greyish to emerald ash Sh layers with ne inclusions of white Ls in the Laan Formation (roughly
10m thick).
Table 1. Hyperparameters of baseline classiers.
Baseline classier Hyperparameters
SVM Kernel: Radial Basis Function (RBF), C (Regularization Parameter): 8.0, Gamma: 0.001
DT Criterion: Gini impurity, Max Depth: 5.0, Min Samples Split: 5.0
RF Number of Estimators: 128.0, Max Depth: 8.0, Max Features: ’sqrt’
LR Solver: ’liblinear’, Regularization: L2, C (Regularization Parameter): 10.0
XGBoost Number of Boosting Rounds: 100.0, Learning Rate: 0.1, Max Depth: 3.0, Objective Function: Binary logistic regres-
sion
Content courtesy of Springer Nature, terms of use apply. Rights reserved
7
Vol.:(0123456789)
Scientic Reports | (2023) 13:21622 | https://doi.org/10.1038/s41598-023-49080-7
www.nature.com/scientificreports/
4. e Sarvak Formation’s lower lithotype contains numerous Sh layers and hydrocarbon residues. In the
remainder, there are predominantly grey Chalky Limestones (chkLs), light grey to white chkLs, and dark
brown to pale brown Cherty Ls. Regional Sh accompanies these Lss.
5. Kazhdumi Formation generally consists of dark black and dark brown Sh and pyritic Ls, rich in dark grey
to pale ash and dark brown Sh-Ls.
Dataset
e dataset consists of computed gamma ray [CGR (GAPI)], spectral gamma ray [SGR (GAPI)], neutron poros-
ity [NPHI (V/V)], photoelectric factor [PE (B/E)], density [RHOB (G/C3)], Sonic [DT (US/F)], and lithology
logs. Data from ve wells identied as W-01 to W-05 exist within the study area. Figure2a demonstrates the
geographical positions of the wells in the area under investigation. W-03 is selected as a blind well based on its
geographical location and data range coverage. e ML algorithms are trained using data from the other four
wells. For instance, Fig.2b illustrates the conventional well logs and lithology logs for W-02. Figure3a–g display
the distribution of input features (CGR, SGR, DT, NPHI, PE, RHOB) and target features (Facies), respectively.
Figure3g illustrates a substantial imbalance within the input data.
Data preparation and class dierentiation
As a part of this subsection, the data undergo a check for missing values and outliers aer encoding categorical
features (such as facies names, well identiers, and formations) into dummy variables. An error in a dataset can
take many forms, for example, duplicate rows or weak columns. While rening the available data, columns with
only a single value, low variance, and rows containing repeated observations are identied and eliminated. Addi-
tionally, unnecessary columns are eliminated based on the correlation between dierent features. Furthermore,
the distribution quantity of available datasets necessitated the application of standardization. Before presentation
as input to the ML algorithms, the data undergo standardization to achieve a zero mean and unit variance71.
Figure2. (a) e geographic positions of the wells in the area under investigation, and (b) Conventional well
logs, lithology log, and a legend map for W-02 as an illustrative example.
Content courtesy of Springer Nature, terms of use apply. Rights reserved
8
Vol:.(1234567890)
Scientic Reports | (2023) 13:21622 | https://doi.org/10.1038/s41598-023-49080-7
www.nature.com/scientificreports/
However, complications like drilling uid disturbance or drill bit balling up during lithology log recording can
occur. erefore, it could be challenging to separate dierent facies because of these bugs. Before training the
classier, the preprocessing stage aims to achieve a high level of separation between other classes. is goal is
Figure3. Distribution of input features including (a) CGR, (b) SGR, (c) DT, (d) NPHI, (e) PE, and (f) RHOB,
alongside (g) Facies as the target feature.
Content courtesy of Springer Nature, terms of use apply. Rights reserved
9
Vol.:(0123456789)
Scientic Reports | (2023) 13:21622 | https://doi.org/10.1038/s41598-023-49080-7
www.nature.com/scientificreports/
performed using linear discriminant analysis as a noise reduction technique72 with 97% accuracy. By stratifying
sampling73, the input data are divided between training (75%) and testing (25%) to account for the problem of
data imbalance. us, both sets have a proportional representation of class.
Results and discussion
e study initiates with Workow 1 (see Fig.2), aimed at assessing the baseline classiers while exploring syner-
gies between the decomposition strategy and various tactics tailored for handling imbalanced binary data. is
phase is crucial for pinpointing noteworthy interactions. Furthermore, Workow 2 amalgamates optimal baseline
classiers with customized ad-hoc methods. Subsequently, Workow 3 introduces an enhanced weighted aver-
age ensemble that merges the most eective baseline classiers. is ensemble is then integrated with superior
synergies or ad-hoc techniques for an improved performance assessment. e assessment of imbalanced multi-
class classication presents a challenge because widely used measures for evaluating classiers’ outputs, such as
accuracy, are built upon assumptions of balanced distributed data. Previous studies have proposed Mean Kappa
statistics (Mean. K) and Mean F-measures (Mean. F) to assess imbalanced situations74–76. e Landis and Koch
grouping is commonly utilized for interpreting Kappa statistics values, where the ranges correspond to dier-
ent levels of agreement: 0% (poor); 0–20% (slight); 21–40% (fair); 41–60% (moderate); 61–80% (substantial);
and 81–100% (almost-perfect)77. For a detailed explanation of the Kappa statistic and F-measure for imbalance
multiclass classication, refer to Jamshidi Gohari etal.3. Developing lithology log generation within the Google
Collaboratory platform involves various libraries. ese libraries include Pytorch, Pandas, Numpy, Matplotlib,
Mpl toolkits, and Sklearn in Python 3.11.5. Additionally, we ran on an Intel Core i7-11370H with 16GB of RAM.
Synergy between ECOC and binary imbalanced data combat tactics
is subsection through Workow 1 describes how ECOC and binary imbalanced data combat tactics interact
with baseline classiers. As part of Workow 2, Static-SMOTE highlights the results. Table2 illustrates average
outcomes and rankings based on the average of 20 runs. e t-index represents test marks, whereas the b-index
indicates blind evaluation scores. One section covers the ad-hoc approach, and the other presents the ECOC
scheme. Each technique is ranked separately for a given unit in the "Rank" column. e highest marks are indi-
cated in bold font. Furthermore, the basic version of the algorithms (i.e., Base and Std) is implemented to verify
the results. Table2 supports the following ndings. When combined with ECOC and CSL as a corporator of
Workow 1, SVM produced the most accurate results (Rankb = 1). e eectiveness of this procedure manifested
itself in a Mean. Fb of 86.87% and a Mean. Kb of 78.04% for blind well datasets. ECOC-CSL is numerically better
Table 2. Mean classier test and blind well assessment outcomes (using a 20-run average) for baseline
classiers based on Mean. F and Mean. K (Percentage-wise). e t-index signies test grades, while the
b-index denotes ratings from blind evaluations.
Method Baseline classier Adaptation
Mean.Ft
Mean.Fb
Rankb
Mean.Kt
Mean.Kb
Rankb
Ad-hoc
SVM
Base
93.26 82.46 – 88.15 70.61 –
RF 92.72 81.88 – 87.49 69.96 –
XGBoost 90.62 78.74 – 84.97 67.54 –
DT 88.54 76.65 – 82.65 65.89 –
LR 84.38 71.84 – 77.86 60.85 –
SVM
Static-SMOTE
93.33 83.58 5 89.24 72.55 5
RF 92.58 82.75 6 88.43 71.69 6
XGBoost 89.98 81.42 8 85.68 69.14 8
DT 88.99 80.68 10 83.45 67.82 10
LR 85.04 76.11 13 78.24 62.74 13
ECOC
SVM
Std
93.87 85.30 – 90.03 75.03 –
RF 92.84 84.29 – 89.12 74.08 –
XGBoost 89.76 83.02 – 87.45 72.88 –
DT 87.65 81.45 – 85.94 70.86 –
LR 82.98 77.07 – 80.85 65.87 –
SVM
M-SMOTE
89.92 81.38 9 83.56 68.82 9
RF 88.97 80.24 11 81.75 67.03 11
XGBoost 86.43 77.54 12 78.54 64.72 12
DT 83.95 72.97 14 77.14 62.68 14
LR 80.87 71.95 15 72.56 57.21 15
SVM
CSL
94.71 86.87 1 91.37 78.04 1
RF 94.09 86.28 2 90.55 77.29 2
XGBoost 93.87 84.08 3 89.62 75.42 3
DT 93.74 83.67 4 89.48 74.14 4
LR 90.32 81.54 7 85.98 70.52 7
Content courtesy of Springer Nature, terms of use apply. Rights reserved
10
Vol:.(1234567890)
Scientic Reports | (2023) 13:21622 | https://doi.org/10.1038/s41598-023-49080-7
www.nature.com/scientificreports/
than ECOC-M-SMOTE or Static-SMOTE. In addition, coupling RF with the synergy of ECOC and CSL yielded
a Mean. Fb of 86.28% and a Mean. Kb of 77.29% as a co-factor of Workow 1 (Rankb = 2). In this particular
combination, when paired with RF, ECOC-CSL demonstrates superior numerical performance compared to
other methods, thereby arming its overall functionality. When examining ECOC-CSL-SVM (Rankb = 1) and
ECOC-CSL-RF (Rankb = 2) outputs, it becomes apparent that the former exhibits a higher level of prociency.
However, both perform satisfactorily on blind well data evaluation. erefore, improving performance by devel-
oping an enhanced weighted average ensemble that combines these two synergies from Workow 1 may result
in superior performance.
SVM‑RF enhanced weighted average ensemble development
In this subsection, the development of an enhanced weighted average ensemble based on two superior combina-
tions of Workow 1, i.e., ECOC-CSL-SVM and ECOC-CSL-RF, is reported. e voting scheme consists of two
types: so voting and hard voting.Table3 presents the average results and rankings across 20 runs. As reported,
Workow 3 provides the best performance, in which the enhanced weighted average ensemble of SVM and RF in
so voting mode is coupled with ECOC-CSL—a Mean. Fb of 91.04% and a Mean. Kb of 84.50%, which indicates
almost perfect agreement, is proof of this superiority (Rankb = 1). Tables2 and 3 illustrate that the enhanced
weighted average ensemble of SVM and RF in so voting mode coupled with ECOC-CSL performs the most
ecient workow, henceforth called optimal workow. Additionally, by comparing the confusing matrixes of
the various workows (i.e., Workows 1, 2, and 3), the optimal workow provided the superior prediction for
argiLs, chkLs, Ls, and Sh. Figure4a,b present the confusing matrixes comparing the optimized workow against
an unoptimized approach for evaluating blind well data. It’s apparent that the unoptimized workow exhibits
bias towards the majority classes and performs suboptimally in recognizing the minority class, specically Sh.
Graphical comparative assessment
Figure5a–d, depict the generated lithology log (i.e., Generated LL) for dierent depth intervals through the
optimal workow from the blind well dataset. e optimal workow could separate Sh as one of the critical
minority classes from argiLs, chkLs, and Ls according to the peak values in the conventional well logs, especially
CGR and SGR. e generated lithology log displays a reasonable similarity to the original one (i.e., Original L.L.
in Fig.5a–d) in pinpointing the regions where argiLs, chkLs, Ls, and Sh occur. Figure5b displays the concentrat-
ing depth interval (2728–2750m) for the minority Sh class in the blind well. It shows an excellent correlation
among the peak positions of the blind well logs, the Sh positions in the original lithology log, and the generated
one. A similar agreement holds to argiLs, chkLs, and Ls facies, which share overlapping characteristics. Figure5c
Table 3. Mean classier test and blind well results (using a 20-run average) for designed ensemble based on
Mean. F and Mean. K (Percentage-wise). e t-index signies test grades, while the b-index denotes ratings
from blind evaluations.
Method ensenble type Adaptation
Mean.Ft
Mean.Fb
Rankb
Mean.Kt
Mean.Kb
Rankb
ECOC
Enhanced weighted average ensemble of SVM
and RF in so voting mode CSL
94.92 91.04 1 91.70 84.50 1
Enhanced weighted average ensemble of SVM
and RF in hard voting mode 94.07 90.33 2 90.44 83.62 2
Figure4. (a) Confusion matrix of the optimal workow for blind well data evaluation, and (b) confusion
matrix of an unoptimized workow for blind well data assessment.
Content courtesy of Springer Nature, terms of use apply. Rights reserved
11
Vol.:(0123456789)
Scientic Reports | (2023) 13:21622 | https://doi.org/10.1038/s41598-023-49080-7
www.nature.com/scientificreports/
highlights the blind well interval of 2450–2600m, covering the argiLs, and Ls facies. Additionally, Fig.5d shows
the depth interval of the blind well for chkLs, Ls, and Sh facies from 3175 to 3300m. In these gures, the positions
of argiLs, chkLs, Ls, and Sh in the generated lithology log reasonably match those in the original one.
Unlike the OVA and OVO approaches, which partition a multiclass modelling problem into a nite number
of binary classication tasks, the ECOC algorithm allows any given class to be encoded as an innite number of
binary classication tasks. Excessive representation enables the additional models to function as "error-correc-
tion" forecasts, enhancing prediction ability. Furthermore, a signicant factor that leads to superior CSL perfor-
mance is assigning additional weight to misclassications of minorities and imposing a penalty for inaccurate
classications. us, these classes are given more attention by the model. is approach compels the model to
learn instances from minority classes, making it a potent tool for forecasting occurrences from these classes.
CSL, on the other hand, maintains the original distribution of data, unlike resampling approaches. Moreover,
the SVM classication eectiveness can be attributed to the fact that it transforms the initial data into a multi-
dimensional space. is ability will separate the classes better while maintaining the exact computational cost
as the initial problem. is feature is referred to as a kernel trick.
Furthermore, RF can minimize the impact of an imbalanced sample distribution during classication. is
characteristic can enhance minority samples’ identication eciency. On the other hand, when the ratio of
imbalanced observations rises, the classication performance of RF is markedly impaired. Due to this issue, it’s
not possible to train a complete classication algorithm. e current study addressed this drawback by coupling
the RF with the ECOC-CSL. SVM behaved more skillfully than RF under similar conditions (i.e. when combined
Figure5. Lithology log (LL) generated using the optimal workow for blind well data, illustrating depth
intervals: (a) 2351–3399m, (b) 2728–2750m, (c) 2450–2600m, and (d) 3175–3300m.
Content courtesy of Springer Nature, terms of use apply. Rights reserved
12
Vol:.(1234567890)
Scientic Reports | (2023) 13:21622 | https://doi.org/10.1038/s41598-023-49080-7
www.nature.com/scientificreports/
with the synergy of ECOC-CSL); however, both performed satisfactorily on blind well data evaluation. Designing
an enhanced weighted average ensemble aims to maximize eciency by combining these two models, each with
unique advantages. As a result of its reduced rate of error and lower variance, the ensemble has an improved
predictive performance over the individual models (i.e., baseline classiers). However, to obtain optimum esti-
mates, a unique classier can only represent some of the fundamental characteristics of the data. Consequently,
combining several primary learners can capture further insight into the data’s internal layout and dramatically
boost estimation precision.
In addition, the study seeks to oer a scalable workow to generate lithology logs or, more broadly, to model
lithofacies, not only restricted regions under investigation. Accordingly, the experiment sought to remedy con-
ventional procedures’ deciencies and considered multiple factors. Hence, the research site with considerable
geological heterogeneity was chosen, highlighting the imbalanced multiclass data issue. e optimal workow
performed superior results in the blind well evaluation. erefore, it is conrmed through blind well analysis,
another indicator of its scalability. Furthermore, given that geological evidence is based on lithology log data,
it is crucial to consider its uncertainty sources. Wellbore instabilities (e.g., breakouts and washouts), balling up,
and rheology disturbances can lead to inaccurate data sources. Incorporating LDA as a denoising tool to mitigate
these concerns is advisable.
Additionally, the developed strategies for dealing with the multiclass imbalance dilemma manifest uniform
performance irrespective of the classier type. Consequently, the outcomes are comparable throughout, support-
ing validity. Finally, the DL algorithm is more stable than the shallow ML technique, particularly when analyzing
noisy and uncertain geoscience datasets. As a result, it is recommended that the geoscience and geo-energy com-
munities collect a global data bank similar to that developed in image processing to facilitate transfer learning.
Moreover, this investigation primarily focused on several standard imbalanced data combat tactics and ad-hoc
techniques. However, considering further alternatives, such as employing tailored loss functions like balanced
cross-entropy and focal loss78 for imbalanced lithofacies modelling, is suggested as a reasonable avenue for future
research directions. Last but not least, this study provides a basis for future work in geosciences and engineering
that deals with multiclass data with imbalances.
Conclusion
e current investigation focused on statistically and graphically analyzing high-resolution lithology log gen-
eration. A primary emphasis was placed on addressing two signicant challenges: multiclass imbalance data
classication and scalability. ree distinct workows were scrutinized to tackle the former, employing baseline
classiers, a custom ensemble algorithm, and methods tailored for handling multiclass imbalance data. Address-
ing the latter challenge involved evaluating these workows using blind well data from an oileld characterized
by substantial geological variations. e optimal workow emerged as an enhanced weighted average ensem-
ble of SVM and RF alongside ECOC and CSL. is amalgamation showcased notable strength and reliability,
evidenced by a mean Kappa statistic of 84.50%, signifying almost-perfect agreement, and mean F-measures of
91.04%. ese results underscore the optimal workow’s robustness and ecacy in evaluating blind well data.
Moreover, the devised ensemble showcased superior performance to commonly employed baseline classiers in
lithofacies classication endeavours. is constructed workow adeptly handles multiclass imbalanced data with
eciency and logical coherence. Evaluation based on statistical and graphical analyses of the blind well dataset
indicated a satisfactory correlation between the generated lithology log and the original one. Additionally, a
notable advantage of the proposed workow lies in its ability to retain the initial data distribution. In summary,
the developed workow presents a versatile solution capable of addressing multiclass imbalance issues within
the geo-energy sector, extending beyond lithofacies classication tasks.
Data availability
e corresponding author will make all the data available upon a reasonable request.
Received: 25 October 2023; Accepted: 4 December 2023
References
1. Karimi, A. M., Sadeghnejad, S. & Rezghi, M. Well-to-well correlation and identifying lithological boundaries by principal com-
ponent analysis of well-logs. Comput. Geosci. 157, 104942 (2021).
2. Zhan, C. et al. Subsurface sedimentary structure identication using deep learning: A review. Earth Sci. Rev. 239, 104370 (2023).
3. Jamshidi Gohari, M. S., Emami Niri, M., Sadeghnejad, S. & Ghiasi-Freez, J. Synthetic graphic well log generation using an enhanced
deep learning workow: imbalanced multiclass data, sample size, and scalability challenges. SPE J. https:// doi. org/ 10. 2118/ 217466-
PA (2023).
4. Masroor, M., Emami Niri, M., Rajabi-Ghozloo, A. H., Sharinasab, M. H. & Sajjadi, M. Application of machine and deep learning
techniques to estimate NMR-derived permeability from conventional well logs and articial 2D feature maps. J. Pet. Explor. Prod.
Tec hnol. 12, 2937–2953 (2022).
5. Sharinasab, M. H., Niri, M. E. & Masroor, M. Developing GAN-boosted articial neural networks to model the rate of drilling
bit penetration. Appl. So Comput. 136, 110067 (2023).
6. Haddadpour, H. & Niri, M. E. Uncertainty assessment in reservoir performance prediction using a two-stage clustering approach:
Proof of concept and eld application. J. Petrol. Sci. Eng. 204, 108765 (2021).
7. Kolajoobi, R. A., Haddadpour, H. & Niri, M. E. Investigating the capability of data-driven proxy models as solution for reservoir
geological uncertainty quantication. J. Petrol. Sci. Eng. 205, 108860 (2021).
8. Mousavi, S.-P. et al. Modeling of H2S solubility in ionic liquids: comparison of white-box machine learning, deep learning and
ensemble learning approaches. Sci. Rep. 13, 7946 (2023).
9. Rezaei, F., Akbari, M., Raei, Y. & Hemmati-Sarapardeh, A. Compositional modeling of gas-condensate viscosity using ensemble
approach. Sci. Rep. 13, 9659 (2023).
Content courtesy of Springer Nature, terms of use apply. Rights reserved
13
Vol.:(0123456789)
Scientic Reports | (2023) 13:21622 | https://doi.org/10.1038/s41598-023-49080-7
www.nature.com/scientificreports/
10. Nakhaei-Kohani, R. et al. Solubility of gaseous hydrocarbons in ionic liquids using equations of state and machine learning
approaches. Sci. Rep. 12, 14276 (2022).
11. Glover, P. W., Mohammed-Sajed, O. K., Akyüz, C., Lorinczi, P. & Collier, R. Clustering of facies in tight carbonates using machine
learning. Mar. Pet. Geol. 144, 105828 (2022).
12. Troccoli, E. B., Cerqueira, A. G., Lemos, J. B. & Holz, M. K-means clustering using principal component analysis to automate label
organization in multi-attribute seismic facies analysis. J. Appl. Geophys. 198, 104555 (2022).
13. Emelyanova, I., Peyaud, J.-B., Dance, T. & Pervukhina, M. Detecting specic facies in well-log data sets using knowledge-driven
hierarchical clustering. Petrophysics 61, 383–400 (2020).
14. Liu, Z., Cao, J., Chen, S., Lu, Y. & Tan, F. Visualization analysis of seismic facies based on deep embedded SOM. IEEE Geosci.
Remote Sens. Lett. 18, 1491–1495 (2020).
15. Liu, X. et al. Deep classied autoencoder for lithofacies identication. IEEE Trans. Geosci. Remote Sens. 60, 1–14 (2021).
16. Lan, X., Zou, C., Kang, Z. & Wu, X. Log facies identication in carbonate reservoirs using multiclass semi-supervised learning
strategy. Fuel 302, 121145 (2021).
17. Xie, W. & Spikes, K. T. Well-log facies classication using an active semi-supervised algorithm with pairwise constraints. Geophys.
J. Int. 229, 56–69 (2022).
18. Li, Z. et al. Semi-supervised learning for lithology identication using Laplacian support vector machine. J. Pet. Sci. Eng. 195,
107510 (2020).
19. Zhang, L. et al. Diagenetic facies characteristics and quantitative prediction via wireline logs based on machine learning: A case
of Lianggaoshan tight sandstone, fuling area, Southeastern Sichuan Basin, Southwest China. Front. Earth Sci. 10, 1018442 (2022).
20. Wood, D. A. Carbonate/siliciclastic lithofacies classication aided by well-log derivative, volatility and sequence boundary attributes
combined with machine learning. Earth Sci. Inform. 15, 1699–1721 (2022).
21. Zhao, Z. et al. Lithofacies identication of shale reservoirs using a tree augmented Bayesian network: A case study of the lower
Silurian Longmaxi formation in the changning block, South Sichuan basin, China. Geoenergy Sci. Eng. 221, 211385 (2023).
22. He, M., Gu, H. & Xue, J. Log interpretation for lithofacies classication with a robust learning model using stacked generalization.
J. Pet. Sci. Eng. 214, 110541 (2022).
23. Antariksa, G., Muammar, R. & Lee, J. Performance evaluation of machine learning-based classication with rock-physics analysis
of geological lithofacies in Tarakan Basin, Indonesia. J. Pet. Sci. Eng. 208, 109250 (2022).
24. R au, E. G. et al. Applicability of decision tree-based machine learning models in the prediction of core-calibrated shale facies from
wireline logs in the late Devonian Duvernay Formation, Alberta, Canada. Interpretation 10, T555–T566 (2022).
25. Dong, S., Zeng, L., Du, X., He, J. & Sun, F. Lithofacies identication in carbonate reservoirs by multiple kernel Fisher discriminant
analysis using conventional well logs: A case study in A oileld, Zagros Basin, Iraq. J. Pet. Sci. Eng. 210, 110081 (2022).
26. Dong, S.-Q. et al. A deep kernel method for lithofacies identication using conventional well logs. Pet. Sci. 20, 1411–1428 (2023).
27. Babasafari, A. A., Campane Vidal, A., Furlan Chinelatto, G., Rangel, J. & Basso, M. Ensemble-based machine learning application
for lithofacies classication in a pre-salt carbonate reservoir, Santos Basin, Brazil. Pet. Sci. Technol. https:// doi. org/ 10. 1080/ 10916
466. 2022. 21438 13 (2022).
28. Feng, R. A Bayesian approach in machine learning for lithofacies Classication and its uncertainty analysis. IEEE Geosci. Remote
Sens. Lett. 18, 18–22 (2020).
29. Feng, R. Improving uncertainty analysis in well log classication by machine learning with a scaling algorithm. J. Pet. Sci. Eng.
196, 107995 (2021).
30. Nwaila, G. T. et al. Data-driven predictive modeling of lithofacies and fe in-situ grade in the assen fe ore deposit of the transvaal
supergroup (South Africa) and Implications on the Genesis of Banded Iron Formations. Nat. Resour. Res. 31, 2369–2395 (2022).
31. Zheng, D. et al. Application of machine learning in the identication of uvial-lacustrine lithofacies from well logs: A case study
from Sichuan Basin, China. J. Pet. Sci. Eng. 215, 110610 (2022).
32. Zhou, K., Zhang, J., Ren, Y., Huang, Z. & Zhao, L. A gradient boosting decision tree algorithm combining synthetic minority
oversampling technique for lithology identication. Geophysics 85, WA147–WA158 (2020).
33. Al-Mudhafar, W. J., Abbas, M. A. & Wood, D. A. Performance evaluation of boosting machine learning algorithms for lithofacies
classication in heterogeneous carbonate reservoirs. Mar. Pet. Geol. 145, 105886 (2022).
34. Hou, M. et al. Machine learning algorithms for lithofacies classication of the gulong shale from the Songliao Basin, China. Ener‑
gies 16, 2581 (2023).
35. Feng, R. Lithofacies classication based on a hybrid system of articial neural networks and hidden Markov models. Geophys. J.
Int. 221, 1484–1498 (2020).
36. Kim, J. Lithofacies classication integrating conventional approaches and machine learning technique. J. Nat. Gas Sci. Eng. 100,
104500 (2022).
37. Liu, J.-J. & Liu, J.-C. Integrating deep learning and logging data analytics for lithofacies classication and 3D modeling of tight
sandstone reservoirs. Geosci. Front. 13, 101311 (2022).
38. Ta, V.-C. et al. Tabnet eciency for facies classication and learning feature embedding from well log data. Pet. Sci. Technol. https://
doi. org/ 10. 1080/ 10916 466. 2023. 22236 23 (2023).
39. Ngo, G., Beard, R. & Chandra, R. Evolutionary bagging for ensemble learning. Neurocomputing 510, 1–14 (2022).
40. Zhang, Q., Tsang, E. C., He, Q. & Guo, Y. Ensemble of kernel extreme learning machine based elimination optimization for multi-
label classication. Knowl. Based Syst. 278, 10817 (2023).
41. Klikowski, J. & Woźniak, M. Deterministic sampling classier with weighted bagging for dried imbalanced data stream clas-
sication. Appl. So Comput. 122, 108855 (2022).
42. Tavana, P., Akraminia, M., Koochari, A. & Bagherifard, A. An ecient ensemble method for detecting spinal curvature type using
deep transfer learning and so voting classier. Expert Syst. Appl. 213, 119290 (2023).
43. Cui, S., Yin, Y., Wang, D., Li, Z. & Wang, Y. A stacking-based ensemble learning method for earthquake casualty prediction. Appl.
So Comput. 101, 107038 (2021).
44. Mohammed, A. & Kora, R. A comprehensive review on ensemble deep learning: Opportunities and challenges. J. King Saud Univ.
Comput. Inform. Sci. 35, 757–774 (2023).
45. Sesmero, M. P., Ledezma, A. I. & Sanchis, A. Generating ensembles of heterogeneous classiers using stacked generalization. Wiley
Interdiscip. Rev. Data Min. Knowl. Discov. 5, 21–34 (2015).
46. Dong, S.-Q. et al. How to improve machine learning models for lithofacies identication by practical and novel ensemble strategy
and principles. Pet. Sci. 20, 733–752 (2023).
47. Ntibahanana, M., Luemba, M. & Tondozi, K. Enhancing reservoir porosity prediction from acoustic impedance and lithofacies
using a weighted ensemble deep learning approach. Appl. Comput. Geosci. 16, 100106 (2022).
48. Huang, C. et al. A feature weighted support vector machine and articial neural network algorithm for academic course perfor-
mance prediction. Neural Comput. Appl. 35, 11517–11529 (2023).
49. Ding, Y., Jia, M., Zhuang, J. & Ding, P. Deep imbalanced regression using cost-sensitive learning and deep feature transfer for
bearing remaining useful life estimation. Appl. So Comput. 127, 109271 (2022).
50. Lui, T. C., Gregory, D. D., Anderson, M., Lee, W.-S. & Cowling, S. A. Applying machine learning methods to predict geology using
soil sample geochemistry. Appl. Comput. Geosci. 16, 100094 (2022).
Content courtesy of Springer Nature, terms of use apply. Rights reserved
14
Vol:.(1234567890)
Scientic Reports | (2023) 13:21622 | https://doi.org/10.1038/s41598-023-49080-7
www.nature.com/scientificreports/
51. Valencia, O., Ortiz, M., Ruiz, S., Sanchez, M. & Sarabia, L. Simultaneous class-modelling in chemometrics: A generalization of
Partial Least Squares class modelling for more than two classes by using error correcting output code matrices. Chemom. Intell.
Lab. Syst. 227, 104614 (2022).
52. Santos, L. I. et al. Decision tree and articial immune systems for stroke prediction in imbalanced data. Expert Syst. Appl. 191,
116221 (2022).
53. Leng, Q., Guo, J., Jiao, E., Meng, X. & Wang, C. NanBDOS: Adaptive and parameter-free borderline oversampling via natural
neighbor search for class-imbalance learning. Knowl. Based Syst. 274, 110665 (2023).
54. Fernández, A. et al. Learning from Imbalanced Data Sets Vol. 10 (Springer, 2018).
55. Lango, M. & Stefanowski, J. What makes multiclass imbalanced problems dicult? An experimental study. Expert Syst. Appl. 199,
116962 (2022).
56. Volk, O., Ratnovsky, A., Naali, S. & Singer, G. Classication of tracheal stenosis with asymmetric misclassication errors from
EMG signals using an adaptive cost-sensitive learning method. Biomed. Signal Process. Control 85, 104962 (2023).
57. Chamseddine, E., Mansouri, N., Soui, M. & Abed, M. Handling class imbalance in COVID-19 chest X-ray images classication:
Using SMOTE and weighted loss. Appl. So Comput. 129, 109588 (2022).
58. Zhang, C., Tan, K. C., Li, H. & Hong, G. S. A cost-sensitive deep belief network for imbalanced classication. IEEE Trans. Neural
Netw. Learn. Syst. 30, 109–122 (2018).
59. Tang, J., Hou, Z., Yu, X., Fu, S. & Tian, Y. Multi-view cost-sensitive kernel learning for imbalanced classication problem. Neuro‑
computing 552, 126562 (2023).
60. Yi-Fan, L. et al. A novel error-correcting output codes based on genetic programming and ternary digit operators. Pattern Recognit.
110, 107642 (2021).
61. Wang, Y.-C. & Cheng, C.-H. A multiple combined method for rebalancing medical data with class imbalances. Comput. Biol. Med.
134, 104527 (2021).
62. Young, M. M., Himmelreich, J., Honcharov, D. & Soundarajan, S. Using articial intelligence to identify administrative errors in
unemployment insurance. Gov. Inform. Q. 39, 101758 (2022).
63. Mohammadi, M.-R. et al. Modeling hydrogen solubility in hydrocarbons using extreme gradient boosting and equations of state.
Sci. Rep. 11, 17911 (2021).
64. Riazi, M. et al. Modelling rate of penetration in drilling operations using RBF, MLP, LSSVM, and DT models. Sci. Rep. 12, 11650
(2022).
65. Ghazwani, M. & Begum, M. Y. Computational intelligence modeling of hyoscine drug solubility and solvent density in supercritical
processing: Gradient boosting, extra trees, and random forest models. Sci. Rep. 13, 10046 (2023).
66. Hartonen, T. et al. Nationwide health, socio-economic and genetic predictors of COVID-19 vaccination status in Finland. Nat.
Hum. Behav. 7, 1069–1083 (2023).
67. Raihan, M. J., Khan, M.A.-M., Kee, S.-H. & Nahid, A.-A. Detection of the chronic kidney disease using XGBoost classier and
explaining the inuence of the attributes on the model using SHAP. Sci. Rep. 13, 6263 (2023).
68. Khairy, R. S., Hussein, A. & ALRikabi, H.,. e detection of counterfeit banknotes using ensemble learning techniques of AdaBoost
and voting. Int. J. Intell. Eng. and Syst. 14, 326–339 (2021).
69. Loganathan, S., Geetha, C., Nazaren, A. R. & Fernandez, M. H. F. Autism spectrum disorder detection and classication using
chaotic optimization based Bi-GRU network: An weighted average ensemble model. Expert Syst. Appl. 230, 120613 (2023).
70. Osamor, V. C. & Okezie, A. F. Enhancing the weighted voting ensemble algorithm for tuberculosis predictive diagnosis. Sci. Rep.
11, 14806 (2021).
71. Jamshidi Gohari, M. S., Emami Niri, M. & Ghiasi-Freez, J. Improving permeability estimation of carbonate rocks using extracted
pore network parameters: a gas eld case study. Acta Geophy. 69, 509–527 (2021).
72. Ma, H., Yan, J., Li, Y., Zhang, C. & Lin, H. Desert seismic random noise reduction based on LDA eective signal detection. Acta
Geophys. 67, 109–121 (2019).
73. Yin, X. et al. Strength of stacking technique of ensemble learning in rockburst prediction with imbalanced data: Comparison of
eight single and ensemble models. Nat. Resour. Res. 30, 1795–1815 (2021).
74. Doan, Q. H., Mai, S.-H., Do, Q. T. & ai, D.-K. A cluster-based data splitting method for small sample and class imbalance
problems in impact damage classication. Appl. So Comput. 120, 108628 (2022).
75. Wernicke, J., Seltmann, C. T., Wenzel, R., Becker, C. & Koerner, M. Forest canopy stratication based on fused, imbalanced and
collinear LiDAR and Sentinel-2 metrics. Remote Sens. Environ. 279, 113134 (2022).
76. Zhang, X., Akber, M. Z. & Zheng, W. Predicting the slump of industrially produced concrete using machine learning: A multiclass
classication approach. J. Build. Eng. 58, 104997 (2022).
77. B enchou, M., Matzner-Lober, E., Molinari, N., Jannot, A.-S. & Soyer, P. Interobserver agreement issues in radiology. Diagn. Inter.
Imaging 101, 639–641 (2020).
78. Jiang, G., Yue, R., He, Q., Xie, P. & Li, X. Imbalanced learning for wind turbine blade icing detection via spatio-temporal attention
model with a self-adaptive weight loss function. Expert Syst. Appl. 229, 120428 (2023).
Author contributions
MSJG: investigation, visualization, writing-original dra, conceptualization, validation, modeling, MEN: writing-
review and editing, methodology, validation, supervision, data curation, SS: writing-review and editing, valida-
tion, JG-F: writing-review AND; editing, validation, methodology.
Competing interests
e authors declare no competing interests.
Additional information
Correspondence and requests for materials should be addressed to M.E.N.
Reprints and permissions information is available at www.nature.com/reprints.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and
institutional aliations.
Content courtesy of Springer Nature, terms of use apply. Rights reserved
15
Vol.:(0123456789)
Scientic Reports | (2023) 13:21622 | https://doi.org/10.1038/s41598-023-49080-7
www.nature.com/scientificreports/
Open Access is article is licensed under a Creative Commons Attribution 4.0 International
License, which permits use, sharing, adaptation, distribution and reproduction in any medium or
format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the
Creative Commons licence, and indicate if changes were made. e images or other third party material in this
article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the
material. If material is not included in the article’s Creative Commons licence and your intended use is not
permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.
© e Author(s) 2023
Content courtesy of Springer Nature, terms of use apply. Rights reserved
1.
2.
3.
4.
5.
6.
Terms and Conditions
Springer Nature journal content, brought to you courtesy of Springer Nature Customer Service Center GmbH (“Springer Nature”).
Springer Nature supports a reasonable amount of sharing of research papers by authors, subscribers and authorised users (“Users”), for small-
scale personal, non-commercial use provided that all copyright, trade and service marks and other proprietary notices are maintained. By
accessing, sharing, receiving or otherwise using the Springer Nature journal content you agree to these terms of use (“Terms”). For these
purposes, Springer Nature considers academic use (by researchers and students) to be non-commercial.
These Terms are supplementary and will apply in addition to any applicable website terms and conditions, a relevant site licence or a personal
subscription. These Terms will prevail over any conflict or ambiguity with regards to the relevant terms, a site licence or a personal subscription
(to the extent of the conflict or ambiguity only). For Creative Commons-licensed articles, the terms of the Creative Commons license used will
apply.
We collect and use personal data to provide access to the Springer Nature journal content. We may also use these personal data internally within
ResearchGate and Springer Nature and as agreed share it, in an anonymised way, for purposes of tracking, analysis and reporting. We will not
otherwise disclose your personal data outside the ResearchGate or the Springer Nature group of companies unless we have your permission as
detailed in the Privacy Policy.
While Users may use the Springer Nature journal content for small scale, personal non-commercial use, it is important to note that Users may
not:
use such content for the purpose of providing other users with access on a regular or large scale basis or as a means to circumvent access
control;
use such content where to do so would be considered a criminal or statutory offence in any jurisdiction, or gives rise to civil liability, or is
otherwise unlawful;
falsely or misleadingly imply or suggest endorsement, approval , sponsorship, or association unless explicitly agreed to by Springer Nature in
writing;
use bots or other automated methods to access the content or redirect messages
override any security feature or exclusionary protocol; or
share the content in order to create substitute for Springer Nature products or services or a systematic database of Springer Nature journal
content.
In line with the restriction against commercial use, Springer Nature does not permit the creation of a product or service that creates revenue,
royalties, rent or income from our content or its inclusion as part of a paid for service or for other commercial gain. Springer Nature journal
content cannot be used for inter-library loans and librarians may not upload Springer Nature journal content on a large scale into their, or any
other, institutional repository.
These terms of use are reviewed regularly and may be amended at any time. Springer Nature is not obligated to publish any information or
content on this website and may remove it or features or functionality at our sole discretion, at any time with or without notice. Springer Nature
may revoke this licence to you at any time and remove access to any copies of the Springer Nature journal content which have been saved.
To the fullest extent permitted by law, Springer Nature makes no warranties, representations or guarantees to Users, either express or implied
with respect to the Springer nature journal content and all parties disclaim and waive any implied warranties or warranties imposed by law,
including merchantability or fitness for any particular purpose.
Please note that these rights do not automatically extend to content, data or other material published by Springer Nature that may be licensed
from third parties.
If you would like to use or distribute our Springer Nature journal content to a wider audience or on a regular basis or in any other manner not
expressly permitted by these Terms, please contact Springer Nature at
onlineservice@springernature.com