ArticlePDF Available

Abstract and Figures

The lithology log, an integral component of the master log, graphically portrays the encountered lithological sequence during drilling operations. In addition to offering real-time cross-sectional insights, lithology logs greatly aid in correlating and evaluating multiple sections efficiently. This paper introduces a novel workflow reliant on an enhanced weighted average ensemble approach for producing high-resolution lithology logs. The research contends with a challenging multiclass imbalanced lithofacies distribution emerging from substantial heterogeneities within subsurface geological structures. Typically, methods to handle imbalanced data, e.g., cost-sensitive learning (CSL), are tailored for issues encountered in binary classification. Error correcting output code (ECOC) originates from decomposition strategies, effectively breaking down multiclass problems into numerous binary subproblems. The database comprises conventional well logs and lithology logs obtained from five proximate wells within a Middle Eastern oilfield. Utilizing well-known machine learning (ML) algorithms, such as support vector machine (SVM), random forest (RF), decision tree (DT), logistic regression (LR), and extreme gradient boosting (XGBoost), as baseline classifiers, this study aims to enhance the accurate prediction of underground lithofacies. Upon recognizing a blind well, the data from the remaining four wells are utilized to train the ML algorithms. After integrating ECOC and CSL techniques with the baseline classifiers, they undergo evaluation. In the initial assessment, both RF and SVM demonstrated superior performance, prompting the development of an enhanced weighted average ensemble based on them. The comprehensive numerical and visual analysis corroborates the outstanding performance of the developed ensemble. The average Kappa statistic of 84.50%, signifying almost-perfect agreement, and mean F-measures of 91.04% emphasize the robustness of the designed ensemble-based workflow during the evaluation of blind well data.
This content is subject to copyright. Terms and conditions apply.
1
Vol.:(0123456789)
Scientic Reports | (2023) 13:21622 | https://doi.org/10.1038/s41598-023-49080-7
www.nature.com/scientificreports
An ensemble‑based machine
learning solution for imbalanced
multiclass dataset during lithology
log generation
Mohammad Saleh Jamshidi Gohari
1, Mohammad Emami Niri
2*, Saeid Sadeghnejad
3 &
Javad Ghiasi‑Freez
4
The lithology log, an integral component of the master log, graphically portrays the encountered
lithological sequence during drilling operations. In addition to oering real‑time cross‑sectional
insights, lithology logs greatly aid in correlating and evaluating multiple sections eciently. This
paper introduces a novel workow reliant on an enhanced weighted average ensemble approach
for producing high‑resolution lithology logs. The research contends with a challenging multiclass
imbalanced lithofacies distribution emerging from substantial heterogeneities within subsurface
geological structures. Typically, methods to handle imbalanced data, e.g., cost‑sensitive learning
(CSL), are tailored for issues encountered in binary classication. Error correcting output code (ECOC)
originates from decomposition strategies, eectively breaking down multiclass problems into
numerous binary subproblems. The database comprises conventional well logs and lithology logs
obtained from ve proximate wells within a Middle Eastern oileld. Utilizing well‑known machine
learning (ML) algorithms, such as support vector machine (SVM), random forest (RF), decision tree
(DT), logistic regression (LR), and extreme gradient boosting (XGBoost), as baseline classiers, this
study aims to enhance the accurate prediction of underground lithofacies. Upon recognizing a blind
well, the data from the remaining four wells are utilized to train the ML algorithms. After integrating
ECOC and CSL techniques with the baseline classiers, they undergo evaluation. In the initial
assessment, both RF and SVM demonstrated superior performance, prompting the development of
an enhanced weighted average ensemble based on them. The comprehensive numerical and visual
analysis corroborates the outstanding performance of the developed ensemble. The average Kappa
statistic of 84.50%, signifying almost‑perfect agreement, and mean F‑measures of 91.04% emphasize
the robustness of the designed ensemble‑based workow during the evaluation of blind well data.
Abbreviations
ML Machine learning
CSL Cost-sensitive learning
ADASYN Adaptive synthetic sampling
ECOC Error correcting output code
SVM Support vector machine
RF Random forest
DT Decision tree
LR Logistic regression
XGBoost Extreme gradient boosting
CNN Convolutional neural networks
DL Deep learning
OPEN
1Department of Petroleum Engineering, Kish International Campus, University of Tehran, Tehran, Iran. 2Institute
of Petroleum Engineering, School of Chemical Engineering, College of Engineering, University of Tehran, Tehran,
Iran. 3Department of Petroleum Engineering, Faculty of Chemical Engineering, Tarbiat Modares University, Tehran,
Iran. 4Faculty of Mining, Petroleum, and Geophysics, Shahrood University of Technology, Shahrood, Iran. *email:
Emami.m@ut.ac.ir
Content courtesy of Springer Nature, terms of use apply. Rights reserved
2
Vol:.(1234567890)
Scientic Reports | (2023) 13:21622 | https://doi.org/10.1038/s41598-023-49080-7
www.nature.com/scientificreports/
M-SMOTE Modied synthetic minority oversampling technique
OVA One-vs.-All
OVO One-vs.-One
Sh Shale
Ls Limestone
argiLs Argillaceous limestone
chkLs Chalky limestones
CGR Computed gamma ray log
SGR Spectral gamma ray log
NPHI Neutron porosity log
RHOB Density log
PE Photoelectric log
DT Sonic log
HD
Hamming distance
Mean.K
Mean kappa statistics
Mean.F
Mean F-measures
L.L. Lithology log
ED
Euclidean distance
F(V)
Voting function
Subscript and superscript
Ci,j Number of confusion matrix elements
s0
Random element
ys0
Decoded vector
e weight assigned to class
c
Recognizing lithofacies holds signicant importance in characterizing subsurface reservoirs. e lithology log, an
essential segment of the master log, delineates the sequences encountered in subsurface drilling. is log oers a
real-time depiction of the subsurface layers. Utilizing lithology logs proves valuable for correlating and compar-
ing equivalent parts or subsections across various areas. Depending on the geologist’s goals, these logs can dier
in format and style. eir primary function is to display geological and lithological formations. A lithology log
is a visual summary of underground sedimentary rock units. Summarising extensive data, identifying patterns,
and recognizing changes in sedimentary facies due to creating an overview of the vertical sequence are some of
the key benets of such logs. Additionally, these logs are appropriate for verifying correlations across sections of
the corresponding age in diverse regions, called well-to-well correlation1. In the geo-energy industry, accessing
and analyzing lithology logs for reasons like the age of drilled wells and mud loss is challenging. In such cases,
they are traditionally generated manually by visually correlating lithology logs from nearby wells. Subsurface
geological heterogeneities exacerbate this technique’s inaccuracy2. Due to its reliance on the interpreter’s skills,
the manual method has a relatively long processing time and has considerable generalization errors. Aside
from that, even experienced interpreters nd this method cumbersome and inecient when dealing with the
increasing volume of data.
Additionally, cross-plot characterization can categorize lithofacies from well logs. Typically, well logs are
sampled continuously as part of underground exploration. Besides measuring the petrophysical characteristics
of subsurface rocks, well logs facilitate understanding lithofacies by revealing lithology, texture, and structure
changes. In light of the rising volume of data, cross-plot characterization also becomes time-consuming and
challenging, even for skilled interpreters. Salinity, uid content, diagenesis, fractures, and clay composition
can exhibit parallel log reactions to lithology in standard well logs. Nevertheless, well-log patterns for distinct
lithologies, notably their transition subtypes, can be identical. In cross plots, these cases can complicate and
non-linearise the problem. e Exploration and Production industry has focused on machine learning (ML)
techniques in light of their potential to handle non-linear issues, the massive volume of data, the need for skilled
interpreters, and manual methods’ generalization errors310. Developing an ML-based methodology to generate
high-resolution lithology logs via conventional well logs and lithology logs from nearby wells may be crucial.
Over the past several decades, researchers have extensively investigated how ML techniques can identify litho-
facies from well logs. Unsupervised learning techniques, e.g., expectation-maximization11, K-means clustering12,
hierarchical clustering13, self-organizing map14, and deep autoencoder15, provide only an overall perspective by
arranging the lithofacies based on their inherent characteristics. ey are helpful in cases where the dataset is
limited, i.e., no label is available. In contrast, semi-supervised learning techniques, e.g., positive and unlabeled
ML16, active semi-supervised algorithms17, and laplacian support vector machine (SVM)18, are benecial when
a limited amount of labelled data is accessible. Conversely, the supervised learning technique is applicable when
lithofacies are pre-dened in a well, and we need to determine which labels from the second well belong. Several
well-known supervised shallow learning algorithms are traditionally employed for lithofacies classication based
on well logs labelled by cores. is category encompasses backpropagation neural networks19, SVM20, bayesian
networks21, K-nearest neighbor22, logistic regression (LR)23, decision tree (DT)24, kernel Fisher discriminant
analysis25, quadratic discriminant analysis26, gaussian naive Bayes27, and bayesian-articial neural network28.
Moreover, homogeneous ensemble techniques, e.g., random forest (RF)29, adaptive boosting model30, extreme
gradient boosting (XGBoost)31, gradient boost DT32, logistic boosting regression, and generalized boosting
modeling33, also fall under the same category. Additionally, the integration of RF and XGBoost34, the combination
Content courtesy of Springer Nature, terms of use apply. Rights reserved
3
Vol.:(0123456789)
Scientic Reports | (2023) 13:21622 | https://doi.org/10.1038/s41598-023-49080-7
www.nature.com/scientificreports/
of articial neural networks and hidden Markov models35, and the stacked generalization of K-nearest neigh-
bours, DT, RF, and XGBoost22 can be considered heterogeneous ensemble algorithms in the related domain.
Such supervised algorithms use geological rules, making lithofacies estimation more trustworthy3. Moreover,
researchers have employed several deep learning (DL) algorithms, e.g., convolutional neural networks (CNNs)36,
hybrid CNN-long short-term memory networks37, and TabNet38, to classify lithofacies via core-labelled well logs.
Nevertheless, many DL applications need to pay more attention to the signicance of sample size, a critical factor
for eective lithofacies modeling. Generally, a more complex problem demands more sophisticated and improved
algorithms, which, in turn, request more training data. Collecting such a volume of data can take time and eort,
making the process infeasible. To address the sample size dilemma in lithofacies classication tasks, transfer
learning, which uses DL models trained on large amounts of data, has emerged as a solution3. Transfer learning,
however, requires access to a large volume of data similar to or related to the upcoming problem dataset. It may
be possible to locate such data sources occasionally, but this may only sometimes be true. Alternatively, ensemble
learning involves combining several baseline models into a larger one with more robust performance than each
model individually. Furthermore, combining diverse baseline models reduces overtting risk in ensemble learn-
ing. Many elds and domains have beneted from ensemble learning, oen outperforming single models39,40. e
selection of baseline classiers in ensemble techniques results in dierences. Two methodologies, homogeneous
and heterogeneous ensembles, generate multiple classiers based on their structure. Homogeneous ensembles,
e.g., RF and bagging41, comprise similar baseline classiers that utilize dierent datasets. e major limitation
of homogenous systems is generating diversity using a single algorithm. In contrast, the heterogeneous ensem-
ble, e.g., voting42 and stacking43, consists of several baseline classiers trained on a single dataset44. Research
has proven that heterogeneity in base classiers contributes to developing more accurate, robust, and scalable
ensemble models45. Ensemble methods provide a means to handle non-linear, intricate, and multi-dimensional
geoscience data46,47.
As aforementioned, to date, researchers have utilized several supervised shallow/deep algorithms to determine
the correspondence among multiple varieties of well logs (as input) and lithofacies derived from core data or
well logs (i.e., electrofacies) (as target) and then used the resultant correlation to locate lithofacies in uncorded
intervals/wells. However, this research focuses on designing a robust and scalable heterogeneous ensemble-based
workow for lithofacies modelling using lithology logs as the target. Nevertheless, several signicant drawbacks
can be found in nearly all ML/ensemble-based paradigms for lithofacies classication, mainly (1) their scalability
constraints and (2) their ignorance of multiclass imbalances in data. e investigation attempts to overcome
the rst drawback by utilizing the blind well dataset from an oileld with bold geological heterogeneity. As the
second drawback, subsurface geological heterogeneities place lithofacies modelling problems in the spotlight
in various real-world scenarios with multiclass imbalanced data classication diculties. Due to their focus on
accuracy, traditional classiers encounter challenges in performance when confronted with class imbalance,
leading to neglect of the minority class or classes. Moreover, conventional ML algorithms such as SVM, primarily
devised for binary classication tasks, oen demand adjustments to attain optimal performance in multiclass
scenarios48. Furthermore, most standard imbalanced data combat tactics, e.g., cost-sensitive learning (CSL)49,
adaptive synthetic sampling (ADASYN), and modied synthetic minority oversampling technique (M-SMOTE)
(as resampling techniques)50, are designed for binary issues and fail to adapt directly insituations with multiple
classes. However, in some research, e.g., Liu and Liu37 and Zhou etal.32, imbalanced binary data combat tac-
tics have been directly implemented for imbalanced multiclass lithofacies classication situations. We utilized
decomposition techniques to extend imbalanced binary data combat tactics and binary-based ML algorithms
(e.g., SVM) to multiclass environments. e original datasets are broken down into binary sets as part of these
techniques by a divide-and-conquer procedure. Consequently, multiple classiers are required, each responsible
for a specic binary problem. Decomposition strategies are divided into two main categories, i.e., One-vs.-All
(OVA) and One-vs.-One (OVO). When there are k classes in a problem, OVA compares each class with the others
using
k
binary classiers. Alternatively, OVO uses
k(k1)/2
binary classiers to dierentiate between class pairs
in
k
-class problems3. ese binary classier architectures can be signicantly improved using error correcting
output code (ECOC)51. Furthermore, by under-sampling the majority samples or over-sampling the minority
observations, resampling techniques seek to balance data. Nevertheless, these methods will likely exclude some
relevant information or even raise the processing rates of irrelevant samples. Under-sampling techniques (e.g.,
one-sided selection52) and over-sampling algorithms (e.g., borderline-synthetic minority oversampling53) alter
class distribution. In return, CSL considers the costs of misclassifying samples49. Additionally, there are other
options available in this situation besides class decomposition. is way, the research uses ad-hoc approaches
designed to learn directly from dataset54.
In this study, we aim to develop a scalable ensemble-based workow to generate high-resolution lithology logs
reliably and automatically. We address two challenging topics: (1) the scalability of the designed workow and
(2) the analysis of the multiclass imbalanced dataset. e initial obstacle is overcome using a blind well dataset
from an oileld with complex heterogeneous conditions. Besides ad-hoc strategies, combining decomposi-
tion techniques with binary imbalance data combat tactics is crucial in addressing the second concern. In this
investigation, a heterogeneous ensemble model is designed and compared with baseline classiers as popular
algorithms in lithofacies classication research.
Methodology
General workow
Figure1 demonstrates an overview of the proposed high-resolution lithology log generation workow, consist-
ing of three main subsections: Workows 1, 2, and 3. Following data collection and preprocessing, it is parti-
tioned into training, testing, and blind verication datasets. Workow 1 evaluates the interaction of the baseline
Content courtesy of Springer Nature, terms of use apply. Rights reserved
4
Vol:.(1234567890)
Scientic Reports | (2023) 13:21622 | https://doi.org/10.1038/s41598-023-49080-7
www.nature.com/scientificreports/
classiers with the synergy of decomposition techniques and binary imbalanced data handling methods. rough
Workow 2, the baseline classiers are coupled with ad-hoc approaches. Finally, aer the training and evalua-
tion all baseline classiers, an enhanced weighted average ensemble of outstanding classiers is integrated with
superior synergies/ad-hoc tactics in Workow 3.
Multiclass imbalanced learning
Even though minority classes are rare, they frequently provide vital knowledge and crucial learning content. is
section should address two main challenges: (1) the usability of standard ML algorithms and (2) the feasibility
of conventional binary imbalance data combat tactics for solving multiclass imbalance issues. A widely accepted
methodology to simultaneously address both obstacles involves dividing the multiple-class modelling issue into
several binary subproblems through ECOC, OVA, and OVO as decomposition strategies. is investigation
focuses on the ECOC encoding process due to its functionality (in contrast, OVO/OVA). Specically, this is
true regarding overlap due to the vicinity across classes’ spectrum and inuenced by their spatial positions. By
exploiting ECOC, it is possible to use standard ML algorithms and strategies for combating binary imbalance
data in the upcoming multiclass imbalance concern. However, several studies have concentrated on an overall
framework that focuses on developing ad-hoc methods like Static-SMOTE55 instead of modifying conventional
techniques for handling binary imbalance data in the multiclass context. Ad-hoc approaches are generally limited
to several specic types of research and are not very general. Additionally, CSL can handle an imbalanced binary
class56,57. CSL proves more eective than sampling techniques (e.g., M-SMOTE) for imbalanced varieties58. Unlike
sampling methods, CSL maintains the original distribution of data59. As a result, due to CSL’s capabilities, this
paper focuses on its ability to address imbalanced data challenges. In the current research, through the ECOC
technique, the existing imbalanced multiclass problem is decomposed into binary subsets. en, strategies for
dealing with imbalanced binary data are implemented to address it. Additionally, the study utilizes Static-SMOTE
as an ad-hoc tactic to highlight the eciency of the proposed technique.
Error correcting output code concept
eoretically, encoding and decoding are the two phases involved in ECOC schemes. Encoding results in a
confusion matrix, while decoding places every unidentied instance in the most similar class. An
Nm
confu-
sion matrix has a
ci,j
element in the ith row (
ci
) and jth column. e ith class and the jth column are respectively
symbolized by
clai
and
colj
. e confusion matrix must meet ve specications simultaneously. Initially, every
row ought to include either a ’ + 1’ or ’ 1’:
If not, the relevant class cannot be identied during training. Secondly, to provide training examples for each
group, all columns must include a ’ + 1’ or ’ 1’:
e third rule is to avoid having duplicate overlapping columns:
As a fourth rule, no two rows should be alike:
Lastly, no pair of columns should have a reverse correlation:
(1)
m
j=1
abs
ci,j
�= 0, j[1, N
]
(2)
N
i=1
abs
ci,j
�= abs
N
i=1
ci,j
,j[1, m
]
(3)
N
i=1
abs
ci,jci,l
�= 0, j,l[1, m],j�=
l
(4)
m
i=1
abs
ci,jcl,j
�= 0, i,l[1, N],i�=
l
Figure1. An overview of the proposed workow.
Content courtesy of Springer Nature, terms of use apply. Rights reserved
5
Vol.:(0123456789)
Scientic Reports | (2023) 13:21622 | https://doi.org/10.1038/s41598-023-49080-7
www.nature.com/scientificreports/
Every dichotomizer selects a random element
s0
during the decoding process, which forms the decoded vector
ys0
. Typically, hamming distance (
HD
) is applied to assess similarities among
ys0
with
ci
, and
s0
being allocated
to the clao exhibiting the most similarities.
In this case,
ys0,j
refers to the jth item in
ys0
. In cases where so outcomes are required, the euclidean distance
(
ED
) is applied instead of
HD
, which is restricted to complex results (+ 1/ 1):
Data-independent and data-dependent strategies can be used to produce optimum confusion matrixes.
e earlier method generates confusing matrixes without considering the samples’ distribution. Subsets of this
approach include OVA and OVO. Due to the predetermined nature of the confusion matrixes in this category,
they cannot be used on a wide range of data sets with satisfactory results. In contrast, the latter method creates
confusion matrixes considering the numerical distributions, of which Data-Driven ECOC is one of its categories.
Due to the better t of its confusion matrixes to sample distributions, it typically provides superior classication
performance60.
Cost‑sensitive learning method
In analyzing data, the CSL tactic refers to a learning approach considering misclassication costs. Total cost
minimization is its objective. Under CSL procedures, such as the MetaCost approach, various classes pay varying
costs to address class imbalance challenges. CSL can be used to handle the costs associated with unfair misclas-
sications and class imbalances. CSL consists of two distinct groups. Developing classiers that are independently
cost-sensitive constitutes the primary group. A "wrapper" is designed in the second group that converts current
cost-insensitive classiers to cost-sensitive ones61. Due to its ability to convert a wide range of cost-intensive
classiers to cost-sensitive ones, the present study applies an instance-based weighting scheme from the second
group. Adjusting class weights is one of the most straightforward ways to increase the algorithms sensitivity to
minority class/classes (particularly in models that incorporate class weights). Logically, penalties for the misclassi-
cation of distinct categories correspond with class weights. A class with a higher weight will be subject to higher
penalties for misclassication than classes with a lower weight. ere are several options for setting the weight of
classes. is investigation utilizes the following equation as a balanced heuristic for class weight determination:
where
refers to the weight assigned to the class
c
,
N
denotes the number of classes within the dataset,
k
stands
for the class count within the dataset, and
|c|
represents the sample count for class c62.
Baseline classiers. SVM, DT, RF, LR, and XGBoost are selected baseline classiers. e selection of such
algorithms was deliberate, aiming to leverage the diverse strengths of each model for addressing various aspects
of the research problem. Indeed, a diverse array of baseline algorithms, including linear, non-linear, homogeneous
ensemble, and tree-based methods, provides varied learning strategies for the available dataset. SVM handles
complex boundaries well. It uses a hyperplane to divide n-dimensional attribute vectors into two classes. Kernel
functions are utilized to train the SVM algorithm, facilitating the transformation of feature vectors into higher-
dimensional domains. Aer that, the convex optimization approach is adopted to solve the ML task. According
to the maximum marginal hyperplane, every incoming instance should t logically into either of the categories.
A support vector is a set of data points nearest the hyperplane, which divides the class63. Additionally, DT oers
interpretability and enables analysts to create intelligent forecasting classiers. A DT allows users to estimate an
object’s value based on gathered data. In light of a set of relevant decisions, DT illustrates potential scenarios.
As a result of this approach, users can weigh various decision alternatives, the costs, the probability, and the
importance of every option. is study implements a classication and regression tree training procedure. e
procedure facilitates classication and regression tasks by utilizing discrete or contiguous parameters. Classica-
tion and regression trees have just a pair of leaves on each node64. e classication task could also be conducted
using RF, which provides robustness through ensemble learning. e model generates multiple DTs (or a forest)
for the training process. When performing classication tasks, the model returns the class that corresponds to
the mode of classes. Moreover, this approach eliminates the risk of overtting inherent in DTs65. LR is another
ML algorithm primarily designed for predicting class membership, in which the objective is to estimate the
probability of whether an instance falls into a particular class66. LR oers simplicity and is adequate for binary
classication tasks. Moreover, XGBoost is a popular ML algorithm suitable for tabular data, ensuring high
performance and scalability. With XGBoost, it is possible to detect complex numerical correlations between the
(5)
N
i=1
abs
ci,jci,l
�= 0, j,l[1, m],j�=
l
(6)
HD
ys0,ci
=
m
j=1
1sign
ys0,j.ci,j

(7)
o
=argmin
i
={
1,
···
,N
}HD
y
s0
,c
i.
(8)
ED
ys0,ci
=
m
j=1
(ys0,jci,j)2
(9)
w
c=
N
(k|c|)
Content courtesy of Springer Nature, terms of use apply. Rights reserved
6
Vol:.(1234567890)
Scientic Reports | (2023) 13:21622 | https://doi.org/10.1038/s41598-023-49080-7
www.nature.com/scientificreports/
measured parameters and the desired model. is method combines conventional regression and categorization
trees alongside analytic boosting algorithms. XGBoost details are available at Raihan etal.67. Table1 outlines
the hyperparameters obtained through hyperparameter tuning for baseline classiers. ese specic parameters
are carefully chosen following preliminary experiments and subsequent ne-tuning conducted through grid
search and cross-validation. is iterative process aimed to attain optimal performance while mitigating the
risk of overtting.
Voting ensemble classier
Voting ensembles combine estimates of several distinct classiers. is technique improves the performance of
individual classiers in an ensemble, ideally outperforming any single algorithm. Pooling forecasts across dif-
ferent algorithms enables the creation of a voting ensemble applicable to regression and classication problems.
During classication, estimates for each label are added together, and the majority vote label is determined. Sup-
pose
N
classiers are chosen and identied by
S1,...,S
N
and
R={Si:i=1, 2, 3, ...N}
. In the case of
M
output
classes, the ensemble voting algorithm determines how to combine the classier
S1
by voting
V
to optimize the
F(V)
function. An array with dimensions
N×M
represents
V
. An indication of the weight of ith classier’s vote
for the jth class is provided by
V
i, j
. As a general rule, the more condent a classier is, the greater the weight
allocated, while the more uncertain a classier is, the lower the weight assigned.
V
i, j
[
0, 1
]
represents the
level of assurance the ith classier has for the jth class. Combination rules use weights to combine the predicted
outcomes of classiers. ere are two approaches to predicting the majority vote for classication: hard voting
and so voting. Hard voting involves calculating the total number of votes for each class label and predicting
which has the most votes. So voting involves summing the probability estimates of each class label, and the
predicted class label is the one with the highest probability. Voting ensembles are recommended when all models
in an ensemble are predominantly in consensus or have similar exemplary performance. ey are particularly
benecial whenever several ts of identical baseline classiers are combined with various hyperparameters68.
e voting ensemble is limited in considering all algorithms equally, i.e., each model contributes identically to
forecasting. To address such issues, an extension of the voting ensemble involves applying weighted averaging
or weighted voting of the collaborating algorithms.
Enhanced weighted average ensemble method
is paper applies the enhanced weighted average ensemble model69 to classify multiclass imbalanced data.
ese ensembles have shown their eectiveness, accuracy, reliability, and robustness in addressing complex pat-
tern recognition challenges70. Baseline classiers that are more skilled than others are given additional weight
in this method. e algorithm modies voting ensembles in which all models are deemed equally qualied and
contribute identically to predictions. Each baseline classier is assigned a weight to determine its contribution
amount. Finding appropriate weights is a challenge for such algorithms. Optimum weights result in superior
eciency to ensembles based on similar weights and individual baseline classiers. e present study utilizes
the Grid Search strategy, assigning weights from a range of [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0] to
each baseline classier. is approach aims to optimize the assigned weights eectively, addressing the challenge.
Additionally, the research utilizes so and hard estimators for voting.
Case study
One of the Middle East oil elds is selected as a case study. Geologically, the eld lies in the transition zone
between the highly folded Zagros region and the stable Arabian platform. e underground formations explored
are Gurpi, Ilam, Laan, Sarvak, and Kazhdumi, whose predicted strata are as follows:
1. e Gurpi Formation comprises a sequence of Shale (Sh), Limestone (Ls), and Argillaceous Limestone
(argiLs) stratigraphically associated with the Ilam Formation (at the top section).
2. e Ilam Formation is composed mainly of yellow to grey-brown Ls containing glauconite alongside trace
quantities of hydrocarbons. Oolitic Ls appear frequently intermingled with Ls. ere are traces of Sh seg-
ments in its lower part and evidence of hydrocarbons. Sh sequences, secondary Ls, and hydrocarbon remains
are in the top position.
3. ere are greyish to emerald ash Sh layers with ne inclusions of white Ls in the Laan Formation (roughly
10m thick).
Table 1. Hyperparameters of baseline classiers.
Baseline classier Hyperparameters
SVM Kernel: Radial Basis Function (RBF), C (Regularization Parameter): 8.0, Gamma: 0.001
DT Criterion: Gini impurity, Max Depth: 5.0, Min Samples Split: 5.0
RF Number of Estimators: 128.0, Max Depth: 8.0, Max Features: ’sqrt’
LR Solver: ’liblinear’, Regularization: L2, C (Regularization Parameter): 10.0
XGBoost Number of Boosting Rounds: 100.0, Learning Rate: 0.1, Max Depth: 3.0, Objective Function: Binary logistic regres-
sion
Content courtesy of Springer Nature, terms of use apply. Rights reserved
7
Vol.:(0123456789)
Scientic Reports | (2023) 13:21622 | https://doi.org/10.1038/s41598-023-49080-7
www.nature.com/scientificreports/
4. e Sarvak Formations lower lithotype contains numerous Sh layers and hydrocarbon residues. In the
remainder, there are predominantly grey Chalky Limestones (chkLs), light grey to white chkLs, and dark
brown to pale brown Cherty Ls. Regional Sh accompanies these Lss.
5. Kazhdumi Formation generally consists of dark black and dark brown Sh and pyritic Ls, rich in dark grey
to pale ash and dark brown Sh-Ls.
Dataset
e dataset consists of computed gamma ray [CGR (GAPI)], spectral gamma ray [SGR (GAPI)], neutron poros-
ity [NPHI (V/V)], photoelectric factor [PE (B/E)], density [RHOB (G/C3)], Sonic [DT (US/F)], and lithology
logs. Data from ve wells identied as W-01 to W-05 exist within the study area. Figure2a demonstrates the
geographical positions of the wells in the area under investigation. W-03 is selected as a blind well based on its
geographical location and data range coverage. e ML algorithms are trained using data from the other four
wells. For instance, Fig.2b illustrates the conventional well logs and lithology logs for W-02. Figure3a–g display
the distribution of input features (CGR, SGR, DT, NPHI, PE, RHOB) and target features (Facies), respectively.
Figure3g illustrates a substantial imbalance within the input data.
Data preparation and class dierentiation
As a part of this subsection, the data undergo a check for missing values and outliers aer encoding categorical
features (such as facies names, well identiers, and formations) into dummy variables. An error in a dataset can
take many forms, for example, duplicate rows or weak columns. While rening the available data, columns with
only a single value, low variance, and rows containing repeated observations are identied and eliminated. Addi-
tionally, unnecessary columns are eliminated based on the correlation between dierent features. Furthermore,
the distribution quantity of available datasets necessitated the application of standardization. Before presentation
as input to the ML algorithms, the data undergo standardization to achieve a zero mean and unit variance71.
Figure2. (a) e geographic positions of the wells in the area under investigation, and (b) Conventional well
logs, lithology log, and a legend map for W-02 as an illustrative example.
Content courtesy of Springer Nature, terms of use apply. Rights reserved
8
Vol:.(1234567890)
Scientic Reports | (2023) 13:21622 | https://doi.org/10.1038/s41598-023-49080-7
www.nature.com/scientificreports/
However, complications like drilling uid disturbance or drill bit balling up during lithology log recording can
occur. erefore, it could be challenging to separate dierent facies because of these bugs. Before training the
classier, the preprocessing stage aims to achieve a high level of separation between other classes. is goal is
Figure3. Distribution of input features including (a) CGR, (b) SGR, (c) DT, (d) NPHI, (e) PE, and (f) RHOB,
alongside (g) Facies as the target feature.
Content courtesy of Springer Nature, terms of use apply. Rights reserved
9
Vol.:(0123456789)
Scientic Reports | (2023) 13:21622 | https://doi.org/10.1038/s41598-023-49080-7
www.nature.com/scientificreports/
performed using linear discriminant analysis as a noise reduction technique72 with 97% accuracy. By stratifying
sampling73, the input data are divided between training (75%) and testing (25%) to account for the problem of
data imbalance. us, both sets have a proportional representation of class.
Results and discussion
e study initiates with Workow 1 (see Fig.2), aimed at assessing the baseline classiers while exploring syner-
gies between the decomposition strategy and various tactics tailored for handling imbalanced binary data. is
phase is crucial for pinpointing noteworthy interactions. Furthermore, Workow 2 amalgamates optimal baseline
classiers with customized ad-hoc methods. Subsequently, Workow 3 introduces an enhanced weighted aver-
age ensemble that merges the most eective baseline classiers. is ensemble is then integrated with superior
synergies or ad-hoc techniques for an improved performance assessment. e assessment of imbalanced multi-
class classication presents a challenge because widely used measures for evaluating classiers’ outputs, such as
accuracy, are built upon assumptions of balanced distributed data. Previous studies have proposed Mean Kappa
statistics (Mean. K) and Mean F-measures (Mean. F) to assess imbalanced situations7476. e Landis and Koch
grouping is commonly utilized for interpreting Kappa statistics values, where the ranges correspond to dier-
ent levels of agreement: 0% (poor); 0–20% (slight); 21–40% (fair); 41–60% (moderate); 61–80% (substantial);
and 81–100% (almost-perfect)77. For a detailed explanation of the Kappa statistic and F-measure for imbalance
multiclass classication, refer to Jamshidi Gohari etal.3. Developing lithology log generation within the Google
Collaboratory platform involves various libraries. ese libraries include Pytorch, Pandas, Numpy, Matplotlib,
Mpl toolkits, and Sklearn in Python 3.11.5. Additionally, we ran on an Intel Core i7-11370H with 16GB of RAM.
Synergy between ECOC and binary imbalanced data combat tactics
is subsection through Workow 1 describes how ECOC and binary imbalanced data combat tactics interact
with baseline classiers. As part of Workow 2, Static-SMOTE highlights the results. Table2 illustrates average
outcomes and rankings based on the average of 20 runs. e t-index represents test marks, whereas the b-index
indicates blind evaluation scores. One section covers the ad-hoc approach, and the other presents the ECOC
scheme. Each technique is ranked separately for a given unit in the "Rank" column. e highest marks are indi-
cated in bold font. Furthermore, the basic version of the algorithms (i.e., Base and Std) is implemented to verify
the results. Table2 supports the following ndings. When combined with ECOC and CSL as a corporator of
Workow 1, SVM produced the most accurate results (Rankb = 1). e eectiveness of this procedure manifested
itself in a Mean. Fb of 86.87% and a Mean. Kb of 78.04% for blind well datasets. ECOC-CSL is numerically better
Table 2. Mean classier test and blind well assessment outcomes (using a 20-run average) for baseline
classiers based on Mean. F and Mean. K (Percentage-wise). e t-index signies test grades, while the
b-index denotes ratings from blind evaluations.
Method Baseline classier Adaptation
Mean.Ft
Mean.Fb
Rankb
Mean.Kt
Mean.Kb
Rankb
Ad-hoc
SVM
Base
93.26 82.46 88.15 70.61
RF 92.72 81.88 87.49 69.96
XGBoost 90.62 78.74 84.97 67.54
DT 88.54 76.65 82.65 65.89
LR 84.38 71.84 77.86 60.85
SVM
Static-SMOTE
93.33 83.58 5 89.24 72.55 5
RF 92.58 82.75 6 88.43 71.69 6
XGBoost 89.98 81.42 8 85.68 69.14 8
DT 88.99 80.68 10 83.45 67.82 10
LR 85.04 76.11 13 78.24 62.74 13
ECOC
SVM
Std
93.87 85.30 90.03 75.03
RF 92.84 84.29 89.12 74.08
XGBoost 89.76 83.02 87.45 72.88
DT 87.65 81.45 85.94 70.86
LR 82.98 77.07 80.85 65.87
SVM
M-SMOTE
89.92 81.38 9 83.56 68.82 9
RF 88.97 80.24 11 81.75 67.03 11
XGBoost 86.43 77.54 12 78.54 64.72 12
DT 83.95 72.97 14 77.14 62.68 14
LR 80.87 71.95 15 72.56 57.21 15
SVM
CSL
94.71 86.87 1 91.37 78.04 1
RF 94.09 86.28 2 90.55 77.29 2
XGBoost 93.87 84.08 3 89.62 75.42 3
DT 93.74 83.67 4 89.48 74.14 4
LR 90.32 81.54 7 85.98 70.52 7
Content courtesy of Springer Nature, terms of use apply. Rights reserved
10
Vol:.(1234567890)
Scientic Reports | (2023) 13:21622 | https://doi.org/10.1038/s41598-023-49080-7
www.nature.com/scientificreports/
than ECOC-M-SMOTE or Static-SMOTE. In addition, coupling RF with the synergy of ECOC and CSL yielded
a Mean. Fb of 86.28% and a Mean. Kb of 77.29% as a co-factor of Workow 1 (Rankb = 2). In this particular
combination, when paired with RF, ECOC-CSL demonstrates superior numerical performance compared to
other methods, thereby arming its overall functionality. When examining ECOC-CSL-SVM (Rankb = 1) and
ECOC-CSL-RF (Rankb = 2) outputs, it becomes apparent that the former exhibits a higher level of prociency.
However, both perform satisfactorily on blind well data evaluation. erefore, improving performance by devel-
oping an enhanced weighted average ensemble that combines these two synergies from Workow 1 may result
in superior performance.
SVM‑RF enhanced weighted average ensemble development
In this subsection, the development of an enhanced weighted average ensemble based on two superior combina-
tions of Workow 1, i.e., ECOC-CSL-SVM and ECOC-CSL-RF, is reported. e voting scheme consists of two
types: so voting and hard voting.Table3 presents the average results and rankings across 20 runs. As reported,
Workow 3 provides the best performance, in which the enhanced weighted average ensemble of SVM and RF in
so voting mode is coupled with ECOC-CSL—a Mean. Fb of 91.04% and a Mean. Kb of 84.50%, which indicates
almost perfect agreement, is proof of this superiority (Rankb = 1). Tables2 and 3 illustrate that the enhanced
weighted average ensemble of SVM and RF in so voting mode coupled with ECOC-CSL performs the most
ecient workow, henceforth called optimal workow. Additionally, by comparing the confusing matrixes of
the various workows (i.e., Workows 1, 2, and 3), the optimal workow provided the superior prediction for
argiLs, chkLs, Ls, and Sh. Figure4a,b present the confusing matrixes comparing the optimized workow against
an unoptimized approach for evaluating blind well data. It’s apparent that the unoptimized workow exhibits
bias towards the majority classes and performs suboptimally in recognizing the minority class, specically Sh.
Graphical comparative assessment
Figure5a–d, depict the generated lithology log (i.e., Generated LL) for dierent depth intervals through the
optimal workow from the blind well dataset. e optimal workow could separate Sh as one of the critical
minority classes from argiLs, chkLs, and Ls according to the peak values in the conventional well logs, especially
CGR and SGR. e generated lithology log displays a reasonable similarity to the original one (i.e., Original L.L.
in Fig.5a–d) in pinpointing the regions where argiLs, chkLs, Ls, and Sh occur. Figure5b displays the concentrat-
ing depth interval (2728–2750m) for the minority Sh class in the blind well. It shows an excellent correlation
among the peak positions of the blind well logs, the Sh positions in the original lithology log, and the generated
one. A similar agreement holds to argiLs, chkLs, and Ls facies, which share overlapping characteristics. Figure5c
Table 3. Mean classier test and blind well results (using a 20-run average) for designed ensemble based on
Mean. F and Mean. K (Percentage-wise). e t-index signies test grades, while the b-index denotes ratings
from blind evaluations.
Method ensenble type Adaptation
Mean.Ft
Mean.Fb
Rankb
Mean.Kt
Mean.Kb
Rankb
ECOC
Enhanced weighted average ensemble of SVM
and RF in so voting mode CSL
94.92 91.04 1 91.70 84.50 1
Enhanced weighted average ensemble of SVM
and RF in hard voting mode 94.07 90.33 2 90.44 83.62 2
Figure4. (a) Confusion matrix of the optimal workow for blind well data evaluation, and (b) confusion
matrix of an unoptimized workow for blind well data assessment.
Content courtesy of Springer Nature, terms of use apply. Rights reserved
11
Vol.:(0123456789)
Scientic Reports | (2023) 13:21622 | https://doi.org/10.1038/s41598-023-49080-7
www.nature.com/scientificreports/
highlights the blind well interval of 2450–2600m, covering the argiLs, and Ls facies. Additionally, Fig.5d shows
the depth interval of the blind well for chkLs, Ls, and Sh facies from 3175 to 3300m. In these gures, the positions
of argiLs, chkLs, Ls, and Sh in the generated lithology log reasonably match those in the original one.
Unlike the OVA and OVO approaches, which partition a multiclass modelling problem into a nite number
of binary classication tasks, the ECOC algorithm allows any given class to be encoded as an innite number of
binary classication tasks. Excessive representation enables the additional models to function as "error-correc-
tion" forecasts, enhancing prediction ability. Furthermore, a signicant factor that leads to superior CSL perfor-
mance is assigning additional weight to misclassications of minorities and imposing a penalty for inaccurate
classications. us, these classes are given more attention by the model. is approach compels the model to
learn instances from minority classes, making it a potent tool for forecasting occurrences from these classes.
CSL, on the other hand, maintains the original distribution of data, unlike resampling approaches. Moreover,
the SVM classication eectiveness can be attributed to the fact that it transforms the initial data into a multi-
dimensional space. is ability will separate the classes better while maintaining the exact computational cost
as the initial problem. is feature is referred to as a kernel trick.
Furthermore, RF can minimize the impact of an imbalanced sample distribution during classication. is
characteristic can enhance minority samples’ identication eciency. On the other hand, when the ratio of
imbalanced observations rises, the classication performance of RF is markedly impaired. Due to this issue, it’s
not possible to train a complete classication algorithm. e current study addressed this drawback by coupling
the RF with the ECOC-CSL. SVM behaved more skillfully than RF under similar conditions (i.e. when combined
Figure5. Lithology log (LL) generated using the optimal workow for blind well data, illustrating depth
intervals: (a) 2351–3399m, (b) 2728–2750m, (c) 2450–2600m, and (d) 3175–3300m.
Content courtesy of Springer Nature, terms of use apply. Rights reserved
12
Vol:.(1234567890)
Scientic Reports | (2023) 13:21622 | https://doi.org/10.1038/s41598-023-49080-7
www.nature.com/scientificreports/
with the synergy of ECOC-CSL); however, both performed satisfactorily on blind well data evaluation. Designing
an enhanced weighted average ensemble aims to maximize eciency by combining these two models, each with
unique advantages. As a result of its reduced rate of error and lower variance, the ensemble has an improved
predictive performance over the individual models (i.e., baseline classiers). However, to obtain optimum esti-
mates, a unique classier can only represent some of the fundamental characteristics of the data. Consequently,
combining several primary learners can capture further insight into the datas internal layout and dramatically
boost estimation precision.
In addition, the study seeks to oer a scalable workow to generate lithology logs or, more broadly, to model
lithofacies, not only restricted regions under investigation. Accordingly, the experiment sought to remedy con-
ventional procedures’ deciencies and considered multiple factors. Hence, the research site with considerable
geological heterogeneity was chosen, highlighting the imbalanced multiclass data issue. e optimal workow
performed superior results in the blind well evaluation. erefore, it is conrmed through blind well analysis,
another indicator of its scalability. Furthermore, given that geological evidence is based on lithology log data,
it is crucial to consider its uncertainty sources. Wellbore instabilities (e.g., breakouts and washouts), balling up,
and rheology disturbances can lead to inaccurate data sources. Incorporating LDA as a denoising tool to mitigate
these concerns is advisable.
Additionally, the developed strategies for dealing with the multiclass imbalance dilemma manifest uniform
performance irrespective of the classier type. Consequently, the outcomes are comparable throughout, support-
ing validity. Finally, the DL algorithm is more stable than the shallow ML technique, particularly when analyzing
noisy and uncertain geoscience datasets. As a result, it is recommended that the geoscience and geo-energy com-
munities collect a global data bank similar to that developed in image processing to facilitate transfer learning.
Moreover, this investigation primarily focused on several standard imbalanced data combat tactics and ad-hoc
techniques. However, considering further alternatives, such as employing tailored loss functions like balanced
cross-entropy and focal loss78 for imbalanced lithofacies modelling, is suggested as a reasonable avenue for future
research directions. Last but not least, this study provides a basis for future work in geosciences and engineering
that deals with multiclass data with imbalances.
Conclusion
e current investigation focused on statistically and graphically analyzing high-resolution lithology log gen-
eration. A primary emphasis was placed on addressing two signicant challenges: multiclass imbalance data
classication and scalability. ree distinct workows were scrutinized to tackle the former, employing baseline
classiers, a custom ensemble algorithm, and methods tailored for handling multiclass imbalance data. Address-
ing the latter challenge involved evaluating these workows using blind well data from an oileld characterized
by substantial geological variations. e optimal workow emerged as an enhanced weighted average ensem-
ble of SVM and RF alongside ECOC and CSL. is amalgamation showcased notable strength and reliability,
evidenced by a mean Kappa statistic of 84.50%, signifying almost-perfect agreement, and mean F-measures of
91.04%. ese results underscore the optimal workow’s robustness and ecacy in evaluating blind well data.
Moreover, the devised ensemble showcased superior performance to commonly employed baseline classiers in
lithofacies classication endeavours. is constructed workow adeptly handles multiclass imbalanced data with
eciency and logical coherence. Evaluation based on statistical and graphical analyses of the blind well dataset
indicated a satisfactory correlation between the generated lithology log and the original one. Additionally, a
notable advantage of the proposed workow lies in its ability to retain the initial data distribution. In summary,
the developed workow presents a versatile solution capable of addressing multiclass imbalance issues within
the geo-energy sector, extending beyond lithofacies classication tasks.
Data availability
e corresponding author will make all the data available upon a reasonable request.
Received: 25 October 2023; Accepted: 4 December 2023
References
1. Karimi, A. M., Sadeghnejad, S. & Rezghi, M. Well-to-well correlation and identifying lithological boundaries by principal com-
ponent analysis of well-logs. Comput. Geosci. 157, 104942 (2021).
2. Zhan, C. et al. Subsurface sedimentary structure identication using deep learning: A review. Earth Sci. Rev. 239, 104370 (2023).
3. Jamshidi Gohari, M. S., Emami Niri, M., Sadeghnejad, S. & Ghiasi-Freez, J. Synthetic graphic well log generation using an enhanced
deep learning workow: imbalanced multiclass data, sample size, and scalability challenges. SPE J. https:// doi. org/ 10. 2118/ 217466-
PA (2023).
4. Masroor, M., Emami Niri, M., Rajabi-Ghozloo, A. H., Sharinasab, M. H. & Sajjadi, M. Application of machine and deep learning
techniques to estimate NMR-derived permeability from conventional well logs and articial 2D feature maps. J. Pet. Explor. Prod.
Tec hnol. 12, 2937–2953 (2022).
5. Sharinasab, M. H., Niri, M. E. & Masroor, M. Developing GAN-boosted articial neural networks to model the rate of drilling
bit penetration. Appl. So Comput. 136, 110067 (2023).
6. Haddadpour, H. & Niri, M. E. Uncertainty assessment in reservoir performance prediction using a two-stage clustering approach:
Proof of concept and eld application. J. Petrol. Sci. Eng. 204, 108765 (2021).
7. Kolajoobi, R. A., Haddadpour, H. & Niri, M. E. Investigating the capability of data-driven proxy models as solution for reservoir
geological uncertainty quantication. J. Petrol. Sci. Eng. 205, 108860 (2021).
8. Mousavi, S.-P. et al. Modeling of H2S solubility in ionic liquids: comparison of white-box machine learning, deep learning and
ensemble learning approaches. Sci. Rep. 13, 7946 (2023).
9. Rezaei, F., Akbari, M., Raei, Y. & Hemmati-Sarapardeh, A. Compositional modeling of gas-condensate viscosity using ensemble
approach. Sci. Rep. 13, 9659 (2023).
Content courtesy of Springer Nature, terms of use apply. Rights reserved
13
Vol.:(0123456789)
Scientic Reports | (2023) 13:21622 | https://doi.org/10.1038/s41598-023-49080-7
www.nature.com/scientificreports/
10. Nakhaei-Kohani, R. et al. Solubility of gaseous hydrocarbons in ionic liquids using equations of state and machine learning
approaches. Sci. Rep. 12, 14276 (2022).
11. Glover, P. W., Mohammed-Sajed, O. K., Akyüz, C., Lorinczi, P. & Collier, R. Clustering of facies in tight carbonates using machine
learning. Mar. Pet. Geol. 144, 105828 (2022).
12. Troccoli, E. B., Cerqueira, A. G., Lemos, J. B. & Holz, M. K-means clustering using principal component analysis to automate label
organization in multi-attribute seismic facies analysis. J. Appl. Geophys. 198, 104555 (2022).
13. Emelyanova, I., Peyaud, J.-B., Dance, T. & Pervukhina, M. Detecting specic facies in well-log data sets using knowledge-driven
hierarchical clustering. Petrophysics 61, 383–400 (2020).
14. Liu, Z., Cao, J., Chen, S., Lu, Y. & Tan, F. Visualization analysis of seismic facies based on deep embedded SOM. IEEE Geosci.
Remote Sens. Lett. 18, 1491–1495 (2020).
15. Liu, X. et al. Deep classied autoencoder for lithofacies identication. IEEE Trans. Geosci. Remote Sens. 60, 1–14 (2021).
16. Lan, X., Zou, C., Kang, Z. & Wu, X. Log facies identication in carbonate reservoirs using multiclass semi-supervised learning
strategy. Fuel 302, 121145 (2021).
17. Xie, W. & Spikes, K. T. Well-log facies classication using an active semi-supervised algorithm with pairwise constraints. Geophys.
J. Int. 229, 56–69 (2022).
18. Li, Z. et al. Semi-supervised learning for lithology identication using Laplacian support vector machine. J. Pet. Sci. Eng. 195,
107510 (2020).
19. Zhang, L. et al. Diagenetic facies characteristics and quantitative prediction via wireline logs based on machine learning: A case
of Lianggaoshan tight sandstone, fuling area, Southeastern Sichuan Basin, Southwest China. Front. Earth Sci. 10, 1018442 (2022).
20. Wood, D. A. Carbonate/siliciclastic lithofacies classication aided by well-log derivative, volatility and sequence boundary attributes
combined with machine learning. Earth Sci. Inform. 15, 1699–1721 (2022).
21. Zhao, Z. et al. Lithofacies identication of shale reservoirs using a tree augmented Bayesian network: A case study of the lower
Silurian Longmaxi formation in the changning block, South Sichuan basin, China. Geoenergy Sci. Eng. 221, 211385 (2023).
22. He, M., Gu, H. & Xue, J. Log interpretation for lithofacies classication with a robust learning model using stacked generalization.
J. Pet. Sci. Eng. 214, 110541 (2022).
23. Antariksa, G., Muammar, R. & Lee, J. Performance evaluation of machine learning-based classication with rock-physics analysis
of geological lithofacies in Tarakan Basin, Indonesia. J. Pet. Sci. Eng. 208, 109250 (2022).
24. R au, E. G. et al. Applicability of decision tree-based machine learning models in the prediction of core-calibrated shale facies from
wireline logs in the late Devonian Duvernay Formation, Alberta, Canada. Interpretation 10, T555–T566 (2022).
25. Dong, S., Zeng, L., Du, X., He, J. & Sun, F. Lithofacies identication in carbonate reservoirs by multiple kernel Fisher discriminant
analysis using conventional well logs: A case study in A oileld, Zagros Basin, Iraq. J. Pet. Sci. Eng. 210, 110081 (2022).
26. Dong, S.-Q. et al. A deep kernel method for lithofacies identication using conventional well logs. Pet. Sci. 20, 1411–1428 (2023).
27. Babasafari, A. A., Campane Vidal, A., Furlan Chinelatto, G., Rangel, J. & Basso, M. Ensemble-based machine learning application
for lithofacies classication in a pre-salt carbonate reservoir, Santos Basin, Brazil. Pet. Sci. Technol. https:// doi. org/ 10. 1080/ 10916
466. 2022. 21438 13 (2022).
28. Feng, R. A Bayesian approach in machine learning for lithofacies Classication and its uncertainty analysis. IEEE Geosci. Remote
Sens. Lett. 18, 18–22 (2020).
29. Feng, R. Improving uncertainty analysis in well log classication by machine learning with a scaling algorithm. J. Pet. Sci. Eng.
196, 107995 (2021).
30. Nwaila, G. T. et al. Data-driven predictive modeling of lithofacies and fe in-situ grade in the assen fe ore deposit of the transvaal
supergroup (South Africa) and Implications on the Genesis of Banded Iron Formations. Nat. Resour. Res. 31, 2369–2395 (2022).
31. Zheng, D. et al. Application of machine learning in the identication of uvial-lacustrine lithofacies from well logs: A case study
from Sichuan Basin, China. J. Pet. Sci. Eng. 215, 110610 (2022).
32. Zhou, K., Zhang, J., Ren, Y., Huang, Z. & Zhao, L. A gradient boosting decision tree algorithm combining synthetic minority
oversampling technique for lithology identication. Geophysics 85, WA147–WA158 (2020).
33. Al-Mudhafar, W. J., Abbas, M. A. & Wood, D. A. Performance evaluation of boosting machine learning algorithms for lithofacies
classication in heterogeneous carbonate reservoirs. Mar. Pet. Geol. 145, 105886 (2022).
34. Hou, M. et al. Machine learning algorithms for lithofacies classication of the gulong shale from the Songliao Basin, China. Ener
gies 16, 2581 (2023).
35. Feng, R. Lithofacies classication based on a hybrid system of articial neural networks and hidden Markov models. Geophys. J.
Int. 221, 1484–1498 (2020).
36. Kim, J. Lithofacies classication integrating conventional approaches and machine learning technique. J. Nat. Gas Sci. Eng. 100,
104500 (2022).
37. Liu, J.-J. & Liu, J.-C. Integrating deep learning and logging data analytics for lithofacies classication and 3D modeling of tight
sandstone reservoirs. Geosci. Front. 13, 101311 (2022).
38. Ta, V.-C. et al. Tabnet eciency for facies classication and learning feature embedding from well log data. Pet. Sci. Technol. https://
doi. org/ 10. 1080/ 10916 466. 2023. 22236 23 (2023).
39. Ngo, G., Beard, R. & Chandra, R. Evolutionary bagging for ensemble learning. Neurocomputing 510, 1–14 (2022).
40. Zhang, Q., Tsang, E. C., He, Q. & Guo, Y. Ensemble of kernel extreme learning machine based elimination optimization for multi-
label classication. Knowl. Based Syst. 278, 10817 (2023).
41. Klikowski, J. & Woźniak, M. Deterministic sampling classier with weighted bagging for dried imbalanced data stream clas-
sication. Appl. So Comput. 122, 108855 (2022).
42. Tavana, P., Akraminia, M., Koochari, A. & Bagherifard, A. An ecient ensemble method for detecting spinal curvature type using
deep transfer learning and so voting classier. Expert Syst. Appl. 213, 119290 (2023).
43. Cui, S., Yin, Y., Wang, D., Li, Z. & Wang, Y. A stacking-based ensemble learning method for earthquake casualty prediction. Appl.
So Comput. 101, 107038 (2021).
44. Mohammed, A. & Kora, R. A comprehensive review on ensemble deep learning: Opportunities and challenges. J. King Saud Univ.
Comput. Inform. Sci. 35, 757–774 (2023).
45. Sesmero, M. P., Ledezma, A. I. & Sanchis, A. Generating ensembles of heterogeneous classiers using stacked generalization. Wiley
Interdiscip. Rev. Data Min. Knowl. Discov. 5, 21–34 (2015).
46. Dong, S.-Q. et al. How to improve machine learning models for lithofacies identication by practical and novel ensemble strategy
and principles. Pet. Sci. 20, 733–752 (2023).
47. Ntibahanana, M., Luemba, M. & Tondozi, K. Enhancing reservoir porosity prediction from acoustic impedance and lithofacies
using a weighted ensemble deep learning approach. Appl. Comput. Geosci. 16, 100106 (2022).
48. Huang, C. et al. A feature weighted support vector machine and articial neural network algorithm for academic course perfor-
mance prediction. Neural Comput. Appl. 35, 11517–11529 (2023).
49. Ding, Y., Jia, M., Zhuang, J. & Ding, P. Deep imbalanced regression using cost-sensitive learning and deep feature transfer for
bearing remaining useful life estimation. Appl. So Comput. 127, 109271 (2022).
50. Lui, T. C., Gregory, D. D., Anderson, M., Lee, W.-S. & Cowling, S. A. Applying machine learning methods to predict geology using
soil sample geochemistry. Appl. Comput. Geosci. 16, 100094 (2022).
Content courtesy of Springer Nature, terms of use apply. Rights reserved
14
Vol:.(1234567890)
Scientic Reports | (2023) 13:21622 | https://doi.org/10.1038/s41598-023-49080-7
www.nature.com/scientificreports/
51. Valencia, O., Ortiz, M., Ruiz, S., Sanchez, M. & Sarabia, L. Simultaneous class-modelling in chemometrics: A generalization of
Partial Least Squares class modelling for more than two classes by using error correcting output code matrices. Chemom. Intell.
Lab. Syst. 227, 104614 (2022).
52. Santos, L. I. et al. Decision tree and articial immune systems for stroke prediction in imbalanced data. Expert Syst. Appl. 191,
116221 (2022).
53. Leng, Q., Guo, J., Jiao, E., Meng, X. & Wang, C. NanBDOS: Adaptive and parameter-free borderline oversampling via natural
neighbor search for class-imbalance learning. Knowl. Based Syst. 274, 110665 (2023).
54. Fernández, A. et al. Learning from Imbalanced Data Sets Vol. 10 (Springer, 2018).
55. Lango, M. & Stefanowski, J. What makes multiclass imbalanced problems dicult? An experimental study. Expert Syst. Appl. 199,
116962 (2022).
56. Volk, O., Ratnovsky, A., Naali, S. & Singer, G. Classication of tracheal stenosis with asymmetric misclassication errors from
EMG signals using an adaptive cost-sensitive learning method. Biomed. Signal Process. Control 85, 104962 (2023).
57. Chamseddine, E., Mansouri, N., Soui, M. & Abed, M. Handling class imbalance in COVID-19 chest X-ray images classication:
Using SMOTE and weighted loss. Appl. So Comput. 129, 109588 (2022).
58. Zhang, C., Tan, K. C., Li, H. & Hong, G. S. A cost-sensitive deep belief network for imbalanced classication. IEEE Trans. Neural
Netw. Learn. Syst. 30, 109–122 (2018).
59. Tang, J., Hou, Z., Yu, X., Fu, S. & Tian, Y. Multi-view cost-sensitive kernel learning for imbalanced classication problem. Neuro
computing 552, 126562 (2023).
60. Yi-Fan, L. et al. A novel error-correcting output codes based on genetic programming and ternary digit operators. Pattern Recognit.
110, 107642 (2021).
61. Wang, Y.-C. & Cheng, C.-H. A multiple combined method for rebalancing medical data with class imbalances. Comput. Biol. Med.
134, 104527 (2021).
62. Young, M. M., Himmelreich, J., Honcharov, D. & Soundarajan, S. Using articial intelligence to identify administrative errors in
unemployment insurance. Gov. Inform. Q. 39, 101758 (2022).
63. Mohammadi, M.-R. et al. Modeling hydrogen solubility in hydrocarbons using extreme gradient boosting and equations of state.
Sci. Rep. 11, 17911 (2021).
64. Riazi, M. et al. Modelling rate of penetration in drilling operations using RBF, MLP, LSSVM, and DT models. Sci. Rep. 12, 11650
(2022).
65. Ghazwani, M. & Begum, M. Y. Computational intelligence modeling of hyoscine drug solubility and solvent density in supercritical
processing: Gradient boosting, extra trees, and random forest models. Sci. Rep. 13, 10046 (2023).
66. Hartonen, T. et al. Nationwide health, socio-economic and genetic predictors of COVID-19 vaccination status in Finland. Nat.
Hum. Behav. 7, 1069–1083 (2023).
67. Raihan, M. J., Khan, M.A.-M., Kee, S.-H. & Nahid, A.-A. Detection of the chronic kidney disease using XGBoost classier and
explaining the inuence of the attributes on the model using SHAP. Sci. Rep. 13, 6263 (2023).
68. Khairy, R. S., Hussein, A. & ALRikabi, H.,. e detection of counterfeit banknotes using ensemble learning techniques of AdaBoost
and voting. Int. J. Intell. Eng. and Syst. 14, 326–339 (2021).
69. Loganathan, S., Geetha, C., Nazaren, A. R. & Fernandez, M. H. F. Autism spectrum disorder detection and classication using
chaotic optimization based Bi-GRU network: An weighted average ensemble model. Expert Syst. Appl. 230, 120613 (2023).
70. Osamor, V. C. & Okezie, A. F. Enhancing the weighted voting ensemble algorithm for tuberculosis predictive diagnosis. Sci. Rep.
11, 14806 (2021).
71. Jamshidi Gohari, M. S., Emami Niri, M. & Ghiasi-Freez, J. Improving permeability estimation of carbonate rocks using extracted
pore network parameters: a gas eld case study. Acta Geophy. 69, 509–527 (2021).
72. Ma, H., Yan, J., Li, Y., Zhang, C. & Lin, H. Desert seismic random noise reduction based on LDA eective signal detection. Acta
Geophys. 67, 109–121 (2019).
73. Yin, X. et al. Strength of stacking technique of ensemble learning in rockburst prediction with imbalanced data: Comparison of
eight single and ensemble models. Nat. Resour. Res. 30, 1795–1815 (2021).
74. Doan, Q. H., Mai, S.-H., Do, Q. T. & ai, D.-K. A cluster-based data splitting method for small sample and class imbalance
problems in impact damage classication. Appl. So Comput. 120, 108628 (2022).
75. Wernicke, J., Seltmann, C. T., Wenzel, R., Becker, C. & Koerner, M. Forest canopy stratication based on fused, imbalanced and
collinear LiDAR and Sentinel-2 metrics. Remote Sens. Environ. 279, 113134 (2022).
76. Zhang, X., Akber, M. Z. & Zheng, W. Predicting the slump of industrially produced concrete using machine learning: A multiclass
classication approach. J. Build. Eng. 58, 104997 (2022).
77. B enchou, M., Matzner-Lober, E., Molinari, N., Jannot, A.-S. & Soyer, P. Interobserver agreement issues in radiology. Diagn. Inter.
Imaging 101, 639–641 (2020).
78. Jiang, G., Yue, R., He, Q., Xie, P. & Li, X. Imbalanced learning for wind turbine blade icing detection via spatio-temporal attention
model with a self-adaptive weight loss function. Expert Syst. Appl. 229, 120428 (2023).
Author contributions
MSJG: investigation, visualization, writing-original dra, conceptualization, validation, modeling, MEN: writing-
review and editing, methodology, validation, supervision, data curation, SS: writing-review and editing, valida-
tion, JG-F: writing-review AND; editing, validation, methodology.
Competing interests
e authors declare no competing interests.
Additional information
Correspondence and requests for materials should be addressed to M.E.N.
Reprints and permissions information is available at www.nature.com/reprints.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and
institutional aliations.
Content courtesy of Springer Nature, terms of use apply. Rights reserved
15
Vol.:(0123456789)
Scientic Reports | (2023) 13:21622 | https://doi.org/10.1038/s41598-023-49080-7
www.nature.com/scientificreports/
Open Access is article is licensed under a Creative Commons Attribution 4.0 International
License, which permits use, sharing, adaptation, distribution and reproduction in any medium or
format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the
Creative Commons licence, and indicate if changes were made. e images or other third party material in this
article are included in the articles Creative Commons licence, unless indicated otherwise in a credit line to the
material. If material is not included in the article’s Creative Commons licence and your intended use is not
permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.
© e Author(s) 2023
Content courtesy of Springer Nature, terms of use apply. Rights reserved
1.
2.
3.
4.
5.
6.
Terms and Conditions
Springer Nature journal content, brought to you courtesy of Springer Nature Customer Service Center GmbH (“Springer Nature”).
Springer Nature supports a reasonable amount of sharing of research papers by authors, subscribers and authorised users (“Users”), for small-
scale personal, non-commercial use provided that all copyright, trade and service marks and other proprietary notices are maintained. By
accessing, sharing, receiving or otherwise using the Springer Nature journal content you agree to these terms of use (“Terms”). For these
purposes, Springer Nature considers academic use (by researchers and students) to be non-commercial.
These Terms are supplementary and will apply in addition to any applicable website terms and conditions, a relevant site licence or a personal
subscription. These Terms will prevail over any conflict or ambiguity with regards to the relevant terms, a site licence or a personal subscription
(to the extent of the conflict or ambiguity only). For Creative Commons-licensed articles, the terms of the Creative Commons license used will
apply.
We collect and use personal data to provide access to the Springer Nature journal content. We may also use these personal data internally within
ResearchGate and Springer Nature and as agreed share it, in an anonymised way, for purposes of tracking, analysis and reporting. We will not
otherwise disclose your personal data outside the ResearchGate or the Springer Nature group of companies unless we have your permission as
detailed in the Privacy Policy.
While Users may use the Springer Nature journal content for small scale, personal non-commercial use, it is important to note that Users may
not:
use such content for the purpose of providing other users with access on a regular or large scale basis or as a means to circumvent access
control;
use such content where to do so would be considered a criminal or statutory offence in any jurisdiction, or gives rise to civil liability, or is
otherwise unlawful;
falsely or misleadingly imply or suggest endorsement, approval , sponsorship, or association unless explicitly agreed to by Springer Nature in
writing;
use bots or other automated methods to access the content or redirect messages
override any security feature or exclusionary protocol; or
share the content in order to create substitute for Springer Nature products or services or a systematic database of Springer Nature journal
content.
In line with the restriction against commercial use, Springer Nature does not permit the creation of a product or service that creates revenue,
royalties, rent or income from our content or its inclusion as part of a paid for service or for other commercial gain. Springer Nature journal
content cannot be used for inter-library loans and librarians may not upload Springer Nature journal content on a large scale into their, or any
other, institutional repository.
These terms of use are reviewed regularly and may be amended at any time. Springer Nature is not obligated to publish any information or
content on this website and may remove it or features or functionality at our sole discretion, at any time with or without notice. Springer Nature
may revoke this licence to you at any time and remove access to any copies of the Springer Nature journal content which have been saved.
To the fullest extent permitted by law, Springer Nature makes no warranties, representations or guarantees to Users, either express or implied
with respect to the Springer nature journal content and all parties disclaim and waive any implied warranties or warranties imposed by law,
including merchantability or fitness for any particular purpose.
Please note that these rights do not automatically extend to content, data or other material published by Springer Nature that may be licensed
from third parties.
If you would like to use or distribute our Springer Nature journal content to a wider audience or on a regular basis or in any other manner not
expressly permitted by these Terms, please contact Springer Nature at
onlineservice@springernature.com
... Challenges arise in selecting variables, particularly when dealing with large sample sizes [46]. Variable selection was guided by professional judgment, which emphasised the retention of the most influential variables [66,67]. Various methodologies have contributed to the optimisation of predictive models through variable selection. ...
... This section describes the selection of diverse algorithms. Table 2 compares commonly used algorithms for predictive tasks and presents their classifications, advantages, and disadvantages [67]. Algorithms are categorised into four domains: basic, evolutionary, ensemble, and deep learning [68]. ...
Article
Full-text available
The increasing impact of climate change raises concerns regarding the vulnerability of overhead transmission lines to ice disasters. To address this issue, this study reviews icing growth modelling in two categories: physical‐driven models (PDMs) and data‐driven models (DDMs), covering current advances and future directions. First, PDMs are summarised, focusing on the thermodynamic and fluid mechanics mechanisms. Existing PDMs are compared based on principles, analysing their advantages, disadvantages, and challenges faced. Second, the summarisation of DDMs involves four aspects: data preparation, algorithm selection, model training, and model evaluation. In data preparation, techniques such as preprocessing methods are reviewed to handle multisource data. In algorithm selection, various modelling algorithms are compared and analysed, from basic to deep learning approaches. In model training, processes are summarised to enhance practical applicability, including data partitioning, hyperparameter adjustment, generalisation capability, and model interpretability. In model evaluation, the predictive capabilities are analysed, covering both regression and classification tasks. Subsequently, based on the analyses, a comparison of PDMs and DDMs across various aspects is presented. Finally, future directions in icing growth modelling are outlined. The aim is to enhance icing assessment by understanding the underlying mechanism in attempt to reduce vulnerability and ensure reliability against adverse weather conditions.
... With the increasing availability of logging data and advancements in computer science, machine learning techniques have been widely applied in log interpretation [11]. Machine learning methods can discover correlations between multidimensional data by integrating and learning feature attributes from logging curves. ...
Article
Full-text available
Lithology identification is essential for formation evaluation and reservoir characterization, serving as a fundamental basis for assessing the potential value of oil and gas resources. However, traditional models often struggle with identification accuracy due to the complexities of nonlinear relationships and class imbalances in well-logging data. This paper presents an effective multi-model ensemble approach for lithology identification, integrating one-dimensional multi-scale convolutional neural networks (MCNN1D), Graph Attention Networks (GAT), and Transformer networks. MCNN1D extracts local features of lithological changes with varying convolutional kernels, enhancing robustness to complex geological data. The GAT assigns adaptive weights to adjacent nodes, capturing spatial relationships among lithological samples and enhancing local interactions. Meanwhile, the Transformer uses self-attention to capture contextual relationships in lithological sequences, improving global feature processing and identification. The multi-model fusion effectively combines the strengths of individual models, enabling comprehensive and efficient modeling of geological features. Experimental results show that the proposed Multi-Model Fusion Network outperforms other models in accuracy, precision, recall, and F1-score on the Hugoton–Panoma oilfield dataset, achieving a lithology identification accuracy of 95.06% for adjacent lithologies. This approach mitigates the effects of data imbalance and enhances identification accuracy, making it a powerful tool for lithology identification in complex reservoirs.
... Considering the outstanding performance of Random Forest (RF) and stacking ensemble (SE) methods in previous studies (Abbas and Al-Mudhafar 2021;Bhattacharya 2021;Jamshidi Gohari et al. 2023), these algorithms are employed to identify breakout zones in carbonate reservoirs using petrophysical logs. The RF algorithm is renowned for being a reliable, understandable, and efficient method that can handle big and complicated datasets and deal with nonlinear patterns (Breiman 2001(Breiman , 2017 particularly when used with petrophysical log datasets. ...
Article
Full-text available
In the present research, a comparison between random forest and stacking ensemble learning approaches is presented to identify breakout zones in carbonate formations based on machine learning techniques. Breakout zones play a very significant role in hydraulic fracturing and wellbore stability within the frame of geomechanical modeling. This work evaluated the efficiency of machine learning approaches in breakout zone prediction for four wells in two different fields using the petrophysical logs such as gamma ray (GR), sonic transit time (DT), density log (RHOB), formation evaluation photoelectric factor (PEF), deep resistivity log (RT), neutron porosity (NPHI), caliper, bit size and formation microresistivity image (FMI) logs. Results showed that the accuracy ranking of all test wells using the Stacking Ensemble method ranged from 0.86 to 0.89, while those using the Random Forest ranged between 0.55 and 0.84. In general, results have indicated that on scope and accuracy, the Stacking Ensemble method outperformed the Random Forest against a well-defined circumstance range. Geomechanical modeling has illustrated that intelligent approaches for breakout prediction enhance the accuracy of the geomechanical models. This work will finally illustrate how machine learning can enhance breakout zone detection, further enhance geomechanical modeling and optimize oil and gas development by ensuring stability in the wellbore.
... Moreover, the basic condition of the ensemble learning process is to achieve accuracy and diversity as shown in Fig. 5. One of the most standard ensemble methods is the voting method [33,57]. A voting classifier is an ensemble learning technique that functions as a wrapper contains unique ML classifiers to categorize the information through joined voting [54,58]. ...
Article
Full-text available
Automatic medical document classification using machine learning techniques can enhance the productivity of healthcare services by reducing processing time and cost. This work proposes an ensemble learning approach to develop a model that classifies electronic medical documents in Afaan Oromo. The main tasks in this work are preparing the corpus, pre-processing, training the models, and the classification process. We used the term frequency-inverse document frequency (TF-IDF) and bag of words (BOW) feature extraction methods. An ensemble technique in this work is that it creates multiple individual classifier predictions from naïve Bayes, random forest, SVM, and logistic regression and then combines them to advance a reliable and more accurate classifier. Evaluation measures were employed using accuracy, F1-score, recall, and precision for performance comparison. The efficiency of the proposed method is compared with the two existing boosting approaches, namely gradient boosting and adaboost. The experimental result shows the efficiency of BOW feature extraction over TF-IDF in this work on our dataset. These results also illustrated the effectiveness of the proposed model by scoring 94.81% accuracy and 94.84% F1-score. This work significantly contributes to the technological enhancement of service delivery, managing documents through classification methods, and advancing the data processing systems in healthcare sectors.
Article
Full-text available
This work presents the results of using tree-based models, including Gradient Boosting, Extra Trees, and Random Forest, to model the solubility of hyoscine drug and solvent density based on pressure and temperature as inputs. The models were trained on a dataset of hyoscine drug with known solubility and density values, optimized with WCA algorithm, and their accuracy was evaluated using R², MSE, MAPE, and Max Error metrics. The results showed that Gradient Boosting and Extra Trees models had high accuracy, with R² values above 0.96 and low MAPE and Max Error values for both solubility and density output. The Random Forest model was less accurate than the other two models. These findings demonstrate the effectiveness of tree-based models for predicting the solubility and density of chemical compounds and have potential applications in determination of drug solubility prior to process design by correlation of solubility and density to input parameters including pressure and temperature.
Article
Full-text available
In gas-condensate reservoirs, liquid dropout occurs by reducing the pressure below the dew point pressure in the area near the wellbore. Estimation of production rate in these reservoirs is important. This goal is possible if the amount of viscosity of the liquids released below the dew point is available. In this study, the most comprehensive database related to the viscosity of gas condensate, including 1370 laboratory data was used. Several intelligent techniques, including Ensemble methods, support vector regression (SVR), K-nearest neighbors (KNN), Radial basis function (RBF), and Multilayer Perceptron (MLP) optimized by Bayesian Regularization and Levenberg–Marquardt were applied for modeling. In models presented in the literature, one of the input parameters for the development of the models is solution gas oil ratio (Rs). Measuring Rs in wellhead requires special equipment and is somewhat difficult. Also, measuring this parameter in the laboratory requires spending time and money. According to the mentioned cases, in this research, unlike the research done in the literature, Rs parameter was not used to develop the models. The input parameters for the development of the models presented in this research were temperature, pressure and condensate composition. The data used includes a wide range of temperature and pressure, and the models presented in this research are the most accurate models to date for predicting the condensate viscosity. Using the mentioned intelligent approaches, precise compositional models were presented to predict the viscosity of gas/condensate at different temperatures and pressures for different gas components. Ensemble method with an average absolute percent relative error (AAPRE) of 4.83% was obtained as the most accurate model. Moreover, the AAPRE values for SVR, KNN, MLP-BR, MLP-LM, and RBF models developed in this study are 4.95%, 5.45%, 6.56%, 7.89%, and 10.9%, respectively. Then, the effect of input parameters on the viscosity of the condensate was determined by the relevancy factor using the results of the Ensemble methods. The most negative and positive effects of parameters on the gas condensate viscosity were related to the reservoir temperature and the mole fraction of C 11 , respectively. Finally, suspicious laboratory data were determined and reported using the leverage technique.
Article
Full-text available
In the context of gas processing and carbon sequestration, an adequate understanding of the solubility of acid gases in ionic liquids (ILs) under various thermodynamic circumstances is crucial. A poisonous, combustible, and acidic gas that can cause environmental damage is hydrogen sulfide (H2S). ILs are good choices for appropriate solvents in gas separation procedures. In this work, a variety of machine learning techniques, such as white-box machine learning, deep learning, and ensemble learning, were established to determine the solubility of H2S in ILs. The white-box models are group method of data handling (GMDH) and genetic programming (GP), the deep learning approach is deep belief network (DBN) and extreme gradient boosting (XGBoost) was selected as an ensemble approach. The models were established utilizing an extensive database with 1516 data points on the H2S solubility in 37 ILs throughout an extensive pressure and temperature range. Seven input variables, including temperature (T), pressure (P), two critical variables such as temperature (Tc) and pressure (Pc), acentric factor (ω), boiling temperature (Tb), and molecular weight (Mw), were used in these models; the output was the solubility of H2S. The findings show that the XGBoost model, with statistical parameters such as an average absolute percent relative error (AAPRE) of 1.14%, root mean square error (RMSE) of 0.002, standard deviation (SD) of 0.01, and a determination coefficient (R²) of 0.99, provides more precise calculations for H2S solubility in ILs. The sensitivity assessment demonstrated that temperature and pressure had the highest negative and highest positive affect on the H2S solubility in ILs, respectively. The Taylor diagram, cumulative frequency plot, cross-plot, and error bar all demonstrated the high effectiveness, accuracy, and reality of the XGBoost approach for predicting the H2S solubility in various ILs. The leverage analysis shows that the majority of the data points are experimentally reliable and just a small number of data points are found beyond the application domain of the XGBoost paradigm. Beyond these statistical results, some chemical structure effects were evaluated. First, it was shown that the lengthening of the cation alkyl chain enhances the H2S solubility in ILs. As another chemical structure effect, it was shown that higher fluorine content in anion leads to higher solubility in ILs. These phenomena were confirmed by experimental data and the model results. Connecting solubility data to the chemical structure of ILs, the results of this study can further assist to find appropriate ILs for specialized processes (based on the process conditions) as solvents for H2S.
Article
The present study introduces an enhanced deep learning (DL) workflow based on transfer learning (TL) for producing high-resolution synthetic graphic well logs (SGWLs). To examine the scalability of the proposed workflow, a carbonate reservoir with a high geological heterogeneity has been chosen as the case study, and the developed workflow is evaluated on unseen data (i.e., blind well). Data sources include conventional well logs and graphical well logs (GWLs) from neighboring wells. During drilling operations, GWLs are standard practice for collecting data. GWL provides a rapid visual representation of subsurface lithofacies to establish geological correlations. This investigation examines five wells in a southwest Iranian oil field. Due to subsurface geological heterogeneities, the primary challenge of this research lies in addressing the imbalanced facies distribution. The traditional artificial intelligence strategies that manage imbalanced data [e.g., the modified synthetic minority oversampling technique (M-SMOTE) and Tomek link (TKL)] are mainly designed to solve binary problems. However, to adapt these methods to the upcoming imbalanced multiclass situation, one-vs.-one (OVO) and one-vs.-all (OVA) decomposition strategies and ad-hoc techniques are used. Well-known VGG16-1D and ResNet18-1D are used as adaptive very-deep algorithms. Additionally, to highlight the robustness and efficiency of these algorithms, shallow learning approaches of support vector machine (SVM) and random forest (RF) as conventional facies classification methods are also used. The other main challenge is the need for enough data points to train the very deep algorithms, resolved through TL. After identifying a blind well, the other four wells’ data are entered for model training. The average kappa statistic and F-measure, as appropriate imbalance data evaluation metrics, are implemented to assess the designed workflows’ performance. The numerical and visual comparison analysis shows that the VGG16-1D TL model performs better on the blind well data set when combined with the OVA scheme as a decomposition technique and TKL as a binary imbalance data combat tactic. An average kappa statistic of 86.33% and a mean F-measure of 92.09% demonstrate designed workflow superiority. Considering the prevalence of different imbalanced facies distributions, the developed scalable workflow can be efficient and productive for generating SGWL.