Conference PaperPDF Available

Machine Learning in Rock Facies Classification: An Application of XGBoost

Authors:
  • Houston Methodist Research Institue

Figures

Content may be subject to copyright.
Machine Learning in Rock Facies Classification: An Application of XGBoost
Licheng Zhang, Cheng Zhan
Summary
Big data analysis has drawn much attention across different
industries. Geoscientists, meanwhile, have been doing
analysis with voluminous data for many years, without even
bragging how big it is. In this paper, we present an
application of machine learning, to be more specific, the
gradient boosting method, in Rock Facies Classification
based on certain geological features and constrains. Gradient
boosting is a both popular and effective approach in
classification, which produces a prediction model in an
ensemble of weak models, typically decision trees. The key
for gradient boosting to work successfully lies in introducing
a customized objective function and tuning the parameters
iteratively based on cross-validation. Our model achieves a
rather high F1 score in evaluating two test wells data.
Introduction and Background
Machine learning emerges to be a very promising area and
should make the work of future geoscientists more fun and
less tedious. Furthermore, with the maturing neural network
technology, the ability for better geological interpretation
could be more automatic and accurate, e.g., in the Gulf of
Mexico region, salt body characterization (challenging in the
velocity model) might be elevated to the next level of higher
quality seismic images.
There are a few decision tree based algorithms to handle
classification problems. One is using the random forest,
which operates by constructing multiple decision trees to
reduce the possible variance error in each model. Another
widely used technique is gradient boosting, which has been
successfully applied in many Kaggle competitions. This
method focuses on where the model performs poorly, and
improves those areas by introducing a learner to compensate
the existing model.
This facies classification problem was originally introduced
in the Leading Edge by Brendon Hall in Oct. 2016 (Hall,
2016). It seems to evolve into the first machine learning
contest in the SEG, more information to be found on here
(https://github.com/seg/2016-ml-contest). By the time we
submitted the paper, our ranking is 5th on the leaderboard.
This data is from the Council Grove gas reservoir in
Southwest Kansas. The Panoma Council Grove Field is
predominantly a carbonate gas reservoir encompassing 2700
square miles in Southwestern Kansas. This dataset is from
ten wells (with 4149 examples), consisting of a set of seven
predictor variables and a rock facies (class) for each example
vector and validation (test) data (830 examples from two
wells) having the same seven predictor variables in the
feature vector. Facies are based on the examination of cores
from nine wells taken vertically at half-foot intervals.
Predictor variables include five from the wireline log
measurements and two geologic constraining variables that
are derived from geologic knowledge. These are essentially
continuous variables sampled at a half-foot sample rate.
The seven predictor variables are:
Five wireline log
measurements
Two geologic constrains
Gamma ray (GR)
Resistivity logging
(ILD_log10)
Photoelectric effect (PE)
Neutron-density porosity
difference (Delta PHI)
Average neutron-density
porosity (PHIND)
N
onmarine
-
indicator (NM_M)
Relative position
(RELPOS)
The nine discrete facies (classes of rocks), the abbreviated
labels, and the corresponding adjacent facies are listed in the
following Table 1. The facies gradually blend into one
another, and some of the neighboring facies are rather close.
Mislabeling within these neighboring is possible to occur.
Table 1:
Class of rocks Facies Label Adjacent Facies
Nonmarine sandstone 1 SS 2
Nonmarine coarse siltstone 2 CSiS 1,3
Nonmarine fine siltstone 3 FSiS 2
Marine siltstone and shale 4 SiSh 5
Mudstone 5 MS 4,6
Wackestone 6 WS 5,7
Dolomite 7 D 6,8
Packstone-Grainstone 8 PS 6,7,9
Phylloid-glgal bafflestone 9 BS 7,8
Machine Learning in Rock Facies Classification: An Application of XGBoost
Methodology
Generally speaking, there are 3 types of machine learning
algorithms: supervised learning, unsupervised learning, and
reinforcement learning. The application in this paper belongs
to the category of supervised learning. This type of
algorithm consists of a target/outcome variable (or
dependent variable), which is to be predicted from a given
set of predictors (independent variables, or usually called
features). Using these feature variables, a function that maps
inputs to desired outputs will be generated. The training
process continues until the model achieves a satisfied level
of accuracy on the training data. Examples of supervised
learning includes: regression, decision tree, random forest,
KNN, logistic regression etc.
The algorithm adopted here is called XGBoost (eXtreme
Gradient Boosting), which is an optimized distributed
gradient boosting library designed to be highly efficient,
flexible and portable. It implements machine learning
algorithms under the Gradient Boosting framework.
XGBoost provides a parallel tree boosting (also known as
GBDT, GBM) that solves many data science problems in a
fast and accurate way. It was created and developed by
Tianqi Chen, a Ph.D. student at the University of
Washington. More details about XGBoost can be found here
(http://dmlc.cs.washington.edu/xgboost.html).
The basic idea of boosting is to combine hundreds of simple
trees with low accuracy to build a more accurate model.
Every iteration will generate a new tree for the model. When
it comes to how a new tree is created, there are thousands of
methods. A famous one is called Gradient Boosting
machine, raised by Friedman (Friedman, 2001). It utilizes
the gradient descent to generate the new tree based on all
previous trees, driving the objective function towards the
minimum direction.
An objective function usually has the form that contains two
parts (training loss and regularization):
)
()()(
LObj
(1)
Where L is the training loss function, and
is the
regularization term. The training loss measures the
performance of the model is on training data. The
regularization term controls the complexity of the model,
which usually controls overfitting. The complexity of each
tree is defined as the following:
T
j
j
Tf
1
2
2
1
)(
(2)
There is, of course, more than one way to define the
complexity, and this particular one works well in practice.
And the objective function in XGBoost is defined as:
T
j
jjjj THGobj
1
2])(
2
1
[
(3)
More details about the notations can be found here
(http://xgboost.readthedocs.io/en/latest/model.html).
Data Analysis and Model Selection
Before building any machine learning model, it is necessary
to perform some exploratory analysis and cleanup.
First, we examine the data that will be used to train the
classifier. The data consists of 5 wireline log measures, 2
indicator variables, and 1 facies label at half foot interval. In
machine learning terminology, each log measurement is a
feature vector that maps a set of ‘features’ (the log measures)
to a class (the facies type).
Pandas library in Python is a great tool in loading data into
the dataframe structure for further manipulation.
Then some basic statistical analysis are produced, for
example, the distribution of each classes (Figure 1a),
heatmap of features (Figure 1b), which produces correlation
plot for us to observe relationship between variables, and log
plots for wells (Figure 1c). These figures are the initial
blocks to explore the data, and the visualization libraries are
seaborn and matplotlib.
Machine Learning in Rock Facies Classification: An Application of XGBoost
(a)
(b)
(c)
Figure 1: (a) Distribution of facies (b) Heatmap of features (c) Log
plots for well SHRIMPLIN and SHANKLE
The next step is data preparation and model selection. The
goal is to build a reliable model to predict the Y values
(Facies) based on X values (the seven predictor variables).
To enhance the performance of XGBoost’s speed over many
iterations, we create a DMatrix format. Such process sorts
the data initially to optimize for XGBoost in building trees,
and reduces the runtime correspondingly. This is especially
helpful in learning with a large number of training examples.
On the other hand, in order to quantify the quality of the
models, certain metrics are needed. We use accuracy metrics
for judging the models. A simple and easy way to learn the
terminologies (e.g., accuracy, prediction, recall) can be
found in the following webpage
(http://www.dataschool.io/simple-guide-to-confusion-
matrix-terminology/).
There are several main parameters to be tuned to get a good
model for this rock facies classification problem.
Table 2: main parameters
Learning rate
Step size shrinkage employed to
prevent overfitting. We shrinks
the feature weights to make the
boosting process more
conservative
N_estimators The number of trees
Max_depth
Maximum depth of a tree, and
increasing this value will make
the model more complex(likely to
be overfitting)
Min_child_weight Minimum sum of instance weight
needed in a child
Gamma
Minimum loss reduced required to
make a further partition on a leaf
node of the tree
Subsample Subsample ratio of the training
instance
Colsample_bytree Subsample ratio of features when
constructing each tree
Objective:’multi:softmax’
This sets XGBoost to produce
multiclass classification using the
softmax objective
nthread Number of parallel threads used
to run XGBoost
Machine Learning in Rock Facies Classification: An Application of XGBoost
Algorithm parameter tuning is a critical processs in
achieving the optimal performance of certain algorithm, and
needs to be carefully justified before moving into
production. Our workflow for optimizing parameters is
presented here:
The reason we adopt such flow is because of the nature of
XGBoost algorithm, which is robust enough not to be
overfitting with increasing trees, but a high value for a
particular learning rate could degrade its ability in predicting
new test data. As we reduce the learning rate and increase
the number of trees, the computation becomes expensive and
could potentially take longer time on standard personal
computers.
Grid search is a typical approach for parameters tuning that
methodically builds and evaluates a model for each
combination of parameters in a specific grid. For instance,
the code below examines different combinations of
‘max_depth’ and ‘min_child_weight’.
Another way to tailor parameters is by random search, which
complements the predefined grid search procedure that is
currently being exploited. In this case, we didn’t find random
search benefits much the final results.
After several iterations, the final model is built up. A cross-
validation is conducted to access the performance before
applying to another two blind well test data. The best
accuracy (F1 score) we have so far is 0.564, ranked 5th in the
contest. The following is the feature importance plot of the
model. Importance provides a score that indicates how
useful or valuable each feature was in the construction of the
boosted decision trees within the model. The more an
attribute is used to make key decisions with decision trees,
the higher its relative importance.
Conclusions
We have successfully applied the gradient boosting method
to a classification problem in the rock facies. Potential
applications of such prediction could be to validate the
velocity model for seismic data. This could be viewed as
some commencing endeavors for more machine learning
applications in the near future of the oil and gas sector.
Acknowledgments
The authors would like to thank Ted Petrou, Aiqun Huang
and Zhongyang Dong for discussion. We also thank Yan Xu
for reviewing the manuscript.
Reference
Chen, T. & Guestrin, C., 2016. Xgboost: A scalable tree
boosting system. arXiv preprint arXiv:1603.02754.
Friedman, J. H., 2001. Greedy function approximation: a
gradient boosting machine. Annals of statistics, pp. 1189-
1232.
Hall, B., 2016. Facies classification using machine learning.
The Leading Edge, 10, pp. 906-909.
Natekin, A. & Knoll, A., n.d. Gradient boosting machines, a
tutorial. Frontiers in neurorobotics, p. 2013.
Pick
initial parameters
(e.g., default values)
Turn tree
-
based parameters (e.g., adjust
max_depth and min_child_weight
simultaneously
Calibrate gamma, subsample and
colsample_bytree
Balance regularization parameters
Reduce learning rate and update the number
of trees
... ML is the application of statistical approaches to learn and predict molecular properties [13,14]. Some of the most often applied ML algorithms in drug design and discovery applications include: eXtreme Gradient Boosting (XGBoost) [15], Random Forests (RF) [16], Naive Bayesian (NB) [17,18], K-nearest neighbors (kNN) [19], probabilistic neural networks (PNN) [20,21], multilayer perceptron (MLP) [22] and locally weighted learning (LWL) [23,24]. ...
... The calculated descriptors (pharmacophore fit values and other physicochemical descriptors) served as explanatory variables, while the bioactivity classes (active, inactive and moderate) served as response variable in following ML experiments. Several machine learners were scanned to identify the best learner capable of correlating the molecular descriptors with bioactivity classes, e.g., RF [60], XGBoost [15], NB [61], PNN [20,21], kNN [19], LWL [23,24], and MLP [22]. KNIME 4.3.3 ...
... In this model feature importance is calculated as the slope of the particular feature in the corresponding linear model [65,66]. GFI was applied against best performing machine learners, namely, XGBoost [15] and RF [60]. ...
Article
Full-text available
STAT3 belongs to a family of seven transcription factors. It plays an important role in activating the transcription of various genes involved in a variety of cellular processes. High levels of STAT3 are detected in several types of cancer. Hence, STAT3 inhibition is considered a promising therapeutic anti-cancer strategy. However, since STAT3 inhibitors bind to the shallow SH2 domain of the protein, it is expected that hydration water molecules play significant role in ligand-binding complicating the discovery of potent binders. To remedy this issue, we herein propose to extract pharmacophores from molecular dynamics (MD) frames of a potent co-crystallized ligand complexed within STAT3 SH2 domain. Subsequently, we employ genetic function algorithm coupled with machine learning (GFA-ML) to explore the optimal combination of MD-derived pharmacophores that can account for the variations in bioactivity among a list of inhibitors. To enhance the dataset, the training and testing lists were augmented nearly a 100-fold by considering multiple conformers of the ligands. A single significant pharmacophore emerged after 188 ns of MD simulation to represent STAT3-ligand binding. Screening the National Cancer Institute (NCI) database with this model identified one low micromolar inhibitor most likely binds to the SH2 domain of STAT3 and inhibits this pathway.
... Shale lithofacies classification via machine learning requires a certain amount of data, so advanced logging data, such as PNS logging, is highly valuable (Bhattacharya and Carr, 2019;Wang et al., 2014;Zhang and Zhan, 2017). However, for most research areas, only conventional logging data and XRD data are available, and not PNS logging. ...
... The RMSE quantifies how well a regression model can predict the absolute value of the response variable while R 2 quantifies the distance between the model result and the true data. A lower value of RMSE and a higher value of R 2 implies higher accuracy of a regression model (Hyndman and Koehler, 2006;Milad et al., 2020;Zhang and Zhan, 2017). ...
Article
Shale lithofacies classification is one of the key components of shale reservoir evaluation. Typically, a significant amount of laboratory X-ray diffraction (XRD) data acquired on many shale samples collected in a large number of boreholes is required to constrain the mineral composition that enables classifying shale lithofacies at the basin scale. This procedure is costly and time consuming. Here, we propose a supervised machine learning method to predict the mineral composition of shale samples, including the clay and silicate content. The main advantage of our approach is that it only uses conventional logging data and a small number of XRD measurements of core samples, combined with XGBoost algorithm, to predict shale lithofacies, which can reduce the cost and improve the efficiency of reservoir evaluation. We apply our method on the Shanxi and Taiyuan Early Permian shales in the Ordos Basin, China, because these shale rocks have a high potential of gas production. However, these formations are also highly heterogeneous, making them challenging to explore and exploit. Therefore, it is critical to perform detailed lithofaciices classification analysis at the basin scale before gas production. Our result show that the gamma ray, neutron porosity, and density measurements are the critical logging data that control the model predictions. These parameters are known to be sensitive to clay content, thereby supporting the robustness of model predictions. Applying the model to different wells to classify shale lithofacies, results that these shale formations are dominated by three types of lithofacies. We characterize the different shale lithofacies by microscopy images and gas adsorption measurements, and demonstrate that our results are consistent with previous studies, verifying the accuracy and applicability of our machine learning method to classify shale lithofacies.
... The key advantages of XGBoost as a prediction tool lie in its scalability and excellent performance in various situations [9][10][11]. By using concurrent and distributed computation, XGBoost employs a weighted quantile sketch technique to handle instance weights in approximate tree learning [7]. ...
... The integration of regularization terms into XGBoost helps to reduce the possibility of over-fitting while simultaneously increasing model accuracy. XGBoost has proven to be very effective in solving both regression and classification problems [9][10][11]. ...
Article
Full-text available
Globally, slope failures cause severe disasters and substantial financial losses annually. Recent advancements in machine learning (ML) algorithms and dataset collection have created alternate solutions for complex slope stability problems. However, rock slope stability prediction remains a challenging problem due to factors such as inadequate data and insufficient generalization performance of rock slope prediction models. The black-box nature of AI models also causes further criticism of using such models to address issues such as slope stability. In this study, we proposed an artificial intelligence (AI) based technique for rock slope stability prediction based on evolutionary and ML algorithms. The proposed GA-XGBoost model uses XGBoost to model the relationship between the input and output parameters of rock slopes, while Genetic Algorithm (GA) optimizes the hyperparameters of XGBoost. A comprehensive rock slope database of 7525 slope cases is implemented in this study to develop and verify the model. The model attains an impressive performance score of \({R}^{2} = 0.9999\), \(MAE = 0.8006\), and \(RMSE=1.8624\) on the training dataset and \({R}^{2} = 0.9934\), \(MAE = 2.2793\), and \(RMSE=11.1090\) on the testing dataset. Furthermore, to assess the relative significance of the various influential slope parameters, the SHapley Additive exPlanations (SHAP) algorithm is implemented. This step enables the physical and quantitative interpretations of dependencies between the input and output variables. Generally, this relationship is hidden in traditional machine learning algorithms.
... XGBoost is a parallel tree boosting system designed to be highly efficient and flexible [48], XGBoost can build a powerful learner to easily reduce computational cost and prevent model fit from overfitting [49]. The difference between XGBoost and gradient boosted tree generally uses the Taylor equation of the first order while XGBoost increases the loss function using the Taylor equation of the second order and also uses the objective function to prevent overfitting and further simplify the method [50]. ...
Article
Full-text available
This paper discusses the impact of the feature input vector on the performance of dissolved gas analysis-based intelligent power transformer fault diagnosis methods. For this purpose, 22 feature vectors from traditional diagnostic methods were used as feature input vectors for four tree-based ensemble algorithms, namely random forest, tree ensemble, gradient boosted tree, and extreme gradient tree. To build the proposed diagnostics models, 407 samples were used for training and testing. For validation and comparison with the existing methods of literature, 89 samples were used. Based on the results obtained on the training and testing datasets, the best performance was achieved with feature vector 16, which consists of the gas ratios of Rogers’ four ratios method and the three ratios technique. The test accuracies based on these vectors are 98.37, 96.75, 95.93, and 97.56% for the namely random forest, tree ensemble, gradient boosted tree, and extreme gradient tree algorithms, respectively. Furthermore, the performance of the methods based on best input feature was evaluated and compared with other methods of literature such as Duval triangle, modified Rogers’ four ratios method, combined technique, three ratios technique, Gouda triangle, IEC 60599, NBR 7274, the clustering method, and key gases with gas ratio methods. These methods suffer from unreliability, and this is the motivation behind the current work to develop a new technique that enhances the diagnostic accuracy of transformer faults to avoid unwanted faults and outages from the network. On validating dataset, diagnostic accuracies of 92.13, 91.01, 89.89, and 91.01% were achieved by the namely random forest, tree ensemble, gradient boosted tree, and extreme gradient tree models, respectively. These diagnostic accuracies are higher than 83.15% for the clustering method, 82.02% for the combined technique, 80.90% for the modified IEC 60599, and 79.78% for key gases with gas ratios, which are the best existing methods. Even if the performance of dissolved gas analysis-based intelligent methods depends strongly on the shape of the feature vector used, this study provides scholars with a tool for choosing the feature vector to use when implementing these methods.
... Random search tries a number of predetermined combinations, evaluates hyperparameters, and selects the most promising ones [59]. Large volumes of data can be processed efficiently by random search [60]. ...
Article
Full-text available
Predicting the drill penetration rate is a fundamental requirement in mining operations, profoundly impacting both the cost-effectiveness of mining activities and strategic mine planning. Given the intricate web of factors influencing rotary drilling performance, the necessity for advanced modeling techniques becomes evident. To this end, the hybrid extreme gradient boosting (XGBoost) was utilized to gauge the penetration rate of rotary drilling machines, utilizing random search, grid search, Harris Hawk optimization (HHO), and the dragonfly algorithm (DA) as metaheuristic algorithms. Our research draws from extensive data collected in copper mine case studies, encompassing both field and investigational data. This dataset incorporates critical material properties, such as tensile strength (TS), uniaxial compressive strength (UCS), as well as vital rock-mass characteristics including joint direction (JD), joint spacing (JS), and bit diameter (D). Our investigation evaluates the reliability of these prediction methods using various performance indicators, including mean absolute error (MAE), root mean square error (RMSE), average absolute relative error (AARE), and coefficient of determination (R ² ). The multivariate analysis reveals that the HHO-XGB model stands out, demonstrating superior prediction accuracy (MAE: 0.457; RMSE: 2.19; AARE: 2.29; R ² : 0.993) compared to alternative models. Furthermore, our sensitivity analysis emphasizes the substantial impact of uniaxial compressive strength and tensile strength on the drill penetration rate. This underlines the importance of considering these material properties in mining operations. In conclusion, our research offers robust models for forecasting the penetration rate of similar rock formations, providing invaluable insights that can significantly enhance mining operations and planning processes.
... The value of the prediction function can have different interpretations depending on the program being run, that is, regression or classification. An objective function usually has a form that contains two parts, training loss and regularization [20]. The objective function in XGBoost is defined as ...
Article
Full-text available
The emergence of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) in late 2019 resulted in the COVID-19 pandemic, necessitating rapid and accurate detection of pathogens through protein sequence data. This study is aimed at developing an efficient classification model for coronavirus protein sequences using machine learning algorithms and feature selection techniques to aid in the early detection and prediction of novel viruses. We utilized a dataset comprising 2000 protein sequences, including 1000 SARS-CoV-2 sequences and 1000 non-SARS-CoV-2 sequences. Feature extraction provided 27 essential features representing the primary structural data, achieved through the Discere package. To optimize performance, we employed machine learning classification algorithms such as K-nearest neighbor (KNN), XGBoost, and Naïve Bayes, along with feature selection techniques like genetic algorithm (GA), LASSO, and support vector machine recursive feature elimination (SVM-RFE). The SVM-RFE+KNN model exhibited exceptional performance, achieving a classification accuracy of 99.30%, specificity of 99.52%, and sensitivity of 99.55%. These results demonstrate the model’s efficacy in accurately classifying coronavirus protein sequences. Our research successfully developed a robust classification model capable of early detection and prediction of protein sequences in SARS-CoV-2 and other coronaviruses. This advancement holds great promise in facilitating the development of targeted treatments and preventive strategies for combating future viral outbreaks.
... In numerous areas of study, the XGBoost algorithm has produced encouraging outcomes (Y. Zhang & Xie, 2007; L. Zhang & Zhan, 2017). This algorithm consists of a set of decision trees, each of which learns from its predecessor and influences its successor. ...
Article
Two-vehicle crashes resulting from distracted driving led to a higher number of fatalities and serious injuries over time. This study utilized machine learning and econometric models to investigate two-vehicle-involved distracted driving crashes from the Crash Report Sampling System within the United States. XGBoost and Random Forest were utilized to identify the top variables based on SHAP value, although mixed logit with unobserved heterogeneity was used to model injury severity. The model results indicate that there is a complex interaction of driver characteristics, such as demographics (male drivers), driver actions (careless driving, driving more than the speed limit of more than 15 mph, hitting a stopped vehicle), a driver without violation history, turning violation, drinking, roadway characteristics (non-interstate highways, undivided and divided roadways with positive barrier, curved roadways, dry surface), environmental conditions (rainy weather), vehicle attributes (motorcycle, displacement volume up to 2500 cc, newer vehicle within five years of crash-involvement), temporal characteristics (4–6 PM, July– September, and year 2017). These findings underscore the importance of driving behavior and roadway design. As such, prioritizing efforts to address distracted driving behavior through driver training and law enforcement, as well as considering its implications for roadway design and maintenance, becomes crucial.
... Malinin et al. (2020) suggested that boosting-based ML algorithms can be selected as base learners for SGLB. XGBoost is a boosting algorithm, and it is utilized extensively in geophysical log completion and rock facies classification (Chen and Guestrin, 2016;Zhang and Zhan, 2017;Zhong et al., 2020). Therefore, this study selected XGBoost as the base learner. ...
Article
Full-text available
Nowadays, it is commonplace for geological surveys to integrate multi-source geophysical data and drilling data in order to construct three-dimensional (3D) lithological models. In this context, manual translation of complex geophysical data into parameters used for 3D lithological modeling is challenging. Machine learning has recently shown great potential in 3D lithological modeling. However, the performance of machine learning algorithm is influenced by the imbalance in number of categories of lithological samples. In addition, the uncertainty associated with 3D lithological modeling by machine learning has rarely been quantified. This study presents a novel integrated machine learning framework to address the imbalance issue and to quantify uncertainty in 3D lithological modeling. As its novelty, our integrated machine learning framework can subdivide total uncertainty into aleatoric and epistemic uncertainties in the 3D lithological modeling procedure by stochastic gradient Langevin boosting. Another innovation of this study is the use of Bayesian hyperparameter optimization for automatic tuning of hyperparameters of the integrated machine learning framework. The 3D lithological and uncertainty modeling case study in the Jiaojia–Sanshandao gold district of China demonstrated the superiority of our proposed integrated machine learning framework. The proposed framework has great potential in integrating multi-source geophysical and drilling data for 3D lithological and uncertainty modeling in engineering geology.
... XGBoost is a parallel tree boosting system designed to be highly efficient and flexible [38], XGBoost can build a powerful learner to easily reduce computational cost and prevent model fit from overfitting [39]. The difference between XGBoost and Gradient Boosted Tree generally uses the Taylor equation of the first order while XGBoost increases the loss function using the Taylor equation of the second order and also uses the objective function to prevent overfitting and further simplify the method [40]. ...
Preprint
Full-text available
This paper discusses the impact of the feature input vector on the performance of DGA-based intelligent power transformer fault diagnosis methods. For this purpose, 22 feature vectors from traditional diagnostic methods were used as feature input vectors for four tree-based ensemble algorithms, namely random forest (RF), tree ensemble (TE), gradient boosted tree (GBT), and extreme gradient tree (XGB). To build the proposed diagnostics models, 407 samples were used for training and testing. For validation and comparison with the existing methods of literature 89 samples were used. Based on the results obtained on the training and testing datasets, the best performance was achieved with feature vector 16, which consists of the gas ratios of Rogers’ four ratios method and the three ratios technique. The test accuracies based on these vectors are 98.37, 96.75, 95.93, and 97.56% for the RF, TE, GBT, and XGB algorithms, respectively. Furthermore, the performance of the methods based on best input feature were evaluated and compared with other methods of literature such as Duval Triangle, modified Rogers’ four ratios method, combined technique, three ratios technique, Gouda triangle, IEC 60599, NBR 7274, clustering, and key gases with gas ratio methods. On validating dataset, diagnostic accuracies of 92.13, 91.01, 89.89, and 91.01% were achieved by the RF, TE, GBT, and XGBoost models, respectively. These diagnostic accuracies are higher than 83.15 % of the clustering method and 82.02 % of combined technique which are the best existing methods. Even if the performance of DGA-based intelligent methods depends strongly on the shape of the feature vector used, this study provides scholars with a tool for choosing the feature vector to use when implementing these methods.
Article
Full-text available
Highlights • To extract the nonlinear feature based on the principal component using the Kernel method is a vital step. • Machine learning methods are one of the most common methods in face classification. • Random Forest classifier is successful in increasing the classification accuracy by using the KPCA method.
Article
Full-text available
Gradient boosting machines are a family of powerful machine-learning techniques that have shown considerable success in a wide range of practical applications. They are highly customizable to the particular needs of the application, like being learned with respect to different loss functions. This article gives a tutorial introduction into the methodology of gradient boosting methods with a strong focus on machine learning aspects of modeling. A theoretical information is complemented with descriptive examples and illustrations which cover all the stages of the gradient boosting model design. Considerations on handling the model complexity are discussed. Three practical examples of gradient boosting applications are presented and comprehensively analyzed.
Article
There has been much excitement recently about big data and the dire need for data scientists who possess the ability to extract meaning from it. Geoscientists, meanwhile, have been doing science with voluminous data for years, without needing to brag about how big it is. But now that large, complex data sets are widely available, there has been a proliferation of tools and techniques for analyzing them. Many free and open-source packages now exist that provide powerful additions to the geoscientist's toolbox, much of which used to be only available in proprietary (and expensive) software platforms.
Conference Paper
Tree boosting is a highly effective and widely used machine learning method. In this paper, we describe a scalable end-to-end tree boosting system called XGBoost, which is used widely by data scientists to achieve state-of-the-art results on many machine learning challenges. We propose a novel sparsity-aware algorithm for sparse data and weighted quantile sketch for approximate tree learning. More importantly, we provide insights on cache access patterns, data compression and sharding to build a scalable tree boosting system. By combining these insights, XGBoost scales beyond billions of examples using far fewer resources than existing systems.
Article
Tree boosting is a highly effective and widely used machine learning method. In this paper, we describe a scalable end-to-end tree boosting system called XGBoost, which is used widely by data scientists to achieve state-of-the-art results on many machine learning challenges. We propose a novel sparsity-aware algorithm for sparse data and weighted quantile sketch for approximate tree learning. More importantly, we provide insights on cache access patterns, data compression and sharding to build a scalable tree boosting system. By combining these insights, XGBoost scales beyond billions of examples using far fewer resources than existing systems.
Article
Function estimation/approximation is viewed from the perspective of numerical optimization iti function space, rather than parameter space. A connection is made between stagewise additive expansions and steepest-descent minimization. A general gradient descent "boosting" paradigm is developed for additive expansions based on any fitting criterion. Specific algorithms are presented for least-squares, least absolute deviation, and Huber-M loss functions for regression, and multiclass logistic likelihood for classification. Special enhancements are derived for the particular case where the individual additive components are regression trees, and tools for interpreting such "TreeBoost" models are presented. Gradient boosting of regression trees produces competitives highly robust, interpretable procedures for both regression and classification, especially appropriate for mining less than clean data. Connections between this approach and the boosting methods of Freund and Shapire and Friedman, Hastie and Tibshirani are discussed.
Article
Function approximation is viewed from the perspective of numerical optimization in function space, rather than parameter space. A connection is made between stagewise additive expansions and steepest--descent minimization. A general gradient--descent "boosting" paradigm is developed for additive expansions based on any fitting criterion. Specific algorithms are presented for least--squares, least--absolute--deviation, and Huber--M loss functions for regression, and multi--class logistic likelihood for classification. Special enhancements are derived for the particular case where the individual additive components are decision trees, and tools for interpreting such "TreeBoost" models are presented. Gradient boosting of decision trees produces competitive, highly robust, interpretable procedures for regression and classification, especially appropriate for mining less than clean data. Connections between this approach and the boosting methods of Freund and Shapire 1996, and Fr...