Conference Paper

XGBoost: A Scalable Tree Boosting System

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Tree boosting is a highly effective and widely used machine learning method. In this paper, we describe a scalable end-to-end tree boosting system called XGBoost, which is used widely by data scientists to achieve state-of-the-art results on many machine learning challenges. We propose a novel sparsity-aware algorithm for sparse data and weighted quantile sketch for approximate tree learning. More importantly, we provide insights on cache access patterns, data compression and sharding to build a scalable tree boosting system. By combining these insights, XGBoost scales beyond billions of examples using far fewer resources than existing systems.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Also, there could be a case where the ground truth is not so accurate. Taking these things into account, we choose XGBoost [43], which handles sparsity to build tree structures and is robust to outliers, as the algorithm for this case. Now, let us provide a brief explanation about the XGBoost framework. ...
... after the t-th iteration that constructs a decision tree [43]. The shrinkage parameter η ∈ [0, 1], which each of the base learner functions is multiplied by, is also considered. ...
... The shrinkage parameter η ∈ [0, 1], which each of the base learner functions is multiplied by, is also considered. This parameter beefs up the technical capabilities for training an XGB model to improve its performances, since it disperses the power of each base learner function over future trees [43]. The column subsampling technique is considered as well to run the program faster and address an overfitting issue. ...
Article
Full-text available
Water mapping for satellite imagery has been an active research field for many applications, in particular natural disasters such as floods. Synthetic Aperture Radar (SAR) provides high-resolution imagery without constraints on weather conditions. The single-date SAR approach is less accurate than the multi-temporal approach but can produce results more promptly. This paper proposes novel segmentation schemes that are designed to process both a target superpixel and its surrounding ones for the input for machine learning. Mixture-based Superpixel-Shallow Deit-Ti/XGBoost (MISP-SDT/XGB) schemes are devised to generate, annotate, and classify superpixels, and perform the land/water segmentation of SAR imagery. These schemes are applied to Sentinel-1 SAR data to examine segmentation performances. Single/mask/neighborhood models and single/neighborhood models are introduced in the MISP-SDT scheme and the MISP-XGB scheme, respectively. The effects of the contextual information about the target and its neighbor superpixels are assessed on its segmentation performances. Regarding polarization, it is shown that the VH mode produces more encouraging results than the VV, which is consistent with previous studies. Also, under our MISP-SDT/XGP schemes, the neighborhood models show better performances than FCNN models. Overall, the neighborhood model gives better performances than the single model. Results from attention maps and feature importance scores show that neighbor regions are looked at or used by the algorithms in the neighborhood models. Our findings suggest that under our schemes, the contextual information has positive effects on land/water segmentation.
... It typically includes decision trees, and its fundamental principle is to make predictions with the created models, calculate the gradient of the error, and attempt to minimize the errors ). This process is repeated to build a more robust model (Chen & Guestrin, 2016). In other words, Gradient Boosting is a tree-based algorithm (Friedman, 1999). ...
... At each step, a new decision tree is created to correct the errors made by the previously created trees (Natekin & Knoll, 2013). By calculating the gradient on the errors, new trees are generated in the opposite direction of the gradient, creating a more robust model (Chen & Guestrin, 2016). ...
... By using residual errors to improve step-by-step, each new tree aims to correct the errors of the previous ones ). This methodology is a powerful tool, especially in classification and regression problems, providing high accuracy (Chen & Guestrin, 2016). ...
Book
Full-text available
This book, titled Classification Handbook for Beginners, aims to provide a comprehensive understanding of various classification algorithms used in machine learning. The book is divided into eight distinct sections, each focusing on different models and approaches for classification, ranging from basic concepts to practical implementations in Python. Below, you will find an overview of the main topics covered in each section. Section 1: Business Intelligence and Data Mining The first section lays the groundwork by introducing key concepts in business intelligence and data mining. It explores the relationship between data, information, knowledge, and wisdom, as well as the role of business intelligence in decision-making processes. Additionally, this section introduces classification, its performance evaluation, and compatibility challenges. Section 2: Linear Models This section provides an in-depth look at linear classification models. It covers Support Vector Machines (SVM), Ridge Classifiers, and Lasso Regression, discussing their similarities, differences, hyperparameters, and scenarios in which they are best used. Detailed guidance on hyperparameter tuning for each algorithm is also provided. Section 3: Probabilistic Models The third section delves into probabilistic classification methods, such as Naive Bayes and Logistic Regression. It also discusses Hidden Markov Models (HMMs) and their use cases. Each method is compared to others, detailing its strengths, weaknesses, and how it applies to different types of data. Section 4: Instance-Based Models Section four explores instance-based learning approaches, such as K-Nearest Neighbors (KNN) and Radius Neighbors Classifiers. It discusses the strengths and limitations of these methods, along with scenarios where their use would be most appropriate. The section also includes a discussion on Lazy models, which are forms of instance-based learning. Section 5: Decision Trees This section explains decision tree-based methods, including well-known algorithms like ID3, C4.5, C5.0, and CART. Each algorithm is explored with a focus on how it creates decision boundaries, what makes it suitable for specific tasks, and which hyperparameters are crucial for optimizing performance. Section 6: Neural Network Models The sixth section introduces neural network models, specifically Perceptron and Multi-layer Perceptron (MLP). It covers how these models can be used for classification tasks and provides a comparison between them, highlighting their effectiveness in complex data structures. Section 7: Ensemble Classifiers In this section, the book focuses on ensemble learning methods, such as Random Forests, Gradient Boosted Trees, and Voting Classifiers. It explains how combining multiple classifiers can enhance overall model performance and tackle challenges like overfitting and imbalanced data. Section 8: Implementing Classification Algorithms with Python The final section of the book presents practical implementations of the discussed algorithms using Python. It explains how to work with datasets, train/test splits, and cross-validation techniques. Additionally, it covers the process of optimizing model parameters and automating batch classification using multiple algorithms. Overall, Classification Handbook for Beginners serves as a valuable resource for readers at all levels, from those just beginning their journey in data science to experienced practitioners. By combining theoretical explanations with hands-on Python applications, the book provides a balanced learning experience, equipping readers to apply classification algorithms effectively in real-world projects.
... Our findings suggest that sensor data can be calibrated using a linear relationship with the reference Davis AirLink sensor data. This aligns with previous studies [25][26][27] that reported high correlation coefficients when calibrating Plantower sensors using reference data. However, acknowledging two critical limitations of low-cost sensors, including the Plantower model, is crucial. ...
... Micrometeorological factors (e.g., air temperature, relative humidity, wind, pollutants) can influence PM2.5 sensor measurement in non-linear ways. Neural networks capture the complex interaction, enhancing PM2.5 sensor calibration more reliably [26]. Neural networks are powerful tools, but their success requires large practical training datasets with high computational demands and lengthy execution times. ...
Article
Affordable IoT PM2.5 sensors, enabled by the Internet of Things, offer new ways to monitor air quality. However, concerns exist about their data accuracy. This study aimed (1) to investigate the low-cost PM sensor's performance under various outdoor ambient circumstances and (2) to evaluate seven calibration methods, which include decision trees, gradient-boosted trees, linear regression, nearest neighbors, neural networks, random forests, and the Gaussian Process. The Davis AirLink was used as a reference to compare the Plantower PMS3003 sensor's performance. The data from the Plantower PMS3003 sensor were then compared to the Davis AirLink values using calibration curves created by machine learning algorithms. Calibration curves were generated using machine learning algorithms trained on sensor measurements collected in two Thai cities (Nakhon Si Thammarat and Phuket). Our results show that all machine learning methods outperformed traditional linear regression, with decision trees and neural networks demonstrating the most significant improvement. This research highlights the need for sensor calibration and the limitations of current calibration methods and paves the way for advancements in cloud-based calibration and machine learning for improved data accuracy in IoT PM2.5sensor technology. Doi: 10.28991/ESJ-2024-08-06-08 Full Text: PDF
... The findings suggest that machine learning models, especially ensemble models, are effective tools for identifying evolving adware threats. Future work may include exploring real-time detection systems and the use of deep learning models to further improve detection accuracy [3] [5]. ...
... This ability to handle large datasets with numerous features contributes to its effectiveness, although its performance does not quite match that of XGBoost in terms of precision and recall. The use of decision trees makes Random Forest less sensitive to the noise in the data, but its ability to improve gradually through boosting is limited compared to XGBoost [5]. ...
Preprint
Full-text available
The increasing growth of mobile adware has turned it into one of the most critical cybersecurity challenges [9], [13]. This causes serious disruption to user experiences and grossly violates privacy by gathering sensitive information without the consent of users. The rapid proliferation of adware variants, abetted by sophisticated evasion techniques, has rendered traditional signature-based detection methods ineffective. This study classifies mobile adware variants using the CIC-AndMal2017 dataset in a machine learning-based approach. Logistic Regression , Random Forest, and XGBoost are implemented to detect adware and evaluate the performance of machine learning algorithms [2]. The results have shown that XGBoost and Random Forest present very good detection of accuracy, precision, and recall. XGBoost outperforms the others due to the boosting of the gradient. Logistic Regression, though less effective, provides a comparison baseline. This paper also discusses practical challenges in the deployment of machine learning models for real-time adware detection, such as computational efficiency and scalability. The findings indicate the potential of machine learning in improving mobile security through effective detection of evolving adware threats. Future work will be related to exploring deep learning methods and real-time detection systems for further improvements.
... Numerous updates to the gradient-boosted trees have improved its generalization capability, as the original model suffers from overfitting. In this study, we utilized XGBoost, 17 an open-source software library that is known for its computational efficiency and performance, for model regression. ...
... To determine optimal XGBoost parameters, we used grid search with in-training-set five-fold cross-validation. Following the formalism in the XGBoost paper, 17 the overall cost function can be formulated as ...
Article
Full-text available
Background The widespread adoption of knowledge‐based planning in radiation oncology clinics is hindered by the lack of data and the difficulty associated with sharing medical data. Purpose This study aims to assess the feasibility of mitigating this challenge through federated learning (FL): a centralized model trained with distributed datasets, while keeping data localized and private. Methods This concept was tested using 273 prostate 45 Gy plans. The cases were split into a training set with 220 cases and a validation set with 53 cases. The training set was further separated into 10 subsets to simulate treatment plans from different clinics. A gradient‐boosting model was used to predict bladder and rectum V30Gy, V35Gy, and V40Gy. The Federated Averaging algorithm was employed to aggregate the individual model weights from distributed datasets. Grid search with five‐fold in‐training‐set cross‐validation was implemented to tune model hyperparameters. Additionally, we evaluated the robustness of the FL approach by varying the distribution of the training set data in several scenarios, including different number of sites and imbalanced data across sites. Results The mean absolute error (MAE) for the FL model (4.7% ± 2.9%) is significantly lower than individual models trained separately (6.5% ± 4.9%, p < 0.001) and similar to a traditional centralized model (4.4% ± 2.8%, p = 0.14). The federated model is robust to the number of subsets, showing MAE of 4.7% ± 3.2%, 4.8% ± 3.1%, 4.8% ± 2.9%, 4.5% ± 2.8%, 4.9% ± 3.3%, and 4.8% ± 3.1% for 5, 10, 15, 20, 25, and 30 subsets, respectively. For the two imbalanced datasets, the FL model achieves MAEs of 4.5% ± 2.9% and 5.6% ± 4.0%, non‐inferior to the balanced data model. For all bladder and rectum metrics, the FL model significantly outperforms 36.7% of individual models. Conclusions This study demonstrates the potential advantages of implementing a federated model over training individual models: the proposed FL approach achieves similar prediction accuracy as a conventional model without requiring centralized data storage. Even when local models struggle to produce accurate predictions due to data scarcity, the federated model consistently maintains high performance.
... We studied the importance of various features in UE using GB. The XGBoost package [38] was used. The default hyperparameters were used (max_depth = 3; learning_rate = 0.1; n_estimators = 100). ...
... For chemoinformatics tasks (generation of MD, MF, molecular graphs, and atom-wise features) the CDK [39] (Chemistry Development Kit) framework, version 2.7.1, was used. The training of models for predicting gas chromatographic RI was performed exactly as described in our previous works [7,8] (using the Deeplearning4j [40] framework, version 1.0.0-beta6; and the XGBoost [38] library, version 1.0.0). More detailed information is given in the previous works [7,8]. ...
Article
Full-text available
Mass spectral identification (in particular, in metabolomics) can be refined by comparing the observed and predicted properties of molecules, such as chromatographic retention. Significant advancements have been made in predicting these values using machine learning and deep learning. Usually, model predictions do not contain any indication of the possible error (uncertainty) or only one criterion is used for this purpose. The spread of predictions of several models included in the ensemble, and the molecular similarity of the considered molecule and the most “similar” molecule from the training set, are values that allow us to estimate the uncertainty. The Euclidean distance between vectors, calculated based on real-valued molecular descriptors, can be used for the assessment of molecular similarity. Another factor indicating uncertainty is the molecule’s belonging to one of the clusters (data set clustering). Together, all three factors can be used as features for the uncertainty assessment model. Classification models that predict whether a prediction belongs to the worst 15% were obtained. The area under the receiver operating curve value is in the range of 0.73–0.82 for the considered tasks: the prediction of retention indices in gas chromatography, retention times in liquid chromatography, and collision cross-sections in ion mobility spectroscopy.
... CellSexID, as illustrated in Figure 1, identifies sex-specific gene features using a committee of four machine learning classifiers: XGBoost [10][11][12], Support Vector Machine [13][14][15], Random Forest [16][17][18][19][20], and Logistic Regression [21][22][23]. The committee determines important features based on classifier consensus (Fig. 1a). ...
... These methods, grounded in robust statistical and mathematical principles, allow for a detailed exploration of gene significance in predictive modeling. In our implementation, Logistic Regression, Random Forest and Support Vector Machine are implemented in Python packages Scikitlearn [55] and XGBoost is implemented using package 'XGBoost' [10]. ...
Preprint
Full-text available
Cell tracking in chimeric models is essential yet challenging, particularly in developmental biology, regenerative medicine, and transplantation studies. Existing methods, such as fluorescent labeling and genetic barcoding, are technically demanding, costly, and often impractical for dynamic, heterogeneous tissues. To address these limitations, we propose a computational framework that leverages sex as a surrogate marker for cell tracking. Our approach uses a machine learning model trained on single-cell transcriptomic data to predict cell sex with high accuracy, enabling clear distinction between donor (male) and recipient (female) cells in sex-mismatched chimeric models. The model identifies specific genes critical for sex prediction and has been validated using public datasets and experimental flow sorting, confirming the biological relevance of the identified cell populations. Applied to skeletal muscle macrophages, our method revealed distinct transcriptional profiles associated with cellular origins. This pipeline offers a robust, cost-effective solution for cell tracking in chimeric models, advancing research in regenerative medicine and immunology by providing precise insights into cellular origins and therapeutic outcomes.
... XGBoost is an enhancement of gradient boosting decision trees (GBDT). Compared to traditional GBDT, XGBoost introduces several innovations, including regularization (to improve generalization), a two-step gradient approximation for the objective function (to enhance computational efficiency), column subsampling (to reduce noise and boost generalization), and handling missing values (using a default direction for tree nodes, making it suitable for sparse datasets) [37]. MLP is a type of feedforward neural network consisting of an input layer, one or more hidden layers, and an output layer. ...
Preprint
Lattice thermal conductivity, being integral to thermal transport properties, is indispensable to advancements in areas such as thermoelectric materials and thermal management. Traditional methods, such as Density Functional Theory and Molecular Dynamics, require significant computational resources, posing challenges to the high-throughput prediction of lattice thermal conductivity. Although AI-driven material science has achieved fruitful progress, the trade-off between accuracy and interpretability in machine learning continues to hinder further advancements. This study utilizes interpretable deep learning techniques to construct a rapid prediction framework that enables both qualitative assessments and quantitative predictions, accurately forecasting the thermal transport properties of three novel materials. Furthermore, interpretable deep learning offers analytically grounded physical models while integrating with sensitivity analysis to uncover deeper theoretical insights.
... First, a sample tree is used to generate an estimated result, and the second tree is learned based on the difference between the actual labels and the predicted labels from the previous step. In a similar manner, the algorithm's error can be effectively minimized (Chen & Guestrin, 2016). ...
Article
Nowadays, with the development of the Internet, publishing and sharing news has become very easy, and anyone can do it. Along with the increasing amount of information on the internet, besides official information, fake news continues to rise and spreads quickly across the network. Fake news has become a major societal problem, negatively impacting all aspects of economic, cultural, and social life. How to prevent the spread of fake news online is an urgent issue today. To help readers recognize whether news is trustworthy, this paper proposes using natural language processing techniques and machine learning models to detect fake news in posts on the social network Facebook in the Vietnamese language. After the training process, the resulting model can predict whether the news is real or fake. The model's evaluation results are presented according to popular machine learning metrics, and the best-performing model on the dataset used in this paper is the Light Gradient Boosting Algorithm – LGBM – with an accuracy of 88.21% compared to other models used in this paper.
... A utilização dos métodos tradicionais de avaliação das frações granulométricas dificulta os estudos quando o foco é o monitoramento de grandes áreas de estudo (Liu et al., 2018). No entanto, esse impedimento pode ser contornado pelo progresso das técnicas pedométricas e uso dos sensores proximais, que podem ser utilizados para fornecer um grande volume de dados sobre uma área estudada, auxiliando na proteção do solo, a exemplo de estimativa de atributos cruciais como a concentração de argila, silte e areia (Hu et al., 2018;Chen et al., 2016). ...
... Unlike logistic regression, decision trees can with each iteration, ultimately resulting in a high level of predictive accuracy. Popular algorithms that implement gradient boosting include XGBoost (16) and LightGBM (17), which have been optimized for both performance and computational efficiency. This makes them suitable for large-scale datasets. ...
Article
In recent years, machine learning, and particularly deep learning, has shown remarkable potential in various fields, including medicine. Advanced techniques like convolutional neural networks and transformers have enabled high-performance predictions for complex problems, making machine learning a valuable tool in medical decision-making. From predicting postoperative complications to assessing disease risk, machine learning has been actively used to analyze patient data and assist healthcare professionals. However, the "black box" problem, wherein the internal workings of machine learning models are opaque and difficult to interpret, poses a significant challenge in medical applications. The lack of transparency may hinder trust and acceptance by clinicians and patients, making the development of explainable AI (XAI) techniques essential. XAI aims to provide both global and local explanations for machine learning models, offering insights into how predictions are made and which factors influence these outcomes. In this article, we explore various applications of machine learning in medicine, describe commonly used algorithms, and discuss explainable AI as a promising solution to enhance the interpretability of these models. By integrating explainability into machine learning, we aim to ensure its ethical and practical application in healthcare, ultimately improving patient outcomes and supporting personalized treatment strategies.
... XGBoost, proposed by Chen et al. [18], is an ensemble learning method based on the GBDT framework. By optimizing computational efficiency and model performance within the gradient boosting framework, XGBoost has been widely applied in various machine learning prediction tasks [82]. ...
Article
The Hardgrove grindability index (HGI) is a crucial indicator for assessing the grindability of coal, and accurate prediction of HGI is essential for improving the production efficiency and economic benefits of the coal industry. This study employed six decision tree-based machine learning models to predict the HGI values of 129 coal samples, with hyperparameter optimization performed using Optuna, and model interpretability analyzed using SHapley Additive exPlanations (SHAP). The results showed that the optimized natural gradient boosting (NGBoost) model outperformed all other models, which achieved the highest performance on the test set with a coefficient of determination (R 2) of 0.9715, a mean absolute error (MAE) of 1.1507, and a root mean squared error (RMSE) of 1.4735. SHAP analysis further revealed that volatile matter (VM) contributed the most to the model's predictions, while pyrite (FeS 2) had the least contribution. This study provides an efficient machine learning approach for accurate HGI prediction, offering excellent predictive performance, interpretability, and application value.
... It randomly chooses features for each tree and combines the results of all trees to generate the ultimate forecast of the model. The second machine learning approach used in this research is XGBoost [36], which is based on tree-based algorithms. XGBoost utilizes two fundamental principles in ensemble learning: bagging and boosting. ...
Article
This research aims to enhance financial fraud detection by integrating SHAP-Instance Weighting and Anchor Explainable AI with XGBoost, addressing challenges of class imbalance and model interpretability. The study extends SHAP values beyond feature importance to instance weighting, assigning higher weights to more influential instances. This focuses model learning on critical samples. It combines this with Anchor Explainable AI to generate interpretable if-then rules explaining model decisions. The approach is applied to a dataset of financial statements from the listed companies on the Stock Exchange of Thailand. The method significantly improves fraud detection performance, achieving perfect recall for fraudulent instances and substantial gains in accuracy while maintaining high precision. It effectively differentiates between non-fraudulent, fraudulent, and grey area cases. The generated rules provide transparent insights into model decisions, offering nuanced guidance for risk management and compliance. This research introduces instance weighting based on SHAP values as a novel concept in financial fraud detection. By simultaneously addressing class imbalance and interpretability, the integrated approach outperforms traditional methods and sets a new standard in the field. It provides a robust, explainable solution that reduces false positives and increases trust in fraud detection models. Doi: 10.28991/ESJ-2024-08-06-016 Full Text: PDF
... The eXtreme gradient boosting (XGBoost) algorithm is most commonly used in embedding methods, as it is highly effective and easily operated (Chen & Guestrin, 2016). XGBoost automatically selects and ranks features during the model training process, revealing the subset of features with the largest contribution (Ben Jabeur et al., 2022). ...
Article
Full-text available
The aim of this study is to develop an effective financial distress prediction (FDP) model that enhances companies’ understanding of their financial states. We propose a novel definition of multi-class financial status and construct a multi-class FDP model accordingly. The multi-class FDP model is constructed based on feature selection and a deep forest algorithm. We compare 11 different forms of feature selection and select the optimal approach for input into the model, with the deep forest algorithm as the classifier. We enrich the indicator set by incorporating financial network indicators to enhance the model’s informational output. The analysis centers on Chinese listed companies from 2007 to 2020 and yields four main results. (1) The proposed multi-class FDP model exhibits excellent prediction performance, particularly in identifying financial distress, light financial soundness, and moderate financial soundness. (2) XGBoost provides optimal results among the eleven forms of feature selection, with an accuracy of 92.02%. Feature extraction and hybrid feature selection also show promising results. (3) The deep forest model demonstrates better prediction performance compared to other benchmark models. (4) The inclusion of financial network indicators in the indicator set improves the prediction performance of the model. This paper introduces a novel perspective on defining multiple states of corporate finance and explores the impact of various forms of feature selection on a multi-class FDP model. Moreover, we apply the deep forest algorithm to a multi-class FDP model for the first time, broadening its application in enterprise financial risk management.
... Recent advances include the XGBoost algorithm [9], renowned for its regression and classification capabilities. Fine-tuning XGBoost with Bayesian optimization has achieved impressive predictions of the TBM advance rate (AR) [10]. ...
Article
Full-text available
Tunnel boring machines (TBMs) are essential for excavating metro tunnels, reducing disruptions to surrounding rock, and ensuring efficient progress. This study examines how machine learning (ML) models can predict key tunneling outcomes, focusing on making these predictions clearer. Specifically, the models aim to predict surface settlements (ground sinking) and the TBM’s penetration rate (PR) during the Athens Metro Line 2 extension to Hellinikon. For surface settlements, four artificial neural networks (ANNs) were developed, achieving an accuracy of over 79%, on average. For the TBM’s PR, both an XGBoost Regressor (XGBR) and ANNs performed consistently well, offering reliable predictions. This study emphasizes model transparency mostly. Using the SHapley Additive exPlanations (SHAP) library, it is possible to explain how models make decisions, highlighting key factors like geological conditions and TBM operating data. With SHAP’s Tree Explainer and Deep Explainer techniques, the study reveals which parameters matter most, making ML models less of a “black box” and more practical for real-world metro tunnel projects. By showing how decisions are made, these tools give decision-makers confidence to rely on ML in complex tunneling operations.
... XGBoost, renowned for its efficiency and accuracy, excels in handling complex, high-dimensional data [10]. Its advanced gradient boosting implementation, regularization features, and ability to manage missing data ensure precise predictions and strong generalization. ...
Article
Full-text available
With the continuous development of online education, how to improve the teaching ability of new PE teachers through adaptive and effective online training has become an important research issue. Based on machine learning algorithm, this paper discusses the influence of different characteristics on the adaptability of online training for new physical education teachers, and evaluates the application of various models in predicting the training effect of teachers. The results show that factors such as teachers' professional experience, past training experience, school type and whether to participate in training with colleagues have a significant impact on the adaptability of online training. By comparing Logistic regression model, KNN model, random forest model, XGBoost model and support vector machine model, this paper finds that random forest model is the best in prediction accuracy and generalization ability. This study provides data support and theoretical basis for optimizing the online training of physical education teachers, and can provide reference for educational managers to formulate personalized training programs.
... These algorithms extract effective features from vast amounts of raw data, and dynamically adjust parameters during the training process, which enables accurate and stable results of various prediction tasks. XGBoost (eXtreme Gradient Boosting) is an ensemble learning algorithm based on decision trees that has found widespread application in prediction tasks 38,41 . XGBoost uses the gradient boosting framework to optimize models, continuously training weak classifiers to improve the overall accuracy of the model, with each weak classifier being a decision tree. ...
... A RF model is an ensemble learning method that trains multiple, in this case 150, decision trees during the training process, combining their predictions to improve accuracy and reduce overfitting. Tree boosting methods, such as XGBoost [46], are commonly used to improve performance. ...
Preprint
Full-text available
Predicting molecular taste remains a significant challenge in food science. Here, we present FART (Flavor Analysis and Recognition Transformer), a chemical language model fine-tuned on the largest public dataset (15,031 compounds) of molecular tastants to date. When operating within confidence bounds, FART achieves 88.4% accuracy in predicting four fundamental taste categories—sweet, bitter, sour, and umami. Unlike previous approaches focused on binary classification, FART performs multi-class prediction while maintaining interpretability through gradient-based visualization of molecular features. The model identifies key structural elements driving taste properties and demonstrates utility in analyzing both known tastants and novel compounds. By making both the model and dataset publicly available, we provide the food science community with tools for rapid taste prediction, potentially accelerating the development of new flavor compounds and enabling systematic exploration of taste chemistry.
... XGBoost: XGBoost is highly regarded for its rapid processing capabilities and precision in model outcomes (Chen & Guestrin, 2016). This algorithm's flexibility renders it particularly compatible with the diverse nature of our dataset. ...
Article
Full-text available
The study examines the use of machine learning models to forecast attendance at sports stadiums, specifically analyzing National Football League (NFL) games from 2000 to 2019, with over 5,055 regular-season games. The models, including Linear Regression, Classification and Regression Trees (CART), Random Forest, CatBoost, and XGBoost, integrate a diverse set of variables such as team performance, economic indicators, stadium characteristics , and weather conditions. Each model's accuracy and effectiveness are assessed using five statistical metrics. With a Mean Absolute Error (MAE) of 0.02 and a Root Mean Squared Error (RMSE) of 0.04, the models display high precision in predicting stadium attendance. The coefficient of determination (R²) reaches 77.27% after optimization. These figures suggest that the models, particularly Random Forest and CatBoost, are highly effective in forecasting attendance rates for NFL games. Key influences on game attendance include factors like 'stadium_name,' 'personal_income,' 'stadium_age,' and 'home_club_age', which emerge as significant predictors. This study fills a theoretical gap in the limited research on the NFL and provides valuable insights for strategic planning and decision-making in professional sports management.
... Extreme gradient boosting (XGB) is a highly effective ensemble learning algorithm commonly used in various fields such as classification, regression, and ranking [44]. The algorithm is developed based on the principles of gradient boosting framework. ...
Article
Full-text available
Background and objectives Child undernutrition is a leading global health concern, especially in low and middle-income developing countries, including Bangladesh. Thus, the objectives of this study are to develop an appropriate model for predicting the risk of undernutrition and identify its influencing predictors among under-five children in Bangladesh using explainable machine learning algorithms. Materials and methods This study used the latest nationally representative cross-sectional Bangladesh demographic health survey (BDHS), 2017–18 data. The Boruta technique was implemented to identify the important predictors of undernutrition, and logistic regression, artificial neural network, random forest, and extreme gradient boosting (XGB) were adopted to predict undernutrition (stunting, wasting, and underweight) risk. The models’ performance was evaluated through accuracy and area under the curve (AUC). Additionally, SHapley Additive exPlanations (SHAP) were employed to illustrate the influencing predictors of undernutrition. Results The XGB-based model outperformed the other models, with the accuracy and AUC respectively 81.73% and 0.802 for stunting, 76.15% and 0.622 for wasting, and 79.13% and 0.712 for underweight. Moreover, the SHAP method demonstrated that the father’s education, wealth, mother’s education, BMI, birth interval, vitamin A, watching television, toilet facility, residence, and water source are the influential predictors of stunting. While, BMI, mother education, and BCG of wasting; and father education, wealth, mother education, BMI, birth interval, toilet facility, breastfeeding, birth order, and residence of underweight. Conclusion The proposed integrating framework will be supportive as a method for selecting important predictors and predicting children who are at high risk of stunting, wasting, and underweight in Bangladesh.
... We now investigate how well the suggested data cleansing scheme works in classification tasks on the MWSM dataset. For the classification task, we consider several Fig. 7 Average performances of imputation methods in terms of R 2 and R M S E for different missing ratios, models, and patterns well-known methods such as ADF (Rahman and Islam 2022), XGBoost (Chen and Guestrin 2016), and MTS-LSTM (Gauch et al. 2021). We utilise the Github source code (ADF 4 , XGBoost 5 and MTS-LSTM 6 ) to implement the methods. ...
Article
Full-text available
Quality in meteorological data is one of the main issues for many real applications including weather forecasting and for developing irrigation models. The integrity of meteorological data may be compromised for several reasons including the presence of corrupted and missing data which can be added due to interference and equipment malfunctioning. A decrease in data quality can significantly affect the efficiency of weather forecasting systems and irrigation models. Therefore, it is imperative to address the corrupt and missing data prior to their utilisation. In this study, we introduce a Data Cleansing Scheme (DCS) for handling the corrupt and missing values in a real meteorological dataset. DCS utilises a cutting-edge corrupt data identification method and a cutting-edge missing data imputation method to cleanse the meteorological data. The finalised dataset, free from any corrupt or missing values, is subsequently employed for data mining endeavours such as classification and knowledge discovery. Despite the negative impact of corrupt and missing values on the quality of data analysis results, this study demonstrates an enhancement when corrupt data is identified, and missing values are imputed using DCS. We also evaluate DCS on two publicly available datasets. Our extensive empirical and statistical analyses indicate the effectiveness of DCS for improving meteorological data quality.
... The machine learning framework of Irvin et al. (2021) is implemented in Python (Van Rossum and Drake 2009) using the scikit-learn (Pedregosa et al. 2011) and xgboost (Chen and Guestrin 2016) packages. MDS gapfilling was performed using the REddyProc package (Wutzler et al. 2018). ...
Article
Full-text available
Rewetting peatlands is required to limit carbon dioxide (CO 2 ) emissions, however, raising the groundwater level (GWL) will strongly increase the chance of methane (CH 4 ) emissions which has a higher radiative forcing than CO 2 . Data sets of CH 4 from different rewetting strategies and natural systems are scarce, and quantification and an understanding of the main drivers of CH 4 emissions are needed to make effective peatland rewetting decisions. We present a large data set of CH 4 fluxes (FCH 4 ) measured across 16 sites with eddy covariance on Dutch peatlands. Sites were classified into six land uses, which also determined their vegetation and GWL range. We investigated the principal drivers of emissions and gapfilled the data using machine learning (ML) to derive annual totals. In addition, Shapley values were used to understand the importance of drivers to ML model predictions. The data showed the typical controls of FCH 4 where temperature and the GWL were the dominant factors, however, some relationships were dependent on land use and the vegetation present. There was a clear average increase in FCH 4 with increasing GWLs, with the highest emissions occurring at GWLs near the surface. Soil temperature was the single most important predictor for ML gapfilling but the Shapley values revealed the multi‐driver dependency of FCH 4 . Mean annual FCH 4 totals across all land uses ranged from 90 ± 11 to 632 ± 65 kg CH 4 ha ⁻¹ year ⁻¹ and were on average highest for semi‐natural land uses, followed by paludiculture, lake, wet grassland and pasture with water infiltration system. The mean annual flux was strongly correlated with the mean annual GWL ( R ² = 0.80). The greenhouse gas balance of our sites still needs to be estimated to determine the net climate impact, however, our results indicate that considerable rates of CO 2 uptake and long‐term storage are required to fully offset the emissions of CH 4 from land uses with high GWLs.
... This study used thirteen different supervised classifiers, as follows: Support Vector Machine (SVM) [62], K-Nearest Neighbor (KNN) [63], Decision Tree (DT) [64], Random Forest (RF) [65], XGBoost (XGB) [66], Stochastic Gradient Descent (SGD) [67], Histogram Gradient Boosting (HGB) [68], Gaussian Naive Bayes (GNB) [69], Multi-Layer Perceptron (MLP) [70], Logistic Regression (LR) [71], Adaboost (ADA) [72], Bagging Trees (BAG) [73] and CatBoost (CAT) [74]. ...
Article
Full-text available
This paper presents the identification of arousal and valence during visual stimuli exposure using electroencephalograms (EEGs) and functional near-infrared spectroscopy (fNIRS) signals. Specifically, various images were shown to several volunteers to evoke different emotions defined by their level of arousal and valence, such as happiness, sadness, fear, and anger. Brain activity was recorded using the Emotiv EPOC X and NIRSport2 devices separately. The recorded signals were then processed and analyzed to identify the primary brain regions activated during the trials. Next, machine learning methods were employed to classify the evoked emotions with highest accuracy values of 71.3% for EEG data with a Multi-Layer Perceptron (MLP) method and 64.0% for fNIRS data using a Bagging Trees (BAG) algorithm. This approach not only highlights the effectiveness of using EEG and fNIRS technologies but also provides insights into the complex interplay between different brain areas during emotional experiences. By leveraging these advanced acquisition techniques, this study aims to contribute to the broader field of affective neuroscience and improve the accuracy of emotion recognition systems. The findings could have significant implications for developing intelligent systems capable of more empathetic interactions with humans, enhancing applications in areas such as mental health, human–computer interactions, or adaptive learning environments, among others.
... Tree boosting is a highly effective and widely applied technique in ML. Chen and Guestrin [29] introduced XGBoost, an expandable end-to-end tree-boosting system that has been widely embraced by data scientists for its ability to deliver state-of-the-art results in many ML challenges. They developed an innovative algorithm designed for efficiently handling sparse data and introduced a weighted quantile sketch technique for approximate tree learning. ...
Article
Full-text available
Railway noise, stemming from various sources such as wheel/rail interactions, locomotives, and track machinery, affects both human health and the environment. This study explores the application of machine learning (ML) models to quantify tram noise at sharp curves, considering variables such as weather conditions, train speed, crowd levels, and running directions. Data collection is carried out on a tram line in Birmingham, using an iPhone 11 to record acoustic data at a sample rate of 48 kHz. The noise is categorized into impact noise, rolling noise, flanging noise, and squeal noise based on frequency and power spectrum characteristics. Random Forests (RF) and Extreme Gradient Boosting (XGBoost) are employed to predict the root mean square (R.M.S) values of each type of noise. Results indicate that XGBoost outperformed RF with an R2 up to 0.96 during k-fold cross-validation. This model provides a robust tool for railway operators to optimize noise control measures and contributes to improved compliance with environmental regulations and a better quality of life for communities near rail tracks.
... In the next step, we applied the XGBOOST algorithm to disentangle and quantify the contributions of key climatic drivers, i.e. global warming signal and the IPO, to seasonal precipitation trends at each station in Fig. 1. XGBOOST is a machine-learning algorithm based on gradient-boosted decision trees (Chen and Guestrin, 2016). XGBOOST is particularly well-suited for handling interactions between explanatory variables, including nonlinear relationships and potential collinearity. ...
Preprint
Full-text available
Global warming is a significant challenge of the 21st century, driving notable changes in weather patterns. On the other hand, the Interdecadal Pacific Oscillation (IPO) is a remarkable climatic mode of variability that impacts interdecadal climate patterns and the rate of global warming. This study introduces the extreme gradient boosting (XGBOOST) feature important metric, to disentangle and rank the fingerprints of global warming and IPO on the seasonal precipitation trends in Ohio, United States, a region characterized by variable weather. Using monthly precipitation data from 55 weather stations spanning 1960–2023, seasonal average trends for boreal winter, spring, summer, and autumn were analyzed using Theil-Sen’s Slope method, and statistical significance was tested at the 95% confidence level. Results indicate a significant increase in precipitation during winter (0.15 mm/decade) and summer (0.13 mm/decade), while no statistically significant changes were observed for spring and autumn. Correlation analysis revealed that 56.4% of the stations showed statistically significant positive correlations between global warming signals and increased winter precipitation. In comparison, 40% of the stations negatively correlated with the IPO during winter. Therefore, global warming and the negative IPO phase are associated with the observed increase in winter precipitation in most of the analyzed stations. In 60% of the stations, including stations impacted by the lake-effect snow, the XGBOOST model showed that the fingerprint of global warming ranked higher than the IPO. This indicates that global warming has a stronger association with the observed positive winter precipitation trend in most stations, and the IPO's net effect is limited to a smaller number of stations (i.e., 40%). These findings highlight that Ohio’s winters are becoming wetter with global warming remarkably contributing to it.
... The Extreme Gradient Boosting (XGBoost) algorithm [12] is a novel implementation method for Gradient Boosting Machine and in particular K Classification and Regression Trees. The algorithm is based on the idea of "boosting", which combined all the predictions of a set of "weak" learners for developing a "strong" learner through additive training strategies. ...
Article
Full-text available
The primary obstacles in addressing the energy consumption forecasting challenge revolve around ensuring reliability, stability, efficiency, and accuracy in forecasting methodologies. The current forecasting models face difficulties due to the unpredictable nature of energy consumption data volatility. There is a need for artificial intelligence models that can anticipate abrupt irregular changes and effectively capture long-term dependencies within the data. Within this study, a pioneering AI-boosted forecasting model is presented, combining Extreme Gradient Boosting (XGBoost) with parallel long short-term memory (PLSTM) neural networks. The integration of XGBoost with PLSTM neural networks contributes to the improved performance of the overall PLSTM network. The access the suggested model using the Mean Absolute Percentage Error (MAPE).
Preprint
Full-text available
Genetic sequence identification from electrical characterization of single molecules has emerged as a promising alternative to traditional approaches. Since electrical data on single molecules is extremely noisy due to the limitations of even state-of-the-art approaches, achieving high detection rates is challenging, particularly when the task involves being able to distinguish a sequence from its single base-pair mismatches. To address this issue, we propose an architecture based on combining a convolutional neural network with an ensemble learning method, XGBoost. In addition, four different input feature representations are considered, 1D conductance probability distributions and 2D conductance versus distance probability distributions which can be viewed as images, with or without averaging over the experimental parameters. The with averaging case corresponds to feature matrices derived from mixed datasets. We find that 2D probability distributions are helpful with respect to classifier accuracy, but averaged conductance probability distributions are much more impactful and significantly enhance prediction accuracy. Our quantitative analysis of multiple sequences shows an impressive performance increase of approximately 10% for all sequences. While the basis of our analysis is conductance data of DNA strands for COVID-19 Alpha, Beta, and Delta variants and their single base-pair mismatches, our method is generally applicable to other single-molecule identification based on their conductance.
Article
Accurate cooling consumption forecasts are crucial for optimizing energy management, storage, and overall efficiency in interconnected HVAC systems. Weather conditions, building characteristics, and operational parameters significantly impact prediction accuracy. Since meteorological conditions highly influence cooling demand, leveraging external air data and user metrics offers a promising approach to estimate a building's hourly cooling energy usage. This study addresses the gap in existing research by comprehensively analyzing the performance of various machine learning algorithms, including ensemble learning and deep learning models, to improve prediction accuracy. By leveraging weather conditions, building characteristics, and operational parameters, we aim to predict cooling consumption across multiple systems (Cooling Ceiling, Ventilation, Free Cooling, and Total Cooling). Data from four weather stations, encompassing diverse features relevant to the European Central Bank (ECB) building's cooling consumption in Frankfurt, were employed. Our methodology includes the use of K-Nearest Neighbor, Decision Tree, Support Vector Regression, Linear Regression, Random Forest, Gradient Boosting, XGBoost, Adaboost, Long-Short-Term Memory, and Gated Recurrent Unit. Models. The results consistently demonstrate the superiority of the Random Forest model across different weather stations and feature sets. This model achieved a Mean Squared Error of approximately 0.002-0.003, Mean Absolute Error of around 0.031-0.034, and Root Mean Squared Error of about 0.052-0.069. These findings contribute to improved building cooling load management, promoting insights into optimal energy utilization and sustainable building practices. Doi: 10.28991/ESJ-2024-08-06-01 Full Text: PDF
Article
Full-text available
A novel set of dimensionless numbers for predicting cavitation flow in gas-liquid flow through a throttling orifice is proposed. Multiple sets of cavitation data are obtained from the literature, and eight dimensionless groups are extracted using dimensional analysis. Subsequently, self-similarity among these groups is established, leading to the proposal of new dimensionless correlations (CfW and CfWd). Investigation results indicate that the established correlations between dimensionless groups can predict cavitation flow in orifices under a wide range of hydrodynamic and geometric conditions. The predictive model is validated using different sets of experimental data, improving accuracy (average relative error decreased by more than 57%) and covering more physical conditions compared to previous correlation predictions. Furthermore, dimensionless parameters with significant influence on the cavitation discharge coefficient are identified by the explanatory machine learning technique SHapley Additive exPlanations (SHAP) and sensitivity analysis. Based on these findings, this investigation contributes to a reduction in the number of variables involved in cavitation experiments or simulations.
Article
Full-text available
Machine learning (ML) models can simulate flood risk by identifying critical non-linear relationships between flood damage locations and flood risk factors (FRFs). To explore it, Tampa Bay, Florida, is selected as a test site. The study's goal is to simulate flood risk and identify dominant FRFs using historical flood damage data as target variable, with 16 FRFs as predictor variables. Five different ML models such as decision tree (DT), support vector machine (SVM), adaptive boosting (AdaBoost), extreme gradient boosting (XGBoost), and random forest (RF) were adopted. RF classifies 2.42% of Tampa Bay as very high risk and 2.54% as high risk, while XGBoost classifies 3.85% as very high risk and 1.11% as high risk. Moreover, the communities reside at low altitudes and near the waterbodies, with dense man-made infrastructure, are at high flood risk. This study introduces a comprehensive framework for flood risk assessment and helps policymakers mitigate flood risk.
Article
Full-text available
Sentiment analysis of news media articles is essential for understanding the dynamics of conflict and cooperation in transboundary rivers. However, it is not known which machine learning model(s) can best meet the requirement of sentiment analysis for transboundary rivers. This study presents a comparative examination of ten machine learning models commonly used in the field of text sentiment analysis, including K-Nearest Neighbors, Naive Bayes, Support Vector Machine, Decision Tree, Random Forest, Gradient Boosting Decision Tree, Extreme Gradient Boosting, Multilayer Perceptron, Long Short-Term Memory and Bidirectional Encoder Representations from Transformers, for five-class sentiment classification of 9382 news articles (1977–2022) attending to transboundary water conflict and cooperation. By evaluating their performance in terms of accuracy, precision, recall and F1-score, the Bidirectional Encoder Representations from Transformers (BERT) model demonstrated good overall performance and prediction capabilities for news articles with conflictive sentiments. By comparing with the AFINN sentiment dictionary, BERT showed superior performance in the prediction and identification of conflictive sentiment labels. And by validating against historical water events in the three river basins, BERT performed best in the Indus River basin. The findings of this study hold significant implications for government agencies in transboundary rivers, allowing them to promptly assess and respond to public sentiment, thereby preventing water conflict and promoting water cooperation.
Article
Machine learning is increasingly being utilized across various domains of nutrition research due to its ability to analyse complex data, especially as large datasets become more readily available. However, at times, this enthusiasm has led to the adoption of machine learning techniques prior to a proper understanding of how they should be applied, leading to non-robust study designs and results of questionable validity. To ensure that research standards do not suffer, key machine learning concepts must be understood by the research community. The aim of this review is to facilitate a better understanding of machine learning in research by outlining good practices and common pitfalls in each of the steps in the machine learning process. Key themes include the importance of generating high-quality data, employing robust validation techniques, quantifying the stability of results, accurately interpreting machine learning outputs, adequately describing methodologies, and ensuring transparency in reporting findings. Achieving this aim will facilitate the implementation of robust machine learning methodologies, which will reduce false findings and make research more reliable, as well as enable researchers to critically evaluate and better interpret the findings of others using machine learning in their work.
Article
Full-text available
Forest fires represent a paramount natural disaster of global concern. Zhejiang Province has the highest forest coverage rate in China, and forest fires are one of the main natural disasters impacting forest management in the region. In this study, we comprehensively analyzed the spatiotemporal distribution of forest fires based on the MODIS data from 2013 to 2023. The results showed that the annual incidence of forest fires in Zhejiang Province has shown an overall downward trend from 2013 to 2023, with forest fires occurring more frequently in winter and spring. By utilizing eight contributing factors of forest fire occurrence as variables, three models were constructed: Logistic Regression (LR), Random Forest (RF), and eXtreme Gradient Boosting (XGBoost). The RF and XGBoost models demonstrated high predictive ability, achieving accuracy rates of 0.85 and 0.92, f1-score of 0.84 and 0.92, and AUC values of 0.892 and 0.919, respectively. Further analysis using the RF and XGBoost models revealed that elevation and precipitation had the most significant effects on the occurrence of forest fires. Additionally, the predictions of forest fire risk generated by the RF and XGBoost models indicated that the incidence rate is high in the southern part of Zhejiang Province, particularly in the Wenzhou and Lishui areas, as well as in the southwest of the Hangzhou area and the north of the Quzhou area. In the future, the forest fire risk in this area can be predicted using site factors with the RF and XGBoost models, providing a scientific reference for forest management in Zhejiang Province and aiding in the prevention and mitigation of the impacts of forest fires.
Article
Machine learning offers a powerful and versatile approach to flood susceptibility mapping, enabling us to leverage complex data and improve prediction accuracy. Given the plethora of available techniques and the challenges in selecting the optimal approach, this study investigates prominent ML algorithms for flood susceptibility mapping (FSM) in the Wardha River sub-basin, India. Seven machine learning algorithms, viz. support vector machine (SVM), extreme gradient boosting (XGB), artificial neural network (ANN), generalized linear model (GLM), gradient boosting machine (GBM), random forest (RF), and linear discriminant analysis (LDA), were evaluated at varying spatial resolutions (30 m, 50 m, 100 m, and 200 m). Seven flood-inducing factors (elevation, flow accumulation, topographic wetness index, slope, rainfall, land use, and drain density) were considered. Model performance was assessed using sensitivity, specificity, area under the curve (AUC), overall correlation, overall standard deviation ratio, and overall root mean square difference (RMSD). The impact of spatial resolution on models’ accuracy was analysed. SVM, GBM, and RF were significantly affected, while ANN, GLM, and XGB were less sensitive. LDA excelled in execution time and spatial resolution resilience. The overall ranking of models was executed based on their accuracy, AUC, and execution time. XGB outperformed GBM and RF, securing first place, while SVM ranked last. GLM, ANN, and LDA ranked third to fifth. The results highlighted the importance of algorithm selection in accurately mapping flood susceptibility, particularly when working with varying spatial resolution data. The study findings can inform the decision-making process for implementing FSM using these machine learning algorithms.
Article
Full-text available
LambdaMART is the boosted tree version of LambdaRank, which is based on RankNet. RankNet, LambdaRank, and LambdaMART have proven to be very suc-cessful algorithms for solving real world ranking problems: for example an ensem-ble of LambdaMART rankers won Track 1 of the 2010 Yahoo! Learning To Rank Challenge. The details of these algorithms are spread across several papers and re-ports, and so here we give a self-contained, detailed and complete description of them.
Article
Full-text available
Boosting is one of the most important recent developments in classi-fication methodology. Boosting works by sequentially applying a classifica-tion algorithm to reweighted versions of the training data and then taking a weighted majority vote of the sequence of classifiers thus produced. For many classification algorithms, this simple strategy results in dramatic improvements in performance. We show that this seemingly mysterious phenomenon can be understood in terms of well-known statistical princi-ples, namely additive modeling and maximum likelihood. For the two-class problem, boosting can be viewed as an approximation to additive modeling on the logistic scale using maximum Bernoulli likelihood as a criterion. We develop more direct approximations and show that they exhibit nearly identical results to boosting. Direct multiclass generalizations based on multinomial likelihood are derived that exhibit performance comparable to other recently proposed multiclass generalizations of boosting in most situations, and far superior in some. We suggest a minor modification to boosting that can reduce computation, often by factors of 10 to 50. Finally, we apply these insights to produce an alternative formulation of boosting decision trees. This approach, based on best-first truncated tree induction, often leads to better performance, and can provide interpretable descrip-tions of the aggregate decision rule. It is also much faster computationally, making it more suitable to large-scale data mining applications.
Article
Full-text available
Gradient boosting constructs additive regression models by sequentially fitting a simple parameterized function (base learner) to current “pseudo”-residuals by least squares at each iteration. The pseudo-residuals are the gradient of the loss functional being minimized, with respect to the model values at each training data point evaluated at the current step. It is shown that both the approximation accuracy and execution speed of gradient boosting can be substantially improved by incorporating randomization into the procedure. Specifically, at each iteration a subsample of the training data is drawn at random (without replacement) from the full training data set. This randomly selected subsample is then used in place of the full sample to fit the base learner and compute the model update for the current iteration. This randomized approach also increases robustness against overcapacity of the base learner.
Conference Paper
Full-text available
We cast the ranking problem as (1) multiple classification (“Mc”) (2) multiple ordinal classification, which lead to computationally tractable learning algorithms for relevance ranking in Web search. We consider the DCG criterion (discounted cumulative gain), a standard quality measure in information retrieval. Our approach is motivated by the fact that perfect classifications result in perfect DCG scores and the DCG errors are bounded by classification errors. We propose using the Expected Relevance to convert class probabilities into ranking scores. The class probabilities are learned using a gradient boosting tree algorithm. Evaluations on large-scale datasets show that our approach can improve LambdaRank [5] and the regressions-based ranker [6], in terms of the (normalized) DCG scores. An efficient implementation of the boosting tree algorithm is also presented. 1
Article
Full-text available
LIBLINEAR is an open source library for large-scale linear classification. It supports logistic regres- sion and linear support vector machines. We provide easy-to-use command-line tools and library calls for users and developers. Comprehensive documents are available for both beginners and advanced users. Experiments demonstrate that LIBLINEAR is very efficient on large sparse data sets.
Conference Paper
Full-text available
Gradient Boosted Regression Trees (GBRT) are the current state-of-the-art learning paradigm for machine learned web-search ranking - a domain notorious for very large data sets. In this paper, we propose a novel method for parallelizing the training of GBRT. Our technique parallelizes the construction of the individual regression trees and operates using the master-worker paradigm as follows. The data are partitioned among the workers. At each iteration, the worker summarizes its data-partition using histograms. The master processor uses these to build one layer of a regression tree, and then sends this layer to the workers, allowing the workers to build histograms for the next layer. Our algorithm carefully orchestrates overlap between communication and computation to achieve good performance. Since this approach is based on data partitioning, and requires a small amount of communication, it generalizes to distributed and shared memory machines, as well as clouds. We present experimental results on both shared memory machines and clusters for two large scale web search ranking data sets. We demonstrate that the loss in accuracy induced due to the histogram approximation in the regression tree creation can be compensated for through slightly deeper trees. As a result, we see no significant loss in accuracy on the Yahoo data sets and a very small reduction in accuracy for the Microsoft LETOR data. In addition, on shared memory machines, we obtain almost perfect linear speed-up with up to about 48 cores on the large data sets. On distributed memory machines, we get a speedup of 25 with 32 processors. Due to data partitioning our approach can scale to even larger data sets, on which one can reasonably expect even higher speedups.
Article
Full-text available
Scikit-learn is a Python module integrating a wide range of state-of-the-art machine learning algorithms for medium-scale supervised and unsupervised problems. This package focuses on bringing machine learning to non-specialists using a general-purpose high-level language. Emphasis is put on ease of use, performance, documentation, and API consistency. It has minimal dependencies and is distributed under the simplified BSD license, encouraging its use in both academic and commercial settings. Source code, binaries, and documentation can be downloaded from http://scikit-learn.sourceforge.net.
Article
Full-text available
Learning a function of many arguments is viewed from the perspective of high-- dimensional numerical quadrature. It is shown that many of the popular ensemble learning procedures can be cast in this framework. In particular randomized methods, including bagging and random forests, are seen to correspond to random Monte Carlo integration methods each based on particular importance sampling strategies. Non random boosting methods are seen to correspond to deterministic quasi Monte Carlo integration techniques. This view helps explain some of their properties and suggests modifications to them that can substantially improve their accuracy while dramatically improving computational performance.
Article
Full-text available
An ∈-approximate quantile summary of a sequence of N elements is a data structure that can answer quantile queries about the sequence to within a precision of ∈ N . We present a new online algorithm for computing∈-approximate quantile summaries of very large data sequences. The algorithm has a worst-case space requirement of Ο (1÷∈ log(∈ N )). This improves upon the previous best result of Ο (1÷∈ log ² (∈ N )). Moreover, in contrast to earlier deterministic algorithms, our algorithm does not require a priori knowledge of the length of the input sequence. Finally, the actual space bounds obtained on experimental data are significantly better than the worst case guarantees of our algorithm as well as the observed space requirements of earlier algorithms.
Article
Full-text available
Function approximation is viewed from the perspective of numerical optimization in function space, rather than parameter space. A connection is made between stagewise additive expansions and steepest--descent minimization. A general gradient--descent "boosting" paradigm is developed for additive expansions based on any fitting criterion. Specific algorithms are presented for least--squares, least--absolute--deviation, and Huber--M loss functions for regression, and multi--class logistic likelihood for classification. Special enhancements are derived for the particular case where the individual additive components are decision trees, and tools for interpreting such "TreeBoost" models are presented. Gradient boosting of decision trees produces competitive, highly robust, interpretable procedures for regression and classification, especially appropriate for mining less than clean data. Connections between this approach and the boosting methods of Freund and Shapire 1996, and Fr...
Article
Function estimation/approximation is viewed from the perspective of numerical optimization iti function space, rather than parameter space. A connection is made between stagewise additive expansions and steepest-descent minimization. A general gradient descent "boosting" paradigm is developed for additive expansions based on any fitting criterion. Specific algorithms are presented for least-squares, least absolute deviation, and Huber-M loss functions for regression, and multiclass logistic likelihood for classification. Special enhancements are derived for the particular case where the individual additive components are regression trees, and tools for interpreting such "TreeBoost" models are presented. Gradient boosting of regression trees produces competitives highly robust, interpretable procedures for both regression and classification, especially appropriate for mining less than clean data. Connections between this approach and the boosting methods of Freund and Shapire and Friedman, Hastie and Tibshirani are discussed.
Article
Apache Spark is a popular open-source platform for large-scale data processing that is well-suited for iterative machine learning tasks. In this paper we present MLlib, Spark's open-source distributed machine learning library. MLlib provides efficient functionality for a wide range of learning settings and includes several underlying statistical, optimization, and linear algebra primitives. Shipped with Spark, MLlib supports several languages and provides a high-level API that leverages Spark's rich ecosystem to simplify the development of end-to-end machine learning pipelines. MLlib has experienced a rapid growth due to its vibrant open-source community of over 140 contributors, and includes extensive documentation to support further growth and to let users quickly get up to speed.
Article
Online advertising allows advertisers to only bid and pay for measurable user responses, such as clicks on ads. As a consequence, click prediction systems are central to most online advertising systems. With over 750 million daily active users and over 1 million active advertisers, predicting clicks on Facebook ads is a challenging machine learning task. In this paper we introduce a model which combines decision trees with logistic regression, outperforming either of these methods on its own by over 3%, an improvement with significant impact to the overall system performance. We then explore how a number of fundamental parameters impact the final prediction performance of our system. Not surprisingly, the most important thing is to have the right features: those capturing historical information about the user or ad dominate other types of features. Once we have the right features and the right model (decisions trees plus logistic regression), other factors play small roles (though even small improvements are important at scale). Picking the optimal handling for data freshness, learning rate schema and data sampling improve the model slightly, though much less than adding a high-value feature, or picking the right model to begin with.
Article
Random forests are a combination of tree predictors such that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest. The generalization error for forests converges a.s. to a limit as the number of trees in the forest becomes large. The generalization error of a forest of tree classifiers depends on the strength of the individual trees in the forest and the correlation between them. Using a random selection of features to split each node yields error rates that compare favorably to Adaboost (Y. Freund & R. Schapire, Machine Learning: Proceedings of the Thirteenth International conference, ***, 148–156), but are more robust with respect to noise. Internal estimates monitor error, strength, and correlation and these are used to show the response to increasing the number of features used in the splitting. Internal estimates are also used to measure variable importance. These ideas are also applicable to regression.
Article
We consider the problem of learning a forest of nonlinear decision rules with general loss functions. The standard methods employ boosted decision trees such as Adaboost for exponential loss and Friedman's gradient boosting for general loss. In contrast to these traditional boosting algorithms that treat a tree learner as a black box, the method we propose directly learns decision forests via fully-corrective regularized greedy search using the underlying forest structure. Our method achieves higher accuracy and smaller models than gradient boosting on many of the datasets we have tested on.
Article
This tutorial gives a broad view of modern approaches for scaling up machine learning and data mining methods on parallel/distributed platforms. Demand for scaling up machine learning is task-specific: for some tasks it is driven by the enormous dataset sizes, for others by model complexity or by the requirement for real-time prediction. Selecting a task-appropriate parallelization platform and algorithm requires understanding their benefits, trade-offs and constraints. This tutorial focuses on providing an integrated overview of state-of-the-art platforms and algorithm choices. These span a range of hardware options (from FPGAs and GPUs to multi-core systems and commodity clusters), programming frameworks (including CUDA, MPI, MapReduce, and DryadLINQ), and learning settings (e.g., semi-supervised and online learning). The tutorial is example-driven, covering a number of popular algorithms (e.g., boosted trees, spectral clustering, belief propagation) and diverse applications (e.g., recommender systems and object recognition in vision). The tutorial is based on (but not limited to) the material from our upcoming Cambridge U. Press edited book which is currently in production. Visit the tutorial website at http://hunch.net/~large_scale_survey/
Conference Paper
Stochastic Gradient Boosted Decision Trees (GBDT) is one of the most widely used learning algorithms in machine learning today. It is adaptable, easy to interpret, and produces highly accurate models. However, most implementations today are computationally expensive and require all training data to be in main memory. As training data becomes ever larger, there is motivation for us to parallelize the GBDT algorithm. Parallelizing decision tree training is intuitive and various approaches have been explored in existing literature. Stochastic boosting on the other hand is inherently a sequential process and have not been applied to distributed decision trees. In this work, we present two different distributed methods that generates exact stochastic GBDT models, the first is a MapReduce implementation and the second utilizes MPI on the Hadoop grid environment.
Article
Classification and regression tree learning on massive datasets is a common data mining task at Google, yet many state of the art tree learning algorithms require training data to reside in memory on a single machine. While more scalable implementations of tree learning have been proposed, they typically require specialized parallel computing architectures. In contrast, the majority of Google's computing infrastructure is based on commodity hardware. In this paper, we describe PLANET: a scalable distributed framework for learning tree models over large datasets. PLANET defines tree learning as a series of distributed computations, and implements each one using the MapReduce model of distributed computation. We show how this framework supports scalable construction of classification and regression trees, as well as ensembles of such models. We discuss the benefits and challenges of using a MapReduce compute cluster for tree learning, and demonstrate the scalability of this approach by applying it to a real world learning task from the domain of computational advertising.
Conference Paper
We present a fast algorithm for computing approximate quantiles in high speed data streams with deterministic error bounds. For data streams of size N where N is unknown in advance, our algorithm partitions the stream into sub-streams of exponentially increasing size as they arrive. For each sub-stream which has a fixed size, we compute and maintain a multi-level summary structure using a novel algorithm. In order to achieve high speed performance, the algorithm uses simple block-wise merge and sample operations. Overall, our algorithms for fixed-size streams and arbitrary-size streams have a computational cost of O(N log(1/epsivlogepsivN)) and an average per-element update cost of O(log logN) if epsiv is fixed.
The present and the future of the kdd cup competition: an outsider's perspective. R. Bekkerman. The present and the future of the kdd cup competition: an outsider's perspective
  • R Bekkerman
R. Bekkerman. The present and the future of the kdd cup competition: an outsider's perspective.
General functional matrix factorization using gradient boosting
  • T Chen
  • H Li
  • Q Yang
  • Y Yu
T. Chen, H. Li, Q. Yang, and Y. Yu. General functional matrix factorization using gradient boosting. In Proceeding of 30th International Conference on Machine Learning (ICML'13), volume 1, pages 436-444, 2013.
Efficient second-order gradient boosting for conditional random fields
  • T Chen
  • S Singh
  • B Taskar
  • C Guestrin
T. Chen, S. Singh, B. Taskar, and C. Guestrin. Efficient second-order gradient boosting for conditional random fields. In Proceeding of 18th Artificial Intelligence and Statistics Conference (AISTATS'15), volume 1, 2015.
LIBLINEAR: A Library for Large Linear Classification
  • Kai-Wei Rong-En Fan
  • Cho-Jui Chang
  • Xiang-Rui Hsieh
  • Chih-Jen Wang
  • Lin