Conference Paper

XGBoost: A Scalable Tree Boosting System

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Tree boosting is a highly effective and widely used machine learning method. In this paper, we describe a scalable end-to-end tree boosting system called XGBoost, which is used widely by data scientists to achieve state-of-the-art results on many machine learning challenges. We propose a novel sparsity-aware algorithm for sparse data and weighted quantile sketch for approximate tree learning. More importantly, we provide insights on cache access patterns, data compression and sharding to build a scalable tree boosting system. By combining these insights, XGBoost scales beyond billions of examples using far fewer resources than existing systems.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Data cleaning, processing, and descriptive analysis All data processing and analysis were undertaken in R-statistical software version 4.1.3 (32), with the addition of the following packages Tidyverse (33), Caret (34), PresenceAbsence (35,36), XGBoost (37), and SHAPforxgboost (38,39). ...
... Following initial exploration with a variety of algorithms, a gradient-boosting machine learning decision tree algorithm was chosen using XGBoost (37). This algorithm is a supervised decision tree algorithm that combines ensemble learning and gradientboosting techniques. ...
... Model tuning parameters were analyzed based on the cross-validation results of the training dataset. Other model tuning parameters were set as follows: subsample = 1, max depth = 4, columns sampled by tree = 1, and number of parallel trees = 1 (37,47). The final model included the top 11 that were variables judged by relative importance and Shapley Additive exPlanations (SHAP) values. ...
Article
Full-text available
Udder health remains a priority for the global dairy industry to reduce pain, economic losses, and antibiotic usage. The dry period is a critical time for the prevention of new intra-mammary infections and it provides a point for curing existing intra-mammary infections. Given the wealth of udder health data commonly generated through routine milk recording and the importance of udder health to the productivity and longevity of individual cows, an opportunity exists to extract greater value from cow-level data to undertake risk-based decision-making. The aim of this research was to construct a machine learning model, using routinely collected farm data, to make probabilistic predictions at drying off for an individual cow’s risk of a raised somatic cell count (hence intra-mammary infection) post-calving. Anonymized data were obtained as a large convenience sample from 108 UK dairy herds that undertook regular milk recording. The outcome measure evaluated was the presence of a raised somatic cell count in the 30 days post-calving in this observational study. Using a 56-farm training dataset, machine learning analysis was performed using the extreme gradient boosting decision tree algorithm, XGBoost . External validation was undertaken on a separate 28-farm test dataset. Statistical assessment to evaluate model performance using the external dataset returned calibration plots, a Scaled Brier Score of 0.095, and a Mean Absolute Calibration Error of 0.009. Test dataset model calibration performance indicated that the probability of a raised somatic cell count post-calving was well differentiated across probabilities to allow an end user to apply group-level risk decisions. Herd-level new intra-mammary infection rate during the dry period was a key driver of the probability that a cow had a raised SCC post-calving, highlighting the importance of optimizing environmental hygiene conditions. In conclusion, this research has determined that probabilistic classification of the risk of a raised SCC in the 30 days post-calving is achievable with a high degree of certainty, using routinely collected data. These predicted probabilities provide the opportunity for farmers to undertake risk decision-making by grouping cows based on their probabilities and optimizing management strategies for individual cows immediately after calving, according to their likelihood of intra-mammary infection.
... Decision tree-based algorithms such as random forests have also been found to perform well for diverse-species forests in the insular Caribbean . Extreme gradient boosting (XGBoost) is another tree-based algorithm which gained popularity after winning several data-mining competitions hosted by Kaggle in 2015 (Chen and Guestrin, 2016) and has been a go-to machine learning algorithm among data scientists since then. XGBoost is based on the ensemble learning method where multiple weak decision trees are combined into a strong decision tree. ...
... These decision trees are generated based on the objective function. Different from traditional gradient boosting algorithm, the objective function in XGBoost is penalized with regularization techniques to prevent overfitting (Chen and Guestrin, 2016). XGBoost has been applied in several areas of forestry, especially in using remote sensing for forestry research (e.g., Wang et al., 2022;Xu et al., 2022;Zhang et al., 2022). ...
... The working mechanism of XGBoost involves minimizing the loss function using gradient descent optimization, while incorporating regularization techniques to prevent overfitting and improve generalization performance (Chen and Guestrin, 2016). The input data is a matrix X of dimension n × p where n is the number of observations and p is the number of predictor variables (DBH and CI in this case). ...
Article
Full-text available
Height-diameter relationship models, denoted as H-D models, have important applications in sustainable forest management which include studying the vertical structure of a forest stand, understanding the habitat heterogeneity for wildlife niches, analyzing the growth rate pattern for making decisions regarding silvicultural treatments. Compared to monocultures, characterizing allometric relationships for uneven-aged, mixed-species forests, especially tropical forests, is more challenging and has historically received less attention. Modeling how the competitive interactions between trees of varying sizes and multiple species affects these relationships adds a high degree of complexity. In this study, five regression methods and five distance-independent competition indices were evaluated for temperate and pantropical tree species in different physiographic regions. A total of 163,922 individual tree measurements from the US Department of Agriculture, Forest Inventory and Analysis (FIA) database were used in analyses, which cover Appalachian plateau (AP) and Ridge and Valley (VR) in the southeastern US, as well as Caribbean (CAR) and Pacific (PAC) islands. Results indicated that the generalized additive model (GAM) and the Pearl and Reed model provided more accurate predictions than other regression methods examined. Models with competition indices had a varying level of predictability, while diameter ratio, cumulative distribution function and partitioned stand density index (PSDI) were found to improve the prediction accuracy for AP, VR and CAR. The results of this work provide additional insights on modeling H-D relationships for a variety of species in temperate and pantropical forests.
... It requires input hidden layers and an output layer. Extreme Gradient Boosting Machine on the other hand is a gradient boosting technique which is an ensemble method (Chen and Guestrin 2016). It build several learners and combines the various predictions from these learners to make a final prediction. ...
... It is able to deal with overfitting problems using a regularization term. A unique advantage of using XGBoost is its scalability to fit high-dimension data without overfitting and its ability to implement parallel computing to reduce computational complexity and learn faster (Chen and Guestrin 2016). Also, XGBoost can handle missing values. ...
Preprint
Continuously, cost overruns in construction projects, as a leading cause of project failure, have been attracting more and more attention among construction stakeholders. Notably, cost overrun prediction model development can help identify factors that lead to cost overruns, thereby substantially improving cost estimates. Meanwhile, a machine learning application on archival data to estimate construction cost overrun is still in development. Motivated by this, we applied an Extreme Gradient Boosting (XGBoost) machine to analyze historical data of construction projects in Ghana completed between 2016 and 2018. The comparison between the actual and predicted cost yielded a good model prediction. The RMSE, MSE, MAE, and MAPE values are 0.202, 0.041, 0.069, and 0.306, respectively. To visually explain the importance of each feature for cost overrun prediction, we used SHAP values to illustrate the effect of each feature for model interpretability. According to SHAP ranking, we discover that the initial contract amount, the number of storeys, scope changes, and the initial duration are the variables that most accurately predict project completion costs and cost overruns. This research explores an innovative way to understand and evaluate essential variables that can help develop a prediction model of cost overruns that could aid the construction industry's cost estimation.
... These assumptions were tested using machine learning (ML). Using the ground truth 308 datasets outlined above, we developed eXtreme Gradient Boosting (XGBoost) ensemble 309 classification models (Chen & Guestrin, 2016) that utilise the prediction results from the 310 diverse tools used by pyRBDome as features to predict how likely an amino acid is to bind 311 RNA (detailed in Fig. EV4). The XGBoost probability scores for SRP19, derived from all the 312 pyRBDome results for this protein, are shown in the model prediction structure Fig. 2A and 313 the score bar in Fig. 2B. ...
... These models discern patterns within the aggregated predictive results and aligns them with 587 known RNA-binding amino acids in the existing structural data. The main reasons for relying 588 on XGBoost to build these preliminary models include its frequent outperformance of neural 589 networks when presented with tabular data (such as the data used here), its ability to handle 590 missing data points effectively (useful in cases where a protein could not be analysed by one 591 of the prediction tools), its competence in dealing with unbalanced datasets (our ground truth 592 datasets are unbalanced), and its tolerance to uninformative features (Chen & Guestrin, 2016; 593 . CC-BY 4.0 International license perpetuity. ...
Preprint
Full-text available
High-throughput proteomics approaches have revolutionised the identification of RNA-binding proteins (RBPome) and RNA-binding sequences (RBDome) across organisms. Many novel putative RNA-binding proteins (RBPs) were discovered, including those that lack recognisable RNA-binding domains. Yet the extent of noise, including false-positive identifications, associated with these methodologies is difficult to quantify as experimental approaches for validating the results are generally low throughput. To address this, we introduce pyRBDome, a pipeline for in-depth in silico enhancement of RNA-binding proteome data. It does so by comparing experimental results with RNA-binding site (RBS) predictions from several distinct machine learning tools and integrates high-resolution structural data of protein-RNA complexes when available. By providing a statistical evaluation of RBDome data, users can rapidly identify protein sequences from RBDome experiments most likely to be bona fide RNA-binders. Furthermore, by leveraging the predictions collated by pyRBDome, we have enhanced the sensitivity and specificity of RBS detection through training new ensemble machine learning models. We describe a pyRBDome analysis of a large human RBDome dataset and conducted a comparision with know structural data. These analyses reinforced the significance of stacking interactions in UV cross-linking protein-RNA interactions. Surprisingly, our analyses revealed two contrasting findings: While UV cross-linked amino acids were more likely to contain predicted RBSs, they infrequently bind RNA in high-resolution structures. Given the known limitations of structural data as benchmarks, these finding highlights the utility of pyRBDome as a valuable alternative approach for enhancing confidence in RBDome datasets. Finally, our comprehensive analysis of hundreds of (putative) RBPs offers a valuable resource for RBP enthusiasts.
... XGBR is a decision tree-based ensemble algorithm that uses a gradient boosting framework. It works as Newton-Raphson in function space unlike gradient boosting which works as gradient descent in function space, a second-order Taylor approximation is used in the loss function to make the connection to Newton Raphson method (Chen and Guestrin, 2016). The idea behind boosting is to generate multiple "weak"prediction models sequentially, and each of these takes the results of the previous model to generate a "stronger" model, with better predictive power and greater stability Prediction of evapotranspiration in the Pampean Plain ... Table II: Variables, descriptions, sources, and units of the CERES products used. ...
... in its results (Chen and Guestrin, 2016;Han et al., 2019;Putatunda and Rama, 2018). Finally, GLM expands the general linear model. ...
Article
Full-text available
A key aspect in agricultural zones, such as the Pampean Plain of Argentina, is to accurately estimate evapotranspiration rates to optimize crops and irrigation requirements and the floods and droughts prediction. In this sense, we evaluate six machine learning approaches to estimate the reference and actual evapotranspiration (ET0 and ETa) through CERES satellite products data. The results obtained applying machine learning techniques were compared with values obtained from ground-based information. After training and validating the algorithms, we observed that Support Vector machine-based Regressor (SVR) showed the best accuracy. Then, with an independent dataset, the calibrated SVR were tested. For predicting the reference evapotranspiration, we observed statistical errors of MAE = 0.437 mm d−1, and RMSE = 0.616 mm d−1, with a determination coefficient, R2, of 0.893. Regarding actual evapotranspiration modelling, we observed statistical errors of MAE = 0.422 mm d−1, and RMSE =0.599 mm d−1, with a R2 of 0.614. Comparing the results obtained with the machine learning models developed another studies in the same field, we understand that the results are promising and represent a baseline for future studies. Combining CERES data with information from other sources may generate more specific evapotranspiration products, considering the different land covers.
... The specifics of the input data and the type of model are defined in Table 4 as follows: Table 4. Comparison among the four tested models with total input features. The ML models were all instances of extreme gradient boosting (XGBoost) 1.7.6 [58]. XGBoost is an optimised distributed gradient-boosted decision tree (GBDT) library designed to be highly efficient, flexible, and portable. ...
... XGBoost often outperforms random forest, k-nearest neighbours, and support vector machines on similar classification tasks using hyperspectral imagery [59][60][61] due to its ability to handle sparse data and its scalability with parallel and GPU computing [62]. The algorithm provides advanced regularisation, which reduces overfitting and improves overall performance [58]. ...
Article
Full-text available
Mapping Antarctic Specially Protected Areas (ASPAs) remains a critical yet challenging task, especially in extreme environments like Antarctica. Traditional methods are often cumbersome, expensive, and risky, with limited satellite data further hindering accuracy. This study addresses these challenges by developing a workflow that enables precise mapping and monitoring of vegetation in ASPAs. The processing pipeline of this workflow integrates small unmanned aerial vehicles (UAVs)—or drones—to collect hyperspectral and multispectral imagery (HSI and MSI), global navigation satellite system (GNSS) enhanced with real-time kinematics (RTK) to collect ground control points (GCPs), and supervised machine learning classifiers. This workflow was validated in the field by acquiring ground and aerial data at ASPA 135, Windmill Islands, East Antarctica. The data preparation phase involves a data fusion technique to integrate HSI and MSI data, achieving the collection of georeferenced HSI scans with a resolution of up to 0.3 cm/pixel. From these high-resolution HSI scans, a series of novel spectral indices were proposed to enhance the classification accuracy of the model. Model training was achieved using extreme gradient boosting (XGBoost), with four different combinations tested to identify the best fit for the data. The research results indicate the successful detection and mapping of moss and lichens, with an average accuracy of 95%. Optimised XGBoost models, particularly Model 3 and Model 4, demonstrate the applicability of the custom spectral indices to achieve high accuracy with reduced computing power requirements. The integration of these technologies results in significantly more accurate mapping compared to conventional methods. This workflow serves as a foundational step towards more extensive remote sensing applications in Antarctic and ASPA vegetation mapping, as well as in monitoring the impact of climate change on the Antarctic ecosystem.
... A total of five hyperparameters were used: number of trees, learning rate, L1 regularization parameter, maximum depth of trees and subsampling ratio for the training dataset and for the columns. Tuning was done using Grid-search 47 . ...
... Keras and Tensorflow were used as backend for deep learning methods 48,49 . XGBoost library was used for the gradient boosting machines 47 . OpenCV and PILLOW were used for digital image processing 37 . ...
Article
Full-text available
This study used deep neural networks and machine learning models to predict facial landmark positions and pain scores using the Feline Grimace Scale© (FGS). A total of 3447 face images of cats were annotated with 37 landmarks. Convolutional neural networks (CNN) were trained and selected according to size, prediction time, predictive performance (normalized root mean squared error, NRMSE) and suitability for smartphone technology. Geometric descriptors (n = 35) were computed. XGBoost models were trained and selected according to predictive performance (accuracy; mean square error, MSE). For prediction of facial landmarks, the best CNN model had NRMSE of 16.76% (ShuffleNetV2). For prediction of FGS scores, the best XGBoost model had accuracy of 95.5% and MSE of 0.0096. Models showed excellent predictive performance and accuracy to discriminate painful and non-painful cats. This technology can now be used for the development of an automated, smartphone application for acute pain assessment in cats.
... For this purpose, we tried three methods. Two of these are the Machine Learning (ML) regression algorithms Support Vector Machine (SVM) [32] and XGBoost (XGB) [33]. We also made use of Deep Learning (DL) regression by designing a very simplified Artificial Neural Network (ANN). ...
Preprint
Full-text available
p>Respiratory diseases in children under the age of two, such as bronchiolitis or pneumonia, are a major cause of emergency consultations in hospital and primary care settings, being also a significant cause of mortality in low-income countries. Early detection of respiratory distress and high respiratory rate is crucial for timely intervention and improved clinical outcomes. In this study, we developed and evaluated two computer vision techniques for respiratory rate estimation in young children. The first technique, remote photoplethysmography, uses changes in skin color due to blood flow modulation to estimate the respiratory rate, while the second technique, designed in this work, uses the motion of a sticker placed on the patient’s abdomen, and captures the variations of the reflected light throughout inhalation and exhalation. Both techniques were tested on a dataset of video recordings of children under the age of two taken in the Hospital 12 de Octubre of Madrid. Our results show that both techniques achieved accurate respiratory rate estimation, being the second technique the one with lower mean absolute error. For high respiratory frequencies, the values of the estimator are less than 3 bpm. These techniques have the potential to be used as low-cost and non-invasive tools for respiratory rate monitoring in low-resource settings, including remote and underserved areas of Africa. Besides, the elaboration of a labeled dataset will serve as potential groundwork for further research in this matter.</p
... Prediction was performed using four representative ML algorithms: logistic regression (LR), random forest (RF) 6 , XGBoost (XGB) 7 , and LightGBM (LGBM) 8 . LR is a model that uses regression to predict the probability of data falling into a category and classifies it as belonging to a more likely category. ...
Article
Full-text available
Osteoporosis is a serious health concern in patients with rheumatoid arthritis (RA). Machine learning (ML) models have been increasingly incorporated into various clinical practices, including disease classification, risk prediction, and treatment response. However, only a few studies have focused on predicting osteoporosis using ML in patients with RA. We aimed to develop an ML model to predict osteoporosis using a representative Korean RA cohort database. The KORean Observational study Network for Arthritis (KORONA) database, established by the Clinical Research Center for RA in Korea, was used in this study. Among the 5077 patients registered in KORONA, 2374 patients were included in this study. Four representative ML algorithms were used for the prediction: logistic regression (LR), random forest, XGBoost (XGB), and LightGBM. The accuracy, F1 score, and area under the curve (AUC) of each model were measured. The LR model achieved the highest AUC value at 0.750, while the XGB model achieved the highest accuracy at 0.682. Body mass index, age, menopause, waist and hip circumferences, RA surgery, and monthly income were risk factors of osteoporosis. In conclusion, ML algorithms are a useful option for screening for osteoporosis in patients with RA.
... The concept behind gradient boosting is to use gradient descent algorithm over an object function to combine a single weak classifier with other weak classifiers to build a strong classifier. In this process, it is aimed to minimize the prediction error considerably [26] . ...
Article
Full-text available
Classifiers in machine learning work on the principle that the observations are evenly distributed across the classes. However, real-world datasets frequently exhibit skewed distributions of classes, which is called imbalanced, causing the classifiers make highly biased predictions. One of the several method groups that deal with imbalance data problem is class balancing methods. We aimed to compare some class balancing methods during the classification of pacing horses according to their origins. Data set contains morphological traits of horses and four origin classes with different sample sizes that leads a multi-class imbalanced data problem. Training data set was modified with different balancing methods. Each balanced data set was trained with C5.0, Random Forest and Extreme Gradient Boosting Machine classifiers. Method comparisons were made based on comparison metrics using the original test set. The best prediction result was obtained on the data set balanced with random undersampling method regarding both G-mean and Matthews Correlation Coefficient; however, the best result according to F1 score was observed on the data set balanced with Adaptive Synthetic Sampling Approach (ADASYN). Primary important variables of the best models were body length, withers height, chest circumference and rump height. The Bulgarian origin was the most accurately predicted class despite having the smallest sample size. Class balancing methods clearly improved the performance of classifiers for predicting origins of pacing horses.
... To this end, we train classifiers to distinguish samples generated from the true and modelled posterior distributions on the sample position in the base space, the position in target space, and the density in target space together with β 2 . In all three cases, the classification is carried out with a Boosted Decision Tree (BDT) [22] with the default settings of the XGBClassifier class provided by the python library xgboost [23]. The data generated from the true posterior is split into two equal parts, which are then only used for training and testing respectively. ...
Article
Full-text available
Studying potential BSM effects at the precision frontier requires accurate transfer of information from low-energy measurements to high-energy BSM models. We propose to use normalising flows to construct likelihood functions that achieve this transfer. Likelihood functions constructed in this way provide the means to generate additional samples and admit a “trivial” goodness-of-fit test in form of a $$\chi ^2$$ χ 2 test statistic. Here, we study a particular form of normalising flow, apply it to a multi-modal and non-Gaussian example, and quantify the accuracy of the likelihood function and its test statistic.
... Because of the wide variety of hosts and infection symptoms, we expect phytoplasmas' effectors to have different characteristics, thus we reasoned that different learning models would be able to capture diverse properties yielding a more comprehensive prediction. Therefore, we used an ensemble learning composed of two tree-based algorithms, namely random forest and XGBoost (57,58), and two naive Bayes classifiers including a Gaussian and a Multinomial model (59,60). We fed the four classifiers with our training dataset and measured their performances on the test set (refer to methods for dataset construction and Additional Table 4 for models' parameters). ...
Preprint
Background: Crop pathogens are a major threat to plants health, reducing the yield and quality of agricultural production. Among them, the Candidatus Phytoplasma genus, a group of fastidious phloem-restricted bacteria, can parasite a wide variety of both ornamental and agro-economically important plants. Several aspects of the interaction with the plant host are still unclear but it was discovered that phytoplasmas secrete certain proteins (effectors) responsible for the symptoms associated with the disease. Identifying and characterizing these proteins is of prime importance for globally improving plant health in an environmentally friendly context. Results: We challenged the identification of phytoplasma effectors by developing LEAPH, a novel machine-learning ensemble predictor for phytoplasmas pathogenicity proteins. The prediction core is composed of four models: Random Forest, XGBoost, Gaussian, and Multinomial Naive Bayes. The consensus prediction is achieved by a novel consensus prediction score. LEAPH was trained on 479 proteins from 53 phytoplasmas species, described by 30 features accounting for the biological complexity of these protein sequences. LEAPH achieved 97.49% accuracy, 95.26% precision, and 98.37% recall, ensuring a low false-positive rate and outperforming available state-of-the-art methods for putative effector prediction. The application of LEAPH to 13 phytoplasma proteomes yields a comprehensive landscape of 2089 putative pathogenicity proteins. We identified three classes of these proteins according to different secretion models: classical, presenting a signal peptide, classically like and non classical, lacking the canonical secretion signal. Importantly, LEAPH was able to identify 15 out of 17 known experimentally validated effectors belonging to the three classes. Furthermore, to help the selection of novel candidates for biological validation, we applied the Self Organizing Maps algorithm and developed a shiny app called EffectorComb. Both tools would be a valuable resource to improve our understanding of effectors in plant phytoplasmas interactions. Conclusions: LEAPH and EffectorComb app can be used to boost the characterization of putative effectors at both computational and experimental levels and can be employed in other phytopathological models. Both tools are available at https://github.com/Plant-Net/LEAPH-EffectorComb.git.
... Due to its bravery and strength, XGB quickly became one of the most well-known and commonly utilized machine learning techniques. 96 It is comparable to GB but has specific additional characteristics that significantly increase its strength. A proportional decrease in leaf nodes is used to increase model generality. ...
Article
The main objective of this paper is to use the data‐driven approach to predict and evaluate the mechanical properties of concrete made with recycled concrete aggregate (RCA), including compressive strength and elastic modulus. Using 358 data samples, including 10 input variables, 10 popular machine learning (ML) algorithms are introduced to select the best ML performance model for predicting RCA concrete's compressive strength and elastic modulus. Gradient Boosting and Categorial Boosting have the best performance in predicting the compressive strength of RCA concrete, with R 2 = 0.9112, RMSE = 5.3464 MPa, MAE = 4.0845 MPa, and R 2 = 0.9175, RMSE = 5.1520 MPa, MAE = 3.7567 MPa, respectively. Light Gradient Boosting and Categorial Boosting have the best performance in predicting the elastic modulus of RCA concrete, with R 2 = 0.8775, RMSE = 2.3560 GPa, MAE = 1.8330 GPa, and R 2 = 0.9300, RMSE = 2.3560 MPa, MAE = 1.2589 MPa, respectively. Based on the Shapley Additive Explanation analysis, the influence of main factors on compressive strength and elastic modulus of RCA concrete values has been analyzed qualitatively and quantitatively. RCA replacement level and cement/sand ratio slightly affect compressive strength but have a dominant influence on the elastic modulus of RCA concrete.
... In this paper, a bank customer churn model is constructed using a combination of genetic algorithm (GA) and XGBoost algorithm. XGBoost (Extreme Gradient Boosting) is a new method based on GBDT algorithm proposed by Chen in 2016 [24]. It is used as a boosting tree model to improve the traditional GBDT model, i.e., regularization and second-order Taylor expansion. ...
Article
Full-text available
In recent years, with the continuous improvement of the financial system and the rapid development of the banking industry, the competition of the banking industry itself has intensified. At the same time, with the rapid development of information technology and Internet technology, customers’ choice of financial products is becoming more and more diversified, and customers’ dependence and loyalty to banking institutions is becoming less and less, and the problem of customer churn in commercial banks is becoming more and more prominent. How to predict customer behavior and retain existing customers has become a major challenge for banks to solve. Therefore, this study takes a bank’s business data on Kaggle platform as the research object, uses multiple sampling methods to compare the data for balancing, constructs a bank customer churn prediction model for churn identification by GA-XGBoost, and conducts interpretability analysis on the GA-XGBoost model to provide decision support and suggestions for the banking industry to prevent customer churn. The results show that: (1) The applied SMOTEENN is more effective than SMOTE and ADASYN in dealing with the imbalance of banking data. (2) The F1 and AUC values of the model improved and optimized by XGBoost using genetic algorithm can reach 90% and 99%, respectively, which are optimal compared to other six machine learning models. The GA-XGBoost classifier was identified as the best solution for the customer churn problem. (3) Using Shapley values, we explain how each feature affects the model results, and analyze the features that have a high impact on the model prediction, such as the total number of transactions in the past year, the amount of transactions in the past year, the number of products owned by customers, and the total sales balance. The contribution of this paper is mainly in two aspects: (1) this study can provide useful information from the black box model based on the accurate identification of churned customers, which can provide reference for commercial banks to improve their service quality and retain customers; (2) it can provide reference for customer churn early warning models of other related industries, which can help the banking industry to maintain customer stability, maintain market position and reduce corporate losses.
... The labels are one when a solar irradiance blockage by a probable partial/full snow cover increases the power loss of the PV system and zero otherwise. After labelling the dataset, a snow-cover prediction model based on extreme gradient boosting (XGBoost) [42] is developed that receives the current values of the AC power and the main meteorological parameters and detects the presence of a snow layer on the panels. The XGBoost model is identified as the best snow-cover predictor after developing and comparing several models based on other computational intelligence techniques. ...
Article
Full-text available
Energy management in a renewable energy‐based microgrid has a key role in improving energy utilisation and reducing the microgrid operation cost. The optimal energy management strategy can be significantly affected by the intermittency of renewable energies and also harsh weather conditions. In this study, a novel snow conditions‐compatible computational intelligence‐based short‐term photovoltaic (PV) power forecasting (PVPF) approach is proposed that is independent of exogenous weather forecasts. The proposed approach consists of a snow cover detection stage, a snow cover forecasting stage, and a PV power forecasting stage. This approach is then validated for a model predictive control (MPC)‐based energy management system (EMS) of a PV energy‐based grid‐connected microgrid located in a snow‐prone area. The PVPF method together with a computational intelligence‐based short‐term load demand forecasting model constitutes the forecasting block of the EMS. The forecasting block generates day‐ahead hourly forecasts based on the local measurements of the meteorological‐electrical parameters and sends them to the optimisation block where a two‐stage control method, corresponding to the tertiary and secondary control levels, is developed based on mixed‐integer linear and quadratic programming. The developed EMS is applied to a test microgrid simulated in MATLAB/Simulink and compared with a heuristic control method. The results show that the proposed approach can reduce the overall operation cost of the microgrid by 8% (24$), 15% (166$), and 13% (235$) on sunny, cloudy, and snowy days under study, respectively, compared to the heuristic controller.
... Given that observed variables may be interconnected and exhibit non-linear patterns, traditional methods such as the vector autoregressive (VAR) [13] model and the Gaussian process (GP) [14] model may fail to capture these patterns. The same problem also happens to the support vector machine with regression [15] and XGBoost [16]. Furthermore, these statistical models may suffer from their high computation complexity when dealing with larger datasets [17], [18]. ...
Article
Full-text available
Multivariate time series (MTS) forecasting is a crucial aspect in many classification and regression tasks. In recent years, deep learning models have become the mainstream framework for MTS forecasting. Among these deep learning methods, the transformer model has been proved particularly effective due to its ability to capture long- and short-term dependencies. However, the computational complexity of transformer-based models sets the obstacles for resource-constrained scenarios. To address this challenge, we propose a novel and efficient Skip-RCNN network that incorporates Skip-RNN and Skip-CNN modules to split the MTS into multiple frames with various time intervals. Thanks to the skipping process of Skip-RNN and Skip-CNN, the resulting network could process information with different reception field together and achieves better performance than the state-of-the-art network. We conducted comparative experiments using our proposed method and six baseline models on seven publicly available datasets. The results demonstrate that our model outperforms other baseline methods in accuracy under most conditions and surpasses the transformer-based model with 0.098 for a short interval and 0.068 for a long interval. Our Skip-RCNN network presents a promising approach to MTS forecasting that can meet the demands of resource-constrained prediction scenarios.
... XGB: XGBoost [51] uses a method called CART (classification and regression) in which all leaves are related to the final score of a model, unlike the decision-making tree that only considers the result values of leaf nodes [52]. While a common decision-making tree is interested in how well the classification performed, CART enables the comparison of superiority among models that retain identical classification results. ...
Article
Full-text available
Natural gas is widely used for domestic and industrial purposes, and whether it is being leaked into the air cannot be directly known. The current problem is that gas leakage is not only economically harmful but also detrimental to health. Therefore, much research has been done on gas damage and leakage risks, but research on predicting gas leakages is just beginning. In this study, we propose a method based on deep learning to predict gas leakage from environmental data. Our proposed method has successfully improved the performance of machine learning classification algorithms by efficiently preparing training data using a deep autoencoder model. The proposed method was evaluated on an open dataset containing natural gas and environmental information and compared with extreme gradient boost (XGBoost), K-nearest neighbors (KNN), decision tree (DT), random forest (RF), and naive Bayes (NB) algorithms. The proposed method is evaluated using accuracy, F1-score, mean square error (MSE), mean intersection over union (mIoU), and area under the ROC curve (AUC). The presented method in this study outperformed all compared methods. Moreover, the deep autoencoder and ordinal encoder-based XGBoost (DA-MA-XGBoost) showed the best performance by giving 99.51% accuracy, an F1-score of 99.53%, an MSE of 0.003, mIoU of 99.40 and an AUC of 99.62%.
... Segundo Morettin and Motta (2022), modelos baseados em árvores foram desenvolvidos por Leo Breiman e são bastante populares devido à simplicidade em termos conceituais e computacionais e no decorrer dos anos, foram amplamente exploradas, servindo de base para generalizações tais como RF e XGBoost. Este último, por exemplo, foi apresentado pelos autores Chen and Guestrin (2016), que criaram uma implementação do algoritmo Gradient Boosting (propostos por Friedman et al. (2000)). A motivação para a criação deste novo algoritmo consistia, especialmente, na utilização de menos recursos em um sistema com melhor desempenho em comparação aos demais, sendo este um sistema de aumento de árvore escalável, reduzindo o tempo de treinamento ao lidar com grandes conjuntos de dados. ...
Article
Full-text available
O presente trabalho tem como propósito avaliar o desempenho de alguns classificadores da literatura em dados no espaço tangente no contexto da análise estatística de formas. Ademais, foram realizadas simulações considerando três cenários: (1) dados sem uso de análise de componentes principais (ACP); e (2) com uso de ACP utilizando as componentes que explicam de 70% a 75% e de 90% a 95%. Constatou-se na simulação que, quando há baixa concentração nos dados, o desempenho dos classificadores diminui, com ganhos expressivos na acurácia quando se fez o uso de ACP na maioria dos cenários observados. A etapa seguinte consistiu em realizar a classificação utilizando quatro aplicações em dados reais, considerando os mesmos cenários do estudo de simulação. Nestes, os melhores resultados foram observados em bancos de dados cujas formas médias eram expressivamente distintas entre os grupos. Por outro lado, os piores desempenhos foram observados em dados relacionados a ressonâncias magnéticas de pacientes esquizofrênicos, com acurácia máxima de 85,7%.
... Therefore, these techniques offer significant informational value that is not readily identifiable through other statistical methods, thereby offering a novel perspective on the data. One prominent category of widely used algorithms comprises tree, or forest-based techniques, such as Random Forest (RF) (Breiman, 2001) or XGBoost (XGB) (Chen & Guestrin, 2016). A typical ML workflow involves the initial preparation of the training and testing datasets, which include both the features and the target variables. ...
... An established method for integrating ML into GPR data analysis revolves around regressing grouting thickness by utilizing GPR waveforms as sample data. CatBoost Prokhorenkova et al., 2018), LightGBM (Ke et al., 2017) and XGBoost (XGB) (Chen & Guestrin, 2016;Shehadeh et al., 2021) are classical boosting algorithms built on gradient boosting decision tree (GBDT) techniques, which are designed to tackle regression and classification problem. Nonetheless, extracting and selecting features poses a substantial challenge for the integration of supervised ML algorithms into GPR data analysis, owing to the uncertainty surrounding the precise correlation between the detection target and its corresponding GPR waveform. ...
Article
Full-text available
Ground penetrating radar (GPR) is a vital non-destructive testing (NDT) technology that can be employed for detecting the backfill grouting of shield tunnels. To achieve intelligent analysis of GPR data and overcome the subjectivity of traditional data processing methods, the CatBoost & BO-TPE model was constructed for regressing the grouting thickness based on GPR waveforms. A full-scale model test and corresponding numerical simulations were carried out to collect GPR data at 400 and 900 MHz, with known backfill grouting thickness. The model test helps address the limitation of not knowing the grout body condition in actual field detection. The data were then used to create machine learning datasets. The method of feature selection was proposed based on the analysis of feature importance and the electromagnetic (EM) propagation law in mediums. The research shows that: (1) the CatBoost & BO-TPE model exhibited outstanding performance in both experimental and numerical data, achieving R2 values of 0.9760, 0.8971, 0.8808, and 0.5437 for numerical data and test data at 400 and 900 MHz. It outperformed extreme gradient boosting (XGBoost) and random forest (RF) in terms of performance in the backfill grouting thickness regression; (2) compared with the full-waveform GPR data, the feature selection method proposed in this paper can promote the performance of the model. The selected features within the 5–30 ns of the A-scan can yield the best performance for the model; (3) compared to GPR data at 900 MHz, GPR data at 400 MHz exhibited better performance in the CatBoost & BO-TPE model. This indicates that the results of the machine learning model can provide feedback for the selection of GPR parameters; (4) the application results of the trained CatBoost & BO-TPE model in engineering are in line with the patterns observed through traditional processing methods, yet they demonstrate a more quantitative and objective nature compared to the traditional method.
... Extreme Gradient Boosting (XGBoost): XGBoost is a powerful ML algorithm for regression and classification problems [41]. It implements gradient boosting machines, which are ensemble models that aim to minimize prediction error by combining the predictions of multiple simpler models, called 'weak learners'. ...
Article
Full-text available
Data-driven approaches are helpful for quantitative justification and performance evaluation. The Netherlands has made notable strides in establishing a national protocol for bicycle traffic counting and collecting GPS cycling data through initiatives such as the Talking Bikes program. This article addresses the need for a generic framework to harness cycling data and extract relevant insights. Specifically, it focuses on the application of estimating average bicycle delays at signalized intersections, as this is an essential variable in assessing the performance of the transportation system. This study evaluates machine learning (ML)-based approaches using GPS cycling data. The dataset provides comprehensive yet incomplete information regarding one million bicycle rides annually across The Netherlands. These ML models, including random forest, k-nearest neighbor, support vector regression, extreme gradient boosting, and neural networks, are developed to estimate bicycle delays. The study demonstrates the feasibility of estimating bicycle delays using sparse GPS cycling data combined with publicly accessible information, such as weather information and intersection complexity, leveraging the burden of understanding local traffic conditions. It emphasizes the potential of data-driven approaches to inform traffic management, bicycle policy, and infrastructure development.
... We selected several classical machine-learning classification algorithms included in the scikit-learn v1.3.2 library (Fabian, 2011) in Python v3.11, comprising Logistic Regression (LogisticRegression; (Cox, 1958)), Decision Trees (DecisionTreeClassifier; (Fisher, 1936)), Random Forest (RandomForestClassifier; (Breiman, 2001)), Support Vector Machines (SVC; (Cortes and Vapnik, 1995)), K-Nearest Neighbor (KNeighborsClassifier; (Fix and Hodges, 1989)), Gaussian Naive Bayes (GaussianNB; (Bayes, 1958)), Multi-layer Perceptron (MLPClassifier; (Rumelhart et al., 1987)), AdaBoost (AdaBoostClassifier; (Freund and Schapire, 1997)), Gradient Boosting (GradientBoostingClassifier; (Friedman, 2001)), Quadratic Discriminant Analysis (QuadraticDiscriminantAnalysis; (Fisher, 1936)), and XGBoost (XGBClassifier; (Chen et al., 2016)). These algorithms were chosen based on previous research outcomes, documented in the specialised literature, and their suitability for the nature of our data. ...
Article
Full-text available
The Douro region is renowned for its quality wines, particularly for the famous Port Wine. Vintage years, declared approximately 2-3 times per decade, signify exceptional quality linked to optimum climatic conditions driving grape quality attributes. Climate change poses challenges, as rising temperatures and extreme weather events impact viticulture. This study uses machine learning algorithms to assess the climatic influence on vintage years and climate change impacts for the next decades. Historical vintage data were collected from 1850 to 2014. Monthly climatic data for the same period were obtained, including temperature, precipitation, humidity, solar radiation, and wind components. Various machine-learning algorithms were selected for classification, and a statistical analysis helped identify relevant climate variables for differentiation. Cross-validation was used for model training and evaluation, with the hits and misses (confusion matrix) as the performance metric. The best-performing model underwent hyperparameter tuning. Subsequently, future climate projections were acquired for four regional climate models from 2030 until 2099 under different socioeconomic scenarios (IPCC SSP2, SSP3, and SSP5). Quantile mapping bias adjustment was applied to correct future climate data and reduce model biases. Past data revealed that vintages occurred 23.6 % of the years, with an average of two vintage years per decade, with a slightly positive trend. Climate variables such as precipitation in March, air temperatures in April and May, humidity in March and April, solar radiation in March, and meridional wind in June were identified as important factors influencing vintage year occurrence. Machine-learning models were employed to predict vintage years based on the climate variables, with the XGBClassifier achieving the highest performance with 76 %/88 % hits for the vintage/non-vintage classes, respectively, and an ROC score of 0.86, demonstrating strong predictive capabilities. Future climate change scenarios under different socioeconomic pathways were assessed, and the results indicated a decrease in the occurrence vintage years until 2099 (10.3 % for SSP2, 9.1 % for SSP3, and 5.8 % for SSP5). The study provides valuable insights into the relationship between climate variables and wine vintage years, enabling winemakers to make informed decisions about vineyard management and grape cultivation. The predictions suggest that climate change may challenge the wine industry, emphasising the need for adaptation strategies.
... One of the tools that have signifi antly impacted the field is the machine learning library XGBoost. Known for its efficiency and accuracy, XGBoost has been a preferred choice for data scientists and researchers dealing with complex climate data [2]. ...
Article
Full-text available
The recent release of XGBoost 2.0, an advanced machine learning library, embodies a substantial advancement in analytical tools available for climate science research. With its novel features like Multi-Target Trees with Vector-Leaf Outputs, enhanced scalability, and computational efficiency improvements, XGBoost 2.0 is poised to significantly aid climate scientists in dissecting complex climate data, thereby fostering a deeper understanding of climate dynamics. This article delves into the key features of XGBoost 2.0 and elucidates its potential applications and benefits in the domain of climate science analytics.
... We then trained an XGBoost classifier [25] using the labeled 6500 reviews with a standard 80:20 proportion of train-test split for training the model. The model was trained with 0.97 accuracy, 0.99 precision, and 0.80 recall, and 0.89 F1-score, indicating high accuracy and reliability [26]. ...
Preprint
Full-text available
Online user feedback has become an essential mechanism for software organizations to gain insight into user concerns and to recognize areas for improvement. In software platform ecosystems, staying abreast of user feedback is particularly challenging due to the multitude of feedback channels and the complex interplay with third party applications. In this paper we report from a mixed-method study of user feedback from over 40,000 relevant reviews from 139 SECO platforms out of 2.4 million online user reviews scraped from 283 retrieved SECO platforms. Through thematic analysis and machine learning classifiers with high accuracy, we identified and analyzed six categories of user challenges in the areas of Integration, Customer Support, Design & Complexity, Privacy & Security, Cost & Pricing, and Performance & Compatibility. Our analysis also shows a significant growth of SECO user feedback in the past five years, highlighting the importance of understanding such user feedback as well as research methodologies to automatically study online user concerns in software ecosystems. To further understand mit-igation strategies for challenges reported by end users, we interviewed four executives from large ecosystems and describe strategies in addressing those identified challenges. This research is a first large scale study of user feedback in software ecosystems; the categories of user concerns are hopefully useful in guiding platforms in designing and fostering better software ecosystems. Our methodology for automatically classifying the user feedback that is SECO-related can also serve as guidance for future studies that can further advance our understanding of user feedback and how to integrate it into improved software ecosystems.
... The following is a selection of 14 traditional models, as well as the Deep Neural Network (DNN) models that have been commonly used in recent years. These traditional machine learning models for predicting customer churn include commonly employed to predict probabilities and classify customers as either churned or non-churned Logistic Regression (LR) model [29], The predictions from these trees are combined to form the final prediction Random Forest Classifier (RFC) [30], Particularly suitable for handling continuous feature data in classification problemsGaussian Naive Bayes (NB) [31], A gradient boosting framework known for its speed and efficiency, capable of handling large-scale datasets Light-GBM (LGB) [32], An ensemble learning technique that typically builds multiple models using bootstrap sampling and combines their predictions through averaging or voting to make the final prediction Bagging Classifier (BGC) [33], Repeatedly divides the dataset into different subsets to make the final prediction Decision Tree Classifier (DTC), An ensemble learning technique often implemented using boosting algorithms, incrementally improving the accuracy of weak learners to build a powerful predictive modelGradient Boosting Classifier (GB), Excelling in large-scale datasets and offering regularization features to prevent overfitting XGBoost (XGB) [34], Analysis achieves classification by projecting features into a lower-dimensional space Linear Discriminant Analysis (LDA) [35], and a linear model that uses L2 regularization to control model complexity and mitigate issues with multicol linearityRidge Classifier (RC) [36]. In the presentation of our research findings within the manuscript, we account for the heterogeneity across various datasets, recognizing the plausible challenge of data imbalance. ...
Article
Full-text available
This study proposes a hybrid approach to predict customer churn by combining statistic approaches and machine learning models. Unlike traditional methods, where churn is defined by a fixed period of time, the proposed algorithm uses the probability of customer alive derived from the statistical model to dynamically determine the churn line. After observing customer churn through clustering over time, the proposed method segmented customers into four behaviors: new, short-term, high-value, and churn, and selected machine learning models to predict the churned customers. This combination reduces the risk to be misjudged as churn for customers with longer consumption cycles. Two public datasets were used to evaluate the hybrid approach, an online retail of U.K. gift sellers and the largest E-Commerce of Pakistan. Based on the top three learning models, the recall ranged from 0.56 to 0.72 in the former while that ranged from 0.91 to 0.95 in the latter. Results show that the proposed approach enables companies to retain important customers earlier by predicting customer churn. The proposed hybrid method requires less data than existing methods.
... • Estimator F(·): an XGboost regression model was used as the estimator [52]. For this selection we took into account the size of the available datasets (800-6K observations per AOI), and the fact that we had tabular data containing categorical variables extracted in the feature engineering process from the initial data (such as the province based on the coordinates of the observation) [53,54]. ...
Article
Full-text available
Mosquito-borne diseases have been spreading across Europe over the past two decades, with climate change contributing to this spread. Temperature and precipitation are key factors in a mosquito’s life cycle, and are greatly affected by climate change. Using a machine learning framework, Earth Observation data, and future climate projections of temperature and precipitation, this work studies three different cases (Veneto region in Italy, Upper Rhine Valley in Germany and Pancevo, Serbia) and focuses on (i) evaluating the impact of climate factors on mosquito abundance and (ii) long-term forecasting of mosquito abundance based on EURO-CORDEX future climate projections under different Representative Concentration Pathways (RCPs) scenarios. The study shows that increases in precipitation and temperature are directly linked to increased mosquito abundance, with temperature being the main driving factor. Additionally, as the climatic conditions become more extreme, meaning higher variance, the mosquito abundance increases. Moreover, we show that in the upcoming decades mosquito abundance is expected to increase. In the worst-case scenario (RCP8.5) Serbia will face a 10% increase, Italy around a 40% increase, and Germany will reach almost a 200% increase by 2100, relative to the decade 2010–2020. However, in terms of absolute numbers both in Italy and Germany, the expected increase is similar. An interesting finding is that either strong (RCP2.6) or moderate mitigation actions (RCP4.5) against greenhouse gas concentration lead to similar levels of future mosquito abundance, as opposed to no mitigation action at all (RCP8.5), which is projected to lead to high mosquito abundance for all cases studied.
Article
Full-text available
Интернет вещей (IoT)-это новая парадигма нашего времени, в которой интеллектуальные устройства и датчики со всего мира соединены в глобальную сеть, а распределенные приложения и услуги влияют на все сферы человеческой деятельности. Благодаря огромному экономическому эффекту и всепроникающему влиянию на нашу жизнь интернет вещей является привлекательной мишенью для преступников, а кибербезопасность становится приоритетом для экосистемы интернета вещей. Для защиты сетей интернета вещей от атак существуют системы обнаружения вторжений (IDS). В данной работе будут рассматриваться IDS, ядром которых являются методы машинного обучения, поскольку такие IDS способны к самообучению и могут работать при относительно небольших мощностях с достаточной скоростью, в отличие от классических IDS. В работе приведена исчерпывающая классификация атак на сети интернета вещей, рассмотрены методы классического машинного обучения и современные архитектуры нейронных сетей, в том числе, трансформерные модели, а также проведен сравни-тельный анализ их результатов применительно к задаче обнаружения вторжений в сетях IoT.
Article
Full-text available
In recent years, with the rapid economic development of our country, environmental problems have become increasingly prominent, especially air pollution has more and more affected People’s daily life. Air pollution is mobile and can cause long-term effects over large areas, which are detrimental to the natural environment and human body. Haze is a form of air pollution, which comprises PM 2.5 components that adversely impair human health. Multiple approaches for predicting PM 2.5 in the past have had limited accuracy, meanwhile required vast quantities of data and computational resources. In order to tackle the difficulties of poor fitting effect, large data demand, and slow convergence speed of prior prediction techniques, a PM 2.5 prediction model based on the stacking integration method is proposed. This model employs eXtreme Gradient Boosting (XGBoost), Light Gradient Boosting Machine (LightGBM) and Random Forest (RF) as the base model, while ridge regression is used as the meta-learner to stack. PM 2.5 concentration is influenced by a variety of pollutant factors and meteorological factors, and the correlation between PM 2.5 concentration and other factors was analyzed using Spearman’s correlation coefficient method. Several significant factors that determine the haze concentration are selected out, and the stacking model is built on this data for training and prediction. The experimental results indicate that the fusion model constructed in this thesis can provide accurate PM 2.5 concentration estimates with fewer data features. The RMSE of the proposed model is 19.2 and the R ² reached 0.94, an improvement of 3–25% over the single model. This hybrid model performs better in terms of accuracy.
Article
DNA 5-methylcytosine (5mC) is widely present in multicellular eukaryotes, which plays important roles in various developmental and physiological processes and a wide range of human diseases. Thus, it is essential to accurately detect the 5mC sites. Although current sequencing technologies can map genome-wide 5mC sites, these experimental methods are both costly and time-consuming. To achieve a fast and accurate prediction of 5mC sites, we propose a new computational approach, BERT-5mC. First, we pre-trained a domain-specific BERT (bidirectional encoder representations from transformers) model by using human promoter sequences as language corpus . BERT is a deep two-way language representation model based on Transformer. Second, we fine-tuned the domain-specific BERT model based on the 5mC training dataset to build the model. The cross-validation results show that our model achieves an AUROC of 0.966 which is higher than other state-of-the-art methods such as iPromoter-5mC, 5mC_Pred, and BiLSTM-5mC. Furthermore, our model was evaluated on the independent test set, which shows that our model achieves an AUROC of 0.966 that is also higher than other state-of-the-art methods. Moreover, we analyzed the attention weights generated by BERT to identify a number of nucleotide distributions that are closely associated with 5mC modifications. To facilitate the use of our model, we built a webserver which can be freely accessed at: http://5mc-pred.zhulab.org.cn .
Article
Full-text available
Infant mortality remains high and uneven in much of sub-Saharan Africa. Even low-cost, highly effective therapies can only save lives in proportion to how successfully they can be targeted to those children who, absent the treatment, would have died. This places great value on maximizing the accuracy of any targeting or means-testing algorithm. Yet, the interventions that countries deploy in hopes of reducing mortality are often targeted based on simple models of wealth or income or a few additional variables. Examining 22 countries in sub-Saharan Africa, we illustrate the use of flexible (machine learning) risk models employing up to 25 generally available pre-birth variables from the Demographic and Health Surveys. Using these models, we construct risk scores such that the 10 percent of the population at highest risk account for 15-30 percent of infant mortality, depending on the country. Successful targeting in these models turned on several variables other than wealth, while models that employ only wealth data perform little or no better than chance. Consequently, employing such data and models to predict high-risk births in the countries studied flexibly could substantially improve the targeting and thus the life-saving potential of existing interventions.
Article
Full-text available
Short-term power load forecasting refers to the use of load and weather information to forecast the Day-ahead load, which is very important for power dispatch and the establishment of the power spot market. In this manuscript, a comprehensive study on the frame of input data for electricity load forecasting is proposed based on the extreme gradient boosting algorithm. Periodicity was the first of the historical load data to be analyzed using discrete Fourier transform, autocorrelation function, and partial autocorrelation function to determine the key width of a sliding window for an optimization load feature. The mean absolute error (MAE) of the frame reached 52.04 using a boosting model with a 7-day width in the validation dataset. Second, the fusing of datetime variables and meteorological information factors was discussed in detail and determined how to best improve performance. The datetime variables were determined as a form of integer, sine–cosine pairs, and Boolean-type combinations, and the meteorological features were determined as a combination with 540 features from 15 sampled sites, which further decreased MAE to 44.32 in the validation dataset. Last, a training method for day-ahead forecasting was proposed to combine the Minkowski distance to determine the historical span. Under this framework, the performance has been significantly improved without any tuning for the boosting algorithm. The proposed method further decreased MAE to 37.84. Finally, the effectiveness of the proposed method is evaluated using a 200-day load dataset from the Estonian grid. The achieved MAE of 41.69 outperforms other baseline models, with MAE ranging from 65.03 to 104.05. This represents a significant improvement of 35.89% over the method currently employed by the European Network of Transmission System Operators for Electricity (ENTSO-E). The robustness of the proposal method can be also guaranteed with excellent performance in extreme weather and on special days.
Article
Full-text available
Fingerprint localization using neural networks is emerging as the state-of-the-art technique for outdoor localization using mobile network features. In this paper, we introduce two sequence-based frameworks showing major accuracy enhancements in large-scale outdoor environments compared to the state of the art in this domain. The first uses a uni-directional LSTM network called SeqOutLoc, and the second uses a bi-directional LSTM network called BiOutLoc. We also introduce the AngleNoiseSynth augmenter to expand the dataset, taking into account the angle of user movement and system noise. For SeqOutLoc, we show how adding sequence information enhances the accuracy of localization in a large-scale outdoor urban area of 45 km <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">2</sup> by 25% compared to previous work, while using 35% fewer network parameters. The second model, BiOutLoc, enhances the localization accuracy with fewer network parameters by utilizing both past and future information, which is useful in near-real-time localization. To the best of our knowledge, our work is the first to use a Bi-LSTM model in outdoor fingerprint-based localization. BiOutLoc achieves a median localization accuracy of 9.4 meters, surpassing other deep learning-based localization systems by 31%, while reducing the number of parameters by 67%. Finally, we use transfer learning to fine-tune the parameters of BiOutLoc trained in a certain area using the data from a new area. This results in an 18% enhancement in accuracy and a 71% reduction in training time compared to training the model using only the data of the new area.
Preprint
Full-text available
Evolutionary biologists, primarily anatomists and ontogenists, employ modern geometric morphometrics to quantitatively analyse physical forms (e.g., skull morphology) and explore relationships, variations, and differences between samples and taxa using landmark coordinates. The standard approach comprises two steps, Generalised Procrustes Analysis (GPA) followed by Principal Component Analysis (PCA). PCA projects the superimposed data produced by GPA onto a set of uncorrelated variables, which can be visualised on scatterplots and used to draw phenetic, evolutionary, and ontogenetic conclusions. Recently, the use of PCA in genetic studies has been challenged. Due to PCA’s central role in morphometrics, we sought to evaluate the standard approach and claims based on PCA outcomes. To test PCA’s accuracy, robustness, and reproducibility using benchmark data of the crania of five papionin genera, we developed MORPHIX, a Python package containing the necessary tools for processing superimposed landmark data with classifier and outlier detection methods, which can be further visualised using various plots. We discuss the case of Homo Nesher Ramla , an archaic human with a questionable taxonomy. We found that PCA outcomes are artefacts of the input data and are neither reliable, robust, nor reproducible as field members may assume and that supervised machine learning classifiers are more accurate both for classification and detecting new taxa. Our findings raise concerns about PCA-based findings in 18,000 to 32,900 studies. Our work can be used to evaluate prior and novel claims concerning the origins and relatedness of inter- and intra-species and improve phylogenetic and taxonomic reconstructions.
Article
Full-text available
Objectives Mechanical ventilation in prematurely born infants, particularly if prolonged, can cause long term complications including bronchopulmonary dysplasia. Timely extubation then is essential, yet predicting its success remains challenging. Artificial intelligence (AI) may provide a potential solution. Content A narrative review was undertaken to explore AI’s role in predicting extubation success in prematurely born infants. Across the 11 studies analysed, the range of reported area under the receiver operator characteristic curve (AUC) for the selected prediction models was between 0.7 and 0.87. Only two studies implemented an external validation procedure. Comparison to the results of clinical predictors was made in two studies. One group reported a logistic regression model that outperformed clinical predictors on decision tree analysis, while another group reported clinical predictors outperformed their artificial neural network model (AUCs: ANN 0.68 vs. clinical predictors 0.86). Amongst the studies there was an heterogenous selection of variables for inclusion in prediction models, as well as variations in definitions of extubation failure. Summary Although there is potential for AI to enhance extubation success, no model’s performance has yet surpassed that of clinical predictors. Outlook Future studies should incorporate external validation to increase the applicability of the models to clinical settings.
Article
Horizontal principal stress is a fundamental parameter for reservoir reconstruction. For improving single well productivity, accurate evaluation of reservoir stress characteristics is of great importance. One of the main challenges in predicting the magnitude of the in situ stress is how to obtain the rock mechanical parameters accurately. An intelligent fusion model was proposed to predict rock mechanical parameters to address the issue that traditional approaches are not very reliable at predicting the rock mechanical parameters of complex lithology reservoirs, using transitional shale reservoir rocks as the research object. Machine learning algorithms such as nearest neighbor regression, support vector machine, and random forest were selected to construct intelligent fusion models of different rock mechanics parameters based on the laboratory test data. Finally, the logging profile of transitional shale reservoir horizontal principal stress in the study area was obtained under the constraints of the empirical physical model and measured in situ stress data. The results showed that the fusion models outperformed the single model on rock mechanics parameters and had higher accuracy in both training and test sets, meeting the engineering requirements for predicting the horizontal principal stress in the study area.
Chapter
Emerging synergies of nanotechnology and artificial intelligence (AI) promise transformative impacts on various sectors. This fusion unlocks novel materials and applications previously unattainable. This chapter explores AI and nanotech potentials across healthcare, energy, environment, manufacturing, and transportation, emphasizing ethical frameworks for responsible use. The horizon shines bright for AI and nanotech, ushering in an era of unprecedented innovation. Rapid advancements beckon boundless achievements. However, prudent navigation is essential, given potential risks like autonomous weapons or hazardous nanomaterials. Ethical guidelines must steer these technologies toward positive trajectories. Concluding, the chapter addresses challenges and opportunities shaping AI and nanotech's trajectory. Their potential to reshape the world is evident. Guided by ethics, the authors hold the key to harnessing their power for global betterment, marrying innovation with ethical stewardship.
Article
Background An early warning tool to predict attacks could enhance asthma management and reduce the likelihood of serious consequences. Electronic health records (EHRs) providing access to historical data about patients with asthma coupled with machine learning (ML) provide an opportunity to develop such a tool. Several studies have developed ML-based tools to predict asthma attacks. Objective This study aims to critically evaluate ML-based models derived using EHRs for the prediction of asthma attacks. Methods We systematically searched PubMed and Scopus (the search period was between January 1, 2012, and January 31, 2023) for papers meeting the following inclusion criteria: (1) used EHR data as the main data source, (2) used asthma attack as the outcome, and (3) compared ML-based prediction models’ performance. We excluded non-English papers and nonresearch papers, such as commentary and systematic review papers. In addition, we also excluded papers that did not provide any details about the respective ML approach and its result, including protocol papers. The selected studies were then summarized across multiple dimensions including data preprocessing methods, ML algorithms, model validation, model explainability, and model implementation. Results Overall, 17 papers were included at the end of the selection process. There was considerable heterogeneity in how asthma attacks were defined. Of the 17 studies, 8 (47%) studies used routinely collected data both from primary care and secondary care practices together. Extreme imbalanced data was a notable issue in most studies (13/17, 76%), but only 38% (5/13) of them explicitly dealt with it in their data preprocessing pipeline. The gradient boosting–based method was the best ML method in 59% (10/17) of the studies. Of the 17 studies, 14 (82%) studies used a model explanation method to identify the most important predictors. None of the studies followed the standard reporting guidelines, and none were prospectively validated. Conclusions Our review indicates that this research field is still underdeveloped, given the limited body of evidence, heterogeneity of methods, lack of external validation, and suboptimally reported models. We highlighted several technical challenges (class imbalance, external validation, model explanation, and adherence to reporting guidelines to aid reproducibility) that need to be addressed to make progress toward clinical adoption.
Preprint
Full-text available
Cardiac surgery-associated Acute Kidney Injury (CSA-AKI) is a significant complication that often leads to increased morbidity and mortality. Effective CSA-AKI management relies on timely diagnosis and interventions. However, many cases of CSA-AKI are detected too late. Despite the efforts of novel biomarkers and data-driven predictive models, their limited discriminative and generalization capabilities along with stringent application requirements pose challenges for clinical use. Here we incorporate a causal deep learning approach that combines the universal approximation abilities of neural networks with causal discovery to develop REACT, a reliable and generalizable model to predict a patient’s risk of developing CSA-AKI within the next 48 hours. REACT was developed using 21.5 billion time-stamped medical records from two large hospitals covering 23,933 patients and validated in three independent centers covering 30,963 patients. By analyzing the causal relationships buried in the time dimensions, REACT distilled the complex temporal dynamics among variables into six minimal causal inputs and achieved an average AUROC of 0.93 (ranging from 0.89 to 0.96 among different CSA-AKI stages), surpassing state-of-the-art models that depend on more complex variables. This approach accurately predicted 97% of CSA-AKI events within 48 hours for all prediction windows, maintaining a ratio of two false alerts for every true alert, improving practical feasibility. Compared to guideline-recommended pathways, REACT detected CSA-AKI on average 16.35 hours earlier in external tests. In addition, we have established a publicly accessible website and performed prospective validation on 754 patients across two centers, achieving high accuracy. Our study holds substantial promise in enhancing early detection and preserving critical intervention windows for clinicians.
Article
Full-text available
A critical problem for several real world applications is class imbalance. Indeed, in contexts like fraud detection or medical diagnostics, standard machine learning models fail because they are designed to handle balanced class distributions. Existing solutions typically increase the rare class instances by generating synthetic records to achieve a balanced class distribution. However, these procedures generate not plausible data and tend to create unnecessary noise. We propose a change of perspective where instead of relying on resampling techniques, we depend on unsupervised features engineering approaches to represent records with a combination of features that will help the classifier capturing the differences among classes, even in presence of imbalanced data. Thus, we combine a large array of outlier detection, features projection, and features selection approaches to augment the expressiveness of the dataset population. We show the effectiveness of our proposal in a deep and wide set of benchmarking experiments as well as in real case studies.
Preprint
Full-text available
Objectives: Pelvic radiography can quickly diagnose pelvic fractures, and the Association for Osteosynthesis Foundation and Orthopedic Trauma Association (AO/OTA) classification system is useful for evaluating pelvic fracture instability. This study aimed to develop a radiomics-based machine-learning algorithm to quickly diagnose fractures on pelvic X-ray and classify their instability. A total of 93 features were extracted based on radiomics:18 first-order, 24 GLCM, 16 GLRLM, 16 GLSZM, 5 NGTDM, and 14 GLDM features. To improve the performance of machine learning, the feature selection methods RFE, SFS, LASSO, and Ridge were used, and the machine learning models used LR, SVM, RF, XGB, MLP, KNN, and LGBM. Results: The machine learning model was trained based on the selected features using four feature-selection methods. Among them, the combination with the machine learning model SVM showed the best performance, with an average AUC of 0.75±0.06. By obtaining a feature-importance graph for the combination of RFE and SVM, it is possible to identify features with high importance. Conclusions: The AO/OTA classification of normal pelvic rings and pelvic fractures on pelvic AP radiographs using a radiomics-based machine learning model showed the highest AUC when using the SVM classification combination.
Article
Full-text available
LambdaMART is the boosted tree version of LambdaRank, which is based on RankNet. RankNet, LambdaRank, and LambdaMART have proven to be very suc-cessful algorithms for solving real world ranking problems: for example an ensem-ble of LambdaMART rankers won Track 1 of the 2010 Yahoo! Learning To Rank Challenge. The details of these algorithms are spread across several papers and re-ports, and so here we give a self-contained, detailed and complete description of them.
Article
Full-text available
Boosting is one of the most important recent developments in classi-fication methodology. Boosting works by sequentially applying a classifica-tion algorithm to reweighted versions of the training data and then taking a weighted majority vote of the sequence of classifiers thus produced. For many classification algorithms, this simple strategy results in dramatic improvements in performance. We show that this seemingly mysterious phenomenon can be understood in terms of well-known statistical princi-ples, namely additive modeling and maximum likelihood. For the two-class problem, boosting can be viewed as an approximation to additive modeling on the logistic scale using maximum Bernoulli likelihood as a criterion. We develop more direct approximations and show that they exhibit nearly identical results to boosting. Direct multiclass generalizations based on multinomial likelihood are derived that exhibit performance comparable to other recently proposed multiclass generalizations of boosting in most situations, and far superior in some. We suggest a minor modification to boosting that can reduce computation, often by factors of 10 to 50. Finally, we apply these insights to produce an alternative formulation of boosting decision trees. This approach, based on best-first truncated tree induction, often leads to better performance, and can provide interpretable descrip-tions of the aggregate decision rule. It is also much faster computationally, making it more suitable to large-scale data mining applications.
Article
Full-text available
Gradient boosting constructs additive regression models by sequentially fitting a simple parameterized function (base learner) to current “pseudo”-residuals by least squares at each iteration. The pseudo-residuals are the gradient of the loss functional being minimized, with respect to the model values at each training data point evaluated at the current step. It is shown that both the approximation accuracy and execution speed of gradient boosting can be substantially improved by incorporating randomization into the procedure. Specifically, at each iteration a subsample of the training data is drawn at random (without replacement) from the full training data set. This randomly selected subsample is then used in place of the full sample to fit the base learner and compute the model update for the current iteration. This randomized approach also increases robustness against overcapacity of the base learner.
Conference Paper
Full-text available
We cast the ranking problem as (1) multiple classification (“Mc”) (2) multiple ordinal classification, which lead to computationally tractable learning algorithms for relevance ranking in Web search. We consider the DCG criterion (discounted cumulative gain), a standard quality measure in information retrieval. Our approach is motivated by the fact that perfect classifications result in perfect DCG scores and the DCG errors are bounded by classification errors. We propose using the Expected Relevance to convert class probabilities into ranking scores. The class probabilities are learned using a gradient boosting tree algorithm. Evaluations on large-scale datasets show that our approach can improve LambdaRank [5] and the regressions-based ranker [6], in terms of the (normalized) DCG scores. An efficient implementation of the boosting tree algorithm is also presented. 1
Article
Full-text available
LIBLINEAR is an open source library for large-scale linear classification. It supports logistic regres- sion and linear support vector machines. We provide easy-to-use command-line tools and library calls for users and developers. Comprehensive documents are available for both beginners and advanced users. Experiments demonstrate that LIBLINEAR is very efficient on large sparse data sets.
Conference Paper
Full-text available
Gradient Boosted Regression Trees (GBRT) are the current state-of-the-art learning paradigm for machine learned web-search ranking - a domain notorious for very large data sets. In this paper, we propose a novel method for parallelizing the training of GBRT. Our technique parallelizes the construction of the individual regression trees and operates using the master-worker paradigm as follows. The data are partitioned among the workers. At each iteration, the worker summarizes its data-partition using histograms. The master processor uses these to build one layer of a regression tree, and then sends this layer to the workers, allowing the workers to build histograms for the next layer. Our algorithm carefully orchestrates overlap between communication and computation to achieve good performance. Since this approach is based on data partitioning, and requires a small amount of communication, it generalizes to distributed and shared memory machines, as well as clouds. We present experimental results on both shared memory machines and clusters for two large scale web search ranking data sets. We demonstrate that the loss in accuracy induced due to the histogram approximation in the regression tree creation can be compensated for through slightly deeper trees. As a result, we see no significant loss in accuracy on the Yahoo data sets and a very small reduction in accuracy for the Microsoft LETOR data. In addition, on shared memory machines, we obtain almost perfect linear speed-up with up to about 48 cores on the large data sets. On distributed memory machines, we get a speedup of 25 with 32 processors. Due to data partitioning our approach can scale to even larger data sets, on which one can reasonably expect even higher speedups.
Article
Full-text available
Scikit-learn is a Python module integrating a wide range of state-of-the-art machine learning algorithms for medium-scale supervised and unsupervised problems. This package focuses on bringing machine learning to non-specialists using a general-purpose high-level language. Emphasis is put on ease of use, performance, documentation, and API consistency. It has minimal dependencies and is distributed under the simplified BSD license, encouraging its use in both academic and commercial settings. Source code, binaries, and documentation can be downloaded from http://scikit-learn.sourceforge.net.
Article
Full-text available
Learning a function of many arguments is viewed from the perspective of high-- dimensional numerical quadrature. It is shown that many of the popular ensemble learning procedures can be cast in this framework. In particular randomized methods, including bagging and random forests, are seen to correspond to random Monte Carlo integration methods each based on particular importance sampling strategies. Non random boosting methods are seen to correspond to deterministic quasi Monte Carlo integration techniques. This view helps explain some of their properties and suggests modifications to them that can substantially improve their accuracy while dramatically improving computational performance.
Article
Full-text available
An ∈-approximate quantile summary of a sequence of N elements is a data structure that can answer quantile queries about the sequence to within a precision of ∈ N . We present a new online algorithm for computing∈-approximate quantile summaries of very large data sequences. The algorithm has a worst-case space requirement of Ο (1÷∈ log(∈ N )). This improves upon the previous best result of Ο (1÷∈ log ² (∈ N )). Moreover, in contrast to earlier deterministic algorithms, our algorithm does not require a priori knowledge of the length of the input sequence. Finally, the actual space bounds obtained on experimental data are significantly better than the worst case guarantees of our algorithm as well as the observed space requirements of earlier algorithms.
Article
Function estimation/approximation is viewed from the perspective of numerical optimization iti function space, rather than parameter space. A connection is made between stagewise additive expansions and steepest-descent minimization. A general gradient descent "boosting" paradigm is developed for additive expansions based on any fitting criterion. Specific algorithms are presented for least-squares, least absolute deviation, and Huber-M loss functions for regression, and multiclass logistic likelihood for classification. Special enhancements are derived for the particular case where the individual additive components are regression trees, and tools for interpreting such "TreeBoost" models are presented. Gradient boosting of regression trees produces competitives highly robust, interpretable procedures for both regression and classification, especially appropriate for mining less than clean data. Connections between this approach and the boosting methods of Freund and Shapire and Friedman, Hastie and Tibshirani are discussed.
Article
Apache Spark is a popular open-source platform for large-scale data processing that is well-suited for iterative machine learning tasks. In this paper we present MLlib, Spark's open-source distributed machine learning library. MLlib provides efficient functionality for a wide range of learning settings and includes several underlying statistical, optimization, and linear algebra primitives. Shipped with Spark, MLlib supports several languages and provides a high-level API that leverages Spark's rich ecosystem to simplify the development of end-to-end machine learning pipelines. MLlib has experienced a rapid growth due to its vibrant open-source community of over 140 contributors, and includes extensive documentation to support further growth and to let users quickly get up to speed.
Article
Online advertising allows advertisers to only bid and pay for measurable user responses, such as clicks on ads. As a consequence, click prediction systems are central to most online advertising systems. With over 750 million daily active users and over 1 million active advertisers, predicting clicks on Facebook ads is a challenging machine learning task. In this paper we introduce a model which combines decision trees with logistic regression, outperforming either of these methods on its own by over 3%, an improvement with significant impact to the overall system performance. We then explore how a number of fundamental parameters impact the final prediction performance of our system. Not surprisingly, the most important thing is to have the right features: those capturing historical information about the user or ad dominate other types of features. Once we have the right features and the right model (decisions trees plus logistic regression), other factors play small roles (though even small improvements are important at scale). Picking the optimal handling for data freshness, learning rate schema and data sampling improve the model slightly, though much less than adding a high-value feature, or picking the right model to begin with.
Article
Random forests are a combination of tree predictors such that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest. The generalization error for forests converges a.s. to a limit as the number of trees in the forest becomes large. The generalization error of a forest of tree classifiers depends on the strength of the individual trees in the forest and the correlation between them. Using a random selection of features to split each node yields error rates that compare favorably to Adaboost (Y. Freund & R. Schapire, Machine Learning: Proceedings of the Thirteenth International conference, ***, 148–156), but are more robust with respect to noise. Internal estimates monitor error, strength, and correlation and these are used to show the response to increasing the number of features used in the splitting. Internal estimates are also used to measure variable importance. These ideas are also applicable to regression.
Article
We consider the problem of learning a forest of nonlinear decision rules with general loss functions. The standard methods employ boosted decision trees such as Adaboost for exponential loss and Friedman's gradient boosting for general loss. In contrast to these traditional boosting algorithms that treat a tree learner as a black box, the method we propose directly learns decision forests via fully-corrective regularized greedy search using the underlying forest structure. Our method achieves higher accuracy and smaller models than gradient boosting on many of the datasets we have tested on.
Article
This tutorial gives a broad view of modern approaches for scaling up machine learning and data mining methods on parallel/distributed platforms. Demand for scaling up machine learning is task-specific: for some tasks it is driven by the enormous dataset sizes, for others by model complexity or by the requirement for real-time prediction. Selecting a task-appropriate parallelization platform and algorithm requires understanding their benefits, trade-offs and constraints. This tutorial focuses on providing an integrated overview of state-of-the-art platforms and algorithm choices. These span a range of hardware options (from FPGAs and GPUs to multi-core systems and commodity clusters), programming frameworks (including CUDA, MPI, MapReduce, and DryadLINQ), and learning settings (e.g., semi-supervised and online learning). The tutorial is example-driven, covering a number of popular algorithms (e.g., boosted trees, spectral clustering, belief propagation) and diverse applications (e.g., recommender systems and object recognition in vision). The tutorial is based on (but not limited to) the material from our upcoming Cambridge U. Press edited book which is currently in production. Visit the tutorial website at http://hunch.net/~large_scale_survey/
Conference Paper
Stochastic Gradient Boosted Decision Trees (GBDT) is one of the most widely used learning algorithms in machine learning today. It is adaptable, easy to interpret, and produces highly accurate models. However, most implementations today are computationally expensive and require all training data to be in main memory. As training data becomes ever larger, there is motivation for us to parallelize the GBDT algorithm. Parallelizing decision tree training is intuitive and various approaches have been explored in existing literature. Stochastic boosting on the other hand is inherently a sequential process and have not been applied to distributed decision trees. In this work, we present two different distributed methods that generates exact stochastic GBDT models, the first is a MapReduce implementation and the second utilizes MPI on the Hadoop grid environment.
Article
Classification and regression tree learning on massive datasets is a common data mining task at Google, yet many state of the art tree learning algorithms require training data to reside in memory on a single machine. While more scalable implementations of tree learning have been proposed, they typically require specialized parallel computing architectures. In contrast, the majority of Google's computing infrastructure is based on commodity hardware. In this paper, we describe PLANET: a scalable distributed framework for learning tree models over large datasets. PLANET defines tree learning as a series of distributed computations, and implements each one using the MapReduce model of distributed computation. We show how this framework supports scalable construction of classification and regression trees, as well as ensembles of such models. We discuss the benefits and challenges of using a MapReduce compute cluster for tree learning, and demonstrate the scalability of this approach by applying it to a real world learning task from the domain of computational advertising.
Conference Paper
We present a fast algorithm for computing approximate quantiles in high speed data streams with deterministic error bounds. For data streams of size N where N is unknown in advance, our algorithm partitions the stream into sub-streams of exponentially increasing size as they arrive. For each sub-stream which has a fixed size, we compute and maintain a multi-level summary structure using a novel algorithm. In order to achieve high speed performance, the algorithm uses simple block-wise merge and sample operations. Overall, our algorithms for fixed-size streams and arbitrary-size streams have a computational cost of O(N log(1/epsivlogepsivN)) and an average per-element update cost of O(log logN) if epsiv is fixed.
Article
Function approximation is viewed from the perspective of numerical optimization in function space, rather than parameter space. A connection is made between stagewise additive expansions and steepest--descent minimization. A general gradient--descent "boosting" paradigm is developed for additive expansions based on any fitting criterion. Specific algorithms are presented for least--squares, least--absolute--deviation, and Huber--M loss functions for regression, and multi--class logistic likelihood for classification. Special enhancements are derived for the particular case where the individual additive components are decision trees, and tools for interpreting such "TreeBoost" models are presented. Gradient boosting of decision trees produces competitive, highly robust, interpretable procedures for regression and classification, especially appropriate for mining less than clean data. Connections between this approach and the boosting methods of Freund and Shapire 1996, and Fr...
The present and the future of the kdd cup competition: an outsider's perspective. R. Bekkerman. The present and the future of the kdd cup competition: an outsider's perspective
  • R Bekkerman
R. Bekkerman. The present and the future of the kdd cup competition: an outsider's perspective.
General functional matrix factorization using gradient boosting
  • T Chen
  • H Li
  • Q Yang
  • Y Yu
T. Chen, H. Li, Q. Yang, and Y. Yu. General functional matrix factorization using gradient boosting. In Proceeding of 30th International Conference on Machine Learning (ICML'13), volume 1, pages 436-444, 2013.
Efficient second-order gradient boosting for conditional random fields
  • T Chen
  • S Singh
  • B Taskar
  • C Guestrin
T. Chen, S. Singh, B. Taskar, and C. Guestrin. Efficient second-order gradient boosting for conditional random fields. In Proceeding of 18th Artificial Intelligence and Statistics Conference (AISTATS'15), volume 1, 2015.
LIBLINEAR: A Library for Large Linear Classification
  • Kai-Wei Rong-En Fan
  • Cho-Jui Chang
  • Xiang-Rui Hsieh
  • Chih-Jen Wang
  • Lin