Conference Paper

XGBoost: A Scalable Tree Boosting System

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Tree boosting is a highly effective and widely used machine learning method. In this paper, we describe a scalable end-to-end tree boosting system called XGBoost, which is used widely by data scientists to achieve state-of-the-art results on many machine learning challenges. We propose a novel sparsity-aware algorithm for sparse data and weighted quantile sketch for approximate tree learning. More importantly, we provide insights on cache access patterns, data compression and sharding to build a scalable tree boosting system. By combining these insights, XGBoost scales beyond billions of examples using far fewer resources than existing systems.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... RF, BPNN, and other machine learning have also been widely used in BB84, MDI-QKD. Extreme gradient boosting (XGBoost) is an open-source machine learning project developed in 2016, which efficiently implemented gradient boosting decision tree (GBDT) and made many algorithmic and engineering improvements [30,31]. ...
... The input is x i . The outputŷ i is predicted by K addition functions [31]: Where F is the space of regression trees (CART). y i is the actual value,ŷ ...
... i is the prediction of the i-th instance at the t-th iteration. Expanded by Taylor, the simplified objective at step t is [31]: ...
Article
Full-text available
Twin-field quantum key distribution (TF-QKD) can overcome the basic limits of QKD without repeaters. In practice, TF-QKD needs to optimize all parameters when limited data sets are considered. The traditional exhaustive traversal or local search algorithm can’t meet the time and resource requirements of the real-time communication system. Combined with machine learning, parameter optimization prediction of QKD has become the mainstream of parameter optimization. Random forest (RF) is a classical algorithm of the bagging class in integrated learning, and back-propagation neural network (BPNN) is an important algorithm in the neural network. This paper uses the extreme gradient boosting (XGBoost) of boosting class to predict the optimization parameters of TF-QKD and compares it with RF and BPNN. The results show that XGBoost can efficiently and accurately predict optimization parameters, and its performance is slightly better than RF and BPNN in parameter prediction, which can provide a reference for future real-time QKD networks.
... Recently, neural network architectures and training routines for tabular data have advanced significantly. Leading methods in tabular deep learning [26,27,64,44] now perform on par with the traditionally dominant gradient boosted decision trees (GBDT) [24,56,16,41]. On top of their competitive performance, neural networks, which are end-to-end differentiable and extract complex data representations, possess numerous capabilities which decision trees lack; one especially useful capability is transfer learning, in which a representation learned on pre-training data is reused or fine-tuned on one or more downstream tasks. ...
... For GBDT implementation, we use the popular CatBoost [56] and XGBoost libraries [16]. ...
... The hyperparameter search space and distributions as well as the default configuration are presented in Table 6. For default configuration, we use default parameters from the XGBoost library [16]. ...
Preprint
Full-text available
Recent work on deep learning for tabular data demonstrates the strong performance of deep tabular models, often bridging the gap between gradient boosted decision trees and neural networks. Accuracy aside, a major advantage of neural models is that they learn reusable features and are easily fine-tuned in new domains. This property is often exploited in computer vision and natural language applications, where transfer learning is indispensable when task-specific training data is scarce. In this work, we demonstrate that upstream data gives tabular neural networks a decisive advantage over widely used GBDT models. We propose a realistic medical diagnosis benchmark for tabular transfer learning, and we present a how-to guide for using upstream data to boost performance with a variety of tabular neural network architectures. Finally, we propose a pseudo-feature method for cases where the upstream and downstream feature sets differ, a tabular-specific problem widespread in real-world applications. Our code is available at https://github.com/LevinRoman/tabular-transfer-learning .
... In parallel, the eXtreme Gradient Boosting (XGBoost) regressor was selected to predict the PV performance since it is one of the most popular ML algorithms these days [18]. The XGBoost has reported the best performance in both prediction and classification problems [18]. ...
... In parallel, the eXtreme Gradient Boosting (XGBoost) regressor was selected to predict the PV performance since it is one of the most popular ML algorithms these days [18]. The XGBoost has reported the best performance in both prediction and classification problems [18]. It is an ensemble algorithm that combines decision trees (or weak learners) using a gradient boosting architecture to construct an enhanced model and optimize the output prediction (i.e., the power at the DC side in this work) [7]. ...
Conference Paper
Full-text available
A main challenge in the scope of integrating higher shares of photovoltaic (PV) systems is to ensure optimal operations. This can be achieved through next-generation monitoring with automatic data-driven functionalities. This work aims to address this fundamental challenge by presenting the stage of implementation of an advanced cloud-based monitoring platform and a control digital twin for PV power plants (MW scale). The platform is fully equipped with a multitude of artificial intelligent (AI) algorithms for health-state diagnostics and analytics. The performance of the digital twin to act as a health-state monitor was validated against field and synthetic data from PV systems at different locations and demonstrated high accuracies for PV performance modelling and fault diagnosis.
... XGBoost 47 is an open-source library with an efficient and scalable implementation of the gradient boosting framework 79 . Gradient boosting minimizes the prediction error with a gradient descent algorithm and produces a model in the form of a set of weak prediction models (decision trees in this case). ...
... A linear interpolation of the dielectric constant at the temperatures where ΔT EC was measured provided data for 63 of the materials in the dataset. The missing dielectric constant values in the other 34 materials were handled by the default built-in method in the XGBoost algorithm 47 . Essentially, each of the decision nodes in XGBoost has a default direction, such that when a missing value is encountered during splitting in the tree branch, the instance is classified in that default direction. ...
Article
Full-text available
An eXtreme Gradient Boosting (XGBoost) machine learning model is built to predict the electrocaloric (EC) temperature change of a ceramic based on its composition (encoded by Magpie elemental properties), dielectric constant, Curie temperature, and characterization conditions. A dataset of 97 EC ceramics is assembled from the experimental literature. By sampling data from clusters in the feature space, the model can achieve a coefficient of determination of 0.77 and a root mean square error of 0.38 K for the test data. Feature analysis shows that the model captures known physics for effective EC materials. The Magpie features help the model to distinguish between materials, with the elemental electronegativities and ionic charges identified as key features. The model is applied to 66 ferroelectrics whose EC performance has not been characterized. Lead-free candidates with a predicted EC temperature change above 2 K at room temperature and 100 kV/cm are identified.
... XGB is a comprehensive ML system for tree boosting proposed by Chen and Guestrin [25], which was a winner of the Kaggle ML competition in 2015. Gradient Boosting is the base model of XGB, where multiple iterations co-occur. ...
... where f n belongs to a leaf score independent tree structure, and F is the space of trees. After minimizing the above equation, it takes the following form in Equation (2) [25]. ...
Article
Full-text available
COVID-19 has imposed many challenges and barriers on traditional healthcare systems due to the high risk of being infected by the coronavirus. Modern electronic devices like smartphones with information technology can play an essential role in handling the current pandemic by contributing to different telemedical services. This study has focused on determining the presence of this virus by employing smartphone technology, as it is available to a large number of people. A publicly available COVID-19 dataset consisting of 33 features has been utilized to develop the aimed model, which can be collected from an in-house facility. The chosen dataset has 2.82% positive and 97.18% negative samples, demonstrating a high imbalance of class populations. The Adaptive Synthetic (ADASYN) has been applied to overcome the class imbalance problem with imbalanced data. Ten optimal features are chosen from the given 33 features, employing two different feature selection algorithms, such as K Best and recursive feature elimination methods. Mainly, three classification schemes, Random Forest (RF), eXtreme Gradient Boosting (XGB), and Support Vector Machine (SVM), have been applied for the ablation studies, where the accuracy from the XGB, RF, and SVM classifiers achieved 97.91%, 97.81%, and 73.37%, respectively. As the XGB algorithm confers the best results, it has been implemented in designing the Android operating system base and web applications. By analyzing 10 users’ questionnaires, the developed expert system can predict the presence of COVID-19 in the human body of the primary suspect. The preprocessed data and codes are available on the GitHub repository.
... The XGBoost model can explore this complex issue according to the literature (Ma et al., 2020a(Ma et al., , 2020bQin et al., 2020). XGBoost is an already welldeveloped model and was reported in detail by Chen and Guestrin (2016), which has been applied to many engineering fields . As a consequence, the Python software was used to implement the XGBoost analysis through the open-source software library to probe the impact of driving behaviour on tyre wear. ...
... Table A1 in Appendix A shows the air quality features studied in the modeling process. Following comprehensively reviewing the related literature of Chen and Guestrin [90], Friedman [91], the GB and XGBoost algorithms are represented as follows: ...
Article
Full-text available
Air pollution, as one of the most significant environmental challenges, has adversely affected the global economy, human health, and ecosystems. Consequently, comprehensive research is being conducted to provide solutions to air quality management. Recently, it has been demonstrated that environmental parameters, including temperature, relative humidity, wind speed, air pressure, and vegetation, interact with air pollutants, such as particulate matter (PM), NO2, SO2, O3, and CO, contributing to frameworks for forecasting air quality. The objective of the present study is to explore these interactions in three Iranian metropolises of Tehran, Tabriz, and Shiraz from 2015 to 2019 and develop a machine learning-based model to predict daily air pollution. Three distinct assessment criteria were used to assess the proposed XGBoost model, including R squared (R 2), Root Mean Squared Error (RMSE), and Mean Absolute Error (MAE). Preliminary results showed that although air pollutants were significantly associated with meteorological factors and vegetation , the formulated model had low accuracy in predicting (R 2 PM2.5 = 0.36, R 2 PM10 = 0.27, R 2 NO2 = 0.46, R 2 SO2 = 0.41, R 2 O3 = 0.52, and R 2 CO = 0.38). Accordingly, future studies should consider more variables, including emission data from manufactories and traffic, as well as sunlight and wind direction. It is also suggested that strategies be applied to minimize the lack of observational data by considering second-and third-order interactions between parameters, increasing the number of simultaneous air pollution and meteorological monitoring stations, as well as hybrid machine learning models based on proximal and satellite data.
... The performance of the feature representation is actually coupled with an estimator. As a comparative study on the feature representation issue for the identification of DNA-binding proteins, we just examine the performance of several traditional classifiers released in the scikit-learn platform [57], including Gaussian naïve Bayes (GNB), K-nearest neighbors (KNN) [58], decision tree (DT) [59], logistic regression (LR), support vector machine (SVM) [60], random forest (RF) [55], gradient boosting decision tree (GBDT) [61], and eXtreme Gradient Boosting (XGB) [62]. The support vector machine (SVM) is a binary classification model, aiming at finding the largest separation hyperplane of positive and negative samples. ...
Article
Full-text available
The interaction between DNA and protein is vital for the development of a living body. Previous numerous studies on in silico identification of DNA-binding proteins (DBPs) usually include features extracted from the alignment-based (pseudo) position-specific scoring matrix (PSSM), leading to limited application due to its time-consuming generation. Few researchers have paid attention to the application of pretrained language models at the scale of evolution to the identification of DBPs. To this end, we present comprehensive insights into a comparison study on alignment-based PSSM and pretrained evolutionary scale modeling (ESM) representations in the field of DBP classification. The comparison is conducted by extracting information from PSSM and ESM representations using four unified averaging operations and by performing various feature selection (FS) methods. Experimental results demonstrate that the pretrained ESM representation outperforms the PSSM-derived features in a fair comparison perspective. The pretrained feature presentation deserves wide application to the area of in silico DBP identification as well as other function annotation issues. Finally, it is also confirmed that an ensemble scheme by aggregating various trained FS models can significantly improve the classification performance of DBPs.
... We used the scikit-learn library [40] to implement commonly-used models, including Logistic Regression, Lasso, Decision Tree, Random Forest as well as Extreme Gradient Boosting (XGBoost) as implemented in the XGBoost library [19]. XGBoost outperformed other algorithms in all cases; we therefore only discuss the XGBoost results below. ...
Conference Paper
Full-text available
Self-regulated learning (SRL) is a critical component of mathematics problem solving. Students skilled in SRL are more likely to effectively set goals, search for information, and direct their attention and cognitive process so that they align their efforts with their objectives. An influential framework for SRL, the SMART model, proposes that five cognitive operations (i.e., searching, monitoring, assembling, rehearsing, and translating) play a key role in SRL. However, these categories encompass a wide range of behaviors, making measurement challenging-often involving observing individual students and recording their think-aloud activities or asking students to complete labor-intensive tagging activities as they work. In the current study, we develop machine-learned indicators of SMART operations, in order to achieve better scalability than other measurement approaches. We analyzed student's textual responses and interaction data collected from a mathematical learning platform where students are asked to thoroughly explain their solutions and are scaffolded in communicating their problem-solving process to their peers and teachers. We built detectors of four indicators of SMART operations (namely, assembling and translating operations). Our detectors are found to be reliable and generalizable, with AUC ROCs ranging from .76-.89. When applied to the full test set, the detectors are robust against algorithmic bias, performing well across different student populations.
... Second, independent features were created based on a principal components analysis (PCA) vectorization of 40 high-level features (HLFs) (e.g., molecular weight, melting point, dimers, etc.) 29 . Third, these HLF and CNN model outputs were then both used as independent features for gradient-boosted decision trees (GBDTs) in order to produce the final predictions regarding whether DNA sequences can produce promising sensor candidates for each analyte and sensing environment 30 . The study demonstrated that these ML models can significantly predict DNA-SWCNT MR with relatively few data points. ...
Article
Full-text available
Nanoparticle corona phase (CP) design offers a unique approach toward molecular recognition (MR) for sensing applications. Single-walled carbon nanotube (SWCNT) CPs can additionally transduce MR through its band-gap photoluminescence (PL). While DNA oligonucleotides have been used as SWCNT CPs, no generalized scheme exists for MR prediction de novo due to their sequence-dependent three-dimensional complexity. This work generated the largest DNA-SWCNT PL response library of 1408 elements and leveraged machine learning (ML) techniques to understand MR and DNA sequence dependence through local (LFs) and high-level features (HLFs). Out-of-sample analysis of our ML model showed significant correlations between model predictions and actual sensor responses for 6 out of 8 experimental conditions. Different HLF combinations were found to be uniquely correlated with different analytes. Furthermore, models utilizing both LFs and HLFs show improvement over that with HLFs alone, demonstrating that DNA-SWCNT CP engineering is more complex than simply specifying molecular properties.
... XG is a large-scale parallel boosted tree algorithm and an excellent representative of ensemble learning. It is frequently used in data competition and industry (Chen et al. 2016b). ...
Conference Paper
Full-text available
Depression is a common and severe mental illness. Early detection can reduce costs and improve treatment outcomes. Previous studies mainly relied on the posting behaviors to build automatic detection model for patients with depression but ignored the replying behaviors. This study systematically analyze the replying behavior, identify various features about content, language style and emotion from user-written replies and user-replied posts, and compare their relative importance. The experimental results using real-world dataset reveal that the replying behavior can significantly improve traditional detection model. Compared with posting behavior, replying behavior be shown to be more important for depression detection. Further analysis for the replying behavior shows that the user-written replies is not effective, while user-replied posts are effective. Considering that there are many users who only have replying behaviors, the detection model proposed will be applicable to a larger number of people.
... 3) eXtream Gradient Boosting (XGBoost): XGBosst is a cutting-edge classifier that was introduced in [13]. It is an ensemble method that enhances the performance of simpler models by combining them together. ...
Preprint
Full-text available
Opposed to standard authentication methods based on credentials, biometric-based authentication has lately emerged as a viable paradigm for attaining rapid and secure authentication of users. Among the numerous categories of biometric traits, electroencephalogram (EEG)-based biometrics is recognized as a promising method owing to its unique characteristics. This paper provides an experimental evaluation of the effect of auditory stimuli (AS) on EEG-based biometrics by studying the following features: i) general change in AS-aided EEG-based biometric authentication in comparison with non-AS-aided EEG-based biometric authentication, ii) role of the language of the AS and ii) influence of the conduction method of the AS. Our results show that the presence of an AS can improve authentication performance by 9.27%. Additionally, the performance achieved with an in-ear AS is better than that obtained using a bone-conducting AS. Finally, we verify that performance is independent of the language of the AS. The results of this work provide a step forward towards designing a robust EEG-based authentication system.
... Mann-Whitney U test [56] was conducted to compute the very relevant features among extracted features. An ensemble machine learning algorithm XGBoost [57] is trained with the extracted features for automated detection of dysgraphia in children. ...
Preprint
Full-text available
Learning disabilities, which primarily interfere with the basic learning skills such as reading, writing and math, are known to affect around 10% of children in the world. The poor motor skills and motor coordination as part of the neurodevelopmental disorder can become a causative factor for the difficulty in learning to write (dysgraphia), hindering the academic track of an individual. The signs and symptoms of dysgraphia include but are not limited to irregular handwriting, improper handling of writing medium, slow or labored writing, unusual hand position, etc. The widely accepted assessment criterion for all the types of learning disabilities is the examination performed by medical experts. The few available artificial intelligence-powered screening systems for dysgraphia relies on the distinctive features of handwriting from the corresponding images.This work presents a review of the existing automated dysgraphia diagnosis systems for children in the literature. The main focus of the work is to review artificial intelligence-based systems for dysgraphia diagnosis in children. This work discusses the data collection method, important handwriting features, machine learning algorithms employed in the literature for the diagnosis of dysgraphia. Apart from that, this article discusses some of the non-artificial intelligence-based automated systems also. Furthermore, this article discusses the drawbacks of existing systems and proposes a novel framework for dysgraphia diagnosis.
... GBC outperforms ABC in terms of accuracy due to its immense flexibility, which allows the algorithm as many differentiable and convex loss functions as possible [36]. On the other hand, XGB's scalability presents a structure that achieves algorithmic optimization, distinguishing it from the other boosters [37]. ...
Article
Full-text available
In industry, electric motors such as the squirrel cage induction motor (SCIM) generate motive power and are particularly popular due to their low acquisition cost, strength, and robustness. Along with these benefits, they have minimal maintenance costs and can run for extended periods before requiring repair and/or maintenance. Early fault detection in SCIMs, especially at low-load conditions, further helps minimize maintenance costs and mitigate abrupt equipment failure when loading is increased. Recent research on these devices is focused on fault/failure diagnostics with the aim of reducing downtime, minimizing costs, and increasing utility and productivity. Data-driven predictive maintenance offers a reliable avenue for intelligent monitoring whereby signals generated by the equipment are harnessed for fault detection and isolation (FDI). Particularly, motor current signature analysis (MCSA) provides a reliable avenue for extracting and/or exploiting discriminant information from signals for FDI and/or fault diagnosis. This study presents a fault diagnostic framework that exploits underlying spectral characteristics following MCSA and intelligent classification for fault diagnosis based on extracted spectral features. Results show that the extracted features reflect induction motor fault conditions with significant diagnostic performance (minimal false alarm rate) from intelligent models, out of which the random forest (RF) classifier was the most accurate, with an accuracy of 79.25%. Further assessment of the models showed that RF had the highest computational cost of 3.66 s, while NBC had the lowest at 0.003 s. Other significant empirical assessments were conducted, and the results support the validity of the proposed FDI technique.
... In order to compare the prediction results of different algorithms, five different classifiers were used: RF (Liu, 2017;Wei et al., 2017b), SVM (Song et al., 2018), K-nearest neighbor (KNN) , logistic regression (LR) (Cha et al., 2015) and XGBoost (Chen and Guestrin, 2016). RF is a popular ML algorithm used to predict m 6 A RNA methylation, which was applied in SRAMP (Zhou et al., 2016) to predict mammalian m 6 A sites. ...
Article
N 6-methyladenosine (m 6 A) is one of the most widely studied epigenetic modifications, which plays an important role in many biological processes, such as splicing, RNA localization, and degradation. Studies have shown that m 6 A on lncRNA has important functions, including regulating the expression and functions of lncRNA, regulating the synthesis of pre-mRNA, promoting the proliferation of cancer cells, and affecting cell differentiation and many others. Although a number of methods have been proposed to predict m 6 A RNA methylation sites, most of these methods aimed at general m 6 A sites prediction without noticing the uniqueness of the lncRNA methylation prediction problem. Since many lncRNAs do not have a polyA tail and cannot be captured in the polyA selection step of the most widely adopted RNA-seq library preparation protocol, lncRNA methylation sites cannot be effectively captured and are thus likely to be significantly underrepresented in existing experimental data affecting the accuracy of existing predictors. In this paper, we propose a new computational framework, LITHOPHONE, which stands for long noncoding RNA methylation sites prediction from sequence characteristics and genomic information with an ensemble predictor. We show that the methylation sites of lncRNA and mRNA have different patterns exhibited in the extracted features and should be differently handled when making predictions. Due to the used experiment protocols, the number of known lncRNA m 6 A sites is limited, and insufficient to train a reliable predictor; thus, the performance can be improved by combining both lncRNA and mRNA data using an ensemble predictor. We show that the newly developed LITHOPHONE approach achieved a reasonably good performance when tested on independent datasets (AUC: 0.966 and 0.835 under full transcript and mature mRNA modes, respectively), marking a substantial improvement compared with existing methods. Additionally, LITHOPHONE was applied to scan the entire human lncRNAome for all possible lncRNA m 6 A sites, and the results are freely accessible at: http://180.208.58.19/lith/.
... Hence, the used classifier(s) is (are) iteratively trained on the labeled and the pseudo-labeled portions of the input data according to various fashions. Among the oldest and the most widely known wrapper methods, we distinguish self-training [13,57,66,76], co-training [74,77,80] and boosting [6,38,82]. These methods use a base learner for pseudo-labelling the records and rely on this learner's aptitude to correctly distinguish between the classes and to effectively scoring the labels. ...
Article
Full-text available
The abundant availability of data in Big Data era has helped achieving significant advances in the machine learning field. However, many datasets appear with incompleteness from different perspectives such as values, labels, annotations and records. By discarding the records yielding ambiguousness, the exploitable data settles down to a small, sometimes ineffective, portion. Making the most of this small portion is burdensome because it usually yields overfitted models. In this paper we propose a new taxonomy for data missingness, in the machine learning context, along with a new metamodel to address the missing data problem within real and open data. Our proposed methodology relies on a H2S Kernel whose ultimate goal is the effective learning of a generalized Bayesian network from small input datasets. Our contributions are motivated by the strong probabilistic foundation of the Bayesian network, on the one hand, and on the ensemble learning effectiveness, on the other hand. The highlights of our kernel are the new strategy for multiple Bayesian network structure learning and the novel technique for the weighted fusion of Bayesian network structures. To harness on the richness of the merged network in terms of knowledge, we propose four H2S-derived systems to address the missing values/records impacts involving the annotation, the balancing, missing values imputation and data over-sampling. We combine these systems into a meta-model, and we perform a step-by-step experimental study. The obtained results showcase the efficiency of our contributions to deal with multi-class problems and with extremely small datasets.
... The above evolutionary SR-based algorithms are not publicly available through online repositories and their performance has been tested on a small number of datasets with mixed results. This led us to focus our comparison in Section 5 on three top, state-ofthe art, machine-learning classifiers: XGBoost (Extreme Gradient Boosting) [2], LightGBM (Light Gradient Boosting Machine) [6], and a Deep Neural Network (DNN), with with 10 hidden layers of size 16 nodes each. ...
Preprint
We present three evolutionary symbolic regression-based classification algorithms for binary and multinomial datasets: GPLearnClf, CartesianClf, and ClaSyCo. Tested over 162 datasets and compared to three state-of-the-art machine learning algorithms -- XGBoost, LightGBM, and a deep neural network -- we find our algorithms to be competitive. Further, we demonstrate how to find the best method for one's dataset automatically, through the use of a state-of-the-art hyperparameter optimizer.
... Extreme Gradient Boosting (XGBoost). XGBoost (Extreme Gradient Boosting) is a robust algorithm that comes under the category of boosting technique [87]. It is a tree-based ensemble model in which new learners are added to minimize the errors made by the prior learners. ...
Preprint
Full-text available
Breast cancer is one of the leading causes of death among women across the globe. It is difficult to treat if detected at advanced stages, however, early detection can significantly increase chances of survival and improves lives of millions of women. Given the widespread prevalence of breast cancer, it is of utmost importance for the research community to come up with the framework for early detection, classification and diagnosis. Artificial intelligence research community in coordination with medical practitioners are developing such frameworks to automate the task of detection. With the surge in research activities coupled with availability of large datasets and enhanced computational powers, it expected that AI framework results will help even more clinicians in making correct predictions. In this article, a novel framework for classification of breast cancer using mammograms is proposed. The proposed framework combines robust features extracted from novel Convolutional Neural Network (CNN) features with handcrafted features including HOG (Histogram of Oriented Gradients) and LBP (Local Binary Pattern). The obtained results on CBIS-DDSM dataset exceed state of the art.
... For the classifiers, we used random forests (RF) (Breiman, 2001), naive bayes (NB) (Rish, 2001), decision trees (DT) (Quinlan, 1986), and k-nearest neighbours (KNN) (Altman, 1992). For the regressors, we used the least absolute shrinkage and selection operator (LS) (Tibshirani, 1996), ridge regression (RR) (Hoerl & Kennard, 1970), RF, and eXtreme gradient boosting (XGB) (Chen & Guestrin, 2016). The RF models were built with 1000 trees and a third of the number of attributes was used as the number of splitting variables considered at each node, the defaults were used for everything else. ...
Article
Full-text available
We present an extension to the federated ensemble regression using classification algorithm, an ensemble learning algorithm for regression problems which leverages the distribution of the samples in a learning set to achieve improved performance. We evaluated the extension using four classifiers and four regressors, two discretizers, and 119 responses from a wide variety of datasets in different domains. Additionally, we compared our algorithm to two resampling methods aimed at addressing imbalanced datasets. Our results show that the proposed extension is highly unlikely to perform worse than the base case, and on average outperforms the two resampling methods with significant differences in performance.
... It [43] is an improvised version of gradient boosting techniques, crafted especially to enhance speed and performance of gradient boosted decision trees. It utilizes regularized model to handle over-fitting issue with reduced computational time and optimal consumption of resources. ...
Article
Full-text available
Internet-of-Things (IoT) has become an enthralling attacking surface for attackers to explode multitude of cyber-attacks. Distributed Denial of Service (DDoS) attack has transpired as the most menacing attack in the IoT networks. In this article, we propose an attack detection system to identify anomalous activities in the fog-enabled IoT network. Initially, authors have investigated exhaustively on the performance of filter-based feature selection algorithms comprising ReliefF, Correlation Feature Selection (CFS), Information Gain (IG), and Minimum-Redundancy-Maximum-Relevancy (mRMR) and distinct categories classification algorithms upon the prepared dataset consisting of IoT network specific features. Performance of the tested classification algorithm is assessed using prominent evaluation measures. Moreover, response time of classifiers is calculated for centralized and fog-enabled IoT network infrastructure. The experimental outcomes unveil that, in terms of both accuracy and latency, J48 classifier outperforms all other tested classifier with mRMR feature selection algorithm.
... We used logistic regression as a baseline, along with ensemble methods such as Random Forest and XG-Boost. The main difference between these two algorithms is that Random Forest builds several trees independently and combines them at the end of the process with bagging [11] while XG-Boost builds one tree at a time, with the new trees aiming at reducing the error committed by previous ones thanks to gradient boosting [12]. ...
Article
Full-text available
This project is part of a kaggle competition. We reached 94% accuracy and 4th position on the leaderboard, which is accessible here. In this data challenge, we investigate how to determine whether summaries of some news article were written by humans or generated by machines. The issue was to extract some relevant features to capture the stylistic differences between these two kinds of summaries. Performing a quick data exploration allowed us to conclude that we needed two types of features to perform this task: features on the style of summaries and features comparing the style of summaries with the one of their associated documents. Leveraging pretrained language models such as BERT or GPT2, as well as statistics on the n-grams of the summaries and documents and metrics such as ROUGE scores, we managed to design such insightful features. We finally used these features to fit an XG-Boost classifier, whose hyper-parameters were specifically optimized for this task. This paper is provided with a git repository accessible here.
... After feature selection, the selected features were sent into the prediction model. First, we used individual classifiers independently as prediction models, including RF (Breiman, 2001), XGBoost (Chen and Guestrin, 2016), logistic regression (LR) (Kleinbaum and Klein, 2010), and support vector machine (SVM) (Cortes and Vapnik, 1995). After that, we used two ensemble learning strategies, hard voting and soft voting, to combine the advantages of individual classifiers. ...
Article
Objectives: We aimed to identify whether ensemble learning can improve the performance of the epidermal growth factor receptor (EGFR) mutation status predicting model. Methods: We retrospectively collected 168 patients with non–small cell lung cancer (NSCLC), who underwent both computed tomography (CT) examination and EGFR test. Using the radiomics features extracted from the CT images, an ensemble model was established with four individual classifiers: logistic regression (LR), support vector machine (SVM), random forest (RF), and extreme gradient boosting (XGBoost). The synthetic minority oversampling technique (SMOTE) was also used to decrease the influence of data imbalance. The performances of the predicting model were evaluated using the area under the curve (AUC). Results: Based on the 26 radiomics features after feature selection, the SVM performed best (AUCs of 0.8634 and 0.7885 on the training and test sets, respectively) among four individual classifiers. The ensemble model of RF, XGBoost, and LR achieved the best performance (AUCs of 0.8465 and 0.8654 on the training and test sets, respectively). Conclusion: Ensemble learning can improve the model performance in predicting the EGFR mutation status of patients with NSCLC, showing potential value in clinical practice.
Article
With the developing technology, the issue of cyber security has become one of the most common and current issues in recent years. Spam URLs are one of the most common and dangerous issues for cybersecurity. Spam URLs are one of the most widely used attacks to defraud users. These attacks cause users to suffer monetary losses, steal private information, and install malicious software on their devices. It is very important to detect such threats promptly and to take precautions against these threats. Detection of malicious URLs is mostly done by using blacklists. However, these lists are insufficient to detect newly created URLs. In recent years, machine learning techniques have been developed to overcome this deficiency. In this study, URL classification was made using different machine learning techniques. In the study, 9 different classifiers were preferred for URL classification. The performances of the classifiers were compared in the URL classification process. In addition, similar studies in the literature have been comprehensively examined and these studies have been discussed. In addition, since the preparation of data sets in the natural language processing process has a great effect on the training of models, these steps are discussed in detail.
Chapter
Association football is one of the most popular sports in the world, having a large fan base that draws media and entertainment platforms to it. The three major difficulties to developing highly accurate match result prediction models, especially in prediction match outcomes, are data availability and quality [1], model assumptions [2], and testing various models and parameters [3]. The primary goal of the study is to identify the best model in predicting football match outcomes. The football dataset was obtained from the top 5 leagues in Euro. Exploratory data analysis had been conducted to better understand the dataset. Models used in predicting the football match outcomes include Logistic Regression, Artificial Neural Networks, and XGBoost. The predictive performance of three classification models was compared in terms of accuracy, precision and recall. The results showed that the Artificial Neural Networks achieved highest accuracy of 0.6788, followed by Logistic Regression (0.668) and XGBoost (0.654). These results are hoped to be used as benchmark results for future experiments in the area of football match classification.
Thesis
Full-text available
A wide variety of antiretroviral drugs has been developed in the last decades to help patients fight AIDS, with most current therapy cares comprised of several antiretroviral drugs combined. Those therapies actively suppress viral replication by targeting different stages of the viral life cycle simultaneously. However, evolutionary escape dynamics of HIV complicate the task, with appearance of drug resistance subsequently. Nowadays, finding an effective treatment for each patient remains challenging: therapies which need to be personalized due to the accumulation of mutations conferring resistance to drugs, complex drug interactions between antiretroviral, uncertain therapy adherence and various side effects. Manual consideration of all these factors by doctors is infeasible hence precision medicine steps in, aiming to assist clinical decision based on statistical models. However, available clinical HIV dataset are sparse and unbalanced with limited observations available for most of drug combinations due to the diversity of antiretroviral and rapid pace of mutation appearance. With continuous improvements in machine and deep learning techniques, scientists develop models with potential applications to support medical care of HIV-infected patients, helped by the emergence of large cohort studies and extensive databases. Here we propose approaches to predict the chance of a therapy switch to be successful, design of optimal treatments as well as forecast occurrence of new mutational events. Our model to predict treatment response address challenge of treatment diversity and demonstrate promising performances with limited features. Thus, with the aid of additional indicators, we believe on its potential to significantly improve the interpretation of genotypic drug resistance tests and the clinical decision to prescribe suitable treatments.
Article
Intensive care unit (ICU) patients with venous thromboembolism (VTE) and/or cancer suffer from high mortality rates. Mortality prediction in the ICU has been a major medical challenge for which several scoring systems exist but lack in specificity. This study focuses on two target groups, namely patients with thrombosis or cancer. The main goal is to develop and validate interpretable machine learning (ML) models to predict early and late mortality, while exploiting all available data stored in the medical record. To this end, retrospective data from two freely accessible databases, MIMIC-III and eICU, were used. Well-established ML algorithms were implemented utilizing automated and purposely built ML frameworks for addressing class imbalance. Prediction of early mortality showed excellent performance in both disease categories, in terms of the area under the receiver operating characteristic curve (AUC−ROC): VTE-MIMIC-III 0.93, eICU 0.87, cancer-MIMIC-III 0.94. On the other hand, late mortality prediction showed lower performance, i.e., AUC−ROC: VTE 0.82, cancer 0.74–0.88. The predictive model of early mortality developed from 1651 VTE patients (MIMIC-III) ended up with a signature of 35 features and was externally validated in 2659 patients from the eICU dataset. Our model outperformed traditional scoring systems in predicting early as well as late mortality. Novel biomarkers, such as red cell distribution width, were identified.
Chapter
The deep convolutional neural network has been extensively applied for clinical computer-aided diagnosis. In this study, we combine deep learning feature extraction and eXtreme gradient boosting (XGBoost) classifier for predicting the risk of esophageal varices (EV). First, the quantitative deep learning features and radiomics features of the regions of interest which includes spleen, liver and esophageal are extracted and concatenated. Then, XGBoost and the Least Absolute Shrinkage and Selection Operator (LASSO) are applied for the optimal predictive features selection and prediction of EV risk. XGBoost is used to assess the significance of the extracted features and LASSO is used to select the distinctive features. Finally, random forest, XGBoost and support vector machine classification methods are applied for predicting the low-risk and high-risk of esophageal varices. We collected computed tomography images of cirrhotic patients in two hospitals as the independent training and validation sets. Experimental results show that the features of esophageal are more distinctive than that of other organs. Moreover, the combination of deep learning and radiomics features based on XGBoost algorithm has outperforming classification performance in predicting the severity of EV disease compared to existing approaches.
Article
Full-text available
The educational data mining research attempts have contributed in developing policies to improve student learning in different levels of educational institutions. One of the common challenges to building accurate classification and prediction systems is the imbalanced distribution of classes in the data collected. This study investigates data-level techniques and algorithm-level techniques. Six classifiers from each technique are used to explore their effectiveness to handle the imbalanced data problem while predicting students’ graduation grade based on their performance at the first stage. The classifiers are tested using the k-fold cross-validation approach before and after applying the data-level and algorithm-level techniques. For the purpose of evaluation, various evaluation metrics have been used such as accuracy, precision, recall, and f1-score. The results showed that the classifiers do not perform well with imbalanced dataset, and the performance could be improved by using these techniques. As for the level of improvement, it varies from one technique to another. Additionally, the results of the statistical hypothesis testing confirmed that there were no statistically significant differences for classifiers of the two techniques.
Chapter
Foliage environment target detection has been an extremely difficult problem to solve. In this paper, we propose a machine learning approach for sense through target detection. Detection of target can be achieved with an accuracy of 93.7% with our XGBoost based technology on single received Ultra-Wide Band (UWB) radar waveform. This excellent result is achieved with very less computational resource making it a lucrative application in the target field.
Article
Terrorism is a major problem worldwide, causing thousands of fatalities and billions of dollars in damage every year. To address this threat, we propose a novel feature representation method and evaluate machine learning models that learn from localized news data in order to predict whether a terrorist attack will occur on a given calendar date and in a given state. The best model (a Random Forest aided by a novel variable-length moving average method) achieved area under the receiver operating characteristic (AUROC) of ≥ 0.667 (statistically significant w.r.t. random guessing with p ≤ .0001) on four of the five states that were impacted most by terrorism between 2015 and 2018. These results demonstrate that treating terrorism as a set of independent events, rather than as a continuous process, is a fruitful approach—especially when historical events are sparse and dissimilar—and that large-scale news data contains information that is useful for terrorism prediction. Our analysis also suggests that predictive models should be localized (i.e., state models should be independently designed, trained, and evaluated) and that the characteristics of individual attacks (e.g., responsible group or weapon type) were not correlated with prediction success. These contributions provide a foundation for the use of machine learning in efforts against terrorism in the United States and beyond.
Article
Full-text available
One of the main applications of machine learning (ML) in remote sensing (RS) is the pixel-level classification of satellite images into land cover types. Although classes with different spectral signatures can be easily separated, e.g. aquatic and terrestrial land cover types, others have similar spectral signatures and are hard to separate using only the information within a single pixel. This work focused on the separation of two cover types with similar spectral signatures, cocoa agroforest and forest, over an area in Pará, Brazil. For this, we study the training and application of several ML algorithms on datasets obtained from a single composite image, a time-series (TS) composite obtained from the same location and by preprocessing the TS composite using simple TS preprocessing techniques. As expected, when ML algorithms are applied to a dataset obtained from a composite image, the median producer’s accuracy (PA) and user’s accuracy (UA) in those two classes are significantly lower than the median overall accuracy (OA) for all classes. The second dataset allows the ML models to learn the evolution of the spectral signatures over 5 months. Compared to the first dataset, the results indicate that ML models generalize better using TS data, even if the series are short and without any preprocessing. This generalization is further improved in the last dataset. The ML models are subsequently applied to an area with different geographical bounds. These last results indicate that, out of seven classifiers, the popular random forest (RF) algorithm ranked fourth, while XGBoost (XGB) obtained the best results. The best OA, as well as the best PA/UA balance, were obtained by performing feature construction using the M3GP algorithm and then applying XGB to the new extended dataset.
Article
Given the increased concern of racial disparities in the stop-and-frisk programs, the New York Police Department ( NYPD ) requires publicly displaying detailed data for all the stops conducted by police authorities, including the suspected offense and race of the suspects. By adopting a public data transparency policy, it becomes possible to investigate racial biases in stop-and-frisk data and demonstrate the benefit of data transparency to approve or disapprove social beliefs and police practices. Thus, data transparency becomes a crucial need in the era of Artificial Intelligence ( AI ), where police and justice increasingly use different AI techniques not only to understand police practices but also to predict recidivism, crimes, and terrorism. In this study, we develop a predictive analytics method, including bias metrics and bias mitigation techniques to analyze the NYPD Stop-and-Frisk datasets and discover whether underline bias patterns are responsible for stops and arrests. In addition, we perform a fairness analysis on two protected attributes, namely, the race and the gender, and investigate their impacts on arrest decisions. We also apply bias mitigation techniques. The experimental results show that the NYPD Stop-and-Frisk dataset is not biased toward colored and Hispanic individuals and thus law enforcement authorities can apply the bias predictive analytics method to inculcate more fair decisions before making any arrests.
Article
Full-text available
The fast, exponential increase of COVID-19 infections and their catastrophic effects on patients' health have required the development of tools that support health systems in the quick and efficient diagnosis and prognosis of this disease. In this context, the present study aims to identify the potential factors associated with COVID-19 infections, applying machine learning techniques, particularly random forest, chi-squared, xgboost, and rpart for feature selection; ROSE and SMOTE were used as resampling methods due to the existence of class imbalance. Similarly, machine and deep learning algorithms such as support vector machines, C4.5, random forest, rpart, and deep neural networks were explored during the train/test phase to select the best prediction model. The dataset used in this study contains clinical data, anthropometric measurements, and other health parameters related to smoking habits, alcohol consumption, quality of sleep, physical activity, and health status during confinement due to the pandemic associated with COVID-19. The results showed that the XGBoost model got the best features associated with COVID-19 infection, and random forest approximated the best predictive model with a balanced accuracy of 90.41% using SMOTE as a resampling technique. The model with the best performance provides a tool to help prevent contracting SARS-CoV-2 since the variables with the highest risk factor are detected, and some of them are, to a certain extent controllable.
Article
Real estate is one of the major sectors of the Armenian economy and has been developing dynamically since Armenia transitions from planned to market economies in early 1990s. More recently, large online platforms have been developed in Armenia to advertise real estate offerings, thus reducing information asymmetry, and increasing liquidity in both sales and rental markets. Simultaneously, granular geospatial data became increasingly affordable via platforms such as OpenStreetMap, Google Maps and Yandex Maps. With granular data concerning a representative portion of the real estate offering available online, it is increasingly tenable to monitor the real estate market in real time and develop analytical tools that can automatically and accurately estimate the value of real estate assets based on their internal and external features. This paper sets out to analyze Armenia real estate market and assess the performance of a special class of machine learning models while predicting the price of a square meter of apartments in Yerevan. Furthermore, it is presented the way to determine the most decisive factors which have an influence on the price of apartments on sale.
Article
In the study of human mobility, gait analysis is a well-recognized assessment methodology. Despite its widespread use, doubts exist about its clinical utility, i.e., its potential to influence the diagnostic-therapeutic practice. Gait analysis evaluates the walking pattern (normal/abnormal) based on the gait cycle. Based on the analysis obtained, various applications can be developed in the medical, security, sports, and fitness domain to improve overall outcomes. Wearable sensors provide a convenient, efficient, and low-cost approach to gather data, while machine learning methods provide high accuracy gait feature extraction for analysis. The problem is to identify gait abnormalities and if present, subsequently identify the locations of impairments that lead to the change in gait pattern of the individual. Proper physiotherapy treatment can be provided once the location/landmark of the impairment is known correctly. In this paper, classification of multiple anatomical regions and their combination on a large scale highly imbalanced dataset is carried out. We focus on identifying 27 different locations of injury and formulate it as a multi-class classification approach. The advantage of this method is the convenience and simplicity as compared to previous methods. In our work, a benchmark is set to identify the gait disorders caused by accidental impairments at multiple anatomical regions using the GaitRec dataset. In our work, machine learning models are trained and tested on the GaitRec dataset, which provides Ground Reaction Force (GRF) data, to analyze an individual’s gait and further classify the gait abnormality (if present) at the specific lower-region portion of the body. The design and implementation of machine learning models are carried out to detect and classify the gait patterns between healthy controls and gait disorders. Finally, the efficacy of the proposed approach is showcased using various qualitative accuracy metrics. The achieved test accuracy is 96% and an F1 score of 95% is obtained in classifying various gait disorders on unseen test samples. The paper concludes by stating how machine learning models can help to detect gait abnormalities along with directions of future work.
Article
Agent-based models return spatiotemporal information used to process time series of specific parameters for specific individuals called “agents”. For complex, advanced and detailed models, this typically comes at the expense of high computing times and requires access to important computing resources. This paper provides an example on how machine learning and artificial intelligence can help predict an agent-based model’s output values at regular intervals without having to rely on time-consuming numerical calculations. Gradient-boosting XGBoost under GNU package’s R was used in the social-ecological agent-based model 3MTSim to interpolate, in the time domain, sound pressure levels received at the agents’ positions that were occupied by the endangered St. Lawrence Estuary and Saguenay Fjord belugas and caused by anthropomorphic noise of nearby transiting merchant vessels. A mean error of 3.23 ± 3.76(1σ) dB on received sound pressure levels was predicted when compared to ground truth values that were processed using rigorous, although time-consuming, numerical algorithms. The computing time gain was significant, i.e., it was estimated to be 10-fold higher than the ground truth simulation, whilst maintaining the original temporal resolution.
Chapter
With the development of 5G, big data and other new technologies, the Internet of Vehicles is recognized as the future development direction of the industry. The traffic of operators ensures the network connection of vehicles, but how to recommend traffic products for the appropriate users has always been one of the research topics of operators. This paper implements a recommendation system based on big data of operators. The system can mine users’ behavior characteristics and use machine learning method to predict the possibility of users accepting the recommended products. In order to verify the effect of the model recommendation, this paper compares the number of orders and flow growth rate before and after using the model, and the results show that the model-based user payment rate is significantly higher than those that do not use the model.
Article
Full-text available
Despite the availability of chromatin conformation capture experiments, discerning the relationship between the 1D genome and 3D conformation remains a challenge, which limits our understanding of their affect on gene expression and disease. We propose Hi-C-LSTM, a method that produces low-dimensional latent representations that summarize intra-chromosomal Hi-C contacts via a recurrent long short-term memory neural network model. We find that these representations contain all the information needed to recreate the observed Hi-C matrix with high accuracy, outperforming existing methods. These representations enable the identification of a variety of conformation-defining genomic elements, including nuclear compartments and conformation-related transcription factors. They furthermore enable in-silico perturbation experiments that measure the influence of cis-regulatory elements on conformation.
Article
A well-maintained road network is a crucial factor for sustainable urban development. Over the past few years, researchers have proposed smartphone-based crowdsourced applications as a low-cost effective solution to acquire frequent road surface quality updates. One of the main limitations faced by these applications is that the collected values exhibit significant variations over the conditions under which the road data was collected. This study is an attempt to develop a road roughness monitoring platform using passenger cars that can produce accurate results while reducing the effect of these conditions such as the car type, smartphone model, or its placement. The developed system consists of several features including automatic journey detection, freedom to use any smartphone in any position with or without an active internet connection when collecting data, converging values collected from different sources, and visualizing them in a virtual map. A set of field tests were carried out to evaluate the proposed system based on the road condition, passenger car type, smartphone model, and smartphone placement inside the vehicle. The results show that the proposed solution is effective in predicting accurate values after reducing the effect of these varying factors.
Article
Full-text available
LambdaMART is the boosted tree version of LambdaRank, which is based on RankNet. RankNet, LambdaRank, and LambdaMART have proven to be very suc-cessful algorithms for solving real world ranking problems: for example an ensem-ble of LambdaMART rankers won Track 1 of the 2010 Yahoo! Learning To Rank Challenge. The details of these algorithms are spread across several papers and re-ports, and so here we give a self-contained, detailed and complete description of them.
Article
Full-text available
Boosting is one of the most important recent developments in classi-fication methodology. Boosting works by sequentially applying a classifica-tion algorithm to reweighted versions of the training data and then taking a weighted majority vote of the sequence of classifiers thus produced. For many classification algorithms, this simple strategy results in dramatic improvements in performance. We show that this seemingly mysterious phenomenon can be understood in terms of well-known statistical princi-ples, namely additive modeling and maximum likelihood. For the two-class problem, boosting can be viewed as an approximation to additive modeling on the logistic scale using maximum Bernoulli likelihood as a criterion. We develop more direct approximations and show that they exhibit nearly identical results to boosting. Direct multiclass generalizations based on multinomial likelihood are derived that exhibit performance comparable to other recently proposed multiclass generalizations of boosting in most situations, and far superior in some. We suggest a minor modification to boosting that can reduce computation, often by factors of 10 to 50. Finally, we apply these insights to produce an alternative formulation of boosting decision trees. This approach, based on best-first truncated tree induction, often leads to better performance, and can provide interpretable descrip-tions of the aggregate decision rule. It is also much faster computationally, making it more suitable to large-scale data mining applications.
Article
Full-text available
Gradient boosting constructs additive regression models by sequentially fitting a simple parameterized function (base learner) to current “pseudo”-residuals by least squares at each iteration. The pseudo-residuals are the gradient of the loss functional being minimized, with respect to the model values at each training data point evaluated at the current step. It is shown that both the approximation accuracy and execution speed of gradient boosting can be substantially improved by incorporating randomization into the procedure. Specifically, at each iteration a subsample of the training data is drawn at random (without replacement) from the full training data set. This randomly selected subsample is then used in place of the full sample to fit the base learner and compute the model update for the current iteration. This randomized approach also increases robustness against overcapacity of the base learner.
Conference Paper
Full-text available
We cast the ranking problem as (1) multiple classification (“Mc”) (2) multiple ordinal classification, which lead to computationally tractable learning algorithms for relevance ranking in Web search. We consider the DCG criterion (discounted cumulative gain), a standard quality measure in information retrieval. Our approach is motivated by the fact that perfect classifications result in perfect DCG scores and the DCG errors are bounded by classification errors. We propose using the Expected Relevance to convert class probabilities into ranking scores. The class probabilities are learned using a gradient boosting tree algorithm. Evaluations on large-scale datasets show that our approach can improve LambdaRank [5] and the regressions-based ranker [6], in terms of the (normalized) DCG scores. An efficient implementation of the boosting tree algorithm is also presented. 1
Article
Full-text available
LIBLINEAR is an open source library for large-scale linear classification. It supports logistic regres- sion and linear support vector machines. We provide easy-to-use command-line tools and library calls for users and developers. Comprehensive documents are available for both beginners and advanced users. Experiments demonstrate that LIBLINEAR is very efficient on large sparse data sets.
Conference Paper
Full-text available
Gradient Boosted Regression Trees (GBRT) are the current state-of-the-art learning paradigm for machine learned web-search ranking - a domain notorious for very large data sets. In this paper, we propose a novel method for parallelizing the training of GBRT. Our technique parallelizes the construction of the individual regression trees and operates using the master-worker paradigm as follows. The data are partitioned among the workers. At each iteration, the worker summarizes its data-partition using histograms. The master processor uses these to build one layer of a regression tree, and then sends this layer to the workers, allowing the workers to build histograms for the next layer. Our algorithm carefully orchestrates overlap between communication and computation to achieve good performance. Since this approach is based on data partitioning, and requires a small amount of communication, it generalizes to distributed and shared memory machines, as well as clouds. We present experimental results on both shared memory machines and clusters for two large scale web search ranking data sets. We demonstrate that the loss in accuracy induced due to the histogram approximation in the regression tree creation can be compensated for through slightly deeper trees. As a result, we see no significant loss in accuracy on the Yahoo data sets and a very small reduction in accuracy for the Microsoft LETOR data. In addition, on shared memory machines, we obtain almost perfect linear speed-up with up to about 48 cores on the large data sets. On distributed memory machines, we get a speedup of 25 with 32 processors. Due to data partitioning our approach can scale to even larger data sets, on which one can reasonably expect even higher speedups.
Article
Full-text available
Scikit-learn is a Python module integrating a wide range of state-of-the-art machine learning algorithms for medium-scale supervised and unsupervised problems. This package focuses on bringing machine learning to non-specialists using a general-purpose high-level language. Emphasis is put on ease of use, performance, documentation, and API consistency. It has minimal dependencies and is distributed under the simplified BSD license, encouraging its use in both academic and commercial settings. Source code, binaries, and documentation can be downloaded from http://scikit-learn.sourceforge.net.
Article
Full-text available
Learning a function of many arguments is viewed from the perspective of high-- dimensional numerical quadrature. It is shown that many of the popular ensemble learning procedures can be cast in this framework. In particular randomized methods, including bagging and random forests, are seen to correspond to random Monte Carlo integration methods each based on particular importance sampling strategies. Non random boosting methods are seen to correspond to deterministic quasi Monte Carlo integration techniques. This view helps explain some of their properties and suggests modifications to them that can substantially improve their accuracy while dramatically improving computational performance.
Article
Full-text available
An ∈-approximate quantile summary of a sequence of N elements is a data structure that can answer quantile queries about the sequence to within a precision of ∈ N . We present a new online algorithm for computing∈-approximate quantile summaries of very large data sequences. The algorithm has a worst-case space requirement of Ο (1÷∈ log(∈ N )). This improves upon the previous best result of Ο (1÷∈ log ² (∈ N )). Moreover, in contrast to earlier deterministic algorithms, our algorithm does not require a priori knowledge of the length of the input sequence. Finally, the actual space bounds obtained on experimental data are significantly better than the worst case guarantees of our algorithm as well as the observed space requirements of earlier algorithms.
Article
Function estimation/approximation is viewed from the perspective of numerical optimization iti function space, rather than parameter space. A connection is made between stagewise additive expansions and steepest-descent minimization. A general gradient descent "boosting" paradigm is developed for additive expansions based on any fitting criterion. Specific algorithms are presented for least-squares, least absolute deviation, and Huber-M loss functions for regression, and multiclass logistic likelihood for classification. Special enhancements are derived for the particular case where the individual additive components are regression trees, and tools for interpreting such "TreeBoost" models are presented. Gradient boosting of regression trees produces competitives highly robust, interpretable procedures for both regression and classification, especially appropriate for mining less than clean data. Connections between this approach and the boosting methods of Freund and Shapire and Friedman, Hastie and Tibshirani are discussed.
Article
Apache Spark is a popular open-source platform for large-scale data processing that is well-suited for iterative machine learning tasks. In this paper we present MLlib, Spark's open-source distributed machine learning library. MLlib provides efficient functionality for a wide range of learning settings and includes several underlying statistical, optimization, and linear algebra primitives. Shipped with Spark, MLlib supports several languages and provides a high-level API that leverages Spark's rich ecosystem to simplify the development of end-to-end machine learning pipelines. MLlib has experienced a rapid growth due to its vibrant open-source community of over 140 contributors, and includes extensive documentation to support further growth and to let users quickly get up to speed.
Article
Online advertising allows advertisers to only bid and pay for measurable user responses, such as clicks on ads. As a consequence, click prediction systems are central to most online advertising systems. With over 750 million daily active users and over 1 million active advertisers, predicting clicks on Facebook ads is a challenging machine learning task. In this paper we introduce a model which combines decision trees with logistic regression, outperforming either of these methods on its own by over 3%, an improvement with significant impact to the overall system performance. We then explore how a number of fundamental parameters impact the final prediction performance of our system. Not surprisingly, the most important thing is to have the right features: those capturing historical information about the user or ad dominate other types of features. Once we have the right features and the right model (decisions trees plus logistic regression), other factors play small roles (though even small improvements are important at scale). Picking the optimal handling for data freshness, learning rate schema and data sampling improve the model slightly, though much less than adding a high-value feature, or picking the right model to begin with.
Article
Random forests are a combination of tree predictors such that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest. The generalization error for forests converges a.s. to a limit as the number of trees in the forest becomes large. The generalization error of a forest of tree classifiers depends on the strength of the individual trees in the forest and the correlation between them. Using a random selection of features to split each node yields error rates that compare favorably to Adaboost (Y. Freund & R. Schapire, Machine Learning: Proceedings of the Thirteenth International conference, ***, 148–156), but are more robust with respect to noise. Internal estimates monitor error, strength, and correlation and these are used to show the response to increasing the number of features used in the splitting. Internal estimates are also used to measure variable importance. These ideas are also applicable to regression.
Article
We consider the problem of learning a forest of nonlinear decision rules with general loss functions. The standard methods employ boosted decision trees such as Adaboost for exponential loss and Friedman's gradient boosting for general loss. In contrast to these traditional boosting algorithms that treat a tree learner as a black box, the method we propose directly learns decision forests via fully-corrective regularized greedy search using the underlying forest structure. Our method achieves higher accuracy and smaller models than gradient boosting on many of the datasets we have tested on.
Article
This tutorial gives a broad view of modern approaches for scaling up machine learning and data mining methods on parallel/distributed platforms. Demand for scaling up machine learning is task-specific: for some tasks it is driven by the enormous dataset sizes, for others by model complexity or by the requirement for real-time prediction. Selecting a task-appropriate parallelization platform and algorithm requires understanding their benefits, trade-offs and constraints. This tutorial focuses on providing an integrated overview of state-of-the-art platforms and algorithm choices. These span a range of hardware options (from FPGAs and GPUs to multi-core systems and commodity clusters), programming frameworks (including CUDA, MPI, MapReduce, and DryadLINQ), and learning settings (e.g., semi-supervised and online learning). The tutorial is example-driven, covering a number of popular algorithms (e.g., boosted trees, spectral clustering, belief propagation) and diverse applications (e.g., recommender systems and object recognition in vision). The tutorial is based on (but not limited to) the material from our upcoming Cambridge U. Press edited book which is currently in production. Visit the tutorial website at http://hunch.net/~large_scale_survey/
Conference Paper
Stochastic Gradient Boosted Decision Trees (GBDT) is one of the most widely used learning algorithms in machine learning today. It is adaptable, easy to interpret, and produces highly accurate models. However, most implementations today are computationally expensive and require all training data to be in main memory. As training data becomes ever larger, there is motivation for us to parallelize the GBDT algorithm. Parallelizing decision tree training is intuitive and various approaches have been explored in existing literature. Stochastic boosting on the other hand is inherently a sequential process and have not been applied to distributed decision trees. In this work, we present two different distributed methods that generates exact stochastic GBDT models, the first is a MapReduce implementation and the second utilizes MPI on the Hadoop grid environment.
Article
Classification and regression tree learning on massive datasets is a common data mining task at Google, yet many state of the art tree learning algorithms require training data to reside in memory on a single machine. While more scalable implementations of tree learning have been proposed, they typically require specialized parallel computing architectures. In contrast, the majority of Google's computing infrastructure is based on commodity hardware. In this paper, we describe PLANET: a scalable distributed framework for learning tree models over large datasets. PLANET defines tree learning as a series of distributed computations, and implements each one using the MapReduce model of distributed computation. We show how this framework supports scalable construction of classification and regression trees, as well as ensembles of such models. We discuss the benefits and challenges of using a MapReduce compute cluster for tree learning, and demonstrate the scalability of this approach by applying it to a real world learning task from the domain of computational advertising.
Conference Paper
We present a fast algorithm for computing approximate quantiles in high speed data streams with deterministic error bounds. For data streams of size N where N is unknown in advance, our algorithm partitions the stream into sub-streams of exponentially increasing size as they arrive. For each sub-stream which has a fixed size, we compute and maintain a multi-level summary structure using a novel algorithm. In order to achieve high speed performance, the algorithm uses simple block-wise merge and sample operations. Overall, our algorithms for fixed-size streams and arbitrary-size streams have a computational cost of O(N log(1/epsivlogepsivN)) and an average per-element update cost of O(log logN) if epsiv is fixed.
Article
Function approximation is viewed from the perspective of numerical optimization in function space, rather than parameter space. A connection is made between stagewise additive expansions and steepest--descent minimization. A general gradient--descent "boosting" paradigm is developed for additive expansions based on any fitting criterion. Specific algorithms are presented for least--squares, least--absolute--deviation, and Huber--M loss functions for regression, and multi--class logistic likelihood for classification. Special enhancements are derived for the particular case where the individual additive components are decision trees, and tools for interpreting such "TreeBoost" models are presented. Gradient boosting of decision trees produces competitive, highly robust, interpretable procedures for regression and classification, especially appropriate for mining less than clean data. Connections between this approach and the boosting methods of Freund and Shapire 1996, and Fr...
The present and the future of the kdd cup competition: an outsider's perspective. R. Bekkerman. The present and the future of the kdd cup competition: an outsider's perspective
  • R Bekkerman
R. Bekkerman. The present and the future of the kdd cup competition: an outsider's perspective.
General functional matrix factorization using gradient boosting
  • T Chen
  • H Li
  • Q Yang
  • Y Yu
T. Chen, H. Li, Q. Yang, and Y. Yu. General functional matrix factorization using gradient boosting. In Proceeding of 30th International Conference on Machine Learning (ICML'13), volume 1, pages 436-444, 2013.
Efficient second-order gradient boosting for conditional random fields
  • T Chen
  • S Singh
  • B Taskar
  • C Guestrin
T. Chen, S. Singh, B. Taskar, and C. Guestrin. Efficient second-order gradient boosting for conditional random fields. In Proceeding of 18th Artificial Intelligence and Statistics Conference (AISTATS'15), volume 1, 2015.
LIBLINEAR: A Library for Large Linear Classification
  • Kai-Wei Rong-En Fan
  • Cho-Jui Chang
  • Xiang-Rui Hsieh
  • Chih-Jen Wang
  • Lin