Article

Machine Learning, Volume 45, Number 1 - SpringerLink

Authors:
To read the full-text of this research, you can request a copy directly from the author.

Abstract

Random forests are a combination of tree predictors such that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest. The generalization error for forests converges a.s. to a limit as the number of trees in the forest becomes large. The generalization error of a forest of tree classifiers depends on the strength of the individual trees in the forest and the correlation between them. Using a random selection of features to split each node yields error rates that compare favorably to Adaboost (Y. Freund & R. Schapire, Machine Learning: Proceedings of the Thirteenth International conference, ***, 148–156), but are more robust with respect to noise. Internal estimates monitor error, strength, and correlation and these are used to show the response to increasing the number of features used in the splitting. Internal estimates are also used to measure variable importance. These ideas are also applicable to regression.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the author.

... ML has a plethora of classification algorithms, including RF, LR, Support Vector Machine (SVM), naive Bayes classifier, decision trees, and many more. In this study, we used the RF method (Breiman 2001), a supervised, ensemble learning, decision-tree-based algorithm for classification and regression. RF is one of the most popular classifier ML algorithms. ...
Preprint
Full-text available
The Fermi fourth catalog of active galactic nuclei (AGNs) data release 3 (4LAC-DR3) contains 3407 AGNs, out of which 755 are flat spectrum radio quasars (FSRQs), 1379 are BL Lacertae objects (BL Lacs), 1208 are blazars of unknown (BCUs) type, while 65 are non AGNs. Accurate categorization of many unassociated blazars still remains a challenge due to the lack of sufficient optical spectral information. The aim of this work is to use high-precision, optimized machine learning (ML) algorithms to classify BCUs into BL Lacs and FSRQs. To address this, we selected the 4LAC-DR3 Clean sample (i.e., sources with no analysis flags) containing 1115 BCUs. We employ five different supervised ML algorithms, namely, random forest, logistic regression, XGBoost, CatBoost, and neural network with seven features: Photon index, synchrotron-peak frequency, Pivot Energy, Photon index at Pivot\_Energy, Fractional variability, $\nu F\nu$ at synchrotron-peak frequency, and Variability index. Combining results from all models leads to better accuracy and more robust predictions. These five methods together classified 610 BCUs as BL Lacs and 333 BCUs as FSRQs with a classification metric area under the curve $>$ 0.96. Our results are significantly compatible with recent studies as well. The output from this study provides a larger blazar sample with many new targets that could be used for forthcoming multi-wavelength surveys. This work can be further extended by adding features in X-rays, UV, visible, and radio wavelengths.
... The binary UPSIT ® item response data were used to train and test popular machine learning algorithms available in MATLAB ® version 2020a (MATLAB ® , 2020), including logistic regression (Grimm & Yarnold, 1995), artificial neural networks (using ten hidden elements) (Haykin, 1998), decision trees (Breiman et al., 1984), k-nearest neighbor (kNN, k = 3 with city block distance metric), and the ensemble learning methods of random forests (Breiman, 2001), AdaBoost (short for adaptive boosting) (Freund & Schapire, 1999), and support vector machines (SVM) (Hearst et al., 1998). Parameter sweep analysis for number of hidden elements in neural networks and the number of nearest neighbors in the kNN method are provided in the supplementary Figs. ...
Article
Although there are numerous brief odor identification tests available for quantifying the ability to smell, none are available in multiple parallel forms that can be longitudinally administered without potential confounding from knowledge of prior test items. Moreover, empirical algorithms for establishing optimal test lengths have not been generally applied. In this study, we employed and compared eight machine learning algorithms to develop a set of four brief parallel smell tests employing items from the University of Pennsylvania Smell Identification Test that optimally differentiated 100 COVID-19 patients from 132 healthy controls. Among the algorithms, linear discriminant analysis (LDA) achieved the best overall performance. The minimum number of odorant test items needed to differentiate smell loss accurately was identified as eight. We validated the sensitivity of the four developed tests, whose means and variances did not differ from one another (Bradley–Blackwood test), by sequential testing an independent group of 32 subjects that included persons with smell dysfunction not due to COVID-19. These eight-item tests clearly differentiated the olfactory compromised subjects from normosmics, with areas under the ROC curve ranging from 0.79 to 0.83. Each test was correlated with the overall UPSIT scores from which they were derived. These brief smell tests can be used separately or sequentially over multiple days in a variety of contexts where longitudinal olfactory testing is needed.
... Multinomial Naïve Bayes [39] naivebayes package [40] Applied only for count vectorization and TF-IDF vectorization Classification and Regression Trees (CART) [41] rpart package [42] Applied for all 9 vectorization methods; Applied with Gini and Information gain criteria to split the nodes Bagged CART [43] e1071 [44] and caret [45] packages Applied for all 9 vectorization methods Repeated 10-fold cross-validation to further reduce the variance C4.5 [46] Rweka package [47] Applied for all 9 vectorization methods Repeated 10-fold cross-validation to further reduce the variance C50 [48] C50 package [49] Applied for all 9 vectorization methods Repeated 10-fold cross-validation to further reduce the variance Random Forest [50] ranger package [51] Applied for all 9 vectorization methods Repeated 10-fold cross-validation to further reduce the variance Support Vector Machines [52] e1071 [44] and caret [45] packages ...
Article
Full-text available
Modern approaches to computing consumer price indices include the use of various data sources, such as web-scraped data or scanner data, which are very large in volume and need special processing techniques. In this paper, we address one of the main problems in the consumer price index calculation, namely the product classification, which cannot be performed manually when using large data sources. Therefore, we conducted an experiment on automatic product classification according to an international classification scheme. We combined 9 different word-embedding techniques with 13 classification methods with the aim of identifying the best combination in terms of the quality of the resultant classification. Because the dataset used in this experiment was significantly imbalanced, we compared these methods not only using the accuracy, F1-score, and AUC, but also using a weighted F1-score that better reflected the overall classification quality. Our experiment showed that logistic regression, support vector machines, and random forests, combined with the FastText skip-gram embedding technique provided the best classification results, with superior values in performance metrics, as compared to other similar studies. An execution time analysis showed that, among the three mentioned methods, logistic regression was the fastest while the random forest recorded a longer execution time. We also provided per-class performance metrics and formulated an error analysis that enabled us to identify methods that could be excluded from the range of choices because they provided less reliable classifications for our purposes.
... Random forest (RF). The RF is an extensively used learning approach for classification, clustering, and interaction detection as well as regression analysis which was developed by (Breiman, 2001). The RF employs ensemble trees to reduce these issues, resulting in more robust models. ...
Article
Groundwater contamination caused by elevated nitrate levels and its associated health effects is a serious global concern. The U.S. Environmental Protection Agency has developed a method for assessing potential human health risks from groundwater contamination that involves extensive groundwater sampling and analysis. However, this approach can be labor intensive and stand as a constraint to the robustness of the traditional approach. Here in machine learning (ML) could be alternative approaches to bridging the contemporary challenges. Machine learning models (ML) such as deep neural networks (DNN), gradient boosting machines (GBM), random forests (RF) and generalized linear models (GLM) can provide alternative solutions to overcome these limitations. In this study, the effectiveness of Hybrid Monte Carlo Machine Learning (MC-ML) models was evaluated by predicting health risks using hazard quotients. A total of 32 groundwater samples were collected and analyzed for nitrate and physical properties during the pre- and post-monsoon seasons. The results showed that the groundwater was severely contaminated by elevated nitrate concentrations, leading to high hazard quotient values. The prediction model results and validation using error and performance metrics showed that the Hybrid MC-DNN model outperformed the other models in both the training and testing phases. These results suggest that this surrogate approach could be a promising alternative to traditional health risk assessment methods.
... Introduced by Brieman (2001), the Random Forest Algorithm (RFA) gained tremendous popularity due to robust performance across a wide range of datasets (Wyner et al., 2017, p.7). The RFA can be considered as a generalization of the concept of decision tree. ...
Article
Full-text available
Cardiovascular diseases (CVDs) are the number one cause of death globally. Coronary artery disease (CAD) is the most common form of CVDs. Abundant research works propose decision support systems for CAD early detection. Most of proposed solutions have their origins in the realm of machine learning and datamining. This paper presents two solutions for CAD prediction. The first solution optimizes a random forest model (RFM) through hyperparameters tuning. The second solution uses a case-based reasoning (CBR) methodology. The CBR solution takes advantage of feature importance to improve the execution time of the retrieve step in the CBR cycle. The experimentations show that the RFM outperformed most recent published models for CAD diagnosis. By reducing the number of attributes, the CBR solution improves the execution time and also performs very well in terms of diagnosis accuracy. The performance of the CBR solution is intended to be enhanced because CBR is a learning methodology.
... Finally, to understand what elements are most informative in the decision process, we calculate the permutation feature importance (Breiman 2001) of the classification pipeline. The feature importance for element X is defined as the decrease in cross-validation accuracy if we randomly shuffle the values of all abundances that include element X in the cross-validation data. ...
Article
Full-text available
In unveiling the nature of the first stars, the main astronomical clue is the elemental compositions of the second generation of stars, observed as extremely metal-poor (EMP) stars, in the Milky Way. However, no observational constraint was available on their multiplicity, which is crucial for understanding early phases of galaxy formation. We develop a new data-driven method to classify observed EMP stars into mono- or multi-enriched stars with support vector machines. We also use our own nucleosynthesis yields of core-collapse supernovae with mixing fallback that can explain many of the observed EMP stars. Our method predicts, for the first time, that 31.8% ± 2.3% of 462 analyzed EMP stars are classified as mono-enriched. This means that the majority of EMP stars are likely multi-enriched, suggesting that the first stars were born in small clusters. Lower-metallicity stars are more likely to be enriched by a single supernova, most of which have high carbon enhancement. We also find that Fe, Mg. Ca, and C are the most informative elements for this classification. In addition, oxygen is very informative despite its low observability. Our data-driven method sheds a new light on solving the mystery of the first stars from the complex data set of Galactic archeology surveys.
Article
Full-text available
In the context of product innovation, there is an emerging trend to use Machine Learning (ML) models with the support of Design Of Experiments (DOE). The paper aims firstly to review the most suitable designs and ML models to use jointly in an Active Learning (AL) approach; it then reviews ALPERC, a novel AL approach, and proves the validity of this method through a case study on amorphous metallic alloys, where this algorithm is used in combination with a Random Forest model.
Chapter
DNA-binding protein is applied to compact the DNA and regulate different cellular processes. This protein has been successfully used to handle genetic disorder and critical diseases, such as cancer. Identification of DNA-binding proteins through experimental techniques is always time-consuming and costly. So, reliable automatic computational method is always desirable to detect DNA-binding proteins by using machine learning concept. However, various aspects affect the overall performances of machine learning algorithms, while discriminating DNA-binding proteins and non-DNA-binding proteins. In the current paper, we present a new methodology to cope with all the important factors. Firstly, we extract the explanatory features based on different composition concept. Secondly, a fuzzy rough set-assisted feature selection with harmony search is used to eliminate redundant and/or irrelevant attributes. Thirdly, Synthetic Minority Over-sampling Technique (SMOTE) is applied to produce optimally balanced datasets. Further, we explore the assessment measures of different learning techniques over unreduced, reduced, and optimally balanced reduced datasets. Next, a comparative study is presented to demonstrate the effectiveness of the entire methodology.
Chapter
In this paper, we have presented an effective methodology to improve the prediction performances of learning algorithms to forecast the anti-cancer peptides. Firstly, 489 informative features are extracted based on 11 interesting compositions. Thereafter, 117 non-redundant and relevant features are selected by using fuzzy rough feature selection (FRFS) with ant colony optimization (ACO) search. Then, instances of the reduced dataset are resampled by using synthetic minority optimization technique (SMOTE) to achieve the optimally balancing ratio, i.e. 1:1. Next, we conduct comprehensive experiments with reduced and unreduced datasets by using various learning algorithms based on tenfold cross validation (CV) and percentage split 80:20 validation. Finally, we represent a comparative study of our proposed methodology with possible alternatives and prove that it outperforms the previous existing methods. Experimental results indicate that the best results are produced by vote-based classifier using percentage split of 80:20 validation technique with specificity of 99.1%, sensitivity of 97.3%, accuracy of 98.2%, AUC of 0.983, and MCC of 0.888. From the experimentation, it can be concluded that our current methodology can enhance the discriminating ability of different artificial intelligence models for anti-cancer and non-anti-cancer peptides by using feature extraction, FRFS with ACO followed by SMOTE, and vote-based classifier.KeywordsFeature extractionFeature selectionSMOTEFuzzy setClassification
Chapter
Full-text available
Customer journey analysis is important for organizations to get to know as much as possible about the main behavior of their customers. This provides the basis to improve the customer experience within their organization. This paper addresses the problem of predicting the occurrence of a certain activity of interest in the remainder of the customer journey that follows the occurrence of another specific activity. For this, we propose the HIAP framework which uses process mining techniques to analyze customer journeys. Different prediction models are researched to investigate which model is most suitable for high importance activity prediction. Furthermore the effect of using a sliding window or landmark model for (re)training a model is investigated. The framework is evaluated using a health insurance real dataset and a benchmark data set. The efficiency and prediction quality results highlight the usefulness of the framework under various realistic online business settings.
Article
Full-text available
Due to the excellent biocompatible physicochemical performance, luminogens with aggregation-induced emission (AIEgens) characteristics have played a significant role in biomedical fluorescence imaging recently. However, screening AIEgens for special applications takes a lot of time and efforts by using conventional chemical synthesis route. Fortunately, artificial intelligence techniques that could predict the properties of AIEgen molecules would be helpful and valuable for novel AIEgens design and synthesis. In this work, we applied machine learning (ML) techniques to screen AIEgens with expected excitation and emission wavelength for biomedical deep fluorescence imaging. First, a database of various AIEgens collected from the literature was established. Then, by extracting key features using molecular descriptors and training various state-of-the-art ML models, a multi-modal molecular descriptors strategy has been proposed to extract the structure-property relationships of AIEgens and predict molecular absorption and emission wavelength peaks. Compared to the first principles calculations, the proposed strategy provided greater accuracy at a lower computational cost. Finally, three newly predicted AIEgens with desired absorption and emission wavelength peaks were synthesized successfully and applied for cellular fluorescence imaging and deep penetration imaging. All the results were consistent successfully with our expectations, which demonstrated the above ML has a great potential for screening AIEgens with suitable wavelengths, which could boost the design and development of novel organic fluorescent materials.
Article
Chickpea is an important edible legume consumed worldwide because of rich nutrient composition. The physical parameters of chickpea are crucial attributes for design of processing and classification systems. In this study, effects of seven different irrigation treatments on size, shape, mass, and color properties of chickpea seeds were investigated, and machine learning algorithms were used to estimate mass and color attributes of chickpea seeds. The results showed that Multilayer Perceptron (MLP) had the greatest correlation coefficients for mass (0.9997) and chroma (0.9997). The MLP yielded better outcomes than Random Forest for both mass and color estimation. In terms of physical attributes, the best results were obtained in I1 (rainfed) and I5 (irrigation at 50% flowering and 50% pod fill) irrigation treatments. Additionally, single or couple irrigations at different physiological stages instead of full irrigation treatment might be sufficient to improve the physical attributes of chickpea.
Article
Full-text available
Disturbance-dependent grasslands, often associated with hydromorphological and fire dynamics, are threatened, especially in subtropical climates. In the Nepalese and Indian Terai Arc Landscape at the foot of the Himalayas, natural and cultural grasslands serve a viable role for greater one-horned rhinoceros (Rhinoceros unicornis) and for grazers that form prey of the Royal Bengal tiger (Panthera tigris). The grasslands are vulnerable to encroachment of forest. We aimed to establish the effects of environmental drivers, in particular river discharge, river channel dynamics, precipitation and forest fires, on the spatio-temporal dynamics of these grasslands. The study area is the floodplain of the eastern branch of the Karnali River and adjacent western part of Bardia National Park. We created annual time series (1993–2019) of land cover with the use of field data, remotely sensed LANDSAT imagery and a supervised classification model. Additionally, we analysed the pattern of grassland patches and aerial photographs of 1964. Between 1964 and 2019, grassland patches decreased in abundance and size due to encroachment of forest. Outside the floodplain, conversion of grassland to bare substrate coincides with extreme precipitation events. Within the floodplain, conversion of grassland to bare substrate correlates with the magnitude of the annual peak discharge of the bifurcated Karnali River. Since 2009, however, this correlation is absent due to a shift of the main discharge channel to the western branch of the Karnali River. Consequently, alluvial tall grasslands (Saccharum spontaneum dominant) have vastly expanded between 2009 and 2019. Because the hydromorphological processes in the floodplain have become more static, other sources of disturbances – local flooding of ephemeral streams, anthropogenic maintenance, grazing and fires – are more paramount to prevent encroachment of grasslands. Altogether, our findings underscore that a change in the environmental drivers impact the surface area and heterogeneity of grassland patches in the landscape, which can lead to cascading effects for the grassland-dependent megafauna.
Article
Full-text available
Despite some encouraging successes, predicting the therapy response of acute myeloid leukemia (AML) patients remains highly challenging due to tumor heterogeneity. Here we aim to develop and validate MDREAM, a robust ensemble-based prediction model for drug response in AML based on an integration of omics data, including mutations and gene expression, and large-scale drug testing. Briefly, MDREAM is first trained in the BeatAML cohort ( n = 278), and then validated in the BeatAML ( n = 183) and two external cohorts, including a Swedish AML cohort ( n = 45) and a relapsed/refractory acute leukemia cohort ( n = 12). The final prediction is based on 122 ensemble models, each corresponding to a drug. A confidence score metric is used to convey the uncertainty of predictions; among predictions with a confidence score >0.75, the validated proportion of good responders is 77%. The Spearman correlations between the predicted and the observed drug response are 0.68 (95% CI: [0.64, 0.68]) in the BeatAML validation set, –0.49 (95% CI: [–0.53, –0.44]) in the Swedish cohort and 0.59 (95% CI: [0.51, 0.67]) in the relapsed/refractory cohort. A web-based implementation of MDREAM is publicly available at https://www.meb.ki.se/shiny/truvu/MDREAM/ .
Article
Full-text available
Suicide risk prediction models can identify individuals for targeted intervention. Discussions of transparency, explainability, and transportability in machine learning presume complex prediction models with many variables outperform simpler models. We compared random forest, artificial neural network, and ensemble models with 1500 temporally defined predictors to logistic regression models. Data from 25,800,888 mental health visits made by 3,081,420 individuals in 7 health systems were used to train and evaluate suicidal behavior prediction models. Model performance was compared across several measures. All models performed well (area under the receiver operating curve [AUC]: 0.794–0.858). Ensemble models performed best, but improvements over a regression model with 100 predictors were minimal (AUC improvements: 0.006–0.020). Results are consistent across performance metrics and subgroups defined by race, ethnicity, and sex. Our results suggest simpler parametric models, which are easier to implement as part of routine clinical practice, perform comparably to more complex machine learning methods.
Article
Do people have well-defined social preferences waiting to be applied when making decisions? Or do they have to construct social decisions on the spot? If the latter, how are those decisions influenced by the way in which information is acquired and evaluated? These temporal dynamics are fundamental to understanding how people trade off selfishness and prosociality in organizations and societies. Here, we investigate how the temporal dynamics of the choice process shape social decisions in three studies using response times and mouse tracking. In the first study, participants made binary decisions in mini-dictator games with and without time constraints. Using mouse trajectories and a starting time drift diffusion model, we find that, regardless of time constraints, selfish participants were delayed in processing others’ payoffs, whereas the opposite was true for prosocial participants. The independent mouse trajectory and computational modeling analyses identified consistent measures of the delay between considering one’s own and others’ payoffs (self-onset delay, SOD). This measure correlated with individual differences in prosociality and predicted heterogeneous effects of time constraints on preferences. We confirmed these results in two additional studies, one a purely behavioral study in which participants made decisions by pressing computer keys, and the other a replication of the mouse-tracking study. Together, these results indicate that people preferentially process either self or others’ payoffs early in the choice process. The intrachoice dynamics are crucial in shaping social preferences and might be manipulated via nudge policies (e.g., manipulating the display order or saliency of self and others’ outcomes) for behavior in managerial or other contexts. This paper was accepted by Yan Chen, behavioral economics and decisions analysis. Funding: F. Chen acknowledges support from the National Natural Science Foundation of China [Grants 71803174 and 72173113]. Z. Zhu acknowledges support from the Ministry of Science and Technology [Grant STI 2030-Major Projects 2021ZD0200409]. Q. Shen acknowledges support from the National Natural Science Foundation of China [Grants 71971199 and 71942004]. I. Krajbich acknowledges support from the U.S. National Science Foundation [Grant 2148982]. This work was also supported by the James McKeen Cattell Fund. Supplemental Material: The online appendix and data are available at https://doi.org/10.1287/mnsc.2023.4732 .
Article
Studies on safety in aviation are necessary for the development of new technologies to forecast and prevent aeronautical accidents and incidents. When predicting these occurrences, the literature frequently considers the internal characteristics of aeronautical operations, such as aircraft telemetry and flight procedures, or external characteristics, such as meteorological conditions, with only few relationships being identified between the two. In this study, data from 6,188 aeronautical occurrences involving accidents, incidents, and serious incidents, in Brazil between January 2010 and October 2021, as well as meteorological data from two automatic weather stations, totaling more than 2.8 million observations, were investigated using machine learning tools. For data analysis, decision tree, extra trees, Gaussian naive Bayes, gradient boosting, and k-nearest neighbor classifiers with a high identification accuracy of 96.20% were used. Consequently, the developed algorithm can predict occurrences as functions of operational and meteorological patterns. Variables such as maximum take-off weight, aircraft registration and model, and wind direction are among the main forecasters of aeronautical accidents or incidents. This study provides insight into the development of new technologies and measures to prevent such occurrences.
Conference Paper
Full-text available
Compacted soil is widely used as liners and covers in waste containment systems owing to its low hydraulic conductivity (k < 1 x 10-9 m/s). The hydraulic conductivity is often measured in laboratories using flexible-wall permeameters or in fields using a lysimeter or infiltrometer, which all require weeks or months to reach equilibrium. In this study, a machine-learning-based predictive model was developed to determine the saturated hydraulic conductivity of compacted soil using the random forest (RF) algorithm. A database was created to train and validate the RF model, which contains the hydraulic conductivity and 12 impact factors of 329 soil samples in North America. The 12 impact factors, covering soil physical properties, hydration characteristics, and compaction conditions, were used in the RF model. A multiple linear regression model was also constructed as a comparison, using the same database and impact factors. The RF model validation indicated that 92% of the predicted hydraulic conductivity has a discrepancy with the measured hydraulic conductivity of fewer than 10 times, and 100% fewer than 100 times. RF model has a higher precision in predicting the hydraulic conductivity of compacted soil than the multiple linear regression model.
Article
Concentrations of ambient particulate matter (PM) depend on various factors including emissions of primary pollutants, meteorology and chemical transformations. New Delhi, India is the most polluted megacity in the world and routinely experiences extreme pollution episodes. As part of the Delhi Aerosol Supersite study, we measured online continuous PM1 (particulate matter of size less than 1µm) concentrations and composition for over five years starting January 2017, using an Aerosol Chemical Speciation Monitor (ACSM). Here, we describe the development and application of machine learning models using random forest regression to estimate the concentrations, composition, sources and dynamics of PM in Delhi. These models estimate PM1 species concentrations based on meteorological parameters including ambient temperature, relative humidity, planetary boundary layer height, wind speed, wind direction, precipitation, agricultural burning fire counts, solar radiation and cloud cover. We used hour of day, day of week and month of year as proxies for time-dependent emissions (e.g., emissions from traffic during rush hours). We demonstrate the applicability of these models to capture temporal variability of the PM1 species, to understand the influence of individual factors via sensitivity analyses, and to separate impacts of the COVID-19 lockdowns and associated activity restrictions from impacts of other factors. Our models provide new insights into the factors influencing ambient PM1 in New Delhi, India, demonstrating the power of machine learning models in atmospheric science applications.
Article
Full-text available
Invasive alien plant species (IAPS) have negative impacts on ecosystems, including the loss of biodiversity and the alteration of ecosystem functions. The strategy for mitigating these impacts requires knowledge of these species’ spatial distribution and level of infestation. In situ inventories or aerial photo interpretation can be used to collect these data but they are labor-intensive, time-consuming, and incomplete, especially when dealing with large or inaccessible areas. Remote sensing may be an effective method of mapping IAPS for a better management strategy. Several studies using remote sensing to map IAPS have focused on single species detection and were conducted in relatively homogeneous natural environments, while other common, more heterogeneous environments, such as urban areas, are often invaded by multiple IAPS, posing management challenges. The main objective of this study was to develop a mapping method for three major IAPS observed in the urban agglomeration of Quebec City (Canada), namely Japanese knotweed (Fallopia japonica); giant hogweed (Heracleum mantegazzianum); and phragmites (Phragmites australis). Mono-date and multi-date classification approaches were used with WorldView-3 and SPOT-7 satellite imagery, acquired in the summer of 2020 and in the autumn of 2019, respectively. To estimate presence probability, object-based image analysis (OBIA) and nonparametric classifiers such as Support Vector Machine (SVM), Random Forest (RF), and Extreme Gradient Boosting (XGBoost) were used. Overall, multi-date classification using WorldView-3 and SPOT-7 images produced the best results, with a Kappa coefficient of 0.85 and an overall accuracy of 91% using RF. For XGBoost, the Kappa coefficient was 0.81 with an overall accuracy of 89%, whereas the Kappa coefficient and overall accuracy were 0.80 and 88% for SVM classifier, respectively. Individual class performances based on F1-score revealed that Japanese knotweed had the highest maximum value (0.95), followed by giant hogweed (0.91), and phragmites (0.87). These results confirmed the potential of remote sensing to accurately map and simultaneously monitor the main IAPS in a heterogeneous urban environment using a multi-date approach. Although the approach is limited by image and reference data availability, it provides new tools to managers for IAPS invasion control.
Article
Full-text available
Mapping soil organic matter (SOM) content has become an important application of digital soil mapping. In this study, we processed all Sentinel-2 images covering the bare-soil period (March to June) in Northeast China from 2019 to 2022 and integrated the observation results into synthetic materials with four defined time intervals (10, 15, 20, and 30 d). Then, we used synthetic images corresponding to different time periods to conduct SOM mapping and determine the optimal time interval and time period before finally assessing the impacts of adding environmental covariates. The results showed the following: (1) in SOM mapping, the highest accuracy was obtained using day-of-year (DOY) 120 to 140 synthetic images with 20 d time intervals, as well as with different time intervals, ranked as follows: 20 d > 30 d > 15 d > 10 d; (2) when using synthetic images at different time intervals to predict SOM, the best time period for predicting SOM was always within May; and (3) adding environmental covariates effectively improved the SOM mapping performance, and the multiyear average temperature was the most important factor. In general, our results demonstrated the valuable potential of SOM mapping using multiyear synthetic imagery, thereby allowing detailed mapping of large areas of cultivated soil. ARTICLE HISTORY
Article
Full-text available
The risk of cardiovascular disease (CVD) is a serious health threat to human society worldwide. The use of machine learning methods to predict the risk of CVD is of great relevance to identify high-risk patients and take timely interventions. In this study, we propose the XGBH machine learning model, which is a CVD risk prediction model based on key contributing features. In this paper, the generalisation of the model was enhanced by adding retrospective data of 14,832 Chinese Shanxi CVD patients to the kaggle dataset. The XGBH risk prediction model proposed in this paper was validated to be highly accurate (AUC = 0.81) compared to the baseline risk score (AUC = 0.65), and the accuracy of the model for CVD risk prediction was improved with the inclusion of the conventional biometric BMI variable. To increase the clinical application of the model, a simpler diagnostic model was designed in this paper, which requires only three characteristics from the patient (age, value of systolic blood pressure and whether cholesterol is normal or not) to enable early intervention in the treatment of high-risk patients with a slight reduction in accuracy (AUC = 0.79). Ultimately, a CVD risk score model with few features and high accuracy will be established based on the main contributing features. Of course, further prospective studies, as well as studies with other populations, are needed to assess the actual clinical effectiveness of the XGBH risk prediction model.
Article
Full-text available
One of the main obstacles to the successful treatment of cancer is the phenomenon of drug resistance. A common strategy to overcome resistance is the use of combination therapies. However, the space of possibilities is huge and efficient search strategies are required. Machine Learning (ML) can be a useful tool for the discovery of novel, clinically relevant anti-cancer drug combinations. In particular, deep learning (DL) has become a popular choice for modeling drug combination effects. Here, we set out to examine the impact of different methodological choices on the performance of multimodal DL-based drug synergy prediction methods, including the use of different input data types, preprocessing steps and model architectures. Focusing on the NCI ALMANAC dataset, we found that feature selection based on prior biological knowledge has a positive impact-limiting gene expression data to cancer or drug response-specific genes improved performance. Drug features appeared to be more predictive of drug response, with a 41% increase in coefficient of determination (R2) and 26% increase in Spearman correlation relative to a baseline model that used only cell line and drug identifiers. Molecular fingerprint-based drug representations performed slightly better than learned representations-ECFP4 fingerprints increased R2 by 5.3% and Spearman correlation by 2.8% w.r.t the best learned representations. In general, fully connected feature-encoding subnetworks outperformed other architectures. DL outperformed other ML methods by more than 35% (R2) and 14% (Spearman). Additionally, an ensemble combining the top DL and ML models improved performance by about 6.5% (R2) and 4% (Spearman). Using a state-of-the-art interpretability method, we showed that DL models can learn to associate drug and cell line features with drug response in a biologically meaningful way. The strategies explored in this study will help to improve the development of computational methods for the rational design of effective drug combinations for cancer therapy.
Article
Full-text available
Background Using visual, biological, and electronic health records data as the sole input source, pretrained convolutional neural networks and conventional machine learning methods have been heavily employed for the identification of various malignancies. Initially, a series of preprocessing steps and image segmentation steps are performed to extract region of interest features from noisy features. Then, the extracted features are applied to several machine learning and deep learning methods for the detection of cancer. Methods In this work, a review of all the methods that have been applied to develop machine learning algorithms that detect cancer is provided. With more than 100 types of cancer, this study only examines research on the four most common and prevalent cancers worldwide: lung, breast, prostate, and colorectal cancer. Next, by using state-of-the-art sentence transformers namely: SBERT (2019) and the unsupervised SimCSE (2021), this study proposes a new methodology for detecting cancer. This method requires raw DNA sequences of matched tumor/normal pair as the only input. The learnt DNA representations retrieved from SBERT and SimCSE will then be sent to machine learning algorithms (XGBoost, Random Forest, LightGBM, and CNNs) for classification. As far as we are aware, SBERT and SimCSE transformers have not been applied to represent DNA sequences in cancer detection settings. Results The XGBoost model, which had the highest overall accuracy of 73 ± 0.13 % using SBERT embeddings and 75 ± 0.12 % using SimCSE embeddings, was the best performing classifier. In light of these findings, it can be concluded that incorporating sentence representations from SimCSE’s sentence transformer only marginally improved the performance of machine learning models.
Article
Full-text available
The role of artificial intelligence (AI) in organizations has fundamentally changed from performing routine tasks to supervising human employees. While prior studies focused on normative perceptions of such AI supervisors, employees’ behavioral reactions towards them remained largely unexplored. We draw from theories on AI aversion and appreciation to tackle the ambiguity within this field and investigate if and why employees might adhere to unethical instructions either from a human or an AI supervisor. In addition, we identify employee characteristics affecting this relationship. To inform this debate, we conducted four experiments (total N = 1701) and used two state-of-the-art machine learning algorithms (causal forest and transformers). We consistently find that employees adhere less to unethical instructions from an AI than a human supervisor. Further, individual characteristics such as the tendency to comply without dissent or age constitute important boundary conditions. In addition, Study 1 identified that the perceived mind of the supervisors serves as an explanatory mechanism. We generate further insights on this mediator via experimental manipulations in two pre-registered studies by manipulating mind between two AI (Study 2) and two human supervisors (Study 3). In (pre-registered) Study 4, we replicate the resistance to unethical instructions from AI supervisors in an incentivized experimental setting. Our research generates insights into the ‘black box’ of human behavior toward AI supervisors, particularly in the moral domain, and showcases how organizational researchers can use machine learning methods as powerful tools to complement experimental research for the generation of more fine-grained insights.
Article
Full-text available
Nearly ~ 10^8 types of High entropy alloys (HEAs) can be developed from about 64 elements in the periodic table. A major challenge for materials scientists and metallurgists at this stage is to predict their crystal structure and, therefore, their mechanical properties to reduce experimental efforts, which are energy and time intensive. Through this paper, we show that it is possible to use machine learning (ML) in this arena for phase prediction to develop novel HEAs. We tested five robust algorithms namely, K-nearest neighbours (KNN), support vector machine (SVM), decision tree classifier (DTC), random forest classifier (RFC) and XGBoost (XGB) in their vanilla form (base models) on a large dataset screened specifically from experimental data concerning HEA fabrication using melting and casting manufacturing methods. This was necessary to avoid the discrepancy inherent with comparing HEAs obtained from different synthesis routes as it causes spurious effects while treating an imbalanced data-an erroneous practice we observed in the reported literature. We found that (i) RFC model predictions were more reliable in contrast to other models and (ii) the synthetic data augmentation is not a neat practice in materials science specially to develop HEAs, where it cannot assure phase information reliably. To substantiate our claim, we compared the vanilla RFC (V-RFC) model for original data (1200 datasets) with SMOTE-Tomek links augmented RFC (ST-RFC) model for the new datasets (1200 original + 192 generated = 1392 datasets). We found that although the ST-RFC model showed a higher average test accuracy of 92%, no significant breakthroughs were observed, when testing the number of correct and incorrect predictions using confusion matrix and ROC-AUC scores for individual phases. Based on our RFC model, we report the development of a new HEA (Ni 25 Cu 18.75 Fe 25 Co 25 Al 6.25) exhibiting an FCC phase proving the robustness of our predictions.
Chapter
Statistical and machine learning methods have many applications in the environmental sciences, including prediction and data analysis in meteorology, hydrology and oceanography; pattern recognition for satellite images from remote sensing; management of agriculture and forests; assessment of climate change; and much more. With rapid advances in machine learning in the last decade, this book provides an urgently needed, comprehensive guide to machine learning and statistics for students and researchers interested in environmental data science. It includes intuitive explanations covering the relevant background mathematics, with examples drawn from the environmental sciences. A broad range of topics is covered, including correlation, regression, classification, clustering, neural networks, random forests, boosting, kernel methods, evolutionary algorithms and deep learning, as well as the recent merging of machine learning and physics. End‑of‑chapter exercises allow readers to develop their problem-solving skills, and online datasets allow readers to practise analysis of real data.
Article
Breast cancer is the most common form of cancer and is still the second leading cause of death for women in the world. Early detection and treatment of breast cancer can reduce mortality rates. Breast ultrasound is always used to detect and diagnose breast cancer. The accurate breast segmentation and diagnosis as benign or malignant is still a challenging task in the ultrasound image. In this paper, we proposed a classification model as short-ResNet with DC-UNet to solve the segmentation and diagnosis challenge to find the tumor and classify benign or malignant with breast ultrasonic images. The proposed model has a dice coefficient of 83% for segmentation and achieves an accuracy of 90% for classification with breast tumors. In the experiment, we have compared with segmentation task and classification result in different datasets to prove that the proposed model is more general and demonstrates better results. The deep learning model using short-ResNet to classify tumor whether benign or malignant, that combine DC-UNet of segmentation task to assist in improving the classification results.
Article
Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) belongs to coronaviridae family and a change in the genetic sequence of SARS-CoV-2 is named as a mutation that causes to variants of SARS-CoV-2. In this paper, we propose a novel and efficient method to predict SARS-CoV-2 variants of concern from whole human genome sequences. In this method, we describe 16 dinucleotide and 64 trinucleotide features to differentiate SARS-CoV-2 variants of concern. The efficacy of the proposed features is proved by using four classifiers, k-nearest neighbor, support vector machines, multilayer perceptron, and random forest. The proposed method is evaluated on the dataset including 223,326 complete human genome sequences including recently designated variants of concern, Alpha, Beta, Gamma, Delta, and Omicron variants. Experimental results present that overall accuracy for detecting SARS-CoV-2 variants of concern remarkably increases when trinucleotide features rather than dinucleotide features are used. Furthermore, we use the whale optimization algorithm, which is a state-of-the-art method for reducing the number of features and choosing the most relevant features. We select 44 trinucleotide features out of 64 to differentiate SARS-CoV-2 variants with acceptable accuracy as a result of the whale optimization method. Experimental results indicate that the SVM classifier with selected features achieves about 99% accuracy, sensitivity, specificity, precision on average. The proposed method presents an admirable performance for detecting SARS-CoV-2 variants.
Article
Full-text available
Electric shorting induced by tall vegetation is one of the major hazards affecting power transmission lines extending through rural regions and rough terrain for tens of kilometres. This raises the need for an accurate, reliable, and cost-effective approach for continuous monitoring of canopy heights. This paper proposes and evaluates two deep convolution neural network (CNN) variants based on Seg-Net and Res-Net architectures, characterized by their small number of trainable weights (nearly 800,000) while maintaining high estimation accuracy. The proposed models utilize the freely available data from Sentinel-2, and a digital surface model to estimate forest canopy heights with high accuracy and a spatial resolution of 10 metres. Various factors affect canopy height estimation , including topography signature, dataset diversity, input layers, and model structure. The proposed models are applied separately to two powerline regions located in the northern and southern parts of Thailand. The application results show that the proposed Encoder-Decoder CNN Seg-Net model presents an average mean absolute error (MAE), root mean square error (RMSE), and coefficient of determination R 2 ð Þ of 1.38 m, 1.85 m, and 0.87, respectively , and is nearly 4.8 times faster than the CNN Res-Net model in conversion. These results prove the proposed model's capability of estimating and monitoring canopy heights with high accuracy and fine spatial resolution.
Article
Full-text available
This article aims to model the relationship between the size of the shadow economy and the most important government expenditures respectively social protection, health, and education, using nonlinear approaches. We applied four different Machine Learning models, namely Support Vector Regression, Neural Networks, Random Forest, and XGBoost on a cross-sectional dataset of 28 EU states between 1995 and 2020. Our goal is to calibrate an algorithm that can explain the variance of shadow economy size better than a linear model. Moreover, the most performant model has been used to predict the shadow economy size for over 30,000 simulated combinations of expenses in order to outline some possible inflection points after which government expenditures become counterproductive. Our findings suggest that ML algorithms outperform linear regression in terms of R-squared and root mean squared error and that social protection spending is the most important determinant of shadow economy size. Further to our analysis for the 28 EU states, between 1995 and 2020, the results suggest that the lowest size of shadow economy occurs when social protection expenses are greater than 20% of GDP, health expenses are greater than 6% of GDP, and education expenses range between 6% and 8% of GDP. To the best of the authors' knowledge, this is the first paper that used ML to model shadow economy and its determinants (i.e., government expenditures). We propose an easy-to-replicate methodology that can be developed in future research.
Article
Full-text available
Ensemble learning algorithms such as bagging often generate unnecessarily large models, which consume extra computational resources and may degrade the generalization ability. Pruning can potentially reduce ensemble size as well as improve performance; however, researchers have previously focused more on pruning classifiers rather than regressors. This is because, in general, ensemble pruning is based on two metrics: diversity and accuracy. Many diversity metrics are known for problems dealing with a finite set of classes defined by discrete labels. Therefore, most of the work on ensemble pruning is focused on such problems: classification, clustering, and feature selection. For the regression problem, it is much more difficult to introduce a diversity metric. In fact, the only such metric known to date is a correlation matrix based on regressor predictions. This study seeks to address this gap. First, we introduce the mathematical condition that allows checking whether the regression ensemble includes redundant estimators, i.e., estimators, whose removal improves the ensemble performance. Developing this approach, we propose a new ambiguity-based pruning (AP) algorithm that bases on error-ambiguity decomposition formulated for a regression problem. To check the quality of AP, we compare it with the two methods that directly minimize the error by sequentially including and excluding regressors, as well as with the state-of-art Ordered Aggregation algorithm. Experimental studies confirm that the proposed approach allows reducing the size of the regression ensemble with simultaneous improvement in its performance and surpasses all compared methods.
Article
To assess the status and change trend of forest in China, an indicator framework was developed using SDG sub-indicators. In this paper, we propose an improved methodology and a set of workflows for calculating SDG indicators. The main modification include the use of moderate and high spatial resolution satellite data, as well as state-of-the-art machine learning techniques for forest cover classification and estimation of forest above-ground biomass (AGB). This research employs GF-1 and GF-2 data with enhanced texture information to map forest cover, while time series Landsat data is used to estimate forest AGB across the whole territory of China. The study calculate two SDG sub-indicators: SDG15.1.1 for forest area and SDG15.2.1 for sustainable forest management. The evaluation results showed that the total forest area in China was approximately 219 million hectares at the end of 2021, accounting for about 23.51% of the land area. The average annual forest AGB from 2015 to 2021 was estimated to be 105.01Mg/ha, and the overall trend of forest AGB change in China was positive, albeit with some spatial differences.
Article
Full-text available
Precision in the measurement of glucose levels in the artificial pancreas is a challenging task and a mandatory requirement for the proper functioning of an artificial pancreas. A suitable machine learning (ML) technique for the measurement of glucose levels in an artificial pancreas may play a crucial role in the management of diabetes. Therefore in the present work, a comparison has been made among a few ML techniques for the measurement of glucose levels in the artificial pancreas because ML is an astounding technology of artificial intelligence and widely applicable in various fields such as medical science, robotics, and environmental science. The models, namely, decision tree (DT), random forest (RF), support vector machine (SVM), and K‐nearest neighbor (KNN), based on supervised learning, are proposed for the dataset of Pima Indian to predict and classify diabetes mellitus. Ensuring the predictions and accuracy up to the level of diabetes mellitus type 2 (DMT2), the comparative behavior of all four models has been discussed. The ML models developed here stratify and predict whether an individual is diabetic or not based on the features available in the dataset. The dataset passes through pre‐processing, and ML algorithms are fitted to train the dataset, and then the performance of the test results is discussed. An error matrix (EM) has been generated to measure the accuracy score of the models. The accuracies in the prediction and classification of DMT2 models are 71%, 77%, 78%, and 80% for DT, SVM, RF, and KNN algorithms, respectively. The KNN model has shown a more precise result in comparison to other models. The proposed methods have shown astounding behavior in terms of accuracy in the prediction of diabetes mellitus as compared to previously developed methods.
Article
Extreme gradient boosting (XGBoost) is an artificial intelligence algorithm capable of high accuracy and low inference time. The current study applies this XGBoost to the production of platinum nano-film coating through atomic layer deposition (ALD). In order to generate a database for model development, platinum is coated on α-Al2O3 using a rotary-type ALD equipment. The process is controlled by four parameters: process temperature, stop valve time, precursor pulse time, and reactant pulse time. A total of 625 samples according to different process conditions are obtained. The ALD coating index is used as the Al/Pt component ratio through ICP-AES analysis during postprocessing. The four process parameters serve as the input data and produces the Al/Pt component ratio as the output data. The postprocessed data set is randomly divided into 500 training samples and 125 test samples. XGBoost demonstrates 99.9% accuracy and a coefficient of determination of 0.99. The inference time is lower than that of random forest regression, in addition to a higher prediction safety than that of the light gradient boosting machine.
Article
Full-text available
We explore a new approach to shape recognition based on a virtually infinite family of binary features (queries) of the image data, designed to accommodate prior information about shape invariance and regularity. Each query corresponds to a spatial arrangement of several local topographic codes (or tags), which are in themselves too primitive and common to be informative about shape. All the discriminating power derives from relative angles and distances among the tags. The important attributes of the queries are a natural partial ordering corresponding to increasing structure and complexity; semi-invariance, meaning that most shapes of a given class will answer the same way to two queries that are successive in the ordering; and stability, since the queries are not based on distinguished points and substructures. No classifier based on the full feature set can be evaluated, and it is impossible to determine a priori which arrangements are informative. Our approach is to select informative features and build tree classifiers at the same time by inductive learning. In effect, each tree provides an approximation to the full posterior where the features chosen depend on the branch that is traversed. Due to the number and nature of the queries, standard decision tree construction based on a fixed-length feature vector is not feasible. Instead we entertain only a small random sample of queries at each node, constrain their complexity to increase with tree depth, and grow multiple trees. The terminal nodes are labeled by estimates of the corresponding posterior distribution over shape classes. An image is classified by sending it down every tree and aggregating the resulting distributions. The method is applied to classifying handwritten digits and synthetic linear and nonlinear deformations of three hundred [Formula: see text] symbols. State-of-the-art error rates are achieved on the National Institute of Standards and Technology database of digits. The principal goal of the experiments on [Formula: see text] symbols is to analyze invariance, generalization error and related issues, and a comparison with artificial neural networks methods is presented in this context. [Figure: see text]
Article
Full-text available
In bagging [Bre94a] one uses bootstrap replicates of the training set [Efr79, ET93] to try to improve a learning algorithm's performance. The computational requirements for estimating the resultant generalization error on a test set by means of cross-validation are often prohibitive; for leave-one-out cross-validation one needs to train the underlying algorithm on the order of m times, where m is the size of the training set and is the number of replicates. This paper presents several techniques for exploiting the bias-variance decomposition [GBD92, Wol96] to estimate the generalization error of a bagged learning algorithm without invoking yet more training of the underlying learning algorithm. The best of our estimators exploits stacking [Wol92]. In a set of experiments reported here, it was found to be more accurate than both the alternative crossvalidation -based estimator of the bagged algorithm's error and the cross-validation-based estimator of the underlying algor...
Article
Full-text available
We attempt to provide a unifying explanation to the success of aggregation protocols of multiple classiers. We show that all protocols are producing weakly dependent classiers conditional on class. This is the primary reason both for decrease in the bias error of the aggregate classier relative to each individual one, as well as for the reduction in variance with respect to the training sample. We show that all protocols including deterministic boosting can be viewed as producing a sample from some distribution on the space of classiers. There is a tradeo in terms of two key expectations with respect to this distribution. We study these points using a simple running example on which we also illustrate many of the properties we derive. In view of the proliferation of acronyms we have decided to add one of our own: MRCL. 1 Introduction In recent years a multitude of algorithms have been suggested in which a large number of classiers are trained on the same training samp...
Article
Recent work has shown that combining multiple versions of unstable classifiers such as trees or neural nets results in reduced test set error. One of the more effective is bagging. Here, modified training sets are formed by resampling from the original training set, classifiers constructed using these training sets and then combined by voting. Y. Freund and R. Schapire [in L. Saitta (ed.), Machine Learning: Proc. Thirteenth Int. Conf. 148-156 (1996); see also Ann. Stat. 26, No. 5, 1651-1686 (1998; Zbl 0929.62069)] propose an algorithm the basis of which is to adaptively resample and combine (hence the acronym “arcing”) so that the weights in the resampling are increased for those cases most often misclassified and the combining is done by weighted voting. Arcing is more successful than bagging in test set error reduction. We explore two arcing algorithms, compare them to each other and to bagging, and try to understand how arcing works. We introduce the definitions of bias and variance for a classifier as components of the test set error. Unstable classifiers can have low bias on a large range of data sets. Their problem is high variance. Combining multiple versions either through bagging or arcing reduces variance significantly.
Article
Bagging (Breiman, 1994a) is a technique that tries to improve a learning algorithm's performance by using bootstrap replicates of the training set (Efron & Tibshirani, 1993, Efron, 1979). The computational requirements for estimating the resultant generalization error on a test set by means of cross-validation are often prohibitive, for leave-one-out cross-validation one needs to train the underlying algorithm on the order of m times, where m is the size of the training set and  is the number of replicates. This paper presents several techniques for estimating the generalization error of a bagged learning algorithm without invoking yet more training of the underlying learning algorithm (beyond that of the bagging itself), as is required by cross-validation-based estimation. These techniques all exploit the bias-variance decomposition (Geman, Bienenstock & Doursat, 1992, Wolpert, 1996). The best of our estimators also exploits stacking (Wolpert, 1992). In a set of experiments reported here, it was found to be more accurate than both the alternative cross-validation-based estimator of the bagged algorithm's error and the cross-validation-based estimator of the underlying algorithm's error. This improvement was particularly pronounced for small test sets. This suggests a novel justification for using bagging—more accurate estimation of the generalization error than is possible without bagging.
Article
Breiman(1996) showed that bagging could effectively reduce the variance of regression predictors, while leaving the bias unchanged. A new form of bagging we call adaptive bagging is effective in reducing both bias and variance. The procedure works in stages-- the first stage is bagging. Based on the outcomes of the first stage, the output values are altered and a second stage of bagging is carried out using the altered output values. This is repeated until a specified noise level is reached. We give the background theory, and test the method using both trees and nearest neighbor regression methods. Application to two class classification data gives some interesting results.
Article
Bagging predictors is a method for generating multiple versions of a predictor and using these to get an aggregated predictor. The aggregation averages over the versions when predicting a numerical outcome and does a plurality vote when predicting a class. The multiple versions are formed by making bootstrap replicates of the learning set and using these as new learning sets. Tests on real and simulated data sets using classification and regression trees and subset selection in linear regression show that bagging can give substantial gains in accuracy. The vital element is the instability of the prediction method. If perturbing the learning set can cause significant changes in the predictor constructed, then bagging can improve accuracy.
Article
Stochastic discrimination is a general methodology for constructing classifiers appropriate for pattern recognition. It is based on combining arbitrary numbers of very weak components, which are usually generated by some pseudorandom process, and it has the property that the very complex and accurate classifiers produced in this way retain the ability, characteristic of their weak component pieces, to generalize to new data. In fact, it is often observed, in practice, that classifier performance on test sets continues to rise as more weak components are added, even after performance on training sets seems to have reached a maximum. This is predicted by the underlying theory, for even though the formal error rate on the training set may have reached a minimum, more sophisticated measures intrinsic to this method indicate that classifier performance on both training and test sets continues to improve as complexity increases. We begin with a review of the method of stochastic discrimination as applied to pattern recognition. Through a progression of examples keyed to various theoretical issues, we discuss considerations involved with its algorithmic implementation. We then take such an algorithmic implementation and compare its performance, on a large set of standardized pattern recognition problems from the University of California Irvine, and Statlog collections, to many other techniques reported on in the literature, including boosting and bagging. In doing these studies, we compare our results to those reported in the literature by the various authors for the other methods, using the same data and study paradigms used by them. Included in the paper is an outline of the underlying mathematical theory of stochastic discrimination and a remark concerning boosting, which provides a theoretical justification for properties of that method observed in practice, including its ability to generalize
Article
Much of previous attention on decision trees focuses on the splitting criteria and optimization of tree sizes. The dilemma between overfitting and achieving maximum accuracy is seldom resolved. A method to construct a decision tree based classifier is proposed that maintains highest accuracy on training data and improves on generalization accuracy as it grows in complexity. The classifier consists of multiple trees constructed systematically by pseudorandomly selecting subsets of components of the feature vector, that is, trees constructed in randomly chosen subspaces. The subspace method is compared to single-tree classifiers and other forest construction methods by experiments on publicly available datasets, where the method's superiority is demonstrated. We also discuss independence between trees in a forest and relate that to the combined classification accuracy
Article
We study the notions of bias and variance for classification rules. Following Efron (1978) we develop a decomposition of prediction error into its natural components. Then we derive bootstrap estimates of these components and illustrate how they can be used to describe the error behaviour of a classifier in practice. In the process we also obtain a bootstrap estimate of the error of a "bagged" classifier. Keywords: classification, prediction error, bias, variance, bootstrap 1 Introduction This article concerns classification rules that have been constructed from a set of training data. The training set X = (x 1 ; x 2 ; Delta Delta Delta ; x n ) consists of n observations x i = (t i ; g i ), with t i being the predictor or feature vector and g i being the response, taking values in f1; 2; : : : Kg. On the basis of X the Addresses: tibs@utstat.toronto.edu; http://www.utstat.toronto.edu/tibs 1 statistician constructs a classification rule C(t; X ). Our objective here is to un...
Article
The "minimum margin" of an ensemble classifier on a given training set is, roughly speaking, the smallest vote it gives to any correct training label. Recent work has shown that the Adaboost algorithm is particularly effective at producing ensembles with large minimum margins, and theory suggests that this may account for its success at reducing generalization error. We note, however, that the problem of finding good margins is closely related to linear programming, and we use this connection to derive and test new "LPboosting" algorithms that achieve better minimum margins than Adaboost.
Article
. One of the surprising recurring phenomena observed in experiments with boosting is that the test error of the generated classifier usually does not increase as its size becomes very large, and often is observed to decrease even after the training error reaches zero. In this paper, we show that this phenomenon is related to the distribution of margins of the training examples with respect to the generated voting classification rule, where the margin of an example is simply the difference between the number of correct votes and the maximumnumber of votes received by any incorrect label. We show that techniques used in the analysis of Vapnik's support vector classifiers and of neural networks with small weights can be applied to voting methods to relate the margin distribution to the test error. We also show theoretically and experimentally that boosting is especially effective at increasing the margins of the training examples. Finally, we compare our explanation to those based on the ...
Article
. Bagging and boosting are methods that generate a diverse ensemble of classifiers by manipulating the training data given to a "base" learning algorithm. Breiman has pointed out that they rely for their effectiveness on the instability of the base learning algorithm. An alternative approach to generating an ensemble is to randomize the internal decisions made by the base algorithm. This general approach has been studied previously by Ali and Pazzani and by Dietterich and Kong. This paper compares the effectiveness of randomization, bagging, and boosting for improving the performance of the decision-tree algorithm C4.5. The experiments show that in situations with little or no classification noise, randomization is competitive with (and perhaps slightly superior to) bagging but not as accurate as boosting. In situations with substantial classification noise, bagging is much better than boosting, and sometimes better than randomization. Keywords: Decision trees, ensemble learning, bagg...
Article
Introduction In recent research in combining predictors, it has been recognized that the critical thing to success in combining low-bias predictors such as trees and neural nets has been through methods that reduce the variability in the predictor due to training set variability. Assume that the training set consists of N independent draws from the same underlying distribution. Conceptually, training sets of size N can be drawn repeatedly and the same algorithm used to construct a predictor on each training set. These predictors will vary, and the extent of the variability is a dominant factor in the generalization prediction error. 2 Given a training set {(y n ,x n ),n=1,...N} where the y's are either class labels or numerical values, the most common way of reducing variability is by perturbing the training set to produce alternative training sets, growing a predictor on
Article
In bagging, predictors are constructed using bootstrap samples from the training set and then aggregated to form a bagged predictor. Each bootstrap sample leaves out about 37% of the examples. These left-out examples can be used to form accurate estimates of important quantities. For instance, they can be used to give much improved estimates of node probabilities and node error rates in decision trees. Using estimated outputs instead of the observed outputs improves accuracy in regression trees. They can also be used to give nearly optimal estimates of generalization errors for bagged predictors. * Partially supported by NSF Grant 1-444063-21445 Introduction: We assume that there is a training set T= {(y n ,x n ), n=1, ... ,N} and a method for constructing a predictor Q(x,T) using the given training set. The output variable y can either be a class label (classification) or numerical (regression). In bagging (Breiman[1996a]) a sequence of training sets T B,1 , ... , T B,K are generated ...
Multiple randomized classifiers: MRCL Technical Report, Depart-ment of Statistics An empirical comparison of voting classification algorithms
  • Y Amit
  • G Blanchard
  • K Wilder
  • E Bauer
  • R Kohavi
Amit, Y., Blanchard, G., & Wilder, K. (1999). Multiple randomized classifiers: MRCL Technical Report, Depart-ment of Statistics, University of Chicago. Bauer, E. & Kohavi, R. (1999). An empirical comparison of voting classification algorithms. Machine Learning, 36(1/2), 105–139.
Out-of-bag estimation, ftp.stat.berkeley Arcing classifiers (discussion paper)
  • L Breiman
Breiman, L. (1996b). Out-of-bag estimation, ftp.stat.berkeley.edu/pub/users/breiman/OOBestimation.ps Breiman, L. (1998a). Arcing classifiers (discussion paper). Annals of Statistics, 26, 801–824.
Using adaptive bagging to debias regressions Some infinity theory for predictor ensembles An experimental comparison of three methods for constructing ensembles of decision trees: Bagging, boosting and randomization
  • L Breiman
  • L Breiman
Breiman, L. 1999. Using adaptive bagging to debias regressions. Technical Report 547, Statistics Dept. UCB. Breiman, L. 2000. Some infinity theory for predictor ensembles. Technical Report 579, Statistics Dept. UCB. Dietterich, T. (1998). An experimental comparison of three methods for constructing ensembles of decision trees: Bagging, boosting and randomization, Machine Learning, 1–22.
An experimental comparison of three methods for constructing ensembles of decision trees: Bagging, boosting and randomization
  • L Breiman
Breiman, L. 2000. Some infinity theory for predictor ensembles. Technical Report 579, Statistics Dept. UCB. Dietterich, T. (1998). An experimental comparison of three methods for constructing ensembles of decision trees: Bagging, boosting and randomization, Machine Learning, 1-22.
Experiments with a new boosting algorithm
  • Y Freund
  • R Schapire
Freund, Y. & Schapire, R. (1996). Experiments with a new boosting algorithm, Machine Learning: Proceedings of the Thirteenth International Conference, 148-156.
Some infinity theory for predictor ensembles
  • L Breiman