ArticlePDF Available

A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection

Authors:

Abstract and Figures

We review accuracy estimation methods and compare the two most common methods: crossvalidation and bootstrap. Recent experimental results on artificial data and theoretical results in restricted settings have shown that for selecting a good classifier from a set of classifiers (model selection), ten-fold cross-validation may be better than the more expensiveleaveone -out cross-validation. We report on a largescale experiment---over half a million runs of C4.5 and a Naive-Bayes algorithm---to estimate the effects of different parameters on these algorithms on real-world datasets. For crossvalidation, wevary the number of folds and whether the folds are stratified or not# for bootstrap, wevary the number of bootstrap samples. Our results indicate that for real-word datasets similar to ours, the best method to use for model selection is ten-fold stratified cross validation, even if computation power allows using more folds. 1 Introduction It can not be emphasized eno...
Content may be subject to copyright.
2
5
10 20
-5
-2
-1
folds
45
50
55
60
65
70
75
% acc
Soybean
Vehicle
Rand
2
5
10 20
-5
-2
-1
folds
96
96.5
97
97.5
98
98.5
99
99.5
100
% acc
Chess
Hypo
Mushroom
1
2
5
10 20 50 100
samples
45
50
55
60
65
70
75
% acc
Soybean
Vehicle
Rand
Estimated
Soybean
Vehicle
Rand
1
2
5
10 20 50 100
samples
96
96.5
97
97.5
98
98.5
99
99.5
100
% acc
Chess
Hypo
Mushroom
2
5
10 20
-5
-2
-1
folds
C4.5
0
1
2
3
4
5
6
7
std dev
Mushroom
Chess
Hypo
Breast
Vehicle
Soybean
Rand
2
5
10 20
-5
-2
-1
folds
Naive-Bayes
0
1
2
3
4
5
6
7
std dev
Mushroom
Chess
Hypo
Breast
Vehicle
Soybean
Rand
1
2
5
10 20 50 100
samples
C4.5
0
1
2
3
4
5
6
7
std dev
Mushroom
Chess
Hypo
Breast
Vehicle
Soybean
Rand
... In weighted Pearson correlation, the reciprocals of the squared uncertainty values of the features are used as weights (Dorst et al., 2022). After ranking the features, a 10-fold stratified CV (Kohavi, 1995) is carried out for every possible number of features, and the minimum CV error is determined based on the optimal number of features l ∈ N found. From a mathematical point of view, FS is a mapping F E −→ F S , with F S ∈ R m×l , l < k containing only the optimal number of the most relevant features according to weighted Pearson correlation. ...
Article
Full-text available
Humans spend most of their lives indoors, so indoor air quality (IAQ) plays a key role in human health. Thus, human health is seriously threatened by indoor air pollution, which leads to 3.8 ×106 deaths annually, according to the World Health Organization (WHO). With the ongoing improvement in life quality, IAQ monitoring has become an important concern for researchers. However, in machine learning (ML), measurement uncertainty, which is critical in hazardous gas detection, is usually only estimated using cross-validation and is not directly addressed, and this will be the main focus of this paper. Gas concentration can be determined by using gas sensors in temperature-cycled operation (TCO) and ML on the measured logarithmic resistance of the sensor. This contribution focuses on formaldehyde as one of the most relevant carcinogenic gases indoors and on the sum of volatile organic compounds (VOCs), i.e., acetone, ethanol, formaldehyde, and toluene, measured in the data set as an indicator for IAQ. As gas concentrations are continuous quantities, regression must be used. Thus, a previously published uncertainty-aware automated ML toolbox (UA-AMLT) for classification is extended for regression by introducing an uncertainty-aware partial least squares regression (PLSR) algorithm. The uncertainty propagation of the UA-AMLT is based on the principles described in the Guide to the Expression of Uncertainty in Measurement (GUM) and its supplements. Two different use cases are considered for investigating the influence on ML results in this contribution, namely model training with raw data and with data that are manipulated by adding artificially generated white Gaussian or uniform noise to simulate increased data uncertainty, respectively. One of the benefits of this approach is to obtain a better understanding of where the overall system should be improved. This can be achieved by either improving the trained ML model or using a sensor with higher precision. Finally, an increase in robustness against random noise by training a model with noisy data is demonstrated.
... The skill of ANFIS in mapping the input-output relationship depends on the MF type and the number. In this study, those parameters were optimized using k-fold cross-validation (Kohavi, 1995). ...
Preprint
Full-text available
Urban flood vulnerability monitoring needs a large amount of socioeconomic and environmental data collected at regular time intervals. Collecting such data volume is a significant constraint in assessing changes in flood vulnerability. This study proposed a novel method to monitor spatiotemporal changes in urban flood vulnerability from satellite nighttime light (NTL) data. Peninsular Malaysia was chosen as the research region. A flood vulnerability index (FVI), estimated from socioeconomic and environmental data for a year, was linked to NTL data using a machine learning algorithm called Adaptive neuro-fuzzy inference system (ANFIS). The model was calibrated and validated with administrative unit scale data and subsequently used to predict FVI at a spatial resolution of 10 km for 2000‒2018 using NTL data. Finally, changes in estimated FVI at different grid points were evaluated using the Mann-Kendall trend method to determine changes in flood vulnerability over time and space. Results showed a nonlinear relationship between NTL and flood vulnerability factors such as population density, Gini coefficient and percentage of foreign nationals. The ANFIS technique showed a good performance in estimating FVI from NTL data with a normalized root-mean-square error of 0.68 and Kling-Gupta Efficiency of 0.73. The FVI revealed a high vulnerability in the urbanized western coastal region (FVI ~ 0.5 to 0.54), which matches well with major contributing regions to flood losses in Peninsular Malaysia. Trend assessment showed a significant increase in flood vulnerability in the study area from 2000 to 2018. The spatial distribution of the trend indicated an increase in FVI in the urbanized coastal plains, particularly in rapidly developing western and southern urban regions. The results indicate the potential of the technique in urban flood vulnerability assessment using freely available satellite NTL data.
... This form of CV has been used in clinical research (e.g., [60,61]) due to particularly small data set sizes. However, Kohavi [62] has empirically shown that although leave-one-out estimates are almost unbiased, the variance of the estimates can be large. Besides, it has been argued that repeated random splits lead to more stable results [63]. ...
Preprint
Full-text available
By promising more accurate diagnostics and individual treatment recommendations, deep neural networks and in particular convolutional neural networks have advanced to a powerful tool in medical imaging. Here, we first give an introduction into methodological key concepts and resulting methodological promises including representation and transfer learning, as well as modelling domain-specific priors. After reviewing recent applications within neuroimaging-based psychiatric research, such as the diagnosis of psychiatric diseases, delineation of disease subtypes, normative modeling, and the development of neuroimaging biomarkers, we discuss current challenges. This includes for example the difficulty of training models on small, heterogeneous and biased data sets, the lack of validity of clinical labels, algorithmic bias, and the influence of confounding variables.
Article
Scene classification is an important problem in remote sensing (RS) and has attracted a lot of research in the past decade. Nowadays, most proposed methods are based on deep convolutional neural network (CNN) models, and many pretrained CNN models have been investigated. Ensemble techniques are well studied in the machine learning community; however, few works have used them in RS scene classification. In this work, we propose an ensemble approach, called RS-DeepSuperLearner, that fuses the outputs of five advanced CNN models, namely, VGG16, Inception-V3, DenseNet121, InceptionResNet-V2, and EfficientNet-B3. First, we improve the architecture of the five CNN models by attaching an auxiliary branch at specific layer locations. In other words, the models now have two output layers producing predictions each and the final prediction is the average of the two. The RS-DeepSuperLearner method starts by fine-tuning the five CNN models using the training data. Then, it employs a deep neural network (DNN) SuperLearner to learn the best way for fusing the outputs of the five CNN models by training it on the predicted probability outputs and the cross-validation accuracies (per class) of the individual models. The proposed methodology was assessed on six publicly available RS datasets: UC Merced, KSA, RSSCN7, Optimal31, AID, and NWPU-RSC45. The experimental results demonstrate its superior capabilities when compared to state-of-the-art methods in the literature.
Article
Full-text available
Introduction Early diagnosis of cancer enhances treatment planning and improves prognosis. Many masses presenting to veterinary clinics are difficult to diagnose without using invasive, time-consuming, and costly tests. Our objective was to perform a preliminary proof-of-concept for the HT Vista device, a novel artificial intelligence-based thermal imaging system, developed and designed to differentiate benign from malignant, cutaneous and subcutaneous masses in dogs. Methods Forty-five dogs with a total of 69 masses were recruited. Each mass was clipped and heated by the HT Vista device. The heat emitted by the mass and its adjacent healthy tissue was automatically recorded using a built-in thermal camera. The thermal data from both areas were subsequently analyzed using an Artificial Intelligence algorithm. Cytology and/or biopsy results were later compared to the results obtained from the HT Vista system and used to train the algorithm. Validation was done using a “Leave One Out” cross-validation to determine the algorithm's performance. Results The accuracy, sensitivity, specificity, positive predictive value, and negative predictive value of the system were 90%, 93%, 88%, 83%, and 95%, respectively for all masses. Conclusion We propose that this novel system, with further development, could be used to provide a decision-support tool enabling clinicians to differentiate between benign lesions and those requiring additional diagnostics. Our study also provides a proof-of-concept for ongoing prospective trials for cancer diagnosis using advanced thermodynamics and machine learning procedures in companion dogs.
Article
-Deep learning techniques are proving instrumental in identifying, classifying, and quantifying patterns in medical images. Segmentation is one of the important applications in medical image analysis. The U-Net has become the predominant deep-learning approach to medical image segmentation tasks. Existing U-Net based models have limitations in several respects, however, including: the requirement for millions of parameters in the U-Net, which consumes considerable computational resources and memory; the lack of global information; and incomplete segmentation in difficult cases. To remove some of those limitations, we built on our previous work and applied two modifications to improve the U-Net model: 1) we designed and added the dilated channel-wise CNN module and 2) we simplified the U-shape network. We then proposed a novel light-weight architecture, the Channel-wise Feature Pyramid Network for Medicine (CFPNet-M). To evaluate our method, we selected five datasets from different imaging modalities: thermography, electron microscopy, endoscopy, dermoscopy, and digital retinal images. We compared its performance with several models having a variety of complexities. We used the Tanimoto similarity instead of the Jaccard index for gray-level image comparisons. The CFPNet-M achieves segmentation results on all five medical datasets that are comparable to existing methods, yet require only 8.8 MB memory, and just 0.65 million parameters, which is about 2% of U-Net. Unlike other deep-learning segmentation methods, this new approach is suitable for real-time application: its inference speed can reach 80 frames per second when implemented on a single RTX 2070Ti GPU with an input image size of 256 × 192 pixels.
Article
Full-text available
Discovering a meaningful symbolic expression that explains experimental data is a fundamental challenge in many scientific fields. We present a novel, open-source computational framework called Scientist-Machine Equation Detector (SciMED), which integrates scientific discipline wisdom in a scientist-in-the-loop approach, with state-of-the-art symbolic regression (SR) methods. SciMED combines a wrapper selection method, that is based on a genetic algorithm, with automatic machine learning and two levels of SR methods. We test SciMED on five configurations of a settling sphere, with and without aerodynamic non-linear drag force, and with excessive noise in the measurements. We show that SciMED is sufficiently robust to discover the correct physically meaningful symbolic expressions from the data, and demonstrate how the integration of domain knowledge enhances its performance. Our results indicate better performance on these tasks than the state-of-the-art SR software packages , even in cases where no knowledge is integrated. Moreover, we demonstrate how SciMED can alert the user about possible missing features, unlike the majority of current SR systems.
Article
Full-text available
The design of a pattern recognition system requires careful attention to error estimation. The error rate is the most important descriptor of a classifier's performance. The commonly used estimates of error rate are based on the holdout method, the resubstitution method, and the leave-one-out method. All suffer either from large bias or large variance and their sample distributions are not known. Bootstrapping refers to a class of procedures that resample given data by computer. It permits determining the statistical properties of an estimator when very little is known about the underlying distribution and no additional samples are available. Since its publication in the last decade, the bootstrap technique has been successfully applied to many statistical estimations and inference problems. However, it has not been exploited in the design of pattern recognition systems. We report results on the application of several bootstrap techniques in estimating the error rate of 1-NN and quadratic classifiers. Our experiments show that, in most cases, the confidence interval of a bootstrap estimator of classification error is smaller than that of the leave-one-out estimator. The error of 1-NN, quadratic, and Fisher classifiers are estimated for several real data sets.
Article
Full-text available
This paper introduces stacked generalization, a scheme for minimizing the generalization error rate of one or more generalizers. Stacked generalization works by deducing the biases of the generalizer(s) with respect to a provided learning set. This deduction proceeds by generalizing in a second space whose inputs are (for example) the guesses of the original generalizers when taught with part of the learning set and trying to guess the rest of it, and whose output is (for example) the correct guess. When used with multiple generalizers, stacked generalization can be seen as a more sophisticated version of cross-validation, exploiting a strategy more sophisticated than cross-validation's crude winner-takes-all for combining the individual generalizers. When used with a single generalizer, stacked generalization is a scheme for estimating (and then correcting for) the error of a generalizer which has been trained on a particular learning set and then asked a particular question. After introducing stacked generalization and justifying its use, this paper presents two numerical experiments. The first demonstrates how stacked generalization improves upon a set of separate generalizers for the NETtalk task of translating text to phonemes. The second demonstrates how stacked generalization improves the performance of a single surface-fitter. With the other experimental evidence in the literature, the usual arguments supporting cross-validation, and the abstract justifications presented in this paper, the conclusion is that for almost any real-world generalization problem one should use some version of stacked generalization to minimize the generalization error rate. This paper ends by discussing some of the variations of stacked generalization, and how it touches on other fields like chaos theory.
Conference Paper
Full-text available
We evaluate the performance of weakest- link pruning of decision trees using cross- validation. This technique maps tree pruning into a problem of tree selection: Find the best (i.e. the right-sized) tree, from a set of trees ranging in size from the unpruned tree to a null tree. For samples with at least 200 cases, extensive empirical evidence supports the fol- lowing conclusions relative to tree selection: a fl lo-fold cross-validation is nearly unbiased; b not pruning a covering tree is highly bi- ased; (c) lo-fold cross-validation is consistent with optimal tree selection for large sample sizes and (d) the accuracy of tree selection by lo-fold cross-validation is largely dependent on sample size, irrespective of the population distribution.
Article
It is commonly accepted that statistical modeling should follow the parsimony principle; namely, that simple models should be given priority whenever possible. But little quantitative knowledge is known concerning the amount of penalty (for complexity) regarded as allowable. We try to understand the parsimony principle in the context of model selection. In particular, the generalized final prediction error criterion is considered, and we argue that the penalty term should be chosen between 1.5 and 5 for most practical situations. Applying our results to the cross-validation criterion, we obtain insights into how the partition of data should be done. We also discuss the small sample performance of our methods.
Article
We construct a prediction rule on the basis of some data, and then wish to estimate the error rate of this rule in classifying future observations. Cross-validation provides a nearly unbiased estimate, using only the original data. Cross-validation turns out to be related closely to the bootstrap estimate of the error rate. This article has two purposes: to understand better the theoretical basis of the prediction problem, and to investigate some related estimators, which seem to offer considerably improved estimation in small samples.
Article
We consider the problem of selecting a model having the best predictive ability among a class of linear models. The popular leave-one-out cross-validation method, which is asymptotically equivalent to many other model selection methods such as the Akaike information criterion (AIC), the Cp, and the bootstrap, is asymptotically inconsistent in the sense that the probability of selecting the model with the best predictive ability does not converge to 1 as the total number of observations n → ∞. We show that the inconsistency of the leave-one-out cross-validation can be rectified by using a leave-nv-out cross-validation with nv, the number of observations reserved for validation, satisfying nv/n → 1 as n → ∞. This is a somewhat shocking discovery, because nv/n → 1 is totally opposite to the popular leave-one-out recipe in cross-validation. Motivations, justifications, and discussions of some practical aspects of the use of the leave-nv-out cross-validation method are provided, and results from a simulation study are presented.
Conference Paper
Many aspects of concept learning research can be understood more clearly in light of a basic mathematical result stating, essentially, that positive performance in some learning situations must be offset by an equal degree of negative performance in others. We present a proof of this result and comment on some of its theoretical and practical ramifications.
Chapter
Statistics is a subject of many uses and surprisingly few effective practitioners. The traditional road to statistical knowledge is blocked, for most, by a formidable wall of mathematics. The approach in An Introduction to the Bootstrap avoids that wall. It arms scientists and engineers, as well as statisticians, with the computational techniques they need to analyze and understand complicated data sets.