ArticlePDF Available

A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection

Authors:

Abstract and Figures

We review accuracy estimation methods and compare the two most common methods: crossvalidation and bootstrap. Recent experimental results on artificial data and theoretical results in restricted settings have shown that for selecting a good classifier from a set of classifiers (model selection), ten-fold cross-validation may be better than the more expensiveleaveone -out cross-validation. We report on a largescale experiment---over half a million runs of C4.5 and a Naive-Bayes algorithm---to estimate the effects of different parameters on these algorithms on real-world datasets. For crossvalidation, wevary the number of folds and whether the folds are stratified or not# for bootstrap, wevary the number of bootstrap samples. Our results indicate that for real-word datasets similar to ours, the best method to use for model selection is ten-fold stratified cross validation, even if computation power allows using more folds. 1 Introduction It can not be emphasized eno...
Content may be subject to copyright.
2
5
10 20
-5
-2
-1
folds
45
50
55
60
65
70
75
% acc
Soybean
Vehicle
Rand
2
5
10 20
-5
-2
-1
folds
96
96.5
97
97.5
98
98.5
99
99.5
100
% acc
Chess
Hypo
Mushroom
1
2
5
10 20 50 100
samples
45
50
55
60
65
70
75
% acc
Soybean
Vehicle
Rand
Estimated
Soybean
Vehicle
Rand
1
2
5
10 20 50 100
samples
96
96.5
97
97.5
98
98.5
99
99.5
100
% acc
Chess
Hypo
Mushroom
2
5
10 20
-5
-2
-1
folds
C4.5
0
1
2
3
4
5
6
7
std dev
Mushroom
Chess
Hypo
Breast
Vehicle
Soybean
Rand
2
5
10 20
-5
-2
-1
folds
Naive-Bayes
0
1
2
3
4
5
6
7
std dev
Mushroom
Chess
Hypo
Breast
Vehicle
Soybean
Rand
1
2
5
10 20 50 100
samples
C4.5
0
1
2
3
4
5
6
7
std dev
Mushroom
Chess
Hypo
Breast
Vehicle
Soybean
Rand
... The average of the five performance measurements on the validation sets provided the crossvalidated performance metric. Additionally, another independent cohort (15 PD patients and 11 healthy controls) was used as a validation set in each fold to evaluate the performance of the classifier [39][40][41]. ...
Article
Full-text available
Background Parkinson’s Disease (PD) is a neurodegenerative disorder, and eye movement abnormalities are a significant symptom of its diagnosis. In this paper, we developed a multi-task driven by eye movement in a virtual reality (VR) environment to elicit PD-specific eye movement abnormalities. The abnormal features were subsequently modeled by using the proposed deep learning algorithm to achieve an auxiliary diagnosis of PD. Methods We recruited 114 PD patients and 125 healthy controls and collected their eye-tracking data in a VR environment. Participants completed a series of specific VR tasks, including gaze stability, pro-saccades, anti-saccades, and smooth pursuit. After the tasks, eye movement features were extracted from the behaviors of fixations, saccades, and smooth pursuit to establish a PD diagnostic model. Results The performance of the models was evaluated through cross-validation, revealing a recall of 97.65%, an accuracy of 92.73%, and a receiver operator characteristic area under the curve (ROC-AUC) of 97.08% for the proposed model. Conclusion We extracted PD-specific eye movement features from the behaviors of fixations, saccades, and smooth pursuit in a VR environment to create a model with high accuracy and recall for PD diagnosis. Our method provides physicians with a new auxiliary tool to improve the prognosis and quality of life of PD patients.
... ➢ Models were trained and validated using cross-validation techniques to ensure robustness [10]. ...
Article
Full-text available
Chronic pain is a prevalent condition that significantly impacts patients' quality of life and poses substantial challenges to healthcare systems. The integration of health informatics, particularly predictive analytics and digital health tools, has emerged as a transformative approach in chronic pain management. This study investigates the role of health informatics in optimizing chronic pain care by analyzing secondary data and synthesizing evidence from peer-reviewed literature. Predictive analytics, powered by machine learning models, demonstrated enhanced accuracy in pain prediction, enabling early identification of at-risk patients and the personalization of treatment strategies. Digital health tools, including wearable devices, mobile health applications, and telemedicine platforms, fostered real-time monitoring and patient engagement, leading to improved adherence and clinical outcomes. However, significant barriers such as data standardization, algorithmic bias, and privacy concerns impede the broader adoption of these innovations. This paper highlights the need for standardized data frameworks, diverse training datasets, and transparent policies to address these challenges. The findings underscore the potential of health informatics to revolutionize chronic pain management and provide actionable recommendations for advancing research, clinical practice, and policy development.
... This technique ensures each fold contains a balanced representation of all classes, reducing bias in the assessment. Metrics like accuracy, precision, recall, and ROC-AUC were measured, providing a comprehensive evaluation of the model's performance across different data splits [80]. Figure 2 shows the end-to-end workflow of our proposed aero-engine defect detection approach. ...
Article
Full-text available
This study explores the impact of transfer learning on enhancing deep learning models for detecting defects in aero-engine components. We focused on metrics such as accuracy, precision, recall, and loss to compare the performance of models VGG19 and DeiT (data-efficient image transformer). RandomSearchCV was used for hyperparameter optimization, and we selectively froze some layers during training to help better tailor the models to our dataset. We conclude that the difference in performance across all metrics can be attributed to the adoption of the transformer-based architecture by the DeiT model as it does this well in capturing complex patterns in data. This research demonstrates that transformer models hold promise for improving the accuracy and efficiency of defect detection within the aerospace industry, which will, in turn, contribute to cleaner and more sustainable aviation activities.
... The goal of BO is to iteratively and efficiently search through the model space to find the optimal configuration. BO has proven to be a powerful tool for hyperparameter tuning, as it carefully selects new hyperparameters in each iteration to refine both prediction effectiveness and training time (Kohavi 1995;Wang et al. 2020;Malkomes et al. 2016). This approach intensifies model performance by rapidly converging on the best set of hyperparameters. ...
Article
Full-text available
Geothermal energy is a sustainable resource for power generation, particularly in Yemen. Efficient utilization necessitates accurate forecasting of subsurface temperatures , which is challenging with conventional methods. This research leverages machine learning (ML) to optimize geothermal temperature forecasting in Yemen's western region. The data set, collected from 108 geothermal wells, was divided into two sets: set 1 with 1402 data points and set 2 with 995 data points. Feature engineering prepared the data for model training. We evaluated a suite of machine learning regression models, from simple linear regression (SLR) to multi-layer perceptron (MLP). Hyperparameter tuning using Bayesian optimization (BO) was selected as the optimization process to boost model accuracy and performance. The MLP model outper-formed others, achieving high R 2 values and low error values across all metrics after BO. Specifically, MLP achieved R 2 of 0.999, with MAE of 0.218, RMSE of 0.285, RAE of 4.071%, and RRSE of 4.011%. BO significantly upgraded the Gaussian process model, achieving an R 2 of 0.996, a minimum MAE of 0.283, RMSE of 0.575, RAE of 5.453%, and RRSE of 8.717%. The models demonstrated robust generalization capabilities with high R 2 values and low error metrics (MAE and RMSE) across all sets. This study highlights the potential of enhanced ML techniques and the novel BO in optimizing geothermal energy resource exploitation, contributing significantly to renewable energy research and development.
... ➢ The dataset was split into training (70%) and testing (30%) sets using stratified sampling to ensure proportional representation of outcome classes. Stratification was crucial to address the potential class imbalance in mortality and LOS categories [11]. ...
Article
Full-text available
This study explores the transformative potential of machine learning (ML) in predicting clinical outcomes in intensive care units (ICUs). By analyzing a dataset of 35,000 ICU admissions, the research evaluates the performance of advanced ML algorithms, including XGBoost, Random Forest, Neural Networks, and Logistic Regression, in predicting mortality and length of stay (LOS). The findings demonstrate that ML models, particularly XGBoost, outperform traditional methods like Logistic Regression, achieving superior accuracy, precision, and ROC-AUC scores. The study highlights critical predictors such as age, lactate levels, and CRP levels, offering insights into their clinical relevance. Emphasizing the integration of explainable AI and addressing challenges like data quality and ethical concerns, the research underscores ML’s role in enhancing ICU decision-making and operational efficiency. This work contributes to the growing field of data-driven healthcare and provides a framework for advancing ML applications in critical care settings.
Article
Vegetation overgrowth in rivers worldwide is a considerable problem because it can potentially reduce the flood‐flowing capacity and cause biodiversity loss. In this study, we developed a model to predict vegetation recruitment during the initial stages of secondary succession, which leads to vegetation overgrowth. This study chose a logistic regression model to predict vegetation recruitment because of its simplicity and lower computational load than machine learning. The model was designed for the Kinu River in Japan which is associated with extensive vegetation overgrowth. Data for the model development were obtained from unmanned aerial vehicle (UAV) aerial surveys and public databases. To ensure the model's applicability beyond the training rivers, we trained the logistic regression model across different river flows and geomorphic characteristics, including normal and flood times and gravel and sand beds. The results indicated that the logistic regression model with three explanatory variables, namely distance from the river stream, relative height, and vegetation existence history, was optimal for all rivers, with F ‐measures in the range of 0.79‐0.85. In addition, using UAV imagery allows for high‐spatial resolution in predicting vegetation recruitment. The best model prediction map of vegetation recruitment demonstrated that it could accurately predict the vegetation distribution along the main river channel for gravel and sand beds. The simplicity of the present model would be advantageous when applied to other rivers with similar topographic and biological characteristics within the same river segment without hydrodynamic calculations.
Article
Full-text available
Emphasis on future environmental changes grows due to climate change, with simulations predicting rising river temperatures globally. For Poland, which has a long history of thermal studies of rivers, such an approach has not been implemented to date. This study used 9 Global Climate Models and tested three machine-learning techniques to predict river temperature changes. Random Forest performed best, with R2=0.88 and lowest error (RMSE: 2.25, MAE:1.72). The range of future water temperature changes by the end of the 21st century was based on the Shared Socioeconomic Pathway scenarios SSP2-4.5 and SSP5-8.5. It was determined that by the end of the 21st century, the average temperature will increase by 2.1°C (SSP2-4.5) and 3.7°C (SSP5-8.5). A more detailed analysis, divided by two major basins Vistula and Odra, covered about 90% of Poland’s territory. The average temperature increase, according to scenarios SSP2-4.5 and SSP5-8.5 for the Odra basin rivers, is 1.6°C and 3.2°C and for the Vistula basin rivers 2.3°C and 3.8°C, respectively. The Vistula basin’s higher warming is due to less groundwater input and continental climate influence. These findings provide a crucial basis for water management to mitigate warming effects in Poland.
Article
Full-text available
The design of a pattern recognition system requires careful attention to error estimation. The error rate is the most important descriptor of a classifier's performance. The commonly used estimates of error rate are based on the holdout method, the resubstitution method, and the leave-one-out method. All suffer either from large bias or large variance and their sample distributions are not known. Bootstrapping refers to a class of procedures that resample given data by computer. It permits determining the statistical properties of an estimator when very little is known about the underlying distribution and no additional samples are available. Since its publication in the last decade, the bootstrap technique has been successfully applied to many statistical estimations and inference problems. However, it has not been exploited in the design of pattern recognition systems. We report results on the application of several bootstrap techniques in estimating the error rate of 1-NN and quadratic classifiers. Our experiments show that, in most cases, the confidence interval of a bootstrap estimator of classification error is smaller than that of the leave-one-out estimator. The error of 1-NN, quadratic, and Fisher classifiers are estimated for several real data sets.
Article
Full-text available
This paper introduces stacked generalization, a scheme for minimizing the generalization error rate of one or more generalizers. Stacked generalization works by deducing the biases of the generalizer(s) with respect to a provided learning set. This deduction proceeds by generalizing in a second space whose inputs are (for example) the guesses of the original generalizers when taught with part of the learning set and trying to guess the rest of it, and whose output is (for example) the correct guess. When used with multiple generalizers, stacked generalization can be seen as a more sophisticated version of cross-validation, exploiting a strategy more sophisticated than cross-validation's crude winner-takes-all for combining the individual generalizers. When used with a single generalizer, stacked generalization is a scheme for estimating (and then correcting for) the error of a generalizer which has been trained on a particular learning set and then asked a particular question. After introducing stacked generalization and justifying its use, this paper presents two numerical experiments. The first demonstrates how stacked generalization improves upon a set of separate generalizers for the NETtalk task of translating text to phonemes. The second demonstrates how stacked generalization improves the performance of a single surface-fitter. With the other experimental evidence in the literature, the usual arguments supporting cross-validation, and the abstract justifications presented in this paper, the conclusion is that for almost any real-world generalization problem one should use some version of stacked generalization to minimize the generalization error rate. This paper ends by discussing some of the variations of stacked generalization, and how it touches on other fields like chaos theory.
Conference Paper
Full-text available
We evaluate the performance of weakest- link pruning of decision trees using cross- validation. This technique maps tree pruning into a problem of tree selection: Find the best (i.e. the right-sized) tree, from a set of trees ranging in size from the unpruned tree to a null tree. For samples with at least 200 cases, extensive empirical evidence supports the fol- lowing conclusions relative to tree selection: a fl lo-fold cross-validation is nearly unbiased; b not pruning a covering tree is highly bi- ased; (c) lo-fold cross-validation is consistent with optimal tree selection for large sample sizes and (d) the accuracy of tree selection by lo-fold cross-validation is largely dependent on sample size, irrespective of the population distribution.
Article
It is commonly accepted that statistical modeling should follow the parsimony principle; namely, that simple models should be given priority whenever possible. But little quantitative knowledge is known concerning the amount of penalty (for complexity) regarded as allowable. We try to understand the parsimony principle in the context of model selection. In particular, the generalized final prediction error criterion is considered, and we argue that the penalty term should be chosen between 1.5 and 5 for most practical situations. Applying our results to the cross-validation criterion, we obtain insights into how the partition of data should be done. We also discuss the small sample performance of our methods.
Article
We construct a prediction rule on the basis of some data, and then wish to estimate the error rate of this rule in classifying future observations. Cross-validation provides a nearly unbiased estimate, using only the original data. Cross-validation turns out to be related closely to the bootstrap estimate of the error rate. This article has two purposes: to understand better the theoretical basis of the prediction problem, and to investigate some related estimators, which seem to offer considerably improved estimation in small samples.
Article
We consider the problem of selecting a model having the best predictive ability among a class of linear models. The popular leave-one-out cross-validation method, which is asymptotically equivalent to many other model selection methods such as the Akaike information criterion (AIC), the Cp, and the bootstrap, is asymptotically inconsistent in the sense that the probability of selecting the model with the best predictive ability does not converge to 1 as the total number of observations n → ∞. We show that the inconsistency of the leave-one-out cross-validation can be rectified by using a leave-nv-out cross-validation with nv, the number of observations reserved for validation, satisfying nv/n → 1 as n → ∞. This is a somewhat shocking discovery, because nv/n → 1 is totally opposite to the popular leave-one-out recipe in cross-validation. Motivations, justifications, and discussions of some practical aspects of the use of the leave-nv-out cross-validation method are provided, and results from a simulation study are presented.
Conference Paper
Many aspects of concept learning research can be understood more clearly in light of a basic mathematical result stating, essentially, that positive performance in some learning situations must be offset by an equal degree of negative performance in others. We present a proof of this result and comment on some of its theoretical and practical ramifications.
Chapter
Statistics is a subject of many uses and surprisingly few effective practitioners. The traditional road to statistical knowledge is blocked, for most, by a formidable wall of mathematics. The approach in An Introduction to the Bootstrap avoids that wall. It arms scientists and engineers, as well as statisticians, with the computational techniques they need to analyze and understand complicated data sets.