ArticlePDF Available

A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection

Authors:

Abstract and Figures

We review accuracy estimation methods and compare the two most common methods: crossvalidation and bootstrap. Recent experimental results on artificial data and theoretical results in restricted settings have shown that for selecting a good classifier from a set of classifiers (model selection), ten-fold cross-validation may be better than the more expensiveleaveone -out cross-validation. We report on a largescale experiment---over half a million runs of C4.5 and a Naive-Bayes algorithm---to estimate the effects of different parameters on these algorithms on real-world datasets. For crossvalidation, wevary the number of folds and whether the folds are stratified or not# for bootstrap, wevary the number of bootstrap samples. Our results indicate that for real-word datasets similar to ours, the best method to use for model selection is ten-fold stratified cross validation, even if computation power allows using more folds. 1 Introduction It can not be emphasized eno...
Content may be subject to copyright.
2
5
10 20
-5
-2
-1
folds
45
50
55
60
65
70
75
% acc
Soybean
Vehicle
Rand
2
5
10 20
-5
-2
-1
folds
96
96.5
97
97.5
98
98.5
99
99.5
100
% acc
Chess
Hypo
Mushroom
1
2
5
10 20 50 100
samples
45
50
55
60
65
70
75
% acc
Soybean
Vehicle
Rand
Estimated
Soybean
Vehicle
Rand
1
2
5
10 20 50 100
samples
96
96.5
97
97.5
98
98.5
99
99.5
100
% acc
Chess
Hypo
Mushroom
2
5
10 20
-5
-2
-1
folds
C4.5
0
1
2
3
4
5
6
7
std dev
Mushroom
Chess
Hypo
Breast
Vehicle
Soybean
Rand
2
5
10 20
-5
-2
-1
folds
Naive-Bayes
0
1
2
3
4
5
6
7
std dev
Mushroom
Chess
Hypo
Breast
Vehicle
Soybean
Rand
1
2
5
10 20 50 100
samples
C4.5
0
1
2
3
4
5
6
7
std dev
Mushroom
Chess
Hypo
Breast
Vehicle
Soybean
Rand
... Nonetheless, different variations of k-fold cross-validation exist that necessitate repeated rounds of k-fold cross-validation or stratified folds; the repeated rounds ensure that each data fold has the same proportion of instances within a given label (See Figure 2). In article [67], the author recommended stratified 10-fold cross-validation as the best model selection method from a study of several approaches, where he included different cross-validation methods (regular cross-validation, leave-one-out cross-validation, and stratified cross-validation) and bootstrap to estimate the accuracy. Stratified 10-fold crossvalidation yielded less tendency performance estimation. ...
... Table IX shows significant differences for stratified k-fold crossvalidation. This argument supports the research article from [67], which recommended stratified 10-fold cross-validation as the best model selection method in a study of several validation approaches. ...
Article
Full-text available
This article aims to study the performance of machine learning models in forecasting gender based on the students' open education competency perception. Data were collected from a convenience sample of 326 students from 26 countries using the eOpen instrument. The analysis comprises 1) a study of the students' perceptions of knowledge, skills, and attitudes or values related to open education and its sub-competencies from a 30-item questionnaire using machine learning models to forecast participants' gender, 2) validation of performance through cross-validation methods, 3) statistical analysis to find significant differences between machine learning models, and 4) an analysis from explainable machine learning models to find relevant features to forecast gender. The results confirm our hypothesis that the performance of machine learning models can effectively forecast gender based on the student's perceptions of knowledge, skills, and attitudes or values related to open education competency.
... Cross-validation is a fundamental technique in machine learning and statistics that plays a pivotal role in assessing predictive models' performance and generalization capability. It addresses the challenge of evaluating a model's performance on new, unseen data, which is crucial to avoid overfitting and ensure reliable predictions [35]. ...
... Moreover, cross-validation allows for better model tuning by identifying potential issues like overfitting or underfitting. It also helps to avoid the bias that may occur when using a fixed validation database [35]. ...
Article
Marshall stability (MS) is used to evaluate the resistance to settlement, deformation and displacement of asphalt concrete. However, these experiments are complex, expensive and time-consuming. Therefore, it is important to develop an alternative method to quickly determine these parameters. This paper presents a comprehensive investigation into applying machine learning techniques for predicting the MS of basalt fiber asphalt concrete. The study leverages the Gradient Boosting algorithm to establish predictive models. A database containing 128 samples is employed as the foundation for model construction. Additionally, SHAP analysis is employed to reveal the underlying variables influencing the predictive outcomes. To extend the practicality of the findings, a Graphical User Interface (GUI) is developed to facilitate easy access to the predictive tool for material engineers. The results show that the content aggregate 4.75mm is the most influential variable, followed by the content aggregate 2.36mm, the content of fiber, the content of binder, and the content aggregate 9.5mm in descending order of impact.
... K-fold cross validation is a technique that involves dividing the dataset into k subsets, called folds, and training the model k times, each time using a different fold as the validation set and the remaining folds as the training set. K-fold cross validation is important because it helps to prevent overfitting, which is when a model performs well on the training data but poorly on unseen data [1]. By training the model on different subsets of the data and evaluating its performance on each subset, we can get a better sense of how the model will perform on unseen data. ...
Experiment Findings
Full-text available
Exploring the potential of machine learning in climate trend analysis and utilizing the Jena Climate dataset, the goal is to predict temperature trends using non-temperature-related parameters. To achieve this, we crafted and compared custom implementations of three distinct machine learning models: Random Forest, Naive Bayes algorithm, and LSTM neural networks. This project was not only about developing predictive models but also about a thorough comparison with existing third-party implementations. Through this hands-on experience, we gained deep insights into model effectiveness, data handling, and the nuances of building machine learning models from scratch.
... Holding aside a portion of the data as a validation set is expected in supervised machine learning problems. K-Fold cross-validation [36] is utilized in this research to avoid overfitting and fully use the data used to train the model. The validation set is no longer required when using this approach. ...
Article
The shear strength of corroded reinforced concrete (CRC) beams is a critical consideration during the design stages of RC structures. In this study, we propose a machine learning technique for estimating the shear strength of CRC beams across a range of service periods. To do this, we gathered 158 CRC beam shear tests and used Artificial Neural Network (ANN) to create a forecast model for the considered output. Twelve input variables indicate the geometrical and material properties, reinforcing parameters, and the degree of corrosion in the beam, whereas the shear strength is the output considered. The database is designed to employ 70 percent of the data point to train the model and 30 percent to assess the performance. The model makes outstanding predictions, according to the results, with an R2 value of 0.989. In addition, five empirical shear strength models in the literature are utilized to test the suggested ANN model, demonstrating that the new model performs much better. With any given service period, the suggested time-dependent prediction model can offer the shear strength of CRC beams.
... The choice of K must be appropriate because if K is too large, the training data set will be much larger than the control data set, and the evaluation results will not reflect the true nature of the machine learning method, especially with large data sets. In this study, K=10 was selected to consider a previous work's suggestion [39], and briefly explained in Fig. 2. Fig. 2. Cross-validation technique with 10-fold used in this study ...
Article
This study proposes the application of Ensemble Decision Tree Boosted (EDT Boosted) model for forecasting the surface chloride concentration of marine concrete A database of 386 experimental results was collected from 17 different sources covering twelve variables was used to build and verify the predictive power of the EDT model. The input factors considered the changes in eleven variables, including the contents of cement, fly ash, blast furnace slag, silica fume, superplasticizer, water, fine aggregate, coarse aggregate, annual mean temperature, chloride concentration in seawater, and exposure time. The results indicate that EDT Boosted is a good predictor of as verified via good performance evaluation criteria, i.e., R2, RMSE, MAE, MAPE values were 0.84, 0.16, 0.17, and 17%, respectively. Partial dependence plot (PDP) was then developed to correlate the eleven input variables with the . PDP implied that the strongest factor affecting Cs was the amount of fine aggregate content, chloride concentration, exposure time, amount of cement, and water, which is useful for material engineers in the design of the grade.
... The machine is trained k times in which one random subset is selected as the validation data, the other subsets (k-1) are selected as the training data for each time. The cross-validation estimate of accuracy results from the average evaluation of all runs [30]. In this work, k = 10 was chosen to split the training dataset due to minor errors and low variances through experimentation [31]. ...
Article
γ-Fe2O3 nanoparticles (NPs) were synthesized by co-precipitation method and a following annealing treatment at 200 °C in ambient air for 6 hours. A mass-type sensor was prepared by coating γ-Fe2O3 NPs on the active electrode of quartz crystal microbalance (QCM). The obtained results of the γ-Fe2O3 NPs based QCM sensor indicate the high response and good repeatability toward SO2 gas in the range of 2.5 – 20 ppm at room temperature. Moreover, the frequency shift (DF) and change in mass of SO2 adsorption per unit area (Dm) of the γ-Fe2O3 NPs coated QCM sensor have a relationship with the mass density of γ-Fe2O3 NPs and SO2 concentrations. The artificial neural network (ANN) model using Levenberg-Marquardt optimization was used to handle the DF and Dm of the γ-Fe2O3 NPs coated QCM sensor. The results of the model validation proved to be a reliable way between the experiment and prediction values.
Chapter
Climate data is an essential kind of data for humans in the world. Improving the ability to forecast the climate will contribute to the development of many industries, such as agriculture and shipping. In this project, we use the climate data in Brazil from 2000 to 2020. Attributes of the data mainly are date and time, temperature, precipitation, wind speed, and the province in which these data are measured. This study aims to classify these climate data and analyze the changing climate trends in the same province. An artificial neural network is established as the model in this project to implement this objective. The performance shows that this model can complete this classification task.
Article
Full-text available
A properly performing and efficient bond market is widely considered important for the smooth functioning of trading systems in general. An important feature of the bond market for investors is its liquidity. High-frequency trading employs sophisticated algorithms to explore numerous markets, such as fixed-income markets. In this trading, transactions are processed more quickly, and the volume of trades rises significantly, improving liquidity in the bond market. This paper presents a comparison of neural networks, fuzzy logic, and quantum methodologies for predicting bond price movements through a high-frequency strategy in advanced and emerging countries. Our results indicate that, of the selected methods, QGA, DRCNN and DLNN-GA can correctly interpret the expected bond future price direction and rate changes satisfactorily, while QFuzzy tend to perform worse in forecasting the future direction of bond prices. Our work has a large potential impact on the possible directions of the strategy of algorithmic trading for investors and stakeholders in fixed-income markets and all methodologies proposed in this study could be great options policy to explore other financial markets.
Article
Full-text available
The design of a pattern recognition system requires careful attention to error estimation. The error rate is the most important descriptor of a classifier's performance. The commonly used estimates of error rate are based on the holdout method, the resubstitution method, and the leave-one-out method. All suffer either from large bias or large variance and their sample distributions are not known. Bootstrapping refers to a class of procedures that resample given data by computer. It permits determining the statistical properties of an estimator when very little is known about the underlying distribution and no additional samples are available. Since its publication in the last decade, the bootstrap technique has been successfully applied to many statistical estimations and inference problems. However, it has not been exploited in the design of pattern recognition systems. We report results on the application of several bootstrap techniques in estimating the error rate of 1-NN and quadratic classifiers. Our experiments show that, in most cases, the confidence interval of a bootstrap estimator of classification error is smaller than that of the leave-one-out estimator. The error of 1-NN, quadratic, and Fisher classifiers are estimated for several real data sets.
Article
Full-text available
This paper introduces stacked generalization, a scheme for minimizing the generalization error rate of one or more generalizers. Stacked generalization works by deducing the biases of the generalizer(s) with respect to a provided learning set. This deduction proceeds by generalizing in a second space whose inputs are (for example) the guesses of the original generalizers when taught with part of the learning set and trying to guess the rest of it, and whose output is (for example) the correct guess. When used with multiple generalizers, stacked generalization can be seen as a more sophisticated version of cross-validation, exploiting a strategy more sophisticated than cross-validation's crude winner-takes-all for combining the individual generalizers. When used with a single generalizer, stacked generalization is a scheme for estimating (and then correcting for) the error of a generalizer which has been trained on a particular learning set and then asked a particular question. After introducing stacked generalization and justifying its use, this paper presents two numerical experiments. The first demonstrates how stacked generalization improves upon a set of separate generalizers for the NETtalk task of translating text to phonemes. The second demonstrates how stacked generalization improves the performance of a single surface-fitter. With the other experimental evidence in the literature, the usual arguments supporting cross-validation, and the abstract justifications presented in this paper, the conclusion is that for almost any real-world generalization problem one should use some version of stacked generalization to minimize the generalization error rate. This paper ends by discussing some of the variations of stacked generalization, and how it touches on other fields like chaos theory.
Conference Paper
Full-text available
We evaluate the performance of weakest- link pruning of decision trees using cross- validation. This technique maps tree pruning into a problem of tree selection: Find the best (i.e. the right-sized) tree, from a set of trees ranging in size from the unpruned tree to a null tree. For samples with at least 200 cases, extensive empirical evidence supports the fol- lowing conclusions relative to tree selection: a fl lo-fold cross-validation is nearly unbiased; b not pruning a covering tree is highly bi- ased; (c) lo-fold cross-validation is consistent with optimal tree selection for large sample sizes and (d) the accuracy of tree selection by lo-fold cross-validation is largely dependent on sample size, irrespective of the population distribution.
Article
It is commonly accepted that statistical modeling should follow the parsimony principle; namely, that simple models should be given priority whenever possible. But little quantitative knowledge is known concerning the amount of penalty (for complexity) regarded as allowable. We try to understand the parsimony principle in the context of model selection. In particular, the generalized final prediction error criterion is considered, and we argue that the penalty term should be chosen between 1.5 and 5 for most practical situations. Applying our results to the cross-validation criterion, we obtain insights into how the partition of data should be done. We also discuss the small sample performance of our methods.
Article
We construct a prediction rule on the basis of some data, and then wish to estimate the error rate of this rule in classifying future observations. Cross-validation provides a nearly unbiased estimate, using only the original data. Cross-validation turns out to be related closely to the bootstrap estimate of the error rate. This article has two purposes: to understand better the theoretical basis of the prediction problem, and to investigate some related estimators, which seem to offer considerably improved estimation in small samples.
Article
We consider the problem of selecting a model having the best predictive ability among a class of linear models. The popular leave-one-out cross-validation method, which is asymptotically equivalent to many other model selection methods such as the Akaike information criterion (AIC), the Cp, and the bootstrap, is asymptotically inconsistent in the sense that the probability of selecting the model with the best predictive ability does not converge to 1 as the total number of observations n → ∞. We show that the inconsistency of the leave-one-out cross-validation can be rectified by using a leave-nv-out cross-validation with nv, the number of observations reserved for validation, satisfying nv/n → 1 as n → ∞. This is a somewhat shocking discovery, because nv/n → 1 is totally opposite to the popular leave-one-out recipe in cross-validation. Motivations, justifications, and discussions of some practical aspects of the use of the leave-nv-out cross-validation method are provided, and results from a simulation study are presented.
Conference Paper
Many aspects of concept learning research can be understood more clearly in light of a basic mathematical result stating, essentially, that positive performance in some learning situations must be offset by an equal degree of negative performance in others. We present a proof of this result and comment on some of its theoretical and practical ramifications.
Chapter
Statistics is a subject of many uses and surprisingly few effective practitioners. The traditional road to statistical knowledge is blocked, for most, by a formidable wall of mathematics. The approach in An Introduction to the Bootstrap avoids that wall. It arms scientists and engineers, as well as statisticians, with the computational techniques they need to analyze and understand complicated data sets.