Conference PaperPDF Available

A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection

Authors:

Abstract and Figures

We review accuracy estimation methods and compare the two most common methods: crossvalidation and bootstrap. Recent experimental results on arti cial data and theoretical results in restricted settings have shown that for selecting a good classi er from a set of classiers (model selection), ten-fold cross-validation may be better than the more expensive leaveone-out cross-validation. We report on a largescale experiment|over half a million runs of C4.5 and a Naive-Bayes algorithm|to estimate the e ects of di erent parameters on these algorithms on real-world datasets. For crossvalidation, we vary the number of folds and whether the folds are strati ed or not � for bootstrap, we vary the number of bootstrap samples. Our results indicate that for real-word datasets similar to ours, the best method to use for model selection is ten-fold strati ed cross validation, even if computation power allows using more folds. 1
Content may be subject to copyright.
2510 20 -5 -2 -1 folds
45
50
55
60
65
70
75
% acc
Soybean
Vehicle
Rand
2510 20 -5 -2 -1 folds
96
96.5
97
97.5
98
98.5
99
99.5
100
% acc
Chess
Hypo
Mushroom
1 2 510 20 50 100 samples
45
50
55
60
65
70
75
% acc
Soybean
Vehicle
Rand
Estimated
Soybean
Vehicle
Rand
1 2 510 20 50 100 samples
96
96.5
97
97.5
98
98.5
99
99.5
100
% acc
Chess
Hypo
Mushroom
2510 20 -5 -2 -1 folds
C4.5
0
1
2
3
4
5
6
7
std dev
Mushroom
Chess
Hypo
Breast
V
e
hi
c
l
e
Soybean
Rand
2510 20 -5 -2 -1 folds
Naive-Bayes
0
1
2
3
4
5
6
7
std dev
Mushroom
Chess
Hypo
Breast
Vehicle
Soybean
Rand
1 2 510 20 50 100samples
C4.5
0
1
2
3
4
5
6
7
std dev
Mushroom
Chess
Hypo
Breast
Vehicle
Soybean
Rand
... Even though k-fcv offers several advantages for performance estimation, such as a reduced computation time compared to leave-one-out [6], its application is not totally risk-free [7,8]. It may cause dataset shift [9,10], in which the data used to build and evaluate the model do not follow the same distribution. ...
... The problem of target shift has been widely studied in classification [15,16]. In this context, stratification [6] is employed to reduce target shift related to k-fcv. It consists of having the same proportion of samples of each class in the training and test sets. ...
... It consists of having the same proportion of samples of each class in the training and test sets. This approach has provided successful results creating cross-validation folds for both model selection and evaluation in classification [6]. ...
Article
Full-text available
Data that have not been modeled cannot be correctly predicted. Under this assumption, this research studies how k-fold cross-validation can introduce dataset shift in regression problems. This fact implies data distributions in the training and test sets to be different and, therefore, a deterioration of the model performance estimation. Even though the stratification of the output variable is widely used in the field of classification to reduce the impacts of dataset shift induced by cross-validation, its use in regression is not widespread in the literature. This paper analyzes the consequences for dataset shift of including different regressand stratification schemes in cross-validation with regression data. The results obtained show that these allow for creating more similar training and test sets, reducing the presence of dataset shift related to cross-validation. The bias and deviation of the performance estimation results obtained by regression algorithms are improved using the highest amounts of strata, as are the number of cross-validation repetitions necessary to obtain these better results.
... K-fold cross-validation (Kohavi, 1995) was applied to compare and select the best teak stand growth modeling system. The dataset was randomly split into K folds subsamples (K = 10) of equal size, in which K -1 subsamples were used to develop models and calculate AIC (Akaike, 1973) and adjusted R 2 , and the remaining subsample was used to validate the models and estimate errors such as percent bias, root mean squared error (RMSE), and mean absolute percent error (MAPE,%). ...
Article
Full-text available
We developed a system for modeling the growth and yield of planted teak (Tectona grandis L.f.) for small diameter products under varying management regimes in the tropical Central Highlands of Viet Nam. We compared an independent and simultaneous system of models to predict dominant height (Ho), quadratic mean diameter (Dg),averaged tree height (Hg) with Dg, and mean tree volume (V) versus stand age (A). In addition, the model system performance with and without site index (SI) and stand density (N) as covariates were compared using K-fold cross-validation. The best modeling system was obtained with the simultaneously fit models that included SI and N and were in the form of: Dg=Dm/(1 + a × exp(-b × A)) × exp[e1 × (SI– 15) + e2/1000 × (N – 722)]; Hg=Hm ×exp(-a × exp(-b × A)) × exp[e1 × (SI– 15) + e2/1000 × (N – 722)]; and V = π/4 ×10-4 x Dg2 × Hg × 0.45; where Dm, Hm, a, b, e1and e2 were the parameters to be estimated. These models will help predict the growth and yield of teak planted for different planting schemes, including monoculture, agroforestry, and forest enrichment planting in this region.
... In this technique, we resample the dataset by randomly dividing the total dataset to 80% for training (i.e., UX experiments: 12,715 observations) and 20% for testing (i.e., UX experiments: 3178 observations). Furthermore, multiple iterations are applied following the bootstrap resampling technique to evaluate the error estimate precision and be able to determine whether an error difference is statistically significant [29,30] (see Figure 3). Randomly subsampling the training and testing datasets allows us to obtain a distribution of the errors instead of a point estimate. ...
Article
Full-text available
Identifying the factors that control the dynamics of pedestrians is a crucial step towards modeling and building various pedestrian-oriented simulation systems. In this article, we empirically explore the influential factors that control the single-file movement of pedestrians and their impact. Our goal in this context is to apply feed-forward neural networks to predict and understand the individual speeds for different densities of pedestrians. With artificial neural networks, we can approximate the fitting function that describes pedestrians’ movement without having modeling bias. Our analysis is focused on the distances and range of interactions across neighboring pedestrians. As indicated by previous research, we find that the speed of pedestrians depends on the distance to the predecessor. Yet, in contrast to classical purely anisotropic approaches—which are based on vision fields and assume that the interaction mainly depends on the distance in front—our results demonstrate that the distance to the follower also significantly influences movement. Using the distance to the follower combined with the subject pedestrian’s headway distance to predict the speed improves the estimation by 18% compared to the prediction using the space in front alone.
... Finally, in the leave-one-out method, the split is repeated for as many observations present in the dataset, i.e., the dataset is iteratively split so that only one observation is used to the test accuracy of predictions, while the remaining observations are used to estimate model parameters or coefficients. In general, using k-fold validation and leave-one-out methods is preferable over the simple holdout method (Kohavi 1995). As regards the number of folds, authors have suggested either using k = 5 and k = 10 folds when performing cross-validation over k = 2 folds, as using a larger number of folds is expected to decrease bias in estimating prediction errors (Rodriguez et al. 2010); however, it should be noted that as increasing the number of folds is only feasible using large datasets, as the large sample condition needs to be achieved in each of the fold (Wong 2015). ...
Chapter
The aim of this chapter is to introduce and describe how digital technologies, in particular smartphones, can be used in research in two areas, namely (i) to conduct personality assessment and (ii) to assess and promote physical activity. This area of research is very timely, because it demonstrates how the ubiquitously available smartphone technology—next to its known advantages in day-to-day life—can provide insights into many variables, relevant for psycho-social research, beyond what is possible within the classic spectrum of self-report inventories and laboratory experiments. The present chapter gives a brief overview on first empirical studies and discusses both opportunities and challenges in this rapidly developing research area. Please note that the personality part of this chapter in the second edition has been slightly updated.
... 2) Building classifiers on all feature subsets based on one classification algorithm. 3) All classifiers are evaluated by a 10-fold cross-validation (Kohavi, 1995). 4) The optimum feature subset and classifier are defined as the feature set and classifier with the best classification performance, respectively. ...
Article
Full-text available
Notably, severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has a tight relationship with the immune system. Human resistance to COVID-19 infection comprises two stages. The first stage is immune defense, while the second stage is extensive inflammation. This process is further divided into innate and adaptive immunity during the immune defense phase. These two stages involve various immune cells, including CD4+ T cells, CD8+ T cells, monocytes, dendritic cells, B cells, and natural killer cells. Various immune cells are involved and make up the complex and unique immune system response to COVID-19, providing characteristics that set it apart from other respiratory infectious diseases. In the present study, we identified cell markers for differentiating COVID-19 from common inflammatory responses, non-COVID-19 severe respiratory diseases, and healthy populations based on single-cell profiling of the gene expression of six immune cell types by using Boruta and mRMR feature selection methods. Some features such as IFI44L in B cells, S100A8 in monocytes, and NCR2 in natural killer cells are involved in the innate immune response of COVID-19. Other features such as ZFP36L2 in CD4+ T cells can regulate the inflammatory process of COVID-19. Subsequently, the IFS method was used to determine the best feature subsets and classifiers in the six immune cell types for two classification algorithms. Furthermore, we established the quantitative rules used to distinguish the disease status. The results of this study can provide theoretical support for a more in-depth investigation of COVID-19 pathogenesis and intervention strategies.
... This step is fundamental as it assures that the randomly selected part used to evaluate the quality of the model is unseen in the training process. The cross-validation method was used within the training dataset to reduce the possibility of overfitting [58]. The k-fold cross-validation subdivides the training dataset into smaller k folds with the same classes as the original dataset. ...
Article
Liquefaction causes damage and economic losses that can exceed the impact caused by ground shaking in earthquakes. However, probabilistic models to predict liquefaction occurrence on a regional scale are scarce and uncertain. We developed a non-parametric model using a database with more than 40 events worldwide. We trained and tested a supervised machine-learning model to predict liquefaction occurrence and non-occurrence, using a well-established methodology to select the optimal explanatory variables that correlate best with liquefaction occurrence. The optimal variables include strain proxy, slope, topographic roughness index, water-table depth, average precipitation, and distance to the closest water body. We compared the proposed model with existing proposals from the literature using the area under the Receiver Operating Characteristic (ROC) curve and the Brier score. Lastly, we apply the proposed model to assess liquefaction occurrence for one historical event and two hypothetical scenarios in Montenegro and Albania.
... Then, the final result will be calculated as a mean value of the results of all n-fold. In this experiment, n is set to be (10), as suggested by [32], in order to reliably demonstrate the efficiency of any proposed algorithm. The 10-fold cross-validation is repeated (M = 10) times; each time, the order of the data set instances is randomised. ...
Article
Full-text available
Intrusion detection systems (IDSs) based on machine learning algorithms represent a key component for securing computer networks, where normal and abnormal behaviours of network traffic are automatically learned with no or limited domain experts’ interference. Most of existing IDS approaches rely on labeled predefined classes which require domain experts to efficiently and accurately identify anomalies and threats. However, it is very hard to acquire reliable, up-to-date, and sufficient labeled data for an efficient traffic intrusion detection model. To address such an issue, this paper aims to develop a novel self-automatic labeling intrusion detection approach (called SAL) which utilises only small labeled network traffic data to potentially detect most types of attacks including zero-day attacks. In particular, the proposed SAL approach has three phases including: (i) an ensemble-based decision-making phase to address the limitations of a single classifier by relying on the predictions of multi-classifiers, (ii) a function agreement phase to assign the class label based on an adaptive confidence threshold to unlabeled observations, and (iii) an augmentation labeling phase to maximise the accuracy and the efficiency of the intrusion detection systems in a classifier model and to detect new attacks and anomalies by utilising a hybrid voting-based ensemble learning approach. Experimental results on available network traffic data sets demonstrate that the proposed SAL approach achieves high performance in comparison to two well-known baseline IDSs based on machine learning algorithms.
... For each subset, a classifier was trained based on one classification algorithm and samples consisting of the features from this feature subset. This classifier was further evaluated by 10-fold cross-validation (Kohavi, 1995;Chen et al., 2017;Zhou et al., 2020;Zhu et al., 2021). Then, the classifier giving the optimal performance can be obtained. ...
Article
Full-text available
Mammalian cortical interneurons (CINs) could be classified into more than two dozen cell types that possess diverse electrophysiological and molecular characteristics, and participate in various essential biological processes in the human neural system. However, the mechanism to generate diversity in CINs remains controversial. This study aims to predict CIN diversity in mouse embryo by using single-cell transcriptomics and the machine learning methods. Data of 2,669 single-cell transcriptome sequencing results are employed. The 2,669 cells are classified into three categories, caudal ganglionic eminence (CGE) cells, dorsal medial ganglionic eminence (dMGE) cells, and ventral medial ganglionic eminence (vMGE) cells, corresponding to the three regions in the mouse subpallium where the cells are collected. Such transcriptomic profiles were first analyzed by the minimum redundancy and maximum relevance method. A feature list was obtained, which was further fed into the incremental feature selection, incorporating two classification algorithms (random forest and repeated incremental pruning to produce error reduction), to extract key genes and construct powerful classifiers and classification rules. The optimal classifier could achieve an MCC of 0.725, and category-specified prediction accuracies of 0.958, 0.760, and 0.737 for the CGE, dMGE, and vMGE cells, respectively. The related genes and rules may provide helpful information for deepening the understanding of CIN diversity.
... A model can be validated by fitting on the training set and obtaining GOF statistics on the test set. In neuroscience, the leave-one-out cross-validation algorithm is often applied: the model is fitted on n-1 data points, and tested on the remaining one, repeated for all data points as the "test set" (Kohavi, 1995;Browne, 2000;Hastie et al., 2009; J.R. Cohen et al., 2010). A generalization is the leave-p-out; however, more CPU-intensive with increasing p. ...
Article
Full-text available
Model selection is often implicit: when performing an ANOVA, one assumes that the normal distribution is a good model of the data; fitting a tuning curve implies that an additive and a multiplicative scaler describes the behavior of the neuron; even calculating an average implicitly assumes that the data were sampled from a distribution that has a finite first statistical moment: the mean. Model selection may be explicit, when the aim is to test whether one model provides a better description of the data than a competing one. As a special case, clustering algorithms identify groups with similar properties within the data. They are widely used from spike sorting to cell type identification to gene expression analysis. We discuss model selection and clustering techniques from a statistician's point of view, revealing the assumptions behind, and the logic that governs the various approaches. We also showcase important neuroscience applications and provide suggestions how neuroscientists could put model selection algorithms to best use as well as what mistakes should be avoided.
Article
Full-text available
Purpose: In recent years, the advent of social networking sites has attracted more attention to review-based recommender systems. The purpose of developing such systems is to use the valuable information, which can be obtained from users' textual reviews. This paper presents a collaborative filtering recommender system using sentiment analysis. Design / Methodology: For this purpose, a sample of 7210 comments about 221 books from Amazon website are used to sentiment analysis. We used ensemble models to extract users' opinions. Weighted vote-based classifier ensemble technique is used for ensemble modeling. The required data were collected from Amazon.com through Web Crawlers in Java. The data were limited to Amazon users' comments to specific book topics such as Business Intelligence. We applied different methods including text normalization and ensemble methods for doing the sentiment analysis. Finding: The results showed that sentiment analysis of user reviews has a positive effect on recommending popular goods by users and also on the performance of recommender systems. Practical Implication: These results show that with understanding the effect of sentiment analysis for analysis unstructured data , online retailers could use it for policy making and recommend new suggestion to their customers. Also this system helps consumers to make informed decisions. Originality/value: This study combines sentiment analysis and recommender systems and shows remarkable improvement in the performance of recommender system.
Article
Full-text available
The design of a pattern recognition system requires careful attention to error estimation. The error rate is the most important descriptor of a classifier's performance. The commonly used estimates of error rate are based on the holdout method, the resubstitution method, and the leave-one-out method. All suffer either from large bias or large variance and their sample distributions are not known. Bootstrapping refers to a class of procedures that resample given data by computer. It permits determining the statistical properties of an estimator when very little is known about the underlying distribution and no additional samples are available. Since its publication in the last decade, the bootstrap technique has been successfully applied to many statistical estimations and inference problems. However, it has not been exploited in the design of pattern recognition systems. We report results on the application of several bootstrap techniques in estimating the error rate of 1-NN and quadratic classifiers. Our experiments show that, in most cases, the confidence interval of a bootstrap estimator of classification error is smaller than that of the leave-one-out estimator. The error of 1-NN, quadratic, and Fisher classifiers are estimated for several real data sets.
Article
Full-text available
This paper introduces stacked generalization, a scheme for minimizing the generalization error rate of one or more generalizers. Stacked generalization works by deducing the biases of the generalizer(s) with respect to a provided learning set. This deduction proceeds by generalizing in a second space whose inputs are (for example) the guesses of the original generalizers when taught with part of the learning set and trying to guess the rest of it, and whose output is (for example) the correct guess. When used with multiple generalizers, stacked generalization can be seen as a more sophisticated version of cross-validation, exploiting a strategy more sophisticated than cross-validation's crude winner-takes-all for combining the individual generalizers. When used with a single generalizer, stacked generalization is a scheme for estimating (and then correcting for) the error of a generalizer which has been trained on a particular learning set and then asked a particular question. After introducing stacked generalization and justifying its use, this paper presents two numerical experiments. The first demonstrates how stacked generalization improves upon a set of separate generalizers for the NETtalk task of translating text to phonemes. The second demonstrates how stacked generalization improves the performance of a single surface-fitter. With the other experimental evidence in the literature, the usual arguments supporting cross-validation, and the abstract justifications presented in this paper, the conclusion is that for almost any real-world generalization problem one should use some version of stacked generalization to minimize the generalization error rate. This paper ends by discussing some of the variations of stacked generalization, and how it touches on other fields like chaos theory.
Conference Paper
Full-text available
We evaluate the performance of weakest- link pruning of decision trees using cross- validation. This technique maps tree pruning into a problem of tree selection: Find the best (i.e. the right-sized) tree, from a set of trees ranging in size from the unpruned tree to a null tree. For samples with at least 200 cases, extensive empirical evidence supports the fol- lowing conclusions relative to tree selection: a fl lo-fold cross-validation is nearly unbiased; b not pruning a covering tree is highly bi- ased; (c) lo-fold cross-validation is consistent with optimal tree selection for large sample sizes and (d) the accuracy of tree selection by lo-fold cross-validation is largely dependent on sample size, irrespective of the population distribution.
Article
It is commonly accepted that statistical modeling should follow the parsimony principle; namely, that simple models should be given priority whenever possible. But little quantitative knowledge is known concerning the amount of penalty (for complexity) regarded as allowable. We try to understand the parsimony principle in the context of model selection. In particular, the generalized final prediction error criterion is considered, and we argue that the penalty term should be chosen between 1.5 and 5 for most practical situations. Applying our results to the cross-validation criterion, we obtain insights into how the partition of data should be done. We also discuss the small sample performance of our methods.
Article
We construct a prediction rule on the basis of some data, and then wish to estimate the error rate of this rule in classifying future observations. Cross-validation provides a nearly unbiased estimate, using only the original data. Cross-validation turns out to be related closely to the bootstrap estimate of the error rate. This article has two purposes: to understand better the theoretical basis of the prediction problem, and to investigate some related estimators, which seem to offer considerably improved estimation in small samples.
Article
We consider the problem of selecting a model having the best predictive ability among a class of linear models. The popular leave-one-out cross-validation method, which is asymptotically equivalent to many other model selection methods such as the Akaike information criterion (AIC), the Cp, and the bootstrap, is asymptotically inconsistent in the sense that the probability of selecting the model with the best predictive ability does not converge to 1 as the total number of observations n → ∞. We show that the inconsistency of the leave-one-out cross-validation can be rectified by using a leave-nv-out cross-validation with nv, the number of observations reserved for validation, satisfying nv/n → 1 as n → ∞. This is a somewhat shocking discovery, because nv/n → 1 is totally opposite to the popular leave-one-out recipe in cross-validation. Motivations, justifications, and discussions of some practical aspects of the use of the leave-nv-out cross-validation method are provided, and results from a simulation study are presented.
Conference Paper
Many aspects of concept learning research can be understood more clearly in light of a basic mathematical result stating, essentially, that positive performance in some learning situations must be offset by an equal degree of negative performance in others. We present a proof of this result and comment on some of its theoretical and practical ramifications.