ArticlePDF Available

Abstract and Figures

In a typical supervised data analysis task, one needs to perform the following two tasks: (a) select an optimal combination of learning methods (e.g., for variable selection and classifier) and tune their hyper-parameters (e.g., K in K-NN), also called model selection, and (b) provide an estimate of the performance of the final, reported model. Combining the two tasks is not trivial because when one selects the set of hyper-parameters that seem to provide the best estimated performance, this estimation is optimistic (biased/overfitted) due to performing multiple statistical comparisons. In this paper, we discuss the theoretical properties of performance estimation when model selection is present and we confirm that the simple Cross-Validation with model selection is indeed optimistic (overestimates performance) in small sample scenarios and should be avoided. We present in detail and investigate the theoretical properties of the Nested Cross Validation and a method by Tibshirani and Tibshirani for removing the estimation bias. In computational experiments with real datasets both protocols provide conservative estimation of performance and should be preferred. These statements hold true even if feature selection is performed as preprocessing.
Content may be subject to copyright.
International Journal on Artificial Intelligence Tools
Vol. XX, No. X (2015) 130
World Scientific Publishing Company
1
PERFORMANCE-ESTIMATION PROPERTIES OF CROSS-VALIDATION-
BASED PROTOCOLS WITH SIMULTANEOUS HYPER-PARAMETER OPTI-
MIZATION
Ioannis Tsamardinos
Department of Computer Science, University of Crete, and
Institute of Computer Science, Foundation for Research and Technology Hellas (FORTH)
Heraklion Campus, Voutes, Heraklion, GR-700 13, Greece
tsamard.it@gmail.com
Amin Rakhshani
Department of Computer Science, University of Crete, and
Institute of Computer Science, Foundation for Research and Technology Hellas (FORTH), Vassilika Vouton,
Heraklion Campus, Voutes, ,Heraklion, GR-700 13, Greece
aminra@ics.forth.gr
Vincenzo Lagani
Institute of Computer Science, Foundation for Research and Technology Hellas (FORTH), Vassilika Vouton,
Heraklion, GR-700 13, Greece
vlagani@ics.forth.gr
Received (Day Month Year)
Revised (Day Month Year)
Accepted (Day Month Year)
In a typical supervised data analysis task, one needs to perform the following two tasks: (a) select an
optimal combination of learning methods (e.g., for variable selection and classifier) and tune their
hyper-parameters (e.g., K in K-NN), also called model selection, and (b) provide an estimate of the
performance of the final, reported model. Combining the two tasks is not trivial because when one
selects the set of hyper-parameters that seem to provide the best estimated performance, this estima-
tion is optimistic (biased / overfitted) due to performing multiple statistical comparisons. In this pa-
per, we discuss the theoretical properties of performance estimation when model selection is present
and we confirm that the simple Cross-Validation with model selection is indeed optimistic (overes-
timates performance) in small sample scenarios and should be avoided. We present in detail and dis-
cuss the properties of the Nested Cross Validation and a method by Tibshirani and Tibshirani for
removing the bias of the estimation in detail and investigate their theoretical properties. In computa-
tional experiments with real datasets both protocols provide conservative estimation of performance
and should be preferred. These statements hold true even if feature selection is performed as prepro-
cessing.
Keywords: Performance Estimation; Model Selection; Cross Validation; Stratification; Comparative
Evaluation.
Tsamardinos, Rakhshani, Lagani
2
1. Introduction
A typical supervised analysis (e.g., classification or regression) consists of several steps
that result in a final, single prediction, or diagnostic model. For example, the analyst may
need to impute missing values, perform variable selection or general dimensionality re-
duction, discretize variables, try several different representations of the data, and finally,
apply a learning algorithm for classification or regression. Each of these steps requires a
selection of algorithms out of hundreds or even thousands of possible choices, as well as
the tuning of their hyper-parameters
*
. Hyper-parameter optimization is also called the
model selection problem since each combination of hyper-parameters tried leads to a pos-
sible classification or regression model out of which the best is to be selected. There are
several alternatives in the literature about how to identify a good combination of methods
and their hyper-parameters (e.g., [1][2]) and they all involve implicitly or explicitly
searching the space of hyper-parameters and trying different combinations. Unfortunate-
ly, trying multiple combinations, estimating their performance, and reporting the perfor-
mance of the best model found leads to overestimating the performance (i.e., underesti-
mate the error / loss), sometimes also referred to as overfitting
. This phenomenon is
called the problem of multiple comparisons in induction algorithms and has been ana-
lyzed in detail in [3] and is related to the multiple testing or multiple comparisons in sta-
tistical hypothesis testing. Intuitively, when one selects among several models whose
estimations vary around their true mean value, it becomes likely that what seems to be the
best model has been “lucky” in the specific test set and its performance has been overes-
timated. Extensive discussions and experiments on the subject can be found in [2].
An intuitive small example now follows. Let’s suppose method M1 has 85% true ac-
curacy and method M2 has 83% true accuracy on a given classification task when trained
with a randomly selected dataset of a given size. In 4 randomly drawn training and corre-
sponding test sets on the same problem, the estimations of accuracy maybe 80, 82, 88, 90
for M1 and 88, 85, 79, 79 percent. If M1 was evaluated by itself the estimated mean accu-
racy will be estimated as 85%, and for M2 it would be 82,75% respectively, that are close
to their true means. If performance estimations were perfect then M1 would be chosen
each time and the average performance of the models returned with model selection
would be 85%. However, when both methods are tried, the best is selected, and the max-
imum performance is reported, we obtain the series of estimations: 88, 85, 88, 90 whose
average is 87,75 and will be in generally biased. A larger example and contrived experi-
ment now follows:
Example: In a binary classification problem, an analyst tries N different classification
algorithms, producing N corresponding models from the data. They estimate the perfor-
*
We use the term “hyper-parameters” to denote the algorithm parameters that can be set by the u ser and are not
estimated directly from the data, e.g., the parameter K in the K-NN algorithm. In contrast, the term “parameters”
in the statistical literature typically refers to the model quantities that are estimated directly by the data, e.g., the
weight vector w in a linear regression model y = w
x + b. See [2] for a definition and discussion too.
The term “overfitting” i s a more general term a nd we prefer the term “overestimating” to characterize this
phenomenon.
PERFORMANCE-ESTIMATION PROPERTIES OF CROSS-VALIDATION-BASED PROTOCOLS WITH SIM-
ULTANEOUS HYPER-PARAMETER OPTIMIZATION
3
mance (accuracy) of each model on a test set of M samples. They then select the model
that exhibits the best estimated accuracy
and report this performance as the estimated
performance of the selected model. Let’s assume that all models have the same, true ac-
curacy of 85%. What is the expected value of the estimated
,
and how biased is
it?
Let’s denote the accuracy of each model with Pi = 0,85. The true performance of the
final model is of course also B = max Pi = 0,85 . But, the estimated performance

is biased. Table 1 shows
for different values of N and M assuming each
model makes independent errors on the test set as estimated with 10000 simulations. The
table also shows the 5th and 95th percentile as an indication of the range of the estimation.
Invariably, the expected estimated accuracy
of the final model is overestimated.
As expected, the bias increases with the number of models tried and decreases with
the size of the test set. For sample sizes less than or equal to 100, the bias is significant:
when the number of models produced is larger than 100, it is not uncommon to estimate
the performance of the best model as 100%. Notice that, when using Cross Validation-
based protocols to estimate performance each sample serves once and only once as a test
case. Thus, one can consider the total data-set sample size as the size of the test set. Typ-
ical high-dimensional datasets in biology often contain less than 100 samples and thus,
one should be careful with the estimation protocols employed for their analysis.
What about the number of different models tried in an analysis? Is it realistic to ex-
pect an analyst to generate thousands of different models? Obviously, it is very rare that
any analyst will employ thousands of different algorithms; however, most learning algo-
rithms are parameterized by several different hyper-parameters. For example, the stand-
ard 1-norm, polynomial Support Vector Machine algorithm takes as hyper-parameters the
Table 1. Average estimated accuracy 
when reporting the (estimate) performance of N
models with equal true accuracy of 85%. In brackets the 5 and 95 percentilies are shown. The
smaller the sample size, and the larger the number of models N out of which selection is
performed the larger the overestimation.
Test set sample size
Number
of models
20
80
100
500
1000
5
0.935
[0.85; 1.00]
0.895
[0.86; 0.94]
0.891
[0.86; 0.93]
0.868
[0.85; 0.89]
0.863
[0.85; 0.88]
10
0.959
[0.90; 1.00]
0.908
[0.88; 0.94]
0.902
[0.87; 0.93]
0.874
[0.86; 0.89]
0.867
[0.86; 0.88]
20
0.977
[0.95; 1.00]
0.920
[0.89; 0.95]
0.913
[0.89; 0.94]
0.879
[0.87; 0.89]
0.871
[0.86; 0.88]
50
0.993
[0.95; 1.00]
0.933
[0.91; 0.96]
0.925
[0.90; 0.95]
0.885
[0.87; 0.90]
0.875
[0.87; 0.88]
100
0.999
[1.00; 1.00]
0.941
[0.93; 0.96]
0.932
[0.91; 0.95]
0.889
[0.88; 0.90]
0.878
[0.87; 0.89]
1000
1.000
[1.00; 1.00]
0.962
[0.95; 0.97]
0.952
[0.94; 0.97]
0.899
[0.89; 0.91]
0.885
[0.88; 0.89]
Tsamardinos, Rakhshani, Lagani
4
cost C of misclassifications and the degree of the polynomial d. Similarly, most variable
selection methods take as input a statistical significance threshold or the number of varia-
bles to return. If an analyst tries several different methods for imputation, discretization,
variable selection, and classification, each with several different hyper-parameter values,
the number of combinations explodes and can easily reach into the thousands. Notice
that, model selection and optimistic estimation of performance may also happen uninten-
tionally and implicitly in many other settings. More specifically, consider a typical publi-
cation where a new algorithm is introduced and its performance (after tuning the hyper-
parameters) is compared against numerous other alternatives from the literature (again,
after tuning their hyper-parameters), on several datasets. The comparison aims to com-
paratively evaluate the methods. However, the reported performances of the best method
on each dataset suffer from the same problem of multiple inductions and are on average
optimistically estimated.
We now discuss the different factors that affect estimation. In the simulations above,
we assume that the N different models provide independent predictions. However, this is
unrealistic as the same classifier with slightly different hyper-parameters will produce
models that give correlated predictions (e.g., K-NN models with K=1 and K=3 will often
make the same errors). Thus, in a real analysis setting, the amount of bias may be smaller
than what is expected when assuming no dependence between models. The violation of
independence makes the theoretical analysis of the bias difficult and so in this paper, we
rely on the empirical evaluations of the different estimation protocols.
There are other factors that affect the bias. For example, the difference of the perfor-
mance of the best method with the other methods attempted relative to the variance of the
estimation, affects the bias. For example, if the best method attempted has a true accuracy
of 85% with variance 3% and all the other methods attempted have a true accuracy of
50% with variance 3%, we do not expect considerable bias in the estimation: the best
method will always be selected no matter whether its performance is overestimated or
underestimated with the specific dataset, and thus on average it will be unbiased. This
observation actually forms the basis for the Tibshirani and Tibshirani method [4] de-
scribed below.
In the remainder of the paper, we revisit the Cross-Validation (CV) protocol. We cor-
roborate [2][5] that CV overestimates performance when it is used with hyper-parameter
optimization. As expected overestimation of performance increases with decreasing sam-
ple sizes. We present three other performance estimation methods in the literature. The
first is a simple approach that re-evaluates CV performance by using a different split of
the data (CVM-CV)
. The method by Tibshirani and Tibshirani (hereafter TT) [4] tries to
estimate the bias and remove it from the estimation. The Nested Cross Validation (NCV)
method [6] cross-validates the whole hyper-parameter optimization procedure (which
includes an inner cross-validation, hence the name). NCV is a generalization of the tech-
nique where data is partitioned in train-validation-test sets.
We thank the anonymous reviewers for suggesting the method.
PERFORMANCE-ESTIMATION PROPERTIES OF CROSS-VALIDATION-BASED PROTOCOLS WITH SIM-
ULTANEOUS HYPER-PARAMETER OPTIMIZATION
5
We show that the behavior of the four methods is markedly different, ranging from
the overestimation to conservative estimation of performance bias and variance. To our
knowledge, this is the first time these methods are compared against each other on real
datasets.
There are two sets of experiments, namely with and without a feature selection pre-
processing step. On one side, we expect that the models will gain predictive power from
the elimination of irrelevant or superfluous variables. However, the inclusion of one fur-
ther modelling step increases the number of hyper-parameter configurations to evaluate,
and thus performance overestimation should increase as well. Empirically, we show that
this is indeed the case. The effect of stratification is also empirically examined. Stratifica-
tion is a technique that during partitioning of the data into folds forces the same distribu-
tion of the outcome classes to each fold. When data are split randomly, on average, the
distribution of the outcome in each fold will be the same as in the whole dataset. Howev-
er, in small sample sizes or imbalanced data it could happen that a fold gets no samples
that belong in one of the classes (or in general, the class distribution in a fold is very dif-
ferent from the original). Stratification ensures that this does not occur. We show that
stratification has different effects depending on (a) the specific performance estimation
method and (b) the performance metric. However, we argue that stratification should
always be applied as a cautionary measure against excessive variance in performance.
2. Cross-Validation Without Hyper-Parameter Optimization (CV)
K-fold Cross Validation is perhaps the most common method for estimating performance
of a learning method for small and medium sample sizes. Despite its popularity, its theo-
retical properties are arguably not well known especially outside the machine learning
community, particularly when it is employed with simultaneous hyper-parameter optimi-
zation, as evidenced by the following common machine learning books: Duda ([7], p.
484) presents CV without discussing it in the context of model selection and only hints
that it may underestimate (when used without model selection): “The jackknife [i.e.,
Algorithm 1: K-Fold Cross-Validation 
Input: A dataset     
Output: A model
An estimation of performance (loss) of
Randomly Partition to K folds
Model    // the model learned on all data D
Estimation
:
 

,


Return 

Tsamardinos, Rakhshani, Lagani
6
leave-one-out CV] in particular, generally gives good estimates because each of the n
classifiers is quite similar to the classifier being tested …”. Similarly, Mitchell [8](pp.
112, 147, 150) mentions CV but only in the context of hyper-parameter optimization.
Bishop [9] does not deal at all with issues of performance estimation and model selection.
A notable exception is the Hastie and co-authors [10] book that offers the best treatment
of the subject, upon which the following sections are based. Yet, CV is still not discussed
in the context of model selection.
Let’s assume a dataset     , of identically and independently
distributed (i.i.d.) predictor vectors and corresponding outcomes . Let us also assume
that we have a single method for learning from such data and producing a single predic-
tion model. We will denote with the output of the model produced by the learner
f when trained on data D and applied on input . The actual model produced by f on
dataset D is denoted with . We will denote with  the loss (error) measure
of prediction  when the true output is . One common loss function is the zero-one loss
function:  , if   and  , otherwise. Thus, the average zero-one
loss of a classifier equals 1 - accuracy, i.e., it is the probability of making an incorrect
classification. K-fold CV partitions the data D into K subsets called folds . We
denote with  the data excluding fold and the sample size of each fold. The K-
fold CV algorithm is shown in Algorithm 1.
First, notice that CV should return the model learned from all data D, 
§
. This
is the model to be employed operationally for classification. It then tries to estimate the
performance of the returned model by constructing K other models from datasets ,
each time excluding one fold from the training set. Each of these models is then applied
on each fold serving as test and the loss is averaged over all samples.
Is
 an unbiased estimate of the loss of ? First, notice that each sample is
used once and only once as a test case. Thus, effectively there are as many i.i.d. test cases
as samples in the dataset. Perhaps, this characteristic is what makes the CV so popular
versus other protocols such as repeatedly partitioning the dataset to train-test subsets. The
test size being as large as possible could facilitate the estimation of the loss and its vari-
ance (although, theoretical results show that there is no unbiased estimator of the variance
for the CV! [11]). However, test cases are predicted with different models! If these mod-
els were trained on independent train sets of size equal to the original data D, then CV
would indeed estimate the average loss of the models produced by the specific learning
method on the specific task when trained with the specific sample size. As it stands
though, since the models are correlated and have smaller size than the original we can
state the following:
K-Fold CV estimates the average loss of models returned by the specific learning
method f on the specific classification task when trained with subsets of D of size .
§
This is often a source of confusion for some practitioners who sometimes wonder which model to return out of
the ones produced during Cross-Validation.
PERFORMANCE-ESTIMATION PROPERTIES OF CROSS-VALIDATION-BASED PROTOCOLS WITH SIM-
ULTANEOUS HYPER-PARAMETER OPTIMIZATION
7
Since    (e.g., for 5-fold, we are using 80% of the total
sample size for training each time) and assuming that the learning method improves on
average with larger sample sizes we expect
 to be conservative (i.e., the true perfor-
mance be underestimated). Exactly how conservative it will be depends on where the
classifier is operating on its learning curve for this specific task. It also depends on the
number of folds K: the larger the K, the more (K-1)/K approaches 100% and the bias dis-
appears, i.e., leave-one-out CV should be the least biased (however, there may be still be
significant estimation problems, see [12], p. 151, and [5] for an extreme failure of leave-
one-out CV). When sample sizes are small or distributions are imbalanced (i.e., some
classes are quite rare in the data), we expect most classifiers to quickly benefit from in-
creased sample size, and thus
 to be more conservative.
3. Cross-Validation With Hyper-Parameter Optimization (CVM)
A typical data analysis involves several steps (representing the data, imputation, dis-
cretization, variable selection or dimensionality reduction, learning a classifier) each with
hundreds of available choices of algorithms in the literature. In addition, each algorithm
takes several hyper-parameter values that should be tuned by the user. A general method
for tuning the hyper-parameters is to try a set of predefined combinations of methods and
corresponding values and select the best. We will represent this set with a set a contain-
ing hyper-parameter values, e.g, a = { no variable selection, K-NN, K=5, Lasso, λ = 2,
linear SVM, C=10 } when the intent is to try K-NN with no variable selection, and a
linear SVM using the Lasso algorithm for variable selection. The pseudo-code is shown
in Algorithm 2. The symbol  now denotes the output of the model learned when
using hyper-parameters a on dataset D and applied on input x. Correspondingly, the sym-
bol  denotes the model produced by applying hyper-parameters a on D.
The quantity  is now parameterized by the specific values a and the minimizer
of the loss (maximizer of performance) a* is found. The final model returned is 
, i.e., the model produced by setting hyper-parameter values to a* and learning
from all data D.
On one hand, we expect CV with model selection (hereafter, CVM) to underestimate
performance because estimations are computed using models trained on only a subset of
the dataset. On the other hand, we expect CVM to overestimate performance because it
returns the maximum performance found after trying several hyper-parameter values. In
Section 8 we examine this behavior empirically and determine (in concordance with [2],
[5]) that indeed when sample size is relatively small and the number of models tried is in
the hundreds CVM overestimates performance. Thus, in these cases other types of esti-
mation protocols are required.
Tsamardinos, Rakhshani, Lagani
8
4. The Double Cross Validation Method (CVM-CV)
The CVM is biased because when trying hundreds or more learning methods, what ap-
pears to be the best one has probably also been “lucky” for the particular test sets. Thus,
one idea to reduce the bias is to re-evaluate the selected, best method on different test
sets. Of course, since we are limited to the given samples (dataset) it is impossible to do
so on truly different test cases. One idea thus, is to re-evaluate the selected method on a
different split (partitioning) to folds and repeat Cross-Validation only for the single, se-
lected, best method. We name this approach CVM-CV, since it sequentially performs
CVM and CV for model selection and performance estimation, respectively and it is
shown in Algorithm 3. The tilde symbol `~’ is used to denote a returned value that is ig-
Algorithm 2: K-Fold Cross-Validation with Hyper-
Parameter Optimization (Model Selection) 
Input: A dataset      
A set of hyper-parameter value combinations
Output: A model
An estimation of performance (loss) of
Partition to K folds
Estimate
 for each   :


,



Find minimizer of
 // “best hyper-parameters”
   // the model from all data D with the best
hyper-parameters


Return 

Algorithm 3: Double Cross Validation   
Input: A dataset      
A set of hyper-parameter value combinations
Output: A model
An estimation of performance (loss) of
Partition to K folds
 
is the parameter configuration corresponding to
Estimation
:
Partition to K new randomly chosen folds

 
Return 

PERFORMANCE-ESTIMATION PROPERTIES OF CROSS-VALIDATION-BASED PROTOCOLS WITH SIM-
ULTANEOUS HYPER-PARAMETER OPTIMIZATION
9
nored. Notice that, CVM-CV is not theoretically expected to fully remove the overesti-
mation bias: information from the test sets in the final Cross-Validation step for perfor-
mance estimation is still employed during training to select the best model. Nevertheless,
our experiments show that this relatively computationally-efficient approach does reduce
CVM overestimation bias.
5. The Tibshirani and Tibshirani (TT) Method
The TT method [4] attempts to heuristically and approximately estimate the bias of the
CV error estimation due to model selection and add it to the final estimate of loss. For
each fold, the bias due to model selection is estimated as  where, as
before, is the average loss in fold k, is the hyper-parameter values that minimizes
the loss for fold k, and the global minimizer over all folds. Notice that, if in all folds
the same values provide the best performance, then these will also be selected globally
and hence   for  . In this case, the bias estimate will be zero. The justi-
fication of this estimate for the bias is in [4]. It is quite important to notice that TT does
not require any additional model training and has minimum computational overhead.
6. The Nested Cross-Validation Method (NCV)
We could not trace who introduced or coined up first the name Nested Cross-Validation
(NCV) method but the authors have independently discovered it and using it since 2005
[6],[13],[14]; one early comment hinting of the method is in [15], while Witten and Frank
briefly discuss the need of treating any parameter tuning step as part of the training pro-
cess when assessing performance (see [12], page 286).
Algorithm 4: 
Input: A dataset      
A set of hyper-parameter value combinations
Output: A model
An estimation of performance (loss) of
Partition D to K folds Fi
 


,



Find minimizer of
 // global optimizer
Find minimizers of  // the minimizers for each fold
Estimate 


   , i.e., the model learned on all data D with
the best hyper-parameters

 
Return 

Tsamardinos, Rakhshani, Lagani
10
A similar method in a bioinformatics analysis was used as early as 2003 [16]. The
main idea is to consider the model selection as part of the learning procedure f. Thus, f
tests several hyper-parameter values, selects the best using CV, and returns a single mod-
el. NCV cross-validates f to estimate the performance of the average model returned by f
just as normal CV would do with any other learning method taking no hyper-parameters;
it’s just that f now contains an internal CV trying to select the best model. NCV is a gen-
eralization of the Train-Validation-Test protocol where one trains on the Train set for all
hyper-parameter values, selects the ones that provide the best performance on Validation,
trains on Train+Validation a single model using the best-found values and estimates its
performance on Test. Since Test is used only once by a single model, performance esti-
mation has no bias due to the model selection process. The final model is trained on all
data using the best found values for a. NCV generalizes the above protocol to cross-
validate every step of this procedure: for each Test, all folds serve as Validation, and this
process is repeated for each fold serving as Test. The pseudo-code is shown in Algorithm
5. The pseudo-code is similar to CV (Algorithm 1) with CVM (Cross-Validation with
Model Selection, Algorithm 2) serving as the learning function f. NCV requires a quad-
ratic number of models to be trained to the number of folds K (one model is trained for
every possible pair of two folds serving as test and validation respectively) and thus it is
the most computationally expensive protocol out of the four.
7. Stratification of Folds
In CV, folds are partitioned randomly which should maintain on average the same class
distribution in each fold. However, in cases of small sample sizes or highly imbalanced
class distributions it may happen that some folds contain no samples from one of the
classes (or in general, the class distribution is very different from the original). In that
case, the estimation of performance for that fold will exclude that class. To avoid this
case, “in stratified cross-validation, the folds are stratified so that they contain approxi-
mately the same proportions of labels as the original dataset” [5]. Notice that leave-one-
Algorithm 5: K-Fold Nested Cross-Validation 
Input: A dataset      
A set of hyper-parameter value combinations
Output: A model
An estimation of performance (loss) of
Partition to K folds
 
Estimation
:
 \\best performing model on 
 

 ,


Return 

PERFORMANCE-ESTIMATION PROPERTIES OF CROSS-VALIDATION-BASED PROTOCOLS WITH SIM-
ULTANEOUS HYPER-PARAMETER OPTIMIZATION
11
out CV guarantees that each fold will be unstratified since it contains only one sample
which can cause serious estimation problems ([12], p. 151, [5]).
8. Empirical Comparison of Different Protocols
We performed an empirical comparison in order to assess the characteristics of each data-
analysis protocol. Particularly, we focus on three specific aspects of the protocols’ per-
formances: (a) bias and variance of the estimation, (b) effect of feature selection and (c)
effect of stratification.
Notice that, assuming data are partitioned into the same folds, all methods return the
same model, that is, the model returned by f on all data D using the minimizer of the
CV error. However, each methods return a different estimate of the performance for this
model.
8.1. The Experimental Set-Up
Original Datasets: Five datasets from different scientific fields were employed for the
experimentations. The computational task for each dataset consists in predicting a binary
outcome on the basis of a set of numerical predictors (binary classification). Datasets
were selected to have a relatively large number of samples so that several smaller datasets
that follow the same joint distribution and can be sub-sampled from the original dataset;
when the number of sample size is large the sub-samples are relatively independent
providing independent estimates of all metrics in the experimental part. In more detail the
SPECT [17] dataset contains data from Single Photon Emission Computed Tomography
images collected in both healthy and cardiac patients. Data in Gamma [18] consist of
simulated registrations of high energy gamma particles in an atmospheric Cherenkov
telescope, where each gamma particle can be originated from the upper atmosphere
(background noise) or being a primary gamma particle (signal). Discriminating biode-
gradable vs. non-biodegradable molecules on the basis of their chemical characteristics is
the aim of the Biodeg [19] dataset. The Bank [20] dataset was gathered by direct market-
ing campaigns (phone calls) of a Portuguese banking institution for discriminating cus-
tomers who want to subscribe a term deposit and those who don’t. Last, CD4vsCD8 [21]
contains the phosphorylation levels of 18 intra-cellular proteins as predictors to discrimi-
nate naïve CD4+ and CD8+ human immune system cells. SeismicBump [22] focuses on
forecasting high energy (higher than 104 J) in coal mines. Data come from longwalls lo-
cated in a Polish coal mine. The MiniBoone dataset is taken from the first phase of the
Booster Neutrino Experiment conducted in the FermiLab [23]; the goal is to distinguish
between electron neutrinos (signal) and muon neutrinos (background). Table 2 summa-
rizes datasets’ characteristics. It should be noticed that the outcome distribution consider-
ably varies across datasets.
Model Selection: To generate the hyper-parameter vectors in a we employed two differ-
ent strategies, named No Feature Selection (NFS) and With Feature Selection (WFS).
Tsamardinos, Rakhshani, Lagani
12
Strategy NFS (No Feature Selection) includes three different modelers: the Logistic
Regression classifier ([9], p. 205), as implemented in Matlab 2013b, that takes no hyper-
parameters; the Decision Tree [24], as implemented also in Matlab 2013b with hyper-
parameters MinLeaf and MinParents both within {1, 2, …, 10, 20, 30, 40, 50}; Support
Vector Machines as implemented in the libsvm software [25] with linear, Gaussian ( 
 ) and polynomial (degree   ,    ) kernels, and cost pa-
rameter   . When a classifier takes multiple hyper-parameters, all combina-
tions of choices are tried. Overall, 247 hyper-parameter value combinations and corre-
sponding models are produced each time to select the best.
Strategy WFS (With Feature Selection) adds feature selection methods as prepro-
cessing steps to Strategy NFS. Two feature selection methods are tried each time, namely
the univariate selection and the Statistically Equivalent Signature (SES) algorithm [26].
The former simply applies a statistical test for assessing the association between each
individual predictor and the target outcome (chi-square test for categorical variables and
Student t-test for continuous ones). Predictors whose p-values are below a given signifi-
cance threshold t are retained for successive computations. The SES algorithm [26] be-
longs to the family of constraint-based, Markov-Blanket inspired feature selection meth-
ods [27]. In short, SES repetitively applies a statistical test of conditional independence
for identifying the set of predictors that are associated with the outcome given any com-
bination of the remaining variables. Also SES requires the user to set a priori a signifi-
cance threshold t, along with the hyper-parameter maxK that limits the number of predic-
tors to condition upon. Both feature selection methods are coupled in turn with each
modeler in order to build the hyper-parameters vector a. The significance threshold t is
varied in {0.01, 0.05} for both methods, while maxK varies in {3, 5}, bringing the num-
ber of hyper-parameter combinations produced in Strategy WFS to 1729.
Sub-Datasets and Hold-out Datasets. Each original dataset D is partitioned into two
separate, stratified parts: Dpool, containing 30% of the total samples, and the hold-out set
Dhold-out, consisting of the remaining samples. Subsequently, for each Dpool N sub-datasets
are randomly sampled with replacement for each sample size in the set {20, 40, 60, 80,
100, 500 and 1500}, for a total of 5  7 N sub-datasets Di, j, k (where i indexes the orig-
Table 2. Datasets’ characteristics. Dpool is a 30% partition from which sub-sampled datasets are
produced. Dhold-out is the remaining 70% of samples from which an accurate estimation of the true
performance is computed.
Dataset Name
# Samples
# Attributes
Classes ratio
|Dpoo1|
|Dhold-out|
Ref.
SPECT
267
22
3.85
81
186
[17]
Biodeg
1055
41
1.96
317
738
[19]
SeismicBumps
2584
18
14.2
776
1808
[22]
Gamma
19020
11
1.84
5706
13314
[18]
CD4vsCD8
24126
18
1.13
7238
16888
[21]
MiniBooNE
31104
50
2.55
9332
21772
[23]
Bank
45211
17
7.54
13564
31647
[20]
PERFORMANCE-ESTIMATION PROPERTIES OF CROSS-VALIDATION-BASED PROTOCOLS WITH SIM-
ULTANEOUS HYPER-PARAMETER OPTIMIZATION
13
inal dataset, j the sample size, and k the sub-sampling). Most of the original datasets have
been selected with a relatively large sample size so that each Dhold-out is large enough to
allow an accurate (low variance) estimation of performance. In addition, the size of Dpool
is also relatively large so that each sub-sampled dataset to be approximately considered a
dataset independently sampled from the data population of the problem. Nevertheless, we
also include a couple of datasets with smaller sample size. We set the number of sub-
samples to be   .
Bias and Variance of each Protocol: For each of the data analysis protocols CVM,
CVM-CV, TT, and NCV both the stratified and the non-stratified versions are applied to
each sub-dataset, in order to select the “best model/hyper-parameter values” and estimate
its performance
. For each sub-dataset, the same split in    folds was employed
for the stratified versions of CVM, CVM-CV, TT and NCV, so that the three data-analysis
protocols always select exactly the same model, and differ only in the estimation of per-
formance. For the NCV, the internal CV loop uses K
=K-1 folds. Some of the dataset
though are characterized by a particular high-class ratio, and typically this leads to a scar-
city of instances of the rarest class in some sub-datasets. If the number of instances of a
Figure 1. Average loss and variance for AUC metric in Strategy NFS. From left to right: stratified
CVM, CVM-CV, TT, and NCV. Top row contains average bias, second row the standard deviation of
performance estimation. The results largely vary depending on the specific dataset. In general, CVM is
clearly optimistic (positive bias) for sample sizes less or equal to 100, while NCV tends to underesti-
mate performances. CVM-CV and TT show a behavior that is in between these two extremes. CVM
has the lowest variance, at least for small sample sizes.
Tsamardinos, Rakhshani, Lagani
14
given class is smaller than , we set   in order to ensure the presence of both clas-
ses in each fold. For NCV and NS-NCV, we forgo to analyze sub-datasets where <3.
The bias is computed as 
. Thus, a positive bias indicates a higher “true”
error (i.e., as estimated on the hold-out set) than the one estimated by the corresponding
analysis protocol and implies the estimation protocol is optimistic. For each protocol,
original dataset, and sample size the true mean bias, its variance and its standard devia-
tion are computed over 30 sub-samplings.
Performance Metric: All algorithms are presented using a loss function L computed for
each sample and averaged out for each fold and then over all folds. The zero-one loss
function is typically assumed corresponding to 1-accuracy of the classifier. A valid alter-
native as metric for binary classification problems the Area Under the Receiver’s Operat-
ing Characteristic Curve (AUC) [28]. The AUC does not depend on the prior class distri-
bution. In contrast, the zero-one loss depends on the class distribution: for a problem with
class distribution of 50-50%, a classifier with accuracy 85% (loss 15%) has greatly im-
proved over the baseline of a trivial classifier predicting the majority class; for a problem
of 84-16% class distribution, a classifier with 85% accuracy has not improved much over
the baseline. On the other hand, computing AUC on small test sets leads to poor esti-
mates [29], and it is impossible when leave-one-out cross validation is used (unless mul-
tiple predictions are pooled together, a practice that creates additional issues [29]). More-
over, the AUC cannot be expressed as a loss function   where  is a single predic-
tion. Nevertheless, all Algorithms 1-4 remain the same if we substitute
 
 , i.e., the error in fold i is 1 minus the AUC of the model learned by f
on all data except fold Fi, as estimated on Fi as the test set. In order to contrast the proper-
ties of the two metrics, we have performed all analyses twice, using in turn 0-1 loss and
1-AUC as the metrics to optimize. For both metrics a positive bias corresponds to overes-
timated performance.
8.2. Experimental Results
The results of the analysis greatly differ depending on the specific dataset, performance
metric and performance estimation method. The following Tables and Figures show the
results obtained with the AUC metric, while the remaining results are provided in Ap-
pendix A.
We first comment the results obtained in the experimentation following Strategy NFS.
Figure 1 (first row) shows the average loss bias of the four methods, (stratified version)
suggesting that indeed CVM overestimates performance for small sample sizes (underes-
timates error) corroborating the results in [2],[5]. In contrast, NCV tends to be over pes-
simistic and to underestimate the real performances. CVM-CV and TT exhibit results that
are between these two extremes, with CVM > CVM-CV > TT > NCV in terms of overes-
timation. It should be noted that the results on the SeismicBump dataset strongly penalize
the CVM, CVM-CV and TT methods. Interestingly, this dataset has the highest ratio be-
tween outcome classes (14.2), suggesting that estimating AUC performances in highly
imbalanced datasets may be challenging for these three methods.
PERFORMANCE-ESTIMATION PROPERTIES OF CROSS-VALIDATION-BASED PROTOCOLS WITH SIM-
ULTANEOUS HYPER-PARAMETER OPTIMIZATION
15
Table 3 shows the bias averaged over all datasets, where it is shown that CVM over-
estimates AUC up to ~17 points for small sample sizes, while CVM-CV and TT are al-
ways below 10 points, and NCV bias is never outside the range of 5 points. We perform a
t-test for the null hypothesis that the bias is zero, which is typically rejected: all methods
usually exhibit some bias whether positive or negative.
Figure 2. Effect of stratification using the AUC in Strategy NFS. The first column reports the aver-
age bias for the stratified versions of each method, while the second column for the non-stratified ones.
The effect of stratification is dependent on each specific method and dataset.
Tsamardinos, Rakhshani, Lagani
16
The second row of Figure 1 and Table 4 show the standard deviations for the loss bi-
as. We apply the O'Brien's modification of Levene's statistical test [30] with the null hy-
pothesis that the variance of a method is the same as the corresponding variance for the
same sample size as the NCV. We note that CVM has the lowest variance, while all the
other methods show almost statistically indistinguishable variances.
Table 5 reports the average performances on the hold-out set. As expected, these per-
formances improve with sample size, because the final models are trained on a larger
number of samples. The corresponding results for the accuracy metric are reported in
Figure 4 and Table 6-8 in Appendix A, and generally follow the same patterns and con-
clusions as the ones reported above for the AUC metric. The only noticeable difference is
Figure 3. Comparing Strategy NFS (No Feature Selection) and WFS (With Feature Selection)
over AUC. The first row reports the average bias of each method for Strategy NFS and WFS, respec-
tively, while the second row provides the variance in performance estimation. CVM and TT shows an
evident increment in bias in Strategy WFS, presumably due to the larger hyper-parameter space ex-
plored in this setting.
PERFORMANCE-ESTIMATION PROPERTIES OF CROSS-VALIDATION-BASED PROTOCOLS WITH SIM-
ULTANEOUS HYPER-PARAMETER OPTIMIZATION
17
a large improvement in the average bias for the SeismicBump dataset. A close inspection
of the results for this dataset reveals that all methods tend to select models with perfor-
mances close to the trivial classifier, both during the model selection and the performance
estimation on the hold-out set. The selection of these models minimizes the bias but they
have little practical utility, since they tend to predict the most frequent class. This exam-
ple clearly underlines accuracy’s inadequacy for tasks involving highly imbalanced da-
tasets.
Figure 2 contrasts methods’ stratified and non-stratified versions for Strategy NFS
and AUC. The effect of stratification seems to be quite dependent on the specific method.
The non-stratified version of NCV has larger bias and variance than the stratified version
for small sample sizes, while for other protocols the non-stratified version shows a de-
creased bias at the cost of larger variance (see Table 3 and Table 4). Interestingly, the
results for the accuracy metric show an almost identical pattern (see Figure 5, Tables 6
and 7 in Appendix A). In general, we do not suggest the use of non-stratification, given
the increase of variance that usually provides.
Finally, Figure 3 shows the effect of feature selection in the analysis and contrasts
Strategy NFS and Strategy WFS on the AUC metric. The average bias for both CVM and
TT increases in Strategy WFS. This increment is explained by the fact that Strategy WFS
explores a larger hyper-parameter space than Strategy NFS. The lack of increment in
predictive power in Strategy WFS is probably due to absence of irrelevant variables: all
datasets have a limited dimensionality (max number of features: 40). In terms of variance
the NCV method shows a decrease in standard deviation for small sample sizes in the
experimentation with feature selection. Similar results are observed with the accuracy
metric (Figure 6), where the decrease in variance is present for all the methods.
9. Related Work and Discussion
Estimating performance of the final reported model while simultaneously selecting the
best pipeline of algorithms and tuning their hyper-parameters is a fundamental task for
any data analyst. Yet, arguably these issues have not been examined in full depth in the
literature. The origins of cross-validation in statistics can be traced back to the “jack-
knife” technique of Quenouille [31] in the statistical community. In machine learning, [5]
studied the cross-validation without model selection (the title of the paper may be confus-
ing) comparing it against the bootstrap and reaching the important conclusion that (a) CV
is preferable to the bootstrap, (b) a value of K=10 is preferable for the number of folds
versus a leave-one-out, and (c) stratification is also always preferable. In terms of theory,
Bengio [11] showed that there exist no unbiased estimator for the variance of the CV
performance estimation, which impact hypothesis testing of performance using the CV.
To the extent of our knowledge the first to study the problem of bias in the context of
model selection in machine learning is [3]. Varma [32] demonstrated the optimism of the
CVM protocol and instead suggests the use of the NCV protocol. Unfortunately, all their
experiments are performed on simulated data only. Tibshirani and Tibshirani [4] intro-
Tsamardinos, Rakhshani, Lagani
18
duced the TT protocol but they do not compare it against alternatives and they include
only a proof-of-concept experiment on a single dataset. Thus, the present paper is the first
work that compares all four protocols (CVM, CVM-CV, NCV, and TT) on multiple real
datasets.
Based on our experiments we found evidence that both the CVM-CV and the TT
method have relatively small bias for sample sizes above 20 and have about the same
variance as the NCV; the TT method does not introduce additional computational over-
Table 3. Average AUC Bias over Datasets (Strategy NFS). P-values produced by a t-test
with null hypothesis the mean bias is zero (P<0,05* , P<0,01**). NS stands for Non-Stratified.
CVM
NS-CVM
CVM-CV
NS-CVM-CV
TT
NS-TT
NCV
NS-NCV
20
0.1702**
0.1581**
0.0929**
0,0407**
0.0798**
0.1237**
-0.0483*
-0.0696**
40
0.1321**
0.1367**
0.0695**
0,0398**
0.0418**
0.0744**
-0.0137
-0.0364*
60
0.1095**
0.1072**
0.0647**
0,0223**
0.0371**
0.0538*
0.0065
-0.0113
80
0.0939**
0.0933**
0.0574**
0,0348**
0.023**
0.0447
-0.0162
-0.0147
100
0.0803**
0.0788**
0.0499**
0,0351**
0.0056
0.0296
0.0093*
0.0017
500
0.0197**
0.0172**
0.0143**
0,0079*
-0.0236**
0.0068
0.0031
0.0002
1500
-0.0023
-0.0024
-0.0031
-0,0028
-0.0132**
-0.0447**
-0.0049**
-0.0044**
Table 4. Standard deviation of AUC estimations over Datasets (Strategy NFS). P-values
produced by a test with null hypothesis that the variances are the same as the corresponding
variance of the NCV protocol (P<0,05* , P<0,01**). NS stands for Non-Stratified
CVM
NS-CVM
CVM-CV
NS-CVM-CV
TT
NS-TT
NCV
NS-NCV
20
0.1289**
0.1681**
0.2043
0,2113
0.214
0.1485**
0.2073
0.2156
40
0.1063**
0.0991**
0.1571*
0,1871
0.1872
0.1026**
0.1946
0.1908
60
0.0845**
0.0881**
0.1435
0,1729
0.1439
0.0787**
0.1637
0.1855
80
0.0711**
0.0769**
0.1057**
0,1493
0.124*
0.0742**
0.1544
0.156
100
0.0757**
0.0806**
0.1136*
0,1352
0.1342
0.0659**
0.1474
0.1498
500
0.0758**
0.0745**
0.0834
0,0867
0.1182**
0.0436**
0.0967
0.0938
1500
0.0436
0.0441
0.0448
0,0444
0.0542*
0.0298**
0.0458
0.046
Table 5. Average AUC on the hold-out sets (Strategy NFS). All methods use CVM for
model selection and thus have the same performances on the hold-out sets. NS stands for Non-
Stratified.
CVM
NS-CVM
20
0.6903
0.6720
40
0.7385
0.7377
60
0.7753
0.7783
80
0.7934
0.7898
100
0.8015
0.8004
500
0.8555
0.8604
1500
0.9163
0.9163
PERFORMANCE-ESTIMATION PROPERTIES OF CROSS-VALIDATION-BASED PROTOCOLS WITH SIM-
ULTANEOUS HYPER-PARAMETER OPTIMIZATION
19
head. However, the TT method seems to overestimate in very small sample sizes when a
large number of hyper-parameter configurations are tested. Moreover, caution should still
be exercised and further research is required to better investigate the properties of the TT
protocol. One particularly worrisome situation is the use of TT in a leave-one-out fashion
which we advise against. In this extreme case, each fold contains a single test case. If the
overall-best classifier predicts it wrong, the loss is 1. If any other classifier tried predicts
it correctly its loss is 0. When numerous classifiers are tried at least one of them will pre-
dict the test case correctly with high probability. In this case, the estimation of the bias

 

 of the TT method will be equal to the loss of the
best classifier. Thus, the TT will estimate the loss of the best classifier found as
 
   
, i.e., twice as much as found during leave-one-out CV. To
recapture: in leave-one-out CV when the number of classifiers tried is high, TT estimates
the loss of the best classifier found as twice its cross-validated loss, which is overly con-
servative.
A previous version of this work has appeared elsewhere [33], presenting a more re-
stricted methodological discussion and empirical comparison. Moreover, the analysis
protocol of the present version markedly differs from the protocol of the previous one. In
[33] the class ratio for sub-datasets with sample size  was forced to one, i.e., the
sub-sampling procedure was selecting an equal number of instances from the two classes,
independently by the original class distribution. In the present experimentations the origi-
nal class distribution is maintained in the sub-datasets, and the number of folds is dynam-
ically changed in order to ensure at least one instance from the rarest class in each fold.
This important change originates a number of differences in the results of the two works.
We do underline though that the findings and conclusions of the previous study are still
valid in the context of the design of its experimentations.
We also note the concerning issue that the variance of estimation for small sample
sizes is large, again in concordance with the experiments in [2]. The authors in the latter
advocate methods that may be biased but exhibit reduced variance. However, we believe
that CVM is too biased no matter its variance; implicitly the authors in [2] agree when
they declare that model selection should be integrated in the performance estimation pro-
cedure in such a way that test samples are never employed for selecting the best model.
Instead, they suggest as alternatives limiting the extent of the search of the hyper-
parameters or performing model averaging. In our opinion, neither option is satisfactory
for all analysis purposes and more research is required. One approach that we suggest is
to repeat the whole analysis several times using a different random partitioning to folds
each time, and average the loss estimations. Repeating the analysis for different fold par-
titionings can be performed both for the inner CV loop (if one employs NCV) or just the
outer CV loop. Averaging over several repeats reduces the component of the variance
that is due to the specific partition to folds, which could be relatively high for small sam-
ple sizes.
Another source of variance of performance estimation is due to the stochastic nature
of certain classification algorithms. Numerous classification methods are non-
Tsamardinos, Rakhshani, Lagani
20
deterministic and will return a different model even when trained on the same dataset.
Typical examples include Artificial Neural Networks (ANNs), where the final model
typically depends on the random initialization of weights. The effect of initialization may
be quite intense and result into very different models returned. The exact same theory and
algorithms presented above apply to such classifiers; however, one would expect an even
larger variance of estimated performances because an additional variance component is
added due to the stochastic nature of the classifier training. In this case, we would suggest
training the same classifier multiple times and averaging the results to produce an esti-
mate of performance.
Particularly, for ANNs we note further possible complications when using the above
protocols. Let us assume that the number of epochs of weight updates is used as a hyper-
parameter in the above protocols. In this case, the value of the number of epochs that
optimizes the CV loss is selected, and then used to train an ANN on the full dataset.
However, training the ANN on a larger sample size may require many more epochs to
achieve a good fit on the dataset. Using the same number of epochs as in a smaller dataset
may underfit and result in high loss. This violates the assumption made in Section 2 that
the learning method improves on average with larger sample sizes. The number-of-
epochs hyper-parameter is highly depended on the sample size and thus possibly violates
this assumption. To satisfy the assumption it should be the case that training the ANN
with a fixed number of epochs should result in a better model (smaller loss) on average
with increasing sample size. Typically, such hyper-parameters can be substituted with
other alternatives (e.g., a criterion that dynamically determines the number of epochs) so
that performance is monotonic (on average) with sample size for any fixed values of the
hyper-parameters. Thus, before using the above protocols an analyst is advised to consid-
er whether the monotonicity assumption holds for all hyper-parameters.
Finally, we’d like to comment on the use of Bayesian non-parametric techniques,
such as Gaussian Processes [34]. Such methods consider and reason with all models of a
given family, averaging out all model parameters to provide a prediction. However, they
still have hyper-parameters. In this case, they are defined as the free parameters of the
whole learning process over which there is no marginalization (averaging out). Examples
of hyper-parameters include the type of the kernel covariance function in Gaussian Pro-
cesses and the parameters of the kernel function [35]. In fact, since one can combine
compositionally kernels via sum and product operations, dynamically composing the ap-
propriate kernel adds a new level of complexity to hyper-parameter search [36]. Thus, in
general, such methods still require hyper-parameters to tune and they don’t completely
obviating the need to select them in our opinion. The protocols presented here could be
employed to select these hyper-parameter values, type of kernel, type of priors, etc. From
a different perspective however, the value of hyper-parameters in some settings (e.g.,
number of hidden units in a neural-network architecture) could be selected using a Bayes-
ian non-parametric machinery. Thus, non-parametric methods could also substitute in
some cases the need for the protocols in this paper.
PERFORMANCE-ESTIMATION PROPERTIES OF CROSS-VALIDATION-BASED PROTOCOLS WITH SIM-
ULTANEOUS HYPER-PARAMETER OPTIMIZATION
21
10. Conclusions
In the absence of hyper-parameter optimization (model selection) simple Cross-
Validation underestimates the performance of the model returned when training using the
full dataset. In the presence of learning method and hyper-parameter optimization simple
Cross-Validation overestimates performance. Some other alternatives are to rerun Cross-
Validation one more time but only for the final selected model, a method proposed by
Tibshirani and Tibshirani [4] to estimate and reduce the bias, and the Nested Cross Vali-
dation which Cross-Validates the model selection procedure (which includes an inner
cross-validation). These alternatives seem to reduce bias with Nested Cross-Validation
being conservative in general and robust to the dataset, although incurring a higher com-
putational overhead; the TT method seems promising and does not require additional
training of models. We would also like to acknowledge the limited scope of our experi-
ments in terms of the number and type of datasets, the inclusion of other preprocessing
steps into the analysis, the inclusion of other procedures for hyper-parameter optimization
that dynamically decide to consider value combinations, using other performance metrics,
experimentation with regression methods and others which form our future work on the
subject in order to obtain more general answers to these research questions.
Acknowledgements
The work was funded by the STATegra EU FP7 project, No 306000, and by the EPILO-
GEAS GSRT ARISTEIA II project, No 3446.
References
[1] D. Anguita, A. Ghio, L. Oneto, and S. Ridella, “In-Sample and Out-of-Sample Model
Selection and Error Estimation for Support Vector Machines,” IEEE Trans. Neural
Networks Learn. Syst., vol. 23, no. 9, pp. 13901406, Sep. 2012.
[2] G. C. Cawley and N. L. C. Talbot, “On Over-fitting in Model Selection and Subsequent
Selection Bias in Performance Evaluation,” J. Mach. Learn. Res., vol. 11, pp. 20792107,
Mar. 2010.
[3] D. D. Jensen and P. R. Cohen, “Multiple comparisons in induction algorithms,” Mach.
Learn., vol. 38, pp. 309338, 2000.
[4] R. J. Tibshirani and R. Tibshirani, “A bias correction for the minimum error rate in cross-
validation,” Ann. Appl. Stat., vol. 3, no. 2, pp. 822829, Jun. 2009.
[5] R. Kohavi, “A Study of Cross-Validation and Bootstrap for Accuracy Estimation and
Model Selection,” in International Joint Conference on Artificial Intelligence, 1995, vol.
14, pp. 11371143.
Tsamardinos, Rakhshani, Lagani
22
[6] A. Statnikov, C. F. Aliferis, I. Tsamardinos, D. Hardin, and S. Levy, “A comprehensive
evaluation of multicategory classification methods for microarray gene expression cancer
diagnosis.,” Bioinformatics, vol. 21, no. 5, pp. 63143, Mar. 2005.
[7] R. O. Duda, P. E. Hart, and D. G. Stork, “Pattern Classification (2nd Edition),” Oct. 2000.
[8] T. M. Mitchell, “Machine Learning,” Mar. 1997.
[9] C. M. Bishop, “Pattern Recognition and Machine Learning (Information Science and
Statistics),” Aug. 2006.
[10] T. Hastie, R. Tibshirani, and J. Friedman, “The Elements of Statistical Learning,”
Elements, vol. 1, pp. 337387, 2009.
[11] Y. Bengio and Y. Grandvalet, “Bias in Estimating the Variance of K-Fold Cross-
Validation,” in Statistical Modeling and Analysis for Complex Data Problem, vol. 1, 2005,
pp. 7595.
[12] I. H. Witten and E. Frank, “Data Mining: Practical Machine Learning Tools and
Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems),”
Jun. 2005.
[13] V. Lagani and I. Tsamardinos, “Structure-based variable selection for survival data.,”
Bioinformatics, vol. 26, no. 15, pp. 18871894, 2010.
[14] A. Statnikov, I. Tsamardinos, Y. Dosbayev, and C. F. Aliferis, “GEMS: a system for
automated cancer diagnosis and biomarker discovery from microarray gene expression
data.,” Int. J. Med. Inform., vol. 74, no. 78, pp. 491503, Aug. 2005.
[15] S. Salzberg, “On Comparing Classifiers : Pitfalls to Avoid and a Recommended
Approach,” Data Min. Knowl. Discov., vol. 328, pp. 317328, 1997.
[16] N. Iizuka, M. Oka, H. Yamada-Okabe, M. Nishida, Y. Maeda, N. Mori, T. Takao, T.
Tamesa, A. Tangoku, H. Tabuchi, K. Hamada, H. Nakayama, H. Ishitsuka, T. Miyamoto,
A. Hirabayashi, S. Uchimura, and Y. Hamamoto, “Oligonucleotide microarray for
prediction of early intrahepatic recurrence of hepatocellular carcinoma after curative
resection.,” Lancet, vol. 361, no. 9361, pp. 9239, Mar. 2003.
[17] L. A. Kurgan, K. J. Cios, R. Tadeusiewicz, M. Ogiela, and L. S. Goodenday, “Knowledge
discovery approach to automated cardiac SPECT diagnosis.,” Artif. Intell. Med., vol. 23,
no. 2, pp. 14969, Oct. 2001.
[18] R. K. Bock, A. Chilingarian, M. Gaug, F. Hakl, T. Hengstebeck, M. Jiřina, J. Klaschka, E.
Kotrč, P. Savický, S. Towers, A. Vaiciulis, and W. Wittek, “Methods for multidimensional
event classification: a case study using images from a Cherenkov gamma-ray telescope,”
Nucl. Instruments Methods Phys. Res. Sect. A Accel. Spectrometers, Detect. Assoc. Equip.,
vol. 516, no. 23, pp. 511528, Jan. 2004.
PERFORMANCE-ESTIMATION PROPERTIES OF CROSS-VALIDATION-BASED PROTOCOLS WITH SIM-
ULTANEOUS HYPER-PARAMETER OPTIMIZATION
23
[19] K. Mansouri, T. Ringsted, D. Ballabio, R. Todeschini, and V. Consonni, “Quantitative
structure-activity relationship models for ready biodegradability of chemicals.,” J. Chem.
Inf. Model., vol. 53, pp. 86778, 2013.
[20] S. Moro and R. M. S. Laureano, “Using Data Mining for Bank Direct Marketing: An
application of the CRISP-DM methodology,” Eur. Simul. Model. Conf., pp. 117121,
2011.
[21] S. C. Bendall, E. F. Simonds, P. Qiu, E. D. Amir, P. O. Krutzik, R. Finck, R. V Bruggner,
R. Melamed, A. Trejo, O. I. Ornatsky, R. S. Balderas, S. K. Plevritis, K. Sachs, D. Pe’er,
S. D. Tanner, and G. P. Nolan, “Single-cell mass cytometry of differential immune and
drug responses across a human hematopoietic continuum.,” Science, vol. 332, no. 6030,
pp. 68796, May 2011.
[22] M. Sikora and L. Wrobel, “Application of rule induction algorithms for analysis of data
collected by seismic hazard monitoring systems in coal mines,” Arch. Min. Sci., vol. 55,
no. 1, pp. 91114, 2010.
[23] A. A. Aguilar-Arevalo, A. O. Bazarko, S. J. Brice, B. C. Brown, L. Bugel, J. Cao, L.
Coney, J. M. Conrad, D. C. Cox, A. Curioni, Z. Djurcic, D. A. Finley, B. T. Fleming, R.
Ford, F. G. Garcia, G. T. Garvey, C. Green, J. A. Green, T. L. Hart, E. Hawker, R. Imlay,
R. A. Johnson, P. Kasper, T. Katori, T. Kobilarcik, I. Kourbanis, S. Koutsoliotas, E. M.
Laird, J. M. Link, Y. Liu, Y. Liu, W. C. Louis, K. B. M. Mahn, W. Marsh, P. S. Martin, G.
McGregor, W. Metcalf, P. D. Meyers, F. Mills, G. B. Mills, J. Monroe, C. D. Moore, R. H.
Nelson, P. Nienaber, S. Ouedraogo, R. B. Patterson, D. Perevalov, C. C. Polly, E. Prebys,
J. L. Raaf, H. Ray, B. P. Roe, A. D. Russell, V. Sandberg, R. Schirato, D. Schmitz, M. H.
Shaevitz, F. C. Shoemaker, D. Smith, M. Sorel, P. Spentzouris, I. Stancu, R. J. Stefanski,
M. Sung, H. A. Tanaka, R. Tayloe, M. Tzanov, R. Van de Water, M. O. Wascko, D. H.
White, M. J. Wilking, H. J. Yang, G. P. Zeller, and E. D. Zimmerman, “Search for
electron neutrino appearance at the Delta m2 approximately 1 eV2 scale.,” Phys. Rev.
Lett., vol. 98, p. 231801, 2007.
[24] D. Coppersmith, S. J. Hong, and J. R. M. Hosking, “Partitioning Nominal Attributes in
Decision Trees,” Data Min. Knowl. Discov., vol. 3, pp. 197217, 1999.
[25] C.-C. Chang and C.-J. Lin, “LIBSVM : a library for support vector machines,” ACM
Trans. Intell. Syst. Technol., vol. 2, no. 27, pp. 127, 2011.
[26] I. Tsamardinos, V. Lagani, and D. Pappas, “Discovering multiple, equivalent biomarker
signatures,” in 7th Conference of the Hellenic Society for Computational Biology and
Bioinformatics (HSCBB12), 2012.
[27] I. Tsamardinos, L. E. Brown, and C. F. Aliferis, “The max-min hill-climbing Bayesian
network structure learning algorithm,” Mach. Learn., vol. 65, no. 1, pp. 3178, 2006.
[28] T. Fawcett, “An introduction to ROC analysis,” Pattern Recognit. Lett., vol. 27, pp. 861
874, 2006.
Tsamardinos, Rakhshani, Lagani
24
[29] A. Airola, T. Pahikkala, W. Waegeman, B. De Baets, and T. Salakoski, “A comparison of
AUC estimators in small-sample studies,” J. Mach. Learn. Res. W&CP, vol. 8, pp. 313,
2010.
[30] R. G. O’brien, “A General ANOVA Method for Robust Tests of Additive Models for
Variances,” J. Am. Stat. Assoc., vol. 74, no. 368, pp. 877880, Dec. 1979.
[31] M. H. Quenouille, “Approximate tests of correlation in time-series 3,” Math. Proc.
Cambridge Philos. Soc., vol. 45, no. 03, pp. 483484, Oct. 1949.
[32] S. Varma and R. Simon, “Bias in error estimation when using cross-validation for model
selection.,” BMC Bioinformatics, vol. 7, p. 91, Jan. 2006.
[33] I. Tsamardinos, V. Lagani, and A. Rakhshani, “Performance-Estimation Properties of
Cross-Validation-Based Protocols with Simultaneous Hyper-Parameter Optimization,” in
SETN’14 Proceedings of the 79h Hellenic conference on Artificial Intelligence, 2014.
[34] C. E. Rasmussen and C. K. I. Williams, Gaussian processes for machine learning., vol.
14. 2006.
[35] T. Hofmann, B. Schölkopf, and A. J. Smola, “Kernel methods in machine learning,” The
Annals of Statistics, vol. 36, no. 3. pp. 11711220, 2008.
[36] D. Duvenaud, J. Lloyd, R. Grosse, J. Tenenbaum, and Z. Ghahramani, “Structure
discovery in nonparametric regression through compositional kernel search,” in
Proceedings of the International Conference on Machine Learning (ICML), 2013, vol. 30,
pp. 11661174.
PERFORMANCE-ESTIMATION PROPERTIES OF CROSS-VALIDATION-BASED PROTOCOLS WITH SIM-
ULTANEOUS HYPER-PARAMETER OPTIMIZATION
25
Appendix A.
Figure 4. Average loss and variance for the accuracy metric in Strategy NFS. From left to right:
stratified CVM, CVM-CV, TT, and NCV. Top row contains average bias, second row bias standard
deviation. The results largely vary depending on the specific dataset. In general, CVM is clearly opti-
mistic for sample sizes less or equal to 100, while NCV tends to underestimate performances. CVM-
CV and TT show a behavior that is in between these two extremes. CVM has the lowest variance, at
least for small sample sizes.
Tsamardinos, Rakhshani, Lagani
26
Figure 5. Effect of stratification using accuracy in Strategy NFS. The first column reports the
average bias for the stratified versions of each method, while the second column for the non-stratified
ones. The effect of stratification is dependent on each specific method and dataset.
PERFORMANCE-ESTIMATION PROPERTIES OF CROSS-VALIDATION-BASED PROTOCOLS WITH SIM-
ULTANEOUS HYPER-PARAMETER OPTIMIZATION
27
Figure 6. Comparing Strategy NFS (No Feature Selection) and WFS (With Feature Selection)
over accuracy. The first row reports the average bias of each method for Strategy NFS and WFS,
respectively, while the second row provides the bias standard deviation. CVM and TT shows an evi-
dent increment in bias in Strategy WFS, presumably due to the larger hyper-parameter space explored
in this setting.
Tsamardinos, Rakhshani, Lagani
28
Table 6. Average Accuracy Bias over Datasets (Strategy NFS). P-values produced by a t-
test with null hypothesis the mean bias is zero (P<0,05* , P<0,01**). NS stands for Non-
Stratified.
CVM
NS-CVM
CVM-CV
NS-CVM-CV
TT
NS-TT
NCV
NS-NCV
20
0,1013**
0,0948**
0,0590**
0,0458**
0,0486**
0,0315**
-0,0127
-0,0311**
40
0,0685**
0,0682**
0,0443**
0,0357**
0,0168**
0,0096*
-0,0096
-0,0236**
60
0,0669**
0,0617**
0,0477**
0,0434**
0,0204**
0,0113**
0,0144**
0,0171**
80
0,0567**
0,0549**
0,0397**
0,0389**
0,0139**
0,0074*
0,0166**
0,0090**
100
0,0466**
0,0422**
0,0337**
0,0263**
-0,0050
-0,0084**
0,0093
0,0039
500
0,0076**
0,0075**
0,0036**
0,0028**
-0,0140**
-0,0141**
-0,0010
-0,0020
1500
0,0023**
0,0023**
0,0004
0,0010
-0,0091**
-0,0092**
-0,0015
-0,0020*
Table 7. Standard deviation of Accuracy estimations over Datasets (Strategy NFS). P-
values produced by a test with null hypothesis that the variances are the same as the
corresponding variance of the NCV protocol (P<0,05* , P<0,01**). NS stands for Non-
Stratified
CVM
NS-CVM
CVM-CV
NS-CVM-CV
TT
NS-TT
NCV
NS-NCV
20
0,0766**
0,0862**
0,1086**
0,1067**
0,1242
0,1509
0,1390
0,1374
40
0,0591**
0,0600**
0,0747*
0,0826
0,0962
0,1033
0,0914
0,1118**
60
0,0435**
0,0463**
0,0595
0,0537*
0,0701
0,0786*
0,0650
0,0639
80
0,0431**
0,0454**
0,0522*
0,0520*
0,0675
0,0732
0,0629
0,0709
100
0,0444**
0,0422**
0,0510
0,0508
0,0682**
0,0663*
0,0564
0,0575
500
0,0359
0,0357
0,0356
0,0366
0,0432*
0,0436**
0,0368
0,0370
1500
0,0278
0,0282
0,0281
0,0275
0,0290
0,0299
0,0277
0,0278
Table 8. Average Accuracy on the hold-out sets (Strategy NFS). All methods use CVM for
model selection and thus have the same performances on the hold-out sets. NS stands for Non-
Stratified.
CVM
NS-CVM
20
0,7699
0,7639
40
0,8061
0,8016
60
0,8186
0,8203
80
0,8296
0,8280
100
0,8351
0,8377
500
0,8816
0,8814
1500
0,8805
0,8805
PERFORMANCE-ESTIMATION PROPERTIES OF CROSS-VALIDATION-BASED PROTOCOLS WITH SIM-
ULTANEOUS HYPER-PARAMETER OPTIMIZATION
29
Table 9. Average AUC Bias over Datasets (Strategy WFS). P-values produced by a t-test
with null hypothesis the mean bias is zero (P<0,05* , P<0,01**). NS stands for Non-Stratified
CVM
NS-CVM
CVM-CV
NS-CVM-CV
TT
NS-TT
NCV
NS-NCV
20
0.2589**
0.2821**
0.0651**
0,0128
0.2139**
0.1487**
-0.0634**
-0.1124**
40
0.1628**
0.1729**
0.0698**
0,0418**
0.0886**
0.0691**
-0.0359*
-0.0418**
60
0.1257**
0.1349**
0.0540**
0,0362**
0.051**
0.0563*
-0.0081
-0.0084
80
0.1034**
0.1094**
0.0475**
0,0484**
0.0321**
0.0381
-0.0222
-0.017
100
0.0985**
0.1029**
0.0595**
0,0525**
0.0284**
0.0294
0.0174**
0.0017*
500
0.0239**
0.0259**
0.0120**
0,0143**
-0.0242**
-0.0014
0.0079
0.0018
1500
0.0007
0.0001
-0.0007
-0,0005
-0.0122**
-0.0471**
-0.0031
-0.0035**
Table 10. Standard deviation of AUC estimations over Datasets (Strategy WFS). P-values
produced by a test with null hypothesis that the variances are the same as the corresponding
variance of the NCV protocol (P<0,05* , P<0,01**). NS stands for Non-Stratified.
CVM
NS-CVM
CVM-CV
NS-CVM-CV
TT
NS-TT
NCV
NS-NCV
20
0.0685**
0.0596**
0.2202
0,2390*
0.1289**
0.1303**
0.2078
0.2456*
40
0.0711**
0.0769**
0.1584**
0,1911
0.1345**
0.0974**
0.2020
0.1896
60
0.0692**
0.0669**
0.1344
0,1590
0.1281*
0.0859**
0.1636
0.1669
80
0.0607**
0.0624**
0.1133**
0,1214**
0.1095**
0.0773**
0.1686
0.1581
100
0.0622**
0.0643**
0.1032*
0,1283
0.1168
0.0717**
0.1386
0.1689
500
0.0601**
0.0599**
0.0824
0,0740
0.1003**
0.0468**
0.0754
0.0882
1500
0.0408
0.0410
0.0424
0,0416
0.0519*
0.0312**
0.0443
0.0444
Table 11. Average AUC performance on the hold-out sets (Strategy WFS). All methods
use CVM for model selection and thus have the same performances on the hold-out sets. NS
stands for Non-Stratified.
CVM
NS-CVM
20
0.6887
0.6871
40
0.7496
0.7471
60
0.7794
0.772
80
0.7983
0.7944
100
0.809
0.803
500
0.8643
0.863
1500
0.916
0.9165
Tsamardinos, Rakhshani, Lagani
30
Table 12. Average accuracy Bias over Datasets (Strategy WFS). P-values produced by a t-
test with null hypothesis the mean bias is zero (P<0,05* , P<0,01**). NS stands for Non-
Stratified
CVM
NS-CVM
CVM-CV
NS-CVM-CV
TT
NS-TT
NCV
NS-NCV
20
0,1439**
0,1507**
0,0555**
0,0214**
0,0763**
0,0757**
-0,0231
-0,0612**
40
0,0855**
0,0854**
0,0505**
0,0395**
0,0236**
0,0185**
-0,0060
-0,0325**
60
0,0690**
0,0671**
0,0411**
0,0393**
0,0121*
0,0075
0,0035
0,0039
80
0,0601**
0,0583**
0,0368**
0,0395**
0,0044
0,0026
0,0095*
-0,0013
100
0,0598**
0,0584**
0,0418**
0,0378**
0,0028
0,0005
0,0144**
0,0132**
500
0,0117**
0,0103**
0,0055**
0,0045**
-0,0173**
-0,0192**
-0,0031
-0,0035*
1500
0,0038**
0,0039**
0,0021**
0,0013
-0,0109**
-0,0108**
-0,0020*
-0,0019*
Table 13. Standard deviation of accuracy estimations over Datasets (Strategy WFS). P-
values produced by a test with null hypothesis that the variances are the same as the
corresponding variance of the NCV protocol (P<0,05* , P<0,01**). NS stands for Non-
Stratified.
CVM
NS-CVM
CVM-CV
NS-CVM-CV
TT
NS-TT
NCV
NS-NCV
20
0,0593**
0,0703**
0,1384
0,1407
0,1143
0,1313
0,1286
0,1482
40
0,0488**
0,0528**
0,0706
0,0734
0,0885
0,0948
0,0812
0,1028**
60
0,0453**
0,0472**
0,0624*
0,0612**
0,0817
0,0858
0,0759
0,0711
80
0,0446**
0,0440**
0,0566
0,0500**
0,0761*
0,0766**
0,0650
0,0776*
100
0,0408**
0,0420**
0,0468**
0,0484*
0,0663*
0,0703**
0,0567
0,0566
500
0,0378
0,0371
0,0387
0,0388
0,0477*
0,0469*
0,0410
0,0405
1500
0,0299
0,0296
0,0300
0,0301
0,0318
0,0313
0,0300
0,0293
Table 14. Average accuracy performance on the hold-out sets (Strategy WFS). All
methods use CVM for model selection and thus have the same performances on the hold-out
sets. NS stands for Non-Stratified.
CVM
NS-CVM
20
0,7571
0,7542
40
0,8051
0,8001
60
0,8206
0,8204
80
0,8308
0,8303
100
0,8333
0,8320
500
0,8791
0,8806
1500
0,8802
0,8804
... To obtain the deployed (final) DT model's average performance, a Nested KFold Cross-Validation is conducted. 52,53 In this procedure, two KFold Cross-Validation loops (inner and outer) are executed. The inner loop assesses model hyperparameter tuning (model selection), 48,49 whereas the outer loop involves model performance estimation. ...
... The inner loop assesses model hyperparameter tuning (model selection), 48,49 whereas the outer loop involves model performance estimation. 52,53 Optimistic (overestimation) and biased performance measures are typical issues on small data sets, and they can be mitigated using a Nested KFold Cross-Validation. 52,53 The MCS provides interpretability, unveiling associations among the impedance magnitude values and classes (e.g., soil samples category), and the Nested KFold Cross-Validation method delivers a more truthful performance estimation. ...
... 52,53 Optimistic (overestimation) and biased performance measures are typical issues on small data sets, and they can be mitigated using a Nested KFold Cross-Validation. 52,53 The MCS provides interpretability, unveiling associations among the impedance magnitude values and classes (e.g., soil samples category), and the Nested KFold Cross-Validation method delivers a more truthful performance estimation. 43 ...
Article
The need to increase food production to address the world population growth can only be fulfilled with precision agriculture strategies to increase crop yield with minimal expansion of the cultivated area. One example is site-specific fertilization based on accurate monitoring of soil nutrient levels, which can be made more cost-effective using sensors. This study developed an impedimetric multisensor array using ion-selective membranes to analyze soil samples enriched with macronutrients (N, P, and K), which is compared with another array based on layer-by-layer films. The results obtained from both devices are analyzed with multidimensional projection techniques and machine learning methods, where a decision tree model algorithm chooses the calibrations (best frequencies and sensors). The multicalibration space method indicates that both devices effectively distinguished all soil samples tested, with the ion-selective membrane setup presenting a higher sensitivity to K content. These findings pave the way for more environmentally friendly and efficient agricultural practices, facilitating the mapping of cropping areas for precise fertilizer application and optimized crop yield.
... In all experiments, a nested cross-validation scheme was applied 82,83 . Namely, we split the data set into ten outer folds, and each fold was held out for testing while the remaining k − 1 folds (9 folds) were merged to form the outer training set. ...
... Nested cross-validation is a common approach that can be adopted for feature selection and parameter tuning to obtain reliable classification accuracy and prevent overfitting 82,83 . We ensured that training and testing subsets never contained the same speakers. ...
Article
Full-text available
Numerous studies proposed methods to detect Parkinson’s disease (PD) via speech analysis. However, existing corpora often lack prodromal recordings, have small sample sizes, and lack longitudinal data. Speech samples from celebrities who publicly disclosed their PD diagnosis provide longitudinal data, allowing the creation of a new corpus, ParkCeleb. We collected videos from 40 subjects with PD and 40 controls and analyzed evolving speech features from 10 years before to 20 years after diagnosis. Our longitudinal analysis, focused on 15 subjects with PD and 15 controls, revealed features like pitch variability, pause duration, speech rate, and syllable duration, indicating PD progression. Early dysarthria patterns were detectable in the prodromal phase, with the best classifiers achieving AUCs of 0.72 and 0.75 for data collected ten and five years before diagnosis, respectively, and 0.93 post-diagnosis. This study highlights the potential for early detection methods, aiding treatment response identification and screening in clinical trials.
... A Nested Cross-Validation (NCV) scheme is being applied in our experiments. NCV is a common approach that can be adopted for metric selection and parameter tuning to obtain reliable classification accuracy and prevent overfitting [34,35]. We are splitting the data set into ten outer folds, and each fold is held out for testing while the remaining k − 1 folds (9 folds) are merged to form the outer training set. ...
Preprint
Motor changes are early signs of neurodegenerative diseases (NDs) such as Parkinson's disease (PD) and Alzheimer's disease (AD), but are often difficult to detect, especially in the early stages. In this work, we examine the behavior of a wide array of explainable metrics extracted from the handwriting signals of 113 subjects performing multiple tasks on a digital tablet. The aim is to measure their effectiveness in characterizing and assessing multiple NDs, including AD and PD. To this end, task-agnostic and task-specific metrics are extracted from 14 distinct tasks. Subsequently, through statistical analysis and a series of classification experiments, we investigate which metrics provide greater discriminative power between NDs and healthy controls and among different NDs. Preliminary results indicate that the various tasks at hand can all be effectively leveraged to distinguish between the considered set of NDs, specifically by measuring the stability, the speed of writing, the time spent not writing, and the pressure variations between groups from our handcrafted explainable metrics, which shows p-values lower than 0.0001 for multiple tasks. Using various classification algorithms on the computed metrics, we obtain up to 87% accuracy to discriminate AD and healthy controls (CTL), and up to 69% for PD vs CTL.
... A short description of each model's hyperparameters can be found in supplemental table S4. Combining hyperparameter tuning with (inner) cross-validation is recommended to retrieve the best combination of hyperparameters based on performance on an independent out-of-sample validation set [53]. Given the small sample size in this study, we used leaveone-out cross-validation to select the best combination of hyperparameters using predictive performance on the hold-out sample of the (inner) validation set. ...
Preprint
Full-text available
Background Performance optimization is a major goal in sports science. However, this remains difficult due to the small samples and large individual variation in physiology and training adaptations. Machine learning (ML) solutions seem promising, but have not been tested for their capability to predict performance in this setting. The aim of this study was to predict 4-km cycling performance following a 12-week training intervention based on ML models with predictors from physiological profiling, individual training load and well-being, and to retrieve the most important predictors. Specific techniques were applied to reduce the risk of overfitting. Results Twenty-seven recreational cyclists completed the 4-km time trial with a mean power output of 4.1 ± 0.7 W/kg. Changes in time-trial performance after training were not different between moderate-intensity endurance training (n = 6), polarised endurance training (n = 8), concurrent polarised with concentric strength training (n = 7) and concurrent polarised with eccentric strength training (n = 6) groups (P > 0.05), but included substantial inter-individual differences. ML models predicted cycling performance with excellent model performance on unseen data before (R² = 0.923, mean absolute error (MAE) = 0.183 W/kg using a generalized linear model) and after training (R² = 0.758, MAE = 0.338 W/kg using a generalized linear model). Absolute changes in performance were more difficult to predict (R² = 0.483, MAE = 0.191 W/kg using a random forest model). Important predictors included power at V̇O2max, performance V̇O2, ventilatory thresholds and efficiency, but also parameters related to body composition, training impulse, sleep, sickness and well-being. Conclusion ML models allow accurate predictions of cycling performance based on physiological profiling, individual training load and well-being during a 12-week training intervention, even using small sample sizes, although changes in cycling performance were more difficult to predict.
... Motivated by this, the overarching goal of this study is to develop a novel high-dimensional predictive performance test utilizing exhaustive nested crossvalidation. Nested Cross-validation (NCV) has demonstrated excellent predictive performance in comprehensive evaluations, particularly in datasets with limited samples (Aliferis et al., 2006;Iizuka et al., 2003;Salzberg, 1997;Tsamardinos et al., 2015). Moreover, we demonstrate that the non-exhaustive test based on K-fold CV can lead to contradicting conclusions depending on the test-train partition, thereby highlighting the need for an exhaustive procedure that provides a reproducible test decision. ...
Preprint
Full-text available
It is crucial to assess the predictive performance of a model in order to establish its practicality and relevance in real-world scenarios, particularly for high-dimensional data analysis. Among data splitting or resampling methods, cross-validation (CV) is extensively used for several tasks such as estimating the prediction error, tuning the regularization parameter, and selecting the most suitable predictive model among competing alternatives. The K-fold cross-validation is a popular CV method but its limitation is that the risk estimates are highly dependent on the partitioning of the data (for training and testing). Here, the issues regarding the reproducibility of the K-fold CV estimator is demonstrated in hypothesis testing wherein different partitions lead to notably disparate conclusions. This study presents an alternative novel predictive performance test and valid confidence intervals based on exhaustive nested cross-validation for determining the difference in prediction error between two model-fitting algorithms. A naive implementation of the exhaustive nested cross-validation is computationally costly. Here, we address concerns regarding computational complexity by devising a computationally tractable closed-form expression for the proposed cross-validation estimator using ridge regularization. Our study also investigates strategies aimed at enhancing statistical power within high-dimensional scenarios while controlling the Type I error rate. To illustrate the practical utility of our method, we apply it to an RNA sequencing study and demonstrate its effectiveness in the context of biological data analysis.
... For hyperparameter optimization (also called model selection) and performance evaluation, we adopt the Double Cross-Validation Method (CVM-CV), described in [13]. This protocol consists of applying the cross-validation method for hyperparameter optimization (CVM) and then reevaluating it on different train-test splits (CV) for final performance estimation for the single, selected, best model. ...
Preprint
This paper proposes a novel approach for modeling the problem of fault diagnosis using the Case Western Reserve University (CWRU) bearing fault dataset. Although the dataset is considered a standard reference for testing new algorithms, the typical dataset division suffers from data leakage, as shown by Hendriks et al. (2022) and Abburi et al. (2023), leading to papers reporting over-optimistic results. While their proposed division significantly mitigates this issue, it does not eliminate it entirely. Moreover, their proposed multi-class classification task can still lead to an unrealistic scenario by excluding the possibility of more than one fault type occurring at the same or different locations. As advocated in this paper, a multi-label formulation (detecting the presence of each type of fault for each location) can solve both issues, leading to a scenario closer to reality. Additionally, this approach mitigates the heavy class imbalance of the CWRU dataset, where faulty cases appear much more frequently than healthy cases, even though the opposite is more likely to occur in practice. A multi-label formulation also enables a more precise evaluation using prevalence-independent evaluation metrics for binary classification, such as the ROC curve. Finally, this paper proposes a more realistic dataset division that allows for more diversity in the training dataset while keeping the division free from data leakage. The results show that this new division can significantly improve performance while enabling a fine-grained error analysis. As an application of our approach, a comparative benchmark is performed using several state-of-the-art deep learning models applied to 1D and 2D signal representations in time and/or frequency domains.
... For hyperparameter optimization, the training dataset was further divided into a nested train-validation split, consisting of 75% for training and 25% for validation. This approach allows for the evaluation of model performance during optimization and has been shown to provide a more conservative estimate of model performance (Tsamardinos et al. 2015). ...
Article
Full-text available
Objective This paper presents a comprehensive analysis of perioperative patient deterioration by developing predictive models that evaluate unanticipated ICU admissions and in-hospital mortality both as distinct and combined outcomes. Materials and Methods With less than 1% of cases resulting in at least one of these outcomes, we investigated 98 features to identify their role in predicting patient deterioration, using univariate analyses. Additionally, multivariate analyses were performed by employing logistic regression (LR) with LASSO regularization. We also assessed classification models, including non-linear classifiers like Support Vector Machines, Random Forest, and XGBoost. Results During evaluation, careful attention was paid to the data imbalance therefore multiple evaluation metrics were used, which are less sensitive to imbalance. These metrics included the area under the receiver operating characteristics, precision-recall and kappa curves, and the precision, sensitivity, kappa, and F1-score. Combining unanticipated ICU admissions and mortality into a single outcome improved predictive performance overall. However, this led to reduced accuracy in predicting individual forms of deterioration, with LR showing the best performance for the combined prediction. Discussion The study underscores the significance of specific perioperative features in predicting patient deterioration, especially revealed by univariate analysis. Importantly, interpretable models like logistic regression outperformed complex classifiers, suggesting their practicality. Especially, when combined in an ensemble model for predicting multiple forms of deterioration. These findings were mostly limited by the large imbalance in data as post-operative deterioration is a rare occurrence. Future research should therefore focus on capturing more deterioration events and possibly extending validation to multi-center studies. Conclusions This work demonstrates the potential for accurate prediction of perioperative patient deterioration, highlighting the importance of several perioperative features and the practicality of interpretable models like logistic regression, and ensemble models for the prediction of several outcome types. In future clinical practice these data-driven prediction models might form the basis for post-operative risk stratification by providing an evidence-based assessment of risk.
Article
div>This study explores the effectiveness of two machine learning models, namely multilayer perceptron neural networks (MLP-NN) and adaptive neuro-fuzzy inference systems (ANFIS), in advancing maintenance management based on engine oil analysis. Data obtained from a Mercedes Benz 2628 diesel engine were utilized to both train and assess the MLP-NN and ANFIS models. Six indices—Fe, Pb, Al, Cr, Si, and PQ—were employed as inputs to predict and classify engine conditions. Remarkably, both models exhibited high accuracy, achieving an average precision of 94%. While the radial basis function (RBF) model, as presented in a referenced article, surpassed ANFIS, this comparison underscored the transformative potential of artificial intelligence (AI) tools in the realm of maintenance management. Serving as a proof-of-concept for AI applications in maintenance management, this study encourages industry stakeholders to explore analogous methodologies. Highlights Two machine learning models, multilayer perceptron neural networks (MLP-NN) and adaptive neuro-fuzzy inference systems (ANFIS), were employed to predict and classify the performance condition of diesel engines. Among various training algorithms, Levenberg–Marquardt and the Bayesian regularization demonstrated superior classification accuracy, achieving a 95%–96% range. To assess the generalizability of MLP-NN and ANFIS, the training set size was varied from 90% to 10%. The ANFIS model exhibited greater stability than MLP-NN, with a 50% higher performance. Graphical Abstract </section
Article
Full-text available
The article presents the results of application of rule induction algorithms for predictive classification of states of rockburst hazard in a longwall. Used in mining practice computer system which is a source of valuable data was described at the beginning of this article. The rule induction algorithm and the way of improving classification accuracy were explained in the theoretical part. The results of analysis of data from two longwalls were presented in the experimental section.
Conference Paper
Full-text available
In a typical supervised data analysis task, one needs to perform the following two tasks: (a) select the best combination of learning methods (e.g., for variable selection and classifier) and tune their hyper-parameters (e.g., K in K-NN), also called model selection, and (b) provide an estimate of the perfor-mance of the final, reported model. Combining the two tasks is not trivial be-cause when one selects the set of hyper-parameters that seem to provide the best estimated performance, this estimation is optimistic (biased / overfitted) due to performing multiple statistical comparisons. In this paper, we confirm that the simple Cross-Validation with model selection is indeed optimistic (overesti-mates) in small sample scenarios. In comparison the Nested Cross Validation and the method by Tibshirani and Tibshirani provide conservative estimations, with the later protocol being more computationally efficient. The role of strati-fication of samples is examined and it is shown that stratification is beneficial.
Conference Paper
Full-text available
We review accuracy estimation methods and compare the two most common methods: crossvalidation and bootstrap. Recent experimental results on arti cial data and theoretical results in restricted settings have shown that for selecting a good classi er from a set of classiers (model selection), ten-fold cross-validation may be better than the more expensive leaveone-out cross-validation. We report on a largescale experiment|over half a million runs of C4.5 and a Naive-Bayes algorithm|to estimate the e ects of di erent parameters on these algorithms on real-world datasets. For crossvalidation, we vary the number of folds and whether the folds are strati ed or not � for bootstrap, we vary the number of bootstrap samples. Our results indicate that for real-word datasets similar to ours, the best method to use for model selection is ten-fold strati ed cross validation, even if computation power allows using more folds. 1
Article
LIBSVM is a library for Support Vector Machines (SVMs). We have been actively developing this package since the year 2000. The goal is to help users to easily apply SVM to their applications. LIBSVM has gained wide popularity in machine learning and many other areas. In this article, we present all implementation details of LIBSVM. Issues such as solving SVM optimization problems theoretical convergence multiclass classification probability estimates and parameter selection are discussed in detail.
Article
An important component of many data mining projects is finding a good classification algorithm, a process that requires very careful thought about experimental design. If not done very carefully, comparative studies of classification and other types of algorithms can easily result in statistically invalid conclusions. This is especially true when one is using data mining techniques to analyze very large databases, which inevitably contain some statistically unlikely data. This paper describes several phenomena that can, if ignored, invalidate an experimental comparison. These phenomena and the conclusions that follow apply not only to classification, but to computational experiments in almost any aspect of data mining. The paper also discusses why comparative analysis is more important in evaluating some types of algorithms than for others, and provides some suggestions about how to avoid the pitfalls suffered by many experimental studies.
Article
In-sample approaches to model selection and error estimation of support vector machines (SVMs) are not as widespread as out-of-sample methods, where part of the data is removed from the training set for validation and testing purposes, mainly because their practical application is not straightforward and the latter provide, in many cases, satisfactory results. In this paper, we survey some recent and not-so-recent results of the data-dependent structural risk minimization framework and propose a proper reformulation of the SVM learning algorithm, so that the in-sample approach can be effectively applied. The experiments, performed both on simulated and real-world datasets, show that our in-sample approach can be favorably compared to out-of-sample methods, especially in cases where the latter ones provide questionable results. In particular, when the number of samples is small compared to their dimensionality, like in classification of microarray data, our proposal can outperform conventional out-of-sample approaches such as the cross validation, the leave-one-out, or the Bootstrap methods.
Article
Linearly combining Levene's z variable with the jackknife pseudo-values of s produces a family of variables that allows for analysis of variance (ANOVA) tests of additive models for the variances in fixed effects designs. Some distributional theory is developed, and a new robust homogeneity of variance test is advocated.