ArticlePDF Available

Abstract and Figures

In a typical supervised data analysis task, one needs to perform the following two tasks: (a) select an optimal combination of learning methods (e.g., for variable selection and classifier) and tune their hyper-parameters (e.g., K in K-NN), also called model selection, and (b) provide an estimate of the performance of the final, reported model. Combining the two tasks is not trivial because when one selects the set of hyper-parameters that seem to provide the best estimated performance, this estimation is optimistic (biased/overfitted) due to performing multiple statistical comparisons. In this paper, we discuss the theoretical properties of performance estimation when model selection is present and we confirm that the simple Cross-Validation with model selection is indeed optimistic (overestimates performance) in small sample scenarios and should be avoided. We present in detail and investigate the theoretical properties of the Nested Cross Validation and a method by Tibshirani and Tibshirani for removing the estimation bias. In computational experiments with real datasets both protocols provide conservative estimation of performance and should be preferred. These statements hold true even if feature selection is performed as preprocessing.
Content may be subject to copyright.
International Journal on Artificial Intelligence Tools
Vol. XX, No. X (2015) 130
World Scientific Publishing Company
1
PERFORMANCE-ESTIMATION PROPERTIES OF CROSS-VALIDATION-
BASED PROTOCOLS WITH SIMULTANEOUS HYPER-PARAMETER OPTI-
MIZATION
Ioannis Tsamardinos
Department of Computer Science, University of Crete, and
Institute of Computer Science, Foundation for Research and Technology Hellas (FORTH)
Heraklion Campus, Voutes, Heraklion, GR-700 13, Greece
tsamard.it@gmail.com
Amin Rakhshani
Department of Computer Science, University of Crete, and
Institute of Computer Science, Foundation for Research and Technology Hellas (FORTH), Vassilika Vouton,
Heraklion Campus, Voutes, ,Heraklion, GR-700 13, Greece
aminra@ics.forth.gr
Vincenzo Lagani
Institute of Computer Science, Foundation for Research and Technology Hellas (FORTH), Vassilika Vouton,
Heraklion, GR-700 13, Greece
vlagani@ics.forth.gr
Received (Day Month Year)
Revised (Day Month Year)
Accepted (Day Month Year)
In a typical supervised data analysis task, one needs to perform the following two tasks: (a) select an
optimal combination of learning methods (e.g., for variable selection and classifier) and tune their
hyper-parameters (e.g., K in K-NN), also called model selection, and (b) provide an estimate of the
performance of the final, reported model. Combining the two tasks is not trivial because when one
selects the set of hyper-parameters that seem to provide the best estimated performance, this estima-
tion is optimistic (biased / overfitted) due to performing multiple statistical comparisons. In this pa-
per, we discuss the theoretical properties of performance estimation when model selection is present
and we confirm that the simple Cross-Validation with model selection is indeed optimistic (overes-
timates performance) in small sample scenarios and should be avoided. We present in detail and dis-
cuss the properties of the Nested Cross Validation and a method by Tibshirani and Tibshirani for
removing the bias of the estimation in detail and investigate their theoretical properties. In computa-
tional experiments with real datasets both protocols provide conservative estimation of performance
and should be preferred. These statements hold true even if feature selection is performed as prepro-
cessing.
Keywords: Performance Estimation; Model Selection; Cross Validation; Stratification; Comparative
Evaluation.
Tsamardinos, Rakhshani, Lagani
2
1. Introduction
A typical supervised analysis (e.g., classification or regression) consists of several steps
that result in a final, single prediction, or diagnostic model. For example, the analyst may
need to impute missing values, perform variable selection or general dimensionality re-
duction, discretize variables, try several different representations of the data, and finally,
apply a learning algorithm for classification or regression. Each of these steps requires a
selection of algorithms out of hundreds or even thousands of possible choices, as well as
the tuning of their hyper-parameters
*
. Hyper-parameter optimization is also called the
model selection problem since each combination of hyper-parameters tried leads to a pos-
sible classification or regression model out of which the best is to be selected. There are
several alternatives in the literature about how to identify a good combination of methods
and their hyper-parameters (e.g., [1][2]) and they all involve implicitly or explicitly
searching the space of hyper-parameters and trying different combinations. Unfortunate-
ly, trying multiple combinations, estimating their performance, and reporting the perfor-
mance of the best model found leads to overestimating the performance (i.e., underesti-
mate the error / loss), sometimes also referred to as overfitting
. This phenomenon is
called the problem of multiple comparisons in induction algorithms and has been ana-
lyzed in detail in [3] and is related to the multiple testing or multiple comparisons in sta-
tistical hypothesis testing. Intuitively, when one selects among several models whose
estimations vary around their true mean value, it becomes likely that what seems to be the
best model has been “lucky” in the specific test set and its performance has been overes-
timated. Extensive discussions and experiments on the subject can be found in [2].
An intuitive small example now follows. Let’s suppose method M1 has 85% true ac-
curacy and method M2 has 83% true accuracy on a given classification task when trained
with a randomly selected dataset of a given size. In 4 randomly drawn training and corre-
sponding test sets on the same problem, the estimations of accuracy maybe 80, 82, 88, 90
for M1 and 88, 85, 79, 79 percent. If M1 was evaluated by itself the estimated mean accu-
racy will be estimated as 85%, and for M2 it would be 82,75% respectively, that are close
to their true means. If performance estimations were perfect then M1 would be chosen
each time and the average performance of the models returned with model selection
would be 85%. However, when both methods are tried, the best is selected, and the max-
imum performance is reported, we obtain the series of estimations: 88, 85, 88, 90 whose
average is 87,75 and will be in generally biased. A larger example and contrived experi-
ment now follows:
Example: In a binary classification problem, an analyst tries N different classification
algorithms, producing N corresponding models from the data. They estimate the perfor-
*
We use the term “hyper-parameters” to denote the algorithm parameters that can be set by the u ser and are not
estimated directly from the data, e.g., the parameter K in the K-NN algorithm. In contrast, the term “parameters”
in the statistical literature typically refers to the model quantities that are estimated directly by the data, e.g., the
weight vector w in a linear regression model y = w
x + b. See [2] for a definition and discussion too.
The term “overfitting” i s a more general term a nd we prefer the term “overestimating” to characterize this
phenomenon.
PERFORMANCE-ESTIMATION PROPERTIES OF CROSS-VALIDATION-BASED PROTOCOLS WITH SIM-
ULTANEOUS HYPER-PARAMETER OPTIMIZATION
3
mance (accuracy) of each model on a test set of M samples. They then select the model
that exhibits the best estimated accuracy
and report this performance as the estimated
performance of the selected model. Let’s assume that all models have the same, true ac-
curacy of 85%. What is the expected value of the estimated
,
and how biased is
it?
Let’s denote the accuracy of each model with Pi = 0,85. The true performance of the
final model is of course also B = max Pi = 0,85 . But, the estimated performance

is biased. Table 1 shows
for different values of N and M assuming each
model makes independent errors on the test set as estimated with 10000 simulations. The
table also shows the 5th and 95th percentile as an indication of the range of the estimation.
Invariably, the expected estimated accuracy
of the final model is overestimated.
As expected, the bias increases with the number of models tried and decreases with
the size of the test set. For sample sizes less than or equal to 100, the bias is significant:
when the number of models produced is larger than 100, it is not uncommon to estimate
the performance of the best model as 100%. Notice that, when using Cross Validation-
based protocols to estimate performance each sample serves once and only once as a test
case. Thus, one can consider the total data-set sample size as the size of the test set. Typ-
ical high-dimensional datasets in biology often contain less than 100 samples and thus,
one should be careful with the estimation protocols employed for their analysis.
What about the number of different models tried in an analysis? Is it realistic to ex-
pect an analyst to generate thousands of different models? Obviously, it is very rare that
any analyst will employ thousands of different algorithms; however, most learning algo-
rithms are parameterized by several different hyper-parameters. For example, the stand-
ard 1-norm, polynomial Support Vector Machine algorithm takes as hyper-parameters the
Table 1. Average estimated accuracy 
when reporting the (estimate) performance of N
models with equal true accuracy of 85%. In brackets the 5 and 95 percentilies are shown. The
smaller the sample size, and the larger the number of models N out of which selection is
performed the larger the overestimation.
Test set sample size
Number
of models
20
80
100
500
1000
5
0.935
[0.85; 1.00]
0.895
[0.86; 0.94]
0.891
[0.86; 0.93]
0.868
[0.85; 0.89]
0.863
[0.85; 0.88]
10
0.959
[0.90; 1.00]
0.908
[0.88; 0.94]
0.902
[0.87; 0.93]
0.874
[0.86; 0.89]
0.867
[0.86; 0.88]
20
0.977
[0.95; 1.00]
0.920
[0.89; 0.95]
0.913
[0.89; 0.94]
0.879
[0.87; 0.89]
0.871
[0.86; 0.88]
50
0.993
[0.95; 1.00]
0.933
[0.91; 0.96]
0.925
[0.90; 0.95]
0.885
[0.87; 0.90]
0.875
[0.87; 0.88]
100
0.999
[1.00; 1.00]
0.941
[0.93; 0.96]
0.932
[0.91; 0.95]
0.889
[0.88; 0.90]
0.878
[0.87; 0.89]
1000
1.000
[1.00; 1.00]
0.962
[0.95; 0.97]
0.952
[0.94; 0.97]
0.899
[0.89; 0.91]
0.885
[0.88; 0.89]
Tsamardinos, Rakhshani, Lagani
4
cost C of misclassifications and the degree of the polynomial d. Similarly, most variable
selection methods take as input a statistical significance threshold or the number of varia-
bles to return. If an analyst tries several different methods for imputation, discretization,
variable selection, and classification, each with several different hyper-parameter values,
the number of combinations explodes and can easily reach into the thousands. Notice
that, model selection and optimistic estimation of performance may also happen uninten-
tionally and implicitly in many other settings. More specifically, consider a typical publi-
cation where a new algorithm is introduced and its performance (after tuning the hyper-
parameters) is compared against numerous other alternatives from the literature (again,
after tuning their hyper-parameters), on several datasets. The comparison aims to com-
paratively evaluate the methods. However, the reported performances of the best method
on each dataset suffer from the same problem of multiple inductions and are on average
optimistically estimated.
We now discuss the different factors that affect estimation. In the simulations above,
we assume that the N different models provide independent predictions. However, this is
unrealistic as the same classifier with slightly different hyper-parameters will produce
models that give correlated predictions (e.g., K-NN models with K=1 and K=3 will often
make the same errors). Thus, in a real analysis setting, the amount of bias may be smaller
than what is expected when assuming no dependence between models. The violation of
independence makes the theoretical analysis of the bias difficult and so in this paper, we
rely on the empirical evaluations of the different estimation protocols.
There are other factors that affect the bias. For example, the difference of the perfor-
mance of the best method with the other methods attempted relative to the variance of the
estimation, affects the bias. For example, if the best method attempted has a true accuracy
of 85% with variance 3% and all the other methods attempted have a true accuracy of
50% with variance 3%, we do not expect considerable bias in the estimation: the best
method will always be selected no matter whether its performance is overestimated or
underestimated with the specific dataset, and thus on average it will be unbiased. This
observation actually forms the basis for the Tibshirani and Tibshirani method [4] de-
scribed below.
In the remainder of the paper, we revisit the Cross-Validation (CV) protocol. We cor-
roborate [2][5] that CV overestimates performance when it is used with hyper-parameter
optimization. As expected overestimation of performance increases with decreasing sam-
ple sizes. We present three other performance estimation methods in the literature. The
first is a simple approach that re-evaluates CV performance by using a different split of
the data (CVM-CV)
. The method by Tibshirani and Tibshirani (hereafter TT) [4] tries to
estimate the bias and remove it from the estimation. The Nested Cross Validation (NCV)
method [6] cross-validates the whole hyper-parameter optimization procedure (which
includes an inner cross-validation, hence the name). NCV is a generalization of the tech-
nique where data is partitioned in train-validation-test sets.
We thank the anonymous reviewers for suggesting the method.
PERFORMANCE-ESTIMATION PROPERTIES OF CROSS-VALIDATION-BASED PROTOCOLS WITH SIM-
ULTANEOUS HYPER-PARAMETER OPTIMIZATION
5
We show that the behavior of the four methods is markedly different, ranging from
the overestimation to conservative estimation of performance bias and variance. To our
knowledge, this is the first time these methods are compared against each other on real
datasets.
There are two sets of experiments, namely with and without a feature selection pre-
processing step. On one side, we expect that the models will gain predictive power from
the elimination of irrelevant or superfluous variables. However, the inclusion of one fur-
ther modelling step increases the number of hyper-parameter configurations to evaluate,
and thus performance overestimation should increase as well. Empirically, we show that
this is indeed the case. The effect of stratification is also empirically examined. Stratifica-
tion is a technique that during partitioning of the data into folds forces the same distribu-
tion of the outcome classes to each fold. When data are split randomly, on average, the
distribution of the outcome in each fold will be the same as in the whole dataset. Howev-
er, in small sample sizes or imbalanced data it could happen that a fold gets no samples
that belong in one of the classes (or in general, the class distribution in a fold is very dif-
ferent from the original). Stratification ensures that this does not occur. We show that
stratification has different effects depending on (a) the specific performance estimation
method and (b) the performance metric. However, we argue that stratification should
always be applied as a cautionary measure against excessive variance in performance.
2. Cross-Validation Without Hyper-Parameter Optimization (CV)
K-fold Cross Validation is perhaps the most common method for estimating performance
of a learning method for small and medium sample sizes. Despite its popularity, its theo-
retical properties are arguably not well known especially outside the machine learning
community, particularly when it is employed with simultaneous hyper-parameter optimi-
zation, as evidenced by the following common machine learning books: Duda ([7], p.
484) presents CV without discussing it in the context of model selection and only hints
that it may underestimate (when used without model selection): “The jackknife [i.e.,
Algorithm 1: K-Fold Cross-Validation 
Input: A dataset     
Output: A model
An estimation of performance (loss) of
Randomly Partition to K folds
Model    // the model learned on all data D
Estimation
:
 

,


Return 

Tsamardinos, Rakhshani, Lagani
6
leave-one-out CV] in particular, generally gives good estimates because each of the n
classifiers is quite similar to the classifier being tested …”. Similarly, Mitchell [8](pp.
112, 147, 150) mentions CV but only in the context of hyper-parameter optimization.
Bishop [9] does not deal at all with issues of performance estimation and model selection.
A notable exception is the Hastie and co-authors [10] book that offers the best treatment
of the subject, upon which the following sections are based. Yet, CV is still not discussed
in the context of model selection.
Let’s assume a dataset     , of identically and independently
distributed (i.i.d.) predictor vectors and corresponding outcomes . Let us also assume
that we have a single method for learning from such data and producing a single predic-
tion model. We will denote with the output of the model produced by the learner
f when trained on data D and applied on input . The actual model produced by f on
dataset D is denoted with . We will denote with  the loss (error) measure
of prediction  when the true output is . One common loss function is the zero-one loss
function:  , if   and  , otherwise. Thus, the average zero-one
loss of a classifier equals 1 - accuracy, i.e., it is the probability of making an incorrect
classification. K-fold CV partitions the data D into K subsets called folds . We
denote with  the data excluding fold and the sample size of each fold. The K-
fold CV algorithm is shown in Algorithm 1.
First, notice that CV should return the model learned from all data D, 
§
. This
is the model to be employed operationally for classification. It then tries to estimate the
performance of the returned model by constructing K other models from datasets ,
each time excluding one fold from the training set. Each of these models is then applied
on each fold serving as test and the loss is averaged over all samples.
Is
 an unbiased estimate of the loss of ? First, notice that each sample is
used once and only once as a test case. Thus, effectively there are as many i.i.d. test cases
as samples in the dataset. Perhaps, this characteristic is what makes the CV so popular
versus other protocols such as repeatedly partitioning the dataset to train-test subsets. The
test size being as large as possible could facilitate the estimation of the loss and its vari-
ance (although, theoretical results show that there is no unbiased estimator of the variance
for the CV! [11]). However, test cases are predicted with different models! If these mod-
els were trained on independent train sets of size equal to the original data D, then CV
would indeed estimate the average loss of the models produced by the specific learning
method on the specific task when trained with the specific sample size. As it stands
though, since the models are correlated and have smaller size than the original we can
state the following:
K-Fold CV estimates the average loss of models returned by the specific learning
method f on the specific classification task when trained with subsets of D of size .
§
This is often a source of confusion for some practitioners who sometimes wonder which model to return out of
the ones produced during Cross-Validation.
PERFORMANCE-ESTIMATION PROPERTIES OF CROSS-VALIDATION-BASED PROTOCOLS WITH SIM-
ULTANEOUS HYPER-PARAMETER OPTIMIZATION
7
Since    (e.g., for 5-fold, we are using 80% of the total
sample size for training each time) and assuming that the learning method improves on
average with larger sample sizes we expect
 to be conservative (i.e., the true perfor-
mance be underestimated). Exactly how conservative it will be depends on where the
classifier is operating on its learning curve for this specific task. It also depends on the
number of folds K: the larger the K, the more (K-1)/K approaches 100% and the bias dis-
appears, i.e., leave-one-out CV should be the least biased (however, there may be still be
significant estimation problems, see [12], p. 151, and [5] for an extreme failure of leave-
one-out CV). When sample sizes are small or distributions are imbalanced (i.e., some
classes are quite rare in the data), we expect most classifiers to quickly benefit from in-
creased sample size, and thus
 to be more conservative.
3. Cross-Validation With Hyper-Parameter Optimization (CVM)
A typical data analysis involves several steps (representing the data, imputation, dis-
cretization, variable selection or dimensionality reduction, learning a classifier) each with
hundreds of available choices of algorithms in the literature. In addition, each algorithm
takes several hyper-parameter values that should be tuned by the user. A general method
for tuning the hyper-parameters is to try a set of predefined combinations of methods and
corresponding values and select the best. We will represent this set with a set a contain-
ing hyper-parameter values, e.g, a = { no variable selection, K-NN, K=5, Lasso, λ = 2,
linear SVM, C=10 } when the intent is to try K-NN with no variable selection, and a
linear SVM using the Lasso algorithm for variable selection. The pseudo-code is shown
in Algorithm 2. The symbol  now denotes the output of the model learned when
using hyper-parameters a on dataset D and applied on input x. Correspondingly, the sym-
bol  denotes the model produced by applying hyper-parameters a on D.
The quantity  is now parameterized by the specific values a and the minimizer
of the loss (maximizer of performance) a* is found. The final model returned is 
, i.e., the model produced by setting hyper-parameter values to a* and learning
from all data D.
On one hand, we expect CV with model selection (hereafter, CVM) to underestimate
performance because estimations are computed using models trained on only a subset of
the dataset. On the other hand, we expect CVM to overestimate performance because it
returns the maximum performance found after trying several hyper-parameter values. In
Section 8 we examine this behavior empirically and determine (in concordance with [2],
[5]) that indeed when sample size is relatively small and the number of models tried is in
the hundreds CVM overestimates performance. Thus, in these cases other types of esti-
mation protocols are required.
Tsamardinos, Rakhshani, Lagani
8
4. The Double Cross Validation Method (CVM-CV)
The CVM is biased because when trying hundreds or more learning methods, what ap-
pears to be the best one has probably also been “lucky” for the particular test sets. Thus,
one idea to reduce the bias is to re-evaluate the selected, best method on different test
sets. Of course, since we are limited to the given samples (dataset) it is impossible to do
so on truly different test cases. One idea thus, is to re-evaluate the selected method on a
different split (partitioning) to folds and repeat Cross-Validation only for the single, se-
lected, best method. We name this approach CVM-CV, since it sequentially performs
CVM and CV for model selection and performance estimation, respectively and it is
shown in Algorithm 3. The tilde symbol `~’ is used to denote a returned value that is ig-
Algorithm 2: K-Fold Cross-Validation with Hyper-
Parameter Optimization (Model Selection) 
Input: A dataset      
A set of hyper-parameter value combinations
Output: A model
An estimation of performance (loss) of
Partition to K folds
Estimate
 for each   :


,



Find minimizer of
 // “best hyper-parameters”
   // the model from all data D with the best
hyper-parameters


Return 

Algorithm 3: Double Cross Validation   
Input: A dataset      
A set of hyper-parameter value combinations
Output: A model
An estimation of performance (loss) of
Partition to K folds
 
is the parameter configuration corresponding to
Estimation
:
Partition to K new randomly chosen folds

 
Return 

PERFORMANCE-ESTIMATION PROPERTIES OF CROSS-VALIDATION-BASED PROTOCOLS WITH SIM-
ULTANEOUS HYPER-PARAMETER OPTIMIZATION
9
nored. Notice that, CVM-CV is not theoretically expected to fully remove the overesti-
mation bias: information from the test sets in the final Cross-Validation step for perfor-
mance estimation is still employed during training to select the best model. Nevertheless,
our experiments show that this relatively computationally-efficient approach does reduce
CVM overestimation bias.
5. The Tibshirani and Tibshirani (TT) Method
The TT method [4] attempts to heuristically and approximately estimate the bias of the
CV error estimation due to model selection and add it to the final estimate of loss. For
each fold, the bias due to model selection is estimated as  where, as
before, is the average loss in fold k, is the hyper-parameter values that minimizes
the loss for fold k, and the global minimizer over all folds. Notice that, if in all folds
the same values provide the best performance, then these will also be selected globally
and hence   for  . In this case, the bias estimate will be zero. The justi-
fication of this estimate for the bias is in [4]. It is quite important to notice that TT does
not require any additional model training and has minimum computational overhead.
6. The Nested Cross-Validation Method (NCV)
We could not trace who introduced or coined up first the name Nested Cross-Validation
(NCV) method but the authors have independently discovered it and using it since 2005
[6],[13],[14]; one early comment hinting of the method is in [15], while Witten and Frank
briefly discuss the need of treating any parameter tuning step as part of the training pro-
cess when assessing performance (see [12], page 286).
Algorithm 4: 
Input: A dataset      
A set of hyper-parameter value combinations
Output: A model
An estimation of performance (loss) of
Partition D to K folds Fi
 


,



Find minimizer of
 // global optimizer
Find minimizers of  // the minimizers for each fold
Estimate 


   , i.e., the model learned on all data D with
the best hyper-parameters

 
Return 

Tsamardinos, Rakhshani, Lagani
10
A similar method in a bioinformatics analysis was used as early as 2003 [16]. The
main idea is to consider the model selection as part of the learning procedure f. Thus, f
tests several hyper-parameter values, selects the best using CV, and returns a single mod-
el. NCV cross-validates f to estimate the performance of the average model returned by f
just as normal CV would do with any other learning method taking no hyper-parameters;
it’s just that f now contains an internal CV trying to select the best model. NCV is a gen-
eralization of the Train-Validation-Test protocol where one trains on the Train set for all
hyper-parameter values, selects the ones that provide the best performance on Validation,
trains on Train+Validation a single model using the best-found values and estimates its
performance on Test. Since Test is used only once by a single model, performance esti-
mation has no bias due to the model selection process. The final model is trained on all
data using the best found values for a. NCV generalizes the above protocol to cross-
validate every step of this procedure: for each Test, all folds serve as Validation, and this
process is repeated for each fold serving as Test. The pseudo-code is shown in Algorithm
5. The pseudo-code is similar to CV (Algorithm 1) with CVM (Cross-Validation with
Model Selection, Algorithm 2) serving as the learning function f. NCV requires a quad-
ratic number of models to be trained to the number of folds K (one model is trained for
every possible pair of two folds serving as test and validation respectively) and thus it is
the most computationally expensive protocol out of the four.
7. Stratification of Folds
In CV, folds are partitioned randomly which should maintain on average the same class
distribution in each fold. However, in cases of small sample sizes or highly imbalanced
class distributions it may happen that some folds contain no samples from one of the
classes (or in general, the class distribution is very different from the original). In that
case, the estimation of performance for that fold will exclude that class. To avoid this
case, “in stratified cross-validation, the folds are stratified so that they contain approxi-
mately the same proportions of labels as the original dataset” [5]. Notice that leave-one-
Algorithm 5: K-Fold Nested Cross-Validation 
Input: A dataset      
A set of hyper-parameter value combinations
Output: A model
An estimation of performance (loss) of
Partition to K folds
 
Estimation
:
 \\best performing model on 
 

 ,


Return 

PERFORMANCE-ESTIMATION PROPERTIES OF CROSS-VALIDATION-BASED PROTOCOLS WITH SIM-
ULTANEOUS HYPER-PARAMETER OPTIMIZATION
11
out CV guarantees that each fold will be unstratified since it contains only one sample
which can cause serious estimation problems ([12], p. 151, [5]).
8. Empirical Comparison of Different Protocols
We performed an empirical comparison in order to assess the characteristics of each data-
analysis protocol. Particularly, we focus on three specific aspects of the protocols’ per-
formances: (a) bias and variance of the estimation, (b) effect of feature selection and (c)
effect of stratification.
Notice that, assuming data are partitioned into the same folds, all methods return the
same model, that is, the model returned by f on all data D using the minimizer of the
CV error. However, each methods return a different estimate of the performance for this
model.
8.1. The Experimental Set-Up
Original Datasets: Five datasets from different scientific fields were employed for the
experimentations. The computational task for each dataset consists in predicting a binary
outcome on the basis of a set of numerical predictors (binary classification). Datasets
were selected to have a relatively large number of samples so that several smaller datasets
that follow the same joint distribution and can be sub-sampled from the original dataset;
when the number of sample size is large the sub-samples are relatively independent
providing independent estimates of all metrics in the experimental part. In more detail the
SPECT [17] dataset contains data from Single Photon Emission Computed Tomography
images collected in both healthy and cardiac patients. Data in Gamma [18] consist of
simulated registrations of high energy gamma particles in an atmospheric Cherenkov
telescope, where each gamma particle can be originated from the upper atmosphere
(background noise) or being a primary gamma particle (signal). Discriminating biode-
gradable vs. non-biodegradable molecules on the basis of their chemical characteristics is
the aim of the Biodeg [19] dataset. The Bank [20] dataset was gathered by direct market-
ing campaigns (phone calls) of a Portuguese banking institution for discriminating cus-
tomers who want to subscribe a term deposit and those who don’t. Last, CD4vsCD8 [21]
contains the phosphorylation levels of 18 intra-cellular proteins as predictors to discrimi-
nate naïve CD4+ and CD8+ human immune system cells. SeismicBump [22] focuses on
forecasting high energy (higher than 104 J) in coal mines. Data come from longwalls lo-
cated in a Polish coal mine. The MiniBoone dataset is taken from the first phase of the
Booster Neutrino Experiment conducted in the FermiLab [23]; the goal is to distinguish
between electron neutrinos (signal) and muon neutrinos (background). Table 2 summa-
rizes datasets’ characteristics. It should be noticed that the outcome distribution consider-
ably varies across datasets.
Model Selection: To generate the hyper-parameter vectors in a we employed two differ-
ent strategies, named No Feature Selection (NFS) and With Feature Selection (WFS).
Tsamardinos, Rakhshani, Lagani
12
Strategy NFS (No Feature Selection) includes three different modelers: the Logistic
Regression classifier ([9], p. 205), as implemented in Matlab 2013b, that takes no hyper-
parameters; the Decision Tree [24], as implemented also in Matlab 2013b with hyper-
parameters MinLeaf and MinParents both within {1, 2, …, 10, 20, 30, 40, 50}; Support
Vector Machines as implemented in the libsvm software [25] with linear, Gaussian ( 
 ) and polynomial (degree   ,    ) kernels, and cost pa-
rameter   . When a classifier takes multiple hyper-parameters, all combina-
tions of choices are tried. Overall, 247 hyper-parameter value combinations and corre-
sponding models are produced each time to select the best.
Strategy WFS (With Feature Selection) adds feature selection methods as prepro-
cessing steps to Strategy NFS. Two feature selection methods are tried each time, namely
the univariate selection and the Statistically Equivalent Signature (SES) algorithm [26].
The former simply applies a statistical test for assessing the association between each
individual predictor and the target outcome (chi-square test for categorical variables and
Student t-test for continuous ones). Predictors whose p-values are below a given signifi-
cance threshold t are retained for successive computations. The SES algorithm [26] be-
longs to the family of constraint-based, Markov-Blanket inspired feature selection meth-
ods [27]. In short, SES repetitively applies a statistical test of conditional independence
for identifying the set of predictors that are associated with the outcome given any com-
bination of the remaining variables. Also SES requires the user to set a priori a signifi-
cance threshold t, along with the hyper-parameter maxK that limits the number of predic-
tors to condition upon. Both feature selection methods are coupled in turn with each
modeler in order to build the hyper-parameters vector a. The significance threshold t is
varied in {0.01, 0.05} for both methods, while maxK varies in {3, 5}, bringing the num-
ber of hyper-parameter combinations produced in Strategy WFS to 1729.
Sub-Datasets and Hold-out Datasets. Each original dataset D is partitioned into two
separate, stratified parts: Dpool, containing 30% of the total samples, and the hold-out set
Dhold-out, consisting of the remaining samples. Subsequently, for each Dpool N sub-datasets
are randomly sampled with replacement for each sample size in the set {20, 40, 60, 80,
100, 500 and 1500}, for a total of 5  7 N sub-datasets Di, j, k (where i indexes the orig-
Table 2. Datasets’ characteristics. Dpool is a 30% partition from which sub-sampled datasets are
produced. Dhold-out is the remaining 70% of samples from which an accurate estimation of the true
performance is computed.
Dataset Name
# Samples
# Attributes
Classes ratio
|Dpoo1|
|Dhold-out|
Ref.
SPECT
267
22
3.85
81
186
[17]
Biodeg
1055
41
1.96
317
738
[19]
SeismicBumps
2584
18
14.2
776
1808
[22]
Gamma
19020
11
1.84
5706
13314
[18]
CD4vsCD8
24126
18
1.13
7238
16888
[21]
MiniBooNE
31104
50
2.55
9332
21772
[23]
Bank
45211
17
7.54
13564
31647
[20]
PERFORMANCE-ESTIMATION PROPERTIES OF CROSS-VALIDATION-BASED PROTOCOLS WITH SIM-
ULTANEOUS HYPER-PARAMETER OPTIMIZATION
13
inal dataset, j the sample size, and k the sub-sampling). Most of the original datasets have
been selected with a relatively large sample size so that each Dhold-out is large enough to
allow an accurate (low variance) estimation of performance. In addition, the size of Dpool
is also relatively large so that each sub-sampled dataset to be approximately considered a
dataset independently sampled from the data population of the problem. Nevertheless, we
also include a couple of datasets with smaller sample size. We set the number of sub-
samples to be   .
Bias and Variance of each Protocol: For each of the data analysis protocols CVM,
CVM-CV, TT, and NCV both the stratified and the non-stratified versions are applied to
each sub-dataset, in order to select the “best model/hyper-parameter values” and estimate
its performance
. For each sub-dataset, the same split in    folds was employed
for the stratified versions of CVM, CVM-CV, TT and NCV, so that the three data-analysis
protocols always select exactly the same model, and differ only in the estimation of per-
formance. For the NCV, the internal CV loop uses K
=K-1 folds. Some of the dataset
though are characterized by a particular high-class ratio, and typically this leads to a scar-
city of instances of the rarest class in some sub-datasets. If the number of instances of a
Figure 1. Average loss and variance for AUC metric in Strategy NFS. From left to right: stratified
CVM, CVM-CV, TT, and NCV. Top row contains average bias, second row the standard deviation of
performance estimation. The results largely vary depending on the specific dataset. In general, CVM is
clearly optimistic (positive bias) for sample sizes less or equal to 100, while NCV tends to underesti-
mate performances. CVM-CV and TT show a behavior that is in between these two extremes. CVM
has the lowest variance, at least for small sample sizes.
Tsamardinos, Rakhshani, Lagani
14
given class is smaller than , we set   in order to ensure the presence of both clas-
ses in each fold. For NCV and NS-NCV, we forgo to analyze sub-datasets where <3.
The bias is computed as 
. Thus, a positive bias indicates a higher “true”
error (i.e., as estimated on the hold-out set) than the one estimated by the corresponding
analysis protocol and implies the estimation protocol is optimistic. For each protocol,
original dataset, and sample size the true mean bias, its variance and its standard devia-
tion are computed over 30 sub-samplings.
Performance Metric: All algorithms are presented using a loss function L computed for
each sample and averaged out for each fold and then over all folds. The zero-one loss
function is typically assumed corresponding to 1-accuracy of the classifier. A valid alter-
native as metric for binary classification problems the Area Under the Receiver’s Operat-
ing Characteristic Curve (AUC) [28]. The AUC does not depend on the prior class distri-
bution. In contrast, the zero-one loss depends on the class distribution: for a problem with
class distribution of 50-50%, a classifier with accuracy 85% (loss 15%) has greatly im-
proved over the baseline of a trivial classifier predicting the majority class; for a problem
of 84-16% class distribution, a classifier with 85% accuracy has not improved much over
the baseline. On the other hand, computing AUC on small test sets leads to poor esti-
mates [29], and it is impossible when leave-one-out cross validation is used (unless mul-
tiple predictions are pooled together, a practice that creates additional issues [29]). More-
over, the AUC cannot be expressed as a loss function   where  is a single predic-
tion. Nevertheless, all Algorithms 1-4 remain the same if we substitute
 
 , i.e., the error in fold i is 1 minus the AUC of the model learned by f
on all data except fold Fi, as estimated on Fi as the test set. In order to contrast the proper-
ties of the two metrics, we have performed all analyses twice, using in turn 0-1 loss and
1-AUC as the metrics to optimize. For both metrics a positive bias corresponds to overes-
timated performance.
8.2. Experimental Results
The results of the analysis greatly differ depending on the specific dataset, performance
metric and performance estimation method. The following Tables and Figures show the
results obtained with the AUC metric, while the remaining results are provided in Ap-
pendix A.
We first comment the results obtained in the experimentation following Strategy NFS.
Figure 1 (first row) shows the average loss bias of the four methods, (stratified version)
suggesting that indeed CVM overestimates performance for small sample sizes (underes-
timates error) corroborating the results in [2],[5]. In contrast, NCV tends to be over pes-
simistic and to underestimate the real performances. CVM-CV and TT exhibit results that
are between these two extremes, with CVM > CVM-CV > TT > NCV in terms of overes-
timation. It should be noted that the results on the SeismicBump dataset strongly penalize
the CVM, CVM-CV and TT methods. Interestingly, this dataset has the highest ratio be-
tween outcome classes (14.2), suggesting that estimating AUC performances in highly
imbalanced datasets may be challenging for these three methods.
PERFORMANCE-ESTIMATION PROPERTIES OF CROSS-VALIDATION-BASED PROTOCOLS WITH SIM-
ULTANEOUS HYPER-PARAMETER OPTIMIZATION
15
Table 3 shows the bias averaged over all datasets, where it is shown that CVM over-
estimates AUC up to ~17 points for small sample sizes, while CVM-CV and TT are al-
ways below 10 points, and NCV bias is never outside the range of 5 points. We perform a
t-test for the null hypothesis that the bias is zero, which is typically rejected: all methods
usually exhibit some bias whether positive or negative.
Figure 2. Effect of stratification using the AUC in Strategy NFS. The first column reports the aver-
age bias for the stratified versions of each method, while the second column for the non-stratified ones.
The effect of stratification is dependent on each specific method and dataset.
Tsamardinos, Rakhshani, Lagani
16
The second row of Figure 1 and Table 4 show the standard deviations for the loss bi-
as. We apply the O'Brien's modification of Levene's statistical test [30] with the null hy-
pothesis that the variance of a method is the same as the corresponding variance for the
same sample size as the NCV. We note that CVM has the lowest variance, while all the
other methods show almost statistically indistinguishable variances.
Table 5 reports the average performances on the hold-out set. As expected, these per-
formances improve with sample size, because the final models are trained on a larger
number of samples. The corresponding results for the accuracy metric are reported in
Figure 4 and Table 6-8 in Appendix A, and generally follow the same patterns and con-
clusions as the ones reported above for the AUC metric. The only noticeable difference is
Figure 3. Comparing Strategy NFS (No Feature Selection) and WFS (With Feature Selection)
over AUC. The first row reports the average bias of each method for Strategy NFS and WFS, respec-
tively, while the second row provides the variance in performance estimation. CVM and TT shows an
evident increment in bias in Strategy WFS, presumably due to the larger hyper-parameter space ex-
plored in this setting.
PERFORMANCE-ESTIMATION PROPERTIES OF CROSS-VALIDATION-BASED PROTOCOLS WITH SIM-
ULTANEOUS HYPER-PARAMETER OPTIMIZATION
17
a large improvement in the average bias for the SeismicBump dataset. A close inspection
of the results for this dataset reveals that all methods tend to select models with perfor-
mances close to the trivial classifier, both during the model selection and the performance
estimation on the hold-out set. The selection of these models minimizes the bias but they
have little practical utility, since they tend to predict the most frequent class. This exam-
ple clearly underlines accuracy’s inadequacy for tasks involving highly imbalanced da-
tasets.
Figure 2 contrasts methods’ stratified and non-stratified versions for Strategy NFS
and AUC. The effect of stratification seems to be quite dependent on the specific method.
The non-stratified version of NCV has larger bias and variance than the stratified version
for small sample sizes, while for other protocols the non-stratified version shows a de-
creased bias at the cost of larger variance (see Table 3 and Table 4). Interestingly, the
results for the accuracy metric show an almost identical pattern (see Figure 5, Tables 6
and 7 in Appendix A). In general, we do not suggest the use of non-stratification, given
the increase of variance that usually provides.
Finally, Figure 3 shows the effect of feature selection in the analysis and contrasts
Strategy NFS and Strategy WFS on the AUC metric. The average bias for both CVM and
TT increases in Strategy WFS. This increment is explained by the fact that Strategy WFS
explores a larger hyper-parameter space than Strategy NFS. The lack of increment in
predictive power in Strategy WFS is probably due to absence of irrelevant variables: all
datasets have a limited dimensionality (max number of features: 40). In terms of variance
the NCV method shows a decrease in standard deviation for small sample sizes in the
experimentation with feature selection. Similar results are observed with the accuracy
metric (Figure 6), where the decrease in variance is present for all the methods.
9. Related Work and Discussion
Estimating performance of the final reported model while simultaneously selecting the
best pipeline of algorithms and tuning their hyper-parameters is a fundamental task for
any data analyst. Yet, arguably these issues have not been examined in full depth in the
literature. The origins of cross-validation in statistics can be traced back to the “jack-
knife” technique of Quenouille [31] in the statistical community. In machine learning, [5]
studied the cross-validation without model selection (the title of the paper may be confus-
ing) comparing it against the bootstrap and reaching the important conclusion that (a) CV
is preferable to the bootstrap, (b) a value of K=10 is preferable for the number of folds
versus a leave-one-out, and (c) stratification is also always preferable. In terms of theory,
Bengio [11] showed that there exist no unbiased estimator for the variance of the CV
performance estimation, which impact hypothesis testing of performance using the CV.
To the extent of our knowledge the first to study the problem of bias in the context of
model selection in machine learning is [3]. Varma [32] demonstrated the optimism of the
CVM protocol and instead suggests the use of the NCV protocol. Unfortunately, all their
experiments are performed on simulated data only. Tibshirani and Tibshirani [4] intro-
Tsamardinos, Rakhshani, Lagani
18
duced the TT protocol but they do not compare it against alternatives and they include
only a proof-of-concept experiment on a single dataset. Thus, the present paper is the first
work that compares all four protocols (CVM, CVM-CV, NCV, and TT) on multiple real
datasets.
Based on our experiments we found evidence that both the CVM-CV and the TT
method have relatively small bias for sample sizes above 20 and have about the same
variance as the NCV; the TT method does not introduce additional computational over-
Table 3. Average AUC Bias over Datasets (Strategy NFS). P-values produced by a t-test
with null hypothesis the mean bias is zero (P<0,05* , P<0,01**). NS stands for Non-Stratified.
CVM
NS-CVM
CVM-CV
NS-CVM-CV
TT
NS-TT
NCV
NS-NCV
20
0.1702**
0.1581**
0.0929**
0,0407**
0.0798**
0.1237**
-0.0483*
-0.0696**
40
0.1321**
0.1367**
0.0695**
0,0398**
0.0418**
0.0744**
-0.0137
-0.0364*
60
0.1095**
0.1072**
0.0647**
0,0223**
0.0371**
0.0538*
0.0065
-0.0113
80
0.0939**
0.0933**
0.0574**
0,0348**
0.023**
0.0447
-0.0162
-0.0147
100
0.0803**
0.0788**
0.0499**
0,0351**
0.0056
0.0296
0.0093*
0.0017
500
0.0197**
0.0172**
0.0143**
0,0079*
-0.0236**
0.0068
0.0031
0.0002
1500
-0.0023
-0.0024
-0.0031
-0,0028
-0.0132**
-0.0447**
-0.0049**
-0.0044**
Table 4. Standard deviation of AUC estimations over Datasets (Strategy NFS). P-values
produced by a test with null hypothesis that the variances are the same as the corresponding
variance of the NCV protocol (P<0,05* , P<0,01**). NS stands for Non-Stratified
CVM
NS-CVM
CVM-CV
NS-CVM-CV
TT
NS-TT
NCV
NS-NCV
20
0.1289**
0.1681**
0.2043
0,2113
0.214
0.1485**
0.2073
0.2156
40
0.1063**
0.0991**
0.1571*
0,1871
0.1872
0.1026**
0.1946
0.1908
60
0.0845**
0.0881**
0.1435
0,1729
0.1439
0.0787**
0.1637
0.1855
80
0.0711**
0.0769**
0.1057**
0,1493
0.124*
0.0742**
0.1544
0.156
100
0.0757**
0.0806**
0.1136*
0,1352
0.1342
0.0659**
0.1474
0.1498
500
0.0758**
0.0745**
0.0834
0,0867
0.1182**
0.0436**
0.0967
0.0938
1500
0.0436
0.0441
0.0448
0,0444
0.0542*
0.0298**
0.0458
0.046
Table 5. Average AUC on the hold-out sets (Strategy NFS). All methods use CVM for
model selection and thus have the same performances on the hold-out sets. NS stands for Non-
Stratified.
CVM
NS-CVM
20
0.6903
0.6720
40
0.7385
0.7377
60
0.7753
0.7783
80
0.7934
0.7898
100
0.8015
0.8004
500
0.8555
0.8604
1500
0.9163
0.9163
PERFORMANCE-ESTIMATION PROPERTIES OF CROSS-VALIDATION-BASED PROTOCOLS WITH SIM-
ULTANEOUS HYPER-PARAMETER OPTIMIZATION
19
head. However, the TT method seems to overestimate in very small sample sizes when a
large number of hyper-parameter configurations are tested. Moreover, caution should still
be exercised and further research is required to better investigate the properties of the TT
protocol. One particularly worrisome situation is the use of TT in a leave-one-out fashion
which we advise against. In this extreme case, each fold contains a single test case. If the
overall-best classifier predicts it wrong, the loss is 1. If any other classifier tried predicts
it correctly its loss is 0. When numerous classifiers are tried at least one of them will pre-
dict the test case correctly with high probability. In this case, the estimation of the bias

 

 of the TT method will be equal to the loss of the
best classifier. Thus, the TT will estimate the loss of the best classifier found as
 
   
, i.e., twice as much as found during leave-one-out CV. To
recapture: in leave-one-out CV when the number of classifiers tried is high, TT estimates
the loss of the best classifier found as twice its cross-validated loss, which is overly con-
servative.
A previous version of this work has appeared elsewhere [33], presenting a more re-
stricted methodological discussion and empirical comparison. Moreover, the analysis
protocol of the present version markedly differs from the protocol of the previous one. In
[33] the class ratio for sub-datasets with sample size  was forced to one, i.e., the
sub-sampling procedure was selecting an equal number of instances from the two classes,
independently by the original class distribution. In the present experimentations the origi-
nal class distribution is maintained in the sub-datasets, and the number of folds is dynam-
ically changed in order to ensure at least one instance from the rarest class in each fold.
This important change originates a number of differences in the results of the two works.
We do underline though that the findings and conclusions of the previous study are still
valid in the context of the design of its experimentations.
We also note the concerning issue that the variance of estimation for small sample
sizes is large, again in concordance with the experiments in [2]. The authors in the latter
advocate methods that may be biased but exhibit reduced variance. However, we believe
that CVM is too biased no matter its variance; implicitly the authors in [2] agree when
they declare that model selection should be integrated in the performance estimation pro-
cedure in such a way that test samples are never employed for selecting the best model.
Instead, they suggest as alternatives limiting the extent of the search of the hyper-
parameters or performing model averaging. In our opinion, neither option is satisfactory
for all analysis purposes and more research is required. One approach that we suggest is
to repeat the whole analysis several times using a different random partitioning to folds
each time, and average the loss estimations. Repeating the analysis for different fold par-
titionings can be performed both for the inner CV loop (if one employs NCV) or just the
outer CV loop. Averaging over several repeats reduces the component of the variance
that is due to the specific partition to folds, which could be relatively high for small sam-
ple sizes.
Another source of variance of performance estimation is due to the stochastic nature
of certain classification algorithms. Numerous classification methods are non-
Tsamardinos, Rakhshani, Lagani
20
deterministic and will return a different model even when trained on the same dataset.
Typical examples include Artificial Neural Networks (ANNs), where the final model
typically depends on the random initialization of weights. The effect of initialization may
be quite intense and result into very different models returned. The exact same theory and
algorithms presented above apply to such classifiers; however, one would expect an even
larger variance of estimated performances because an additional variance component is
added due to the stochastic nature of the classifier training. In this case, we would suggest
training the same classifier multiple times and averaging the results to produce an esti-
mate of performance.
Particularly, for ANNs we note further possible complications when using the above
protocols. Let us assume that the number of epochs of weight updates is used as a hyper-
parameter in the above protocols. In this case, the value of the number of epochs that
optimizes the CV loss is selected, and then used to train an ANN on the full dataset.
However, training the ANN on a larger sample size may require many more epochs to
achieve a good fit on the dataset. Using the same number of epochs as in a smaller dataset
may underfit and result in high loss. This violates the assumption made in Section 2 that
the learning method improves on average with larger sample sizes. The number-of-
epochs hyper-parameter is highly depended on the sample size and thus possibly violates
this assumption. To satisfy the assumption it should be the case that training the ANN
with a fixed number of epochs should result in a better model (smaller loss) on average
with increasing sample size. Typically, such hyper-parameters can be substituted with
other alternatives (e.g., a criterion that dynamically determines the number of epochs) so
that performance is monotonic (on average) with sample size for any fixed values of the
hyper-parameters. Thus, before using the above protocols an analyst is advised to consid-
er whether the monotonicity assumption holds for all hyper-parameters.
Finally, we’d like to comment on the use of Bayesian non-parametric techniques,
such as Gaussian Processes [34]. Such methods consider and reason with all models of a
given family, averaging out all model parameters to provide a prediction. However, they
still have hyper-parameters. In this case, they are defined as the free parameters of the
whole learning process over which there is no marginalization (averaging out). Examples
of hyper-parameters include the type of the kernel covariance function in Gaussian Pro-
cesses and the parameters of the kernel function [35]. In fact, since one can combine
compositionally kernels via sum and product operations, dynamically composing the ap-
propriate kernel adds a new level of complexity to hyper-parameter search [36]. Thus, in
general, such methods still require hyper-parameters to tune and they don’t completely
obviating the need to select them in our opinion. The protocols presented here could be
employed to select these hyper-parameter values, type of kernel, type of priors, etc. From
a different perspective however, the value of hyper-parameters in some settings (e.g.,
number of hidden units in a neural-network architecture) could be selected using a Bayes-
ian non-parametric machinery. Thus, non-parametric methods could also substitute in
some cases the need for the protocols in this paper.
PERFORMANCE-ESTIMATION PROPERTIES OF CROSS-VALIDATION-BASED PROTOCOLS WITH SIM-
ULTANEOUS HYPER-PARAMETER OPTIMIZATION
21
10. Conclusions
In the absence of hyper-parameter optimization (model selection) simple Cross-
Validation underestimates the performance of the model returned when training using the
full dataset. In the presence of learning method and hyper-parameter optimization simple
Cross-Validation overestimates performance. Some other alternatives are to rerun Cross-
Validation one more time but only for the final selected model, a method proposed by
Tibshirani and Tibshirani [4] to estimate and reduce the bias, and the Nested Cross Vali-
dation which Cross-Validates the model selection procedure (which includes an inner
cross-validation). These alternatives seem to reduce bias with Nested Cross-Validation
being conservative in general and robust to the dataset, although incurring a higher com-
putational overhead; the TT method seems promising and does not require additional
training of models. We would also like to acknowledge the limited scope of our experi-
ments in terms of the number and type of datasets, the inclusion of other preprocessing
steps into the analysis, the inclusion of other procedures for hyper-parameter optimization
that dynamically decide to consider value combinations, using other performance metrics,
experimentation with regression methods and others which form our future work on the
subject in order to obtain more general answers to these research questions.
Acknowledgements
The work was funded by the STATegra EU FP7 project, No 306000, and by the EPILO-
GEAS GSRT ARISTEIA II project, No 3446.
References
[1] D. Anguita, A. Ghio, L. Oneto, and S. Ridella, “In-Sample and Out-of-Sample Model
Selection and Error Estimation for Support Vector Machines,” IEEE Trans. Neural
Networks Learn. Syst., vol. 23, no. 9, pp. 13901406, Sep. 2012.
[2] G. C. Cawley and N. L. C. Talbot, “On Over-fitting in Model Selection and Subsequent
Selection Bias in Performance Evaluation,” J. Mach. Learn. Res., vol. 11, pp. 20792107,
Mar. 2010.
[3] D. D. Jensen and P. R. Cohen, “Multiple comparisons in induction algorithms,” Mach.
Learn., vol. 38, pp. 309338, 2000.
[4] R. J. Tibshirani and R. Tibshirani, “A bias correction for the minimum error rate in cross-
validation,” Ann. Appl. Stat., vol. 3, no. 2, pp. 822829, Jun. 2009.
[5] R. Kohavi, “A Study of Cross-Validation and Bootstrap for Accuracy Estimation and
Model Selection,” in International Joint Conference on Artificial Intelligence, 1995, vol.
14, pp. 11371143.
Tsamardinos, Rakhshani, Lagani
22
[6] A. Statnikov, C. F. Aliferis, I. Tsamardinos, D. Hardin, and S. Levy, “A comprehensive
evaluation of multicategory classification methods for microarray gene expression cancer
diagnosis.,” Bioinformatics, vol. 21, no. 5, pp. 63143, Mar. 2005.
[7] R. O. Duda, P. E. Hart, and D. G. Stork, “Pattern Classification (2nd Edition),” Oct. 2000.
[8] T. M. Mitchell, “Machine Learning,” Mar. 1997.
[9] C. M. Bishop, “Pattern Recognition and Machine Learning (Information Science and
Statistics),” Aug. 2006.
[10] T. Hastie, R. Tibshirani, and J. Friedman, “The Elements of Statistical Learning,”
Elements, vol. 1, pp. 337387, 2009.
[11] Y. Bengio and Y. Grandvalet, “Bias in Estimating the Variance of K-Fold Cross-
Validation,” in Statistical Modeling and Analysis for Complex Data Problem, vol. 1, 2005,
pp. 7595.
[12] I. H. Witten and E. Frank, “Data Mining: Practical Machine Learning Tools and
Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems),”
Jun. 2005.
[13] V. Lagani and I. Tsamardinos, “Structure-based variable selection for survival data.,”
Bioinformatics, vol. 26, no. 15, pp. 18871894, 2010.
[14] A. Statnikov, I. Tsamardinos, Y. Dosbayev, and C. F. Aliferis, “GEMS: a system for
automated cancer diagnosis and biomarker discovery from microarray gene expression
data.,” Int. J. Med. Inform., vol. 74, no. 78, pp. 491503, Aug. 2005.
[15] S. Salzberg, “On Comparing Classifiers : Pitfalls to Avoid and a Recommended
Approach,” Data Min. Knowl. Discov., vol. 328, pp. 317328, 1997.
[16] N. Iizuka, M. Oka, H. Yamada-Okabe, M. Nishida, Y. Maeda, N. Mori, T. Takao, T.
Tamesa, A. Tangoku, H. Tabuchi, K. Hamada, H. Nakayama, H. Ishitsuka, T. Miyamoto,
A. Hirabayashi, S. Uchimura, and Y. Hamamoto, “Oligonucleotide microarray for
prediction of early intrahepatic recurrence of hepatocellular carcinoma after curative
resection.,” Lancet, vol. 361, no. 9361, pp. 9239, Mar. 2003.
[17] L. A. Kurgan, K. J. Cios, R. Tadeusiewicz, M. Ogiela, and L. S. Goodenday, “Knowledge
discovery approach to automated cardiac SPECT diagnosis.,” Artif. Intell. Med., vol. 23,
no. 2, pp. 14969, Oct. 2001.
[18] R. K. Bock, A. Chilingarian, M. Gaug, F. Hakl, T. Hengstebeck, M. Jiřina, J. Klaschka, E.
Kotrč, P. Savický, S. Towers, A. Vaiciulis, and W. Wittek, “Methods for multidimensional
event classification: a case study using images from a Cherenkov gamma-ray telescope,”
Nucl. Instruments Methods Phys. Res. Sect. A Accel. Spectrometers, Detect. Assoc. Equip.,
vol. 516, no. 23, pp. 511528, Jan. 2004.
PERFORMANCE-ESTIMATION PROPERTIES OF CROSS-VALIDATION-BASED PROTOCOLS WITH SIM-
ULTANEOUS HYPER-PARAMETER OPTIMIZATION
23
[19] K. Mansouri, T. Ringsted, D. Ballabio, R. Todeschini, and V. Consonni, “Quantitative
structure-activity relationship models for ready biodegradability of chemicals.,” J. Chem.
Inf. Model., vol. 53, pp. 86778, 2013.
[20] S. Moro and R. M. S. Laureano, “Using Data Mining for Bank Direct Marketing: An
application of the CRISP-DM methodology,” Eur. Simul. Model. Conf., pp. 117121,
2011.
[21] S. C. Bendall, E. F. Simonds, P. Qiu, E. D. Amir, P. O. Krutzik, R. Finck, R. V Bruggner,
R. Melamed, A. Trejo, O. I. Ornatsky, R. S. Balderas, S. K. Plevritis, K. Sachs, D. Pe’er,
S. D. Tanner, and G. P. Nolan, “Single-cell mass cytometry of differential immune and
drug responses across a human hematopoietic continuum.,” Science, vol. 332, no. 6030,
pp. 68796, May 2011.
[22] M. Sikora and L. Wrobel, “Application of rule induction algorithms for analysis of data
collected by seismic hazard monitoring systems in coal mines,” Arch. Min. Sci., vol. 55,
no. 1, pp. 91114, 2010.
[23] A. A. Aguilar-Arevalo, A. O. Bazarko, S. J. Brice, B. C. Brown, L. Bugel, J. Cao, L.
Coney, J. M. Conrad, D. C. Cox, A. Curioni, Z. Djurcic, D. A. Finley, B. T. Fleming, R.
Ford, F. G. Garcia, G. T. Garvey, C. Green, J. A. Green, T. L. Hart, E. Hawker, R. Imlay,
R. A. Johnson, P. Kasper, T. Katori, T. Kobilarcik, I. Kourbanis, S. Koutsoliotas, E. M.
Laird, J. M. Link, Y. Liu, Y. Liu, W. C. Louis, K. B. M. Mahn, W. Marsh, P. S. Martin, G.
McGregor, W. Metcalf, P. D. Meyers, F. Mills, G. B. Mills, J. Monroe, C. D. Moore, R. H.
Nelson, P. Nienaber, S. Ouedraogo, R. B. Patterson, D. Perevalov, C. C. Polly, E. Prebys,
J. L. Raaf, H. Ray, B. P. Roe, A. D. Russell, V. Sandberg, R. Schirato, D. Schmitz, M. H.
Shaevitz, F. C. Shoemaker, D. Smith, M. Sorel, P. Spentzouris, I. Stancu, R. J. Stefanski,
M. Sung, H. A. Tanaka, R. Tayloe, M. Tzanov, R. Van de Water, M. O. Wascko, D. H.
White, M. J. Wilking, H. J. Yang, G. P. Zeller, and E. D. Zimmerman, “Search for
electron neutrino appearance at the Delta m2 approximately 1 eV2 scale.,” Phys. Rev.
Lett., vol. 98, p. 231801, 2007.
[24] D. Coppersmith, S. J. Hong, and J. R. M. Hosking, “Partitioning Nominal Attributes in
Decision Trees,” Data Min. Knowl. Discov., vol. 3, pp. 197217, 1999.
[25] C.-C. Chang and C.-J. Lin, “LIBSVM : a library for support vector machines,” ACM
Trans. Intell. Syst. Technol., vol. 2, no. 27, pp. 127, 2011.
[26] I. Tsamardinos, V. Lagani, and D. Pappas, “Discovering multiple, equivalent biomarker
signatures,” in 7th Conference of the Hellenic Society for Computational Biology and
Bioinformatics (HSCBB12), 2012.
[27] I. Tsamardinos, L. E. Brown, and C. F. Aliferis, “The max-min hill-climbing Bayesian
network structure learning algorithm,” Mach. Learn., vol. 65, no. 1, pp. 3178, 2006.
[28] T. Fawcett, “An introduction to ROC analysis,” Pattern Recognit. Lett., vol. 27, pp. 861
874, 2006.
Tsamardinos, Rakhshani, Lagani
24
[29] A. Airola, T. Pahikkala, W. Waegeman, B. De Baets, and T. Salakoski, “A comparison of
AUC estimators in small-sample studies,” J. Mach. Learn. Res. W&CP, vol. 8, pp. 313,
2010.
[30] R. G. O’brien, “A General ANOVA Method for Robust Tests of Additive Models for
Variances,” J. Am. Stat. Assoc., vol. 74, no. 368, pp. 877880, Dec. 1979.
[31] M. H. Quenouille, “Approximate tests of correlation in time-series 3,” Math. Proc.
Cambridge Philos. Soc., vol. 45, no. 03, pp. 483484, Oct. 1949.
[32] S. Varma and R. Simon, “Bias in error estimation when using cross-validation for model
selection.,” BMC Bioinformatics, vol. 7, p. 91, Jan. 2006.
[33] I. Tsamardinos, V. Lagani, and A. Rakhshani, “Performance-Estimation Properties of
Cross-Validation-Based Protocols with Simultaneous Hyper-Parameter Optimization,” in
SETN’14 Proceedings of the 79h Hellenic conference on Artificial Intelligence, 2014.
[34] C. E. Rasmussen and C. K. I. Williams, Gaussian processes for machine learning., vol.
14. 2006.
[35] T. Hofmann, B. Schölkopf, and A. J. Smola, “Kernel methods in machine learning,” The
Annals of Statistics, vol. 36, no. 3. pp. 11711220, 2008.
[36] D. Duvenaud, J. Lloyd, R. Grosse, J. Tenenbaum, and Z. Ghahramani, “Structure
discovery in nonparametric regression through compositional kernel search,” in
Proceedings of the International Conference on Machine Learning (ICML), 2013, vol. 30,
pp. 11661174.
PERFORMANCE-ESTIMATION PROPERTIES OF CROSS-VALIDATION-BASED PROTOCOLS WITH SIM-
ULTANEOUS HYPER-PARAMETER OPTIMIZATION
25
Appendix A.
Figure 4. Average loss and variance for the accuracy metric in Strategy NFS. From left to right:
stratified CVM, CVM-CV, TT, and NCV. Top row contains average bias, second row bias standard
deviation. The results largely vary depending on the specific dataset. In general, CVM is clearly opti-
mistic for sample sizes less or equal to 100, while NCV tends to underestimate performances. CVM-
CV and TT show a behavior that is in between these two extremes. CVM has the lowest variance, at
least for small sample sizes.
Tsamardinos, Rakhshani, Lagani
26
Figure 5. Effect of stratification using accuracy in Strategy NFS. The first column reports the
average bias for the stratified versions of each method, while the second column for the non-stratified
ones. The effect of stratification is dependent on each specific method and dataset.
PERFORMANCE-ESTIMATION PROPERTIES OF CROSS-VALIDATION-BASED PROTOCOLS WITH SIM-
ULTANEOUS HYPER-PARAMETER OPTIMIZATION
27
Figure 6. Comparing Strategy NFS (No Feature Selection) and WFS (With Feature Selection)
over accuracy. The first row reports the average bias of each method for Strategy NFS and WFS,
respectively, while the second row provides the bias standard deviation. CVM and TT shows an evi-
dent increment in bias in Strategy WFS, presumably due to the larger hyper-parameter space explored
in this setting.
Tsamardinos, Rakhshani, Lagani
28
Table 6. Average Accuracy Bias over Datasets (Strategy NFS). P-values produced by a t-
test with null hypothesis the mean bias is zero (P<0,05* , P<0,01**). NS stands for Non-
Stratified.
CVM
NS-CVM
CVM-CV
NS-CVM-CV
TT
NS-TT
NCV
NS-NCV
20
0,1013**
0,0948**
0,0590**
0,0458**
0,0486**
0,0315**
-0,0127
-0,0311**
40
0,0685**
0,0682**
0,0443**
0,0357**
0,0168**
0,0096*
-0,0096
-0,0236**
60
0,0669**
0,0617**
0,0477**
0,0434**
0,0204**
0,0113**
0,0144**
0,0171**
80
0,0567**
0,0549**
0,0397**
0,0389**
0,0139**
0,0074*
0,0166**
0,0090**
100
0,0466**
0,0422**
0,0337**
0,0263**
-0,0050
-0,0084**
0,0093
0,0039
500
0,0076**
0,0075**
0,0036**
0,0028**
-0,0140**
-0,0141**
-0,0010
-0,0020
1500
0,0023**
0,0023**
0,0004
0,0010
-0,0091**
-0,0092**
-0,0015
-0,0020*
Table 7. Standard deviation of Accuracy estimations over Datasets (Strategy NFS). P-
values produced by a test with null hypothesis that the variances are the same as the
corresponding variance of the NCV protocol (P<0,05* , P<0,01**). NS stands for Non-
Stratified
CVM
NS-CVM
CVM-CV
NS-CVM-CV
TT
NS-TT
NCV
NS-NCV
20
0,0766**
0,0862**
0,1086**
0,1067**
0,1242
0,1509
0,1390
0,1374
40
0,0591**
0,0600**
0,0747*
0,0826
0,0962
0,1033
0,0914
0,1118**
60
0,0435**
0,0463**
0,0595
0,0537*
0,0701
0,0786*
0,0650
0,0639
80
0,0431**
0,0454**
0,0522*
0,0520*
0,0675
0,0732
0,0629
0,0709
100
0,0444**
0,0422**
0,0510
0,0508
0,0682**
0,0663*
0,0564
0,0575
500
0,0359
0,0357
0,0356
0,0366
0,0432*
0,0436**
0,0368
0,0370
1500
0,0278
0,0282
0,0281
0,0275
0,0290
0,0299
0,0277
0,0278
Table 8. Average Accuracy on the hold-out sets (Strategy NFS). All methods use CVM for
model selection and thus have the same performances on the hold-out sets. NS stands for Non-
Stratified.
CVM
NS-CVM
20
0,7699
0,7639
40
0,8061
0,8016
60
0,8186
0,8203
80
0,8296
0,8280
100
0,8351
0,8377
500
0,8816
0,8814
1500
0,8805
0,8805
PERFORMANCE-ESTIMATION PROPERTIES OF CROSS-VALIDATION-BASED PROTOCOLS WITH SIM-
ULTANEOUS HYPER-PARAMETER OPTIMIZATION
29
Table 9. Average AUC Bias over Datasets (Strategy WFS). P-values produced by a t-test
with null hypothesis the mean bias is zero (P<0,05* , P<0,01**). NS stands for Non-Stratified
CVM
NS-CVM
CVM-CV
NS-CVM-CV
TT
NS-TT
NCV
NS-NCV
20
0.2589**
0.2821**
0.0651**
0,0128
0.2139**
0.1487**
-0.0634**
-0.1124**
40
0.1628**
0.1729**
0.0698**
0,0418**
0.0886**
0.0691**
-0.0359*
-0.0418**
60
0.1257**
0.1349**
0.0540**
0,0362**
0.051**
0.0563*
-0.0081
-0.0084
80
0.1034**
0.1094**
0.0475**
0,0484**
0.0321**
0.0381
-0.0222
-0.017
100
0.0985**
0.1029**
0.0595**
0,0525**
0.0284**
0.0294
0.0174**
0.0017*
500
0.0239**
0.0259**
0.0120**
0,0143**
-0.0242**
-0.0014
0.0079
0.0018
1500
0.0007
0.0001
-0.0007
-0,0005
-0.0122**
-0.0471**
-0.0031
-0.0035**
Table 10. Standard deviation of AUC estimations over Datasets (Strategy WFS). P-values
produced by a test with null hypothesis that the variances are the same as the corresponding
variance of the NCV protocol (P<0,05* , P<0,01**). NS stands for Non-Stratified.
CVM
NS-CVM
CVM-CV
NS-CVM-CV
TT
NS-TT
NCV
NS-NCV
20
0.0685**
0.0596**
0.2202
0,2390*
0.1289**
0.1303**
0.2078
0.2456*
40
0.0711**
0.0769**
0.1584**
0,1911
0.1345**
0.0974**
0.2020
0.1896
60
0.0692**
0.0669**
0.1344
0,1590
0.1281*
0.0859**
0.1636
0.1669
80
0.0607**
0.0624**
0.1133**
0,1214**
0.1095**
0.0773**
0.1686
0.1581
100
0.0622**
0.0643**
0.1032*
0,1283
0.1168
0.0717**
0.1386
0.1689
500
0.0601**
0.0599**
0.0824
0,0740
0.1003**
0.0468**
0.0754
0.0882
1500
0.0408
0.0410
0.0424
0,0416
0.0519*
0.0312**
0.0443
0.0444
Table 11. Average AUC performance on the hold-out sets (Strategy WFS). All methods
use CVM for model selection and thus have the same performances on the hold-out sets. NS
stands for Non-Stratified.
CVM
NS-CVM
20
0.6887
0.6871
40
0.7496
0.7471
60
0.7794
0.772
80
0.7983
0.7944
100
0.809
0.803
500
0.8643
0.863
1500
0.916
0.9165
Tsamardinos, Rakhshani, Lagani
30
Table 12. Average accuracy Bias over Datasets (Strategy WFS). P-values produced by a t-
test with null hypothesis the mean bias is zero (P<0,05* , P<0,01**). NS stands for Non-
Stratified
CVM
NS-CVM
CVM-CV
NS-CVM-CV
TT
NS-TT
NCV
NS-NCV
20
0,1439**
0,1507**
0,0555**
0,0214**
0,0763**
0,0757**
-0,0231
-0,0612**
40
0,0855**
0,0854**
0,0505**
0,0395**
0,0236**
0,0185**
-0,0060
-0,0325**
60
0,0690**
0,0671**
0,0411**
0,0393**
0,0121*
0,0075
0,0035
0,0039
80
0,0601**
0,0583**
0,0368**
0,0395**
0,0044
0,0026
0,0095*
-0,0013
100
0,0598**
0,0584**
0,0418**
0,0378**
0,0028
0,0005
0,0144**
0,0132**
500
0,0117**
0,0103**
0,0055**
0,0045**
-0,0173**
-0,0192**
-0,0031
-0,0035*
1500
0,0038**
0,0039**
0,0021**
0,0013
-0,0109**
-0,0108**
-0,0020*
-0,0019*
Table 13. Standard deviation of accuracy estimations over Datasets (Strategy WFS). P-
values produced by a test with null hypothesis that the variances are the same as the
corresponding variance of the NCV protocol (P<0,05* , P<0,01**). NS stands for Non-
Stratified.
CVM
NS-CVM
CVM-CV
NS-CVM-CV
TT
NS-TT
NCV
NS-NCV
20
0,0593**
0,0703**
0,1384
0,1407
0,1143
0,1313
0,1286
0,1482
40
0,0488**
0,0528**
0,0706
0,0734
0,0885
0,0948
0,0812
0,1028**
60
0,0453**
0,0472**
0,0624*
0,0612**
0,0817
0,0858
0,0759
0,0711
80
0,0446**
0,0440**
0,0566
0,0500**
0,0761*
0,0766**
0,0650
0,0776*
100
0,0408**
0,0420**
0,0468**
0,0484*
0,0663*
0,0703**
0,0567
0,0566
500
0,0378
0,0371
0,0387
0,0388
0,0477*
0,0469*
0,0410
0,0405
1500
0,0299
0,0296
0,0300
0,0301
0,0318
0,0313
0,0300
0,0293
Table 14. Average accuracy performance on the hold-out sets (Strategy WFS). All
methods use CVM for model selection and thus have the same performances on the hold-out
sets. NS stands for Non-Stratified.
CVM
NS-CVM
20
0,7571
0,7542
40
0,8051
0,8001
60
0,8206
0,8204
80
0,8308
0,8303
100
0,8333
0,8320
500
0,8791
0,8806
1500
0,8802
0,8804
... When comparing multiple pipelines with K-fold crossvalidation, one of the main problems that arise is predictive performance overestimation (Tsamardinos et al., 2015). This is mainly due to the repetitive use of training sets by several pipelines. ...
... K-fold cross-validation is commonly used to estimate the predictive performance of machine learning models, but it can be biased accordingly to the number of folds, number of configurations evaluated and the dataset used. The properties of cross-validation and some variants are explored in the hyperparameter optimization context by Tsamardinos et al. (2015). In short, that study shows that the cross-validation overestimates the predictive performance on the hyperparameter optimization context, while the nested cross-validation yields the most precise results with a high computational cost. ...
Article
Full-text available
Automated Machine Learning (AutoML) is a research area that aims to help humans solve Machine Learning (ML) problems by automatically discovering good ML pipelines (algorithms and their hyperparameters for every stage of a machine learning process) for a given dataset. Since we have a combinatorial optimization problem for which it is impossible to evaluate all possible pipelines, most AutoML systems use a Genetic Algorithm (GA) or Bayesian Optimization (BO) to find a good solution. These systems usually evaluate the performance of the pipelines using the K-fold cross-validation method, for which the more pipelines are evaluated, the higher the chance of finding an overfitted solution. To avoid the aforementioned issue, we propose a system named Auto-ML System for Text Classification (ASTeC), that uses the Bootstrap Bias Corrected CV (BBC-CV) method to evaluate the performance of the pipelines. More specifically, the proposed system combines GA, BO, and BBC-CV to find a good ML pipeline for the text classification task. We evaluated our approach by comparing it with state-of-the-art systems: in the the Sentiment Analysis (SA) task, we compared our approach to TPOT (Tree-based Pipeline Optimization Tool) and Google Cloud AutoML service, and for the Intent Recognition (IR) task, we compared with TPOT and MLJAR AutoML. Concerning the data, we analysed seven public datasets from the SA domain and sixteen from the IR domain. Four out of those sixteen are composed by written English text, while all of the others are in Brazilian Portuguese. Statistical tests show that, in 21 out of 23 datasets, our system's performance is equivalent to or better than the others.
... However, it can contribute to overfitting in most cases, especially with small data sets. NCV is a common approach that can be adopted for feature selection and parameter tuning to obtain reliable classification accuracy and prevent overfitting [104,105]. We split the data set into ten outer folds, and each fold was held out for testing while the remaining − 1 folds (9 folds) were merged to form the outer training set. ...
... The inner loop is used for hyperparameter tuning (which is equal to k-fold CV), whereas the outer loop is used for model evaluation, with the optimal hyperparameter set in the inner loop. Nest-CV is commonly utilized, and several studies have demonstrated that Nest-CV can overcome the problem of overfitting effectively (44)(45)(46)(47). ...
Article
Full-text available
Background and objectives Chronic kidney disease (CKD) is a global health concern. This study aims to identify key factors associated with renal function changes using the proposed machine learning and important variable selection (ML&IVS) scheme on longitudinal laboratory data. The goal is to predict changes in the estimated glomerular filtration rate (eGFR) in a cohort of patients with CKD stages 3–5. Design A retrospective cohort study. Setting and participants A total of 710 outpatients who presented with stable nondialysis-dependent CKD stages 3–5 at the Shin-Kong Wu Ho-Su Memorial Hospital Medical Center from 2016 to 2021. Methods This study analyzed trimonthly laboratory data including 47 indicators. The proposed scheme used stochastic gradient boosting, multivariate adaptive regression splines, random forest, eXtreme gradient boosting, and light gradient boosting machine algorithms to evaluate the important factors for predicting the results of the fourth eGFR examination, especially in patients with CKD stage 3 and those with CKD stages 4–5, with or without diabetes mellitus (DM). Main outcome measurement Subsequent eGFR level after three consecutive laboratory data assessments. Results Our ML&IVS scheme demonstrated superior predictive capabilities and identified significant factors contributing to renal function changes in various CKD groups. The latest levels of eGFR, blood urea nitrogen (BUN), proteinuria, sodium, and systolic blood pressure as well as mean levels of eGFR, BUN, proteinuria, and triglyceride were the top 10 significantly important factors for predicting the subsequent eGFR level in patients with CKD stages 3–5. In individuals with DM, the latest levels of BUN and proteinuria, mean levels of phosphate and proteinuria, and variations in diastolic blood pressure levels emerged as important factors for predicting the decline of renal function. In individuals without DM, all phosphate patterns and latest albumin levels were found to be key factors in the advanced CKD group. Moreover, proteinuria was identified as an important factor in the CKD stage 3 group without DM and CKD stages 4–5 group with DM. Conclusion The proposed scheme highlighted factors associated with renal function changes in different CKD conditions, offering valuable insights to physicians for raising awareness about renal function changes.
... Essentially, this phenomenon occurs since each performance protocol simulates an ideal scenario by pretending that the test sets come from the future, but in reality, these test sets are used to select the winning model and thus the process that aimed to estimate the performance is now used to improve it. This problem becomes more pronounced in low sample sizes, where the optimism could be as much as 20 AUC points (Ding et al., 2014;Tsamardinos et al., 2014). Therefore, appropriate performance estimation protocols should be used to correct for the winner's curse. ...
Article
Full-text available
Microbiome data predictive analysis within a machine learning (ML) workflow presents numerous domain-specific challenges involving preprocessing, feature selection, predictive modeling, performance estimation, model interpretation, and the extraction of biological information from the results. To assist decision-making, we offer a set of recommendations on algorithm selection, pipeline creation and evaluation, stemming from the COST Action ML4Microbiome. We compared the suggested approaches on a multi-cohort shotgun metagenomics dataset of colorectal cancer patients, focusing on their performance in disease diagnosis and biomarker discovery. It is demonstrated that the use of compositional transformations and filtering methods as part of data preprocessing does not always improve the predictive performance of a model. In contrast, the multivariate feature selection, such as the Statistically Equivalent Signatures algorithm, was effective in reducing the classification error. When validated on a separate test dataset, this algorithm in combination with random forest modeling, provided the most accurate performance estimates. Lastly, we showed how linear modeling by logistic regression coupled with visualization techniques such as Individual Conditional Expectation (ICE) plots can yield interpretable results and offer biological insights. These findings are significant for clinicians and non-experts alike in translational applications.
... Additionally, those parameters may be incorrectly approximated when base and target samples are drawn from different populations or differ in size (Dudbridge 2013). In general, when hyper-parameter tuning is poorly performed, it may lead to overfitted, non-parsimonious predictive models, and to overestimation of their predictive performance (Tsamardinos et al. 2014). In our proposed AutoML approach, optimal hyper-parameter tuning is ensured without the need of advanced statistical or bioinformatics knowledge. ...
Article
Full-text available
Motivation: Genome Wide Association Studies (GWAS) present several computational and statistical challenges for their data analysis, including knowledge discovery, interpretability, and translation to clinical practice. Results: We develop, apply, and comparatively evaluate an Automated Machine Learning (AutoML) approach, customized for genomic data that delivers reliable predictive and diagnostic models, the set of genetic variants that are important for predictions (called a biosignature), and an estimate of the out-of-sample predictive power. This AutoML approach discovers variants with higher predictive performance compared to standard GWAS methods, computes an individual risk prediction score, generalizes to new, unseen data, is shown to better differentiate causal variants from other highly correlated variants, and enhances knowledge discovery and interpretability by reporting multiple equivalent biosignatures. Availability: Code for this paper is available at: https://github.com/mensxmachina/autoML-GWAS. JADBio offers a free version at: https://jadbio.com/sign-up/. SNP data can be downloaded from the EGA repository (https://ega-archive.org/). PRS data are found at: https://www.aicrowd.com/challenges/opensnp-height-prediction. Simulation data to study population structure can be found at: https://easygwas.ethz.ch/data/public/dataset/view/1/. Supplementary information: Supplementary data are available at Bioinformatics online.
... We consider other evaluation protocols, e.g., cross-validation or fixed training testing split have disadvantages: cross-validation may overestimate the performance [34], and fixed training testing splits are also infeasible for such a small dataset. In ADReSS 2020 [29], the standard evaluation protocol uses 48 samples for testing. ...
Preprint
Full-text available
Using picture description speech for dementia detection has been studied for 30 years. Despite the long history, previous models focus on identifying the differences in speech patterns between healthy subjects and patients with dementia but do not utilize the picture information directly. In this paper, we propose the first dementia detection models that take both the picture and the description texts as inputs and incorporate knowledge from large pre-trained image-text alignment models. We observe the difference between dementia and healthy samples in terms of the text's relevance to the picture and the focused area of the picture. We thus consider such a difference could be used to enhance dementia detection accuracy. Specifically, we use the text's relevance to the picture to rank and filter the sentences of the samples. We also identified focused areas of the picture as topics and categorized the sentences according to the focused areas. We propose three advanced models that pre-processed the samples based on their relevance to the picture, sub-image, and focused areas. The evaluation results show that our advanced models, with knowledge of the picture and large image-text alignment models, achieve state-of-the-art performance with the best detection accuracy at 83.44%, which is higher than the text-only baseline model at 79.91%. Lastly, we visualize the sample and picture results to explain the advantages of our models.
... However, it can contribute to overfitting in most cases, especially with small data sets. NCV is a common approach that can be adopted for feature selection and parameter tuning to obtain reliable classification accuracy and prevent overfitting [104,105]. We split the data set into ten outer folds, and each fold was held out for testing while the remaining − 1 folds (9 folds) were merged to form the outer training set. ...
Preprint
Individuals with Parkinson's disease (PD) develop speech impairments that deteriorate their communication capabilities. Speech-based approaches for PD assessment rely on feature extraction for automatic classification or detection. It is desirable for these features to be interpretable to facilitate their development as diagnostic tools in clinical environments. However, many studies propose detection techniques based on non-interpretable embeddings from Deep Neural Networks since these provide high detection accuracy, and do not compare them with the performance of interpretable features for the same task. The goal of this work was twofold: providing a systematic comparison between the predictive capabilities of models based on interpretable and non-interpretable features and exploring the language robustness of the features themselves. As interpretable features, prosodic, linguistic, and cognitive descriptors were employed. As non-interpretable features, x-vectors, Wav2Vec 2.0, HuBERT, and TRILLsson representations were used. To the best of our knowledge, this is the first study applying TRILLsson and HuBERT to PD detection. Mono-lingual, multi-lingual, and cross-lingual machine learning experiments were conducted on six data sets. These contain speech recordings from different languages: American English, Castilian Spanish, Colombian Spanish, Italian, German, and Czech. For interpretable feature-based models, the mean of the best F1-scores obtained from each language was 81\% in mono-lingual, 81\% in multi-lingual, and 71\% in cross-lingual experiments. For non-interpretable feature-based models, instead, they were 85\% in mono-lingual, 88\% in multi-lingual, and 79\% in cross-lingual experiments. On one hand, models based on non-interpretable features outperformed interpretable ones, especially in cross-lingual experiments. Among the non-interpretable features used, TRILLsson provided the most stable and accurate results across tasks and data sets. Conversely, the two types of features adopted showed some level of language robustness in multi-lingual and cross-lingual experiments. Overall, these results suggest that interpretable feature-based models can be used by clinicians to evaluate the evolution and the possible deterioration of the speech of patients with PD, while non-interpretable feature-based models can be leveraged to achieve higher detection accuracy.
Chapter
The United States seems to have become the primary source of global corn production and export, making corn production critical to the economic activities of many countries. Many previous studies provide yield forecasts. Ground-based telemetry via satellites has recently emerged and attempts to predict vegetation indices for yield. However, except vegetation index, we should know more about vegetation area and coverage for overall consideration. Therefore, this study uses four major corn-producing areas in the United States and related data for the past nine years for model training, including multivariate linear regression, partial least squares regression, stepwise regression, and Gaussian kernel support vector regression. The experimental results show that the support vector regression with Gaussian kernel (radial basis function kernel) performs the best, and the R2 value reaches 0.94.
Preprint
Full-text available
Background The extraction of valuable insights from malaria routine surveillance data is highly dependent on the processes and tools used to collect, curate, store, analyse, and disseminate that data and the essential information obtained from it. The main challenge is to ensure good quality of data collected at the local level. In this work, we have proposed a new framework for Data Quality Assessment designed for DHIS2 using Machine Learning techniques. Methodology The data used in this study was extracted from the DHIS2 Platform for 8 districts of Mopti in Mali for 2016 and 2017. We carried out three data preprocessing tasks. We developed four models based on machine learning algorithms for local and global outlier detection, trained and validated on malaria surveillance routine data extracted from DHIS2. We used five main evaluation metrics to assess the performance of the developed models. The proposed framework's design will consider the steps of Report-Accuracy Assessment and Cross-Checks presented in the Malaria Routine Data Quality Assessment Tool (MRDQA Tool). Results For the case of random errors (outliers), we found that all four models did not reach an AUC value of 60%. Despite the low value of the AUC metric, the precision scores reached values more than 90%. As the AUC metric represents the overall performance of the models, we can say that random errors do not leave enough patterns in the malaria routine surveillance data to be detected. In contrast, detecting systematic errors has good value for performance metrics (87% AUC and 98% precision. This is the case for systematic errors with the same structures (same consecutive months and same columns) in two different districts and systematic errors with different structures at the same time period in two differents districts. Conclusion The machine learning models integrated into the proposed framework perform well in detecting random and systematic errors (global or local outliers) in the malaria routine surveillance data. Only consistent and accurate data will be stored in the DHIS system with the proposed framework. This will maximise the potential to extract actionable knowledge from malaria routine surveillance data to make better informed-decision.
Article
Full-text available
The article presents the results of application of rule induction algorithms for predictive classification of states of rockburst hazard in a longwall. Used in mining practice computer system which is a source of valuable data was described at the beginning of this article. The rule induction algorithm and the way of improving classification accuracy were explained in the theoretical part. The results of analysis of data from two longwalls were presented in the experimental section.
Conference Paper
Full-text available
In a typical supervised data analysis task, one needs to perform the following two tasks: (a) select the best combination of learning methods (e.g., for variable selection and classifier) and tune their hyper-parameters (e.g., K in K-NN), also called model selection, and (b) provide an estimate of the perfor-mance of the final, reported model. Combining the two tasks is not trivial be-cause when one selects the set of hyper-parameters that seem to provide the best estimated performance, this estimation is optimistic (biased / overfitted) due to performing multiple statistical comparisons. In this paper, we confirm that the simple Cross-Validation with model selection is indeed optimistic (overesti-mates) in small sample scenarios. In comparison the Nested Cross Validation and the method by Tibshirani and Tibshirani provide conservative estimations, with the later protocol being more computationally efficient. The role of strati-fication of samples is examined and it is shown that stratification is beneficial.
Conference Paper
Full-text available
We review accuracy estimation methods and compare the two most common methods: crossvalidation and bootstrap. Recent experimental results on arti cial data and theoretical results in restricted settings have shown that for selecting a good classi er from a set of classiers (model selection), ten-fold cross-validation may be better than the more expensive leaveone-out cross-validation. We report on a largescale experiment|over half a million runs of C4.5 and a Naive-Bayes algorithm|to estimate the e ects of di erent parameters on these algorithms on real-world datasets. For crossvalidation, we vary the number of folds and whether the folds are strati ed or not � for bootstrap, we vary the number of bootstrap samples. Our results indicate that for real-word datasets similar to ours, the best method to use for model selection is ten-fold strati ed cross validation, even if computation power allows using more folds. 1
Article
LIBSVM is a library for Support Vector Machines (SVMs). We have been actively developing this package since the year 2000. The goal is to help users to easily apply SVM to their applications. LIBSVM has gained wide popularity in machine learning and many other areas. In this article, we present all implementation details of LIBSVM. Issues such as solving SVM optimization problems theoretical convergence multiclass classification probability estimates and parameter selection are discussed in detail.
Article
An important component of many data mining projects is finding a good classification algorithm, a process that requires very careful thought about experimental design. If not done very carefully, comparative studies of classification and other types of algorithms can easily result in statistically invalid conclusions. This is especially true when one is using data mining techniques to analyze very large databases, which inevitably contain some statistically unlikely data. This paper describes several phenomena that can, if ignored, invalidate an experimental comparison. These phenomena and the conclusions that follow apply not only to classification, but to computational experiments in almost any aspect of data mining. The paper also discusses why comparative analysis is more important in evaluating some types of algorithms than for others, and provides some suggestions about how to avoid the pitfalls suffered by many experimental studies.
Article
In-sample approaches to model selection and error estimation of support vector machines (SVMs) are not as widespread as out-of-sample methods, where part of the data is removed from the training set for validation and testing purposes, mainly because their practical application is not straightforward and the latter provide, in many cases, satisfactory results. In this paper, we survey some recent and not-so-recent results of the data-dependent structural risk minimization framework and propose a proper reformulation of the SVM learning algorithm, so that the in-sample approach can be effectively applied. The experiments, performed both on simulated and real-world datasets, show that our in-sample approach can be favorably compared to out-of-sample methods, especially in cases where the latter ones provide questionable results. In particular, when the number of samples is small compared to their dimensionality, like in classification of microarray data, our proposal can outperform conventional out-of-sample approaches such as the cross validation, the leave-one-out, or the Bootstrap methods.
Article
Linearly combining Levene's z variable with the jackknife pseudo-values of s produces a family of variables that allows for analysis of variance (ANOVA) tests of additive models for the variances in fixed effects designs. Some distributional theory is developed, and a new robust homogeneity of variance test is advocated.