Content uploaded by Vincenzo Lagani

Author content

All content in this area was uploaded by Vincenzo Lagani on Nov 17, 2015

Content may be subject to copyright.

International Journal on Artificial Intelligence Tools

Vol. XX, No. X (2015) 1–30

World Scientific Publishing Company

1

PERFORMANCE-ESTIMATION PROPERTIES OF CROSS-VALIDATION-

BASED PROTOCOLS WITH SIMULTANEOUS HYPER-PARAMETER OPTI-

MIZATION

Ioannis Tsamardinos

Department of Computer Science, University of Crete, and

Institute of Computer Science, Foundation for Research and Technology – Hellas (FORTH)

Heraklion Campus, Voutes, Heraklion, GR-700 13, Greece

tsamard.it@gmail.com

Amin Rakhshani

Department of Computer Science, University of Crete, and

Institute of Computer Science, Foundation for Research and Technology – Hellas (FORTH), Vassilika Vouton,

Heraklion Campus, Voutes, ,Heraklion, GR-700 13, Greece

aminra@ics.forth.gr

Vincenzo Lagani

Institute of Computer Science, Foundation for Research and Technology – Hellas (FORTH), Vassilika Vouton,

Heraklion, GR-700 13, Greece

vlagani@ics.forth.gr

Received (Day Month Year)

Revised (Day Month Year)

Accepted (Day Month Year)

In a typical supervised data analysis task, one needs to perform the following two tasks: (a) select an

optimal combination of learning methods (e.g., for variable selection and classifier) and tune their

hyper-parameters (e.g., K in K-NN), also called model selection, and (b) provide an estimate of the

performance of the final, reported model. Combining the two tasks is not trivial because when one

selects the set of hyper-parameters that seem to provide the best estimated performance, this estima-

tion is optimistic (biased / overfitted) due to performing multiple statistical comparisons. In this pa-

per, we discuss the theoretical properties of performance estimation when model selection is present

and we confirm that the simple Cross-Validation with model selection is indeed optimistic (overes-

timates performance) in small sample scenarios and should be avoided. We present in detail and dis-

cuss the properties of the Nested Cross Validation and a method by Tibshirani and Tibshirani for

removing the bias of the estimation in detail and investigate their theoretical properties. In computa-

tional experiments with real datasets both protocols provide conservative estimation of performance

and should be preferred. These statements hold true even if feature selection is performed as prepro-

cessing.

Keywords: Performance Estimation; Model Selection; Cross Validation; Stratification; Comparative

Evaluation.

Tsamardinos, Rakhshani, Lagani

2

1. Introduction

A typical supervised analysis (e.g., classification or regression) consists of several steps

that result in a final, single prediction, or diagnostic model. For example, the analyst may

need to impute missing values, perform variable selection or general dimensionality re-

duction, discretize variables, try several different representations of the data, and finally,

apply a learning algorithm for classification or regression. Each of these steps requires a

selection of algorithms out of hundreds or even thousands of possible choices, as well as

the tuning of their hyper-parameters

*

. Hyper-parameter optimization is also called the

model selection problem since each combination of hyper-parameters tried leads to a pos-

sible classification or regression model out of which the best is to be selected. There are

several alternatives in the literature about how to identify a good combination of methods

and their hyper-parameters (e.g., [1][2]) and they all involve implicitly or explicitly

searching the space of hyper-parameters and trying different combinations. Unfortunate-

ly, trying multiple combinations, estimating their performance, and reporting the perfor-

mance of the best model found leads to overestimating the performance (i.e., underesti-

mate the error / loss), sometimes also referred to as overfitting

†

. This phenomenon is

called the problem of multiple comparisons in induction algorithms and has been ana-

lyzed in detail in [3] and is related to the multiple testing or multiple comparisons in sta-

tistical hypothesis testing. Intuitively, when one selects among several models whose

estimations vary around their true mean value, it becomes likely that what seems to be the

best model has been “lucky” in the specific test set and its performance has been overes-

timated. Extensive discussions and experiments on the subject can be found in [2].

An intuitive small example now follows. Let’s suppose method M1 has 85% true ac-

curacy and method M2 has 83% true accuracy on a given classification task when trained

with a randomly selected dataset of a given size. In 4 randomly drawn training and corre-

sponding test sets on the same problem, the estimations of accuracy maybe 80, 82, 88, 90

for M1 and 88, 85, 79, 79 percent. If M1 was evaluated by itself the estimated mean accu-

racy will be estimated as 85%, and for M2 it would be 82,75% respectively, that are close

to their true means. If performance estimations were perfect then M1 would be chosen

each time and the average performance of the models returned with model selection

would be 85%. However, when both methods are tried, the best is selected, and the max-

imum performance is reported, we obtain the series of estimations: 88, 85, 88, 90 whose

average is 87,75 and will be in generally biased. A larger example and contrived experi-

ment now follows:

Example: In a binary classification problem, an analyst tries N different classification

algorithms, producing N corresponding models from the data. They estimate the perfor-

*

We use the term “hyper-parameters” to denote the algorithm parameters that can be set by the u ser and are not

estimated directly from the data, e.g., the parameter K in the K-NN algorithm. In contrast, the term “parameters”

in the statistical literature typically refers to the model quantities that are estimated directly by the data, e.g., the

weight vector w in a linear regression model y = w

•

x + b. See [2] for a definition and discussion too.

†

The term “overfitting” i s a more general term a nd we prefer the term “overestimating” to characterize this

phenomenon.

PERFORMANCE-ESTIMATION PROPERTIES OF CROSS-VALIDATION-BASED PROTOCOLS WITH SIM-

ULTANEOUS HYPER-PARAMETER OPTIMIZATION

3

mance (accuracy) of each model on a test set of M samples. They then select the model

that exhibits the best estimated accuracy

and report this performance as the estimated

performance of the selected model. Let’s assume that all models have the same, true ac-

curacy of 85%. What is the expected value of the estimated

,

and how biased is

it?

Let’s denote the accuracy of each model with Pi = 0,85. The true performance of the

final model is of course also B = max Pi = 0,85 . But, the estimated performance

is biased. Table 1 shows

for different values of N and M assuming each

model makes independent errors on the test set as estimated with 10000 simulations. The

table also shows the 5th and 95th percentile as an indication of the range of the estimation.

Invariably, the expected estimated accuracy

of the final model is overestimated.

As expected, the bias increases with the number of models tried and decreases with

the size of the test set. For sample sizes less than or equal to 100, the bias is significant:

when the number of models produced is larger than 100, it is not uncommon to estimate

the performance of the best model as 100%. Notice that, when using Cross Validation-

based protocols to estimate performance each sample serves once and only once as a test

case. Thus, one can consider the total data-set sample size as the size of the test set. Typ-

ical high-dimensional datasets in biology often contain less than 100 samples and thus,

one should be careful with the estimation protocols employed for their analysis.

What about the number of different models tried in an analysis? Is it realistic to ex-

pect an analyst to generate thousands of different models? Obviously, it is very rare that

any analyst will employ thousands of different algorithms; however, most learning algo-

rithms are parameterized by several different hyper-parameters. For example, the stand-

ard 1-norm, polynomial Support Vector Machine algorithm takes as hyper-parameters the

Table 1. Average estimated accuracy

when reporting the (estimate) performance of N

models with equal true accuracy of 85%. In brackets the 5 and 95 percentilies are shown. The

smaller the sample size, and the larger the number of models N out of which selection is

performed the larger the overestimation.

Test set sample size

Number

of models

20

40

80

100

500

1000

5

0.935

[0.85; 1.00]

0.913

[0.85; 0.97]

0.895

[0.86; 0.94]

0.891

[0.86; 0.93]

0.868

[0.85; 0.89]

0.863

[0.85; 0.88]

10

0.959

[0.90; 1.00]

0.930

[0.90; 0.97]

0.908

[0.88; 0.94]

0.902

[0.87; 0.93]

0.874

[0.86; 0.89]

0.867

[0.86; 0.88]

20

0.977

[0.95; 1.00]

0.946

[0.90; 0.97]

0.920

[0.89; 0.95]

0.913

[0.89; 0.94]

0.879

[0.87; 0.89]

0.871

[0.86; 0.88]

50

0.993

[0.95; 1.00]

0.961

[0.93; 1.00]

0.933

[0.91; 0.96]

0.925

[0.90; 0.95]

0.885

[0.87; 0.90]

0.875

[0.87; 0.88]

100

0.999

[1.00; 1.00]

0.971

[0.95; 1.00]

0.941

[0.93; 0.96]

0.932

[0.91; 0.95]

0.889

[0.88; 0.90]

0.878

[0.87; 0.89]

1000

1.000

[1.00; 1.00]

0.994

[0.97; 1.00]

0.962

[0.95; 0.97]

0.952

[0.94; 0.97]

0.899

[0.89; 0.91]

0.885

[0.88; 0.89]

Tsamardinos, Rakhshani, Lagani

4

cost C of misclassifications and the degree of the polynomial d. Similarly, most variable

selection methods take as input a statistical significance threshold or the number of varia-

bles to return. If an analyst tries several different methods for imputation, discretization,

variable selection, and classification, each with several different hyper-parameter values,

the number of combinations explodes and can easily reach into the thousands. Notice

that, model selection and optimistic estimation of performance may also happen uninten-

tionally and implicitly in many other settings. More specifically, consider a typical publi-

cation where a new algorithm is introduced and its performance (after tuning the hyper-

parameters) is compared against numerous other alternatives from the literature (again,

after tuning their hyper-parameters), on several datasets. The comparison aims to com-

paratively evaluate the methods. However, the reported performances of the best method

on each dataset suffer from the same problem of multiple inductions and are on average

optimistically estimated.

We now discuss the different factors that affect estimation. In the simulations above,

we assume that the N different models provide independent predictions. However, this is

unrealistic as the same classifier with slightly different hyper-parameters will produce

models that give correlated predictions (e.g., K-NN models with K=1 and K=3 will often

make the same errors). Thus, in a real analysis setting, the amount of bias may be smaller

than what is expected when assuming no dependence between models. The violation of

independence makes the theoretical analysis of the bias difficult and so in this paper, we

rely on the empirical evaluations of the different estimation protocols.

There are other factors that affect the bias. For example, the difference of the perfor-

mance of the best method with the other methods attempted relative to the variance of the

estimation, affects the bias. For example, if the best method attempted has a true accuracy

of 85% with variance 3% and all the other methods attempted have a true accuracy of

50% with variance 3%, we do not expect considerable bias in the estimation: the best

method will always be selected no matter whether its performance is overestimated or

underestimated with the specific dataset, and thus on average it will be unbiased. This

observation actually forms the basis for the Tibshirani and Tibshirani method [4] de-

scribed below.

In the remainder of the paper, we revisit the Cross-Validation (CV) protocol. We cor-

roborate [2][5] that CV overestimates performance when it is used with hyper-parameter

optimization. As expected overestimation of performance increases with decreasing sam-

ple sizes. We present three other performance estimation methods in the literature. The

first is a simple approach that re-evaluates CV performance by using a different split of

the data (CVM-CV)

‡

. The method by Tibshirani and Tibshirani (hereafter TT) [4] tries to

estimate the bias and remove it from the estimation. The Nested Cross Validation (NCV)

method [6] cross-validates the whole hyper-parameter optimization procedure (which

includes an inner cross-validation, hence the name). NCV is a generalization of the tech-

nique where data is partitioned in train-validation-test sets.

‡

We thank the anonymous reviewers for suggesting the method.

PERFORMANCE-ESTIMATION PROPERTIES OF CROSS-VALIDATION-BASED PROTOCOLS WITH SIM-

ULTANEOUS HYPER-PARAMETER OPTIMIZATION

5

We show that the behavior of the four methods is markedly different, ranging from

the overestimation to conservative estimation of performance bias and variance. To our

knowledge, this is the first time these methods are compared against each other on real

datasets.

There are two sets of experiments, namely with and without a feature selection pre-

processing step. On one side, we expect that the models will gain predictive power from

the elimination of irrelevant or superfluous variables. However, the inclusion of one fur-

ther modelling step increases the number of hyper-parameter configurations to evaluate,

and thus performance overestimation should increase as well. Empirically, we show that

this is indeed the case. The effect of stratification is also empirically examined. Stratifica-

tion is a technique that during partitioning of the data into folds forces the same distribu-

tion of the outcome classes to each fold. When data are split randomly, on average, the

distribution of the outcome in each fold will be the same as in the whole dataset. Howev-

er, in small sample sizes or imbalanced data it could happen that a fold gets no samples

that belong in one of the classes (or in general, the class distribution in a fold is very dif-

ferent from the original). Stratification ensures that this does not occur. We show that

stratification has different effects depending on (a) the specific performance estimation

method and (b) the performance metric. However, we argue that stratification should

always be applied as a cautionary measure against excessive variance in performance.

2. Cross-Validation Without Hyper-Parameter Optimization (CV)

K-fold Cross Validation is perhaps the most common method for estimating performance

of a learning method for small and medium sample sizes. Despite its popularity, its theo-

retical properties are arguably not well known especially outside the machine learning

community, particularly when it is employed with simultaneous hyper-parameter optimi-

zation, as evidenced by the following common machine learning books: Duda ([7], p.

484) presents CV without discussing it in the context of model selection and only hints

that it may underestimate (when used without model selection): “The jackknife [i.e.,

Algorithm 1: K-Fold Cross-Validation

Input: A dataset

Output: A model

An estimation of performance (loss) of

Randomly Partition to K folds

Model // the model learned on all data D

Estimation

:

,

Return

Tsamardinos, Rakhshani, Lagani

6

leave-one-out CV] in particular, generally gives good estimates because each of the n

classifiers is quite similar to the classifier being tested …”. Similarly, Mitchell [8](pp.

112, 147, 150) mentions CV but only in the context of hyper-parameter optimization.

Bishop [9] does not deal at all with issues of performance estimation and model selection.

A notable exception is the Hastie and co-authors [10] book that offers the best treatment

of the subject, upon which the following sections are based. Yet, CV is still not discussed

in the context of model selection.

Let’s assume a dataset , of identically and independently

distributed (i.i.d.) predictor vectors and corresponding outcomes . Let us also assume

that we have a single method for learning from such data and producing a single predic-

tion model. We will denote with the output of the model produced by the learner

f when trained on data D and applied on input . The actual model produced by f on

dataset D is denoted with . We will denote with the loss (error) measure

of prediction when the true output is . One common loss function is the zero-one loss

function: , if and , otherwise. Thus, the average zero-one

loss of a classifier equals 1 - accuracy, i.e., it is the probability of making an incorrect

classification. K-fold CV partitions the data D into K subsets called folds . We

denote with the data excluding fold and the sample size of each fold. The K-

fold CV algorithm is shown in Algorithm 1.

First, notice that CV should return the model learned from all data D,

§

. This

is the model to be employed operationally for classification. It then tries to estimate the

performance of the returned model by constructing K other models from datasets ,

each time excluding one fold from the training set. Each of these models is then applied

on each fold serving as test and the loss is averaged over all samples.

Is

an unbiased estimate of the loss of ? First, notice that each sample is

used once and only once as a test case. Thus, effectively there are as many i.i.d. test cases

as samples in the dataset. Perhaps, this characteristic is what makes the CV so popular

versus other protocols such as repeatedly partitioning the dataset to train-test subsets. The

test size being as large as possible could facilitate the estimation of the loss and its vari-

ance (although, theoretical results show that there is no unbiased estimator of the variance

for the CV! [11]). However, test cases are predicted with different models! If these mod-

els were trained on independent train sets of size equal to the original data D, then CV

would indeed estimate the average loss of the models produced by the specific learning

method on the specific task when trained with the specific sample size. As it stands

though, since the models are correlated and have smaller size than the original we can

state the following:

K-Fold CV estimates the average loss of models returned by the specific learning

method f on the specific classification task when trained with subsets of D of size .

§

This is often a source of confusion for some practitioners who sometimes wonder which model to return out of

the ones produced during Cross-Validation.

PERFORMANCE-ESTIMATION PROPERTIES OF CROSS-VALIDATION-BASED PROTOCOLS WITH SIM-

ULTANEOUS HYPER-PARAMETER OPTIMIZATION

7

Since (e.g., for 5-fold, we are using 80% of the total

sample size for training each time) and assuming that the learning method improves on

average with larger sample sizes we expect

to be conservative (i.e., the true perfor-

mance be underestimated). Exactly how conservative it will be depends on where the

classifier is operating on its learning curve for this specific task. It also depends on the

number of folds K: the larger the K, the more (K-1)/K approaches 100% and the bias dis-

appears, i.e., leave-one-out CV should be the least biased (however, there may be still be

significant estimation problems, see [12], p. 151, and [5] for an extreme failure of leave-

one-out CV). When sample sizes are small or distributions are imbalanced (i.e., some

classes are quite rare in the data), we expect most classifiers to quickly benefit from in-

creased sample size, and thus

to be more conservative.

3. Cross-Validation With Hyper-Parameter Optimization (CVM)

A typical data analysis involves several steps (representing the data, imputation, dis-

cretization, variable selection or dimensionality reduction, learning a classifier) each with

hundreds of available choices of algorithms in the literature. In addition, each algorithm

takes several hyper-parameter values that should be tuned by the user. A general method

for tuning the hyper-parameters is to try a set of predefined combinations of methods and

corresponding values and select the best. We will represent this set with a set a contain-

ing hyper-parameter values, e.g, a = { no variable selection, K-NN, K=5, Lasso, λ = 2,

linear SVM, C=10 } when the intent is to try K-NN with no variable selection, and a

linear SVM using the Lasso algorithm for variable selection. The pseudo-code is shown

in Algorithm 2. The symbol now denotes the output of the model learned when

using hyper-parameters a on dataset D and applied on input x. Correspondingly, the sym-

bol denotes the model produced by applying hyper-parameters a on D.

The quantity is now parameterized by the specific values a and the minimizer

of the loss (maximizer of performance) a* is found. The final model returned is

, i.e., the model produced by setting hyper-parameter values to a* and learning

from all data D.

On one hand, we expect CV with model selection (hereafter, CVM) to underestimate

performance because estimations are computed using models trained on only a subset of

the dataset. On the other hand, we expect CVM to overestimate performance because it

returns the maximum performance found after trying several hyper-parameter values. In

Section 8 we examine this behavior empirically and determine (in concordance with [2],

[5]) that indeed when sample size is relatively small and the number of models tried is in

the hundreds CVM overestimates performance. Thus, in these cases other types of esti-

mation protocols are required.

Tsamardinos, Rakhshani, Lagani

8

4. The Double Cross Validation Method (CVM-CV)

The CVM is biased because when trying hundreds or more learning methods, what ap-

pears to be the best one has probably also been “lucky” for the particular test sets. Thus,

one idea to reduce the bias is to re-evaluate the selected, best method on different test

sets. Of course, since we are limited to the given samples (dataset) it is impossible to do

so on truly different test cases. One idea thus, is to re-evaluate the selected method on a

different split (partitioning) to folds and repeat Cross-Validation only for the single, se-

lected, best method. We name this approach CVM-CV, since it sequentially performs

CVM and CV for model selection and performance estimation, respectively and it is

shown in Algorithm 3. The tilde symbol `~’ is used to denote a returned value that is ig-

Algorithm 2: K-Fold Cross-Validation with Hyper-

Parameter Optimization (Model Selection)

Input: A dataset

A set of hyper-parameter value combinations

Output: A model

An estimation of performance (loss) of

Partition to K folds

Estimate

for each :

,

Find minimizer of

// “best hyper-parameters”

// the model from all data D with the best

hyper-parameters

Return

Algorithm 3: Double Cross Validation

Input: A dataset

A set of hyper-parameter value combinations

Output: A model

An estimation of performance (loss) of

Partition to K folds

is the parameter configuration corresponding to

Estimation

:

Partition to K new randomly chosen folds

Return

PERFORMANCE-ESTIMATION PROPERTIES OF CROSS-VALIDATION-BASED PROTOCOLS WITH SIM-

ULTANEOUS HYPER-PARAMETER OPTIMIZATION

9

nored. Notice that, CVM-CV is not theoretically expected to fully remove the overesti-

mation bias: information from the test sets in the final Cross-Validation step for perfor-

mance estimation is still employed during training to select the best model. Nevertheless,

our experiments show that this relatively computationally-efficient approach does reduce

CVM overestimation bias.

5. The Tibshirani and Tibshirani (TT) Method

The TT method [4] attempts to heuristically and approximately estimate the bias of the

CV error estimation due to model selection and add it to the final estimate of loss. For

each fold, the bias due to model selection is estimated as where, as

before, is the average loss in fold k, is the hyper-parameter values that minimizes

the loss for fold k, and the global minimizer over all folds. Notice that, if in all folds

the same values provide the best performance, then these will also be selected globally

and hence for . In this case, the bias estimate will be zero. The justi-

fication of this estimate for the bias is in [4]. It is quite important to notice that TT does

not require any additional model training and has minimum computational overhead.

6. The Nested Cross-Validation Method (NCV)

We could not trace who introduced or coined up first the name Nested Cross-Validation

(NCV) method but the authors have independently discovered it and using it since 2005

[6],[13],[14]; one early comment hinting of the method is in [15], while Witten and Frank

briefly discuss the need of treating any parameter tuning step as part of the training pro-

cess when assessing performance (see [12], page 286).

Algorithm 4:

Input: A dataset

A set of hyper-parameter value combinations

Output: A model

An estimation of performance (loss) of

Partition D to K folds Fi

,

Find minimizer of

// global optimizer

Find minimizers of // the minimizers for each fold

Estimate

, i.e., the model learned on all data D with

the best hyper-parameters

Return

Tsamardinos, Rakhshani, Lagani

10

A similar method in a bioinformatics analysis was used as early as 2003 [16]. The

main idea is to consider the model selection as part of the learning procedure f. Thus, f

tests several hyper-parameter values, selects the best using CV, and returns a single mod-

el. NCV cross-validates f to estimate the performance of the average model returned by f

just as normal CV would do with any other learning method taking no hyper-parameters;

it’s just that f now contains an internal CV trying to select the best model. NCV is a gen-

eralization of the Train-Validation-Test protocol where one trains on the Train set for all

hyper-parameter values, selects the ones that provide the best performance on Validation,

trains on Train+Validation a single model using the best-found values and estimates its

performance on Test. Since Test is used only once by a single model, performance esti-

mation has no bias due to the model selection process. The final model is trained on all

data using the best found values for a. NCV generalizes the above protocol to cross-

validate every step of this procedure: for each Test, all folds serve as Validation, and this

process is repeated for each fold serving as Test. The pseudo-code is shown in Algorithm

5. The pseudo-code is similar to CV (Algorithm 1) with CVM (Cross-Validation with

Model Selection, Algorithm 2) serving as the learning function f. NCV requires a quad-

ratic number of models to be trained to the number of folds K (one model is trained for

every possible pair of two folds serving as test and validation respectively) and thus it is

the most computationally expensive protocol out of the four.

7. Stratification of Folds

In CV, folds are partitioned randomly which should maintain on average the same class

distribution in each fold. However, in cases of small sample sizes or highly imbalanced

class distributions it may happen that some folds contain no samples from one of the

classes (or in general, the class distribution is very different from the original). In that

case, the estimation of performance for that fold will exclude that class. To avoid this

case, “in stratified cross-validation, the folds are stratified so that they contain approxi-

mately the same proportions of labels as the original dataset” [5]. Notice that leave-one-

Algorithm 5: K-Fold Nested Cross-Validation

Input: A dataset

A set of hyper-parameter value combinations

Output: A model

An estimation of performance (loss) of

Partition to K folds

Estimation

:

\\best performing model on

,

Return

PERFORMANCE-ESTIMATION PROPERTIES OF CROSS-VALIDATION-BASED PROTOCOLS WITH SIM-

ULTANEOUS HYPER-PARAMETER OPTIMIZATION

11

out CV guarantees that each fold will be unstratified since it contains only one sample

which can cause serious estimation problems ([12], p. 151, [5]).

8. Empirical Comparison of Different Protocols

We performed an empirical comparison in order to assess the characteristics of each data-

analysis protocol. Particularly, we focus on three specific aspects of the protocols’ per-

formances: (a) bias and variance of the estimation, (b) effect of feature selection and (c)

effect of stratification.

Notice that, assuming data are partitioned into the same folds, all methods return the

same model, that is, the model returned by f on all data D using the minimizer of the

CV error. However, each methods return a different estimate of the performance for this

model.

8.1. The Experimental Set-Up

Original Datasets: Five datasets from different scientific fields were employed for the

experimentations. The computational task for each dataset consists in predicting a binary

outcome on the basis of a set of numerical predictors (binary classification). Datasets

were selected to have a relatively large number of samples so that several smaller datasets

that follow the same joint distribution and can be sub-sampled from the original dataset;

when the number of sample size is large the sub-samples are relatively independent

providing independent estimates of all metrics in the experimental part. In more detail the

SPECT [17] dataset contains data from Single Photon Emission Computed Tomography

images collected in both healthy and cardiac patients. Data in Gamma [18] consist of

simulated registrations of high energy gamma particles in an atmospheric Cherenkov

telescope, where each gamma particle can be originated from the upper atmosphere

(background noise) or being a primary gamma particle (signal). Discriminating biode-

gradable vs. non-biodegradable molecules on the basis of their chemical characteristics is

the aim of the Biodeg [19] dataset. The Bank [20] dataset was gathered by direct market-

ing campaigns (phone calls) of a Portuguese banking institution for discriminating cus-

tomers who want to subscribe a term deposit and those who don’t. Last, CD4vsCD8 [21]

contains the phosphorylation levels of 18 intra-cellular proteins as predictors to discrimi-

nate naïve CD4+ and CD8+ human immune system cells. SeismicBump [22] focuses on

forecasting high energy (higher than 104 J) in coal mines. Data come from longwalls lo-

cated in a Polish coal mine. The MiniBoone dataset is taken from the first phase of the

Booster Neutrino Experiment conducted in the FermiLab [23]; the goal is to distinguish

between electron neutrinos (signal) and muon neutrinos (background). Table 2 summa-

rizes datasets’ characteristics. It should be noticed that the outcome distribution consider-

ably varies across datasets.

Model Selection: To generate the hyper-parameter vectors in a we employed two differ-

ent strategies, named No Feature Selection (NFS) and With Feature Selection (WFS).

Tsamardinos, Rakhshani, Lagani

12

Strategy NFS (No Feature Selection) includes three different modelers: the Logistic

Regression classifier ([9], p. 205), as implemented in Matlab 2013b, that takes no hyper-

parameters; the Decision Tree [24], as implemented also in Matlab 2013b with hyper-

parameters MinLeaf and MinParents both within {1, 2, …, 10, 20, 30, 40, 50}; Support

Vector Machines as implemented in the libsvm software [25] with linear, Gaussian (

) and polynomial (degree , ) kernels, and cost pa-

rameter . When a classifier takes multiple hyper-parameters, all combina-

tions of choices are tried. Overall, 247 hyper-parameter value combinations and corre-

sponding models are produced each time to select the best.

Strategy WFS (With Feature Selection) adds feature selection methods as prepro-

cessing steps to Strategy NFS. Two feature selection methods are tried each time, namely

the univariate selection and the Statistically Equivalent Signature (SES) algorithm [26].

The former simply applies a statistical test for assessing the association between each

individual predictor and the target outcome (chi-square test for categorical variables and

Student t-test for continuous ones). Predictors whose p-values are below a given signifi-

cance threshold t are retained for successive computations. The SES algorithm [26] be-

longs to the family of constraint-based, Markov-Blanket inspired feature selection meth-

ods [27]. In short, SES repetitively applies a statistical test of conditional independence

for identifying the set of predictors that are associated with the outcome given any com-

bination of the remaining variables. Also SES requires the user to set a priori a signifi-

cance threshold t, along with the hyper-parameter maxK that limits the number of predic-

tors to condition upon. Both feature selection methods are coupled in turn with each

modeler in order to build the hyper-parameters vector a. The significance threshold t is

varied in {0.01, 0.05} for both methods, while maxK varies in {3, 5}, bringing the num-

ber of hyper-parameter combinations produced in Strategy WFS to 1729.

Sub-Datasets and Hold-out Datasets. Each original dataset D is partitioned into two

separate, stratified parts: Dpool, containing 30% of the total samples, and the hold-out set

Dhold-out, consisting of the remaining samples. Subsequently, for each Dpool N sub-datasets

are randomly sampled with replacement for each sample size in the set {20, 40, 60, 80,

100, 500 and 1500}, for a total of 5 7 N sub-datasets Di, j, k (where i indexes the orig-

Table 2. Datasets’ characteristics. Dpool is a 30% partition from which sub-sampled datasets are

produced. Dhold-out is the remaining 70% of samples from which an accurate estimation of the true

performance is computed.

Dataset Name

# Samples

# Attributes

Classes ratio

|Dpoo1|

|Dhold-out|

Ref.

SPECT

267

22

3.85

81

186

[17]

Biodeg

1055

41

1.96

317

738

[19]

SeismicBumps

2584

18

14.2

776

1808

[22]

Gamma

19020

11

1.84

5706

13314

[18]

CD4vsCD8

24126

18

1.13

7238

16888

[21]

MiniBooNE

31104

50

2.55

9332

21772

[23]

Bank

45211

17

7.54

13564

31647

[20]

PERFORMANCE-ESTIMATION PROPERTIES OF CROSS-VALIDATION-BASED PROTOCOLS WITH SIM-

ULTANEOUS HYPER-PARAMETER OPTIMIZATION

13

inal dataset, j the sample size, and k the sub-sampling). Most of the original datasets have

been selected with a relatively large sample size so that each Dhold-out is large enough to

allow an accurate (low variance) estimation of performance. In addition, the size of Dpool

is also relatively large so that each sub-sampled dataset to be approximately considered a

dataset independently sampled from the data population of the problem. Nevertheless, we

also include a couple of datasets with smaller sample size. We set the number of sub-

samples to be .

Bias and Variance of each Protocol: For each of the data analysis protocols CVM,

CVM-CV, TT, and NCV both the stratified and the non-stratified versions are applied to

each sub-dataset, in order to select the “best model/hyper-parameter values” and estimate

its performance

. For each sub-dataset, the same split in folds was employed

for the stratified versions of CVM, CVM-CV, TT and NCV, so that the three data-analysis

protocols always select exactly the same model, and differ only in the estimation of per-

formance. For the NCV, the internal CV loop uses K

=K-1 folds. Some of the dataset

though are characterized by a particular high-class ratio, and typically this leads to a scar-

city of instances of the rarest class in some sub-datasets. If the number of instances of a

Figure 1. Average loss and variance for AUC metric in Strategy NFS. From left to right: stratified

CVM, CVM-CV, TT, and NCV. Top row contains average bias, second row the standard deviation of

performance estimation. The results largely vary depending on the specific dataset. In general, CVM is

clearly optimistic (positive bias) for sample sizes less or equal to 100, while NCV tends to underesti-

mate performances. CVM-CV and TT show a behavior that is in between these two extremes. CVM

has the lowest variance, at least for small sample sizes.

Tsamardinos, Rakhshani, Lagani

14

given class is smaller than , we set in order to ensure the presence of both clas-

ses in each fold. For NCV and NS-NCV, we forgo to analyze sub-datasets where <3.

The bias is computed as

. Thus, a positive bias indicates a higher “true”

error (i.e., as estimated on the hold-out set) than the one estimated by the corresponding

analysis protocol and implies the estimation protocol is optimistic. For each protocol,

original dataset, and sample size the true mean bias, its variance and its standard devia-

tion are computed over 30 sub-samplings.

Performance Metric: All algorithms are presented using a loss function L computed for

each sample and averaged out for each fold and then over all folds. The zero-one loss

function is typically assumed corresponding to 1-accuracy of the classifier. A valid alter-

native as metric for binary classification problems the Area Under the Receiver’s Operat-

ing Characteristic Curve (AUC) [28]. The AUC does not depend on the prior class distri-

bution. In contrast, the zero-one loss depends on the class distribution: for a problem with

class distribution of 50-50%, a classifier with accuracy 85% (loss 15%) has greatly im-

proved over the baseline of a trivial classifier predicting the majority class; for a problem

of 84-16% class distribution, a classifier with 85% accuracy has not improved much over

the baseline. On the other hand, computing AUC on small test sets leads to poor esti-

mates [29], and it is impossible when leave-one-out cross validation is used (unless mul-

tiple predictions are pooled together, a practice that creates additional issues [29]). More-

over, the AUC cannot be expressed as a loss function where is a single predic-

tion. Nevertheless, all Algorithms 1-4 remain the same if we substitute

, i.e., the error in fold i is 1 minus the AUC of the model learned by f

on all data except fold Fi, as estimated on Fi as the test set. In order to contrast the proper-

ties of the two metrics, we have performed all analyses twice, using in turn 0-1 loss and

1-AUC as the metrics to optimize. For both metrics a positive bias corresponds to overes-

timated performance.

8.2. Experimental Results

The results of the analysis greatly differ depending on the specific dataset, performance

metric and performance estimation method. The following Tables and Figures show the

results obtained with the AUC metric, while the remaining results are provided in Ap-

pendix A.

We first comment the results obtained in the experimentation following Strategy NFS.

Figure 1 (first row) shows the average loss bias of the four methods, (stratified version)

suggesting that indeed CVM overestimates performance for small sample sizes (underes-

timates error) corroborating the results in [2],[5]. In contrast, NCV tends to be over pes-

simistic and to underestimate the real performances. CVM-CV and TT exhibit results that

are between these two extremes, with CVM > CVM-CV > TT > NCV in terms of overes-

timation. It should be noted that the results on the SeismicBump dataset strongly penalize

the CVM, CVM-CV and TT methods. Interestingly, this dataset has the highest ratio be-

tween outcome classes (14.2), suggesting that estimating AUC performances in highly

imbalanced datasets may be challenging for these three methods.

PERFORMANCE-ESTIMATION PROPERTIES OF CROSS-VALIDATION-BASED PROTOCOLS WITH SIM-

ULTANEOUS HYPER-PARAMETER OPTIMIZATION

15

Table 3 shows the bias averaged over all datasets, where it is shown that CVM over-

estimates AUC up to ~17 points for small sample sizes, while CVM-CV and TT are al-

ways below 10 points, and NCV bias is never outside the range of 5 points. We perform a

t-test for the null hypothesis that the bias is zero, which is typically rejected: all methods

usually exhibit some bias whether positive or negative.

Figure 2. Effect of stratification using the AUC in Strategy NFS. The first column reports the aver-

age bias for the stratified versions of each method, while the second column for the non-stratified ones.

The effect of stratification is dependent on each specific method and dataset.

Tsamardinos, Rakhshani, Lagani

16

The second row of Figure 1 and Table 4 show the standard deviations for the loss bi-

as. We apply the O'Brien's modification of Levene's statistical test [30] with the null hy-

pothesis that the variance of a method is the same as the corresponding variance for the

same sample size as the NCV. We note that CVM has the lowest variance, while all the

other methods show almost statistically indistinguishable variances.

Table 5 reports the average performances on the hold-out set. As expected, these per-

formances improve with sample size, because the final models are trained on a larger

number of samples. The corresponding results for the accuracy metric are reported in

Figure 4 and Table 6-8 in Appendix A, and generally follow the same patterns and con-

clusions as the ones reported above for the AUC metric. The only noticeable difference is

Figure 3. Comparing Strategy NFS (No Feature Selection) and WFS (With Feature Selection)

over AUC. The first row reports the average bias of each method for Strategy NFS and WFS, respec-

tively, while the second row provides the variance in performance estimation. CVM and TT shows an

evident increment in bias in Strategy WFS, presumably due to the larger hyper-parameter space ex-

plored in this setting.

PERFORMANCE-ESTIMATION PROPERTIES OF CROSS-VALIDATION-BASED PROTOCOLS WITH SIM-

ULTANEOUS HYPER-PARAMETER OPTIMIZATION

17

a large improvement in the average bias for the SeismicBump dataset. A close inspection

of the results for this dataset reveals that all methods tend to select models with perfor-

mances close to the trivial classifier, both during the model selection and the performance

estimation on the hold-out set. The selection of these models minimizes the bias but they

have little practical utility, since they tend to predict the most frequent class. This exam-

ple clearly underlines accuracy’s inadequacy for tasks involving highly imbalanced da-

tasets.

Figure 2 contrasts methods’ stratified and non-stratified versions for Strategy NFS

and AUC. The effect of stratification seems to be quite dependent on the specific method.

The non-stratified version of NCV has larger bias and variance than the stratified version

for small sample sizes, while for other protocols the non-stratified version shows a de-

creased bias at the cost of larger variance (see Table 3 and Table 4). Interestingly, the

results for the accuracy metric show an almost identical pattern (see Figure 5, Tables 6

and 7 in Appendix A). In general, we do not suggest the use of non-stratification, given

the increase of variance that usually provides.

Finally, Figure 3 shows the effect of feature selection in the analysis and contrasts

Strategy NFS and Strategy WFS on the AUC metric. The average bias for both CVM and

TT increases in Strategy WFS. This increment is explained by the fact that Strategy WFS

explores a larger hyper-parameter space than Strategy NFS. The lack of increment in

predictive power in Strategy WFS is probably due to absence of irrelevant variables: all

datasets have a limited dimensionality (max number of features: 40). In terms of variance

the NCV method shows a decrease in standard deviation for small sample sizes in the

experimentation with feature selection. Similar results are observed with the accuracy

metric (Figure 6), where the decrease in variance is present for all the methods.

9. Related Work and Discussion

Estimating performance of the final reported model while simultaneously selecting the

best pipeline of algorithms and tuning their hyper-parameters is a fundamental task for

any data analyst. Yet, arguably these issues have not been examined in full depth in the

literature. The origins of cross-validation in statistics can be traced back to the “jack-

knife” technique of Quenouille [31] in the statistical community. In machine learning, [5]

studied the cross-validation without model selection (the title of the paper may be confus-

ing) comparing it against the bootstrap and reaching the important conclusion that (a) CV

is preferable to the bootstrap, (b) a value of K=10 is preferable for the number of folds

versus a leave-one-out, and (c) stratification is also always preferable. In terms of theory,

Bengio [11] showed that there exist no unbiased estimator for the variance of the CV

performance estimation, which impact hypothesis testing of performance using the CV.

To the extent of our knowledge the first to study the problem of bias in the context of

model selection in machine learning is [3]. Varma [32] demonstrated the optimism of the

CVM protocol and instead suggests the use of the NCV protocol. Unfortunately, all their

experiments are performed on simulated data only. Tibshirani and Tibshirani [4] intro-

Tsamardinos, Rakhshani, Lagani

18

duced the TT protocol but they do not compare it against alternatives and they include

only a proof-of-concept experiment on a single dataset. Thus, the present paper is the first

work that compares all four protocols (CVM, CVM-CV, NCV, and TT) on multiple real

datasets.

Based on our experiments we found evidence that both the CVM-CV and the TT

method have relatively small bias for sample sizes above 20 and have about the same

variance as the NCV; the TT method does not introduce additional computational over-

Table 3. Average AUC Bias over Datasets (Strategy NFS). P-values produced by a t-test

with null hypothesis the mean bias is zero (P<0,05* , P<0,01**). NS stands for Non-Stratified.

CVM

NS-CVM

CVM-CV

NS-CVM-CV

TT

NS-TT

NCV

NS-NCV

20

0.1702**

0.1581**

0.0929**

0,0407**

0.0798**

0.1237**

-0.0483*

-0.0696**

40

0.1321**

0.1367**

0.0695**

0,0398**

0.0418**

0.0744**

-0.0137

-0.0364*

60

0.1095**

0.1072**

0.0647**

0,0223**

0.0371**

0.0538*

0.0065

-0.0113

80

0.0939**

0.0933**

0.0574**

0,0348**

0.023**

0.0447

-0.0162

-0.0147

100

0.0803**

0.0788**

0.0499**

0,0351**

0.0056

0.0296

0.0093*

0.0017

500

0.0197**

0.0172**

0.0143**

0,0079*

-0.0236**

0.0068

0.0031

0.0002

1500

-0.0023

-0.0024

-0.0031

-0,0028

-0.0132**

-0.0447**

-0.0049**

-0.0044**

Table 4. Standard deviation of AUC estimations over Datasets (Strategy NFS). P-values

produced by a test with null hypothesis that the variances are the same as the corresponding

variance of the NCV protocol (P<0,05* , P<0,01**). NS stands for Non-Stratified

CVM

NS-CVM

CVM-CV

NS-CVM-CV

TT

NS-TT

NCV

NS-NCV

20

0.1289**

0.1681**

0.2043

0,2113

0.214

0.1485**

0.2073

0.2156

40

0.1063**

0.0991**

0.1571*

0,1871

0.1872

0.1026**

0.1946

0.1908

60

0.0845**

0.0881**

0.1435

0,1729

0.1439

0.0787**

0.1637

0.1855

80

0.0711**

0.0769**

0.1057**

0,1493

0.124*

0.0742**

0.1544

0.156

100

0.0757**

0.0806**

0.1136*

0,1352

0.1342

0.0659**

0.1474

0.1498

500

0.0758**

0.0745**

0.0834

0,0867

0.1182**

0.0436**

0.0967

0.0938

1500

0.0436

0.0441

0.0448

0,0444

0.0542*

0.0298**

0.0458

0.046

Table 5. Average AUC on the hold-out sets (Strategy NFS). All methods use CVM for

model selection and thus have the same performances on the hold-out sets. NS stands for Non-

Stratified.

CVM

NS-CVM

20

0.6903

0.6720

40

0.7385

0.7377

60

0.7753

0.7783

80

0.7934

0.7898

100

0.8015

0.8004

500

0.8555

0.8604

1500

0.9163

0.9163

PERFORMANCE-ESTIMATION PROPERTIES OF CROSS-VALIDATION-BASED PROTOCOLS WITH SIM-

ULTANEOUS HYPER-PARAMETER OPTIMIZATION

19

head. However, the TT method seems to overestimate in very small sample sizes when a

large number of hyper-parameter configurations are tested. Moreover, caution should still

be exercised and further research is required to better investigate the properties of the TT

protocol. One particularly worrisome situation is the use of TT in a leave-one-out fashion

which we advise against. In this extreme case, each fold contains a single test case. If the

overall-best classifier predicts it wrong, the loss is 1. If any other classifier tried predicts

it correctly its loss is 0. When numerous classifiers are tried at least one of them will pre-

dict the test case correctly with high probability. In this case, the estimation of the bias

of the TT method will be equal to the loss of the

best classifier. Thus, the TT will estimate the loss of the best classifier found as

, i.e., twice as much as found during leave-one-out CV. To

recapture: in leave-one-out CV when the number of classifiers tried is high, TT estimates

the loss of the best classifier found as twice its cross-validated loss, which is overly con-

servative.

A previous version of this work has appeared elsewhere [33], presenting a more re-

stricted methodological discussion and empirical comparison. Moreover, the analysis

protocol of the present version markedly differs from the protocol of the previous one. In

[33] the class ratio for sub-datasets with sample size was forced to one, i.e., the

sub-sampling procedure was selecting an equal number of instances from the two classes,

independently by the original class distribution. In the present experimentations the origi-

nal class distribution is maintained in the sub-datasets, and the number of folds is dynam-

ically changed in order to ensure at least one instance from the rarest class in each fold.

This important change originates a number of differences in the results of the two works.

We do underline though that the findings and conclusions of the previous study are still

valid in the context of the design of its experimentations.

We also note the concerning issue that the variance of estimation for small sample

sizes is large, again in concordance with the experiments in [2]. The authors in the latter

advocate methods that may be biased but exhibit reduced variance. However, we believe

that CVM is too biased no matter its variance; implicitly the authors in [2] agree when

they declare that model selection should be integrated in the performance estimation pro-

cedure in such a way that test samples are never employed for selecting the best model.

Instead, they suggest as alternatives limiting the extent of the search of the hyper-

parameters or performing model averaging. In our opinion, neither option is satisfactory

for all analysis purposes and more research is required. One approach that we suggest is

to repeat the whole analysis several times using a different random partitioning to folds

each time, and average the loss estimations. Repeating the analysis for different fold par-

titionings can be performed both for the inner CV loop (if one employs NCV) or just the

outer CV loop. Averaging over several repeats reduces the component of the variance

that is due to the specific partition to folds, which could be relatively high for small sam-

ple sizes.

Another source of variance of performance estimation is due to the stochastic nature

of certain classification algorithms. Numerous classification methods are non-

Tsamardinos, Rakhshani, Lagani

20

deterministic and will return a different model even when trained on the same dataset.

Typical examples include Artificial Neural Networks (ANNs), where the final model

typically depends on the random initialization of weights. The effect of initialization may

be quite intense and result into very different models returned. The exact same theory and

algorithms presented above apply to such classifiers; however, one would expect an even

larger variance of estimated performances because an additional variance component is

added due to the stochastic nature of the classifier training. In this case, we would suggest

training the same classifier multiple times and averaging the results to produce an esti-

mate of performance.

Particularly, for ANNs we note further possible complications when using the above

protocols. Let us assume that the number of epochs of weight updates is used as a hyper-

parameter in the above protocols. In this case, the value of the number of epochs that

optimizes the CV loss is selected, and then used to train an ANN on the full dataset.

However, training the ANN on a larger sample size may require many more epochs to

achieve a good fit on the dataset. Using the same number of epochs as in a smaller dataset

may underfit and result in high loss. This violates the assumption made in Section 2 that

“the learning method improves on average with larger sample sizes”. The number-of-

epochs hyper-parameter is highly depended on the sample size and thus possibly violates

this assumption. To satisfy the assumption it should be the case that training the ANN

with a fixed number of epochs should result in a better model (smaller loss) on average

with increasing sample size. Typically, such hyper-parameters can be substituted with

other alternatives (e.g., a criterion that dynamically determines the number of epochs) so

that performance is monotonic (on average) with sample size for any fixed values of the

hyper-parameters. Thus, before using the above protocols an analyst is advised to consid-

er whether the monotonicity assumption holds for all hyper-parameters.

Finally, we’d like to comment on the use of Bayesian non-parametric techniques,

such as Gaussian Processes [34]. Such methods consider and reason with all models of a

given family, averaging out all model parameters to provide a prediction. However, they

still have hyper-parameters. In this case, they are defined as the free parameters of the

whole learning process over which there is no marginalization (averaging out). Examples

of hyper-parameters include the type of the kernel covariance function in Gaussian Pro-

cesses and the parameters of the kernel function [35]. In fact, since one can combine

compositionally kernels via sum and product operations, dynamically composing the ap-

propriate kernel adds a new level of complexity to hyper-parameter search [36]. Thus, in

general, such methods still require hyper-parameters to tune and they don’t completely

obviating the need to select them in our opinion. The protocols presented here could be

employed to select these hyper-parameter values, type of kernel, type of priors, etc. From

a different perspective however, the value of hyper-parameters in some settings (e.g.,

number of hidden units in a neural-network architecture) could be selected using a Bayes-

ian non-parametric machinery. Thus, non-parametric methods could also substitute in

some cases the need for the protocols in this paper.

PERFORMANCE-ESTIMATION PROPERTIES OF CROSS-VALIDATION-BASED PROTOCOLS WITH SIM-

ULTANEOUS HYPER-PARAMETER OPTIMIZATION

21

10. Conclusions

In the absence of hyper-parameter optimization (model selection) simple Cross-

Validation underestimates the performance of the model returned when training using the

full dataset. In the presence of learning method and hyper-parameter optimization simple

Cross-Validation overestimates performance. Some other alternatives are to rerun Cross-

Validation one more time but only for the final selected model, a method proposed by

Tibshirani and Tibshirani [4] to estimate and reduce the bias, and the Nested Cross Vali-

dation which Cross-Validates the model selection procedure (which includes an inner

cross-validation). These alternatives seem to reduce bias with Nested Cross-Validation

being conservative in general and robust to the dataset, although incurring a higher com-

putational overhead; the TT method seems promising and does not require additional

training of models. We would also like to acknowledge the limited scope of our experi-

ments in terms of the number and type of datasets, the inclusion of other preprocessing

steps into the analysis, the inclusion of other procedures for hyper-parameter optimization

that dynamically decide to consider value combinations, using other performance metrics,

experimentation with regression methods and others which form our future work on the

subject in order to obtain more general answers to these research questions.

Acknowledgements

The work was funded by the STATegra EU FP7 project, No 306000, and by the EPILO-

GEAS GSRT ARISTEIA II project, No 3446.

References

[1] D. Anguita, A. Ghio, L. Oneto, and S. Ridella, “In-Sample and Out-of-Sample Model

Selection and Error Estimation for Support Vector Machines,” IEEE Trans. Neural

Networks Learn. Syst., vol. 23, no. 9, pp. 1390–1406, Sep. 2012.

[2] G. C. Cawley and N. L. C. Talbot, “On Over-fitting in Model Selection and Subsequent

Selection Bias in Performance Evaluation,” J. Mach. Learn. Res., vol. 11, pp. 2079–2107,

Mar. 2010.

[3] D. D. Jensen and P. R. Cohen, “Multiple comparisons in induction algorithms,” Mach.

Learn., vol. 38, pp. 309–338, 2000.

[4] R. J. Tibshirani and R. Tibshirani, “A bias correction for the minimum error rate in cross-

validation,” Ann. Appl. Stat., vol. 3, no. 2, pp. 822–829, Jun. 2009.

[5] R. Kohavi, “A Study of Cross-Validation and Bootstrap for Accuracy Estimation and

Model Selection,” in International Joint Conference on Artificial Intelligence, 1995, vol.

14, pp. 1137–1143.

Tsamardinos, Rakhshani, Lagani

22

[6] A. Statnikov, C. F. Aliferis, I. Tsamardinos, D. Hardin, and S. Levy, “A comprehensive

evaluation of multicategory classification methods for microarray gene expression cancer

diagnosis.,” Bioinformatics, vol. 21, no. 5, pp. 631–43, Mar. 2005.

[7] R. O. Duda, P. E. Hart, and D. G. Stork, “Pattern Classification (2nd Edition),” Oct. 2000.

[8] T. M. Mitchell, “Machine Learning,” Mar. 1997.

[9] C. M. Bishop, “Pattern Recognition and Machine Learning (Information Science and

Statistics),” Aug. 2006.

[10] T. Hastie, R. Tibshirani, and J. Friedman, “The Elements of Statistical Learning,”

Elements, vol. 1, pp. 337–387, 2009.

[11] Y. Bengio and Y. Grandvalet, “Bias in Estimating the Variance of K-Fold Cross-

Validation,” in Statistical Modeling and Analysis for Complex Data Problem, vol. 1, 2005,

pp. 75–95.

[12] I. H. Witten and E. Frank, “Data Mining: Practical Machine Learning Tools and

Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems),”

Jun. 2005.

[13] V. Lagani and I. Tsamardinos, “Structure-based variable selection for survival data.,”

Bioinformatics, vol. 26, no. 15, pp. 1887–1894, 2010.

[14] A. Statnikov, I. Tsamardinos, Y. Dosbayev, and C. F. Aliferis, “GEMS: a system for

automated cancer diagnosis and biomarker discovery from microarray gene expression

data.,” Int. J. Med. Inform., vol. 74, no. 7–8, pp. 491–503, Aug. 2005.

[15] S. Salzberg, “On Comparing Classifiers : Pitfalls to Avoid and a Recommended

Approach,” Data Min. Knowl. Discov., vol. 328, pp. 317–328, 1997.

[16] N. Iizuka, M. Oka, H. Yamada-Okabe, M. Nishida, Y. Maeda, N. Mori, T. Takao, T.

Tamesa, A. Tangoku, H. Tabuchi, K. Hamada, H. Nakayama, H. Ishitsuka, T. Miyamoto,

A. Hirabayashi, S. Uchimura, and Y. Hamamoto, “Oligonucleotide microarray for

prediction of early intrahepatic recurrence of hepatocellular carcinoma after curative

resection.,” Lancet, vol. 361, no. 9361, pp. 923–9, Mar. 2003.

[17] L. A. Kurgan, K. J. Cios, R. Tadeusiewicz, M. Ogiela, and L. S. Goodenday, “Knowledge

discovery approach to automated cardiac SPECT diagnosis.,” Artif. Intell. Med., vol. 23,

no. 2, pp. 149–69, Oct. 2001.

[18] R. K. Bock, A. Chilingarian, M. Gaug, F. Hakl, T. Hengstebeck, M. Jiřina, J. Klaschka, E.

Kotrč, P. Savický, S. Towers, A. Vaiciulis, and W. Wittek, “Methods for multidimensional

event classification: a case study using images from a Cherenkov gamma-ray telescope,”

Nucl. Instruments Methods Phys. Res. Sect. A Accel. Spectrometers, Detect. Assoc. Equip.,

vol. 516, no. 2–3, pp. 511–528, Jan. 2004.

PERFORMANCE-ESTIMATION PROPERTIES OF CROSS-VALIDATION-BASED PROTOCOLS WITH SIM-

ULTANEOUS HYPER-PARAMETER OPTIMIZATION

23

[19] K. Mansouri, T. Ringsted, D. Ballabio, R. Todeschini, and V. Consonni, “Quantitative

structure-activity relationship models for ready biodegradability of chemicals.,” J. Chem.

Inf. Model., vol. 53, pp. 867–78, 2013.

[20] S. Moro and R. M. S. Laureano, “Using Data Mining for Bank Direct Marketing: An

application of the CRISP-DM methodology,” Eur. Simul. Model. Conf., pp. 117–121,

2011.

[21] S. C. Bendall, E. F. Simonds, P. Qiu, E. D. Amir, P. O. Krutzik, R. Finck, R. V Bruggner,

R. Melamed, A. Trejo, O. I. Ornatsky, R. S. Balderas, S. K. Plevritis, K. Sachs, D. Pe’er,

S. D. Tanner, and G. P. Nolan, “Single-cell mass cytometry of differential immune and

drug responses across a human hematopoietic continuum.,” Science, vol. 332, no. 6030,

pp. 687–96, May 2011.

[22] M. Sikora and L. Wrobel, “Application of rule induction algorithms for analysis of data

collected by seismic hazard monitoring systems in coal mines,” Arch. Min. Sci., vol. 55,

no. 1, pp. 91–114, 2010.

[23] A. A. Aguilar-Arevalo, A. O. Bazarko, S. J. Brice, B. C. Brown, L. Bugel, J. Cao, L.

Coney, J. M. Conrad, D. C. Cox, A. Curioni, Z. Djurcic, D. A. Finley, B. T. Fleming, R.

Ford, F. G. Garcia, G. T. Garvey, C. Green, J. A. Green, T. L. Hart, E. Hawker, R. Imlay,

R. A. Johnson, P. Kasper, T. Katori, T. Kobilarcik, I. Kourbanis, S. Koutsoliotas, E. M.

Laird, J. M. Link, Y. Liu, Y. Liu, W. C. Louis, K. B. M. Mahn, W. Marsh, P. S. Martin, G.

McGregor, W. Metcalf, P. D. Meyers, F. Mills, G. B. Mills, J. Monroe, C. D. Moore, R. H.

Nelson, P. Nienaber, S. Ouedraogo, R. B. Patterson, D. Perevalov, C. C. Polly, E. Prebys,

J. L. Raaf, H. Ray, B. P. Roe, A. D. Russell, V. Sandberg, R. Schirato, D. Schmitz, M. H.

Shaevitz, F. C. Shoemaker, D. Smith, M. Sorel, P. Spentzouris, I. Stancu, R. J. Stefanski,

M. Sung, H. A. Tanaka, R. Tayloe, M. Tzanov, R. Van de Water, M. O. Wascko, D. H.

White, M. J. Wilking, H. J. Yang, G. P. Zeller, and E. D. Zimmerman, “Search for

electron neutrino appearance at the Delta m2 approximately 1 eV2 scale.,” Phys. Rev.

Lett., vol. 98, p. 231801, 2007.

[24] D. Coppersmith, S. J. Hong, and J. R. M. Hosking, “Partitioning Nominal Attributes in

Decision Trees,” Data Min. Knowl. Discov., vol. 3, pp. 197–217, 1999.

[25] C.-C. Chang and C.-J. Lin, “LIBSVM : a library for support vector machines,” ACM

Trans. Intell. Syst. Technol., vol. 2, no. 27, pp. 1–27, 2011.

[26] I. Tsamardinos, V. Lagani, and D. Pappas, “Discovering multiple, equivalent biomarker

signatures,” in 7th Conference of the Hellenic Society for Computational Biology and

Bioinformatics (HSCBB12), 2012.

[27] I. Tsamardinos, L. E. Brown, and C. F. Aliferis, “The max-min hill-climbing Bayesian

network structure learning algorithm,” Mach. Learn., vol. 65, no. 1, pp. 31–78, 2006.

[28] T. Fawcett, “An introduction to ROC analysis,” Pattern Recognit. Lett., vol. 27, pp. 861–

874, 2006.

Tsamardinos, Rakhshani, Lagani

24

[29] A. Airola, T. Pahikkala, W. Waegeman, B. De Baets, and T. Salakoski, “A comparison of

AUC estimators in small-sample studies,” J. Mach. Learn. Res. W&CP, vol. 8, pp. 3–13,

2010.

[30] R. G. O’brien, “A General ANOVA Method for Robust Tests of Additive Models for

Variances,” J. Am. Stat. Assoc., vol. 74, no. 368, pp. 877–880, Dec. 1979.

[31] M. H. Quenouille, “Approximate tests of correlation in time-series 3,” Math. Proc.

Cambridge Philos. Soc., vol. 45, no. 03, pp. 483–484, Oct. 1949.

[32] S. Varma and R. Simon, “Bias in error estimation when using cross-validation for model

selection.,” BMC Bioinformatics, vol. 7, p. 91, Jan. 2006.

[33] I. Tsamardinos, V. Lagani, and A. Rakhshani, “Performance-Estimation Properties of

Cross-Validation-Based Protocols with Simultaneous Hyper-Parameter Optimization,” in

SETN’14 Proceedings of the 79h Hellenic conference on Artificial Intelligence, 2014.

[34] C. E. Rasmussen and C. K. I. Williams, Gaussian processes for machine learning., vol.

14. 2006.

[35] T. Hofmann, B. Schölkopf, and A. J. Smola, “Kernel methods in machine learning,” The

Annals of Statistics, vol. 36, no. 3. pp. 1171–1220, 2008.

[36] D. Duvenaud, J. Lloyd, R. Grosse, J. Tenenbaum, and Z. Ghahramani, “Structure

discovery in nonparametric regression through compositional kernel search,” in

Proceedings of the International Conference on Machine Learning (ICML), 2013, vol. 30,

pp. 1166–1174.

PERFORMANCE-ESTIMATION PROPERTIES OF CROSS-VALIDATION-BASED PROTOCOLS WITH SIM-

ULTANEOUS HYPER-PARAMETER OPTIMIZATION

25

Appendix A.

Figure 4. Average loss and variance for the accuracy metric in Strategy NFS. From left to right:

stratified CVM, CVM-CV, TT, and NCV. Top row contains average bias, second row bias standard

deviation. The results largely vary depending on the specific dataset. In general, CVM is clearly opti-

mistic for sample sizes less or equal to 100, while NCV tends to underestimate performances. CVM-

CV and TT show a behavior that is in between these two extremes. CVM has the lowest variance, at

least for small sample sizes.

Tsamardinos, Rakhshani, Lagani

26

Figure 5. Effect of stratification using accuracy in Strategy NFS. The first column reports the

average bias for the stratified versions of each method, while the second column for the non-stratified

ones. The effect of stratification is dependent on each specific method and dataset.

PERFORMANCE-ESTIMATION PROPERTIES OF CROSS-VALIDATION-BASED PROTOCOLS WITH SIM-

ULTANEOUS HYPER-PARAMETER OPTIMIZATION

27

Figure 6. Comparing Strategy NFS (No Feature Selection) and WFS (With Feature Selection)

over accuracy. The first row reports the average bias of each method for Strategy NFS and WFS,

respectively, while the second row provides the bias standard deviation. CVM and TT shows an evi-

dent increment in bias in Strategy WFS, presumably due to the larger hyper-parameter space explored

in this setting.

Tsamardinos, Rakhshani, Lagani

28

Table 6. Average Accuracy Bias over Datasets (Strategy NFS). P-values produced by a t-

test with null hypothesis the mean bias is zero (P<0,05* , P<0,01**). NS stands for Non-

Stratified.

CVM

NS-CVM

CVM-CV

NS-CVM-CV

TT

NS-TT

NCV

NS-NCV

20

0,1013**

0,0948**

0,0590**

0,0458**

0,0486**

0,0315**

-0,0127

-0,0311**

40

0,0685**

0,0682**

0,0443**

0,0357**

0,0168**

0,0096*

-0,0096

-0,0236**

60

0,0669**

0,0617**

0,0477**

0,0434**

0,0204**

0,0113**

0,0144**

0,0171**

80

0,0567**

0,0549**

0,0397**

0,0389**

0,0139**

0,0074*

0,0166**

0,0090**

100

0,0466**

0,0422**

0,0337**

0,0263**

-0,0050

-0,0084**

0,0093

0,0039

500

0,0076**

0,0075**

0,0036**

0,0028**

-0,0140**

-0,0141**

-0,0010

-0,0020

1500

0,0023**

0,0023**

0,0004

0,0010

-0,0091**

-0,0092**

-0,0015

-0,0020*

Table 7. Standard deviation of Accuracy estimations over Datasets (Strategy NFS). P-

values produced by a test with null hypothesis that the variances are the same as the

corresponding variance of the NCV protocol (P<0,05* , P<0,01**). NS stands for Non-

Stratified

CVM

NS-CVM

CVM-CV

NS-CVM-CV

TT

NS-TT

NCV

NS-NCV

20

0,0766**

0,0862**

0,1086**

0,1067**

0,1242

0,1509

0,1390

0,1374

40

0,0591**

0,0600**

0,0747*

0,0826

0,0962

0,1033

0,0914

0,1118**

60

0,0435**

0,0463**

0,0595

0,0537*

0,0701

0,0786*

0,0650

0,0639

80

0,0431**

0,0454**

0,0522*

0,0520*

0,0675

0,0732

0,0629

0,0709

100

0,0444**

0,0422**

0,0510

0,0508

0,0682**

0,0663*

0,0564

0,0575

500

0,0359

0,0357

0,0356

0,0366

0,0432*

0,0436**

0,0368

0,0370

1500

0,0278

0,0282

0,0281

0,0275

0,0290

0,0299

0,0277

0,0278

Table 8. Average Accuracy on the hold-out sets (Strategy NFS). All methods use CVM for

model selection and thus have the same performances on the hold-out sets. NS stands for Non-

Stratified.

CVM

NS-CVM

20

0,7699

0,7639

40

0,8061

0,8016

60

0,8186

0,8203

80

0,8296

0,8280

100

0,8351

0,8377

500

0,8816

0,8814

1500

0,8805

0,8805

PERFORMANCE-ESTIMATION PROPERTIES OF CROSS-VALIDATION-BASED PROTOCOLS WITH SIM-

ULTANEOUS HYPER-PARAMETER OPTIMIZATION

29

Table 9. Average AUC Bias over Datasets (Strategy WFS). P-values produced by a t-test

with null hypothesis the mean bias is zero (P<0,05* , P<0,01**). NS stands for Non-Stratified

CVM

NS-CVM

CVM-CV

NS-CVM-CV

TT

NS-TT

NCV

NS-NCV

20

0.2589**

0.2821**

0.0651**

0,0128

0.2139**

0.1487**

-0.0634**

-0.1124**

40

0.1628**

0.1729**

0.0698**

0,0418**

0.0886**

0.0691**

-0.0359*

-0.0418**

60

0.1257**

0.1349**

0.0540**

0,0362**

0.051**

0.0563*

-0.0081

-0.0084

80

0.1034**

0.1094**

0.0475**

0,0484**

0.0321**

0.0381

-0.0222

-0.017

100

0.0985**

0.1029**

0.0595**

0,0525**

0.0284**

0.0294

0.0174**

0.0017*

500

0.0239**

0.0259**

0.0120**

0,0143**

-0.0242**

-0.0014

0.0079

0.0018

1500

0.0007

0.0001

-0.0007

-0,0005

-0.0122**

-0.0471**

-0.0031

-0.0035**

Table 10. Standard deviation of AUC estimations over Datasets (Strategy WFS). P-values

produced by a test with null hypothesis that the variances are the same as the corresponding

variance of the NCV protocol (P<0,05* , P<0,01**). NS stands for Non-Stratified.

CVM

NS-CVM

CVM-CV

NS-CVM-CV

TT

NS-TT

NCV

NS-NCV

20

0.0685**

0.0596**

0.2202

0,2390*

0.1289**

0.1303**

0.2078

0.2456*

40

0.0711**

0.0769**

0.1584**

0,1911

0.1345**

0.0974**

0.2020

0.1896

60

0.0692**

0.0669**

0.1344

0,1590

0.1281*

0.0859**

0.1636

0.1669

80

0.0607**

0.0624**

0.1133**

0,1214**

0.1095**

0.0773**

0.1686

0.1581

100

0.0622**

0.0643**

0.1032*

0,1283

0.1168

0.0717**

0.1386

0.1689

500

0.0601**

0.0599**

0.0824

0,0740

0.1003**

0.0468**

0.0754

0.0882

1500

0.0408

0.0410

0.0424

0,0416

0.0519*

0.0312**

0.0443

0.0444

Table 11. Average AUC performance on the hold-out sets (Strategy WFS). All methods

use CVM for model selection and thus have the same performances on the hold-out sets. NS

stands for Non-Stratified.

CVM

NS-CVM

20

0.6887

0.6871

40

0.7496

0.7471

60

0.7794

0.772

80

0.7983

0.7944

100

0.809

0.803

500

0.8643

0.863

1500

0.916

0.9165

Tsamardinos, Rakhshani, Lagani

30

Table 12. Average accuracy Bias over Datasets (Strategy WFS). P-values produced by a t-

test with null hypothesis the mean bias is zero (P<0,05* , P<0,01**). NS stands for Non-

Stratified

CVM

NS-CVM

CVM-CV

NS-CVM-CV

TT

NS-TT

NCV

NS-NCV

20

0,1439**

0,1507**

0,0555**

0,0214**

0,0763**

0,0757**

-0,0231

-0,0612**

40

0,0855**

0,0854**

0,0505**

0,0395**

0,0236**

0,0185**

-0,0060

-0,0325**

60

0,0690**

0,0671**

0,0411**

0,0393**

0,0121*

0,0075

0,0035

0,0039

80

0,0601**

0,0583**

0,0368**

0,0395**

0,0044

0,0026

0,0095*

-0,0013

100

0,0598**

0,0584**

0,0418**

0,0378**

0,0028

0,0005

0,0144**

0,0132**

500

0,0117**

0,0103**

0,0055**

0,0045**

-0,0173**

-0,0192**

-0,0031

-0,0035*

1500

0,0038**

0,0039**

0,0021**

0,0013

-0,0109**

-0,0108**

-0,0020*

-0,0019*

Table 13. Standard deviation of accuracy estimations over Datasets (Strategy WFS). P-

values produced by a test with null hypothesis that the variances are the same as the

corresponding variance of the NCV protocol (P<0,05* , P<0,01**). NS stands for Non-

Stratified.

CVM

NS-CVM

CVM-CV

NS-CVM-CV

TT

NS-TT

NCV

NS-NCV

20

0,0593**

0,0703**

0,1384

0,1407

0,1143

0,1313

0,1286

0,1482

40

0,0488**

0,0528**

0,0706

0,0734

0,0885

0,0948

0,0812

0,1028**

60

0,0453**

0,0472**

0,0624*

0,0612**

0,0817

0,0858

0,0759

0,0711

80

0,0446**

0,0440**

0,0566

0,0500**

0,0761*

0,0766**

0,0650

0,0776*

100

0,0408**

0,0420**

0,0468**

0,0484*

0,0663*

0,0703**

0,0567

0,0566

500

0,0378

0,0371

0,0387

0,0388

0,0477*

0,0469*

0,0410

0,0405

1500

0,0299

0,0296

0,0300

0,0301

0,0318

0,0313

0,0300

0,0293

Table 14. Average accuracy performance on the hold-out sets (Strategy WFS). All

methods use CVM for model selection and thus have the same performances on the hold-out

sets. NS stands for Non-Stratified.

CVM

NS-CVM

20

0,7571

0,7542

40

0,8051

0,8001

60

0,8206

0,8204

80

0,8308

0,8303

100

0,8333

0,8320

500

0,8791

0,8806

1500

0,8802

0,8804