Content uploaded by Joaquin Vanschoren

Author content

All content in this area was uploaded by Joaquin Vanschoren on Aug 11, 2015

Content may be subject to copyright.

Selecting Classiﬁcation Algorithms with Active

Testing

Rui Leite1, Pavel Brazdil1, and Joaquin Vanschoren2

1LIAAD-INESC Porto L.A./Faculty of Economics, University of Porto, Portugal,

rleite@fep.up.pt, pbrazdil@inescporto.pt

2LIACS - Leiden Institute of Advanced Computer Science, University of Leiden,

Nederlands, joaquin@liacs.nl

Abstract. Given the large amount of data mining algorithms, their

combinations (e.g. ensembles) and possible parameter settings, ﬁnding

the most adequate method to analyze a new dataset becomes an ever

more challenging task. This is because in many cases testing all possi-

bly useful alternatives quickly becomes prohibitively expensive. In this

paper we propose a novel technique, called active testing, that intel-

ligently selects the most useful cross-validation tests. It proceeds in a

tournament-style fashion, in each round selecting and testing the algo-

rithm that is most likely to outperform the best algorithm of the previous

round on the new dataset. This ‘most promising’ competitor is chosen

based on a history of prior duels between both algorithms on similar

datasets. Each new cross-validation test will contribute information to

a better estimate of dataset similarity, and thus better predict which

algorithms are most promising on the new dataset. We have evaluated

this approach using a set of 292 algorithm-parameter combinations on

76 UCI datasets for classiﬁcation. The results show that active testing

will quickly yield an algorithm whose performance is very close to the

optimum, after relatively few tests. It also provides a better solution than

previously proposed methods.

1 Background and Motivation

In many data mining applications, an important problem is selecting the best

algorithm for a speciﬁc problem. Especially in classiﬁcation, there are hundreds

of algorithms to choose from. Moreover, these algorithms can be combined into

composite learning systems (e.g. ensembles) and often have many parameters

that greatly inﬂuence their performance. This yields a whole spectrum of meth-

ods and their variations, so that testing all possible candidates on the given

problem, e.g., using cross-validation, quickly becomes prohibitively expensive.

The issue of selecting the right algorithm has been the subject of many

studies over the past 20 years [17, 3, 23,20, 19]. Most approaches rely on the

concept of metalearning. This approach exploits characterizations of datasets

and past performance results of algorithms to recommend the best algorithm on

the current dataset. The term metalearning stems from the fact that we try to

learn the function that maps dataset characterizations (meta-data) to algorithm

performance estimates (the target variable).

The earliest techniques considered only the dataset itself and calculated an

array of various simple, statistical or information-theoretic properties of the

data (e.g., dataset size, class skewness and signal-noise ratio) [17, 3]. Another

approach, called landmarking [2, 12], ran simple and fast versions of algorithms

(e.g. decision stumps instead of decision trees) on the new dataset and used their

performance results to characterize the new dataset. Alternatively, in sampling

landmarks [21, 8,14], the complete (non-simpliﬁed) algorithms are run on small

samples of the data. A series of sampling landmarks on increasingly large samples

represents a partial learning curve which characterizes datasets and which can be

used to predict the performance of algorithms signiﬁcantly more accurately than

with classical dataset characteristics [13, 14]. Finally, an ‘active testing strategy’

for sampling landmarks [14] was proposed that actively selects the most infor-

mative sample sizes while building these partial learning curves, thus reducing

the time needed to compute them.

Motivation. All these approaches have focused on dozens of algorithms at most

and usually considered only default parameter settings. Dealing with hundreds,

perhaps thousands of algorithm-parameter combinations3, provides a new chal-

lenge that requires a new approach. First, distinguishing between hundreds of

subtly diﬀerent algorithms is signiﬁcantly harder than distinguishing between a

handful of very diﬀerent ones. We would need many more data characterizations

that relate the eﬀects of certain parameters on performance. On the other hand,

the latter method [14] has a scalability issue: it requires that pairwise compar-

isons be conducted between algorithms. This would be rather impractical when

faced with hundreds of algorithm-parameter combinations.

To address these issues, we propose a quite diﬀerent way to characterize

datasets, namely through the eﬀect that the dataset has on the relative perfor-

mance of algorithms run on them. As in landmarking, we use the fact that each

algorithm has its own learning bias, making certain assumptions about the data

distribution. If the learning bias ‘matches’ the underlying data distribution of

a particular dataset, it is likely to perform well (e.g., achieve high predictive

accuracy). If it does not, it will likely under- or overﬁt the data, resulting in a

lower performance.

As such, we characterize a dataset based on the pairwise performance diﬀer-

ences between algorithms run on them: if the same algorithms win, tie or lose

against each other on two datasets, then the data distributions of these datasets

are likely to be similar as well, at least in terms of their eﬀect on learning per-

formance. It is clear that the more algorithms are used, the more accurate the

characterization will be. While we cannot run all algorithms on each new dataset

because of the computational cost, we can run a fair amount of CV tests to get

a reasonably good idea of which prior datasets are most similar to the new one.

Moreover, we can use these same performance results to establish which (yet

untested) algorithms are likely to perform well on the new dataset, i.e., those

algorithms that outperformed or rivaled the currently best algorithm on similar

datasets in the past. As such, we can intelligently select the most promising

3In the remainder of this text, when we speak of algorithms, we mean fully-deﬁned

algorithm instances with ﬁxed components (e.g., base-learners, kernel functions) and

parameter settings.

algorithms for the new dataset, run them, and then use their performance results

to gain increasingly better estimates of the most similar datasets and the most

promising algorithms.

Key concepts. There are two key concepts used in this work. The ﬁrst one is

that of the current best candidate algorithm which may be challenged by other

algorithms in the process of ﬁnding an even better candidate.

The second is the pairwise performance diﬀerence between two algorithm run

on the same dataset, which we call relative landmark. A collection of such rela-

tive landmarks represents a history of previous ‘duels’ between two algorithms

on prior datasets. The term itself originates from the study of landmarking al-

gorithms: since absolute values for the performance of landmarkers vary a lot

depending on the dataset, several types of relative landmarks have been pro-

posed, which basically capture the relative performance diﬀerence between two

algorithms [12]. In this paper, we extend the notion of relative landmarks to all

(including non-simpliﬁed) classiﬁcation algorithms.

The history of previous algorithm duels is used to select the most promis-

ing challenger for the current best candidate algorithm, namely the method

that most convincingly outperformed or rivaled the current champion on prior

datasets similar to the new dataset.

Approach. Given the current best algorithm and a history of relative landmarks

(duels), we can start a tournament game in which, in each round, the current

best algorithm is compared to the next, most promising contender. We select

the most promising challenger as discussed above, and run a CV test with this

algorithm. The winner becomes the new current best candidate, the loser is

removed from consideration. We will discuss the exact procedure in Section 3.

We call this approach active testing (AT)4, as it actively selects the most

interesting CV tests instead of passively performing them one by one: in each

iteration the best competitor is identiﬁed, which determines a new CV test to

be carried out. Moreover, the same result will be used to further characterize

the new dataset and more accurately estimate the similarity between the new

dataset and all prior datasets.

Evaluation. By intelligently selecting the most promising algorithms the test

on the new dataset, we can more quickly discover an algorithm that performs

very well. Note that running a selection of algorithms is typically done anyway

to ﬁnd a suitable algorithm. Here, we optimize and automate this process using

historical performance results of the candidate algorithms on prior datasets.

While we cannot possibly guarantee to return the absolute best algorithm

without performing all possible CV tests, we can return an algorithm whose

performance is either identical or very close to the truly best one. The diﬀerence

between the two can be expressed in terms of a loss. Our aim is thus to minimize

4Note that while the term ‘active testing’ is also used in the context of actively

selected sampling landmarks [14], there is little or no relationship to the approach

described here.

this loss using a minimal number of tests, and we will evaluate our technique as

such.

In all, the research hypothesis that we intend to prove in this paper is: Relative

landmarks provide useful information on the similarity of datasets and can be

used to eﬃciently predict the most promising algorithms to test on new datasets.

We will test this hypothesis by running our active testing approach in a leave-

one-out fashion on a large set of CV evaluations testing 292 algorithms on 76

datasets. The results show that our AT approach is indeed eﬀective in ﬁnding

very accurate algorithms in a very limited number of tests.

Roadmap. The remainder of this paper is organized as follows. First, we formu-

late the concepts of relative landmarks in Section 2 and active testing in Section

3. Next, Section 4 presents the empirical evaluation and Section 5 presents an

overview of some work in other related areas. The ﬁnal section presents conclu-

sions and future work.

2 Relative Landmarks

In this section we formalize our deﬁnition of relative landmarks, and explain

how are used to identify the most promising competitor for a currently best

algorithm.

Given a set of classiﬁcation algorithms and some new classiﬁcation dataset

dnew, the aim is to identify the potentially best algorithm for this task with

respect to some given performance measure M(e.g., accuracy, AUC or rank).

Let us represent the performance of algorithm aion dataset dnew as M(ai, dnew).

As such, we need to identify an algorithm a∗, for which the performance measure

is maximal, or ∀aiM(a∗, dnew)≥M(ai, dnew ). The decision concerning ≥(i.e.

whether a∗is at least as good as ai) may be established using either a statistical

signiﬁcance test or a simple comparison.

However, instead of searching exhaustively for a∗, we aim to ﬁnd a near-

optimal algorithm, ˆ

a∗, which has a high probability P(M(ˆ

a∗, dnew)≥M(ai, dnew ))

to be optimal, ideally close to 1.

As in other work that exploits metalearning, we assume that ˆ

a∗is likely

better than aion dataset dnew if it was found to be better on a similar dataset

dj(for which we have performance estimates):

P(M(ˆ

a∗, dnew)≥M(ai, dnew )) ∼P(M(ˆ

a∗, dj)≥M(ai, dj)) (1)

The latter estimate can be maximized by going through all algorithms and iden-

tifying the algorithm ˆ

a∗that satisﬁes the ≥constraint in a maximum number of

cases. However, this requires that we know which datasets djare most similar

to dnew. Since our deﬁnition of similarity requires CV tests to be run on dnew ,

but we cannot run all possible CV tests, we use an iterative approach in which

we repeat this scan for ˆ

a∗in every round, using only the datasets djthat seem

most similar at that point, as dataset similarities are recalculated after every

CV test.

Initially, having no information, we deem all datasets to be similar to dnew ,

so that ˆ

a∗will be the globally best algorithm over all prior datasets. We then call

this algorithm the current best algorithm abest and run a CV test to calculate its

performance on dnew . Based on this, the dataset similarities are recalculated (see

Section 3), yielding a possibly diﬀerent set of datasets dj. The best algorithm on

this new set becomes the best competitor ak(diﬀerent from abest), calculated by

counting the number of times that M(ak, dj)> M(abest, dj), over all datasets

dj.

We can further reﬁne this method by taking into account how large the

performance diﬀerences are: the larger a diﬀerence was in the past, the higher

chances are to obtain a large gain on the new dataset. This leads to the notion

of relative landmarks RL, deﬁned as:

RL(ak, abest, dj) = i(M(ak, dj)> M (abest, dj)) ∗(M(ak, dj)−M(abest, dj)) (2)

The function i(test) returns 1 if the test is true and 0 otherwise. As stated before,

this can be a simple comparison or a statistical signiﬁcance test that only returns

1 if akperforms signiﬁcantly better than abest on dj. The term RL thus expresses

how much better akis, relative to abest, on a dataset dj. Experimental tests have

shown that this approach yields much better results than simply counting the

number of wins.

Up to now, we are assuming a dataset djto be either similar to dnew or

not. A second reﬁnement is to use a gradual (non-binary) measure of similarity

Sim(dnew , dj) between datasets dnew and dj. As such, we can weigh the perfor-

mance diﬀerence between akand abest on djby how similar djis to dnew . Indeed,

the more similar the datasets, the more informative the performance diﬀerence

is. As such, we aim to optimize the following criterion:

ak= arg max

ai

X

dj ∈D

RL(ai, abest, dj)∗Sim(dnew , dj)) (3)

in which Dis the set of all prior datasets dj.

To calculate the similarity Sim(), we use the outcome of each CV test on

dnew and compare it to the outcomes on dj.

In each iteration, with each CV test, we obtain a new evaluation M(ai, dnew),

which is used to recalculate all similarities Sim(dnew , dj). In fact, we will com-

pare four variants of Sim(), which are discussed in the next section. With this,

we can recalculate equation 3 to ﬁnd the next best competitor ak.

3 Active Testing

In this section we describe the active testing (AT) approach, which proceeds

according to the following steps:

1. Construct a global ranking of a given set of algorithms using performance

information from past experiments (metadata).

2. Initiate the iterative process by assigning the top-ranked algorithm as abest

and obtain the performance of this algorithm on dnew using a CV test.

3. Find the most promising competitor akfor abest using relative landmarks

and all previous CV tests on dnew.

4. Obtain the performance of akon dnew using a CV test and compare with

abest. Use the winner as the current best algorithm, and eliminate the losing

algorithm.

5. Repeat the whole process starting with step 3 until a stopping criterium has

been reached. Finally, output the current abest as the overall winner.

Step 1 - Establish a Global Ranking of Algorithms. Before having run any

CV tests, we have no information on the new dataset dnew to deﬁne which prior

datasets are similar to it. As such, we naively assume that all prior datasets

are similar. As such, we generate a global ranking of all algorithms using the

performance results of all algorithms on all previous datasets, and choose the

top-ranked algorithm as our initial candidate abest. To illustrate this, we use a toy

example involving 6 classiﬁcation algorithms, with default parameter settings,

from Weka [10] evaluated on 40 UCI datasets [1], a portion of which is shown in

Table 1.

As said before, our approach is entirely independent from the exact evalu-

ation measure used: the most appropriate measure can be chosen by the user

in function of the speciﬁc data domain. In this example, we use success rate

(accuracy), but any other suitable measure of classiﬁer performance, e.g. AUC

(area under the ROC curve), precision, recall or F1 can be used as well.

Each accuracy ﬁgure shown in Table 1 represents the mean of 10 values

obtained in 10-fold cross-validation. The ranks of the algorithms on each dataset

are shown in parentheses next to the accuracy value. For instance, if we consider

dataset abalone, algorithm M LP is attributed rank 1 as its accuracy is highest

on this problem. The second rank is occupied by LogD, etc.

The last row in the table shows the mean rank of each algorithm, obtained

by averaging over the ranks of each dataset: Rai=1

nPn

dj=1 Rai,dj, where Rai,dj

represents the rank of algorithm aion dataset djand nthe number of datasets.

This is a quite common procedure, often used in machine learning to assess how

a particular algorithm compares to others [5].

The mean ranks permit us to obtain a global ranking of candidate algorithms,

CA. In our case, C A =hM LP, J 48, JRip, Log D, IB1, N Bi. It must be noted

that such a ranking is not very informative in itself. For instance, statistical

signiﬁcance tests are needed to obtain a truthful ranking. Here, we only use this

global ranking CA are a starting point for the iterative procedure, as explained

next.

Step 2 - Identify the Current Best Algorithm. Indeed, global ranking CA

permits us to identify the top-ranked algorithm as our initial best candidate

algorithm abest. In Table 1, abest =M LP . This algorithm is then evaluated

using a CV test to establish its performance on the new dataset dnew .

Table 1. Accuracies and ranks (in parentheses) of the algorithms 1-nearest neighbor

(IB1), C4.5 (J48), RIPPER (JRip), LogisticDiscriminant (LogD), MultiLayerPercep-

tron (MLP) and naive Bayes (NB) on diﬀerent datasets and their mean rank.

Datasets IB1 J48 JRip LogD MLP NB

abalone .197 (5) .218 (4) .185 (6) .259 (2) .266 (1) .237 (3)

acetylation .844 (1) .831 (2) .829 (3) .745 (5) .609 (6) .822 (4)

adult .794 (6) .861 (1) .843 (3) .850 (2) .830 (5) .834 (4)

... ... ... ... ... ... ...

Mean rank 4.05 2.73 3.17 3.74 2.54 4.78

Step 3 - Identify the Most Promising Competitor. In the next step we

identify ak, the best competitor of abest . To do this, all algorithms are considered

one by one, except for abest and the eliminated algorithms (see step 4).

For each algorithm we analyze the information of past experiments (meta-

data) to calculate the relative landmarks, as outlined in the previous section. As

equation 3 shows, for each ak, we sum up all relative landmarks involving abest,

weighted by a measure of similarity between dataset djand the new dataset

dnew. The algorithm akthat achieves the highest value is the most promising

competitor in this iteration. In case of a tie, the competitor that appears ﬁrst in

ranking CA is chosen.

To calculate Sim(dnew , dj), the similarity between djand dnew, we have

explored four diﬀerent variants, AT0,AT1,ATWs,ATx, described below.

AT0 is a base-line method which ignores dataset similarity. It always returns

a similarity value of 1 and so all datasets are considered similar. This means

that the best competitor is determined by summing up the values of the relative

landmarks.

AT1 method works as AT0 at the beginning, when no test have been car-

ried out on dnew. In all subsequent iterations, this method estimates dataset

similarity using only the most recent CV test. Consider the algorithms listed

in Table 1 and the ranking CA. Suppose we started with algorithm MLP

as the current best candidate. Suppose also that in the next iteration LogD

was identiﬁed as the best competitor, and won from MLP in the CV test:

(M(LogD, dnew )> M (M LP, dnew )). Then, in the subsequent iteration, all prior

datasets djsatisfying the condition M(LogD, dj)> M (MLP , dj) are consid-

ered similar to dnew. In general terms, suppose that the last test revealed that

M(ak, dnew)> M (abest, dnew), then Sim(dnew , dj) is 1 if also M(ak, dj)> M (abest , dj),

and 0 otherwise. The similarity measure determines which RL’s are taken into

account when summing up their contributions to identify the next best competi-

tor.

Another variant of AT1 could use the diﬀerence between RL(ak, abest, dnew)

and RL(ak, abest, dj), normalized between 0 and 1, to obtain a real-valued (non-

binary) similarity estimate Sim(dnew , dj). In other words, djis more similar to

dnew if the relative performance diﬀerence between akand abest is about as large

on both djand dnew . We plan to investigate this in our future work.

ATWs is similar to AT1, but instead of only using the last test, it uses all

CV tests carried out on the new dataset, and calculates the Laplace-corrected

ratio of corresponding results. For instance, suppose we have conducted 3 tests

on dnew, thus yielding 3 pairwise algorithm comparisons on dnew . Suppose that

2 tests had the same result on dataset dj(i.e. M(ax, dnew)> M (ay, dnew) and

M(ax, dj)> M(ay, dj)), then the frequency of occurrence is 2/3, which is ad-

justed by Laplace correction to obtain an estimate of probability (2+ 1)/(3 + 2).

As such, Sim(dnew , dj) = 3

5.

ATx is similar to ATWs, but requires that all pairwise comparisons yield the

same outcome. In the example used above, it will return Sim(dnew , dj) = 1 only

if all three comparisons lead to same result on both datasets and 0 otherwise.

Step 4 - Determine which of the Two Algorithms is Better. Having

found ak, we can now run a CV test and compare it with abest. The winner

(which may be either the current best algorithm or the competitor) is used

as the new current best algorithm in the new round. The losing algorithm is

eliminated from further consideration.

Step 5 - Repeat the Process and Check the Stopping Criteria. The

whole process of identifying the best competitor (step 3) of abest and determining

which one of the two is better (step 4) is repeated until a stopping criterium has

been reached. For instance, the process could be constrained to a ﬁxed number

of CV tests: considering the results presented further on in Section 4, it would

be suﬃcient to run at most 20% of all possible CV tests. Alternatively, one could

impose a ﬁxed CPU time, thus returning the best algorithm in hhours, as in

budgeted learning. In any case, until aborted, the method will keep choosing a

new competitor in each round: there will always be a next best competitor. In

this respect our system diﬀers from, for instance, hill climbing approaches which

can get stuck in a local minimum.

Discussion - Comparison with Active Learning: The term active testing

was chosen because the approach shares some similarities with active learning [7].

The concern of both is to speed up the process of improving a given performance

measure. In active learning, the goal is to select the most informative data point

to be labeled next, so as to improve the predictive performance of a supervised

learning algorithm with a minimum of (expensive) labelings. In active testing,

the goal is to select the most informative CV test, so as to improve the prediction

of the best algorithm on the new dataset with a minimum of (expensive) CV

tests.

4 Empirical Evaluation

4.1 Evaluation Methodology and Experimental Set-up

The proposed method AT was evaluated using a leave-one-out method [18]. The

experiments reported here involve Ddatasets and so the whole procedure was

repeated Dtimes. In each cycle, all performance results on one dataset were left

out for testing and the results on the remaining D−1 datasets were used as

metadata to determine the best candidate algorithm.

Number of CV's

Median Accuracy Loss (%)

0.0 0.5 1.0 1.5

1 2 4 8 16 32 64 128 292

AT0

AT1

ATx

ATWs

Fig. 1. Median loss as a function of the number of CV tests.

This study involved 292 algorithms (algorithm-parameter combinations), which

were extracted from the experiment database for machine learning (ExpDB)

[11, 22]. This set includes many diﬀerent algorithms from the Weka platform

[10], which were varied by assigning diﬀerent values to their most important

parameters. It includes SMO (a support vector machine, SVM), MLP (Multi-

layer Perceptron), J48 (C4.5), and diﬀerent types of ensembles, including Ran-

domForest, Bagging and Boosting. Moreover, diﬀerent SVM kernels were used

with their own parameter ranges and all non-ensemble learners were used as

base-learners for the ensemble learners mentioned above. The 76 datasets used

in this study were all from UCI [1]. A complete overview of the data used

in this study, including links to all algorithms and datasets can be found on

http://expdb.cs.kuleuven.be/ref/blv11.

The main aim of the test was to prove the research hypothesis formulated

earlier: relative landmarks provide useful information for predicting the most

promising algorithms on new datasets. Therefore, we also include two baseline

methods:

TopN has been described before (e.g. [3]). It also builds a ranking of candidate

algorithms as described in step 1 (although other measures diﬀerent from

mean rank could be used), and only runs CV tests on the ﬁrst Nalgorithms.

The overall winner is returned.

Rand simply selects Nalgorithms at random from the given set, evaluates them

using CV and returns the one with the best performance. It is repeated 10

times with diﬀerent random seeds and the results are averaged.

Since our AT methods are iterative, we will restart TopN and Rand Ntimes,

with Nequal to the number of iterations (or CV tests).

To evaluate the performance of all approaches, we calculate the loss of the

currently best algorithm, deﬁned as M(abest, dnew )−M(a∗, dnew ), where abest

represents the currently best algorithm, a∗the best possible algorithm and M(.)

represents the performance measure (success rate).

4.2 Results

By aggregating the results over Ddatasets, we can track the median loss of the

recommended algorithm as a function of the number of CV tests carried out.

The results are shown in Figure 1. Note that the number of CV tests is plotted

on a logarithmic scale.

First, we see that AT W s and AT 1 perform much better than AT 0, which

indicates that it is indeed useful to include dataset similarity. If we consider a

particular level of loss (say 0.5%) we note that these variants require much fewer

CV tests than AT 0. The results also indicate that the information associated

with relative landmarks obtained on the new dataset is indeed valuable. The

median loss curves decline quite rapidly and are always below the AT 0 curve.

We also see that after only 10 CV tests (representing about 3% of all possible

tests), the median loss is less than 0.5%. If we continue to 60 tests (about 20%

of all possible tests) the median loss is near 0.

Also note that AT W s, which uses all relative landmarks involving abest and

dnew, does not perform much better than AT 1, which only uses the most re-

cent CV test. This results suggests that, when looking for the most promising

competitor, the latest test is more informative than the previous ones.

Method AT x, the most restrictive approach, only considers prior datasets on

which all relative landmarks including abest obtained similar results. As shown in

Figure 1, this approach manages to reduce the loss quite rapidly, and competes

well with the other variants in the initial region. However, after achieving a

minimum loss in the order of 0.5%, there are no more datasets that fulﬁll this

restriction, and consequently no new competitor can be chosen, causing it to stop.

The other two methods, AT W s and AT 1, do not suﬀer from this shortcoming.

Number of CV's

Median Accuracy Loss (%)

0.0 0.5 1.0 1.5

1 2 4 8 16 32 64 128 292

Rand

TopN

AT0

Fig. 2. Median loss of AT0 and the two baseline methods.

AT 0 was also our best baseline method. To avoid overloading Figure 1, we

show this separately in Figure 2. Indeed, AT 0 is clearly better than the random

choice method Rand. Comparing AT 0 to T opN , we cannot say that one is clearly

better than the other overall, as the curves cross. However, it is clear that T opN

looses out if we allow more CV tests, and that it is not competitive with the

more advanced methods such as AT 1 and AT W s.

The curves for mean loss (instead of median loss) follow similar trends, but

the values are 1-2% worse due to outliers (see Fig. 3 relative to method AT 1).

Besides, this ﬁgure shows also the curves associated with quartiles of 25% and

75% for AT 1. As the number of CV tests increases, the distance between the two

curves decreases and approaches the median curve. Similar behavior has been

observed for AT W s, but we skip the curves in this text.

Algorithm trace. It is interesting to trace the iterations carried out for one

particular dataset. Table 2 shows the details for method AT 1, where abalone

represents the new dataset. Column 1 shows the number of the iteration (thus

the number of CV tests). Column 2 shows the most promising competitor ak

chosen in each step. Column 3 shows the index of akin our initial ranking

CA, and column 4 the index of abest , the new best algorithm after the CV test

has been performed. As such, if the values in column 3 and 4 are the same,

then the most promising competitor has won the duel. For instance, in step

2, SM O.C.1.0.P olynomial.E .3, i.e. SVM with complexity constant set to 1.0

and a 3rd degree polynomial kernel, (index 96) has been identiﬁed as the best

Number of CV's

Accuracy Loss (%)

0.0 0.5 1.0 1.5

1 2 4 8 16 32 64 128 292

mean

quantile.25

median

quantile.75

Fig. 3. Loss of AT 1 as a function of the number of CV tests.

competitor to be used (column 2), and after the CV test, it has won against

Bagging.I .75..100.P ART , i.e. Bagging with a high number of iterations (be-

tween 75 and 100) and PART as a base-learner. As such, it wins this round and

becomes the new abest. Columns 5 and 6 show the actual rank of the competitor

and the winner on the abalone dataset. Column 7 shows the loss compared to

the optimal algorithm and the ﬁnal column shows the number of datasets whose

similarity measure is 1.

We observe that after only 12 CV tests, the method has identiﬁed an algo-

rithm with a very small loss of 0.2%: Bagging.I.25..50.M ultilay erP erceptron,

i.e. Bagging with relatively few iterations but with a MultiLayerPerceptron base-

learner.

Incidentally, this dataset appears to represent a quite atypical problem: the

truly best algorithm, SMO.C.1.0.RBF.G.20, i.e. SVM with an RBF kernel with

kernel width (gamma) set to 20, is ranked globally as algorithm 246 (of all 292).

AT1 identiﬁes it after 177 CV tests.

5 Related Work in Other Scientiﬁc Areas

In this section we brieﬂy cover some work in other scientiﬁc areas which is

related to the problem tackled here and could provide further insight into how

to improve the method.

One particular area is experiment design [6] and in particular active learning.

As discussed before, the method described here follows the main trends that have

been outlined in this literature. However, there is relatively little work on active

Table 2. Trace of the steps taken by AT 1 in the search for the supposedly best

algorithm for the abalone dataset

CV Algorithm used CA CA abalone abalone Loss D

test (current best competitor, ak)aknew abest aknew abest (%) size

1 Bagging.I.75..100.PART 1 1 75 75 1.9 75

2 SMO.C.1.0.Polynomial.E.3 96 96 56 56 1.6 29

3 AdaBoostM1.I.10.MultilayerPerceptron 92 92 47 47 1.5 34

4 Bagging.I.50..75.RandomForest 15 92 66 47 1.5 27

· · · · · · · · · · · · · · · · · · · · · · · ·

10 LMT 6 6 32 32 1.1 45

11 LogitBoost.I.10.DecisionStump 81 6 70 32 1.1 51

12 Bagging.I.25..50.MultilayerPerceptron 12 12 2 2 0.2 37

13 LogitBoost.I.160.DecisionStump 54 12 91 2 0.2 42

· · · · · · · · · · · · · · · · · · · · · · · ·

177 SMO.C.1.0.RBF.G.20 246 246 1 1 0 9

learning for ranking tasks. One notable exception is [15], who use the notion

of Expected Loss Optimization (ELO). Another work in this area is [4], whose

aim was to identify the most interesting substances for drug screening using

a minimum number of tests. In the experiments described, the authors have

focused on the top-10 substances. Several diﬀerent strategies were considered

and evaluated. Our problem here is not ranking, but rather simply ﬁnding the

best item (algorithm), so this work is only partially relevant.

Another relevant area is the so called multi-armed bandit problem (MAB)

studied in statistics and machine learning [9, 16]. This problem is often formu-

lated in a setting that involves a set of traditional slot machines. When a partic-

ular lever is pulled, a reward is provided from a distribution associated with that

speciﬁc lever. The bandit problem is formally equivalent to a one-state Markov

decision process. The aim is to minimize regret after T rounds, which is deﬁned

as a diﬀerence between the reward sum associated with an optimal strategy and

the sum of collected rewards. Indeed, pulling a lever can be compared to carrying

out a CV test on a given algorithm. However, there is one fundamental diﬀerence

between MAB and our setting: whereas in MAB the aim is to maximize the sum

of collected rewards, our aim it to maximize one reward, i.e. the reward asso-

ciated with identifying the best algorithm. So again, this work is only partially

relevant.

To the best of our knowledge, no other work in this area has addressed the

issue of how to select a suitable algorithm from a large set of candidates.

6 Signiﬁcance and Impact

In this paper we have addressed the problem of selecting the best classiﬁcation

algorithm for a speciﬁc task. We have introduced a new method, called active

testing, that exploits information concerning past evaluation results (metadata),

to recommend the best algorithm using a limited number of tests on the new

dataset.

Starting from an initial ranking of algorithms on previous datasets, the

method runs additional CV evaluations to test several competing algorithms

on the new dataset. However, the aim is to reduce the number of tests to a mini-

mum. This is done by carefully selecting which tests should be carried out, using

the information of both past and present algorithm evaluations represented in

the form of relative landmarks.

In our view this method incorporates several innovative features. First, it

is an iterative process that uses the information in each CV test to ﬁnd the

most promising next test based on a history of prior ‘algorithm duels’. In a

tournament-style fashion, it starts with a current best (parameterized) algo-

rithm, selects the most promising rival algorithm in each round, evaluates it on

the given problem, and eliminates the algorithm that performs worse. Second, it

continually focuses on the most similar prior datasets: those where the algorithm

duels had a similar outcome to those on the new dataset.

Four variants of this basic approach, diﬀering in their deﬁnition of algo-

rithm similarity, were investigated in a very extensive experiment setup involving

292 algorithm-parameter combinations on 76 datasets. Our experimental results

show that particularly versions AT W s and AT 1 provide good recommendations

using a small number of CV tests. When plotting the median loss as a function

of the number of CV tests (Fig. 1), it shows that both outperform all other vari-

ants and baseline methods. They also outperform AT 0, indicating that algorithm

similarity is an important aspect.

We also see that after only 10 CV tests (representing about 3% of all possible

tests), the median loss is less than 0.5%. If we continue to 60 tests (about 20%

of all possible tests) the median loss is near 0. Similar trends can be observed

when considering mean loss.

The results support the hypothesis that we have formulated at the outset

of our work, that relative landmarks are indeed informative and can be used to

suggest the best contender. If this is procedure is used iteratively, it can be used

to accurately recommend a classiﬁcation algorithm after a very limited number

of CV tests.

Still, we believe that the results could be improved further. Classical information-

theoretic measures and/or sampling landmarks could be incorporated into the

process of identifying the most similar datasets. This could lead to further im-

provements and forms part of our future plans.

References

1. C.L. Blake and C.J. Merz. UCI repository of machine learning databases, 1998.

2. B.Pfahringer, H.Bensussan, and C. Giraud-Carrier. Meta-learning by landmarking

various learning algorithms. In Proceedings of the 17th Int. Conf. on Machine

Learning (ICML-2000. Stanford,CA, 2000.

3. P. Brazdil, C. Soares, and J. Costa. Ranking learning algorithms: Using IBL and

meta-learning on accuracy and time results. Machine Learning, 50:251–277, 2003.

4. K. De Grave, J. Ramon, and L. De Raedt. Active learning for primary drug

screening. In Proceedings of Discovery Science. Springer, 2008.

5. J. Demsar. Statistical comparisons of classiﬁers over multiple data sets. The

Journal of Machine Learning Research, 7:1–30, 2006.

6. V. Fedorov. Theory of Optimal Experiments. Academic Press, New York, 1972.

7. Y. Freund, H. Seung, E. Shamir, and N. Tishby. Selective sampling using the query

by committee algorithm. Machine Learning, 28:133–168, 1997.

8. Johannes F¨urnkranz and Johann Petrak. An evaluation of landmarking variants. In

Proceedings of the ECML/PKDD Workshop on Integrating Aspects of Data Mining,

Decision Support and Meta-Learning (IDDM-2001), pages 57–68. Springer, 2001.

9. J. Gittins. Multi-armed bandit allocation indices. In Wiley Interscience Series in

Systems and Optimization. John Wiley & Sons, Ltd., 1989.

10. Mark Hall, Eibe Frank, Geoﬀrey Holmes, Bernhard Pfahringer, Peter Reutemann,

and Ian H. Witten. The WEKA data mining software: an update. SIGKDD Explor.

Newsl., 11(1):10–18, 2009.

11. H.Blockeel. Experiment databases: A novel methodology for experimental research.

In Leacture Notes on Computer Science 3933. Springer, 2006.

12. J.F¨urnkranz and J.Petrak. An evaluation of landmarking variants. In C.Carrier,

N.Lavrac, and S.Moyle, editors, Working Notes of ECML/PKDD 2000 Workshop

on Integration Aspects of Data Mining, Decision Support and Meta-Learning. 2001.

13. R. Leite and P. Brazdil. Predicting relative performance of classiﬁers from sam-

ples. In ICML ’05: Proceedings of the 22nd international conference on Machine

learning, pages 497–503, New York, NY, USA, 2005. ACM Press.

14. Rui Leite and Pavel Brazdil. Active testing strategy to predict the best classiﬁca-

tion algorithm via sampling and metalearning. In Proceedings of the 19th European

Conference on Artiﬁcial Intelligence - ECAI 2010, 2010.

15. B. Long, O. Chapelle, Y. Zhang, Y. Chang, Z. Zheng, and B. Tseng. Active learning

for rankings through expected loss optimization. In Proceedings of the SIGIR’10.

ACM, 2010.

16. A. Mahajan and D. Teneketzis. Multi-armed bandit problems. In D. A. Castanon,

D. Cochran, and K. Kastella, editors, Foundations and Applications of Sensor

Management. Springer-Verlag, 2007.

17. D. Michie, D.J.Spiegelhalter, and C.C.Taylor. Machine Learning, Neural and Sta-

tistical Classiﬁcation. Ellis Horwood, 1994.

18. Tom M. Mitchell. Machine Learning. McGraw-Hill, New York, 1997.

19. John R. Rice. The algorithm selection problem. volume 15 of Advances in Com-

puters, pages 65 – 118. Elsevier, 1976.

20. Kate A. Smith-Miles. Cross-disciplinary perspectives on meta-learning for algo-

rithm selection. ACM Comput. Surv., 41(1):1–25, 2008.

21. Carlos Soares, Johann Petrak, and Pavel Brazdil. Sampling-based relative land-

marks: Systematically test-driving algorithms before choosing. In Proceedings of

the 10th Portuguese Conference on Artiﬁcial Intelligence (EPIA 2001), pages 88–

94. Springer, 2001.

22. J. Vanschoren and H. Blockeel. A community-based platform for machine learning

experimentation. In Machine Learning and Knowledge Discovery in Databases,

European Conference, ECML PKDD 2009, volume LNCS 5782, pages 750–754.

Springer, 2009.

23. Ricardo Vilalta and Youssef Drissi. A perspective view and survey of meta-learning.

Artif. Intell. Rev., 18(2):77–95, 2002.