Selecting Classiﬁcation Algorithms with Active
Rui Leite1, Pavel Brazdil1, and Joaquin Vanschoren2
1LIAAD-INESC Porto L.A./Faculty of Economics, University of Porto, Portugal,
2LIACS - Leiden Institute of Advanced Computer Science, University of Leiden,
Abstract. Given the large amount of data mining algorithms, their
combinations (e.g. ensembles) and possible parameter settings, ﬁnding
the most adequate method to analyze a new dataset becomes an ever
more challenging task. This is because in many cases testing all possi-
bly useful alternatives quickly becomes prohibitively expensive. In this
paper we propose a novel technique, called active testing, that intel-
ligently selects the most useful cross-validation tests. It proceeds in a
tournament-style fashion, in each round selecting and testing the algo-
rithm that is most likely to outperform the best algorithm of the previous
round on the new dataset. This ‘most promising’ competitor is chosen
based on a history of prior duels between both algorithms on similar
datasets. Each new cross-validation test will contribute information to
a better estimate of dataset similarity, and thus better predict which
algorithms are most promising on the new dataset. We have evaluated
this approach using a set of 292 algorithm-parameter combinations on
76 UCI datasets for classiﬁcation. The results show that active testing
will quickly yield an algorithm whose performance is very close to the
optimum, after relatively few tests. It also provides a better solution than
previously proposed methods.
1 Background and Motivation
In many data mining applications, an important problem is selecting the best
algorithm for a speciﬁc problem. Especially in classiﬁcation, there are hundreds
of algorithms to choose from. Moreover, these algorithms can be combined into
composite learning systems (e.g. ensembles) and often have many parameters
that greatly inﬂuence their performance. This yields a whole spectrum of meth-
ods and their variations, so that testing all possible candidates on the given
problem, e.g., using cross-validation, quickly becomes prohibitively expensive.
The issue of selecting the right algorithm has been the subject of many
studies over the past 20 years [17, 3, 23,20, 19]. Most approaches rely on the
concept of metalearning. This approach exploits characterizations of datasets
and past performance results of algorithms to recommend the best algorithm on
the current dataset. The term metalearning stems from the fact that we try to
learn the function that maps dataset characterizations (meta-data) to algorithm
performance estimates (the target variable).
The earliest techniques considered only the dataset itself and calculated an
array of various simple, statistical or information-theoretic properties of the
data (e.g., dataset size, class skewness and signal-noise ratio) [17, 3]. Another
approach, called landmarking [2, 12], ran simple and fast versions of algorithms
(e.g. decision stumps instead of decision trees) on the new dataset and used their
performance results to characterize the new dataset. Alternatively, in sampling
landmarks [21, 8,14], the complete (non-simpliﬁed) algorithms are run on small
samples of the data. A series of sampling landmarks on increasingly large samples
represents a partial learning curve which characterizes datasets and which can be
used to predict the performance of algorithms signiﬁcantly more accurately than
with classical dataset characteristics [13, 14]. Finally, an ‘active testing strategy’
for sampling landmarks  was proposed that actively selects the most infor-
mative sample sizes while building these partial learning curves, thus reducing
the time needed to compute them.
Motivation. All these approaches have focused on dozens of algorithms at most
and usually considered only default parameter settings. Dealing with hundreds,
perhaps thousands of algorithm-parameter combinations3, provides a new chal-
lenge that requires a new approach. First, distinguishing between hundreds of
subtly diﬀerent algorithms is signiﬁcantly harder than distinguishing between a
handful of very diﬀerent ones. We would need many more data characterizations
that relate the eﬀects of certain parameters on performance. On the other hand,
the latter method  has a scalability issue: it requires that pairwise compar-
isons be conducted between algorithms. This would be rather impractical when
faced with hundreds of algorithm-parameter combinations.
To address these issues, we propose a quite diﬀerent way to characterize
datasets, namely through the eﬀect that the dataset has on the relative perfor-
mance of algorithms run on them. As in landmarking, we use the fact that each
algorithm has its own learning bias, making certain assumptions about the data
distribution. If the learning bias ‘matches’ the underlying data distribution of
a particular dataset, it is likely to perform well (e.g., achieve high predictive
accuracy). If it does not, it will likely under- or overﬁt the data, resulting in a
As such, we characterize a dataset based on the pairwise performance diﬀer-
ences between algorithms run on them: if the same algorithms win, tie or lose
against each other on two datasets, then the data distributions of these datasets
are likely to be similar as well, at least in terms of their eﬀect on learning per-
formance. It is clear that the more algorithms are used, the more accurate the
characterization will be. While we cannot run all algorithms on each new dataset
because of the computational cost, we can run a fair amount of CV tests to get
a reasonably good idea of which prior datasets are most similar to the new one.
Moreover, we can use these same performance results to establish which (yet
untested) algorithms are likely to perform well on the new dataset, i.e., those
algorithms that outperformed or rivaled the currently best algorithm on similar
datasets in the past. As such, we can intelligently select the most promising
3In the remainder of this text, when we speak of algorithms, we mean fully-deﬁned
algorithm instances with ﬁxed components (e.g., base-learners, kernel functions) and
algorithms for the new dataset, run them, and then use their performance results
to gain increasingly better estimates of the most similar datasets and the most
Key concepts. There are two key concepts used in this work. The ﬁrst one is
that of the current best candidate algorithm which may be challenged by other
algorithms in the process of ﬁnding an even better candidate.
The second is the pairwise performance diﬀerence between two algorithm run
on the same dataset, which we call relative landmark. A collection of such rela-
tive landmarks represents a history of previous ‘duels’ between two algorithms
on prior datasets. The term itself originates from the study of landmarking al-
gorithms: since absolute values for the performance of landmarkers vary a lot
depending on the dataset, several types of relative landmarks have been pro-
posed, which basically capture the relative performance diﬀerence between two
algorithms . In this paper, we extend the notion of relative landmarks to all
(including non-simpliﬁed) classiﬁcation algorithms.
The history of previous algorithm duels is used to select the most promis-
ing challenger for the current best candidate algorithm, namely the method
that most convincingly outperformed or rivaled the current champion on prior
datasets similar to the new dataset.
Approach. Given the current best algorithm and a history of relative landmarks
(duels), we can start a tournament game in which, in each round, the current
best algorithm is compared to the next, most promising contender. We select
the most promising challenger as discussed above, and run a CV test with this
algorithm. The winner becomes the new current best candidate, the loser is
removed from consideration. We will discuss the exact procedure in Section 3.
We call this approach active testing (AT)4, as it actively selects the most
interesting CV tests instead of passively performing them one by one: in each
iteration the best competitor is identiﬁed, which determines a new CV test to
be carried out. Moreover, the same result will be used to further characterize
the new dataset and more accurately estimate the similarity between the new
dataset and all prior datasets.
Evaluation. By intelligently selecting the most promising algorithms the test
on the new dataset, we can more quickly discover an algorithm that performs
very well. Note that running a selection of algorithms is typically done anyway
to ﬁnd a suitable algorithm. Here, we optimize and automate this process using
historical performance results of the candidate algorithms on prior datasets.
While we cannot possibly guarantee to return the absolute best algorithm
without performing all possible CV tests, we can return an algorithm whose
performance is either identical or very close to the truly best one. The diﬀerence
between the two can be expressed in terms of a loss. Our aim is thus to minimize
4Note that while the term ‘active testing’ is also used in the context of actively
selected sampling landmarks , there is little or no relationship to the approach
this loss using a minimal number of tests, and we will evaluate our technique as
In all, the research hypothesis that we intend to prove in this paper is: Relative
landmarks provide useful information on the similarity of datasets and can be
used to eﬃciently predict the most promising algorithms to test on new datasets.
We will test this hypothesis by running our active testing approach in a leave-
one-out fashion on a large set of CV evaluations testing 292 algorithms on 76
datasets. The results show that our AT approach is indeed eﬀective in ﬁnding
very accurate algorithms in a very limited number of tests.
Roadmap. The remainder of this paper is organized as follows. First, we formu-
late the concepts of relative landmarks in Section 2 and active testing in Section
3. Next, Section 4 presents the empirical evaluation and Section 5 presents an
overview of some work in other related areas. The ﬁnal section presents conclu-
sions and future work.
2 Relative Landmarks
In this section we formalize our deﬁnition of relative landmarks, and explain
how are used to identify the most promising competitor for a currently best
Given a set of classiﬁcation algorithms and some new classiﬁcation dataset
dnew, the aim is to identify the potentially best algorithm for this task with
respect to some given performance measure M(e.g., accuracy, AUC or rank).
Let us represent the performance of algorithm aion dataset dnew as M(ai, dnew).
As such, we need to identify an algorithm a∗, for which the performance measure
is maximal, or ∀aiM(a∗, dnew)≥M(ai, dnew ). The decision concerning ≥(i.e.
whether a∗is at least as good as ai) may be established using either a statistical
signiﬁcance test or a simple comparison.
However, instead of searching exhaustively for a∗, we aim to ﬁnd a near-
optimal algorithm, ˆ
a∗, which has a high probability P(M(ˆ
a∗, dnew)≥M(ai, dnew ))
to be optimal, ideally close to 1.
As in other work that exploits metalearning, we assume that ˆ
better than aion dataset dnew if it was found to be better on a similar dataset
dj(for which we have performance estimates):
a∗, dnew)≥M(ai, dnew )) ∼P(M(ˆ
a∗, dj)≥M(ai, dj)) (1)
The latter estimate can be maximized by going through all algorithms and iden-
tifying the algorithm ˆ
a∗that satisﬁes the ≥constraint in a maximum number of
cases. However, this requires that we know which datasets djare most similar
to dnew. Since our deﬁnition of similarity requires CV tests to be run on dnew ,
but we cannot run all possible CV tests, we use an iterative approach in which
we repeat this scan for ˆ
a∗in every round, using only the datasets djthat seem
most similar at that point, as dataset similarities are recalculated after every
Initially, having no information, we deem all datasets to be similar to dnew ,
so that ˆ
a∗will be the globally best algorithm over all prior datasets. We then call
this algorithm the current best algorithm abest and run a CV test to calculate its
performance on dnew . Based on this, the dataset similarities are recalculated (see
Section 3), yielding a possibly diﬀerent set of datasets dj. The best algorithm on
this new set becomes the best competitor ak(diﬀerent from abest), calculated by
counting the number of times that M(ak, dj)> M(abest, dj), over all datasets
We can further reﬁne this method by taking into account how large the
performance diﬀerences are: the larger a diﬀerence was in the past, the higher
chances are to obtain a large gain on the new dataset. This leads to the notion
of relative landmarks RL, deﬁned as:
RL(ak, abest, dj) = i(M(ak, dj)> M (abest, dj)) ∗(M(ak, dj)−M(abest, dj)) (2)
The function i(test) returns 1 if the test is true and 0 otherwise. As stated before,
this can be a simple comparison or a statistical signiﬁcance test that only returns
1 if akperforms signiﬁcantly better than abest on dj. The term RL thus expresses
how much better akis, relative to abest, on a dataset dj. Experimental tests have
shown that this approach yields much better results than simply counting the
number of wins.
Up to now, we are assuming a dataset djto be either similar to dnew or
not. A second reﬁnement is to use a gradual (non-binary) measure of similarity
Sim(dnew , dj) between datasets dnew and dj. As such, we can weigh the perfor-
mance diﬀerence between akand abest on djby how similar djis to dnew . Indeed,
the more similar the datasets, the more informative the performance diﬀerence
is. As such, we aim to optimize the following criterion:
ak= arg max
RL(ai, abest, dj)∗Sim(dnew , dj)) (3)
in which Dis the set of all prior datasets dj.
To calculate the similarity Sim(), we use the outcome of each CV test on
dnew and compare it to the outcomes on dj.
In each iteration, with each CV test, we obtain a new evaluation M(ai, dnew),
which is used to recalculate all similarities Sim(dnew , dj). In fact, we will com-
pare four variants of Sim(), which are discussed in the next section. With this,
we can recalculate equation 3 to ﬁnd the next best competitor ak.
3 Active Testing
In this section we describe the active testing (AT) approach, which proceeds
according to the following steps:
1. Construct a global ranking of a given set of algorithms using performance
information from past experiments (metadata).
2. Initiate the iterative process by assigning the top-ranked algorithm as abest
and obtain the performance of this algorithm on dnew using a CV test.
3. Find the most promising competitor akfor abest using relative landmarks
and all previous CV tests on dnew.
4. Obtain the performance of akon dnew using a CV test and compare with
abest. Use the winner as the current best algorithm, and eliminate the losing
5. Repeat the whole process starting with step 3 until a stopping criterium has
been reached. Finally, output the current abest as the overall winner.
Step 1 - Establish a Global Ranking of Algorithms. Before having run any
CV tests, we have no information on the new dataset dnew to deﬁne which prior
datasets are similar to it. As such, we naively assume that all prior datasets
are similar. As such, we generate a global ranking of all algorithms using the
performance results of all algorithms on all previous datasets, and choose the
top-ranked algorithm as our initial candidate abest. To illustrate this, we use a toy
example involving 6 classiﬁcation algorithms, with default parameter settings,
from Weka  evaluated on 40 UCI datasets , a portion of which is shown in
As said before, our approach is entirely independent from the exact evalu-
ation measure used: the most appropriate measure can be chosen by the user
in function of the speciﬁc data domain. In this example, we use success rate
(accuracy), but any other suitable measure of classiﬁer performance, e.g. AUC
(area under the ROC curve), precision, recall or F1 can be used as well.
Each accuracy ﬁgure shown in Table 1 represents the mean of 10 values
obtained in 10-fold cross-validation. The ranks of the algorithms on each dataset
are shown in parentheses next to the accuracy value. For instance, if we consider
dataset abalone, algorithm M LP is attributed rank 1 as its accuracy is highest
on this problem. The second rank is occupied by LogD, etc.
The last row in the table shows the mean rank of each algorithm, obtained
by averaging over the ranks of each dataset: Rai=1
dj=1 Rai,dj, where Rai,dj
represents the rank of algorithm aion dataset djand nthe number of datasets.
This is a quite common procedure, often used in machine learning to assess how
a particular algorithm compares to others .
The mean ranks permit us to obtain a global ranking of candidate algorithms,
CA. In our case, C A =hM LP, J 48, JRip, Log D, IB1, N Bi. It must be noted
that such a ranking is not very informative in itself. For instance, statistical
signiﬁcance tests are needed to obtain a truthful ranking. Here, we only use this
global ranking CA are a starting point for the iterative procedure, as explained
Step 2 - Identify the Current Best Algorithm. Indeed, global ranking CA
permits us to identify the top-ranked algorithm as our initial best candidate
algorithm abest. In Table 1, abest =M LP . This algorithm is then evaluated
using a CV test to establish its performance on the new dataset dnew .
Table 1. Accuracies and ranks (in parentheses) of the algorithms 1-nearest neighbor
(IB1), C4.5 (J48), RIPPER (JRip), LogisticDiscriminant (LogD), MultiLayerPercep-
tron (MLP) and naive Bayes (NB) on diﬀerent datasets and their mean rank.
Datasets IB1 J48 JRip LogD MLP NB
abalone .197 (5) .218 (4) .185 (6) .259 (2) .266 (1) .237 (3)
acetylation .844 (1) .831 (2) .829 (3) .745 (5) .609 (6) .822 (4)
adult .794 (6) .861 (1) .843 (3) .850 (2) .830 (5) .834 (4)
... ... ... ... ... ... ...
Mean rank 4.05 2.73 3.17 3.74 2.54 4.78
Step 3 - Identify the Most Promising Competitor. In the next step we
identify ak, the best competitor of abest . To do this, all algorithms are considered
one by one, except for abest and the eliminated algorithms (see step 4).
For each algorithm we analyze the information of past experiments (meta-
data) to calculate the relative landmarks, as outlined in the previous section. As
equation 3 shows, for each ak, we sum up all relative landmarks involving abest,
weighted by a measure of similarity between dataset djand the new dataset
dnew. The algorithm akthat achieves the highest value is the most promising
competitor in this iteration. In case of a tie, the competitor that appears ﬁrst in
ranking CA is chosen.
To calculate Sim(dnew , dj), the similarity between djand dnew, we have
explored four diﬀerent variants, AT0,AT1,ATWs,ATx, described below.
AT0 is a base-line method which ignores dataset similarity. It always returns
a similarity value of 1 and so all datasets are considered similar. This means
that the best competitor is determined by summing up the values of the relative
AT1 method works as AT0 at the beginning, when no test have been car-
ried out on dnew. In all subsequent iterations, this method estimates dataset
similarity using only the most recent CV test. Consider the algorithms listed
in Table 1 and the ranking CA. Suppose we started with algorithm MLP
as the current best candidate. Suppose also that in the next iteration LogD
was identiﬁed as the best competitor, and won from MLP in the CV test:
(M(LogD, dnew )> M (M LP, dnew )). Then, in the subsequent iteration, all prior
datasets djsatisfying the condition M(LogD, dj)> M (MLP , dj) are consid-
ered similar to dnew. In general terms, suppose that the last test revealed that
M(ak, dnew)> M (abest, dnew), then Sim(dnew , dj) is 1 if also M(ak, dj)> M (abest , dj),
and 0 otherwise. The similarity measure determines which RL’s are taken into
account when summing up their contributions to identify the next best competi-
Another variant of AT1 could use the diﬀerence between RL(ak, abest, dnew)
and RL(ak, abest, dj), normalized between 0 and 1, to obtain a real-valued (non-
binary) similarity estimate Sim(dnew , dj). In other words, djis more similar to
dnew if the relative performance diﬀerence between akand abest is about as large
on both djand dnew . We plan to investigate this in our future work.
ATWs is similar to AT1, but instead of only using the last test, it uses all
CV tests carried out on the new dataset, and calculates the Laplace-corrected
ratio of corresponding results. For instance, suppose we have conducted 3 tests
on dnew, thus yielding 3 pairwise algorithm comparisons on dnew . Suppose that
2 tests had the same result on dataset dj(i.e. M(ax, dnew)> M (ay, dnew) and
M(ax, dj)> M(ay, dj)), then the frequency of occurrence is 2/3, which is ad-
justed by Laplace correction to obtain an estimate of probability (2+ 1)/(3 + 2).
As such, Sim(dnew , dj) = 3
ATx is similar to ATWs, but requires that all pairwise comparisons yield the
same outcome. In the example used above, it will return Sim(dnew , dj) = 1 only
if all three comparisons lead to same result on both datasets and 0 otherwise.
Step 4 - Determine which of the Two Algorithms is Better. Having
found ak, we can now run a CV test and compare it with abest. The winner
(which may be either the current best algorithm or the competitor) is used
as the new current best algorithm in the new round. The losing algorithm is
eliminated from further consideration.
Step 5 - Repeat the Process and Check the Stopping Criteria. The
whole process of identifying the best competitor (step 3) of abest and determining
which one of the two is better (step 4) is repeated until a stopping criterium has
been reached. For instance, the process could be constrained to a ﬁxed number
of CV tests: considering the results presented further on in Section 4, it would
be suﬃcient to run at most 20% of all possible CV tests. Alternatively, one could
impose a ﬁxed CPU time, thus returning the best algorithm in hhours, as in
budgeted learning. In any case, until aborted, the method will keep choosing a
new competitor in each round: there will always be a next best competitor. In
this respect our system diﬀers from, for instance, hill climbing approaches which
can get stuck in a local minimum.
Discussion - Comparison with Active Learning: The term active testing
was chosen because the approach shares some similarities with active learning .
The concern of both is to speed up the process of improving a given performance
measure. In active learning, the goal is to select the most informative data point
to be labeled next, so as to improve the predictive performance of a supervised
learning algorithm with a minimum of (expensive) labelings. In active testing,
the goal is to select the most informative CV test, so as to improve the prediction
of the best algorithm on the new dataset with a minimum of (expensive) CV
4 Empirical Evaluation
4.1 Evaluation Methodology and Experimental Set-up
The proposed method AT was evaluated using a leave-one-out method . The
experiments reported here involve Ddatasets and so the whole procedure was
repeated Dtimes. In each cycle, all performance results on one dataset were left
out for testing and the results on the remaining D−1 datasets were used as
metadata to determine the best candidate algorithm.
Number of CV's
Median Accuracy Loss (%)
0.0 0.5 1.0 1.5
1 2 4 8 16 32 64 128 292
Fig. 1. Median loss as a function of the number of CV tests.
This study involved 292 algorithms (algorithm-parameter combinations), which
were extracted from the experiment database for machine learning (ExpDB)
[11, 22]. This set includes many diﬀerent algorithms from the Weka platform
, which were varied by assigning diﬀerent values to their most important
parameters. It includes SMO (a support vector machine, SVM), MLP (Multi-
layer Perceptron), J48 (C4.5), and diﬀerent types of ensembles, including Ran-
domForest, Bagging and Boosting. Moreover, diﬀerent SVM kernels were used
with their own parameter ranges and all non-ensemble learners were used as
base-learners for the ensemble learners mentioned above. The 76 datasets used
in this study were all from UCI . A complete overview of the data used
in this study, including links to all algorithms and datasets can be found on
The main aim of the test was to prove the research hypothesis formulated
earlier: relative landmarks provide useful information for predicting the most
promising algorithms on new datasets. Therefore, we also include two baseline
TopN has been described before (e.g. ). It also builds a ranking of candidate
algorithms as described in step 1 (although other measures diﬀerent from
mean rank could be used), and only runs CV tests on the ﬁrst Nalgorithms.
The overall winner is returned.
Rand simply selects Nalgorithms at random from the given set, evaluates them
using CV and returns the one with the best performance. It is repeated 10
times with diﬀerent random seeds and the results are averaged.
Since our AT methods are iterative, we will restart TopN and Rand Ntimes,
with Nequal to the number of iterations (or CV tests).
To evaluate the performance of all approaches, we calculate the loss of the
currently best algorithm, deﬁned as M(abest, dnew )−M(a∗, dnew ), where abest
represents the currently best algorithm, a∗the best possible algorithm and M(.)
represents the performance measure (success rate).
By aggregating the results over Ddatasets, we can track the median loss of the
recommended algorithm as a function of the number of CV tests carried out.
The results are shown in Figure 1. Note that the number of CV tests is plotted
on a logarithmic scale.
First, we see that AT W s and AT 1 perform much better than AT 0, which
indicates that it is indeed useful to include dataset similarity. If we consider a
particular level of loss (say 0.5%) we note that these variants require much fewer
CV tests than AT 0. The results also indicate that the information associated
with relative landmarks obtained on the new dataset is indeed valuable. The
median loss curves decline quite rapidly and are always below the AT 0 curve.
We also see that after only 10 CV tests (representing about 3% of all possible
tests), the median loss is less than 0.5%. If we continue to 60 tests (about 20%
of all possible tests) the median loss is near 0.
Also note that AT W s, which uses all relative landmarks involving abest and
dnew, does not perform much better than AT 1, which only uses the most re-
cent CV test. This results suggests that, when looking for the most promising
competitor, the latest test is more informative than the previous ones.
Method AT x, the most restrictive approach, only considers prior datasets on
which all relative landmarks including abest obtained similar results. As shown in
Figure 1, this approach manages to reduce the loss quite rapidly, and competes
well with the other variants in the initial region. However, after achieving a
minimum loss in the order of 0.5%, there are no more datasets that fulﬁll this
restriction, and consequently no new competitor can be chosen, causing it to stop.
The other two methods, AT W s and AT 1, do not suﬀer from this shortcoming.
Number of CV's
Median Accuracy Loss (%)
0.0 0.5 1.0 1.5
1 2 4 8 16 32 64 128 292
Fig. 2. Median loss of AT0 and the two baseline methods.
AT 0 was also our best baseline method. To avoid overloading Figure 1, we
show this separately in Figure 2. Indeed, AT 0 is clearly better than the random
choice method Rand. Comparing AT 0 to T opN , we cannot say that one is clearly
better than the other overall, as the curves cross. However, it is clear that T opN
looses out if we allow more CV tests, and that it is not competitive with the
more advanced methods such as AT 1 and AT W s.
The curves for mean loss (instead of median loss) follow similar trends, but
the values are 1-2% worse due to outliers (see Fig. 3 relative to method AT 1).
Besides, this ﬁgure shows also the curves associated with quartiles of 25% and
75% for AT 1. As the number of CV tests increases, the distance between the two
curves decreases and approaches the median curve. Similar behavior has been
observed for AT W s, but we skip the curves in this text.
Algorithm trace. It is interesting to trace the iterations carried out for one
particular dataset. Table 2 shows the details for method AT 1, where abalone
represents the new dataset. Column 1 shows the number of the iteration (thus
the number of CV tests). Column 2 shows the most promising competitor ak
chosen in each step. Column 3 shows the index of akin our initial ranking
CA, and column 4 the index of abest , the new best algorithm after the CV test
has been performed. As such, if the values in column 3 and 4 are the same,
then the most promising competitor has won the duel. For instance, in step
2, SM O.C.1.0.P olynomial.E .3, i.e. SVM with complexity constant set to 1.0
and a 3rd degree polynomial kernel, (index 96) has been identiﬁed as the best
Number of CV's
Accuracy Loss (%)
0.0 0.5 1.0 1.5
1 2 4 8 16 32 64 128 292
Fig. 3. Loss of AT 1 as a function of the number of CV tests.
competitor to be used (column 2), and after the CV test, it has won against
Bagging.I .75..100.P ART , i.e. Bagging with a high number of iterations (be-
tween 75 and 100) and PART as a base-learner. As such, it wins this round and
becomes the new abest. Columns 5 and 6 show the actual rank of the competitor
and the winner on the abalone dataset. Column 7 shows the loss compared to
the optimal algorithm and the ﬁnal column shows the number of datasets whose
similarity measure is 1.
We observe that after only 12 CV tests, the method has identiﬁed an algo-
rithm with a very small loss of 0.2%: Bagging.I.25..50.M ultilay erP erceptron,
i.e. Bagging with relatively few iterations but with a MultiLayerPerceptron base-
Incidentally, this dataset appears to represent a quite atypical problem: the
truly best algorithm, SMO.C.1.0.RBF.G.20, i.e. SVM with an RBF kernel with
kernel width (gamma) set to 20, is ranked globally as algorithm 246 (of all 292).
AT1 identiﬁes it after 177 CV tests.
5 Related Work in Other Scientiﬁc Areas
In this section we brieﬂy cover some work in other scientiﬁc areas which is
related to the problem tackled here and could provide further insight into how
to improve the method.
One particular area is experiment design  and in particular active learning.
As discussed before, the method described here follows the main trends that have
been outlined in this literature. However, there is relatively little work on active
Table 2. Trace of the steps taken by AT 1 in the search for the supposedly best
algorithm for the abalone dataset
CV Algorithm used CA CA abalone abalone Loss D
test (current best competitor, ak)aknew abest aknew abest (%) size
1 Bagging.I.75..100.PART 1 1 75 75 1.9 75
2 SMO.C.1.0.Polynomial.E.3 96 96 56 56 1.6 29
3 AdaBoostM1.I.10.MultilayerPerceptron 92 92 47 47 1.5 34
4 Bagging.I.50..75.RandomForest 15 92 66 47 1.5 27
· · · · · · · · · · · · · · · · · · · · · · · ·
10 LMT 6 6 32 32 1.1 45
11 LogitBoost.I.10.DecisionStump 81 6 70 32 1.1 51
12 Bagging.I.25..50.MultilayerPerceptron 12 12 2 2 0.2 37
13 LogitBoost.I.160.DecisionStump 54 12 91 2 0.2 42
· · · · · · · · · · · · · · · · · · · · · · · ·
177 SMO.C.1.0.RBF.G.20 246 246 1 1 0 9
learning for ranking tasks. One notable exception is , who use the notion
of Expected Loss Optimization (ELO). Another work in this area is , whose
aim was to identify the most interesting substances for drug screening using
a minimum number of tests. In the experiments described, the authors have
focused on the top-10 substances. Several diﬀerent strategies were considered
and evaluated. Our problem here is not ranking, but rather simply ﬁnding the
best item (algorithm), so this work is only partially relevant.
Another relevant area is the so called multi-armed bandit problem (MAB)
studied in statistics and machine learning [9, 16]. This problem is often formu-
lated in a setting that involves a set of traditional slot machines. When a partic-
ular lever is pulled, a reward is provided from a distribution associated with that
speciﬁc lever. The bandit problem is formally equivalent to a one-state Markov
decision process. The aim is to minimize regret after T rounds, which is deﬁned
as a diﬀerence between the reward sum associated with an optimal strategy and
the sum of collected rewards. Indeed, pulling a lever can be compared to carrying
out a CV test on a given algorithm. However, there is one fundamental diﬀerence
between MAB and our setting: whereas in MAB the aim is to maximize the sum
of collected rewards, our aim it to maximize one reward, i.e. the reward asso-
ciated with identifying the best algorithm. So again, this work is only partially
To the best of our knowledge, no other work in this area has addressed the
issue of how to select a suitable algorithm from a large set of candidates.
6 Signiﬁcance and Impact
In this paper we have addressed the problem of selecting the best classiﬁcation
algorithm for a speciﬁc task. We have introduced a new method, called active
testing, that exploits information concerning past evaluation results (metadata),
to recommend the best algorithm using a limited number of tests on the new
Starting from an initial ranking of algorithms on previous datasets, the
method runs additional CV evaluations to test several competing algorithms
on the new dataset. However, the aim is to reduce the number of tests to a mini-
mum. This is done by carefully selecting which tests should be carried out, using
the information of both past and present algorithm evaluations represented in
the form of relative landmarks.
In our view this method incorporates several innovative features. First, it
is an iterative process that uses the information in each CV test to ﬁnd the
most promising next test based on a history of prior ‘algorithm duels’. In a
tournament-style fashion, it starts with a current best (parameterized) algo-
rithm, selects the most promising rival algorithm in each round, evaluates it on
the given problem, and eliminates the algorithm that performs worse. Second, it
continually focuses on the most similar prior datasets: those where the algorithm
duels had a similar outcome to those on the new dataset.
Four variants of this basic approach, diﬀering in their deﬁnition of algo-
rithm similarity, were investigated in a very extensive experiment setup involving
292 algorithm-parameter combinations on 76 datasets. Our experimental results
show that particularly versions AT W s and AT 1 provide good recommendations
using a small number of CV tests. When plotting the median loss as a function
of the number of CV tests (Fig. 1), it shows that both outperform all other vari-
ants and baseline methods. They also outperform AT 0, indicating that algorithm
similarity is an important aspect.
We also see that after only 10 CV tests (representing about 3% of all possible
tests), the median loss is less than 0.5%. If we continue to 60 tests (about 20%
of all possible tests) the median loss is near 0. Similar trends can be observed
when considering mean loss.
The results support the hypothesis that we have formulated at the outset
of our work, that relative landmarks are indeed informative and can be used to
suggest the best contender. If this is procedure is used iteratively, it can be used
to accurately recommend a classiﬁcation algorithm after a very limited number
of CV tests.
Still, we believe that the results could be improved further. Classical information-
theoretic measures and/or sampling landmarks could be incorporated into the
process of identifying the most similar datasets. This could lead to further im-
provements and forms part of our future plans.
1. C.L. Blake and C.J. Merz. UCI repository of machine learning databases, 1998.
2. B.Pfahringer, H.Bensussan, and C. Giraud-Carrier. Meta-learning by landmarking
various learning algorithms. In Proceedings of the 17th Int. Conf. on Machine
Learning (ICML-2000. Stanford,CA, 2000.
3. P. Brazdil, C. Soares, and J. Costa. Ranking learning algorithms: Using IBL and
meta-learning on accuracy and time results. Machine Learning, 50:251–277, 2003.
4. K. De Grave, J. Ramon, and L. De Raedt. Active learning for primary drug
screening. In Proceedings of Discovery Science. Springer, 2008.
5. J. Demsar. Statistical comparisons of classiﬁers over multiple data sets. The
Journal of Machine Learning Research, 7:1–30, 2006.
6. V. Fedorov. Theory of Optimal Experiments. Academic Press, New York, 1972.
7. Y. Freund, H. Seung, E. Shamir, and N. Tishby. Selective sampling using the query
by committee algorithm. Machine Learning, 28:133–168, 1997.
8. Johannes F¨urnkranz and Johann Petrak. An evaluation of landmarking variants. In
Proceedings of the ECML/PKDD Workshop on Integrating Aspects of Data Mining,
Decision Support and Meta-Learning (IDDM-2001), pages 57–68. Springer, 2001.
9. J. Gittins. Multi-armed bandit allocation indices. In Wiley Interscience Series in
Systems and Optimization. John Wiley & Sons, Ltd., 1989.
10. Mark Hall, Eibe Frank, Geoﬀrey Holmes, Bernhard Pfahringer, Peter Reutemann,
and Ian H. Witten. The WEKA data mining software: an update. SIGKDD Explor.
Newsl., 11(1):10–18, 2009.
11. H.Blockeel. Experiment databases: A novel methodology for experimental research.
In Leacture Notes on Computer Science 3933. Springer, 2006.
12. J.F¨urnkranz and J.Petrak. An evaluation of landmarking variants. In C.Carrier,
N.Lavrac, and S.Moyle, editors, Working Notes of ECML/PKDD 2000 Workshop
on Integration Aspects of Data Mining, Decision Support and Meta-Learning. 2001.
13. R. Leite and P. Brazdil. Predicting relative performance of classiﬁers from sam-
ples. In ICML ’05: Proceedings of the 22nd international conference on Machine
learning, pages 497–503, New York, NY, USA, 2005. ACM Press.
14. Rui Leite and Pavel Brazdil. Active testing strategy to predict the best classiﬁca-
tion algorithm via sampling and metalearning. In Proceedings of the 19th European
Conference on Artiﬁcial Intelligence - ECAI 2010, 2010.
15. B. Long, O. Chapelle, Y. Zhang, Y. Chang, Z. Zheng, and B. Tseng. Active learning
for rankings through expected loss optimization. In Proceedings of the SIGIR’10.
16. A. Mahajan and D. Teneketzis. Multi-armed bandit problems. In D. A. Castanon,
D. Cochran, and K. Kastella, editors, Foundations and Applications of Sensor
Management. Springer-Verlag, 2007.
17. D. Michie, D.J.Spiegelhalter, and C.C.Taylor. Machine Learning, Neural and Sta-
tistical Classiﬁcation. Ellis Horwood, 1994.
18. Tom M. Mitchell. Machine Learning. McGraw-Hill, New York, 1997.
19. John R. Rice. The algorithm selection problem. volume 15 of Advances in Com-
puters, pages 65 – 118. Elsevier, 1976.
20. Kate A. Smith-Miles. Cross-disciplinary perspectives on meta-learning for algo-
rithm selection. ACM Comput. Surv., 41(1):1–25, 2008.
21. Carlos Soares, Johann Petrak, and Pavel Brazdil. Sampling-based relative land-
marks: Systematically test-driving algorithms before choosing. In Proceedings of
the 10th Portuguese Conference on Artiﬁcial Intelligence (EPIA 2001), pages 88–
94. Springer, 2001.
22. J. Vanschoren and H. Blockeel. A community-based platform for machine learning
experimentation. In Machine Learning and Knowledge Discovery in Databases,
European Conference, ECML PKDD 2009, volume LNCS 5782, pages 750–754.
23. Ricardo Vilalta and Youssef Drissi. A perspective view and survey of meta-learning.
Artif. Intell. Rev., 18(2):77–95, 2002.