Conference PaperPDF Available

Selecting Classification Algorithms with Active Testing


Abstract and Figures

Given the large amount of data mining algorithms, their combinations (e.g. ensembles) and possible parameter settings, finding the most adequate method to analyze a new dataset becomes an ever more challenging task. This is because in many cases testing all possibly useful alternatives quickly becomes prohibitively expensive. In this paper we propose a novel technique, called active testing, that intelligently selects the most useful cross-validation tests. It proceeds in a tournament-style fashion, in each round selecting and testing the algorithm that is most likely to outperform the best algorithm of the previous round on the new dataset. This ‘most promising’ competitor is chosen based on a history of prior duels between both algorithms on similar datasets. Each new cross-validation test will contribute information to a better estimate of dataset similarity, and thus better predict which algorithms are most promising on the new dataset. We have evaluated this approach using a set of 292 algorithm-parameter combinations on 76 UCI datasets for classification. The results show that active testing will quickly yield an algorithm whose performance is very close to the optimum, after relatively few tests. It also provides a better solution than previously proposed methods.
Content may be subject to copyright.
Selecting Classification Algorithms with Active
Rui Leite1, Pavel Brazdil1, and Joaquin Vanschoren2
1LIAAD-INESC Porto L.A./Faculty of Economics, University of Porto, Portugal,,
2LIACS - Leiden Institute of Advanced Computer Science, University of Leiden,
Abstract. Given the large amount of data mining algorithms, their
combinations (e.g. ensembles) and possible parameter settings, finding
the most adequate method to analyze a new dataset becomes an ever
more challenging task. This is because in many cases testing all possi-
bly useful alternatives quickly becomes prohibitively expensive. In this
paper we propose a novel technique, called active testing, that intel-
ligently selects the most useful cross-validation tests. It proceeds in a
tournament-style fashion, in each round selecting and testing the algo-
rithm that is most likely to outperform the best algorithm of the previous
round on the new dataset. This ‘most promising’ competitor is chosen
based on a history of prior duels between both algorithms on similar
datasets. Each new cross-validation test will contribute information to
a better estimate of dataset similarity, and thus better predict which
algorithms are most promising on the new dataset. We have evaluated
this approach using a set of 292 algorithm-parameter combinations on
76 UCI datasets for classification. The results show that active testing
will quickly yield an algorithm whose performance is very close to the
optimum, after relatively few tests. It also provides a better solution than
previously proposed methods.
1 Background and Motivation
In many data mining applications, an important problem is selecting the best
algorithm for a specific problem. Especially in classification, there are hundreds
of algorithms to choose from. Moreover, these algorithms can be combined into
composite learning systems (e.g. ensembles) and often have many parameters
that greatly influence their performance. This yields a whole spectrum of meth-
ods and their variations, so that testing all possible candidates on the given
problem, e.g., using cross-validation, quickly becomes prohibitively expensive.
The issue of selecting the right algorithm has been the subject of many
studies over the past 20 years [17, 3, 23,20, 19]. Most approaches rely on the
concept of metalearning. This approach exploits characterizations of datasets
and past performance results of algorithms to recommend the best algorithm on
the current dataset. The term metalearning stems from the fact that we try to
learn the function that maps dataset characterizations (meta-data) to algorithm
performance estimates (the target variable).
The earliest techniques considered only the dataset itself and calculated an
array of various simple, statistical or information-theoretic properties of the
data (e.g., dataset size, class skewness and signal-noise ratio) [17, 3]. Another
approach, called landmarking [2, 12], ran simple and fast versions of algorithms
(e.g. decision stumps instead of decision trees) on the new dataset and used their
performance results to characterize the new dataset. Alternatively, in sampling
landmarks [21, 8,14], the complete (non-simplified) algorithms are run on small
samples of the data. A series of sampling landmarks on increasingly large samples
represents a partial learning curve which characterizes datasets and which can be
used to predict the performance of algorithms significantly more accurately than
with classical dataset characteristics [13, 14]. Finally, an ‘active testing strategy’
for sampling landmarks [14] was proposed that actively selects the most infor-
mative sample sizes while building these partial learning curves, thus reducing
the time needed to compute them.
Motivation. All these approaches have focused on dozens of algorithms at most
and usually considered only default parameter settings. Dealing with hundreds,
perhaps thousands of algorithm-parameter combinations3, provides a new chal-
lenge that requires a new approach. First, distinguishing between hundreds of
subtly different algorithms is significantly harder than distinguishing between a
handful of very different ones. We would need many more data characterizations
that relate the effects of certain parameters on performance. On the other hand,
the latter method [14] has a scalability issue: it requires that pairwise compar-
isons be conducted between algorithms. This would be rather impractical when
faced with hundreds of algorithm-parameter combinations.
To address these issues, we propose a quite different way to characterize
datasets, namely through the effect that the dataset has on the relative perfor-
mance of algorithms run on them. As in landmarking, we use the fact that each
algorithm has its own learning bias, making certain assumptions about the data
distribution. If the learning bias ‘matches’ the underlying data distribution of
a particular dataset, it is likely to perform well (e.g., achieve high predictive
accuracy). If it does not, it will likely under- or overfit the data, resulting in a
lower performance.
As such, we characterize a dataset based on the pairwise performance differ-
ences between algorithms run on them: if the same algorithms win, tie or lose
against each other on two datasets, then the data distributions of these datasets
are likely to be similar as well, at least in terms of their effect on learning per-
formance. It is clear that the more algorithms are used, the more accurate the
characterization will be. While we cannot run all algorithms on each new dataset
because of the computational cost, we can run a fair amount of CV tests to get
a reasonably good idea of which prior datasets are most similar to the new one.
Moreover, we can use these same performance results to establish which (yet
untested) algorithms are likely to perform well on the new dataset, i.e., those
algorithms that outperformed or rivaled the currently best algorithm on similar
datasets in the past. As such, we can intelligently select the most promising
3In the remainder of this text, when we speak of algorithms, we mean fully-defined
algorithm instances with fixed components (e.g., base-learners, kernel functions) and
parameter settings.
algorithms for the new dataset, run them, and then use their performance results
to gain increasingly better estimates of the most similar datasets and the most
promising algorithms.
Key concepts. There are two key concepts used in this work. The first one is
that of the current best candidate algorithm which may be challenged by other
algorithms in the process of finding an even better candidate.
The second is the pairwise performance difference between two algorithm run
on the same dataset, which we call relative landmark. A collection of such rela-
tive landmarks represents a history of previous ‘duels’ between two algorithms
on prior datasets. The term itself originates from the study of landmarking al-
gorithms: since absolute values for the performance of landmarkers vary a lot
depending on the dataset, several types of relative landmarks have been pro-
posed, which basically capture the relative performance difference between two
algorithms [12]. In this paper, we extend the notion of relative landmarks to all
(including non-simplified) classification algorithms.
The history of previous algorithm duels is used to select the most promis-
ing challenger for the current best candidate algorithm, namely the method
that most convincingly outperformed or rivaled the current champion on prior
datasets similar to the new dataset.
Approach. Given the current best algorithm and a history of relative landmarks
(duels), we can start a tournament game in which, in each round, the current
best algorithm is compared to the next, most promising contender. We select
the most promising challenger as discussed above, and run a CV test with this
algorithm. The winner becomes the new current best candidate, the loser is
removed from consideration. We will discuss the exact procedure in Section 3.
We call this approach active testing (AT)4, as it actively selects the most
interesting CV tests instead of passively performing them one by one: in each
iteration the best competitor is identified, which determines a new CV test to
be carried out. Moreover, the same result will be used to further characterize
the new dataset and more accurately estimate the similarity between the new
dataset and all prior datasets.
Evaluation. By intelligently selecting the most promising algorithms the test
on the new dataset, we can more quickly discover an algorithm that performs
very well. Note that running a selection of algorithms is typically done anyway
to find a suitable algorithm. Here, we optimize and automate this process using
historical performance results of the candidate algorithms on prior datasets.
While we cannot possibly guarantee to return the absolute best algorithm
without performing all possible CV tests, we can return an algorithm whose
performance is either identical or very close to the truly best one. The difference
between the two can be expressed in terms of a loss. Our aim is thus to minimize
4Note that while the term ‘active testing’ is also used in the context of actively
selected sampling landmarks [14], there is little or no relationship to the approach
described here.
this loss using a minimal number of tests, and we will evaluate our technique as
In all, the research hypothesis that we intend to prove in this paper is: Relative
landmarks provide useful information on the similarity of datasets and can be
used to efficiently predict the most promising algorithms to test on new datasets.
We will test this hypothesis by running our active testing approach in a leave-
one-out fashion on a large set of CV evaluations testing 292 algorithms on 76
datasets. The results show that our AT approach is indeed effective in finding
very accurate algorithms in a very limited number of tests.
Roadmap. The remainder of this paper is organized as follows. First, we formu-
late the concepts of relative landmarks in Section 2 and active testing in Section
3. Next, Section 4 presents the empirical evaluation and Section 5 presents an
overview of some work in other related areas. The final section presents conclu-
sions and future work.
2 Relative Landmarks
In this section we formalize our definition of relative landmarks, and explain
how are used to identify the most promising competitor for a currently best
Given a set of classification algorithms and some new classification dataset
dnew, the aim is to identify the potentially best algorithm for this task with
respect to some given performance measure M(e.g., accuracy, AUC or rank).
Let us represent the performance of algorithm aion dataset dnew as M(ai, dnew).
As such, we need to identify an algorithm a, for which the performance measure
is maximal, or aiM(a, dnew)M(ai, dnew ). The decision concerning (i.e.
whether ais at least as good as ai) may be established using either a statistical
significance test or a simple comparison.
However, instead of searching exhaustively for a, we aim to find a near-
optimal algorithm, ˆ
a, which has a high probability P(M(ˆ
a, dnew)M(ai, dnew ))
to be optimal, ideally close to 1.
As in other work that exploits metalearning, we assume that ˆ
ais likely
better than aion dataset dnew if it was found to be better on a similar dataset
dj(for which we have performance estimates):
a, dnew)M(ai, dnew )) P(M(ˆ
a, dj)M(ai, dj)) (1)
The latter estimate can be maximized by going through all algorithms and iden-
tifying the algorithm ˆ
athat satisfies the constraint in a maximum number of
cases. However, this requires that we know which datasets djare most similar
to dnew. Since our definition of similarity requires CV tests to be run on dnew ,
but we cannot run all possible CV tests, we use an iterative approach in which
we repeat this scan for ˆ
ain every round, using only the datasets djthat seem
most similar at that point, as dataset similarities are recalculated after every
CV test.
Initially, having no information, we deem all datasets to be similar to dnew ,
so that ˆ
awill be the globally best algorithm over all prior datasets. We then call
this algorithm the current best algorithm abest and run a CV test to calculate its
performance on dnew . Based on this, the dataset similarities are recalculated (see
Section 3), yielding a possibly different set of datasets dj. The best algorithm on
this new set becomes the best competitor ak(different from abest), calculated by
counting the number of times that M(ak, dj)> M(abest, dj), over all datasets
We can further refine this method by taking into account how large the
performance differences are: the larger a difference was in the past, the higher
chances are to obtain a large gain on the new dataset. This leads to the notion
of relative landmarks RL, defined as:
RL(ak, abest, dj) = i(M(ak, dj)> M (abest, dj)) (M(ak, dj)M(abest, dj)) (2)
The function i(test) returns 1 if the test is true and 0 otherwise. As stated before,
this can be a simple comparison or a statistical significance test that only returns
1 if akperforms significantly better than abest on dj. The term RL thus expresses
how much better akis, relative to abest, on a dataset dj. Experimental tests have
shown that this approach yields much better results than simply counting the
number of wins.
Up to now, we are assuming a dataset djto be either similar to dnew or
not. A second refinement is to use a gradual (non-binary) measure of similarity
Sim(dnew , dj) between datasets dnew and dj. As such, we can weigh the perfor-
mance difference between akand abest on djby how similar djis to dnew . Indeed,
the more similar the datasets, the more informative the performance difference
is. As such, we aim to optimize the following criterion:
ak= arg max
dj D
RL(ai, abest, dj)Sim(dnew , dj)) (3)
in which Dis the set of all prior datasets dj.
To calculate the similarity Sim(), we use the outcome of each CV test on
dnew and compare it to the outcomes on dj.
In each iteration, with each CV test, we obtain a new evaluation M(ai, dnew),
which is used to recalculate all similarities Sim(dnew , dj). In fact, we will com-
pare four variants of Sim(), which are discussed in the next section. With this,
we can recalculate equation 3 to find the next best competitor ak.
3 Active Testing
In this section we describe the active testing (AT) approach, which proceeds
according to the following steps:
1. Construct a global ranking of a given set of algorithms using performance
information from past experiments (metadata).
2. Initiate the iterative process by assigning the top-ranked algorithm as abest
and obtain the performance of this algorithm on dnew using a CV test.
3. Find the most promising competitor akfor abest using relative landmarks
and all previous CV tests on dnew.
4. Obtain the performance of akon dnew using a CV test and compare with
abest. Use the winner as the current best algorithm, and eliminate the losing
5. Repeat the whole process starting with step 3 until a stopping criterium has
been reached. Finally, output the current abest as the overall winner.
Step 1 - Establish a Global Ranking of Algorithms. Before having run any
CV tests, we have no information on the new dataset dnew to define which prior
datasets are similar to it. As such, we naively assume that all prior datasets
are similar. As such, we generate a global ranking of all algorithms using the
performance results of all algorithms on all previous datasets, and choose the
top-ranked algorithm as our initial candidate abest. To illustrate this, we use a toy
example involving 6 classification algorithms, with default parameter settings,
from Weka [10] evaluated on 40 UCI datasets [1], a portion of which is shown in
Table 1.
As said before, our approach is entirely independent from the exact evalu-
ation measure used: the most appropriate measure can be chosen by the user
in function of the specific data domain. In this example, we use success rate
(accuracy), but any other suitable measure of classifier performance, e.g. AUC
(area under the ROC curve), precision, recall or F1 can be used as well.
Each accuracy figure shown in Table 1 represents the mean of 10 values
obtained in 10-fold cross-validation. The ranks of the algorithms on each dataset
are shown in parentheses next to the accuracy value. For instance, if we consider
dataset abalone, algorithm M LP is attributed rank 1 as its accuracy is highest
on this problem. The second rank is occupied by LogD, etc.
The last row in the table shows the mean rank of each algorithm, obtained
by averaging over the ranks of each dataset: Rai=1
dj=1 Rai,dj, where Rai,dj
represents the rank of algorithm aion dataset djand nthe number of datasets.
This is a quite common procedure, often used in machine learning to assess how
a particular algorithm compares to others [5].
The mean ranks permit us to obtain a global ranking of candidate algorithms,
CA. In our case, C A =hM LP, J 48, JRip, Log D, IB1, N Bi. It must be noted
that such a ranking is not very informative in itself. For instance, statistical
significance tests are needed to obtain a truthful ranking. Here, we only use this
global ranking CA are a starting point for the iterative procedure, as explained
Step 2 - Identify the Current Best Algorithm. Indeed, global ranking CA
permits us to identify the top-ranked algorithm as our initial best candidate
algorithm abest. In Table 1, abest =M LP . This algorithm is then evaluated
using a CV test to establish its performance on the new dataset dnew .
Table 1. Accuracies and ranks (in parentheses) of the algorithms 1-nearest neighbor
(IB1), C4.5 (J48), RIPPER (JRip), LogisticDiscriminant (LogD), MultiLayerPercep-
tron (MLP) and naive Bayes (NB) on different datasets and their mean rank.
Datasets IB1 J48 JRip LogD MLP NB
abalone .197 (5) .218 (4) .185 (6) .259 (2) .266 (1) .237 (3)
acetylation .844 (1) .831 (2) .829 (3) .745 (5) .609 (6) .822 (4)
adult .794 (6) .861 (1) .843 (3) .850 (2) .830 (5) .834 (4)
... ... ... ... ... ... ...
Mean rank 4.05 2.73 3.17 3.74 2.54 4.78
Step 3 - Identify the Most Promising Competitor. In the next step we
identify ak, the best competitor of abest . To do this, all algorithms are considered
one by one, except for abest and the eliminated algorithms (see step 4).
For each algorithm we analyze the information of past experiments (meta-
data) to calculate the relative landmarks, as outlined in the previous section. As
equation 3 shows, for each ak, we sum up all relative landmarks involving abest,
weighted by a measure of similarity between dataset djand the new dataset
dnew. The algorithm akthat achieves the highest value is the most promising
competitor in this iteration. In case of a tie, the competitor that appears first in
ranking CA is chosen.
To calculate Sim(dnew , dj), the similarity between djand dnew, we have
explored four different variants, AT0,AT1,ATWs,ATx, described below.
AT0 is a base-line method which ignores dataset similarity. It always returns
a similarity value of 1 and so all datasets are considered similar. This means
that the best competitor is determined by summing up the values of the relative
AT1 method works as AT0 at the beginning, when no test have been car-
ried out on dnew. In all subsequent iterations, this method estimates dataset
similarity using only the most recent CV test. Consider the algorithms listed
in Table 1 and the ranking CA. Suppose we started with algorithm MLP
as the current best candidate. Suppose also that in the next iteration LogD
was identified as the best competitor, and won from MLP in the CV test:
(M(LogD, dnew )> M (M LP, dnew )). Then, in the subsequent iteration, all prior
datasets djsatisfying the condition M(LogD, dj)> M (MLP , dj) are consid-
ered similar to dnew. In general terms, suppose that the last test revealed that
M(ak, dnew)> M (abest, dnew), then Sim(dnew , dj) is 1 if also M(ak, dj)> M (abest , dj),
and 0 otherwise. The similarity measure determines which RL’s are taken into
account when summing up their contributions to identify the next best competi-
Another variant of AT1 could use the difference between RL(ak, abest, dnew)
and RL(ak, abest, dj), normalized between 0 and 1, to obtain a real-valued (non-
binary) similarity estimate Sim(dnew , dj). In other words, djis more similar to
dnew if the relative performance difference between akand abest is about as large
on both djand dnew . We plan to investigate this in our future work.
ATWs is similar to AT1, but instead of only using the last test, it uses all
CV tests carried out on the new dataset, and calculates the Laplace-corrected
ratio of corresponding results. For instance, suppose we have conducted 3 tests
on dnew, thus yielding 3 pairwise algorithm comparisons on dnew . Suppose that
2 tests had the same result on dataset dj(i.e. M(ax, dnew)> M (ay, dnew) and
M(ax, dj)> M(ay, dj)), then the frequency of occurrence is 2/3, which is ad-
justed by Laplace correction to obtain an estimate of probability (2+ 1)/(3 + 2).
As such, Sim(dnew , dj) = 3
ATx is similar to ATWs, but requires that all pairwise comparisons yield the
same outcome. In the example used above, it will return Sim(dnew , dj) = 1 only
if all three comparisons lead to same result on both datasets and 0 otherwise.
Step 4 - Determine which of the Two Algorithms is Better. Having
found ak, we can now run a CV test and compare it with abest. The winner
(which may be either the current best algorithm or the competitor) is used
as the new current best algorithm in the new round. The losing algorithm is
eliminated from further consideration.
Step 5 - Repeat the Process and Check the Stopping Criteria. The
whole process of identifying the best competitor (step 3) of abest and determining
which one of the two is better (step 4) is repeated until a stopping criterium has
been reached. For instance, the process could be constrained to a fixed number
of CV tests: considering the results presented further on in Section 4, it would
be sufficient to run at most 20% of all possible CV tests. Alternatively, one could
impose a fixed CPU time, thus returning the best algorithm in hhours, as in
budgeted learning. In any case, until aborted, the method will keep choosing a
new competitor in each round: there will always be a next best competitor. In
this respect our system differs from, for instance, hill climbing approaches which
can get stuck in a local minimum.
Discussion - Comparison with Active Learning: The term active testing
was chosen because the approach shares some similarities with active learning [7].
The concern of both is to speed up the process of improving a given performance
measure. In active learning, the goal is to select the most informative data point
to be labeled next, so as to improve the predictive performance of a supervised
learning algorithm with a minimum of (expensive) labelings. In active testing,
the goal is to select the most informative CV test, so as to improve the prediction
of the best algorithm on the new dataset with a minimum of (expensive) CV
4 Empirical Evaluation
4.1 Evaluation Methodology and Experimental Set-up
The proposed method AT was evaluated using a leave-one-out method [18]. The
experiments reported here involve Ddatasets and so the whole procedure was
repeated Dtimes. In each cycle, all performance results on one dataset were left
out for testing and the results on the remaining D1 datasets were used as
metadata to determine the best candidate algorithm.
Number of CV's
Median Accuracy Loss (%)
0.0 0.5 1.0 1.5
1 2 4 8 16 32 64 128 292
Fig. 1. Median loss as a function of the number of CV tests.
This study involved 292 algorithms (algorithm-parameter combinations), which
were extracted from the experiment database for machine learning (ExpDB)
[11, 22]. This set includes many different algorithms from the Weka platform
[10], which were varied by assigning different values to their most important
parameters. It includes SMO (a support vector machine, SVM), MLP (Multi-
layer Perceptron), J48 (C4.5), and different types of ensembles, including Ran-
domForest, Bagging and Boosting. Moreover, different SVM kernels were used
with their own parameter ranges and all non-ensemble learners were used as
base-learners for the ensemble learners mentioned above. The 76 datasets used
in this study were all from UCI [1]. A complete overview of the data used
in this study, including links to all algorithms and datasets can be found on
The main aim of the test was to prove the research hypothesis formulated
earlier: relative landmarks provide useful information for predicting the most
promising algorithms on new datasets. Therefore, we also include two baseline
TopN has been described before (e.g. [3]). It also builds a ranking of candidate
algorithms as described in step 1 (although other measures different from
mean rank could be used), and only runs CV tests on the first Nalgorithms.
The overall winner is returned.
Rand simply selects Nalgorithms at random from the given set, evaluates them
using CV and returns the one with the best performance. It is repeated 10
times with different random seeds and the results are averaged.
Since our AT methods are iterative, we will restart TopN and Rand Ntimes,
with Nequal to the number of iterations (or CV tests).
To evaluate the performance of all approaches, we calculate the loss of the
currently best algorithm, defined as M(abest, dnew )M(a, dnew ), where abest
represents the currently best algorithm, athe best possible algorithm and M(.)
represents the performance measure (success rate).
4.2 Results
By aggregating the results over Ddatasets, we can track the median loss of the
recommended algorithm as a function of the number of CV tests carried out.
The results are shown in Figure 1. Note that the number of CV tests is plotted
on a logarithmic scale.
First, we see that AT W s and AT 1 perform much better than AT 0, which
indicates that it is indeed useful to include dataset similarity. If we consider a
particular level of loss (say 0.5%) we note that these variants require much fewer
CV tests than AT 0. The results also indicate that the information associated
with relative landmarks obtained on the new dataset is indeed valuable. The
median loss curves decline quite rapidly and are always below the AT 0 curve.
We also see that after only 10 CV tests (representing about 3% of all possible
tests), the median loss is less than 0.5%. If we continue to 60 tests (about 20%
of all possible tests) the median loss is near 0.
Also note that AT W s, which uses all relative landmarks involving abest and
dnew, does not perform much better than AT 1, which only uses the most re-
cent CV test. This results suggests that, when looking for the most promising
competitor, the latest test is more informative than the previous ones.
Method AT x, the most restrictive approach, only considers prior datasets on
which all relative landmarks including abest obtained similar results. As shown in
Figure 1, this approach manages to reduce the loss quite rapidly, and competes
well with the other variants in the initial region. However, after achieving a
minimum loss in the order of 0.5%, there are no more datasets that fulfill this
restriction, and consequently no new competitor can be chosen, causing it to stop.
The other two methods, AT W s and AT 1, do not suffer from this shortcoming.
Number of CV's
Median Accuracy Loss (%)
0.0 0.5 1.0 1.5
1 2 4 8 16 32 64 128 292
Fig. 2. Median loss of AT0 and the two baseline methods.
AT 0 was also our best baseline method. To avoid overloading Figure 1, we
show this separately in Figure 2. Indeed, AT 0 is clearly better than the random
choice method Rand. Comparing AT 0 to T opN , we cannot say that one is clearly
better than the other overall, as the curves cross. However, it is clear that T opN
looses out if we allow more CV tests, and that it is not competitive with the
more advanced methods such as AT 1 and AT W s.
The curves for mean loss (instead of median loss) follow similar trends, but
the values are 1-2% worse due to outliers (see Fig. 3 relative to method AT 1).
Besides, this figure shows also the curves associated with quartiles of 25% and
75% for AT 1. As the number of CV tests increases, the distance between the two
curves decreases and approaches the median curve. Similar behavior has been
observed for AT W s, but we skip the curves in this text.
Algorithm trace. It is interesting to trace the iterations carried out for one
particular dataset. Table 2 shows the details for method AT 1, where abalone
represents the new dataset. Column 1 shows the number of the iteration (thus
the number of CV tests). Column 2 shows the most promising competitor ak
chosen in each step. Column 3 shows the index of akin our initial ranking
CA, and column 4 the index of abest , the new best algorithm after the CV test
has been performed. As such, if the values in column 3 and 4 are the same,
then the most promising competitor has won the duel. For instance, in step
2, SM O.C.1.0.P olynomial.E .3, i.e. SVM with complexity constant set to 1.0
and a 3rd degree polynomial kernel, (index 96) has been identified as the best
Number of CV's
Accuracy Loss (%)
0.0 0.5 1.0 1.5
1 2 4 8 16 32 64 128 292
Fig. 3. Loss of AT 1 as a function of the number of CV tests.
competitor to be used (column 2), and after the CV test, it has won against
Bagging.I .75..100.P ART , i.e. Bagging with a high number of iterations (be-
tween 75 and 100) and PART as a base-learner. As such, it wins this round and
becomes the new abest. Columns 5 and 6 show the actual rank of the competitor
and the winner on the abalone dataset. Column 7 shows the loss compared to
the optimal algorithm and the final column shows the number of datasets whose
similarity measure is 1.
We observe that after only 12 CV tests, the method has identified an algo-
rithm with a very small loss of 0.2%: Bagging.I.25..50.M ultilay erP erceptron,
i.e. Bagging with relatively few iterations but with a MultiLayerPerceptron base-
Incidentally, this dataset appears to represent a quite atypical problem: the
truly best algorithm, SMO.C.1.0.RBF.G.20, i.e. SVM with an RBF kernel with
kernel width (gamma) set to 20, is ranked globally as algorithm 246 (of all 292).
AT1 identifies it after 177 CV tests.
5 Related Work in Other Scientific Areas
In this section we briefly cover some work in other scientific areas which is
related to the problem tackled here and could provide further insight into how
to improve the method.
One particular area is experiment design [6] and in particular active learning.
As discussed before, the method described here follows the main trends that have
been outlined in this literature. However, there is relatively little work on active
Table 2. Trace of the steps taken by AT 1 in the search for the supposedly best
algorithm for the abalone dataset
CV Algorithm used CA CA abalone abalone Loss D
test (current best competitor, ak)aknew abest aknew abest (%) size
1 Bagging.I.75..100.PART 1 1 75 75 1.9 75
2 SMO.C.1.0.Polynomial.E.3 96 96 56 56 1.6 29
3 AdaBoostM1.I.10.MultilayerPerceptron 92 92 47 47 1.5 34
4 Bagging.I.50..75.RandomForest 15 92 66 47 1.5 27
· · · · · · · · · · · · · · · · · · · · · · · ·
10 LMT 6 6 32 32 1.1 45
11 LogitBoost.I.10.DecisionStump 81 6 70 32 1.1 51
12 Bagging.I.25..50.MultilayerPerceptron 12 12 2 2 0.2 37
13 LogitBoost.I.160.DecisionStump 54 12 91 2 0.2 42
· · · · · · · · · · · · · · · · · · · · · · · ·
177 SMO.C.1.0.RBF.G.20 246 246 1 1 0 9
learning for ranking tasks. One notable exception is [15], who use the notion
of Expected Loss Optimization (ELO). Another work in this area is [4], whose
aim was to identify the most interesting substances for drug screening using
a minimum number of tests. In the experiments described, the authors have
focused on the top-10 substances. Several different strategies were considered
and evaluated. Our problem here is not ranking, but rather simply finding the
best item (algorithm), so this work is only partially relevant.
Another relevant area is the so called multi-armed bandit problem (MAB)
studied in statistics and machine learning [9, 16]. This problem is often formu-
lated in a setting that involves a set of traditional slot machines. When a partic-
ular lever is pulled, a reward is provided from a distribution associated with that
specific lever. The bandit problem is formally equivalent to a one-state Markov
decision process. The aim is to minimize regret after T rounds, which is defined
as a difference between the reward sum associated with an optimal strategy and
the sum of collected rewards. Indeed, pulling a lever can be compared to carrying
out a CV test on a given algorithm. However, there is one fundamental difference
between MAB and our setting: whereas in MAB the aim is to maximize the sum
of collected rewards, our aim it to maximize one reward, i.e. the reward asso-
ciated with identifying the best algorithm. So again, this work is only partially
To the best of our knowledge, no other work in this area has addressed the
issue of how to select a suitable algorithm from a large set of candidates.
6 Significance and Impact
In this paper we have addressed the problem of selecting the best classification
algorithm for a specific task. We have introduced a new method, called active
testing, that exploits information concerning past evaluation results (metadata),
to recommend the best algorithm using a limited number of tests on the new
Starting from an initial ranking of algorithms on previous datasets, the
method runs additional CV evaluations to test several competing algorithms
on the new dataset. However, the aim is to reduce the number of tests to a mini-
mum. This is done by carefully selecting which tests should be carried out, using
the information of both past and present algorithm evaluations represented in
the form of relative landmarks.
In our view this method incorporates several innovative features. First, it
is an iterative process that uses the information in each CV test to find the
most promising next test based on a history of prior ‘algorithm duels’. In a
tournament-style fashion, it starts with a current best (parameterized) algo-
rithm, selects the most promising rival algorithm in each round, evaluates it on
the given problem, and eliminates the algorithm that performs worse. Second, it
continually focuses on the most similar prior datasets: those where the algorithm
duels had a similar outcome to those on the new dataset.
Four variants of this basic approach, differing in their definition of algo-
rithm similarity, were investigated in a very extensive experiment setup involving
292 algorithm-parameter combinations on 76 datasets. Our experimental results
show that particularly versions AT W s and AT 1 provide good recommendations
using a small number of CV tests. When plotting the median loss as a function
of the number of CV tests (Fig. 1), it shows that both outperform all other vari-
ants and baseline methods. They also outperform AT 0, indicating that algorithm
similarity is an important aspect.
We also see that after only 10 CV tests (representing about 3% of all possible
tests), the median loss is less than 0.5%. If we continue to 60 tests (about 20%
of all possible tests) the median loss is near 0. Similar trends can be observed
when considering mean loss.
The results support the hypothesis that we have formulated at the outset
of our work, that relative landmarks are indeed informative and can be used to
suggest the best contender. If this is procedure is used iteratively, it can be used
to accurately recommend a classification algorithm after a very limited number
of CV tests.
Still, we believe that the results could be improved further. Classical information-
theoretic measures and/or sampling landmarks could be incorporated into the
process of identifying the most similar datasets. This could lead to further im-
provements and forms part of our future plans.
1. C.L. Blake and C.J. Merz. UCI repository of machine learning databases, 1998.
2. B.Pfahringer, H.Bensussan, and C. Giraud-Carrier. Meta-learning by landmarking
various learning algorithms. In Proceedings of the 17th Int. Conf. on Machine
Learning (ICML-2000. Stanford,CA, 2000.
3. P. Brazdil, C. Soares, and J. Costa. Ranking learning algorithms: Using IBL and
meta-learning on accuracy and time results. Machine Learning, 50:251–277, 2003.
4. K. De Grave, J. Ramon, and L. De Raedt. Active learning for primary drug
screening. In Proceedings of Discovery Science. Springer, 2008.
5. J. Demsar. Statistical comparisons of classifiers over multiple data sets. The
Journal of Machine Learning Research, 7:1–30, 2006.
6. V. Fedorov. Theory of Optimal Experiments. Academic Press, New York, 1972.
7. Y. Freund, H. Seung, E. Shamir, and N. Tishby. Selective sampling using the query
by committee algorithm. Machine Learning, 28:133–168, 1997.
8. Johannes F¨urnkranz and Johann Petrak. An evaluation of landmarking variants. In
Proceedings of the ECML/PKDD Workshop on Integrating Aspects of Data Mining,
Decision Support and Meta-Learning (IDDM-2001), pages 57–68. Springer, 2001.
9. J. Gittins. Multi-armed bandit allocation indices. In Wiley Interscience Series in
Systems and Optimization. John Wiley & Sons, Ltd., 1989.
10. Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann,
and Ian H. Witten. The WEKA data mining software: an update. SIGKDD Explor.
Newsl., 11(1):10–18, 2009.
11. H.Blockeel. Experiment databases: A novel methodology for experimental research.
In Leacture Notes on Computer Science 3933. Springer, 2006.
12. J.F¨urnkranz and J.Petrak. An evaluation of landmarking variants. In C.Carrier,
N.Lavrac, and S.Moyle, editors, Working Notes of ECML/PKDD 2000 Workshop
on Integration Aspects of Data Mining, Decision Support and Meta-Learning. 2001.
13. R. Leite and P. Brazdil. Predicting relative performance of classifiers from sam-
ples. In ICML ’05: Proceedings of the 22nd international conference on Machine
learning, pages 497–503, New York, NY, USA, 2005. ACM Press.
14. Rui Leite and Pavel Brazdil. Active testing strategy to predict the best classifica-
tion algorithm via sampling and metalearning. In Proceedings of the 19th European
Conference on Artificial Intelligence - ECAI 2010, 2010.
15. B. Long, O. Chapelle, Y. Zhang, Y. Chang, Z. Zheng, and B. Tseng. Active learning
for rankings through expected loss optimization. In Proceedings of the SIGIR’10.
ACM, 2010.
16. A. Mahajan and D. Teneketzis. Multi-armed bandit problems. In D. A. Castanon,
D. Cochran, and K. Kastella, editors, Foundations and Applications of Sensor
Management. Springer-Verlag, 2007.
17. D. Michie, D.J.Spiegelhalter, and C.C.Taylor. Machine Learning, Neural and Sta-
tistical Classification. Ellis Horwood, 1994.
18. Tom M. Mitchell. Machine Learning. McGraw-Hill, New York, 1997.
19. John R. Rice. The algorithm selection problem. volume 15 of Advances in Com-
puters, pages 65 – 118. Elsevier, 1976.
20. Kate A. Smith-Miles. Cross-disciplinary perspectives on meta-learning for algo-
rithm selection. ACM Comput. Surv., 41(1):1–25, 2008.
21. Carlos Soares, Johann Petrak, and Pavel Brazdil. Sampling-based relative land-
marks: Systematically test-driving algorithms before choosing. In Proceedings of
the 10th Portuguese Conference on Artificial Intelligence (EPIA 2001), pages 88–
94. Springer, 2001.
22. J. Vanschoren and H. Blockeel. A community-based platform for machine learning
experimentation. In Machine Learning and Knowledge Discovery in Databases,
European Conference, ECML PKDD 2009, volume LNCS 5782, pages 750–754.
Springer, 2009.
23. Ricardo Vilalta and Youssef Drissi. A perspective view and survey of meta-learning.
Artif. Intell. Rev., 18(2):77–95, 2002.
... A more advanced approach, often considered as the classical metalearning approach, uses, in addition to performance results, a set of measures that characterize datasets [28,5,32]. Other approaches exploit estimates of performance based on past tests in so-called active testing method for algorithm selection [22]. ...
Conference Paper
Full-text available
Machine learning users need methods that can help them identify algorithms or even workflows (combination of algorithms with preprocessing tasks, using or not hyperparameter configurations that are different from the defaults), that achieve the potentially best performance. Our study was oriented towards average ranking (AR), an algorithm selection method that exploits meta-data obtained on prior datasets. We focused on extending the use of a variant of AR* that takes A3R as the relevant metric (combining accuracy and run time). The extension is made at the level of diversity of the portfolio of work-flows that is made available to AR. Our aim was to establish whether feature selection and different hyperparameter configurations improve the process of identifying a good solution. To evaluate our proposal we have carried out extensive experiments in a leave-one-out mode. The results show that AR* was able to select workflows that are likely to lead to good results, especially when the portfolio is diverse. We additionally performed a comparison of AR* with Auto-WEKA, running with different time budgets. Our proposed method shows some advantage over Auto-WEKA, particularly when the time budgets are small.
... Typically, a cross-validation run will be performed on the top-ranked algorithm, and if the result was not good enough, the next one can be tried. One way of evaluating such rankings is using Loss Curves (Leite et al., 2012). Loss is defined to be the difference between current best classifier performance against global best classifier performance. ...
Conference Paper
Full-text available
Meta-learning aims to learn which learning techniques work well on what data. Rather than recommending a single classifier, a ranking should be created, ordering the classifiers on their estimated performance. It is unclear how to evaluate such rankings. In this paper we propose the use of Loss Time curves. We show that for meta-learning techniques to perform well on this measure, they should also take into consideration the amount of time spent to train an algorithm.
... In effect, partial learning curves of accuracy versus sample size for sampling-based landmarkers can be used to further characterise datasets, but these curves also contain additional information about the relative performance of candidate ML algorithms, allowing performance estimates for a full dataset [220,221]. Moreover, knowing how ML algorithms compare on one dataset can be used to guide ranking strategies on another similar dataset [222]. This idea has been exploited several times, with one study working to meta-learn pairwise comparisons of utility between ML algorithms [138]. ...
Full-text available
Over the last decade, the long-running endeavour to automate high-level processes in machine learning (ML) has risen to mainstream prominence, stimulated by advances in optimisation techniques and their impact on selecting ML models/algorithms. Central to this drive is the appeal of engineering a computational system that both discovers and deploys high-performance solutions to arbitrary ML problems with minimal human interaction. Beyond this, an even loftier goal is the pursuit of autonomy, which describes the capability of the system to independently adjust an ML solution over a lifetime of changing contexts. However, these ambitions are unlikely to be achieved in a robust manner without the broader synthesis of various mechanisms and theoretical frameworks, which, at the present time, remain scattered across numerous research threads. Accordingly, this review seeks to motivate a more expansive perspective on what constitutes an automated/autonomous ML system, alongside consideration of how best to consolidate those elements. In doing so, we survey developments in the following research areas: hyperparameter optimisation, multi-component models, neural architecture search, automated feature engineering, meta-learning, multi-level ensembling, dynamic adaptation, multi-objective evaluation, resource constraints, flexible user involvement, and the principles of generalisation. We also develop a conceptual framework throughout the review, augmented by each topic, to illustrate one possible way of fusing high-level mechanisms into an autonomous ML system. Ultimately, we conclude that the notion of architectural integration deserves more discussion, without which the field of automated ML risks stifling both its technical advantages and general uptake.
This chapter discusses some typical approaches that are commonly used to evaluate metalearning and AutoML systems. This helps us to establish whether we can trust the recommendations provided by a particular system, and also provides a way of comparing different competing approaches. As the performance of algorithms may vary substantially across different tasks, it is often necessary to normalize the performance values first to make comparisons meaningful. This chapter discusses some common normalization methods used. As often a given metalearning system outputs a sequence of algorithms to test, we can study how similar this sequence is from the ideal sequence. This can be determined by looking at a degree of correlation between the two sequences. This chapter provides more details on this issue. One common way of comparing systems is by considering the effect of selecting different algorithms (workflows) on base-level performance and determining how the performance evolves with time. If the ideal performance is known, it is possible to calculate the value of performance loss. The loss curve shows how the loss evolves with time or what its value is at the maximum available time (i.e., the time budget) given beforehand. This chapter also describes the methodology that is commonly used in comparisons involving several metalearning/AutoML systems with recourse to statistical tests.
This chapter discusses dataset characteristics that play a crucial role in many metalearning systems. Typically, they help to restrict the search in a given configuration space. The basic characteristic of the target variable, for instance, determines the choice of the right approach. If it is numeric, it suggests that a suitable regression algorithm should be used, while if it is categorical, a classification algorithm should be used instead. This chapter provides an overview of different types of dataset characteristics, which are sometimes also referred to as metafeatures. These are of different types, and include so-called simple, statistical, information-theoretic, model-based, complexitybased, and performance-based metafeatures. The last group of characteristics has the advantage that it can be easily defined in any domain. These characteristics include, for instance, sampling landmarkers representing the performance of particular algorithms on samples of data, relative landmarkers capturing differences or ratios of performance values and providing estimates of performance gains . The final part of this chapter discusses the specific dataset characteristics used in different machine learning tasks, including classification, regression, time series, and clustering.
This chapter discusses different types of metalearning models, including regression, classification and relative performance models. Regression models use a suitable regression algorithm, which is trained on the metadata and used to predict the performance of given base-level algorithms. The predictions can in turn be used to order the base-level algorithms and hence identify the best one. These models also play an important role in the search for the potentially best hyperparameter configuration discussed in the next chapter. Classification models identify which base-level algorithms are applicable or non-applicable to the target classification task. Probabilistic classifiers can be used to construct a ranking of potentially useful alternatives. Relative performance models exploit information regarding the relative performance of base-level models, which can be either in the form of rankings or pairwise comparisons. This chapter discusses various methods that use this information in the search for the potentially best algorithm for the target task.
In this paper, we present a context-based meta-reinforcement learning approach to tackle the challenging data-inefficiency problem of Hyperparameter Optimization (HPO). Specifically, we design an agent which sequentially selects hyperparameters to maximize the expected accuracy of the machine learning algorithm on the validation set. First, we design a context variable that learns the latent embedding of prior experience, and the agent can solve the new tasks efficiently conditioned on it. Second, we employ a multi-task objective method that aims to maximize the average reward across all the meta-training tasks to meta-train the agent. Third, in the adaptation phase, we introduce a quadratic penalty technique to achieve better performance of the agent. Finally, to further improve the efficiency in the adaptation phase, we use a predictive model to evaluate the accuracy of machine learning algorithm instead of training it. We evaluate our approach on 18 real-world datasets and the results demonstrate that our approach outperforms other state-of-the-art optimization methods in terms of test set accuracy and runtime performance.
Full-text available
We study the task of approximating the k best instances with regard to a function us- ing a limited number of evaluations. We also apply an active learning algorithm based on Gaussian processes to the problem, and evaluate it on a challenging set of structure- activity relationship prediction tasks.
Conference Paper
Full-text available
An important task in many scientific and engineering disciplines is to set up experiments with the goal of finding the best instances (substances, compositions, designs) as evaluated on an unknown target function using limited resources. We study this problem using machine learning principles, and introduce the novel task of active k-optimization. The problem consists of approximating the k best instances with regard to an unknown function and the learner is active, that is, it can present a limited number of instances to an oracle for obtaining the target value. We also develop an algorithm based on Gaussian processes for tackling active k-optimization, and evaluate it on a challenging set of tasks related to structure-activity relationship prediction.
Conference Paper
Full-text available
This paper is concerned with the problem of predicting relative performance of clas- sication algorithms. It focusses on meth- ods that use results on small samples and discusses the shortcomings of previous ap- proaches. A new variant is proposed that exploits, as some previous approaches, meta- learning. The method requires that exper- iments be conducted on few samples. The information gathered is used to identify the nearest learning curve for which the sampling procedure was carried out fully. This in turn permits to generate a prediction regards the relative performance of algorithms. Exper- imental evaluation shows that the method competes well with previous approaches and provides quite good and practical solution to this problem.
Currently many classification algorithms exist and no algorithm exists that would outperform all the others. Therefore it is of interest to determine which classification algorithm is the best one for a given task. Although direct comparisons can be made for any given problem using a cross-validation evaluation, it is desirable to avoid this, as the computational costs are significant. We describe a method which relies on relatively fast pairwise comparisons involving two algorithms. This method is based on a previous work and exploits sampling landmarks, that is information about learning curves besides classical data characteristics. One key feature of this method is an iterative procedure for extending the series of experiments used to gather new information in the form of sampling landmarks. Metalearning plays also a vital role. The comparisons between various pairs of algorithm are repeated and the result is represented in the form of a partially ordered ranking. Evaluation is done by comparing the partial order of algorithm that has been predicted to the partial order representing the supposedly correct result. The results of our analysis show that the method has good performance and could be of help in practical applications.