ArticlePDF Available
A Proposed Meta-learning Framework
for Algorithm Selection Utilising
Regression-based Landmarkers
Technical Report Number 569
May 2005
Daren Ler, Irena Koprinska and Sanjay Chawla
ISBN 1 86487 721 9
School of Information Technologies
University of Sydney NSW 2006
A Proposed Meta-learning Framework for Algorithm Selection Utilising
Regression-based Landmarkers
Daren Ler
LER
@
IT
.
USYD
.
EDU
.
AU
Irena Koprinska
IRENA
@
IT
.
USYD
.
EDU
.
AU
Sanjay Chawla
CHAWLA
@
IT
.
USYD
.
EDU
.
AU
School of Information Technologies, University of Sydney, NSW 2006, Australia
Abstract
In this paper, we present a framework for meta-
learning that adopts the use of regression-based
landmarkers. Each such landmarker exploits the
correlations between the various patterns of
performance for a given set of algorithms so as
to construct a regression function that represents
the pattern of performance of one algorithm from
that set. The idea is that the independents utilised
by these regression functions – i.e. landmarkers
– correspond to the performance of a subset of
the given algorithms. In this manner, we may
control the number of algorithms being
landmarked; the more that are landmarked, the
fewer independents or evidence we have to make
those approximations, and less accurate the
landmarkers are. We investigate the ability of
such landmarkers in combination with meta-
learners to learn how to predict the most accurate
algorithm from a given set. While our results
show that the accuracy of the meta-learning
solutions increases as the quality of the meta-
attributes improves; i.e. when less algorithm
performance measurements are landmarked and
instead evaluated as independents, we find that
in general, the results are still poor. However, we
find that when a simple sorting mechanism is
instead employed, the results are quite
promising.
1. Introduction
The selection of the most adequate learning algorithm for
a given dataset is an important problem. If unlimited time
were available to make this decision, hold-out testing
(e.g. cross-validation or bootstrapping) could be used to
evaluate the performance of all applicable algorithms and
thus determine which should be utilised – e.g. (Schaffer,
1993). However, such evaluation is computationally
—————
Preliminary work. Under review by the International Conference on
Machine Learning (ICML). Do not distribute.
unfeasible due to the large number of available
algorithms. To overcome this limitation, various
algorithm selection methods have been proposed.
Typically referred to as a kind of meta-learning (Giraud-
Carrier et al., 2004; Vilalta & Drissi, 2002), these
algorithm selection solutions utilise experience on
previous datasets (i.e. meta-knowledge) to learn how to
characterise the areas of expertise of the candidate
algorithms. (Given the set of all possible datasets, these
domains of expertise correspond to subsets in which
certain algorithms are deemed to be superior to others.)
Predominantly, such solutions involve the mapping of
dataset characteristics to the domains of expertise of some
set of candidate algorithms.
Recently, the concept of landmarking (Fürnkranz &
Petrak, 2001; Ler et al., 2004a; Pfahringer et al., 2000)
has emerged as a technique that characterises a dataset by
directly measuring the performance of simple and fast
learning algorithms, called landmarkers.
In (Ler et al., 2004b) we proposed landmarker selection
criteria based on efficiency and correlativity, and based
on them a landmarker generation approach. This approach
exploits the correlations between the various patterns of
performance of a given set of algorithms to construct
landmarkers that each correspond to a regression function,
with the independents for these regression functions
consisting of the performance measurements of a subset
of the given algorithms. Accordingly, we refer to these as
regression-based landmarkers.
In this paper, we consider the use of these landmarkers in
a meta-learning framework for algorithm selection; we
evaluate and discuss the proposed framework on the
algorithm selection task of predicting the most accurate
candidate algorithm from a given set.
2. Meta-learning via Landmarking
Various meta-learning approaches have been proposed to
perform algorithm selection (Aha, 1992; Brazdil et al.,
2003; Gama & Brazdil, 1995; Kalousis and Hilario, 2001;
Lindner & Studer, 1999; Michie et al., 1994; Pfahringer et
al., 2000; Todorovski et al., 2002). The predominant
strategy is to describe learning tasks in terms of a set of
meta-attributes and classify them based on some aspect of
the performance of the set of candidate algorithms (e.g.
which among the candidate algorithms will give the best
performance on the learning problem in question).
To date, three types of meta-attributes have been
suggested: (i) dataset characteristics, including basic,
statistical and information theoretic measurements
(Brazdil et al., 2003; Gama & Brazdil, 1995; Kalousis and
Hilario, 2001; Lindner & Studer, 1999; Michie et al.,
1994); (ii) properties of induced classifiers over the
dataset in question (Bensusan, 1998; Peng et al., 2002);
and (iii) measurements that represent the performance or
other output of representative classifiers (to the candidate
algorithms); i.e. landmarkers (Fürnkranz & Petrak, 2001;
Ler et al., 2004a; Pfahringer et al., 2000).
Correspondingly, the types of algorithm selection
problems that have been suggested include: (i) classifying
an algorithm as appropriate or inappropriate on the
learning task in question (given that an algorithm is
appropriate if it is not considered worse than the best
performing candidate algorithm) (e.g. see Gama &
Brazdil, 1995; Michie et al., 1994), (ii) classifying which
of two specific candidate algorithms is superior (e.g. see
Fürnkranz & Petrak, 2001; Pfahringer et al., 2000), (iii)
classifying the best performing algorithm for the learning
task in question (e.g. see Pfahringer et al., 2000), and (iv)
classifying the set of rankings for all the candidate
algorithms (e.g. see Brazdil et al., 2003).
In this paper, we focus primarily on the type of meta-
attributes used for algorithm selection problems, and in
particular, on solutions employing landmarkers.
2.1 Landmarkers and Landmarking
Traditionally, a landmarker is associated with a single
algorithm with low computational complexity. The
general idea is that the performance of a learning
algorithm on a dataset reveals some characteristics of that
dataset. However, in the initial landmarking work
(Fürnkranz & Petrak, 2001; Pfahringer et al., 2000),
despite the presence of two landmarker criteria (i.e.
efficiency and bias diversity), no actual mechanism for
generating appropriate landmarkers were defined, and the
choice of landmarkers was made in an ad hoc fashion.
Subsequently, Fürnkranz & Petrak (2001) proposed to
generate landmarkers using: (i) the candidate algorithms
themselves, but only on a sub-sample of the given data
(called sampling-based landmarkers), (ii) the relative
performance of each pair of candidate algorithms (called
relative landmarkers), and (iii) a combination of both (i)
and (ii). However, we note that to compute relative
landmarkers (in the absence of sampling), we are required
to evaluate the performance of the candidate algorithms
themselves, making the meta-learning task(s) redundant.
Also, when adopting sampling-based landmarkers, the
question of appropriate sample size is difficult to solve –
one might even assume that some sub-samples might not
be indicative enough of the learning task in question and
thus not capture the right dataset characteristics.
The fundamental difficulty with landmarkers is that we
must find algorithms that are: (i) efficient (i.e. more
efficient than the set of candidate algorithms), and (ii)
able to describe the datasets so that the regions of
expertise of the set of candidate algorithms are well
represented. The sample-based and relative landmarkers
proposed in (Fürnkranz & Petrak, 2001) take a step closer
toward this goal since those landmarkers would
potentially map similar regions of expertise.
2.2 Regression-based Landmarkers
Consequently, in (Ler et al., 2004a), we proposed
alternate landmarker selection criteria (i.e. efficiency and
correlativity) and correspondingly propose landmarkers
based on the regression estimators whose independents
correspond to a subset of the set of candidate algorithms.
Essentially, we wish the chosen landmarkers (i.e. meta-
attributes) to capture the patterns of performance of the
given set of candidate algorithms; in other words, to be
correlated to the fluctuations in performance of the
candidate algorithms. As such, we propose that each
candidate algorithm be represented by a regression
function that would be indicative of the performance of
the candidate algorithm being landmarked. Further, in
order to preserve efficiency (and thus the benefit of this
type of meta-learning) we require that the independents of
these regression functions correspond to a subset of the
candidate algorithms. In this manner, we utilise meta-
knowledge regarding the correlativity between the
candidate algorithms to infer the performance of the
subset that is not evaluated.
More specifically, given a set of candidate algorithms A =
{a
1
, …, a
m
}, and a set of datasets S = {s
1
, …, s
n
}, let the
pattern of performance of each a
i
A be the vector PP(a
i
)
= {performance(a
i
, s
1
), …, performance(a
i
, s
n
)}, which
describes the performance of a
i
over each s
j
S. Thus,
the landmarker for an algorithm a
j
is an estimate of PP(a
j
)
based on some (regression) function ƒ(a
k
| a
k
B A’),
where A’ A \ a
j
. Consequently, we only require
landmarkers for a subset of A, as the compliment set is
actually evaluated and used to by the landmarkers. As
such, depending on the amount of computation we wish to
save, we may adjust the sizes of either set; the more
computation we save (i.e. the less algorithm performance
measurements that are evaluated), the less evidence we
provide for the landmarkers and the less accurate the
predictions of the performance of the corresponding
landmarked candidate algorithms. This conceptualisation
of landmarking is depicted in Figure 1.
Thus far, we have only utilised linear regression
functions. This is because they represent the simplest
relationships between the patterns of performance
possible, and because they are relatively inexpensive.
Figure 1. The proposed regression-based landmarkers. For the
set of candidate algorithms A, only the algorithms in the subset
A\A’, require landmarkers; the patterns of performance from the
algorithms in the compliment subset A’ serve as potential
independents for the regression functions ƒ(a
k
| a
k
B A’)
that estimate the patterns of performance of the landmarked
algorithms (i.e. ƒ(a
k
| a
k
B A’)
PP(a
i
).
The actual generation method is discussed in greater
detail in (Ler et al., 2004a, 2004b), while a more efficient
hill-climbing version of the proposed landmarker
generation algorithm is described in (Ler et al., 2005).
2.3 Meta-learning via Regression-based Landmarkers
Landmarkers by themselves may provide adequate meta-
knowledge regarding the regions of expertise of the
candidate algorithms; as shown in (Fürnkranz & Petrak,
2001; Ler et al., 2004a, 2004b; Pfahringer et al., 2000).
However, the quality of such results could potentially be
further enhanced by applying them within a meta-learning
framework.
In essence, the proposed meta-learning framework is
expected to perform the following: (a) the base-learning
task(s) – learn to approximate the patterns of performance
of a subset of candidate algorithms via regression-based
landmarkers, which use (i.e. whose independents are) the
evaluated performance scores from the complement
subset, (b) the meta-learning task(s) – learn to map the
various evaluated and estimated algorithm performance
measurements to classes concerning the comparisons
between the candidate algorithms (e.g. the targets
described in Section 2.1, such as the best performing
algorithm among the set of candidate algorithms, or the
superiority of algorithm a
1
versus a
2
). The proposed meta-
learning framework is depicted in Figure 2.
For both the base and meta-learning (algorithm selection)
tasks, more than one task may be defined. For the base-
learning task, the number of tasks corresponds to the
number of candidate algorithms whose performance we
wish to estimate (i.e. to landmark). The number of meta-
learning tasks on the other hand, depends on how we
decide to structure the algorithm selection solution. For
example, if we decide to learn the superiority between
each pair of candidate algorithms, then
|A|
C
2
meta-learning
tasks must be solved; alternatively, should we decide to
simply learn which algorithm is overall superior, then we
could adopt a single meta-learning task.
Obviously, the actual learning mechanism utilised for any
base or meta-learning (algorithm selection) task may
correspond to any applicable learning solution, including
(model combination) meta-learning ones (e.g. stacking).
For the sake of clarity, we abstract the learner involved,
and reserve the term meta-learner (i.e. ML in Figure 2) for
solution(s) to the meta-learning task(s).
Following Figure 2, given a performance generation
mechanism (e.g. stratified ten-fold cross validation) to
measure one or more performance measurements (e.g.
accuracy, precision, recall, F-score, etc), we may obtain
the raw meta-data characterising the dataset under
scrutiny in terms of the set of candidate algorithms. This
raw meta-data may then be used to generate the meta-
class MC and (indirectly) the meta-attributes MA for each
algorithm selection meta-learning task. MA are generated
via the base-learning tasks solved by the generated
regression-based landmarkers.
To utilise the solutions (i.e. meta-classifiers) on a new
dataset s
new
, we require that: (i) the performance of the
algorithms in A’ be evaluated on s
new
, then (ii) these
measurements (i.e. the performance(a
k
, s
new
) score for
each a
k
in A’) will then be used to infer the approximate
performance values of the remaining algorithms, and
finally (iii) all the performance measurements and
approximations (or some subset of them) are input into
the meta-classifiers to generate the meta-classes, or rather,
the algorithm selection predictions.
3. Experimental Setup
For our experiments we utilise a set of candidate
algorithms (i.e. A) consisting of 6 classification learning
algorithms from WEKA (Witten & Frank, 2000) (i.e.
Set of candidate
algorithms A:
a
1
a
m
Corpus of
datasets S:
s
1
s
n
performance(a
1
, s
1
)
performance(a
1
, s
n
)
performance(a
m
, s
1
)
performance(a
m
, s
n
)
Pattern of performance
for algorithm a
1
: PP(a
1
)
Pattern of performance
for algorithm a
m
: PP(a
m
)
approx.perf (a
i
, s
1
)
approx.perf (a
i
, s
n
)
performance(a
j
, s
1
)
performance(a
j
, s
n
)
Set of algorithms requiring
landmarkers: A \ A’
Landmarker for each
a
i
A \ A’:
ƒ(a
k
| a
k
B A’) PP(a
i
)
Set of algorithms to
evaluate: A’
Pattern of performance
for each a
j
A’:
PP(a
j
)
Figure 2. The proposed meta-learning framework that adopts the
regression-based landmarkers. The landmarkers are employed to
generate part of the meta-attributes MA, which are then
combined to some meta-class MC to form the meta-learning
task.
naive Bayes – N.Bayes, k-nearest neighbour (with k = 5) –
IB5, a polynomial kernel SVM – SMO, a RBF kernel
SVM – SMO-R, a WEKA implementation of C4.5 – J4.8,
and Ripper – JRip); and as our corpus of datasets (i.e. S)
80 classification datasets from the UCI repository (Blake
& Merz, 1998). To evaluate the performance of each
candidate algorithm on each dataset (i.e. as our
performance evaluation mechanism), accuracy based on
stratified ten-fold cross-validation was employed.
The effectiveness of the proposed meta-learning
framework is then evaluated using the leave-one-out
cross-validation approach. This corresponds to n-fold
cross-validation, where n is the number of instances,
which in our case is 80, each pertaining to one UCI
dataset.
For each fold we use 79 of the datasets to generate the
regression-based landmarkers as described Section 2.2.
The resultant set of landmarkers indicates which
algorithms must be evaluated and which will be estimated
(i.e. A’ and A \ A’ respectively). Recall from Section 2.1
that we may vary the number of candidate algorithm
performance measurements that are estimated by the
regression-based landmarkers and thus, the number of
candidate algorithm performance measurements that are
actually evaluated. Given |A| = 6, we may generate the
regression-based landmarkers utilising a subset of
independents whose size ranges from 1 to 5 (i.e. have 1
|A’| < 6). Additionally, we evaluate the outputs from two
different regression-based landmarker sets, each
generated utilising two different criteria (i.e. r
2
and r
2
+
efficiency.gained – see (ler et al., 2004b) for details).
Thus, in our experiments, we generate 10 sets of meta-
attributes (i.e. MA
r2,1
, …, MA
r2,5
and MA
r2+EG,1
, …,
MA
r2+EG,5
), each utilising the outputs from the regression-
based landmarkers generated based on one of the two
criteria, and using one of the available independent sets.
As our meta-learning task, we attempt to map the meta-
attributes to a meta-class (i.e. MC) indicating the
candidate algorithm with the highest accuracy. More
specifically, each instance in our meta-learning problem
consists of the performance evaluations/estimations of the
6 candidate algorithms on one of the UCI datasets (MA),
and the index of the candidate algorithm that attained the
highest accuracy on that dataset (MC). It should be noted
that the determination of this best performing algorithm is
done in simplistic fashion by directly comparing the
stratified ten-fold cross-validation accuracies of the
candidate algorithms.
For each (leave-one-out) fold, we thus have 5 datasets for
each criterion; one for each meta-attribute set, meta-class
pairing (i.e. (MA
1
, MC), …, (MA
5
, MC)), and thus 10
datasets in total.
For potential meta-learning algorithms, we employ the
same 6 WEKA algorithms, and the default class (i.e. the
ZeroR WEKA algorithm). This means that we have 7
meta-classifiers for each dataset, and thus, a total of 35
classifiers per (leave-one-out) fold.
These classifiers are then tested on the instance (i.e.
representing the UCI dataset) that was left out. Note that
to obtain either some MA
r2,i
or MA
r2+EG,i
for this instance,
Base-learning task:
Patterns of performance for algorithms a
i
A:
{PP(a
1
), … ,PP(a
m
)}
Set of candidate
algorithms A:
a
1
a
m
Meta-attributes MA
The approximated PP(a
i
)
for each (landmarked)
a
i
A \ A’
The true PP(a
j
) for each
a
j
A’
Corpus of
datasets S:
s
1
s
n
Performance Evaluation Mechanism
(e.g. stratified 10-fold cross-validation)
Landmarker Generation
Mechanism
(e.g. the proposed method of
generating regression-based
landmarkers
)
Meta-learning task:
Meta-class MC
For each s
x
S,
the output from
the meta-class
generation
mechanism
Meta-class Generation
Mechanism*
(e.g. compute the candidate
algorithm from with the
highest accuracy)
Meta-learning Algorithm ML
(i.e. some learning solution to map MA to MC)
* Note: we may decide to formulate more than one meta-learning
task (e.g. one task for each candidate algorithm pair comparison).
we use the performance measurements obtained via
evaluation (for the algorithms in the respective A’) and
estimation (for the algorithms in the corresponding A \
A’). In the latter case, it should further be noted that the
performance measurements of the current test instance
(i.e. UCI dataset) were not used to train the regression-
based landmarkers that are used.
Let best.acc(i), and worst.acc(i) denote the stratified ten-
fold cross-validation accuracies of the most and least
accurate candidate algorithms respectively on the test
dataset used in fold i. Also, let prediction.acc(i) be the
accuracy of the algorithm that is predicted by the meta-
classifier to be the most accurate candidate algorithm on
the test dataset used in fold i.
To grade the success of these classifiers, we measure the
following:
1. Classification accuracy (Acc): the proportion of
test instances (i.e. leave-one-out folds) in which
the classifier made a correct prediction of the
most accurate algorithm.
2. Average rank (Rank): the average rank of the
candidate algorithm predicted to be most
accurate (i.e. an indication of the rank of the
algorithm predicted as the most accurate).
3. E[prediction.acc(i) – best.acc(i)] for i = 1 .. 80
(PtoB): the mean difference between the
accuracy of the predicted most accurate
candidate algorithm and actual most accurate
algorithm over all the test datasets.
4. E[prediction.acc(i) – worst.acc(i)] for i = 1 .. 80
(PtoW): the mean difference between the
accuracy of the predicted most accurate
candidate algorithm and actual least accurate
algorithm over all the test datasets.
4. Results and Discussion
The results of the experiments are described in Table 1
through to Table 5. Each table reports the success of the
various meta-learners trained using meta-attributes
generated via a specific number of regression-based
landmarkers. That is, Table 1 reports the results based on
meta-attributes obtained by evaluating the performance
measurements of one candidate algorithm and estimating
the remaining five, while Table 2 reports the results based
on two evaluated performance measurements and four
estimated ones, and so forth. In addition to the results of
the meta-learners, we also list the results obtained by
directly sorting the meta-attributes in question (listed as
sorted best). Also note that in each table, we report the
results obtained via the meta-attributes generated based
on both the r
2
(crit.1) and r
2
+ efficiency.gained (crit.2)
criteria (listed in rows 2 through 8 and 9 through 15
respectively). As a baseline, the accuracy based on the
default class is also listed in row 1 in each of these tables.
Table 1. The results based on the meta-attributes generated via 5
regression-based landmarkers. The remaining 1 accuracy
measurement was directly evaluated.
M
ETA
-L
EARNER
A
CC
R
ANK
P
TO
W P
TO
B
Z
ERO
R
(
DEF
.
CLASS
)
32.5 2.96 11.2 -3.6
IB5
(
CRIT
.1)
41.3 2.56 11.9 -2.9
J4.8
(
CRIT
.1)
38.8 2.58 12.0 -2.8
JR
IP
(
CRIT
.1)
40.0 2.54 12.4 -2.4
N.B
AYES
(
CRIT
.1)
33.8 2.76 11.5 -3.3
SMO-R
(
CRIT
.1)
42.5 2.32 12.3 -2.5
SMO
(
CRIT
.1)
32.5 2.96 11.2 -3.6
S
ORTED
B
EST
(
CRIT
.1)
28.8 2.63 12.4 -2.4
IB5
(
CRIT
.2)
28.8 3.13 11.1 -3.7
J4.8
(
CRIT
.2)
27.5 2.99 11.2 -3.6
JR
IP
(
CRIT
.2)
31.3 2.94 11.0 -3.8
N.B
AYES
(
CRIT
.2)
18.8 3.63 8.8 -6.0
SMO-R
(
CRIT
.2)
26.3 3.12 10.9 -3.9
SMO
(
CRIT
.2)
32.5 2.96 11.2 -3.6
S
ORTED
B
EST
(
CRIT
.2)
31.3 2.83 11.7 -3.1
Table 2. The results based on the meta-attributes generated via 4
regression-based landmarkers. The remaining 2 accuracy
measurements were directly evaluated.
M
ETA
-L
EARNER
A
CC
R
ANK
P
TO
W P
TO
B
Z
ERO
R
(
DEF
.
CLASS
)
32.5 2.96 11.2 -3.6
IB5
(
CRIT
.1)
35.0 2.65 12.2 -2.6
J4.8
(
CRIT
.1)
33.8 2.68 12.1 -2.7
JR
IP
(
CRIT
.1)
40.0 2.54 12.4 -2.4
N.B
AYES
(
CRIT
.1)
36.3 2.73 11.0 -3.8
SMO-R
(
CRIT
.1)
38.8 2.48 11.7 -3.1
SMO
(
CRIT
.1)
32.5 2.96 11.2 -3.6
S
ORTED
B
EST
(
CRIT
.1)
30.0 2.51 12.5 -2.3
IB5
(
CRIT
.2)
43.8 2.51 11.7 -3.1
J4.8
(
CRIT
.2)
47.5 2.33 11.7 -3.1
JR
IP
(
CRIT
.2)
45.0 2.46 12.3 -2.5
N.B
AYES
(
CRIT
.2)
31.3 2.91 11.4 -3.4
SMO-R
(
CRIT
.2)
27.5 2.86 11.5 -3.3
SMO
(
CRIT
.2)
32.5 2.96 11.2 -3.6
S
ORTED
B
EST
(
CRIT
.2)
42.5 2.07 13.6 -1.2
From these results, we notice that in general, the accuracy
of the meta-learning solutions tends to increases as the
quality of the meta-attributes increase (i.e. when more
performance measurements are evaluated instead of
estimated). However, this improvement is far more
pronounced in the solutions where the meta-attributes are
simply sorted, and the best chosen from that ranking.
For the meta-attribute sets generated using more
estimated than evaluated accuracy measurements (i.e.
Table 3. The results based on the meta-attributes generated via 3
regression-based landmarkers. The remaining 3 accuracy
measurements were directly evaluated.
M
ETA
-L
EARNER
A
CC
R
ANK
P
TO
W P
TO
B
Z
ERO
R
(
DEF
.
CLASS
)
32.5 2.96 11.2 -3.6
IB5
(
CRIT
.1)
47.5 2.35 12.7 -2.1
J4.8
(
CRIT
.1)
37.5 2.74 11.5 -3.3
JR
IP
(
CRIT
.1)
40.0 2.59 11.7 -3.1
N.B
AYES
(
CRIT
.1)
33.8 2.86 11.3 -3.5
SMO-R
(
CRIT
.1)
28.8 2.83 11.5 -3.3
SMO
(
CRIT
.1)
32.5 2.96 11.2 -3.6
S
ORTED
B
EST
(
CRIT
.1)
41.3 2.19 13.0 -1.8
IB5
(
CRIT
.2)
50.0 2.26 13.1 -1.7
J4.8
(
CRIT
.2)
50.0 2.26 13.1 -1.7
JR
IP
(
CRIT
.2)
43.8 2.53 12.3 -2.5
N.B
AYES
(
CRIT
.2)
26.3 2.95 11.1 -3.7
SMO-R
(
CRIT
.2)
27.5 2.78 11.5 -3.3
SMO
(
CRIT
.2)
32.5 2.96 11.2 -3.6
S
ORTED
B
EST
(
CRIT
.2)
55.0 1.85 13.9 -0.9
Table 4. The results based on the meta-attributes generated via 2
regression-based landmarkers. The remaining 4 accuracy
measurements were directly evaluated.
M
ETA
-L
EARNER
A
CC
R
ANK
P
TO
W P
TO
B
Z
ERO
R
(
DEF
.
CLASS
)
32.5 2.96 11.2 -3.6
IB5
(
CRIT
.1)
45.0 2.36 12.5 -2.3
J4.8
(
CRIT
.1)
42.5 2.46 12.2 -2.6
JR
IP
(
CRIT
.1)
42.5 2.51 12.2 -2.6
N.B
AYES
(
CRIT
.1)
32.5 2.69 11.8 -3.0
SMO-R
(
CRIT
.1)
26.3 2.91 11.0 -3.8
SMO
(
CRIT
.1)
32.5 2.96 11.2 -3.5
S
ORTED
B
EST
(
CRIT
.1)
52.5 1.82 14.1 -0.8
IB5
(
CRIT
.2)
48.8 2.23 12.8 -2.0
J4.8
(
CRIT
.2)
33.8 2.74 11.0 -3.8
JR
IP
(
CRIT
.2)
38.8 2.63 12.0 -2.8
N.B
AYES
(
CRIT
.2)
30.0 2.74 11.8 -3.0
SMO-R
(
CRIT
.2)
27.5 2.89 11.4 -3.4
SMO
(
CRIT
.2)
32.5 2.96 11.2 -3.6
S
ORTED
B
EST
(
CRIT
.2)
65.0 1.62 14.2 -0.6
Table 1 and 2), we find that the performance of the meta-
learning solutions tends to fair better than directly sorting
the meta-attributes. When there are an equal number of
evaluated and estimated accuracy meta-attributes (i.e.
Table 3), the solution based on direct sorting approaches
the performance of the best performing meta-learning
solution. And when more evaluated than estimated meta-
attributes are utilised (i.e. Table 4 and 5), the sorting
solution clearly outperforms the meta-learning solutions.
Table 5. The results based on the meta-attributes generated via 1
regression-based landmarker. The remaining 5 accuracy
measurements were directly evaluated.
M
ETA
-L
EARNER
A
CC
R
ANK
P
TO
W P
TO
B
Z
ERO
R
(
DEF
.
CLASS
)
32.5 2.96 11.2 -3.6
IB5
(
CRIT
.1)
45.0 2.35 13.1 -1.7
J4.8
(
CRIT
.1)
50.0 2.21 13.0 -1.8
JR
IP
(
CRIT
.1)
43.8 2.26 13.2 -1.6
N.B
AYES
(
CRIT
.1)
36.3 2.72 11.9 -2.9
SMO-R
(
CRIT
.1)
27.5 2.94 10.8 -4.0
SMO
(
CRIT
.1)
32.5 2.96 11.2 -3.6
S
ORTED
B
EST
(
CRIT
.1)
77.5 1.42 14.6 -0.2
IB5
(
CRIT
.2)
48.8 2.23 12.9 -1.9
J4.8
(
CRIT
.2)
43.8 2.54 12.0 -2.8
JR
IP
(
CRIT
.2)
46.3 2.40 12.3 -2.5
N.B
AYES
(
CRIT
.2)
30.0 2.80 11.7 -3.1
SMO-R
(
CRIT
.2)
26.3 3.07 10.7 -4.1
SMO
(
CRIT
.2)
32.5 2.96 11.2 -3.6
S
ORTED
B
EST
(
CRIT
.2)
86.3 1.19 14.6 -0.3
The accuracy of the meta-learning solutions range from
26.3% to 50%, which suggests that none of the meta-
classifiers generated over the various meta-attribute sets
are able to sufficiently learn how to classify the candidate
algorithm with the highest accuracy. However, in
comparison to the baseline accuracy afforded by the
default class (i.e. ZeroR accuracy of 32.5%), we find that
the IB5, J4.8, and JRip meta-learners consistently perform
better. The performance of the Naive Bayes meta-learner
on the other hand, is close to that achieved by the default
class, while the SVMs perform quite poorly. One possible
reason for this is that the amount of meta-knowledge
regarding the patterns of performance of the candidate
algorithms is insufficient for any meta-learner to
satisfactorily learn to distinguish the most accurate
candidate algorithm.
Essentially, the meta-learner must attempt to decode the
tangle of partially approximated performance patterns (i.e.
potentially noisy accuracy measurements) and then
predict which is likely to be superior. This problem seems
to be too complex for the meta-learner given only the
accuracy measurements of the candidate algorithms over
80 UCI datasets. To clarify this point, we also evaluate
the meta-learners using the actual stratified ten-fold cross-
validation results as meta-attributes. These results, which
are described in Table 6, suggest that even when the true
accuracy scores of the candidate algorithms are provided,
there is still insufficient data (i.e. UCI datasets) for the
meta-learners to learn how to choose the highest score
from among the accuracies input (i.e. to learn to perform
an argmax).
Table 6. The meta-learning task results based on meta-attributes
corresponding to the stratified 10-fold cross-validation
accuracies.
M
ETA
-L
EARNER
A
CC
R
ANK
P
TO
W P
TO
B
Z
ERO
R
(
DEF
.
CLASS
)
32.5 2.96 11.3 -1.9
IB5
43.8 2.27 12.9 -1.9
J4.8
45.0 2.51 12.4 -2.4
JR
IP
46.3 2.54 12.1 -2.7
N.B
AYES
38.8 2.62 12.0 -2.6
SMO-R
32.5 2.96 11.3 -1.9
SMO
27.5 2.89 11.1 -1.8
S
ORTED
B
EST
*
100.0 1.00 14.8 0.0
*
This corresponds to the procedure to compute the
target meta-classes.
In comparison, the accuracy achieved by simply sorting
the generated meta-attributes improves more significantly
as the number of evaluated accuracy measurements
increases, and eventually outperforms the default class
and other meta-learners. This may be because of the bias
behind by the sorting operation is directly representative
of the (argmax) task, and thus only requires induction
over the accuracy score approximations. This also means
that the as the number of accuracy measurements are
evaluated, the required induction is correspondingly
lessened. Consider the following. Given |A| candidate
algorithms, and assuming that each a
i
A has an equal
chance of being the most accurate on a given dataset s
j
,
there is thus a 1 in |A| chance of selecting the most
accurate algorithm from among A for s
j
. If k algorithms
are evaluated (and thus the accuracy of |A| - k remain
unknown), then the chance of selecting the most accurate
algorithm falls to 1 in (|A| - k + 1) – i.e. we know the most
accurate algorithm over the k that are run, but not any that
were not are actually even more accurate). Essentially,
when evaluating all but one candidate algorithm, there is a
1 in 2 chance of picking the most accurate one (i.e. from
between the best of those evaluated, and the one that was
not). However, the meta-learning solutions cannot take
advantage of this, and the difficulty of the induction task
that is faced (i.e. to determine the most accurate candidate
algorithm) persists despite this potential discount.
5. Conclusion and Future Work
In this paper we present a new meta-learning framework
for algorithm selection utilising regression-based
landmarkers. In essence, we seek to solve the algorithm
selection task of identifying the more accurate algorithm
from a given set of candidate algorithms by: 1) generating
meta attributes that correspond to the performance
patterns either directly evaluated or estimated via
regression-based landmarkers; 2) attempting to (meta-)
learn the mapping between these meta-attributes and
meta-classes corresponding to the most accurate
algorithm in the set. From our experiments using 80 UCI
datasets and 6 WEKA algorithms, we discover that for the
meta-knowledge employed, learning to predict the most
accurate algorithm given some new dataset is a task that is
too complex, and simply sorting the outputs of the
predicted accuracy measurements via the regression-
based landmarkers achieves more satisfactory results.
There are several possible avenues for future work,
including:
Developing theory regarding the difficulty of
meta-learning tasks and which of these to solve
given some finite amount of meta-knowledge.
Generating and meta-learning with more
universally representative datasets, or perhaps
datasets that are attuned to the failings of specific
learning algorithms.
Experimenting with different meta-learning
tasks. For example, learning how to classify the
|A|
C
2
pairwise comparisons from among the given
A candidate algorithms.
The generation of use of more accurate meta-
class data. In particular, this corresponds to the
use of more statistically sound methods of
algorithm evaluation and comparison (e.g. using
stratified 10x10-fold cross-validation, paired t-
tests, McNemar tests, etc).
Experimentation with other meta-attributes.
Other types of landmarkers (e.g. regression-
based relational landmarkers) and dataset
characteristics.
Considering more complicated performance
indicators (e.g. F-score, or other cost/utility
functions) to either landmark or use as meta-
classes.
The development of landmarker theory.
Landmarking remains a new and relatively unexplored
facet of meta-learning for algorithm selection, and should
be further explored.
Acknowledgments
The authors would like to acknowledge the support of the
Smart Internet Technology CRC in this research.
References
Aha, D. (1992). Generalizing from case studies: a case
study. Proceedings of the 9th International Conference
on Machine Learning, (pp. 1-10).
Blake, C., and Merz, C. (1998). UCI repository of
machine learning databases. University of California,
Irvine, Department of Information and Computer
Sciences.
Bensusan, H. (1998). God doesn't always shave with
Occam's Razor: learning when and how to prune.
Proceedings of the 9th European Conference on
Machine Learning, (pp. 119-124).
Brazdil, P., Soares, C., & Costa, J. (2003). Ranking
learning algorithms: using IBL and meta-learning on
accuracy and time results. Machine Learning, 50(3),
251–277.
Fürnkranz, J., & Petrak, J. (2001). An evaluation of
landmarking variants. Proceedings of the 10th
European Conference on Machine Learning, Workshop
on Integrating Aspects of Data mining, decision support
and Meta-learning, (pp. 57–68).
Gama, J., & Brazdil, P. (1995). Characterization of
classification algorithms. Proceedings of the 7th
Portuguese Conference in AI, (pp. 83-102).
Giraud-Carrier, C., Vilalta, R., & Brazdil, P. (2004).
Introduction to the special issue on meta-learning.
Machine Learning, 54(3), 187–193.
Kalousis, A., & Hilario, M. (2001). Model selection via
meta-learning: a comparative study. International
Journal on Artificial Intelligence Tools, 10(4), 525–554.
Ler, D., Koprinska, I., and Chawla, S. (2005). A hill-
climbing landmarker generation algorithm based on
efficiency and correlativity criteria. Proceedings of the
18th International FLAIRS Conference, Machine
Learning Track, (in press).
Ler, D., Koprinska, I., & Chawla, S. (2004a). A
landmarker selection algorithm based on correlation and
efficiency criteria. Proceedings of the 17th Australian
Joint Conference on Artificial Intelligence, (pp. 296–
306).
Ler, D., Koprinska, I., & Chawla, S. (2004b).
Comparisons between heuristics based on correlativity
and efficiency for landmarker generation. Proceedings
of the 4th International Conference on Hybrid
Intelligent Systems, (pp. 32–37).
Lindner, G, & Studer, R. (1999). AST: support for
algorithm selection with a CBR approach. Proceedings
of the 3rd European Conference on Principles and
Practice of Knowledge Discovery in Databases, (pp.
418–423).
Michie, D., Spiegelhalter, D., & Taylor, C. (1994).
Machine learning, neural and statistical classification.
Ellis Horwood.
Pfahringer, B., Bensusan, H., & Giraud-Carrier, C.
(2000). Meta-learning by landmarking various learning
algorithms. Proceedings of the 17th International
Conference on Machine Learning, (pp. 743–750).
Schaffer, C. (1993). Technical note: selecting a
classification method by cross-validation. Machine
Learning, 13(1), 135–143.
Todorovski, L., Blockeel, H., & Dzeroski, S. (2002).
Ranking with predictive clustering trees. Proceedings of
the 13th European Conference on Machine Learning,
(pp. 444-455).
Vilalta, R., & Drissi, Y. (2002). A perspective view and
survey of meta-learning. Journal of Artificial
Intelligence Review, 18(2), 77–95.
Witten, I., & Frank, E. (2000). Data mining: practical
machine learning tools with Java implementations.
Morgan Kaufmann.
... However, none has investigated the merits of landmarkers as metafeatures. Since these metafeatures use simple estimates of performance to predict the actual performance of algorithms, its efficacy in solving the algorithm selection problem is not only expected but has been demonstrated in various other tasks [3,11,17,18,20,21,25]. Therefore, it is important to understand if their effect is similarly positive in selecting CF algorithms. ...
... Such metafeatures rely on the assumption that by estimating the performance of fast and simple models or by using samples of the data, the performance estimates will correlate well with the best algorithms, hence enabling future predictions. In fact, these metafeatures have proven successful on the selection of algorithms for various tasks [3,11,17,18,20,21,25]. ...
... This section presents our proposal of subsampling landmarkers for the selection of CF algorithms and the experimental procedure used to validate them. Our motivation for using landmarkers is that, although they have been successfully applied to the algorithm selection problem in other learning tasks [3,11,17,18,20,21,25], they were never adapted for selecting CF algorithms. Since there are no fast/simple CF algorithms, which can be used as traditional landmarkers, we have followed the alternative approach of developing subsampling landmarkers, i.e. applying the complete CF algorithms on samples of the data. ...
Conference Paper
Full-text available
Recommender Systems have become increasingly popular, propelling the emergence of several algorithms. As the number of algorithms grows, the selection of the most suitable algorithm for a new task becomes more complex. The development of new Recommender Systems would benefit from tools to support the selection of the most suitable algorithm. Metalearning has been used for similar purposes in other tasks, such as classification and regression. It learns predictive models to map characteristics of a dataset with the predictive performance obtained by a set of algorithms. For such, different types of characteristics have been proposed: statistical and/or information-theoretical, model-based and landmarkers. Recent studies argue that landmarkers are successful in selecting algorithms for different tasks. We propose a set of landmarkers for a Metalearning approach to the selection of Collaborative Filtering algorithms. The performance is compared with a state of the art systematic metafeatures approach using statistical and/or information-theoretical metafeatures. The results show that the metalevel accuracy performance using landmarkers is not statistically significantly better than the metafeatures obtained with a more traditional approach. Furthermore, the baselevel results obtained with the algorithms recommended using landmarkers are worse than the ones obtained with the other metafeatures. In summary, our results show that, contrary to the results obtained in other tasks, these landmarkers are not necessarily the best metafeatures for algorithm selection in Collaborative Filtering.
... RMSE) as meta-features for the dataset. An analysis of landmarkers for regression problems can be found in Ler et al. (2005). ...
Preprint
We investigate the learning of quantitative structure activity relationships (QSARs) as a case-study of meta-learning. This application area is of the highest societal importance, as it is a key step in the development of new medicines. The standard QSAR learning problem is: given a target (usually a protein) and a set of chemical compounds (small molecules) with associated bioactivities (e.g. inhibition of the target), learn a predictive mapping from molecular representation to activity. Although almost every type of machine learning method has been applied to QSAR learning there is no agreed single best way of learning QSARs, and therefore the problem area is well-suited to meta-learning. We first carried out the most comprehensive ever comparison of machine learning methods for QSAR learning: 18 regression methods, 6 molecular representations, applied to more than 2,700 QSAR problems. (These results have been made publicly available on OpenML and represent a valuable resource for testing novel meta-learning methods.) We then investigated the utility of algorithm selection for QSAR problems. We found that this meta-learning approach outperformed the best individual QSAR learning method (random forests using a molecular fingerprint representation) by up to 13%, on average. We conclude that meta-learning outperforms base-learning methods for QSAR learning, and as this investigation is one of the most extensive ever comparisons of base and meta-learning methods ever made, it provides evidence for the general effectiveness of meta-learning over base-learning.
... In future work, we plan to investigate more sophisticated strategies for building dynamic ensembles such as staking [25] and arbitrated ensembles [18], and also meta-learning for ensemble member selection and ranking [26,27]. Another direction for future work is studying seasonal differences [28] and building ensembles that are better tuned to the seasonal variations. ...
... RMSE) as meta-features for the dataset. An analysis of landmarkers for regression problems can be found in Ler et al. (2005). ...
Article
Full-text available
We investigate the learning of quantitative structure activity relationships (QSARs) as a case-study of meta-learning. This application area is of the highest societal importance, as it is a key step in the development of new medicines. The standard QSAR learning problem is: given a target (usually a protein) and a set of chemical compounds (small molecules) with associated bioactivities (e.g. inhibition of the target), learn a predictive mapping from molecular representation to activity. Although almost every type of machine learning method has been applied to QSAR learning there is no agreed single best way of learning QSARs, and therefore the problem area is well-suited to meta-learning. We first carried out the most comprehensive ever comparison of machine learning methods for QSAR learning: 18 regression methods, 6 molecular representations, applied to more than 2,700 QSAR problems. (These results have been made publicly available on OpenML and represent a valuable resource for testing novel meta-learning methods.) We then investigated the utility of algorithm selection for QSAR problems. We found that this meta-learning approach outperformed the best individual QSAR learning method (random forests using a molecular fingerprint representation) by up to 13%, on average. We conclude that meta-learning outperforms base-learning methods for QSAR learning, and as this investigation is one of the most extensive ever comparisons of base and meta-learning methods ever made, it provides evidence for the general effectiveness of meta-learning over base-learning.
... We will also investigate if feature selection [22] applied to both power and weather data can improve the results. Another direction for future work is selecting the best prediction algorithm for a given solar dataset or different days and times of the day, by investigating methods based on meta-learning [23]. ...
Conference Paper
Full-text available
We consider the task of forecasting the electricity power generated by a photovoltaic solar system, for the next day at half-hourly intervals. The forecasts are based on previous power output and weather data, and weather prediction for the next day. We present a new approach that forecasts all the power outputs for the next day simultaneously. It builds separate prediction models for different types of days, where these types are determined using clustering of weather patterns. As prediction models it uses ensembles of neural networks, trained to predict the power output for a given day based on the weather data. We evaluate the performance of our approach using Australian photovoltaic solar data for two years. The results showed that our approach obtained MAE=83.90 kW and MRE=6.88%, outperforming four other methods used for comparison.
... Another interesting extension would be to investigate the application of adaptive wavelet packets with best basis selection -finding the best wavelet basis for different data segments as opposed to finding the best wavelet basis for the whole dataset as in AWNN. Another avenue for future work is algorithm selectionour approach is not limited to using NN as a prediction algorithm and methods for selecting the most appropriate prediction algorithm such as landmarking [43] can be explored. ...
Preprint
Full-text available
Considerable progress has been made in the recent literature studies to tackle the Algorithms Selection and Parametrization (ASP) problem, which is diversified in multiple meta-learning setups. Yet there is a lack of surveys and comparative evaluations that critically analyze, summarize and assess the performance of existing methods. In this paper, we provide an overview of the state of the art in this continuously evolving field. The survey sheds light on the motivational reasons for pursuing classifiers selection through meta-learning. In this regard, Automated Machine Learning (AutoML) is usually treated as an ASP problem under the umbrella of the democratization of machine learning. Accordingly, AutoML makes machine learning techniques accessible to domain scientists who are interested in applying advanced analytics but lack the required expertise. It can ease the task of manually selecting ML algorithms and tuning related hyperparame-ters. We comprehensively discuss the different phases of classifiers selection based on a generic framework that is formed as an outcome of reviewing prior works. Subsequently, we propose a benchmark knowledge base of 4 millions previously learned models and present extensive comparative evaluations of the prominent methods for classifiers selection based on 08 classification algorithms and 400 benchmark datasets. The comparative study quantitatively assesses the performance of algorithms selection methods along while emphasizing the strengths and limitations of existing studies.
Thesis
Full-text available
Machine learning (ML) has penetrated all aspects of the modern life, and brought more convenience and satisfaction for variables of interest. However, building such solutions is a time consuming and challenging process that requires highly technical expertise. This certainly engages many more people, not necessarily experts, to perform analytics tasks. While the selection and the parametrization of ML models require tedious episodes of trial and error. Additionally, domain experts often lack the expertise to apply advanced analytics. Consequently, they intend frequent consultations with data scientists. However, these collaborations often result in increased costs in terms of undesired delays. It thus can lead risks such as human-resource bottlenecks. Subsequently, as the tasks become more complex, similarly the more support solutions are needed for theincreased ML usability for the non-ML masters. To that end, Automated ML(AutoML) is a data-mining formalism with the aim of redureducing human effort and readily improving the development cycle through automation. The field of AutoML aims to make these decisions in a data-driven, objective, and automated way. Thereby, AutoML makes ML techniques accessible to domain scientists who are interested in applying advanced analytics but lack the required expertise. This can be seen as a democratization of ML. AutoML is usually treated as an algorithms selection and parametrization problem. In this regard, existing approaches include Bayesian optimization, evolutionary algorithms as well as reinforcement learning. Theseapproaches have focused on providing user assistance by automating parts or the entire data analysis process, but without being concerned on its impact on the analysis. The goal has generally been focused on the performance factors, thus leaving aside other important and even crucial aspects such as computational complexity, confidence and transparency. In contrast, this thesis aims at developing alternative methods that provide assistance in building appropriate modeling techniques while providing the rationale for the selected models. In particular, we consider this important demand in intelligent assistance as a meta-analysis process, and we make progress towards addressing two challenges in AutoML research. First, to overcome the computational complexity problem, we studied a formulation of AutoML as a recommendation problem, and proposed a new conceptualization of a Meta-Learning (MtL)-based expert system capable of recommending optimal ML pipelines for a given task; Second, we investigated the automatic explainability aspect of the AutoML process to address the problem of the acceptance of, and the trust in such black-boxes support systems. To this end, we have designed and implemented a framework architecture that leverages ideas from MtL to learn the relationship between a new set of datasets meta-data and mining algorithms. This eventually enables recommending ML pipelines according to their potential impact on the analysis. To guide the development of our work, we chose to focus on the Industry 4.0 as a main field of application for all the constraints it offers.Finally, in this doctoral thesis, we focus on the user assistance in the algorithms selection and tuning step. We devise an architecture and build a tool, AMLBID, that provides users support with the aim of improving the analysis and decreasing the amount of time spent in algorithms selection and parametrization. It is a tool that for the first time does not aim at providing data analysis support only, but instead, it is oriented towards positively contributing to the trust-in such powerful support systems by automatically providing a set of explanation levels to inspect the provided results.
Chapter
Full-text available
Meta-learning, or learning to learn, is the science of systematically observing how different machine learning approaches perform on a wide range of learning tasks, and then learning from this experience, or meta-data, to learn new tasks much faster than otherwise possible. Not only does this dramatically speed up and improve the design of machine learning pipelines or neural architectures, it also allows us to replace hand-engineered algorithms with novel approaches learned in a data-driven way. In this chapter, we provide an overview of the state of the art in this fascinating and continuously evolving field.
Preprint
Full-text available
Meta-learning, or learning to learn, is the science of systematically observing how different machine learning approaches perform on a wide range of learning tasks, and then learning from this experience, or meta-data, to learn new tasks much faster than otherwise possible. Not only does this dramatically speed up and improve the design of machine learning pipelines or neural architectures, it also allows us to replace hand-engineered algorithms with novel approaches learned in a data-driven way. In this chapter, we provide an overview of the state of the art in this fascinating and continuously evolving field.
Conference Paper
Full-text available
Landmarking is a novel approach to describing tasks in meta-learning. Previous approaches to meta-learning mostly considered only statistics-inspired measures of the data as a source for the denition of metaattributes. Contrary to such approaches, landmarking tries to determine the location of a specic learning problem in the space of all learning problems by directly measuring the performance of some simple and ecient learning algorithms themselves. In the experiments reported we show how such a use of landmark values can help to distinguish between areas of the learning space favouring dierent learners. Experiments, both with articial and real-world databases, show that landmarking selects, with moderate but reasonable level of success, the best performing of a set of learning algorithms. 1.
Conference Paper
Full-text available
Landmarking is a recent and promising meta-learning strategy, which defines meta-features that are themselves efficient learning algorithms. However, the choice of landmarkers is often made in an ad hoc manner. In this paper, we propose a new perspective and set of criteria for landmarkers. Based on the new criteria, we propose a landmarker generation algorithm, which generates a set of landmarkers that are each subsets of the algorithms being landmarked. Our experiments show that the landmarkers formed, when used with linear regression are able to estimate the accuracy of a set of candidate algorithms well, while only utilising a small fraction of the computational cost required to evaluate those candidate algorithms via ten-fold cross-validation.
Conference Paper
Full-text available
This paper is concerned with the problem of characterization of classification algorithms. The aim is to determine under what circumstances a particular classification algorithm is applicable. The method used involves generation of different kinds of models. These include regression and rule models, piecewise linear models (model trees) and instance based models. These are generated automatically on the basis of dataset characteristics and given test results. The lack of data is compensated for by various types of preprocessing. The models obtained are characterized by quantifying their predictive capability and the best models are identified.
Article
Full-text available
We present a meta-learning method to support selection of candidate learning algorithms. It uses a k-Nearest Neighbor algorithm to identify the datasets that are most similar to the one at hand. The distance between datasets is assessed using a relatively small set of data characteristics, which was selected to represent properties that affect algorithm performance. The performance of the candidate algorithms on those datasets is used to generate a recommendation to the user in the form of a ranking. The performance is assessed using a multicriteria evaluation measure that takes not only accuracy, but also time into account. As it is not common in Machine Learning to work with rankings, we had to identify and adapt existing statistical techniques to devise an appropriate evaluation methodology. Using that methodology, we show that the meta-learning method presented leads to significantly better rankings than the baseline ranking method. The evaluation methodology is general and can be adapted to other ranking problems. Although here we have concentrated on ranking classification algorithms, the meta-learning framework presented can provide assistance in the selection of combinations of methods or more complex problem solving strategies.
Article
Full-text available
Recent advances in meta-learning are providing the foundations to construct meta-learning assistants and task-adaptive learners. The goal of this special issue is to foster an interest in meta-learning by compiling representative work in the field. The contributions to this special issue provide strong insights into the construction of future meta-learning tools. In this introduction we present a common frame of reference to address work in meta-learning through the concept of meta-knowledge. We show how meta-learning can be simply defined as the process of exploiting knowledge about learning that enables us to understand and improve the performance of learning algorithms.
Article
If we lack relevant problem-specific knowledge, cross-validation methods may be used to select a classification method empirically. We examine this idea here to show in what senses cross-validation does and does not solve the selection problem. As illustrated empirically, cross-validation may lead to higher average performance than application of any single classification strategy, and it also cuts the risk of poor performance. On the other hand, cross-validation is no more or less a form of bias than simpler strategies, and applying it appropriately ultimately depends in the same way on prior knowledge. In fact, cross-validation may be seen as a way of applying partial information about the applicability of alternative classification strategies.
Conference Paper
For a given classification task, there are typically several learning algorithms available. The question then arises: which is the most appropriate algorithm to apply. Recently, we proposed a new algorithm for making such a selection based on landmarking - a meta-learning strategy that utilises meta-features that are measurements based on efficient learning algorithms. This algorithm, which creates a set of landmarkers that each utilise subsets of the algorithms being landmarked, was shown to be able to estimate accuracy well, even when employing a small fraction of the given algorithms. However, that version of the algorithm has exponential computational complexity for training. In this paper, we propose a hill-climbing version of the landmarker generation algorithm, which requires only polynomial training time complexity. Our experiments show that the landmarkers formed have similar results to the more complex version of the algorithm.
Conference Paper
A novel class of applications of predictive clustering trees is addressed, namely ranking. Predictive clustering trees, as implemented in CLUS, allow for predicting multiple target variables. This approach makes sense especially if the target variables are not independent of each other. This is typically the case in ranking, where the (relative) performance of several approaches on the same task has to be predicted from a given description of the task. We propose to use predictive clustering trees for ranking. As compared to existing ranking approaches which are instance-based, our approach also allows for an explanation of the predicted rankings. We illustrate our approach on the task of ranking machine learning algorithms, where the (relative) performance of the learning algorithms on a dataset has to be predicted from a given dataset description.