ArticlePDF Available

Abstract and Figures

This paper proposes a new tree-based ensemble method for supervised classification and regression problems. It essentially consists of randomizing strongly both attribute and cut-point choice while splitting a tree node. In the extreme case, it builds totally randomized trees whose structures are independent of the output values of the learning sample. The strength of the randomization can be tuned to problem specifics by the appropriate choice of a parameter. We evaluate the robustness of the default choice of this parameter, and we also provide insight on how to adjust it in particular situations. Besides accuracy, the main strength of the resulting algorithm is computational efficiency. A bias/variance analysis of the Extra-Trees algorithm is also provided as well as a geometrical and a kernel characterization of the models induced.
Content may be subject to copyright.
Mach Learn (2006) 63: 3–42
DOI 10.1007/s10994-006-6226-1
Extremely randomized trees
Pierre Geurts ·Damien Ernst ·Louis Wehenkel
Received: 14 June 2005 / Revised: 29 October 2005 / Accepted: 15 November 2005 /
Published online: 2 March 2006
Springer Science +Business Media, LLC 2006
Abstract This paper proposes a new tree-based ensemble method for supervised classifica-
tion and regression problems. It essentially consists of randomizing strongly both attribute
and cut-point choice while splitting a tree node. In the extreme case, it builds totally random-
ized trees whose structures are independent of the output values of the learning sample. The
strength of the randomization can be tuned to problem specifics by the appropriate choice
of a parameter. We evaluate the robustness of the default choice of this parameter, and we
also provide insight on how to adjust it in particular situations. Besides accuracy, the main
strength of the resulting algorithm is computational efficiency. A bias/variance analysis of the
Extra-Trees algorithm is also provided as well as a geometrical and a kernel characterization
of the models induced.
Keywords Supervised learning .Decision and regression trees .Ensemble methods .
Cut-point randomization .Bias/variance tradeoff .Kernel-based models
1. Introduction
In this article, we propose a new tree induction algorithm that selects splits, both attribute
and cut-point, totally or partially at random.
The idea that randomized decision trees could perform as well as classical ones appeared
in an experimental study published in the late eighties (Mingers, 1989), even if later it was
Editor: Johannes F¨
urnkranz
P. Geurts ()·D. Ernst ·L. Wehenkel
Department of Electrical Engineering and Computer Science, University of Li`
ege,
Li`
ege, Sart-Tilman, B-28, B-4000 Belgium
e-mail: P.Geurts@ulg.ac.be
D. Ernst
e-mail: Dernst@ulg.ac.be
L. Wehenkel
e-mail: L.Wehenkel@ulg.ac.be
Springer
4 Mach Learn (2006) 63: 3–42
shown in a more carefully designed experiment that they were actually significantly less
accurate than normal ones on many datasets (Buntine and Niblett, 1992).
During the early nineties, the statistical notions of variance and its companion, the
bias, were studied more systematically by machine learning researchers (see for example,
Dietterich and Kong, 1995;Breiman,1996a; Friedman, 1997), and the high variance of
tree-based models, like those induced by CART or C4.5, was acknowledged by the research
community. Actually, because of this high variance, the models induced by these latter
methods are to a large extent random, and also the splits, both attributes and cut-points, that
are chosen at each internal node depend to a very large extent on the random nature of the
learning sample (Wehenkel, 1997; Geurts and Wehenkel, 2000;Geurts,2002). In particular,
in (Wehenkel, 1997) it was shown empirically that the cut-point variance is very high, even
for large sample sizes. More precisely, the optimal cut-point (the one maximizing, for a
given problem and a given attribute, the score computed on the learning sample) was shown
to depend very strongly on the particular learning sample used. Furthermore, this cut-point
variance appeared to be responsible for a significant part of the error rates of tree-based
methods (see, e.g., Geurts, 2002). Thus, in order to try to reduce cut-point variance, various
kinds of cut-point smoothing or averaging approaches have been tried out, but, while allowing
to reduce cut-point variance and hence improve interpretability, they did not translate into
significantly better predictive accuracy (Geurts and Wehenkel, 2000).
During the same early nineties period, various ideas to combine multiple models emerged
with the objective of reducing variance or bias in machine learning methods. For example,
Bayesian averaging (Buntine and Weigend, 1991; Buntine, 1992) is essentially a variance
reduction technique, whereas Stacking and Boosting (Wolpert, 1992; Freund and Schapire,
1995) essentially target a reduction of bias (actually, Boosting also reduces variance; Bauer
and Kohavi, 1999). Breiman came up in 1994 with the “Bagging” idea (Breiman, 1996b)in
order to reduce the variance of a learning algorithm without increasing its bias too much.
One can view Bagging as a particular instance of a broader family of ensemble methods
that we will call randomization methods. These methods explicitly introduce randomization
into the learning algorithm and/or exploit at each run a different randomized version of the
original learning sample, so as to produce an ensemble of more or less strongly diversified
models. Then, the predictions of these models are aggregated by a simple average or a
majority vote in the case of classification. For example, Bagging introduces randomization
by building the models from bootstrap samples drawn with replacement from the original
dataset. Several other generic randomization methods have been proposed which, like Bag-
ging, are applicable to any machine learning algorithm (e.g., Ho, 1998; Bauer and Kohavi,
1999; Webb, 2000;Breiman,2000a). These generic methods often improve considerably
the accuracy of decision or regression trees which, otherwise, are often not competitive with
other supervised learning algorithms like neural networks or support vector machines. Fur-
thermore, although ensemble methods require growing several models, their combination
with decision/regression trees remains also very attractive in terms of computational effi-
ciency because of the low computational cost of the standard tree growing algorithm. Hence,
given the success of generic ensemble methods with trees, several researchers have looked at
specific randomization techniques for trees based on a direct randomization of the tree grow-
ing method (e.g., Ali and Pazzani, 1996;Ho,1998; Dietterich, 2000;Breiman,2001; Cutler
and Guohua, 2001;Geurts,2002; Kamath et al., 2002). All these randomization methods
actually cause perturbations in the induced models by modifying the algorithm responsible
for the search of the optimal split during tree growing. In the context of an ensemble method,
it is thus productive to somewhat deteriorate the “optimality” of the classical deterministic
induction algorithm that looks for the best split locally at each tree node.
Springer
Mach Learn (2006) 63: 3–42 5
Although existing randomization methods for trees significantly randomize the standard
tree growing algorithm, they are still far from building totally random trees. Yet, the very
high variance of decision and regression tree splits suggests investigating whether higher
randomization levels could improve accuracy with respect to existing ensemble methods. To
this end, the present paper proposes and studies a new algorithm that, for a given numerical
attribute, selects its cut-point fully at random, i.e., independently of the target variable. At
each tree node, this is combined with a random choice of a certain number of attributes
among which the best one is determined. In the extreme case, the method randomly picks a
single attribute and cut-point at each node, and hence builds totally randomized trees whose
structures are independent of the target variable values of the learning sample. While we
also propose a way to select random splits for categorical attributes, we focus in this paper
on the study of this randomization idea in the context of numerical attributes only.
The rest of the paper is organized as follows: Section 2introduces the Extra-Trees (for
extremely randomized trees) algorithm with its default parameter settings, and carries out a
systematic empirical evaluation in terms of both accuracy and computational efficiency; Sec-
tion 3provides a detailed analysis of the effect of parameters in different conditions. Section 4
presents an empirical study of bias and variance of the Extra-Trees algorithm and
Section 5analyses the main geometrical properties of the Extra-Trees models. The pa-
per ends with a discussion of related work, conclusions and suggestions for future
work directions. The appendices collect relevant details and complementary simulations
results.
2. Extra-Trees algorithm
We consider the standard batch-mode supervised learning problem, and focus on learning
problems characterized by (possibly a large number of) numerical input variables and one
single (categorical or numerical) target variable. We start with the description of the Extra-
Trees (ET) algorithm and a brief discussion of its rationale. We continue with a systematic
empirical evaluation based on a diverse set of classification and regression problems, where
we compare this new method with standard tree-based methods, in terms of both accuracy
and computational efficiency.
In the rest of the paper, the term attribute denotes a particular input variable used in a
supervised learning problem. The candidate attributes denote all input variables that are
available for a given problem. We use the term output to refer to the target variable that
defines the supervised learning problem. When the output is categorical, we talk about a
classification problem and when it is numerical, we talk about a regression problem.The
term learning sample denotes the observations used to build a model, and the term test
sample the observations used to compute its accuracy (error-rate, or mean square-error).
Nrefers to the size of the learning sample, i.e., its number of observations, and nrefers to
the number of candidate attributes, i.e., the dimensionality of the input space.
2.1. Algorithm description and rationale
The Extra-Trees algorithm builds an ensemble of unpruned decision or regression trees
according to the classical top-down procedure. Its two main differences with other tree-
based ensemble methods are that it splits nodes by choosing cut-points fully at random and
that it uses the whole learning sample (rather than a bootstrap replica) to grow the trees.
Springer
6 Mach Learn (2006) 63: 3–42
Tab l e 1 Extra-Trees splitting algorithm (for numerical attributes)
Split anode(S)
Input: the local learning subset Scorresponding to the node we want to split
Output: a split [a<ac] or nothing
–IfStop split(S) is TRUE then return nothing.
– Otherwise select Kattributes {a1,...,aK}among all non constant (in S) candidate attributes;
–DrawKsplits {s1,...,sK},wheresi=Pick arandom split(S,ai),i=1,..., K;
– Return a split ssuch that Score(s,S)=maxi=1,...,KScore(si,S).
Pick arandom split(S,a)
Inputs: a subset Sand an attribute a
Output: a split
–LetaS
max and aS
min denote the maximal and minimal value of ain S;
– Draw a random cut-point acuniformly in [aS
min,aS
max];
– Return the split [a<ac].
Stop split(S)
Input: a subset S
Output: a boolean
–If|S|<nmin, then return TRUE;
– If all attributes are constant in S, then return TRUE;
– If the output is constant in S, then return TRUE;
– Otherwise, return FALSE.
The Extra-Trees splitting procedure for numerical attributes is given in Table 1.1It has
two parameters: K, the number of attributes randomly selected at each node and nmin ,the
minimum sample size for splitting a node. It is used several times with the (full) original
learning sample to generate an ensemble model (we denote by Mthe number of trees of
this ensemble). The predictions of the trees are aggregated to yield the final prediction, by
majority vote in classification problems and arithmetic average in regression problems.
From the bias-variance point of view, the rationale behind the Extra-Trees method is that
the explicit randomization of the cut-point and attribute combined with ensemble averaging
should be able to reduce variance more strongly than the weaker randomization schemes
used by other methods. The usage of the full original learning sample rather than bootstrap
replicas is motivated in order to minimize bias. From the computational point of view, the
complexity of the tree growing procedure is, assuming balanced trees, on the order of Nlog
Nwith respect to learning sample size, like most other tree growing procedures. However,
given the simplicity of the node splitting procedure we expect the constant factor to be much
smaller than in other ensemble based methods which locally optimize cut-points.
The parameters K,nmin and Mhave different effects: Kdetermines the strength of the
attribute selection process, nmin the strength of averaging output noise, and Mthe strength
of the variance reduction of the ensemble model aggregation. These parameters could be
adapted to the problem specifics in a manual or an automatic way (e.g. by cross-validation).
However, we prefer to use default settings for them in order to maximize the computational
advantages and autonomy of the method. Section 3studies these default settings in terms of
robustness and suboptimality in various contexts.
1The complete Extra-Trees algorithm, for numerical and categorical attributes, is given in Appendix A.
Springer
Mach Learn (2006) 63: 3–42 7
To specify the value of the main parameter K, we will use the notation ETK,whereKis
replaced by ‘d’ to say that default settings are used, by ‘’ to denote the best results obtained
over the range of possible values of K, and by ‘cv’ifKis adjusted by cross-validation.
2.2. Empirical evaluation
We now present an empirical evaluation of the performance of the Extra-Trees method with
default settings. We first describe the datasets used for this purpose and discuss the range of
conditions that are intended to be covered by them. Then we specify the algorithms (all are
tree-based) with which we compare our method and the simulation protocol used to evaluate
them. Finally, we provide and discuss the results in terms of accuracy and computational
requirements.
2.2.1. Datasets
We have used 24 different datasets. Half of them concern classification problems, with a
number of classes ranging from 2 to 26; the other half are regression problems. Overall,
these datasets cover a wide range of conditions in terms of number of candidate attributes
(between 2 and 617), learning sample size (between 300 and 10,000), observation redundancy
(number N/nof observations per attribute between 10 and 1000). In terms of the relative
importance of candidate attributes, we have datasets with a number of totally irrelevant
variables, with equally important variables and with attributes of variable importance. In
terms of problem complexity, we cover linear problems as well as strongly non-linear ones.
Some problems present high noise, others are noise free. All datasets are well known and
publicly available. Out of the 24 problems, 13 are synthetic ones, and for three of them
the explicit form of the Bayes optimal model and its residual error are known. Notice that
we have restricted our choice to datasets which originally had no missing values and only
numerical attributes.2The choice of datasets was made a priori and independently of the
results obtained with our methods, and no dataset was later excluded.
Appendix B provides further details concerning these datasets, in particular their size and
numbers of candidate attributes and classes, and a brief discussion of their origins and results
obtained by standard methods such as k-nearest neighbors (kNN) and linear models.
2.2.2. Compared algorithms
Below, the compared methods are given in the order of their publication. Except for the first
one, they all build ensembles of trees.
2.2.2.1. Single CART Tree (ST/PST). We use the CART algorithm to build single trees
(Breiman et al., 1984). Trees are grown in a deterministic way from the learning sample and
pruned according to the cost-complexity pruning algorithm with error estimates by ten-fold
cross-validation. We will use the acronym ST (respectively, PST) to denote unpruned single
trees (respectively, pruned single trees).
2.2.2.2. Tree Bagging (TB). When talking about Tree Bagging we refer to the standard
algorithm published by Breiman (1996b). In this algorithm, unpruned CART trees are grown
2See also (Geurts et al., 2005a) for results with categorical attributes.
Springer
8 Mach Learn (2006) 63: 3–42
from bootstrap replicas (obtained by Ntimes random sampling with replacement in the
original learning sample). In Tree Bagging, attribute and cut-point randomization is thus
carried out implicitly (and in a parameter free way) via the bootstrap re-sampling.
2.2.2.3. (Local) Random Subspace (RS). We consider the variant where this method random-
izes locally the set of attributes used to determine an optimal split at each tree node (Ho,
1998). To this end, it looks for the best split over a subset of attributes, selected locally at
each test node by drawing without replacement a number Kof attributes from the candidate
attributes. The trees are built from the original learning sample and the cut-point for a given
attribute is optimized node-wise. The strength of the randomization is inversely proportional
to the parameter K. We use the notation RSKto denote this method, with K=∗to denote
the best result found over the range of values of K=1, ...,n.
2.2.2.4. Random Forests (RF). This algorithm has been proposed by Breiman (2001)asan
enhancement of Tree Bagging. To build a tree it uses a bootstrap replica of the learning sample,
and the CART algorithm (without pruning) together with the modification used in the Random
Subspace method. At each test node the optimal split is derived by searching a random
subset of size Kof candidate attributes (selected without replacement from the candidate
attributes). Empirical studies have shown that Random Forests significantly outperform Tree
Bagging and other random tree ensemble methods in terms of accuracy. In terms of degree of
randomization, this algorithm is stronger than Tree Bagging, specially if Kis small compared
to the number of attributes, n. It is also stronger than Random Subspace since it combines this
method with bootstrap sampling. Randomization is both implicit (attribute and cut-point)
and explicit (attribute). We use the notation RFK, with K=dfor the default setting, and
K=∗for the best result over the range K=1, ...,n.
2.2.2.5. Parameter choices. Except for CART, all algorithms have the common parameter
M(number of trees of the ensemble). We use for this parameter the common value of
M=100, which is large enough to ensure convergence of the ensemble effect with all our
datasets. Another common parameter of the ensemble methods is nmin , which we set to 2 for
classification (fully grown trees) and to 5 in regression (slight smoothing). Finally, Random
Subspace, Random Forests and Extra-Trees have the common parameter K. For our method
and unless otherwise specified, we use the default setting in all trials, which is K=nin
classification, rounded to the closest integer, and K=nin regression. This choice is denoted
by ETdand further discussed in Section 3.1. For the purpose of accuracy assessments, we did
run Random Subspace and Random Forests systematically with Kranging from 1 to n,and
we report the best results obtained (i.e., RSand RF). For the computational assessment,
we did however run Random Forests with the same default K(denoted by RFd) as our
method, so as to put these methods in comparable conditions from the computational point
of view.
2.2.2.6. Score measures. We use the same score measure for all methods. In regression this
is the amount of output variance reduction and in classification it is a normalized version of
the Shannon information gain. Detailed score formulas are given in Appendix A.
2.2.2.7. Aggregation scheme. For all ensemble methods we used majority vote for classifi-
cation problems and arithmetic average for regression problems.
Springer
Mach Learn (2006) 63: 3–42 9
Tab l e 2 Win/Draw/Loss records (corrected t-tests) comparing the algorithm in the column versus the
algorithm in the row
Classification problems Regression problems
PST TB RSRFETdPST TB RSRFETd
PST 8/4/0 11/1/0 11/1/0 10/2/0 – 10/2/0 8/4/0 10/2/0 10/2/0
TB 0/4/8 – 7/5/0 7/5/0 7/5/0 0/2/10 0/9/3 1/11/0 4/8/0
RS0/1/11 0/5/7 – 0/8/4 2/10/0 0/4/8 3/9/0 – 4/8/0 4/7/1
RF0/1/11 0/5/7 4/8/0 – 5/7/0 0/2/10 0/11/1 0/8/4 – 3/7/2
ETd0/2/10 0/5/7 0/10/2 0/7/5 – 0/2/10 0/8/4 1/7/4 2/7/3 –
2.2.3. Protocols
The algorithms are run ten times on each dataset, except for smaller datasets where they are
run 50 times (these are marked by a star in Table 7, Appendix B). At each run, each dataset
is first randomly divided into a learning (LS) and test (TS) sample (their respective sizes are
giveninTable7). Then all algorithms are run on the learning sample and their errors are
estimated on the test sample. We report and analyze the average and standard deviations of
the errors of each method obtained in this way on each dataset. In classification problems
these numbers refer to error rates in percent, whereas in regression problems they refer to
mean square-errors, multiplied by the factor given in the last column of Table 7(these factors
are inversely proportional to the order of magnitude of the errors). These results are reported
graphically on Fig. 1. Each individual graph shows, for a particular problem, in left to right
order the accuracies of the five methods, with the pruned CART trees (PST) on the far left
and the Extra-Trees with default settings (ETd) on the far right. We recall that the Random
Subspace and Random Forests results correspond to the best value of Kand are therefore
denoted by RSand RF.
All the numerical values are collected in Table 8of Appendix D, with results obtained by
other variants, the kNN method, and least squares linear regression.
We have also performed statistical tests to compare the different algorithms. For this
purpose, we used a corrected paired two-sided t-test with a confidence level of 95% (see
Appendix C). Table 2reports on the “Win-Draw-Loss” statuses of all pairs of methods.
2.2.4. Discussion of results
Although this is difficult to assess from Fig. 1, let us first notice that for each problem the
standard deviations of the errors of the last three methods (RS,RF
,ET
d) are very close
to each other and are significantly smaller than those of PST (50% on average). For Tree
Bagging, on the other hand, these standard deviations are on classification problems close to
those of PST, while on regression problems they are close to the other three methods.
Concerning average accuracies, Fig. 1highlights that on a very large majority of problems
Extra-Trees are as accurate or more accurate than the other ensemble methods. Among
these other methods, Tree Bagging is less convincing on classification problems, while on
regression problems it appears to be equivalent with the Random Forests. On regression
problems, the Random Subspace method is occasionally significantly worse than the other
ensemble methods (on the two Hwang problems, and a little less on Pumadyn-32nm). Only
on Vehicle and Housing, Extra-Trees are visibly (slightly) less accurate than some other
ensemble methods and, overall, they work very reliably. Finally, we note that all ensemble
Springer
10 Mach Learn (2006) 63: 3–42
Fig. 1 Comparison (average error and standard deviation) on 12 classification and 12 regression problems.
methods are better (generally very much) than the single pruned CART trees. Indeed, this
latter method is competitive only on Pumadyn-32fh. Notice that the results are sometimes
slightly worse and sometimes slightly better for unpruned single CART trees (see Table 8,
Appendix D).
Considering the significance tests of Table 2, we note that Extra-Trees never lose on clas-
sification problems, whereas Tree Bagging loses 7 times with respect to the other ensemble
methods, Random Subspace loses 2 times with respect to Extra-Trees, and Random Forests
loses 5 times with respect to Extra-Trees and 4 times with respect to Random Subspace. The
advantage of Extra-Trees is also valid on regression problems, although here it occasionally
loses (1 time with respect to Random Subspace and 2 times with respect to Random Forests).
To assess the effect of the default choice of the parameter Kon these conclusions, we
did run a side experiment where the value of Kwas adjusted for each run of the Extra-Trees
method by using a 10-fold cross-validation technique internal to the learning sample. The
detailed numerical results are given in Table 8(Appendix D) in the column denoted ETcv;
Springer
Mach Learn (2006) 63: 3–42 11
significance tests show that 22 times out of 24 the ETcv variant performs the same as ETd,
and two times it wins (on the Segment and Isolet datasets, where slightly better results are
obtained for Kvalues higher than the default setting). In terms of Win/Draw/Loss reports
with respect to the other methods, the ETcv variant also appears slightly better than ETd.
Finally, the comparison of the ETcv version with an ETvariant (see Table 8) shows that
there is no significant difference (24 draws on 24 datasets) between these two variants.
These results confirm that the conclusions in terms of accuracy would have been affected
only marginally in favor of the Extra-Trees algorithm if we had used the version which
adjusts Kby cross-validation instead of the default settings. Given the very small gain in
accuracy with respect to the default setting and the very high cost in terms of computational
burden, we do not advocate the use of the cross-validation procedure together with the Extra-
Trees algorithm, except in very special circumstances where one can aprioriforesee that the
default value would be suboptimal. These issues are further discussed in Section 3, together
with the analysis of the other two parameters of the method.
2.3. Computational requirements
We compare Extra-Trees with CART, Tree Bagging and Random Forests. In this comparison,
we have used unpruned CART trees.3We also use the same default settings to run Random
Forests as for Extra-Trees (K=nor K=n), so as to put them in similar conditions from
the computational point of view. Notice that we dropped the Random Subspace method for
this comparison, since its computational requirements are systematically larger than those
of Random Forests under identical conditions.4
Tabl es 3and 4provide respectively tree complexities (number of leaves of a tree or of an
ensemble of trees) and CPU times (in msec)5required by the tree growing phase, averaged
over the 10 or 50 runs for each dataset. We report in the left part of these tables results
for classification problems, and in the right part those for regression problems. Notice that,
because in regression the default value of Kis equal to n, the Random Forests method
degenerates into Tree Bagging; so their results are merged in one column.
Regarding complexity, Table 3shows that Extra-Trees are between 1.5 and 3 times larger
in terms of number of leaves than Random Forests. The average ratio is 2.69 over the
classification problems and 1.67 in regression problems. However, in terms of average tree
depth, this increase in complexity is much smaller due to the exponential growth of the
number of leaves with tree depth. Thus, Extra-Trees are on the average no more than two
levels deeper than those produced by Tree Bagging or Random Forests. For example in our
trials, the most complex trees were obtained on the Census-16H dataset, with an average
depth of 11 for Tree Bagging and 12 for Extra-Trees. Thus, from a practical point of view,
the increase in complexity is detrimental only in terms of memory requirements.
Regarding computing times, Table 4shows that Extra-Trees training is systematically
faster than that of Random Forests and Tree Bagging. In classification problems, the average
3Pruning by ten-fold cross-validation only slightly affects accuracy (see Table 8), but leads to learning times
about ten-times higher than for unpruned trees.
4This is due to the fact that the only difference between these methods is that Random forests use bootstrap
replicas, which leads to smaller trees (about 30%) and faster tree growing and testing times. Note also that
bootstrap sampling could be combined with our Extra-Trees and lead to similar improvements. However, we
have found that it deteriorates accuracy, often significantly (see Appendix D).
5The system is implemented in C under Linux and runs on a Pentium 4 2.4GHz with 1GB of main memory.
In our experiments, all data and models are stored in main memory.
Springer
12 Mach Learn (2006) 63: 3–42
Tab l e 3 Model complexities (total number of leaves, ensembles of M=100 trees)
Classification problems Regression problems
Dataset ST TB RFdETdDataset ST TB/RFdETd
Waveform 38 2661 3554 10782 Friedman1 118 7412 11756
Two-Norm 26 1860 2519 7885 Housing 180 11590 18831
Ring-Norm 23 1666 2237 7823 Hwang-f5 773 48720 78438
Vehicle 123 8880 11699 29812 Hwang-f5n 829 51683 79524
Vowel 127 10505 13832 33769 Pumadyn-32fh 755 48043 78503
Segment 62 5083 8071 24051 Pumadyn-32nm 754 47794 78022
Spambase 202 14843 21174 51155 Abalone 1189 76533 129016
Satellite 371 26571 34443 83659 Ailerons 1786 116242 205939
Pendigits 248 19783 29244 76698 Elevators 1946 124346 208356
Dig44 823 59999 81110 239124 Poletelecomm 434 29488 57342
Letter 1385 104652 144593 278984 Bank-32nh 1313 83385 139866
Isolet 171 12593 21072 48959 Census-16H 2944 187543 320074
ratio over the twelve datasets is 0.36 in favor of Extra-Trees with respect to Random Forests,
and they are on the average 10 times faster than Tree Bagging. In regression problems, the
average ratio is 0.81 (with respect to both methods).
Notice that our implementations of Tree Bagging and Random Forests pre-sort the learning
sample before growing all trees to avoid having to re-sort it each time a node is split. On our
problems, this pre-sorting reduced the average computing times of these methods by a factor
two. Our implementation of Extra-Trees, on the other hand, does not use pre-sorting, which
is a further advantage in the case of very large problems, where it may not be possible to
keep in memory a sorted copy of the learning sample for each candidate attribute. Actually,
strictly speaking, since pre-sorting requires on the order of nNlog Noperations, it makes the
computational complexity of Random Forests depend linearly on the number of attributes,
even when K=n. Hence, for very large numbers of attributes the computational advantage
of Extra-Trees is even higher. This is observed on the largest dataset (Isolet), where they are
more than ten times faster than Random Forests.
Tab l e 4 Computing times (msec) of training (ensembles of M=100 trees)
Classification problems Regression problems
Dataset ST TB RFdETdDataset ST TB/RFdETd
Waveform 68 4022 1106 277 Friedman1 7 372 284
Two-Norm 66 3680 830 196 Housing 12 685 601
Ring-Norm 101 4977 1219 251 Hwang-f5 16 917 742
Vehicle 80 5126 1500 685 Hwang-f5n 15 948 748
Vowel 236 14445 4904 694 Pumadyn-32fh 251 13734 9046
Segment 291 18793 5099 1053 Pumadyn-32nm 221 11850 9318
Spambase 822 55604 9887 8484 Abalone 73 4237 3961
Satellite 687 45096 11035 5021 Ailerons 495 29677 26572
Pendigits 516 34449 12080 5183 Elevators 289 15958 13289
Dig44 4111 259776 67286 9494 Poletelecomm 497 28342 26576
Letter 665 44222 17041 14923 Bank-32nh 613 34402 20178
Isolet 37706 2201246 126480 11469 Census-16H 597 35207 27900
Springer
Mach Learn (2006) 63: 3–42 13
3. On the effect of parameters
This section analyzes and discusses the effect of the three parameters K,nmin and Mon the
Extra-Trees.
3.1. Attribute selection strength K
The parameter Kdenotes the number of random splits screened at each node to develop
Extra-Trees. It may be chosen in the interval [1, ...,n], where nis the number of attributes.
For a given problem, the smaller Kis, the stronger the randomization of the trees and the
weaker the dependence of their structure on the output values of the learning sample. In the
extreme, for K=1, the splits (attributes and cut-points) are chosen in a totally independent
way of the output variable (we therefore use the term totally randomized trees to denote this
variant). On the other extreme, when K=n, the attribute choice is not explicitly randomized
anymore, and the randomization effect acts only through the choice of cut-points.
In order to see how this parameter influences accuracy, and to support our default settings,
we have made a systematic experience for all our datasets, by varying the parameter over its
range. Figure 2shows the evolution of the error of Extra-Trees with respect to the value of K,
top on classification problems, bottom on regression problems. For classification problems,
the default value of K=nis shown as a vertical line on the graphs; for regression the
default corresponds to the highest possible value of K(i.e. K=n).
Let us first discuss the results for classification problems, given in the upper part of
Fig. 2. We see that there are three types of trends: monotonically decreasing (Vehicle,
Satellite, Pendigits), monotonically increasing (only Two-Norm), decreasing followed by
increasing (the other 8 problems). For the first two categories, the default setting k=n,is
obviously suboptimal.
Actually, the analysis of the Two-Norm problem leads to better understanding of the
method. This problem is characterized by strong symmetry over the attributes, and its Bayes
optimal decision surface is linear and invariant with respect to permutations of the attributes.
This invariance is also obtained in the Extra-Trees (approximately for small values of M,
and exactly for very large values of M) provided that we use K=1. As soon as K>1, the
method starts fitting the attribute choices to the learning sample, which increases variance
(without affecting bias) and hence error rates. We can generalize this finding by saying that
problem symmetries should be reflected by the splitting procedure because this allows to
reduce variance without increasing bias.
We conjecture that a high percentage of irrelevant variables yields the opposite behavior,
namely a monotonic decrease of the error rate with increasing K. Clearly, in such situations
using a higher value of Kleads to a better chance of filtering out the irrelevant variables, which
then leads to a significant reduction of bias over-compensating the increase of variance. In the
more generic intermediate situations, where attributes are of variable importance, we obtain
the non-monotonic behavior. In this case the default setting of Kdoes generally a good job.
To check this hypothesis, we carried out an experiment introducing irrelevant variables on
the Two-Norm and Ring-Norm problems. Figure 3compares on these problems the original
trajectory of the error with Kwith the same curve obtained when we add as many irrelevant
attributes as original attributes.6We see that on the Two-Norm problem the loss of symmetry
6The irrelevant attribute values were drawn from N(0,1) distributions.
Springer
14 Mach Learn (2006) 63: 3–42
Fig. 2 Evolution of the error of Extra-Trees with K, on 12 classification and 12 regression problems.
Fig. 3 Effect of irrelevant attributes on the evolution of the error of Extra-Trees with K.
Springer
Mach Learn (2006) 63: 3–42 15
Fig. 4 Evolution of the error of ETdwith nmin for different levels of output noise.
indeed leads to the loss of monotonicity. On the other hand, the optimal value of Kincreases
from 4 to 7 on the Ring-Norm problem, while the default value (n, rounded) increases
from 4 to 6.
Considering the results shown for the regression problems in Fig. 2, we find that with
the exception of Abalone, the behavior is monotonically decreasing, which justifies the
default setting of K=non these problems. Abalone is a rather un-typical regression prob-
lem, since its output is discrete (integer valued). Note that if we had treated this prob-
lem as a classification problem, we would have used the default value of K=
8=3,
which is also the optimal value found when it is treated as a regression problem (see
Fig. 2).
3.2. Smoothing strength nmin
The second parameter of the Extra-Trees method is the number nmin of samples required
for splitting a node. Larger values of nmin lead to smaller trees, higher bias and smaller
variance, and its optimal value hence depends in principle on the level of output noise in
the dataset. To assess the robustness of its default values, we have tried out different ones
on all 24 datasets. This did not yield a significant improvement on any classification prob-
lem7, while on two regression problems, nmin =2 was better than the default of 5: on
Hwang-f5, which is deterministic by construction, the error decreases from 1.62 to 1.13,
and on Housing it decreases from 9.68 to 9.09. Conversely, on two other regression prob-
lems stronger smoothing was better: using nmin =10 decreased the error on Hwang-f5n
(the noisy version of this problem) from 7.50 to 7.15 and on Abalone from 4.69 to 4.56.
Therefore, although possibly slightly suboptimal, the default values of nmin =2 (classifi-
cation) and nmin =5 (regression) appear to be robust choices in a broad range of typical
conditions.
Clearly, in very noisy problems, ensembles of fully grown trees will over-fit the data. In
order to highlight this in the context of Extra-Trees, we report an experiment, where we have
added noise on the output values in the training set, and used ETdwith increasing values
of nmin. The results obtained for Hwang-f5 (an originally noise-free regression problem)
and for Letter (classification) are shown in Fig. 4, where in addition to the curve corre-
sponding to the original dataset, we show for each problem two curves where additional
7According to the corrected t-test with a significance level of 95%.
Springer
16 Mach Learn (2006) 63: 3–42
noise was superposed on the output variable: for the classification problem, we have ran-
domly flipped the class of learning samples (respectively in 10% and 20% of the cases) and,
for the regression problem, we have superposed Gaussian noise (with a standard deviation
respectively of 50% and 100% of that of the output variable), while errors are computed
using the original test samples (i.e. without additional noise). These results clearly illus-
trate that the noisier the output, the higher the optimal value of nmin , and they also show
the robustness of the method to high noise conditions, provided that the value of nmin is
increased.
Similar experiments with the other datasets confirm this analysis, and show that the
default values of nmin are often robust with respect to a moderate increase in the output
noise. On the other hand, we note that from a purely theoretical point of view, one could
ensure consistency (i.e., convergence to the Bayes optimal model with increasing values of
sample size N) of the Extra-Trees by letting nmin grow slowly with N(e.g., nmin N). In
this respect, Extra-Trees are not different from other tree-based methods and the proofs of
consistency given in (Breiman et al., 1984) still hold.
3.3. Averaging strength M
The parameter Mdenotes the number of trees in the ensemble. It is well known that for ran-
domization methods the behavior of prediction error is a monotonically decreasing function
of M(see, e.g., Breiman, 2001). So in principle, the higher the value of M, the better from
the accuracy point of view. The choice of an appropriate value will thus essentially depend
on the resulting compromise between computational requirements and accuracy. Different
randomization methods may have different convergence profiles on different problems, de-
pending also on the sample size and other parameter settings (e.g., Kand nmin ). So not much
more can be said in general about this value.
For the sake of illustration, the top of Fig. 5shows on a classification and on a regression
problem the evolution of the error when increasing the number of trees. To better illustrate
the speed of convergence, the bottom of the same figure shows on the same problems the
evolution with Mof the ratio between the error with Mtrees and the error with 100 trees. In
classification, we compare Bagging, Extra-Trees (with default setting and with K=1), and
Random Forests (also with K=n). In regression, we compare Bagging, and Extra-Trees
with K=nand K=1. The straight lines in the top of Fig. 5represent the error of one single
pruned tree.
These curves are typical of what we observed on most problems. In classification, the
convergence of Extra-Trees with the default setting is slower than the convergence of Tree
Bagging and of Random Forests (to a lesser extent). However, Extra-Trees quickly outper-
forms these two methods. In regression, RFddegenerates into Tree Bagging and the speed
of convergence of this latter method is now indistinguishable from that of Extra-Trees with
default setting.
As concerns totally randomized trees (ET1), we note that their speed of convergence in
regression is comparable to that of the other ensemble methods. In classification, however,
they converge more slowly. On the considered problem, they outperform Tree Bagging only
after M=40 models. This suggests that Khas indeed an influence on the number of trees
that need to be aggregated to ensure convergence, but we found that this influence is notable
only for very small values of Kand only for classification problems.
Springer
Mach Learn (2006) 63: 3–42 17
Fig. 5 Top, evolution of the error with the number Mof trees in the ensemble, bottom, evolution of the ratio
between the error for a given Mand the error for 100 trees for the same method.
4. Bias/variance analysis
In this section, we analyse the bias/variance characteristics of the Extra-Trees algorithm and
compare them with those of the other tree-based methods. In order to make the paper self-
contained, Appendix E provides a theoretical analysis of the bias/variance decomposition
of randomization methods. Before turning to the simulation results, we summarize the main
findings of this analysis as follows:
randomization increases bias and variance of individual trees, but may decrease their
variance with respect to the learning sample;
the part of the variance due to randomization can be canceled out by averaging over a
sufficiently large ensemble of trees;
overall, the bias/variance tradeoff is different in regression than in classification problems;
in particular, classification problems can tolerate high levels of bias of class probability
estimates without yielding high classification error rates.
4.1. Experiments and protocols
We have chosen three classification (Waveform, Two-Norm, Ring-Norm) and three re-
gression problems (Friedman1, Pumadyn-32nm, Census-16H) corresponding to the larger
datasets. To estimate bias and variance (see, e.g., Bauer and Kohavi, 1999), each dataset is
split into two samples: a pool sample (PS) and a test sample (TS). Then, 100 models are
learned from 100 learning samples (LS) randomly drawn (with replacement) from the pool.
Finally, bias, variance, and average errors are estimated on the test sample by means of these
100 models. The sample sizes used in our experiments are given in Table 5.
Springer
18 Mach Learn (2006) 63: 3–42
Tab l e 5 Sample sizes for
bias/variance analysis Dataset PS size TS size LS size
Waveform 4000 1000 300
Two-Norm 8000 2000 300
Ring-Norm 8000 2000 300
Friedman1 8000 2000 300
Pumadyn-32nm 6192 2000 300
Census-16H 15000 7784 2000
In regression, since the Bayes model is unknown on some problems in our experiments,
we call bias the sum of the true bias and the residual error. This sum represents the true error of
the average model. In the case of classification, we will provide bias/variance decompositions
of the average square-error of conditional class probability estimates, together with average
error-rates.
In order to ease the comparison with the other methods, we discuss in this section results for
the ETand ET1variants of Extra-Trees, rather than ETd. Notice, however, that on regression
problems ETdprovides identical results with ET, while on classification problems it has a
slightly higher variance and lower bias.
4.2. Comparison of the different randomization methods
Figure 6shows (on regression problems in the upper part and on classification problems in
the lower part) the bias/variance decomposition for different learning algorithms. Error, bias,
and variance are expressed in percent on classification problems and scaled according to the
factor of Table 7on regression problems. We observe the following:
The variance of single trees (ST) is very high with respect to their bias, the latter including
the residual error. All ensemble methods increase the bias and decrease the variance with
respect to single trees, as predicted by the analysis of Appendix E.
Among ensemble methods, Extra-Trees (ETand ET1) reduce most strongly variance, but
they do also increase more strongly the bias than the other ensemble methods. Overall, ET
Fig. 6 Bias/variance decomposition for different algorithms.
Springer
Mach Learn (2006) 63: 3–42 19
provides the best tradeoff between bias and variance. Its variance reduction with respect to
ST is very impressive in all cases (95% on the average), while its bias increase is moderate
(21% on the average). We also observe that RShas a lower bias than RF, which is due
to the fact that it uses the full learning sample rather than a bootstrap replica to build trees.
Comparing the two ET variants, we observe on most problems that ET1has smaller variance
and a higher bias than ET, and that the increase of bias of ET1is more strongly marked
in regression. A notable exception is the Two-Norm problem where ET1is identical with
ET, which confirms the analysis of Section 3.1 concerning the impact of the symmetric
nature of this problem on bias and variance.
The strongest increase of the bias of ET1is observed on Friedman1 and Pumadyn-32nm.
Actually, these two problems have a large proportion of irrelevant attributes (5 out of 10
on Friedman, 30 out of 32 on Pumadyn-32nm). The effect of removing them is analyzed
in Section 4.4.
On classification problems, ETprovides the best results in terms of error rate, although
in terms of “bias+variance” of probability estimates it is sometimes inferior to RSand/or
RF. Note that this remains true for the ETdvariant.
On all classification problems, ET1provides smaller error rates than Tree Bagging, even
though its “bias+variance” in terms of class probability estimates is sometimes significantly
higher (on Ring-Norm, and to a lesser extent, on Waveform). This is due to the fact that
in classification problems higher bias of probability estimates does not necessarily imply
higher error rates, as discussed in Appendix E.
4.3. Bias/variance tradeoff with K
Figure 7shows the evolution with Kof bias and variance of the mean square-error of Extra-
Trees, left on the Friedman1 and right on the Waveform dataset. For the Waveform dataset,
we actually refer to the mean square-error of probability estimates. We observe that bias
monotonically decreases and variance increases when Kincreases, and that from this point
of view there is no qualitative difference between classification and regression problems.
In classification (Waveform), we see from the curve labeled “error rate” that the optimum
error rate corresponds to a much higher degree of randomization (i.e., a much smaller value
of K) than the optimum of the square-error of probability estimates. Notice that this effect is
also observed on other datasets, because the misclassification error is (intrinsically) more tol-
erant to an increase of bias than the regression error. One can therefore take better advantage
in classification from the decrease of variance resulting from a stronger tree randomization.
Fig. 7 Evolution of bias and variance with K, left on Friedman1, right on Waveform.
Springer
20 Mach Learn (2006) 63: 3–42
Fig. 8 Bias/variance on regression problems with irrelevant attributes removed.
This explains why Extra-Trees and totally randomized trees (and also Random Forests) are
usually significantly better than Tree Bagging on classification problems while they are not
on regression problems (see detailed results in Table 8, Appendix D).
4.4. Totally randomized trees and irrelevant variables
While the decrease of variance brought by totally randomized trees is not surprising, we
would like to explain why bias increases so much on some problems. Since, with this method,
attributes are selected randomly without looking at their relationship with the output, if there
is a (locally or globally) irrelevant attribute, it will nevertheless be selected with the same
probability as a relevant one. By averaging, the effect of irrelevant attributes on the ensemble
prediction will be canceled, but this will have the side effect of increasing bias. Actually,
splitting on an irrelevant attribute is tantamount to randomly splitting the learning sample.
Thus, irrelevant attributes act virtually as a reduction of the sample size, which results in
increased bias and variance of individual trees. While the variance increase is compensated
by tree averaging, the increase in bias is not.8
This analysis is supported by the fact that the two regression problems that present
the highest increase in bias of ET1(Friedman1 and Pumadyn-32nm) contain both a high
proportion of irrelevant attributes. Indeed, by construction (Friedman, 1991), Friedman1
contains five totally irrelevant attributes among the 10 and, in Pumadyn-32nm, 2 attributes
(out of 32) contain over 95% of the information about the output.9Fig. 8shows, for single
unpruned trees, Tree Bagging, and totally randomized trees, the effect of removing the
irrelevant attributes on bias and variance. We see that variance is mostly unaffected for
ensemble methods, while it is reduced for single trees. For all three methods bias is reduced,
but this occurs in a much stronger way for the totally randomized trees. For example, by
removing the irrelevant attributes from the Pumadyn-32nm problem, the bias of the ensemble
of 100 totally randomized trees has dropped from 82.44 to 9.83, while variance has increased
only slightly from 0.94 to 1.40. On the Friedman1 problem, bias drops from 12.52 to 6.25
and variance increases from 0.31 to 0.34.
8For example, on the Pumadyn-32nm problem, the bias and variance of a single totally randomized tree are
respectively of 82.67 and 40.11, while for an ensemble of 100 such trees variance drops to 0.94 and bias
remains unchanged up to the first decimal. Similarly, on the Friedman1 dataset, the bias of a single totally
randomized tree is of 13.02 and variance is of 12.38, while for an ensemble of 100 such trees variance drops
to 0.31, and bias slightly decreases to 12.52.
9This was found by computing attribute importance from a set of trees obtained by Tree Bagging. Attribute
importance was computed according to the algorithm described in (Hastie et al., 2001).
Springer
Mach Learn (2006) 63: 3–42 21
Fig. 9 Bias/variance on classification problems with irrelevant attributes added.
On the three classification problems of Fig. 6all attributes are important for determining
the output and, hence, the increase of bias of totally randomized trees is less important. To
further illustrate this behavior, we have reproduced the experiment of Section 3.1 where we
have added as many irrelevant attributes10 as original attributes on the three classification
problems. The results are shown on Fig. 9. We observe that with the irrelevant attributes, the
increase of bias of ET1with respect to single trees is much more important. Note also that on
Two-Norm the totally randomized trees nevertheless remain competitive with Tree Bagging
in terms of error-rate, which is less strongly affected on this problem than the square-error
of class probabilities.
5. Model characterizations
To provide further insight, we analyze in this section the Extra-Trees algorithm in terms of
properties of the models it induces, from two different points of view. First, we provide a
geometrical characterization of the models output by the Extra-Trees algorithm. Then, we
show that these models can be considered as kernel-based models.
5.1. Geometrical point of view
To illustrate the geometrical properties of fully developed Extra-Trees (nmin =2), we
show on Fig. 10 their models obtained for a simple one-dimensional noise-free regression
problem, together with those of Tree Bagging under identical conditions. The models are
obtained with a specific sample depicted on the figure, which was obtained by drawing
20 points uniformly in the unit interval. The figure shows also the true function behind this
sample. In the left part, it gives models for ensembles of M=100 trees and in the right
part for M=1000. These graphics illustrate the fact that Extra-Trees produce models which
appear to be piecewise linear in the limit of M→∞, and are much smoother than those of
Tree Bagging.
One can show that in general, for an n-dimensional input space and nmin 2, infinite
ensembles of Extra-Trees produce a continuous piecewise multi-linear approximation of the
sample. To make this explicit, let us consider a learning sample of size N
lsN={(xi,yi):i=1,...,N},
10 Irrelevant attributes values are drawn from N(0, 1) distributions.
Springer
22 Mach Learn (2006) 63: 3–42
Fig. 10 Tree Bagging, and fully developed Extra-Trees (nmin =2) on a one-dimensional piecewise linear
problem (N=20). Left with M=100 trees, right with M=1000 trees.
where each xi=(xi
1,...,xi
n) is an attribute vector of dimension nand yiis the corresponding
output value, and let us denote by
(x(1)
j,...,x(N)
j)
the sample values of the jth attribute taken by increasing order, and for notational simplicity
let us define
x(0)
j=−and x(N+1)
j=+,j=1,...,n,
and denote, (i1,...,in)∈{0,...,N}n,byI(i1,...,in)(x) the characteristic function of the
hyper-interval
[x(i1)
1,x(i1+1)
1[×···×[x(in)
n,x(in+1)
n[.
With these notations, one can show11 that an infinite ensemble of Extra-Trees provides an
approximation in the form of
ˆ
y(x)=
N
i1=0···
N
in=0
I(i1,...,in)(x)
X⊂{x1,...,xn}
λX
(i1,...,in)
xjX
xj,(1)
where the real-valued parameters λX
(i1,...,in)depend on sample inputs xiand outputs yias well
as on the parameters nmin and Kof the method.
In the particular case of fully developed trees (nmin =2) they are such that
ˆ
y(xi)=yi,(xi,yi)ls,(2)
11 The proof is a straightforward adaptation of the proofs given in (Zhao, 2000). See Appendix F.
Springer
Mach Learn (2006) 63: 3–42 23
and if the input space is one-dimensional (n=1, and x=(x1)), the model degenerates into
a piecewise linear model
ˆ
y(x)=
N
i1=0
I(i1)(x)
X⊂{x1}
λX
(i1)
xjX
xj=
N
i=0
I(i)(x1)(λ
i+λ{x1}
ix1),(3)
where I(i)(x1) denotes the characteristic function of the interval [x(i)
1,x(i+1)
1[. The values of λ
i
and λ{x1}
imay be derived directly from the Neqs. (2), Ncontinuity constraints, and constraints
imposing a constant model over the intervals [x(i)
1,x(i+1)
1],i∈{0,N}.
Extremely and totally randomized tree ensembles hence provide an interpolation of any
output variable which, for finite Mis piecewise constant (and, hence non-smooth), and for
M→∞becomes piecewise multi-linear and continuous. This is in contrast with other tree-
based ensemble methods whose models remain piecewise constant even for M→∞.Froma
bias/variance viewpoint, the continuous nature of the model translates into smaller variance
and bias in the regions where the target function is smooth and hence leads to more accurate
models in such regions.
5.2. Kernel point of view
For the sake of simplicity, we now particularize our discussion to regression trees.12 Let us
denote by lsN={(xi,yi):i=1,...,N}a learning sample of size N,bytatreestructure
derived from it comprising ltleaves, by lt,i(x) the characteristic function of the ith leaf of t,
by nt,ithe number of learning samples such that lt,i(x)=1, and by
lt(x)=lt,1(x)
nt,1,...,lt,lt(x)
nt,ltT
(4)
the vector of (normalized) characteristic functions of t. Then the model defined by the tree t
can be computed by the equation
ˆ
yt(x)=
N
i=1
yilT
t(xi)lt(x),(5)
which shows that tree-based models are kernel-based models, where the kernel
Kt(x,x)
=lT
t(x)lt(x)(6)
is the scalar product over a feature space defined by the (normalized) characteristic functions
of the leaf nodes of the tree. Furthermore, the kernel and model defined by an ensemble
T={ti:i=1,...,M}of Mtrees are straightforwardly obtained by
KT(x,x)=M1
M
i=1
Kti(x,x),(7)
12 The extension to the estimation of conditional class-probabilities is obtained by considering it as the
regression of a vector of class-indicator variables.
Springer
24 Mach Learn (2006) 63: 3–42
and
ˆ
yT(x)=M1
M
i=1
ˆ
yti(x)=
N
i=1
yiKT(xi,x).(8)
Alternatively, we can construct a feature vector of M
i=1lticomponents by
lT(x)=lT
t1(x)
M,...,lT
tM(x)
MT
(9)
and compute the ensemble kernel by
KT(x,x)=lT
T(x)lT(x).(10)
The feature vector lt(x) induced by a single tree structure is sparse in the sense that, xX
only one feature is different from zero. For an ensemble of Mtrees, the feature vector is also
sparse, in the sense that xXonly Mfeatures are non-zero.
In the particular case of ensembles of fully developed trees, we have lti=N,i=
1,...,Mand nti,j=1,i=1,..., M,j=1...,N, and the ensemble feature space is
of dimensionality MN. In this case the tree ensemble interpolates the learning sample,
namely
ˆ
yT(xi)=yi,(xi,yi)ls.(11)
In general, with nmin 2, it provides a bounded approximation of the sample, namely
min
iyiˆ
yT(x)max
iyi,xX.(12)
With totally randomized trees, the kernel KT(x,x) is independent of the output val-
ues yiof the learning sample. Hence, we can view an ensemble of totally randomized tree
structures as an ensemble of (randomized) metrics which allow one to interpolate or ap-
proximate output values from learning samples using their attribute values. It is clear that
these metrics are invariant with respect to linear transformations of the coordinate axes
(re-scaling).
For finite M, the ensemble kernel KT(x,x) is obviously piecewise constant. However,
the kernel corresponding to an infinite ensemble of Extra-Trees (M→∞) is continu-
ous and piecewise multi-linear with respect to both xand x(see Appendix F). Breiman
(2000b) shows, under the assumptions of uniform prior distribution P(x), infinite sample size
(N→∞), and infinite ensembles of totally randomized trees of fixed number lof leaves, that
the kernel is approximately given by13
KT(x,x)exp{−λ|xx|1},(13)
13 This exponential form, in apparent contradiction with the multi-linear piecewise form found earlier, is due
to the infinite sample size hypothesis which causes the number of hyper-intervals of (multi-)linearity to grow
to infinity and thus results in a non piecewise and nonlinear function.
Springer
Mach Learn (2006) 63: 3–42 25
where |xx|1denotes the city-block distance. In this expression λdenotes the “sharpness”
of the kernel and is defined by
λ=log l
n,(14)
where nis the dimension of the input space. Notice that for balanced trees built from a finite
sample of size N, the number of leaves lis on the order of
N
nmin ,
which suggests that a high dimension of the input space has a very strong smoothing effect,
actually much stronger than a high value of nmin.
Along a slightly different line of reasoning, Lin and Jeon (2002) show that if the number
of samples nti,jin each terminal node of all trees is kept constant and equal to kas N→∞,
then the number of samples which could influence the prediction of an ensemble of trees at
some point14 is on the order of
k(log N)(n1).(15)
This implies, in good agreement with the results of (Breiman, 2000b), that the averaging
effect in ensembles of regression trees could grow exponentially faster with the number of
dimensions of the input space than with the number of samples kkept in the leaves. These
results may explain why for most of our high-dimensional (and sometimes very noisy)
regression problems, we could not improve accuracies when increasing the value of nmin.
The effect of increasing the value of K(attribute selection strength) in the Extra-Trees
method is to make the kernel become sharper along those input directions along which
the output variable varies more strongly and less sharp in the other directions, and hence
to reduce (locally or globally) the dimension of the space over which the kernel actually
operates. This may explain why in regression problems it is often better to use higher
values of Kin favor of an implicit reduction of the input-space dimensionality, which
can reduce the over-smoothing effect of the curse of dimensionality in the presence of
irrelevant variables. In classification problems this effect is less marked because classification
problems are much more tolerant with respect to over-smoothed (biased) class probability
estimates.
6. Related work on randomized trees
Besides the Random Subspace method (Ho, 1998) and Random Forests (Breiman, 2001),
which have been used in this paper for comparison purposes, several other randomized tree
growing algorithms have been proposed in the context of ensemble methods. Although some
of them could be applied to regression problems as well, all these methods have only been
evaluated on classification problems.
For example, Ali (1995) perturbs the standard tree induction algorithm by replacing the
choice of the best test by the choice of a test at random among the best ones. If Sis the score
14 They call them the potential knearest neighbors of this point.
Springer
26 Mach Learn (2006) 63: 3–42
of the optimal test, a test is randomly selected, with a probability proportional to its score,
among the tests which have a score greater than (1β)S,whereβis some constant between
0 and 1. Dietterich (2000) proposes a similar approach that consists in randomly selecting a
test among the kbest splits. Choosing βequal to 1 in Ali’s method or letting kincrease to
infinity in Dietterich’s method, reduces these two algorithms to totally randomized trees.15
However, these authors did not study these extreme parameter values, nor the impact of these
parameters on accuracy. In (Ali, 1995)thevalueofβwas set to 0.75 in all experiments and in
(Dietterich, 2000) the sole value of k=20 was considered. The results shown in the present
paper suggest that these two methods could reach similar accuracy as Extra-Trees by tuning
appropriately their parameters. However, even in the case of strong randomization (β1
or k→∞), these methods would require the score computation of all possible splits at each
test node, thus losing the computational advantages of Extra-Trees.
Zheng and Webb (1998)’s Stochastic Attribute Selection Committees (SASC) are close to
the Random Subspace method. At each node, the best test is searched among only a subset
of the candidate attributes where each attribute has a probability Pof being selected in the
subset. Pplays a similar role as Kin the Random Subspace method and its value has been fixed
to 0.33. A study of bias and variance of this method and other boosting based algorithms
in (Webb and Zheng, 2004) shows that SASC works mainly by reducing the variance of
the standard decision tree method while bias remains mainly unaffected. Furthermore, the
combination of SASC with Wagging (a variant of Bagging) shows improvement with respect
to SASC alone, again because of a more important reduction of variance. This is consistent
with our experiments showing that randomization should be quite high in classification.
Kamath et al. (2002) randomize the tree induction method by discretizing continuous
attributes through histograms at each tree node, evaluating the score only of bin boundaries,
and then selecting a split point in some interval around the best bin boundary. In their
experiments, the number of bins was fixed at the square root of the local learning sample
size. However, when the number of bins in the histograms is equal to the local learning
sample size, this algorithm builds standard trees and when there is only one bin per attribute,
this algorithm is equivalent to Extra-Trees with K=n. The reduction of computing times
was also advanced as an argument in favor of this method with respect to other ensemble
methods like Tree Bagging that require to evaluate every possible split.
Cutler and Guohua (2001) propose an algorithm for classification problems that builds
almost totally randomized trees. To split each non-terminal node in their variant, two exam-
ples are first randomly selected from different classes in the local learning sample. Then, an
attribute is selected at random and the cut-point is randomly and uniformly drawn between
the values of this attribute for the two random examples. Like our Extra-Trees, the trees are
fully grown from the original learning sample so as to perfectly classify the learning sample.
For this reason, they are called perfect random trees, or PERT in short. This method was
compared to Bagging and Random Forests on several classification problems. It often gives
competitive results with these two methods and also comes with an important reduction in
computing times. Although PERT splits are not totally independent of the output, we believe
that this method is very close to our totally randomized trees on classification problems, both
in terms of accuracy and computing times. Note however that this randomization scheme
does not readily apply to regression problems.
In (Geurts, 2002) and (Geurts, 2003), we have proposed a randomized tree algorithm
that, at each test node, generates random tests (without replacement of the attribute) until
15 Not strictly for Ali’s method since a test is randomly drawn with a probability proportional to its score.
Springer
Mach Learn (2006) 63: 3–42 27
finding one that realizes a score greater than some threshold. This score threshold plays
a very similar role as the parameter Kof the algorithm proposed here. When the score
threshold is equal to 0, this method builds totally randomized trees and when it is equal to
the maximal score value, it is equivalent to Extra-Trees with K=n. A comparison between
these two methods does not show any significant differences in terms of accuracy. At first,
we thought that filtering bad tests with a score threshold would improve computing times
and also facilitate the choice of a default value for the parameter. However, we have found no
evidence over many experiments that this was indeed the case. We therefore prefer the variant
proposed here, for simplicity reasons, and because its computational complexity is more
predictable.
Besides the tree world, our analysis suggests a possible way to apply the idea of extreme
randomization to other algorithms. To design a good randomization method, we should be
able to build randomized models that are good on the learning sample to keep the bias low
and then aggregate several of these models to get a low variance. This idea is especially
interesting with trees because building a perfect tree on the learning sample is trivial and
very fast. However, there may be other kinds of models where the same idea could be applied.
Herbrich et al. (2001) have proposed an algorithm that may be interpreted as the application
of this idea to support vector machines; the algorithm consists in generating and aggregating
several perfect linear models in the extended input space. These models are obtained by using
a simple perceptron learning rule and by randomizing the order in which the learning cases
are presented to the algorithm. Like our Extra-Trees, the main advantage of this algorithm is
its computational efficiency with respect to the classical support vector machine approach.
The parallel between these two approaches is certainly worth being explored.
7. Conclusion
In this paper, we have proposed an extremely randomized tree growing algorithm that
combines the attribute randomization of Random Subspace with a totally random selection
of the cut-point. In addition to the number Mof trees generated (a common parameter of
all ensemble methods), this method depends on one main parameter, called K, that controls
the strength of the attribute randomization, and on a secondary parameter, called nmin ,that
controls the degree of smoothing. The analysis of the algorithm and the determination of
the optimal value of Kon several test problem variants have shown that the value is in
principle dependent on problem specifics, in particular the proportion of irrelevant attributes.
Nevertheless, our empirical validations have shown that default values for Kare near-optimal
on 22 out of 24 diverse datasets, and only slightly suboptimal on two of them. They result
also in competitive results with respect to state-of-the-art randomization methods, in terms
of accuracy and computational efficiency.
This empirical validation was completed by a bias/variance analysis of the Extra-Trees
algorithm and a geometrical and a kernel characterization of its models. The bias/variance
analysis has shown that Extra-Trees work by decreasing variance while at the same time
increasing bias. Once the randomization level is properly adjusted, the variance almost
vanishes while bias only slightly increases with respect to standard trees. When the ran-
domization is increased above the optimal level, variance decreases slightly while bias
increases often significantly. We have also shown that this bias increase was due to the fact
that over-randomization prevents the algorithm from detecting attributes of low relevance
and reduces the effective sample size when there are many such attributes. Furthermore,
we have highlighted the different nature of the bias/variance tradeoff in classification and
Springer
28 Mach Learn (2006) 63: 3–42
regression problems, explaining why classification problems can take advantage of stronger
randomization.
The geometrical analysis has shown that Extra-Trees asymptotically produce continuous,
piecewise multi-linear functions. The resulting models are thus smoother than the piecewise
constant ones obtained with other ensemble methods which optimize the cut-points. This
potentially leads to better accuracy in regions of the input space where the target function
is indeed smooth. We have also shown that tree-based ensemble models can be written as
kernel-based models. In the case of totally randomized trees, the kernel is independent of
the output values and thus it defines a universal scale-invariant metric defined on the input
space that can be used to approximate any target function. When Kincreases, the kernel
is automatically adapted to the output values, its sharpness increasing in those directions
along which the target function varies more strongly. Theoretical results from the literature
furthermore suggest that the spreading of the kernel increases rapidly with the dimension of
the input space resulting in a strong smoothing effect in the high dimensional case.
Actually, this paper has come up with two new learning algorithms that have comple-
mentary features: Extra-Trees with the default setting and totally randomized trees. Both
methods are non parametric. The first one provides near optimal accuracy and good compu-
tational complexity, especially on classification problems. The second one, although not as
accurate as the first one, is trivial to implement, even faster, and the models it induces are
independent of the output variable, making of this algorithm a very interesting alternative to
the kNN algorithm. Both methods have already proven useful in a number of applications. In
particular, problems of very high dimensionality, like image classification problems (Mar´
ee
et al., 2004), mass-spectrometry datasets (Geurts et al., 2005b), or time-series classification
problems (Geurts and Wehenkel, 2005), make the Extra-Trees a first choice method due
to its attractive computational performances. Also, the fact that totally randomized trees
have a tree structure independent of the output variable has been exploited in the context
of reinforcement learning where it ensures the convergence of the reinforcement learning
algorithm and leads to a very efficient implementation (Ernst et al., 2005).
There remain several future work directions. First of all, while we have focused here on
numerical attributes, it is also very desirable to handle other types of attributes. For categorical
attributes, we propose to generate random (binary) splits by selecting a random subset of
their possible values. This approach has already been successfully used in order to treat
biological (genetic) sequence classification problems (Geurts et al., 2005a). However, more
systematic empirical studies have to be carried out, and also the analytical characterization
of the models obtained with such attributes still have to be explored and will certainly result
in interesting, and probably very different properties, especially in the context of categorical
attributes only.
Since bias is the dominant component of the error of Extra-Trees, future improvements of
randomization methods should focus on this part of the error. There exist several techniques
to reduce the bias. One simple technique in the context of trees could be to extend tree
tests to take into account several attributes. This idea was already applied with some success
in the context of Random Forests by Breiman (2001). On the other hand, since Boosting
is a method known to reduce bias, it could possibly be combined with our Extra-Trees
so as to reduce their bias (see Webb, 2000; Webb and Zheng, 2004, for a combination
of Boosting and different randomization methods). Stochastic Discrimination (Kleinberg,
1990) is another theoretical framework to transform a weak classification algorithm (i.e. one
with high bias) into a stronger one. A deeper analysis of this framework in our context of
extreme randomization could also help in this direction.
Springer
Mach Learn (2006) 63: 3–42 29
Finally, along the line of the model characterization carried out in this paper, further work
towards a theoretical analysis of randomization methods is still needed and could lead to
a better understanding of these methods. For example, while we have shown that uniform
sampling of cut-points leads to multi-linear models, it would be interesting to study the impact
of different randomization schemes (e.g. cut-points drawn from a Gaussian distribution) on
the analytical form of the approximation. Also, a theoretical analysis of the exact effect of
values of Kgreater than 1 on the approximation and its corresponding kernel is still missing.
Lastly, we have shown that it is possible to exploit problem symmetries to justify particular
randomization schemes. Along this idea, a deeper theoretical analysis could also help to
take advantage of a priori knowledge (e.g. invariances or symmetries) in designing ad hoc
methods for specific classes of problems.
Appendix
A. Pseudo-code of the complete Extra-Trees algorithm and score measures
The complete Extra-Trees algorithm is described in Table 6, together with the node splitting
procedures for both numerical and categorical attributes.
Our score measure in classification is a particular normalization of the information gain.
For a sample Sand a split s, this measure is given by:
ScoreC(s,S)=2Is
c(S)
Hs(S)+Hc(S),
where Hc(S) is the (log) entropy of the classification in S,Hs(S) is the split entropy (also
called split information by Quinlan (1986)), and Is
c(S) is the mutual information of the
split outcome and the classification. With respect to Quinlan’s gain ratio, this normalization,
proposed by Wehenkel and Pavella (1991), has the advantage of being symmetric in cand
sand it also further mitigates the “end-cut” preference of this latter measure (Wehenkel,
1998). For a discussion of entropy based score measures and normalization, the interested
reader can refer to (Wehenkel, 1996).
In regression, we use the relative variance reduction. If Sland Srdenote the two subsets
of cases from Scorresponding to the two outcomes of a split s, then the score is defined as
follows:
ScoreR(s,S)=var{y|S}−|Sl|
|S|var{y|Sl}−|Sr|
|S|var{y|Sr}
var{y|S},
where var{y|S}is the variance of the output yin the sample S.
B. Description of datasets
The experiments are conducted on 12 classification and 12 regression problems which are
summarized in Table 7. Most datasets are available in the UCI Machine Learning Repository
(Blake and Merz, 1998). Friedman1, Two-Norm, and Ring-Norm are three artificial prob-
lems introduced respectively in (Friedman, 1991) and (Breiman, 1996a). Pumadyn, Hwang,
Springer
30 Mach Learn (2006) 63: 3–42
Tab l e 6 Pseudo-code of the Extra-Trees algorithm
Build an extra tree ensemble(S).
Input: a training set S.
Output: a tree ensemble T={t1,...,tM}.
–Fori=1toM
Generate a tree: ti=Build an extra tree(S);
– Return T.
Build an extra tree(S).
Input: a training set S.
Output: a tree t.
– Return a leaf labeled by class frequencies (or average output, in regression) in Sif
(i) |S|<nmin,or
(ii) all candidate attributes are constant in S,or
(iii) the output variable is constant in S
– Otherwise:
1. Select randomly Kattributes, {a1,...,aK}, without replacement, among all (non constant in S)
candidate attributes;
2. Generate Ksplits {s1,...,sK},wheresi=Pick arandom split(S,ai),i=1,...,K;
3. Select a split ssuch that Score(s,S)=maxi=1,...,KScore(si,S);
4. Split Sinto subsets Sland Sraccording to the test s;
5. Build tl=Build an extra tree(Sl)andtr=Build an extra tree(Sr) from these subsets;
6. Create a node with the split s, attach tland tras left and right subtrees of this node and return the
resulting tree t.
Pick arandom split(S,a)
Input: a training set Sand an attribute a.
Output: a split.
– If the attribute ais numerical:
Compute the maximal and minimal value of ain S, denoted respectively by aS
min and aS
max;
Draw a cut-point acuniformly in [aS
min,aS
max];
Return the split [a<ac].
– If the attribute ais categorical (denote by Aits set of possible values):
Compute ASthe subset of Aof values of athat appear in S;
Randomly draw a proper non empty subset A1of ASand a subset A2of A\AS;
Return the split [aA1A2].
Bank, and Census come from the DELVE repository of data16 and Ailerons, Elevators,
and Poletelecomm are taken from (Torgo, 1999).17 Notice that the last column of Table 7
provides the normalizing factors that we have used to display mean square-errors for the
regression problems. Notice also that the datasets marked with a star correspond to problems
where either the learning sample or the test sample contain a small number of observa-
tions (300). For these, our protocol consists of running 50 experiments corresponding to
50 random LS/TS splits, instead of only 10.
To give some insight into the range of problems considered, we give a few indications
about the performance of various learning methods and/or intrinsic properties of the datasets.
To this end, we discuss separately regression and classification problems.
16 http://www.cs.utoronto.ca/delve.
17 http://www.liacc.up.pt/ltorgo.
Springer
Mach Learn (2006) 63: 3–42 31
Tab l e 7 Datasets summaries
Classification problems Regression problems
Dataset Atts Class LS size TS size Dataset Atts LS size TS size Err×
Waveform21 3 300 4700 Friedman110 300 9700 1
Two -No rm20 2 300 9700 Housing13 455 51 1
Ring-Norm20 2 300 9700 Hwang-f5 2 2000 11600 103
Vehicle18 4 761 85 Hwang-f5n 2 2000 11600 102
Vowe l 10 11 891 99 Pumadyn-32fh 32 2000 6291 104
Segment19 7 2079 231 Pumadyn-32nm 32 2000 6291 105
Spambase 57 2 3221 1380 Abalone 8 3133 1044 1
Satellite 36 6 4435 2000 Ailerons 40 5000 8750 108
Pendigits 16 10 7494 3498 Elevators 18 5000 11559 106
Dig44 16 10 9000 9000 Poletelecomm 48 5000 10000 101
Letter 16 26 10000 10000 Bank-32nh 32 3692 4500 103
Isolet 617 26 6238 1559 Census-16H 16 7784 15000 109
Among the 12 classification problems, the first 3 are synthetic ones and the other 9 are
real ones. On two problems, the kNN method yields very good results (Two-Norm and
Vowel, see Table 8), and on two other problems its results are close to those of the tree-
based ensemble methods (Pendigits, and Dig44). On four datasets (Ring-Norm, Spambase,
Segment, and Isolet) the kNN gives very disappointing results. On Waveform, Vehicle,
Satellite, and Letter, it is slightly suboptimal with respect to Extra-Trees. On Ring-Norm and
Vehicle the best performance is obtained with quadratic discriminants. On Two-Norm the
Bayes optimal classifier is linear. On the other problems the best performance published in
the literature is obtained with non-linear multi-layer perceptron type of methods.
Among the 12 regression problems, 9 are synthetic ones and three are real datasets
(Housing, Abalone, Census-16H). On one problem (Bank-32nh), linear regression slightly
outperforms the best tree-based methods, while on three others (Pumadyn-32fh, Abalone,
Ailerons), it performs equally well. On the other hand, it is largely suboptimal on most other
problems. As for kNN, it works well on Hwang-f5n, and to a lesser extent on Census-16H.
Hwang, Pumadyn, Bank, and Census are three families of problems specifically chosen in the
DELVE project to evaluate regression methods. Hwang-f5 and Hwang-f5n are respectively
a noise free and a noisy variant of the same two-dimensional non linear problem. By
construction, Pumadyn-32fh is a fairly linear problem with high noise while Pumadyn-32nm
is a highly non linear problem with medium noise. Bank-32nh is a highly noisy problem and
Census-16H is defined in DELVE as a highly difficult problem.
C. Corrected t-test
In each run of random sub-sampling, the data set is divided into a learning sample of a given
size nLand a test sample of size nT. The learning algorithm is run on the learning sample and
its error is estimated on the test sample. The process is repeated Nstimes and the resulting
errors are averaged. Let ei
Aand ei
Bdenote the errors of two methods Aand Bin the ith run
(1 iNs) of random sub-sampling and let didenote the difference ei
Aei
B. The statistic
corresponding to the t-test is:
t=µd
σ2
d
Ns
,(16)
Springer
32 Mach Learn (2006) 63: 3–42
where
µd=Ns
i=1di
Nsand σ2
d=Ns
i=1(diµd)2
Ns1(17)
Under the null hypothesis stating that Aand Bare equivalent and assuming that the differences
diare independent, tfollows a student distribution with Ns1 degrees of freedom. Under
re-sampling, the hypothesis of independence is clearly violated as the different learning and
test samples partially overlap. Nadeau and Bengio (2003) have proposed a correction to this
t-test that takes into account this overlapping. With the same notations, the corrected statistic
is the following:
tcorr =µd
(1
Ns+nT
nL)σ2
d
,(18)
This statistic is also assumed to follow a student distribution with Ns1 degrees of freedom.
Experiments in (Nadeau and Bengio, 2003) show that this test improves the type I error with
respect to the standard t-test. Note that Nadeau and Bengio (2003) suggest to use a value of
nL5 to 10 times larger than nT.
D. Detailed results of empirical study
Tabl e 8gives average errors of the different learning algorithms compared in this paper as
they were obtained with the protocol described in Section 2.2.3. In classification problems
these numbers refer to error rates in percent, whereas in regression problems they refer to
(normalized) mean square-errors.
To make the comparison easier, we provide in the first column of Table 8the standard
deviations of the error rates over the different LS/TS splits, as obtained for each problem
with the Extra-Trees algorithm with default settings (column named ETd). Notice that these
standard deviations are also indicative of those of all the other tree-based methods except ST
and TB. The standard deviations of the errors of the single trees are at least twice as large
on all problems, and this is also the case for Tree Bagging on classification problems. On
regression problems, however, the standard deviation of the errors of Tree Bagging is close
to that of the other ensemble methods. The five following columns give the results obtained
by the reference methods (single unpruned and pruned CART trees, Tree Bagging, Random
Subspace and Random Forests with optimal value of K). The subsequent columns provide
error rates for the Extra-Trees with different ways of adjusting the parameter K:first
K=*(Kis adjusted optimally on the average test set error rates), then
K=cv (it is adjusted for each LS/TS split by 10-fold cross-validation inter-
nal to the learning sample), then K=d(Kfixed a priori according to the de-
fault setting, i.e., d=nin classification problems, and d=nin regression
problems), then a version denoted ETd
B,whereK=dis combined with boot-
strap resampling of the training set, and finally the totally randomized version
(K=1). For the classification problems we also provide the results for the value of K
=n. For regression problems this value is not explicitly given since it is equal to the default
value of K=d. Instead, we provide the mean square error provided by a linear regression
method for these latter problems (column LR). Finally, the last column provides the results
obtained with the k-nearest neighbor method on these datasets; as suggested by the notation
Springer
Tab l e 8 Error rates of all methods
Classification problems
Dataset σETdST PST TB RSRFETETcv ETdETd
BET1ETnkNN
Waveform 0.70 29.17 28.76 19.41 17.45 16.97 16.60 16.84 16.61 16.59 18.02 17.77 18.35
Two -No rm 0.27 21.53 22.31 6.97 3.76 3.54 3.15 3.52 3.53 3.47 3.15 5.14 2.78
Ring-Norm 0.38 16.31 15.40 8.12 4.57 3.64 3.27 3.55 3.27 3.54 5.39 5.53 36.22
Vehicle 4.71 26.80 28.05 25.15 24.16 24.73 24.02 24.47 26.00 25.72 27.22 24.09 28.09
Vowe l 1 .33 20.63 20.71 7.35 2.73 3.43 1.47 1.47 1.74 2.28 2.22 2.14 1.21
Segment 0.96 3.24 3.35 2.20 1.72 1.83 1.40 1.34 1.77 2.00 2.57 1.48 3.54
Spambase 0.60 8.33 7.36 5.73 4.10 4.54 4.15 4.41 4.17 4.36 4.64 4.38 9.48
Satellite 0.49 14.63 13.52 9.56 8.14 8.45 7.96 8.12 8.43 8.97 9.31 8.18 9.37
Pendigits 0.19 3.91 3.89 1.74 0.90 1.02 0.60 0.63 0.67 0.77 0.96 0.61 0.63
Dig44 0.24 15.03 13.96 8.24 4.87 5.23 4.48 4.47 4.49 4.82 5.40 4.73 4.42
Letter 0.15 14.84 14.75 7.44 4.39 4.87 3.75 3.77 3.80 4.24 5.46 4.23 6.33
Isolet 0.43 26.45 24.43 12.40 7.85 8.43 7.08 7.19 7.61 8.52 19.68 8.28 15.45
Regression problems
Dataset σETdST PST TB RSRFETETcv ETdETd
BET1LR kNN
Friedman1 0.26 11.73 10.84 5.69 5.59 5.55 4.97 5.00 4.97 5.50 12.60 7.38 9.44
Housing 4.91 19.90 19.63 9.86 8.79 9.72 9.63 9.81 9.68 10.89 17.36 24.38 18.91
Hwang-f5 0.14 10.91 10.91 4.72 10.28 4.72 1.62 1.58 1.62 2.89 19.11 812.61 5.22
Hwang-f5n 0.07 12.32 9.71 7.91 9.37 7.91 7.50 7.51 7.50 7.25 8.87 87.88 7.56
Pumadyn-32fh 0.06 8.27 4.26 4.22 4.28 4.18 4.18 4.19 4.23 4.20 6.26 4.19 5.38
Pumadyn-32nm 0.08 13.41 9.06 6.77 7.59 6.75 6.28 6.29 6.28 6.72 84.33 72.75 76.89
Abalone 0.31 8.55 5.61 4.71 4.69 4.57 4.61 4.62 4.67 4.59 4.76 4.93 4.90
Ailerons 0.06 5.53 4.21 2.88 2.97 2.85 2.93 2.93 2.93 2.94 6.03 3.07 4.69
Elevators 0.36 19.49 15.86 9.42 9.45 9.36 7.99 8.01 8.03 8.86 16.62 8.43 16.53
Poletelecomm 0.25 6.90 6.90 3.41 2.90 3.24 2.87 2.91 2.95 3.81 26.06 92.98 11.84
Bank-32nh 0.22 14.44 9.00 7.46 7.41 7.35 7.21 7.24 7.27 7.29 12.33 6.96 10.22
Census-16H 0.04 2.20 1.82 1.15 1.10 1.12 1.15 1.16 1.15 1.20 1.50 2.09 1.46
34 Mach Learn (2006) 63: 3–42
kNN*, the column reports the average error corresponding to the best test set error value of k
for each problem. The kNN method uses an euclidian metric over attributes rescaled by the
inverse of their standard deviation.
The comparison of the accuracies of ETdwith those of ETd
Bconfirm that bootstrap
resampling tends to reduce the accuracy of the Extra-Trees method. Indeed, on two of the
classification problems (Letter and Satellite) and 4 regression problems (Housing, Hwang-f5,
Elevators, Poletelecomm) ETd
Bis significantly less accurate than ETd, while it is significantly
better only on two regression problems (Hwang-f5n and Abalone).
The reader may wish to compare the results of kNN* with those of the totally randomized
trees (column ET1) which is also a unsupervised kernel-based method. Also the relative per-
formance of this method with respect to the results obtained with ETgives some indications
of the diversity of the problems in terms of the diversity of relative performance of the kNN
method. For example, it is interesting to observe that on Pumadyn-32fh both ET1and kNN
work quite well, while on Pumadyn-32nm they provide very disappointing results. Actually,
since in this problem 30 attributes among the 32 contain almost no information, this explains
why kNN and ET1, which treat all attributes equally, are so sub-optimal even with respect to
single unpruned trees.
Finally, the comparison of the accuracies of ETdwith those obtained by a linear regression
(LR) gives an indication of how competitive ETdis in regression. We observe that the linear
least squares regression method is much less accurate on most problems than ETd.Only
on Bank-32nh it outperforms the other methods, while it provides comparable results on
Pumadyn-32fh, Abalone, and Ailerons.
E. Bias/variance formulation of ensembles of randomized trees
In this appendix we provide the derivations leading to the bias/variance decomposition of
ensembles of randomized trees. We start with the case of regression and then consider
classification problems.
E.1. Bias/variance decomposition of the square-error for regression problems
Before considering randomized methods, we first recall the standard derivation of the
bias/variance decomposition of the square-error for deterministic learning algorithms.
E.1.1. Deterministic learning algorithms. A deterministic learning algorithm can be viewed
as a function mapping a learning sample into a model. We denote the prediction of such a
model at a point xof the input space by f(x;LS).18 Assuming a random sampling scheme
generating learning samples of fixed size N, this prediction is a random variable, and so is its
average error over the input space. We study the average value of this quantity defined by:
ELS{Err(f(·;LS))}
=ELS{EX,Y{L(Y,f(X;LS))}}
=EX{ELS{EY|X{L(Y,f(X;LS))}}} (19)
=EX{ELS{Err(f(X;LS))}},
18 In this appendix we use upper case letters to denote random variables and lower case letters to refer to their
realizations.
Springer
Mach Learn (2006) 63: 3–42 35
where L(. , .) is the loss function and where Err(f(X;LS)) denotes the expected loss at
point Xand Err(f(·;LS)) its expectation over the input space. If L(y,ˆ
y)=(yˆ
y)2,the
error locally decomposes into (see, e.g., Geman et. al., 1992; Hastie et al., 2001):
ELS{Err(f(x;LS))}=σ2
R(x)+bias2
R(x)+varR(x),(20)
where
σ2
R(x)
=EY|x{(YfB(x))2},(21)
bias2
R(x)
=(fB(x)f(x))2,(22)
varR(x)
=ELS{(f(x;LS)¯
f(x))2}(23)
with
fB(x)
=arg min
yEY|x{(Yy)2}=EY|x{Y}(24)
¯
f(x)
=ELS{f(x;LS)}.(25)
The function fB(·) is called the Bayes (optimal) model. By definition the Bayes model
minimizes the average error defined by Eq. (19), and in the case of a squared loss function it
is equal to the conditional expectation of Ygiven X=x. The first term of the decomposition
(20) is the local error of this model. It is called the residual (squared) error, σ2
R(x). It provides a
theoretical lower bound of the error which is independent of the learning algorithm. Thus the
(local) sub-optimality of a particular learning algorithm is composed of two (non negative)
terms:
the (squared) bias, bias2
R(x), measuring the discrepancy between the Bayes model fB(x)
and the average model, ¯
f(x).
the variance, varR(x), measuring the variability of the predictions (around the average
model) with respect to the learning sample randomness.
These local quantities can be “globalized” by averaging them over the input distribution.
E.1.2. Randomized learning algorithms. We use the following terminology:
“Original algorithm” denotes any given automatic learning algorithm (we assume, without
any limitation, that this algorithm is deterministic). A model returned by this algorithm
for some learning sample ls will be denoted by f(·;ls).
“Randomized algorithm” denotes the randomized version of the original algorithm, ac-
cording to some perturbation technique. We formalize this by introducing an additional
random variable εwhich summarizes the random perturbation scheme, and denote by a
value of this variable and accordingly by fr(·;ls,) a particular model generated by a call
to the randomized algorithm.
Averaged algorithm” denotes the algorithm producing models built by aggregating M
different models obtained with the randomized algorithm. An averaged model will be
denoted by faM(·;ls,M). Its predictions are computed as:
faM(x;ls,M)=1
M
M
i=1
fr(x;ls,
i).(26)
Springer
36 Mach Learn (2006) 63: 3–42
Let us analyze the bias and variance terms of the randomized and averaged algorithms and
relate them to bias and variance of the original algorithm.
E.1.2.1. Randomized algorithm. Because the randomized algorithm depends on a second
random variable ε, its average square-error and its bias/variance decomposition becomes:
ELS{Err(fr(x;LS))}=σ2
R(x)+(fB(x)fr(x))2+varLS{fr(x;LS)},(27)
where:
fr(x)
=ELS{fr(x;LS)}(28)
varLS{fr(x;LS)}
=ELS{(fr(x;LS)fr(x))2}.(29)
The variance term (29) may be further decomposed into two (positive) terms:
varLS{fr(x;LS)}=varLS{Eε|LS{fr(x;LS)}} + ELS{varε|LS{fr(x;LS)}}.
(30)
The first term is the variance with respect to the learning set randomness of the average
prediction according to ε. It measures the dependence of the model on the learning sample,
independently of ε. The second term is the expectation over all learning sets of the variance
of the prediction with respect to ε. It measures the strength of the randomization ε.
E.1.2.2. Averaged algorithm. Assuming that the iare independently drawn from the
same distribution P(ε|LS), the average model of the averaged algorithm is nothing but the
average model of the randomized algorithm:
faM(x)=ELSM{faM(x;LSM)}= 1
M
M
i=1
ELSi{fr(x;LS
i)}=fr(x).(31)
So, its bias is equal to the bias of the randomized algorithm. On the other hand, under the
same assumptions it can be shown that its variance satisfies:
varLSM{faM(x;LSM)}=varLS{Eε|LS{fr(x;LS)}}+ ELS{varε|LS{fr(x;LS)}}
M.
(32)
So, averaging over Mvalues of εreduces the second part of the variance of the randomized
algorithm by a factor Mand leaves the remaining parts of the average square-error unchanged.
This implies that the larger the number of ensemble terms, the smaller the average square-
error of the averaged algorithm.
E.1.2.3. The averaged vs the original algorithm. Figure 11 shows the evolution of bias
and variance from the original algorithm to the averaged algorithm. Asymptotically, with
Springer
Mach Learn (2006) 63: 3–42 37
Fig. 11 Expected evolution of bias and variance by randomization and averaging.
respect to M, the whole process of randomization and averaging decreases the variance with
respect to our original algorithm if:
varLS{Eε|LS{fr(x;LS)}} <varLS{f(x;LS)},(33)
namely, if the randomization cancels part of the variance of the original algorithm.
This condition is easy to satisfy. However, to compare the averaged algorithm with the
original algorithm, we need also to take into account their biases. Indeed, in general,
faM(x)=fr(x)= f(x) (34)
and hence bias could possibly be increased by the randomization. Actually, when the original
learning algorithm explicitly tries to minimize empirical risk, it is likely that the randomiza-
tion will disturb the algorithm in this goal. Consequently, the averaged algorithm is likely to
have higher bias than the original one. It thus appears that in most randomization methods
there is a bias/variance tradeoff controlled by the randomization strength:
the stronger the randomization, the lesser the dependence of the randomized models on
the learning sample and the smaller the variance of the averaged algorithm;
the stronger the randomization, the lesser the dependence of output predictions on the
input attributes and the higher bias of both the randomized and the averaged algorithms.
Note that the increase of variance resulting from the second term of (30) does not influence
the tradeoff since this term is the one which is canceled by averaging. Nevertheless, larger
values of this term will imply slower convergence with respect to the number Mof ensemble
terms, and thus lead to higher computational requirements in practice.
E.2. Bias/variance decompositions for classification problems
Several bias/variance decompositions have been proposed in the literature for the average
error rate. They all try to mimic the main properties of the decomposition of the average
square-error (see for example Geurts, 2002, or James, 2003, for a review of several decom-
positions). However, none of them is fully satisfactory. The problem in classification is that a
decrease of the variability of the predictions at some point may actually increase the error (if
the predictions are wrong in average). The consequence when studying randomization meth-
ods is that, unlike the square-error, the error rate of the averaged algorithm may be greater
than the error rate of the randomized algorithm. So, it is very difficult to study randomization
methods by trying to understand directly their effect on the misclassification error (see Bauer
and Kohavi, 1999; Webb, 2000; Valentini and Dietterich, 2004, for empirical bias/variance
analyses of ensemble methods).
Springer
38 Mach Learn (2006) 63: 3–42
However, many classification algorithms, like decision trees, work by first computing a
numerical estimate fc(x;LS)ofP(Y=c|X=x) and then deriving a classification rule by
predicting the class maximizing this estimate:
f(x;LS)=arg max
cfc(x;LS).(35)
We could thus consider classification models as multidimensional regression models and
apply the previous analysis to study the effect of randomization methods. Thus, an approach
to study classification algorithms is to relate the bias and variance terms of these estimates to
the average misclassification error of the resulting classification rule (35). Friedman (1997)
has done this connection in the particular case of a two-class problem and assuming that the
distribution of fc(x;LS) with respect to LS is close to Gaussian. He shows that the average
misclassification error at some point xmay be written as:
ELS{Err(f(x;LS))}=1P(fB(x)|x)
+ELS{ffB(x)(x;LS)}−0.5
varLS{ffB(x)(x;LS)}(2P(fB(x)|x)1),(36)
where fB(·) is the Bayes optimal classifier and (·) is the upper tail of the standard normal
distribution. According to the sign of the numerator in (36)wehave:
When a majority of models vote in the same way as the Bayes classifier, a decrease of
variance will decrease the error.
Conversely, when a majority of models vote wrongly, a decrease of variance will increase
the error.
Another important conclusion that can be drawn from (36) is that, whatever the regression
bias on the approximation of fc(x;LS), the classification error can be driven to its minimum
value by reducing solely the variance, under the assumption that a majority of models are
right. In other words, perfect classification rules can be induced from very bad (rather biased,
but of small variance) probability estimators.
This discussion shows that where the average probability estimates of the randomized
algorithm are consistent with the Bayes optimal decision rule, the reduction of variance ob-
tained by the averaged algorithm will lead to a reduction of the average error rate. Conversely,
when this is not the case, the averaged algorithm will be worse than the randomized one.
Because of the non additive interaction between bias and variance of probability estimates
and the average error rate, the best tradeoff in classification will not correspond to the best
tradeoff in terms of the average square-error of probability estimates. Actually, since from
Eq. (36) a high bias of probability estimates does not prevent a low misclassification error if
the variance is low, we may expect that the best tradeoff in terms of misclassification error
will correspond to a lower variance and hence a higher randomization level than the best
tradeoff in terms of the square-error.
F. Geometrical characterization of the Extra-Trees kernel
In this appendix, we show that the Extra-Trees kernel is a continuous piecewise multi-linear
function of its two arguments when the number of trees grows to infinity. The developments
follow from an adaptation of the proofs given in (Zhao, 2000).
Springer
Mach Learn (2006) 63: 3–42 39
Let us consider a learning sample of size N
lsN={(xi,yi):i=1,...,N},
where each xi=(xi
1,...,xi
n) is an attribute vector of dimension nand yiis the corresponding
output value, and let us denote by
(x(1)
j,...,x(N)
j)
the sample values of the jth attribute taken by increasing order. Let us denote by Il
k,
k∈{1,...,n},l∈{1,...,N1}the interval ]x(l)
k,x(l+1)
k] as well as the random event
corresponding to a split on variable kin this interval at the top node of an Extra-Tree grown
from lsN. Let us further denote by lsl
k,Land lsl
k,Rthe left and right subsets of lsNresulting
from that split:
lsl
k,L={(xi,yi)lsN|xi
kx(l)
k}and lsl
k,R={(xi,yi)lsN|xi
kx(l+1)
k}
For M→∞, the kernel defined by (7) (Section 5.2) becomes the average over the
distribution of Extra-Trees of 0 when xand xfall in different leaves in the tree and 1/nlwhen
xand xfall in the same leaf lof size nl.
Given this definition, the following recursive equations allow to compute this kernel:
If N<nmin
K
T(x,x;lsN)=1
N; (37)
Otherwise:
K
T(x,x;lsN)
=
n
k=1
N1
l=1
P(Il
k).
K
T(x,x;lsl
k,L)ifxkx(l)
kx
kx(l)
k
K
T(x,x;lsl
k,R)ifxk>x(l+1)
kx
k>x(l+1)
k
x(l+1)
kx
k
x(l+1)
kx(l)
kK
T(x,x;lsl
k,L)+xkx(l)
k
x(l+1)
kx(l)
kK
T(x,x;lsl
k,R)
if x(l)
k<xk<x
kx(l+1)
k
x(l+1)
kxk
x(l+1)
kx(l)
kK
T(x,x;lsl
k,L)+x
kx(l)
k
x(l+1)
kx(l)
kK
T(x,x;lsl
k,R)
if x(l)
k<x
kxkx(l+1)
k
0 otherwise
(38)
where xk(resp. x
k)isthekth component of the attribute vector x(resp. x).
The sum in (38) is over all possible splitting intervals. The first two choices corre-
spond to the cases when both values xkand x
kare outside the interval Il
kbut on the
same side with respect to this interval. In this case, the kernel is equal to the ker-
nel computed from the left or the right subset of the learning sample corresponding
to that split. The third and fourth choices correspond to the case when both values
Springer
40 Mach Learn (2006) 63: 3–42
fall in the interval Il
k. For example, in the third situation, the kernel is equal to K
T(x,x;lsl
k,L)
when the cut-point is greater than x
k, which happens with probability x(l+1)
kx
k
x(l+1)
kx(l)
k
(since the cut-
point is drawn from a uniform distribution) or equal to K
T(x,x;lsl
k,R) when the cut-point
is lower than xk, which happens with probability xkx(l)
k
x(l+1)
kx(l)
k
.
In the case of totally randomized trees, the probabilities P(Il
k) are easily computed as:
P(Il
k)=1
n
x(l+1)
kx(l)
k
x(N)
kx(1)
k
.(39)
In the case of Extra-Trees, the probability of splitting in some interval Il
kdepends
on the output values of the learning sample cases and also on the score measure
used.
From Eq. (38), it is easy to show that the kernel is continuous with respect to both xand
x.Foragivenvalueofx(resp. x), the function K
T(x,x;lsN) is also piecewise multi-linear
with respect to x(resp. x). From (38), it is clear that the function K
T(x,x;lsN)isasum
of products of xk,k∈{1,2,...,n}. Furthermore, an attribute can appear at most once in
each product. Indeed, the only terms that make appear explicitely xkare the third and fourth
choice in (38)andifxkand x
kare in Il
k, they can only be outside splitting intervals on the
kth attribute appearing in lsl
k,Lor lsl
k,R. Hence, K
T(x,x;lsl
k,L)andK
T(x,x;lsl
k,R) can not
make intervene anymore the kth attribute through the third or fourth choice in subsequent
applications of the recursion (38). Hence, the kernel is piecewise multi-linear with respect
to both xand x. Hyper-intervals where K
T(x,x;lsN) is multi-linear with respect to x(resp.
x) are delimited by the values of the attributes observed in the learning sample as well as by
the values xk(resp. xk), k={1,...,n}.
Since the approximation given by an ensemble of Extra-Trees is written as a weighted sum
of kernels (8), it is also continuous and piecewise multi-linear when M→∞. Furthermore,
since kernels in (8) are all centered at learning sample points, hyper-intervals where the
approximation is multi-linear are defined only by attribute values appearing in the learning
sample. Hence, this proves also the form (1) given in Section 5.1 for the approximation
provided by an infinite ensemble of Extra-Trees.
Acknowledgments Damien Ernst and Pierre Geurts gratefully acknowledge the financial support of the
Belgian National Fund of Scientific Research (FNRS).
References
Ali, K., & Pazzani, M. (1996). Error reduction through learning multiple descriptions. Machine Learning,
24:3, 173–206.
Ali, K. (1995). On the link between error correlation and error reduction in decision tree ensembles. Technical
report, Department of Information and Computer Science, University of California, Irvine.
Bauer, E., & Kohavi., R. (1999). An empirical comparison of voting classification algorithms: bagging,
boosting, and variants. Machine Learning, 36, 105–139.
Blake, C., & Merz, C.(1998). UCI repository of machine learning databases. http://www.
ics.uci.edu/mlearn/MLRepository.html.
Breiman, L., Friedman, J., Olsen, R., & Stone, C. (1984). Classification and regression trees. Wadsworth
International.
Breiman, L. (1996a). Arcing classifiers. Technical report, University of California, Department of Statistics.
Breiman, L. (1996b). Bagging predictors. Machine Learning, 24:2, 123–140.
Springer
Mach Learn (2006) 63: 3–42 41
Breiman, L. (2000a). Randomizing outputs to increase prediction accuracy.Machine Learning, 40:3, 229–242.
Breiman, L. (2000b). Some infinity theory for predictor ensembles. Technical Report 579, University of
California, Department of Statistics.
Breiman, L. (2001). Random forests. Machine Learning, 45, 5–32.
Buntine, W., & Niblett, T. (1992), A further comparison of splitting rules for decision-tree induction. Machine
Learning, 8, 75–85.
Buntine, W., & Weigend, A. (1991). Bayesian back-propagation. Complex Systems, 5, 603–643.
Buntine, W. (1992). Learning classification trees. Statistics and Computing, 2, 63–73.
Cutler, A.,& Guohua, Z. (2001), PERT — Perfect random tree ensembles. Computing Science and Statistics
33.
Dietterich, T., & Kong, E. (1995). Machine learning bias, statistical bias, and statistical variance of decision
tree algorithms. Technical report, Department of Computer Science, Oregon State University.
Dietterich, T. (2000). An experimental comparison of three methods for constructing ensembles of decision
trees: bagging, boosting, and randomization. Machine Learning, 40:2, 139–157.
Ernst, D., Geurts, P., & Wehenkel, L. (2005). Tree-based batch mode reinforcement learning. Journal of
Machine Learning Research, 6, 503–556.
Freund, Y., & Schapire, R. (1995). A decision-theoretic generalization of on-line learning and an application
to boosting. In: Proceedings of the 2nd European Conference on Computational Learning Theory, 23–27.
Friedman, J. (1991). Multivariate adaptive regression splines. Annals of Statistics, 19:1, 1–141.
Friedman, J. (1997). On bias, variance, 0/1-loss, and the curse-of-dimensionality. Data Mining and Knowledge
Discovery, 1, 55–77.
Geman, S., Bienenstock, E., & Doursat, R. (1992). Neural networks and the bias/variance dilemna. Neural
Computation, 4, 1–58.
Geurts, P., Blanco Cuesta A., & Wehenkel, L. (2005a). Segment and combine approach for biological sequence
classification. In: Proceedings of IEEE Symposium on Computational Intelligence in Bioinformatics and
Computational Biology, 194–201.
Geurts, P., Fillet,M., de Seny, D., Meuwis, M. -A., Merville, M. -P., & Wehenkel, L. (2005b). Proteomic
mass spectra classification using decision tree based ensemble methods. Bioinformatics, 21:14, 3138–
3145.
Geurts, P., & L. Wehenkel. (2000). Investigation and reduction of discretization variance in decision tree
induction. In: Proceedings of the 11th European Conference on Machine Learning, 162–170.
Geurts, P., & Wehenkel, L. (2005). Segment and combine approach for non-parametric time-series classifica-
tion. In: Proceedings of the 9th European Conference on Principles and Practice of Knowledge Discovery
in Databases. pp. 478–485.
Geurts, P. (2002). Contributions to decision tree induction: bias/variance tradeo. and time series classification.
Ph.D. thesis, University of Li`
ege.
Geurts, P. (2003). Extremely randomized trees. Technical report, University of Li`
ege - Department of Electrical
Engineering and Computer Science.
Hastie, T., Tibshirani, R., & Friedman, J. (2001). The elements of statistical learning: Data mining, inference,
and prediction. Springer.
Herbrich, R., Graepel, T., & Campbell, C. (2001). Bayes point machines. Journal of Machine Learning
Research, 1, 241–279.
Ho, T. (1998). The Random subspace method for constructing decision forests. IEEE Transactions on Pattern
Analysis and Machine Intelligence, 20:8, 832–844.
James, G. (2003). Variance and bias for generalized loss functions. Machine Learning, 51, 115–135.
Kamath, C., Cantu-Paz, E., & Littau, D. (2002). Approximate splitting for ensembles of trees using histograms.
In: Proceedings of the 2nd SIAM International Conference on Data mining.
Kleinberg, E. (1990). Stochastic discrimination. Annals of Mathematics and Artificial Intelligence 1, 207–239.
Lin, Y., & Jeon, Y. (2002). Random forests and adaptive nearest neighbors. Technical Report 1055, University
of Wisconsin, Department of Statistics.
Mar´
ee, R., Geurts, P., Piater, J., & Wehenkel, L. (2004). A generic approach for image classsification based
on decision tree ensembles and local sub-windows. In: Proceedings of the 6th Asian Conference on
Computer Vision, 2, 860–865.
Mingers, J. (1989). An empirical comparison of selection measures for decision-tree induction. Machine
Learning, 3, 319–342.
Nadeau, C., & Bengio, Y. (2003). Inference for the generalization error. Machine Learning, 52:3, 239–281.
Quinlan, J. (1986). C4.5: Programs for machine learning. Morgan Kaufmann (San Mateo).
Torgo, L. (1999). Inductive learning of tree-based regression models. Ph.D. thesis, University of Porto.
Valentini, G., & Dietterich, T. (2004). Bias-variance analysis of support vector machines for the development
of SVM-based ensemble methods. Journal of Machine Learning Research, 5, 725–775.
Springer
42 Mach Learn (2006) 63: 3–42
Webb, G., & Zheng, Z. (2004). Multi-strategy ensemble learning: reducing error by combining ensemble
learning techniques. IEEE Transactions on Knowledge and Data Engineering,16:8, 980–991.
Webb, G. (2000). Multiboosting: a technique for combining boosting and wagging. Machine Learning, 40:2,
159–196.
Wehenkel, L., & Pavella, M. (1991). Decision trees and transient stability of electric power systems. Auto-
matica, 27:1, 115–134.
Wehenkel,L. (1996). On uncertainty measures used for decision tree induction. In: Proceedings of Information
Processing and Management of Uncertainty in Knowledge Based Systems, 413–418.
Wehenkel, L. (1997). Discretization of continuous attributes for supervised learning: variance evaluation and
variance reduction. In: Proceedings of the International Fuzzy Systems Association World Congress,
381–388.
Wehenkel, L. (1998). Automatic Learning Techniques in Power Systems. Boston: Kluwer Academic.
Wolpert, D. (1992). Stacked generalization. Neural Networks, 5, 241–259.
Zhao, G. (2000). A new perspective on classification. Ph.D. thesis, Utah State University, Department of
Mathematics and Statistics.
Zheng, Z., & Webb, G. (1998). Stochastic attribute selection committees. In: Proceedings of the 11h Australian
Joint Conference on Artificial Intelligence, 321–332.
Springer
... This drawback may impede the uptake of virtual sensing for operational purposes. However, the success of other ensemble techniques such as the extremely randomized trees algorithm [19], which has been tremendous in various application domains in recent years, is yet to be explored for N and P prediction. ...
Article
Full-text available
Harmful cyanobacterial bloom (HCB) is problematic for drinking water treatment, and some of its strains can produce toxins that significantly affect human health. To better control eutrophication and HCB, catchment managers need to continuously keep track of nitrogen (N) and phosphorus (P) in the water bodies. However, the high-frequency monitoring of these water quality indicators is not economical. In these cases, machine learning techniques may serve as viable alternatives since they can learn directly from the available surrogate data. In the present work, a random forest, extremely randomized trees (ET), extreme gradient boosting, k-nearest neighbors, a light gradient boosting machine, and bagging regressor-based virtual sensors were used to predict N and P in two catchments with contrasting land uses. The effect of data scaling and missing value imputation were also assessed, while the Shapley additive explanations were used to rank feature importance. A specification book, sensitivity analysis, and best practices for developing virtual sensors are discussed. Results show that ET, MinMax scaler, and a multivariate imputer were the best predictive model, scaler, and imputer, respectively. The highest predictive performance, reported in terms of R2, was 97% in the rural catchment and 82% in an urban catchment.
... The Extremely randomized trees (or Extra-Trees) builds an ensemble of unpruned decision or regression trees according to the classical top-down procedure. Its two main differences with RF are that it splits nodes by choosing cut-points fully at random and that it uses the whole learning sample (rather than a bootstrap replica) to grow the trees (Geurts et al., 2006). In other words, in extremely randomized trees, randomness goes one step further in the way splits are computed. ...
Thesis
The surface of the cerebral cortex is very convoluted, with a large number of folds, the cortical sulci. Moreover, these folds are extremely variable from one individual to another. This great variability is a problem for many applications in neuroscience and brain imaging. One central problem is that cerebral sulci are not the good unit to describe folding over the cortical surface. In particular, their geometry (shape) and topology (branches, number of pieces) are very variable. “Plis de passages” (PPs) or “annectant gyri” can explain part of the variability. The concept of PPs was first introduced by Gratiolet (1854) to describe transverse gyri that interconnect both sides of a sulcus, are frequently buried in the depth of these sulci, and are sometimes apparent on the cortical surface. As an interesting feature of the cortical folding process, the underlying structural connectivity of PPs also generated a lot of interest. However, the difficulty of identifying PPs and the lack of systematic methods to automatically detecting them limited their use. This thesis aims to detect and characterise the PPs on the cortical surface from both morphology and connectivity aspects. It was structured around two main research axes: 1. Definition of a machine learning-based PPs detection process using their geometrical (or morphological) characteristics. 2. Investigate the relationships between PPs and their un- derlying structural connectivity, and further development of multi-modal machine learning models. In the first part, we present a method to detect the PPs on the cortex automatically according to the local morphological characteristics proposed in (Bodin et al., 2021), To record the local morphological patterns for each vertex on the cortical surface, we used the cortical surface profiling method (Li et al., 2010). After that, the three-dimensional PP recognition problem is converted to a two-dimensional image classification problem of class-imbalance where more points in the STS are non-PPs than PPs. To solve this case, we propose an ensemble SVM model (EnsSVM) with a rebalancing strategy. Experimental results and quantitative statistics analyses show the effectiveness and robustness of our method. In the second part, we study the structural connectivity, particularly short-range U-fibers, underlying the location of PPs, and propose a new approach to study the density of U-fiber terminations on the cortical surface. We hypothesize that the PPs are located in regions of high density of intercrossing U-fibers termination. Indeed, our statistical analyses show a robustness correlation between PPs and U-fibers termination density. Moreover, we discuss the impact of connectivity heterogeneity in the STS on the machine learning results, and the myelin map is then used as a supplement to the structural connectivity.
... The Extreme Random Forest (ERF) model is one of the most popular machine learning algorithms used for classification. It is an extension of the Random Forest (RF) model, and is an ensemble machine learning algorithm 48 . The ERF algorithm works by creating a large number of unpruned decision trees from the training dataset. ...
Preprint
Full-text available
Sex classification of children's voices allows for an investigation of the development of secondary sex characteristics which has been a key interest in the field of speech analysis. This research investigated a broad range of acoustic features from scripted and spontaneous speech and applied a hierarchical clustering-based machine learning model to distinguish the sex of children aged between 5 and 15 years. We proposed an optimal feature set and our modelling achieved an average F1 score (the harmonic mean of the precision and recall) of 0.84 across all ages. Our results suggest that the sex classification is generally more accurate when a model is developed for each year group rather than for children in 4-year age bands, with classification accuracy being better for older age groups. We found that spontaneous speech could provide more helpful cues in sex classification than scripted speech, especially for children younger than 7 years. For younger age groups, a broad range of acoustic factors contributed evenly to sex classification, while for older age groups, F0-related acoustic factors were found to be the most critical predictors generally. Other important acoustic factors for older age groups include vocal tract length estimators, spectral flux, loudness and unvoiced features.
Chapter
When doctors are trained to diagnose a specific disease, they learn faster when presented with cases in order of increasing difficulty. This creates the need for automatically estimating how difficult it is for doctors to classify a given case. In this paper, we introduce methods for estimating how hard it is for a doctor to diagnose a case represented by a medical image, both when ground truth difficulties are available for training, and when they are not. Our methods are based on embeddings obtained with deep metric learning. Additionally, we introduce a practical method for obtaining ground truth human difficulty for each image case in a dataset using self-assessed certainty. We apply our methods to two different medical datasets, achieving high Kendall rank correlation coefficients on both, showing that we outperform existing methods by a large margin on our problem and data.KeywordsDifficulty estimationDeep metric learningHuman classification
Preprint
Full-text available
Background Multiple organ dysfunction syndrome (MODS) disproportionately drives sepsis morbidity and mortality among children. The biology of this heterogeneous syndrome is complex, dynamic, and incompletely understood. Gene expression signatures correlated with MODS trajectories may facilitate identification of molecular targets and predictive enrichment. Methods Secondary analyses of publicly available datasets. (1) Supervised machine learning (ML) was used to identify genes correlated with persistent MODS relative to those without in the derivation cohort. Model performances were tested across 4 validation cohorts, among children and adults with differing inciting cause for organ dysfunctions, to identify a stable set of genes and fixed classification model to reliably estimate the risk of MODS. Clinical propensity scores, where available, were used to enhance model performance. (2) We identified organ-specific dysfunction signatures by eliminating redundancies between the shared MODS signature and those of individual organ dysfunctions. (3) Finally, novel patient subclasses were identified through unsupervised hierarchical clustering of genes correlated with persistent MODS and compared with previously established pediatric septic shock endotypes. Results 568 genes were differentially expressed, among which ML identified 109 genes that were consistently correlated with persistent MODS. The AUROC of a model that incorporated the stable features chosen from repeated cross-validation experiments to estimate risk of MODS was 0.87 (95% CI: 0.85–0.88). Model performance using the top 20 genes and an ExtraTree classification model yielded AUROCs ranging 0.77–0.96 among validation cohorts. Genes correlated with day 3 and 7 cardiovascular, respiratory, and renal dysfunctions were identified. Finally, the top 50 genes were used to discover four novel subclasses, of which patients belonging to M1 and M2 had the worst clinical outcomes. Reactome pathway analyses revealed a potential role of transcription factor RUNX1 in distinguishing subclasses. Interaction with receipt of adjuvant steroids suggested that newly derived M1 and M2 endotypes were biologically distinct relative to established endotypes. Conclusions Our data suggest the existence of complex sub-endotypes among children with septic shock wherein overlapping biological pathways may be linked to differential response to therapies. Future studies in cohorts enriched for patients with MODS may facilitate discovery and development of disease modifying therapies for subsets of critically ill children with sepsis.
Article
Medical decision support systems have been on the rise with technological advances and they have been the subject of many studies. Developing an effective medical decision support system requires a high amount of accuracy, precision, and sensitivity as well as time efficiency that is inversely proportional to the complexity of the model. Hepatitis C virus (HCV) infection is one of the most important causes of chronic liver disease worldwide. In this study, data discovery has been made by applying data science processes, and the HCV has been estimated with machine learning methods. By analyzing and visualizing the values in the data set, features that may be important for HCV was determined, and HCV estimation was made using various machine learning methods, pre-processing and feature extraction. According to the features obtained from this study, the estimation of HCV can be made automatically and can be a decision support system that helps the researchers and clinicians. In this study, HCV was obtained with 99.31% accuracy by adding new features and eliminating imbalances between classes. The model in this study can be used as an alternative method in the prediction of Hepatitis C-related diseases.
Article
Essential emergency services can be provided, and many lives can be saved if the severity of the road accident is analyzed well in time. Several works have been proposed to ascertain accident severity in intelligent transportation system based on traditional machine learning (ML) approaches such as Logistic Regression, Support Vector Machines (SVM), Random Forests, etc. The motive of this research is to propose an efficient and reliable approach for classifying the severity of road accidents through combined techniques of the feature space of extreme learning machine (ELM) and SVM, named as E-SVM, by making the best use of their characteristics. ELM performs feature mapping of input data, and further, the radial basis function kernel is utilized for training the SVM model, which performs the classification process. The Extra-Trees classifier used for feature selection leads to a reduced dataset of significant features which contributes towards efficient accident severity classification. The experimental results show that the proposed approach is better as compared to the other state-of-the-art ML classifiers both in terms of computational time and system performance, hence justifying its usage in real-life applications.
Chapter
Fake news, usually targeting a person or a group, frequently goes viral as it contains provocative information. Misleading articles have been an issue for a long time, implying that the sentences are so well written that most readers cannot tell that it was generated by machine learning. To resolve this problem in our society, this research focused on how fake news and real news can be classified, and how fake articles can be created through a deep learning model. The first step was collecting data from Kaggle to put in algorithms and sort the data. The algorithms used were LGBM, logistic regression, XGB, KNN, and Gaussian NB for machine learning. Naive Bayes, AdaBoost, SVM-linear model, LDA, and Dummy were used for auto-machine learning models as well as logistic regression and KNN. These machine learning models had an accuracy of around 83 percent average, and most of the auto-machine learning models were close to having 100 percent accuracy. The next step was creating fake news through pretrained and fine-tuned models from GPT-2. Unlike a pretrained model, fine-tuned ones generated longer and more fluent texts that were comparable to the written works of people. The results from the classifiers demonstrate that the spread of fake news can be prevented, while the generated fake news suggests that creating fake information does not require much effort. The society remains vulnerable with countless fake articles, but further research may prevent this issue by introducing simpler and more accessible classifiers to readers.KeywordsNLPFake newsGPT-2Machine learningDeep learning
Article
Full-text available
Ensemble classifiers originated in the machine learning community. They work by fitting many individual classifiers and combining them by weighted or unweighted voting. The ensemble classifier is often much more accurate than the individual classifiers from which it is built. In fact, ensemble classifiers are among the most accurate general-purpose classifiers available. We introduce a new ensemble method, PERT, in which each individual classifier is a perfectly-fit classification tree with random selection of splits. Compared to other ensemble methods, PERT is very fast to fit. Given the randomness of the split selection, PERT is surpr