ArticlePDF Available

Conditional Variable Importance for Random Forests

Authors:

Abstract and Figures

Random forests are becoming increasingly popular in many scientific fields because they can cope with "small n large p" problems, complex interactions and even highly correlated predictor variables. Their variable importance measures have recently been suggested as screening tools for, e.g., gene expression studies. However, these variable importance measures show a bias towards correlated predictor variables. We identify two mechanisms responsible for this finding: (i) A preference for the selection of correlated predictors in the tree building process and (ii) an additional advantage for correlated predictor variables induced by the unconditional permutation scheme that is employed in the computation of the variable importance measure. Based on these considerations we develop a new, conditional permutation scheme for the computation of the variable importance measure. The resulting conditional variable importance reflects the true impact of each predictor variable more reliably than the original marginal approach.
Content may be subject to copyright.
BioMed Central
Page 1 of 11
(page number not for citation purposes)
BMC Bioinformatics
Open Access
Methodology article
Conditional variable importance for random forests
Carolin Strobl*
1
, Anne-Laure Boulesteix
2
, Thomas Kneib
1
, Thomas Augustin
1
and Achim Zeileis
3
Address:
1
Department of Statistics, Ludwig-Maximilians-Universität Munchen, Ludwigstraße 33, D-80539 München, Germany,
2
Sylvia Lawry
Centre for Multiple Sclerosis Research, Hohenlindener Straße 1, D-81677 München, Germany and
3
Department of Statistics and Mathematics,
Wirtschaftsuniversität Wien, Augasse 2 – 6, A-1090 Wien, Austria
Email: Carolin Strobl* - carolin.strobl@stat.uni-muenchen.de; Anne-Laure Boulesteix - boulesteix@slcmsr.org;
Thomas Kneib - thomas.kneib@stat.uni-muenchen.de; Thomas Augustin - thomas.augustin@stat.uni-muenchen.de;
Achim Zeileis - achim.zeileis@wu-wien.ac.at
* Corresponding author
Abstract
Background: Random forests are becoming increasingly popular in many scientific fields because
they can cope with "small n large p" problems, complex interactions and even highly correlated
predictor variables. Their variable importance measures have recently been suggested as screening
tools for, e.g., gene expression studies. However, these variable importance measures show a bias
towards correlated predictor variables.
Results: We identify two mechanisms responsible for this finding: (i) A preference for the
selection of correlated predictors in the tree building process and (ii) an additional advantage for
correlated predictor variables induced by the unconditional permutation scheme that is employed
in the computation of the variable importance measure. Based on these considerations we develop
a new, conditional permutation scheme for the computation of the variable importance measure.
Conclusion: The resulting conditional variable importance reflects the true impact of each
predictor variable more reliably than the original marginal approach.
1 Background
Within the past few years, random forests [1] have
become a popular and widely-used tool for non-paramet-
ric regression in many scientific areas. They show high
predictive accuracy and are applicable even in high-
dimensional problems with highly correlated variables, a
situation which often occurs in bioinformatics. Recently,
the variable importance measures yielded by random for-
ests have also been suggested for the selection of relevant
predictor variables in the analysis of microarray data,
DNA sequencing and other applications (see, e.g., [2-5]).
Identifying relevant predictor variables, rather than only
predicting the response by means of some "black-box"
model, is of interest in many applications. By means of
variable importance measures the candidate predictor var-
iables can be compared with respect to their impact in pre-
dicting the response or even their causal effect (see, e.g.,
[6] for assumptions necessary for interpreting the impor-
tance of a variable as a causal effect). In this case a key
advantage of random forest variable importance meas-
ures, as compared to univariate screening methods, is that
they cover the impact of each predictor variable individu-
Published: 11 July 2008
BMC Bioinformatics 2008, 9:307 doi:10.1186/1471-2105-9-307
Received: 1 April 2008
Accepted: 11 July 2008
This article is available from: http://www.biomedcentral.com/1471-2105/9/307
© 2008 Strobl et al; licensee BioMed Central Ltd.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0
),
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
BMC Bioinformatics 2008, 9:307 http://www.biomedcentral.com/1471-2105/9/307
Page 2 of 11
(page number not for citation purposes)
ally as well as in multivariate interactions with other pre-
dictor variables. For example, Lunetta et al. [2] find that
genetic markers relevant in interactions with other mark-
ers or environmental variables can be detected more effi-
ciently by means of random forests than by means of
univariate screening methods like Fisher's exact test. In the
analysis of amino acid sequence data Segal et al. [7] also
point out the necessity to consider interactions between
sequence positions. Tree-based methods like random for-
ests can help identify relevant predictor variables even in
such high dimensional settings involving complex inter-
actions. Therefore, the impact of different amino acid
properties, some of which have been shown to be relevant
in DNA and protein evolution [8], for predicting peptide
binding is investigated in our application example in Sec-
tion 4. However, we will find in this application example,
as often in practical problems, that many predictor varia-
bles are highly correlated.
The issue of correlated predictor variables is prominent in,
but not limited to, applications in genomics and other
high-dimensional problems. Therefore, it is important to
note that in any non-experimental scientific study, where
the predictor variable settings cannot be manipulated
independently by the investigator, the distinction
between the marginal and the conditional effect of a vari-
able is crucial.
Consider, for example, the apparent correlation between
rates of complication after surgery and mortality in hospi-
tals, that was investigated by Silber and Rosenbaum [9]. It
is plausible to believe that the mortality rate of a hospital
depends on the rate of complications – or even that the
mortalities are caused by the complications. However,
when severity of illness is taken into account, the correla-
tion disappears [9].
This phenomenon is known as a spurious correlation (see
also Stigler [10] for a historical example). In the hospital
mortality example, the spurious correlation is caused by
the fact that hospitals that treat many serious cases have
both higher complication and mortality rates. However,
when conditioning on severity of illness (i.e. comparing
only patients with similar severity of illness), mortality is
no longer associated with complications.
If you consider this as a prediction problem, once the truly
influential background variable (severity of illness) is
known, it is clear that the remaining covariate (complica-
tion rate) provides no or little additional information for
predicting the response (mortality rate). From a statistical
point of view, however, this distinction can only be made
by a conditional importance measure.
We will point out throughout this chapter that correla-
tions between predictor variables – regardless of whether
they arise from small-scale characteristics, such as proxim-
ities between genetic loci in organisms, or large-scale char-
acteristics, such as similarities in the clientele of hospitals
– severely affect the original random forest variable
importance measures, because they can be considered as
measures of marginal importance, even though what is of
interest in most applications is the conditional effect of
each variable. To make this distinction more clear, let us
shortly review previous suggestions from the literature for
measuring or illustrating variable importance in classifica-
tion and regression trees (termed "classification trees" in
the following for brevity, while all results apply to both
classification and regression trees) and random forests:
Breiman [11] displays the change in the response variable
over the range of one predictor variable in "partial
dependence plots" (see also [12] for a related approach).
This may remind of the interpretation of model coeffi-
cients in linear models. However, whether the effect of a
variable is interpretable as conditional on all other varia-
bles, as in linear models, may not be guaranteed in other
models – and we will point out explicitly below that this
is not the case in classification trees or random forests.
The permutation accuracy importance, that is described in
more detail in Section 2.3, follows the rationale that a ran-
dom permutation of the values of the predictor variable is
supposed to mimic the absence of the variable from the
model. The difference in the prediction accuracy before
and after permuting the predictor variable, i.e. with and
without the help of this predictor variable, is used as an
importance measure. The actual permutation accuracy
importance measure will be termed "permutation impor-
tance" in the following, while the general concept of the
impact of a predictor variable in predicting the response is
termed "variable importance". The alternative variable
importance measure used in random forests, the Gini
importance, is based on the principle of impurity reduc-
tion that is followed in most traditional classification tree
algorithms. However, it has been shown to be biased
when predictor variables vary in their number of catego-
ries or scale of measurement [13], because the underlying
Gini gain splitting criterion is a biased estimator and can
be affected by multiple testing effects [14]. Therefore, we
will focus on the permutation importance in the follow-
ing, that is reliable when subsampling without replace-
ment – instead of bootstrap sampling – is used in the
construction of the forest [13].
Based on the permutation importance, schemes for varia-
ble selection and for providing statements of the "signifi-
cance" of a predictor variable (instead of a merely
descriptive ranking of the variables w.r.t. their importance
scores) have been derived: Breiman and Cutler [15] sug-
BMC Bioinformatics 2008, 9:307 http://www.biomedcentral.com/1471-2105/9/307
Page 3 of 11
(page number not for citation purposes)
gest a simple significance test that, however, shows poor
statistical properties [16]. An approach for variable selec-
tion in large scale screening studies is introduced by Diaz-
Uriarte and Alvarez de Andres [17], who suggest a back-
ward elimination strategy. This approach has been shown
to provide a reasonable selection of genes in many situa-
tions and is freely available in an R package [18], that also
provides different plots for comparing the performance
on the original data set to those on a data set with ran-
domly permuted values of the response variable. The lat-
ter mimics the overall null hypothesis that none of the
predictor variables is relevant and may serve as a baseline
for significance statements. A similar approach is followed
by Rodenburg et al. [19]. However, some recent simula-
tion studies indicate that the performance of the variable
importance measures may not be reliable when predictor
variables are correlated: Even though Archer and Kimes
[20] show in their extensive simulation study that the Gini
importance can identify influential predictor variables out
of sets of correlated covariates in many settings, the pre-
liminary results of the simulation study of Nicodemus
and Shugart [21] indicate that the ability of the permuta-
tion importance to detect influential predictor variables in
sets of correlated covariates is less reliable than that of
alternative machine learning methods and highly
depends on the number of previously selected splitting
variables mtry. These studies, as well as our simulation
results, indicate that random forests show a preference for
correlated predictor variables, that is also carried forward
to any significance test or variable selection scheme con-
structed from the importance measures.
In this work we aim at providing a deeper understanding
of the underlying mechanisms responsible for the obser-
vations of [20] and [21]. In addition to this, we want to
broaden the scope of considered problems to the compar-
ison of the influence of correlated and uncorrelated pre-
dictor variables. For this type of problem we introduce a
new, conditional permutation importance for random
forests, that better reflects the true importance of predictor
variables. Our approach is motivated by the visual means
of illustration introduced by Nason et al. [22]: In their
"CARTscans" plots they not only display the marginal
influence of a predictor variable, like the partial depend-
ence plots of Breiman [11], but the influence of continu-
ous predictor variables separately for the levels of two
other, categorical predictor variables, namely a condi-
tional influence plot.
As pointed out above, in the case of correlated predictor
variables it is important to distinguish between condi-
tional and marginal influence of a variable, because a var-
iable that may appear influential marginally might
actually be independent of the response when considered
conditional on another variable. In this respect the
approach of [22] is an important improvement, but in its
current form is only applicable for categorical covariates.
Therefore our aim in this work is to provide a general
scheme that can be used both for illustrating the effect of
a variable and for computing its permutation importance
conditional on relevant covariates of any type. While the
conditioning scheme of [22] can be considered as a full-
factorial cross-tabulation based on two categorical predic-
tor variables, our conditioning scheme is based on a par-
tition of the entire feature space that is determined directly
by the fitted random forest model.
In the following Section 2 we will outline how ensembles
of classification trees are constructed and illustrate in a
simulation study why correlated predictor variables tend
to be overselected. Then we will review the construction of
the original permutation importance before we introduce
a new permutation scheme that we suggest for the con-
struction of a conditional permutation importance meas-
ure. The advantage of this measure over the currently-used
one is illustrated in the results of our simulation study in
Section 3 and in the application to peptide-binding data
in Section 4.
2 Methods
In random forests and the related method bagging, an
ensemble of classification trees is created by means of
drawing several bootstrap samples or subsamples from
the original training data and fitting a single classification
tree to each sample. Due to the random variation in the
samples and the instability of the single classification
trees, the ensemble will consist of a diverse set of trees. For
prediction, a vote (or average) over the predictions of the
single trees is used and has been shown to highly outper-
form the single trees: By combining the prediction of a
diverse set of trees, bagging utilizes the fact that classifica-
tion trees are instable but on average produce the right
prediction. This understanding has been supported by
several empirical studies (see, e.g., [23-26]) and especially
the theoretical results of Bühlmann and Yu [27], who
could show that the improvement in the prediction accu-
racy of ensembles is achieved by means of smoothing the
hard cut decision boundaries created by splitting in single
classification trees, which in return reduces the variance of
the prediction.
In random forests, another source of diversity is intro-
duced when the set of predictor variables to select from is
randomly restricted in each split, producing even more
diverse trees. In addition to the smoothing of hard deci-
sion boundaries, the random selection of splitting varia-
bles in random forests allows predictor variables that were
otherwise outplayed by their competitors to enter the
ensemble. Even though these variables may not be opti-
mal with respect to the current split, their selection may
BMC Bioinformatics 2008, 9:307 http://www.biomedcentral.com/1471-2105/9/307
Page 4 of 11
(page number not for citation purposes)
reveal interaction effects with other variables that other-
wise would have been missed and thus work towards the
global optimality of the ensemble.
The classification trees, from which the random forests are
built, are built recursively in that the next splitting varia-
ble is selected by means of locally optimizing a criterion
(such as the Gini gain in the traditional CART algorithm
[28]) within the current node. This current node is
defined by a configuration of predictor values, that is
determined by all previous splits in the same branch of
the tree (see, e.g., [29] for illustrations). In this respect the
evaluation of the next splitting variable can be considered
conditional on the previously selected predictor variables,
but regardless of any other predictor variable. In particu-
lar, the selection of the first splitting variable involves only
the marginal, univariate association between that predic-
tor variable and the response, regardless of all other pre-
dictor variables. However, this search strategy leads to a
variable selection pattern where a predictor variable that is
per se only weakly or not at all associated with the
response, but is highly correlated with another influential
predictor variable, may appear equally well suited for
splitting as the truly influential predictor variable. We will
illustrate this point in more detail in the following simu-
lation study.
2.1 Simulation design
A simulation study was set up in order to illustrate the
treatment of correlated predictor variables in ensemble
methods based on classification trees. Data sets were gen-
erated according to a linear model with twelve predictor
variables y
i
=
β
1
·x
i,1
+ +
β
12
·x
i,12
+
ε
i
, with
. The predictor variables were sampled
from a multivariate normal distribution X
1
,..., X
12
~ N(0,
Σ) where the covariance structure Σ was chosen such that
all variables have unit variance
σ
j, j
= 1 and only the first
four predictor variables are block-correlated with
σ
j, j'
= 0.9
for j j' 4, while the rest were independent with
σ
j, j'
= 0.
Of the twelve predictor variables only six were influential,
as indicated by their coefficients in Table 1. A covariance
structure of this type was already used for illustrating the
effect of correlations by Archer and Kimes [20]. However,
while their study mainly aimed at identifying one influen-
tial predictor out of a correlated set, here we also want to
compare the importance scores of predictor variables with
equally large coefficients, while some of the predictor var-
iables are correlated and others are not: X
1
,..., X
4
and X
5
,...,
X
8
share the same coefficient pattern, while only X
1
,..., X
4
are correlated. From the generated data sets, random for-
ests were built with the cforest function from the party
package [30,31] in the R system for statistical computing
[32]. Different values for the parameter mtry, that regu-
lates the number of randomly preselected splitting varia-
bles, were considered to be able to investigate the
mechanisms responsible for the results of Nicodemus and
Shugart [21]. Default settings were used for all other
parameters.
2.2 Illustration of variable selection
We find in the panel on the left hand side of Figure 1 that
in the first splits of all trees, where the variables are con-
sidered only marginally with respect to their association
to the response, those variables (X
3
and X
4
) correlated
with highly influential predictors are selected equally
often as the highly influential predictor variables (X
1
and
X
2
as well as X
5
and X
6
) for mtry = 1, where no competitors
are available and the correlated predictors can serve as
replacements of the influential ones (the fact that the non-
influential predictor variables X
8
through X
12
are selected
almost equally often is only due to the lax choice of the
stop criterion). When mtry increases and the highly influ-
ential variables may be available as predominant compet-
itors in some splits those variables (X
3
and X
4
) correlated
with highly influential predictors are selected less often
than the highly influential correlated ones (X
1
and X
2
)
themselves, but more often than even the highly influen-
tial uncorrelated ones (X
5
and X
6
). When we consider all
splits of all trees in the panel on the right hand side of Fig-
ure 1, the correlated predictors loose most of their advan-
tage because variable selection is now conditional on the
previously chosen variables in the same branch of the tree,
that may include the truly influential correlated predic-
tors. However, since variable selection is not conditional
on all (or at least all correlated) variables, there is still a
preference for the correlated variables with low and zero
coefficients (X
3
and X
4
over X
7
and X
8
), with a similar
dependency on mtry.
This selection pattern is due to the locally optimal variable
selection scheme used in recursive partitioning, that con-
siders only one variable at a time and conditional only on
the current branch. However, since this characteristic of
tree-based methods is a crucial means of reducing compu-
tational complexity (and any attempts to produce globally
optimal partitions are strictly limited to low dimensional
ε
i
iid
N~(,.)
.. .
005
Table 1: Simulation design. Regression coefficients of the data
generating process.
X
j
X
1
X
2
X
3
X
4
X
5
X
6
X
7
X
8
X
12
β
j
5520-5-5-20 0
BMC Bioinformatics 2008, 9:307 http://www.biomedcentral.com/1471-2105/9/307
Page 5 of 11
(page number not for citation purposes)
problems at the moment, see [33]), it shall remain
untouched here.
2.3 The permutation importance
The rationale of the original random forest permutation
importance is the following: By randomly permuting the
predictor variable X
j
, its original association with the
response Y is broken. When the permuted variable X
j
,
together with the remaining non-permuted predictor var-
iables, is used to predict the response for the out-of-bag
observations, the prediction accuracy (i.e. the number of
observations classified correctly) decreases substantially if
the original variable X
j
was associated with the response.
Thus, Breiman [1] suggests the difference in prediction
accuracy before and after permuting X
j
, averaged over all
trees, as a measure for variable importance, that we for-
malize as follows: Let be the out-of-bag (oob) sam-
ple for a tree t, with t {1,..., ntree}. Then the variable
importance of variable X
j
in tree t is
where is the predicted class for observation
i before and is the predicted class for
observation i after permuting its value of variable X
j
, i.e.
with . (Note that
VI
(t)
(X
j
) = 0 by definition, if variable X
j
is not in tree t.) The
raw variable importance score for each variable is then
computed as the mean importance over all trees:
In standard implementations of random forests an addi-
tional scaled version of the permutation importance
()t
VI
Iy
i
y
i
t
i
t
t
Iy
i
y
i
j
t
t
j
()
()
()
()
|
()
|
,
()
X =
=
=
π
i
t
t
()
|
()
|
(1)
ˆ
()
()
()
yf
i
t
t
i
= x
ˆ
()
,
()
()
,
yf
i
t
t
i
j
j
π
π
= x
x
i i ij i j ij ip
jj
xxx x x
,,,,(),, ,
( ,..., , ,..., )
ππ
=
−+11 1
Selection ratesFigure 1
Selection rates. Relative selection rates for twelve variables in the first splits (left) and in all splits (right) of all trees in ran-
dom forests built with different values for mtry.
0 0.2 0.4
first split
mtry = 1
0 0.4 0.8
all splits
0 0.2 0.4
mtry = 3
0 0.4 0.8
123456789101112
0 0.2 0.4
mtry = 8
variable
123456789101112
0 0.4 0.8
variable
BMC Bioinformatics 2008, 9:307 http://www.biomedcentral.com/1471-2105/9/307
Page 6 of 11
(page number not for citation purposes)
(often called z-score), that is achieved by dividing the raw
importance by its standard error, is provided. However,
since recent results ([16], see also [17]) indicate that the
raw importance VI(X
j
) has better statistical properties, we
will only consider the unscaled version here.
2.4 Types of independence
We know that the original permutation importance over-
estimates the importance of correlated predictor variables.
Part of this artefact may be due to the preference of corre-
lated predictor variables in early splits as illustrated in Sec-
tion 2.2. However, we also have to take into account the
permutation scheme that is employed in the computation
of the permutation importance. In the following we will
first outline what notion of independence corresponds to
the current permutation scheme of the random forest per-
mutation importance. Then we will introduce a more sen-
sible permutation scheme that better reflects the true
impact of predictor variables.
It can help our understanding to consider the permuta-
tion scheme in the context of permutation tests (see, e.g.,
[34]): Usually a null hypothesis is considered that implies
the independence of particular (sets of) variables. Under
this null hypothesis some permutations of the data are
permitted because they preserve the structure determined
by the null hypothesis. If, for example, the response vari-
able Y is independent from all predictor variables (global
null hypothesis) a permutation of the (observed) values
of Y affects neither the marginal distribution of Y nor the
joint distribution of X
1
,..., X
p
and Y, because the joint dis-
tribution can be factorized as P(Y, X
1
,..., X
p
) =
P(YP(X
1
,..., X
p
) under the null hypothesis. If, however,
the null hypothesis is not true, the same permutation will
lead to a deviation in the joint distribution or some rea-
sonable test statistic computed from it. Therefore, a
change in the distribution or test statistic caused by the
permutation can serve as an indicator that the data do not
follow the independence structure we would expect under
the null hypothesis.
With this framework in mind, we can now take a second
look at the random forest permutation importance and
ask: Under which null hypothesis would this permutation
scheme be permitted? If the data are actually generated
under this null hypothesis the permutation importance
will be (a random value from a distribution with mean)
zero, while any deviation from the null hypothesis will
lead to a change in the prediction accuracy, that is used as
a test statistic here, and thus will be detectable as an
increase in the value of the permutation importance.
We find that the original permutation importance, where
one predictor variable X
j
is permuted against both the
response Y and the remaining (one or more) predictor
variables Z = X
1
,..., X
j-1
, X
j+1
,..., X
p
as illustrated in the left
panel of Figure 2, corresponds to a null hypothesis of
independence between X
j
and both Y and Z:
H
0
: X
j
Y, Z or equivalently X
j
Y X
j
Z (2)
Under this null hypothesis the joint distribution can be
factorized as
What is crucial when we want to understand why corre-
lated predictor variables are preferred by the original ran-
dom forest permutation importance is that a positive
value of the importance corresponds to a deviation from
this null hypothesis – that can be caused by a violation of
either part: the independence of X
j
and Y, or the independ-
ence of X
j
and Z. However, from these two aspects only
one is of interest when we want to assess the impact of X
j
to help predict Y, namely the question if X
j
and Y are inde-
pendent. This aim, to measure only the impact of X
j
on Y,
would be better reflected if we could create a measure of
deviation from the null hypothesis that X
j
and Y are inde-
pendent under a given correlation structure between X
j
and the other predictor variables, that is determined by
our data set. To meet this aim we suggest a conditional
permutation scheme, where X
j
is permuted only within
groups of observations with Z = z, to preserve the correla-
tion structure between X
j
and the other predictor variables
as illustrated in the right panel of Figure 2.
This permutation scheme corresponds to the following
null hypothesis
H
0
: (X
j
Y)|Z,(4)
where the conditional distribution can be factorized
under the null hypothesis as
which is the definition of conditional independence.
In the special case where X
j
and Z are independent both
permutation schemes will give the same result, as illus-
trated by our simulation results below. When X
j
and Z are
correlated, however, the original permutation scheme will
lead to an apparent increase in the importance of corre-
lated predictor variables, that is due to deviations from the
uninteresting null hypothesis of independence between X
j
and Z.
PY X Z PY Z PX
j
H
j
(, ,) (,) ( ).=⋅
0
(3)
PY X Z PY Z PX Z
PY X Z PY Z
j
H
j
j
H
(, |) (|) ( |)
(| ,) (|),
=⋅
=
0
0
or
(5)
BMC Bioinformatics 2008, 9:307 http://www.biomedcentral.com/1471-2105/9/307
Page 7 of 11
(page number not for citation purposes)
2.5 A new, conditional permutation scheme
Technically, any kind of conditional assessment of the
importance of one variable conditional on another one is
straightforward whenever the variables to be conditioned
on, Z, are categorical as in [22]. However, for our aim to
conditionally permute the values of X
j
within groups of Z
= z, where Z can contain potentially large sets of covariates
of different scales of measurement, we want to supply a
grid that (i) is applicable to variables of different types,
(ii) is as parsimonious as possible, but (iii) is also compu-
tationally feasible. Our suggestion is to define the grid
within which the values of X
j
are permuted for each tree by
means of the partition of the feature space induced by that
tree. The main advantages of this approach are that this
partition was already learned from the data during model
fitting, contains splits in categorical, ordered and continu-
ous predictor variables and can thus serve as an internally
available means for discretizing the feature space.
In principle, any partition derived from a classification
tree can be used to define the permutation grid. Here we
used partitions produced by unbiased conditional infer-
ence trees [31], that employ binary splitting as in the
standard CART algorithm [28]. This means that, if k is the
number of categories of an unordered or ordered categor-
ical variable, up to k, but potentially less than k, subsets of
the data are separated.
Continuous variables are treated in the same way: Every
binary split in a variable provides one or more cutpoints,
that can induce a more or less fine graded grid on this var-
iable. By using the grid resulting from the current tree we
are able to condition in a straightforward way not only on
categorical, but also on continuous variables and create a
grid that may be more parsimonious than the full factorial
approach of [22]. Only in one aspect we suggest to leave
the recursive partition induced by a tree: Within a tree
structure, each cutpoint refers to a split in a variable only
within the current node (i.e. a split in a variable may not
bisect the entire sample space but only partial planes of
it). However, for ease of computation, we suggest that the
conditional permutation grid uses all cutpoints as bisec-
tors of the sample space (the same approach is followed
by [22]). This leads to a more fine graded grid, and may in
some cases result in small cell frequencies inducing
greater variation (even though our simulation results indi-
cate that in practice this is not a critical issue). From a the-
oretical point of view, however, conditioning too strictly
has no negative effect, while a lack of conditioning pro-
duces artefacts as observed for the unconditional permu-
tation importance.
In summary the conditional permutation importance is
derived as follows:
1. In each tree compute the oob-prediction accuracy
before the permutation as in Equation 1:
.
2. For all variables Z to be conditioned on: Extract the cut-
points that split this variable in the current tree and create
a grid by means of bisecting the sample space in each cut-
point.
3. Within this grid permute the values of X
j
and compute
the oob-prediction accuracy after permutation:
, where is the
predicted classes for observation i after permuting its
value of variable X
j
within the grid defined by the variables
Z.
4. The difference between the prediction accuracy before
and after the permutation accuracy again gives the impor-
tance of X
j
for one tree (see Equation 1). The importance
of X
j
for the forest is again computed as an average over all
trees.
To determine the variables Z to be conditioned on, the
most conservative – or rather overcautious -strategy would
be to include all other variables as conditioning variables,
as was indicated by our initial notation. A more intuitive
choice is to include only those variables whose empirical
correlation with the variable of interest X
j
exceeds a certain
moderate threshold, as we do with the Pearson correla-
tion coefficient for continuous variables in the following
simulation study and application example. For the more
general case of predictor variables of different scales of
Iy
i
y
i
t
i
t
t
=
ˆ
()
()
|
()
|
Iy
i
y
i
j
Z
t
i
t
t
=
ˆ
,|
()
()
|
()
|
π
ˆ
()
,|
()
()
,|
yf
iZ
t
t
iZ
j
j
π
π
= x
Permutation schemes.Figure 2
Permutation scheme for the original marginal (left) and for
the newly suggested conditional (right) permutation impor-
tance.
BMC Bioinformatics 2008, 9:307 http://www.biomedcentral.com/1471-2105/9/307
Page 8 of 11
(page number not for citation purposes)
measurement the framework promoted by Hothorn et al.
[31] provides p-values of conditional inference tests as
measures of association. The p-values have the advantage
that they are comparable for variables of all types and can
serve as an intuitive and objective means for selecting the
variables Z to be conditioned on in any problem. Another
option is to let the user himself select certain variables to
condition on, if, e.g., a hypothesis of interest includes cer-
tain independencies.
Note however, that neither a high number of condition-
ing variables nor a high overall number of variables in the
data set poses a problem for the conditional permutation
approach: The permutation importance is computed indi-
vidually for each tree and then averaged over all trees. Cor-
respondingly, the conditioning grid for each tree is
determined by the partition of that particular tree only.
Thus, even if in principle the stability of the permutation
may be affected by small cell counts in the grid, practically
the complexity of the grid is limited by the depth of each
tree.
The depth of the tree, however, does not depend on the
overall number of predictor variables, but on various
other characteristics of the data set (most importantly the
ratio of relevant vs. noise variables, that is usually low, for
example in genomics) in combination with tuning
parameter settings (including the number of randomly
preselected predictor variables, the split selection crite-
rion, the use of stopping criteria and so forth). Lin and
Jeon [35] even point out that limiting the depth of the
trees in random forests may prove beneficial w.r.t. predic-
tion accuracy in certain situations.
Another important aspect is that the conditioning varia-
bles, especially if there are many, may not necessarily
appear all together with the variable of interest in each
individual tree, but different combinations may be repre-
sented in different trees if the forest is large enough.
3 Results
For the simulation design introduced in Section 2.1, Fig-
ure 3 shows the median and interquartile range (over 500
iterations) of the importance scores of each variable for
the different permutation schemes: the original marginal
permutation and the newly suggested conditional permu-
tation scheme. The set of variables Z to be conditioned on
was chosen here to include all variables with an empirical
correlation r .2.
We find that the pattern of the coefficients induced in the
data generating process is not reflected by the importance
values computed with the ordinary permutation scheme.
With this scheme the importance scores of the correlated
predictor variables are highly overestimated. This effect is
most pronounced for small values of mtry, because corre-
lated variables have a higher chance to end up in a top
position in a tree when their correlated competitors are
not available.
For the conditional permutation scheme the importance
scores better reflect the true pattern: The correlated varia-
bles X
1
and X
2
with the same coefficient show an almost
equal level of importance as the uncorrelated variables X
5
and X
6
, while the importance of X
3
and X
4
, that are corre-
lated but have a lower or zero coefficient, decrease. For the
variables with small and zero coefficients we still find a
difference between the correlated and uncorrelated varia-
bles, such that for the correlated variables the importance
values are still overestimated – however to a much lesser
extent than with the unconditional permutation scheme.
This remaining disadvantage of the uncorrelated predictor
variables may be due to the fact that for most values of
mtry these variables are selected less often and in lower
positions in the tree (see Figure 1) and thus have a lower
chance to produce a high importance value. The degree of
the preference of correlated predictor variables also
depends on the choice of mtry and is most pronounced
for small values of mtry, as expected from the selection
Permutation importanceFigure 3
Permutation importance. Median permutation impor-
tance for marginal (dashed) and conditional (solid) permuta-
tion scheme along with inter-quartile range. Note that the
ordering of variables in the plot is arbitrary.
mtry = 1
0 5 15 25
mtry = 3
010 30 50
mtry = 8
123456789101112
0 20406080
variable
BMC Bioinformatics 2008, 9:307 http://www.biomedcentral.com/1471-2105/9/307
Page 9 of 11
(page number not for citation purposes)
frequencies. On the other hand, we find in Figure 3 that
the variability of the importance increases for large values
of mtry, and the prediction accuracy is expected to be
higher for smaller values of mtry. Another interesting fea-
ture of the conditional permutation scheme is that the
variability of the conditional importance is lower than
that of the unconditional importance within each level of
mtry.
With respect to the identifiability of few influential predic-
tors from a set of correlated and other noise variables
(which was the task in [20] and [21]), we can see from the
importance scores for X
1
,..., X
3
in comparison to that of X
4
that the conditional importance reflects the same pattern
as the unconditional importance, however with a notably
smaller variation that may improve the identifiability. In
the comparison of potentially influential correlated and
uncorrelated predictor variables on the other hand, the
conditional importance is much better suited as a means
of comparison than the original importance. For piece-
wise constant functions, that can be more easily addressed
with recursive partitioning methods, the beneficial effect
of conditioning is even stronger than presented here.
4 Example: Relating amino acid sequence to
phenotype in peptide-binding data
As an application example we consider peptide-binding
data that were previously analysed with recursive parti-
tioning techniques by Segal et al. [7]. The data set includes
105 variables for a total of n = 310 amino acid sequences.
The response to be predicted is a binding property that
can be coded as a binary variable (binding/no binding).
The remaining variables available in this data set corre-
spond to 13 amino acid properties for each of the eight
considered amino acid positions. These 13 properties
include, e.g. volume, polarity, bulkiness, flexibility, aro-
maticity, and charge, yielding in total 104 continuous pre-
dictor variables. A random forest with 1000 trees and mtry
= 104 (which corresponds to bagging [23,24] as a special
case of a random forest where mtry is equal to the number
of candidate predictors and variable selection is not ran-
domly restricted) was fit to the data set. The permutation
importance was computed either with the unconditional
or the conditional permutation scheme. The resulting
importance scores are displayed in Figure 4 (note that the
absolute values of the scores should not be interpreted).
The few predictor variables whose importance scores
reach highest or even exceed the plotting area would be
selected for further analysis by any means. However, for
some of the variables with the next smaller importance
scores the ranking strongly depends on the permutation
scheme. We will focus our illustration on the ranking of
three exemplary predictor variables, "h2y8", "flex8" and
"pol3", that are highlighted in Figure 4: We find in the
unconditional view in the top panel of Figure 4 that
"h2y8" and "flex8" appear to be of higher importance
than "pol3" (ranks "h2y8": 8, "flex8": 9, "pol3": 11).
However, in the conditional view in the bottom panel of
Figure 4 their order is reversed and it turns out that "pol3"
is really more important than "h2y8" and "flex8"(ranks
"h2y8": 9, "flex8": 8, "pol3": 7). This change in the ranks
of the predictor variables is most pronounced for large
mtry as expected, but similar effects can be observed for
smaller values.
When exploring the reason why the importances of
"h2y8" and "flex8" are moderated by conditioning, while
the importance of "pol3" remains almost constant, we
find that "h2y8" and "flex8" are correlated with influen-
tial covariates, while "pol3" is only correlated with non-
influential covariates. For example, "h2y8" is highly corre-
lated with the polarity at position eight "pol8", that is
indicated by the * symbol in in Figure 4. The variable
"pol8" shows a high importance (that is however also
moderated by conditioning) and was already found to be
influential by Segal et al. [7], who note that it may approx-
imate an effect of the eighth position in the original
sequence data, while the results of Xia and Li [8] indicate
an effect of the amino acid property polarity itself.
This shows that importance rankings in data sets that con-
tain complex correlations between predictor variables can
be severely affected by the underlying permutation
scheme: When the conditional permutation is used, the
importance scores of correlated predictor are moderated
such that the truly influential predictor variables have a
higher chance to be detected.
Example: peptide-binding dataFigure 4
Example: peptide-binding data. Marginal (top) and con-
ditional (bottom) permutation importance of 104 predictors
of peptide-binding.
0 0.005
unconditional
0 0.005
conditional
h2y8 flex8 pol3
*
BMC Bioinformatics 2008, 9:307 http://www.biomedcentral.com/1471-2105/9/307
Page 10 of 11
(page number not for citation purposes)
5 Discussion and conclusion
We have investigated the sources of preferences in the var-
iable importance measures of random forests in favor of
correlated predictor variables and suggested a new, condi-
tional permutation scheme for the computation of the
variable importance measure. This new, conditional per-
mutation scheme uses the partition that is automatically
provided by the fitted model as a conditioning grid and
reflects the true impact of each predictor variable better
than the original, marginal approach. Even though the
conditional permutation cannot entirely eliminate the
preference for correlated predictor variables, it has been
shown to provide a more fair means of comparison that
can help identify the truly relevant predictor variables.
Our simulation results also illustrate the impact of the
choice of the random forest tuning parameter mtry: While
the default value mtry = is often found to be optimal
with respect to prediction accuracy in empirical studies
(see, e.g., [36]), our findings indicate that in the case of
correlated predictor variables different values of mtry
should be considered. However, it should also be noted
that any interpretation of random forest variable impor-
tance scores can only be sensible when the number of
trees is chosen sufficiently large such that the results pro-
duced with different random seeds do not vary systemati-
cally. Only then it is assured that the differences between,
e.g., unconditional and conditional importance are not
only due to random variation.
The conditional permutation importance will be freely
available in the next release of the party package for recur-
sive partitioning [30,31] in the R system for statistical
computing [32].
Authors' contributions
CS defined the research question, suggested the condi-
tional variable importance, set up and performed the sim-
ulation experiments and drafted the manuscript. A–LB
analyzed the peptide-binding data. TK, TA and AZ con-
tributed to the theoretical understanding and presenta-
tion of the problem. All authors contributed to and
approved the final version of the manuscript.
Acknowledgements
A–LB was supported by the Porticus Foundation in the context of the Inter-
national School for Technical Medicine and Clinical Bioinformatics.
The authors would like to thank Torsten Hothorn for providing essential
help with accessing and processing cforest objects.
References
1. Breiman L: Random Forests. Machine Learning 2001, 45:5-32.
2. Lunetta KL, Hayward LB, Segal J, Eerdewegh PV: Screening Large-
Scale Association Study Data: Exploiting Interactions Using
Random Forests. BMC Genetics 2004, 5:32.
3. Bureau A, Dupuis J, Falls K, Lunetta KL, Hayward B, Keith TP,
Eerdewegh PV: Identifying SNPs Predictive of Phenotype
Using Random Forests. Genetic Epidemiology 2005,
28(2):171-182.
4. Huang X, Pan W, Grindle S, Han X, Chen Y, Park SJ, Miller LW, Hall
J: A Comparative Study of Discriminating Human Heart Fail-
ure Etiology Using Gene Expression Profiles. BMC Bioinformat-
ics 2005, 6:205.
5. Qi Y, Bar-Joseph Z, Klein-Seetharaman J: Evaluation of Different
Biological Data and Computational Classification Methods
for Use in Protein Interaction Prediction. Proteins 2006,
63(3):490-500.
6. Laan M van der: Statistical Inference for Variable Importance.
International Journal of Biostatistics 2006, 2:Article 2 [http://
www.bepress.com/ijb/vol2/iss1/2/].
7. Segal MR, Cummings MP, Hubbard AE: Relating Amino Acid
Sequence to Phenotype: Analysis of Peptide-binding Data.
Biometrics 2001, 57(2):632-643.
8. Xia X, Li WH: What Amino Acid Properties Affect Protein
Evolution? Journal of Molecular Evolution 1998, 47(5):557-564.
9. Silber JH, Rosenbaum PR: A Spurious Correlation Between Hos-
pital Mortality and Complication Rates. The Importance of
Severity Adjustment. Journal of Urology 1998, 160:288-289.
10. Stigler SM: Correlation and Causation: A Comment. Perspec-
tives in Biology and Medicine 2005, 48:88-94. Supplement
11. Breiman L: Statistical Modeling: The Two Cultures. Statistical
Science 2001, 16(3):199-231.
12. Feraud R, Clerot F: A Methodology to Explain Neural Network
Classification. Neural Networks 2002, 15(2):237-246.
13. Strobl C, Boulesteix AL, Zeileis A, Hothorn T: Bias in Random
Forest Variable Importance Measures: Illustrations, Sources
and a Solution. BMC Bioinformatics 2007, 8:25.
14. Strobl C, Boulesteix AL, Augustin T: Unbiased Split Selection for
Classification Trees Based on the Gini Index. Computational
Statistics & Data Analysis 2007, 52:483-501.
15. Breiman L, Cutler A: Random Forests – Classification Manual
(website accessed in 12/2007). [http://www.math.usu.edu/
~adele/forests/].
16. Strobl C, Zeileis A: Danger: High Power! – Exploring the Sta-
tistical Properties of a Test for Random Forest Variable
Importance. Proceedings of the 18th International Conference on Com-
putational Statistics, Porto, Portugal 2008.
17. Diaz-Uriarte R, Alvarez de Andrés S: Gene Selection and Classi-
fication of Microarray Data Using Random Forest. BMC Bioin-
formatics 2006, 7:3.
18. Diaz-Uriarte R: GeneSrF and varSelRF: A Web-based Tool and
R Package for Gene Selection and Classification Using Ran-
dom Forest. BMC Bioinformatics 2007, 8:328.
19. Rodenburg W, Heidema AG, Boer JM, Bovee-Oudenhoven IM, Fes-
kens EJ, Mariman EC, Keijer J: A Framework to Identify Physio-
logical Responses in Microarray Based Gene Expression
Studies: Selection and Interpretation of Biologically Rele-
vant Genes. Physiological Genomics 2008, 33:78-90.
20. Archer KJ, Kimes RV: Empirical characterization of random
forest variable importance measures. Computational Statistics &
Data Analysis 2008, 52(4):2249-2260.
21. Nicodemus K, Shugart YY: Impact of Linkage Disequilibrium
and Effect Size on the Ability of Machine Learning Methods
to Detect Epistasis in Case-Control Studies. Abstract volume of
the Sixteenth Annual Meeting of the International Genetic Epidemiology
Society, North Yorkshire, UK 2007, 31(6):611.
22. Nason M, Emerson S, Leblanc M: CARTscans: A Tool for Visual-
izing Complex Models. Journal of Computational and Graphical Sta-
tistics 2004, 13(4):1-19.
23. Breiman L: Bagging Predictors. Machine Learning 1996,
24(2):123-140.
24. Breiman L: Arcing Classifiers. The Annals of Statistics 1998,
26(3):801-849.
25. Bauer E, Kohavi R: An Empirical Comparison of Voting Classi-
fication Algorithms: Bagging, Boosting, and Variants.
Machine Learning 1999, 36(1–2):105-139.
26. Dietterich TG: An Experimental Comparison of Three Meth-
ods for Constructing Ensembles of Decision Trees: Bagging,
p
Publish with Bio Med Central and every
scientist can read your work free of charge
"BioMed Central will be the most significant development for
disseminating the results of biomedical research in our lifetime."
Sir Paul Nurse, Cancer Research UK
Your research papers will be:
available free of charge to the entire biomedical community
peer reviewed and published immediately upon acceptance
cited in PubMed and archived on PubMed Central
yours — you keep the copyright
Submit your manuscript here:
http://www.biomedcentral.com/info/publishing_adv.asp
BioMedcentral
BMC Bioinformatics 2008, 9:307 http://www.biomedcentral.com/1471-2105/9/307
Page 11 of 11
(page number not for citation purposes)
Boosting, and Randomization. Machine Learning 2000,
40(2):139-157.
27. Bühlmann P, Yu B: Analyzing Bagging. The Annals of Statistics 2002,
30(4):927-961.
28. Breiman L, Friedman JH, Olshen RA, Stone CJ: Classification and
Regression Trees New York: Chapman and Hall; 1984.
29. Hastie T, Tibshirani R, Friedman JH: The Elements of Statistical Learning
New York: Springer; 2001.
30. Hothorn T, Hornik K, Zeileis A: party: A Laboratory for Recur-
sive Part(y)itioning. [http://CRAN.R-project.org/package=party
].
R package version 0.9-96.
31. Hothorn T, Hornik K, Zeileis A: Unbiased Recursive Partition-
ing: A Conditional Inference Framework. Journal of Computa-
tional and Graphical Statistics 2006, 15(3):651-674.
32. R Development Core Team: R: A Language and Environment for Statis-
tical Computing 2008 [http://www.R-project.org/
]. R Foundation for
Statistical Computing, Vienna, Austria
33. van Os BJ, Meulman J: Globally Optimal Tree Models. In Abstract
Book of the 3rd World Conference on Computational Statistics & Data
Analysis of the International Association for Statistical Computing, Cyprus,
Greece Edited by: Azen S, Kontoghiorghes E, Lee JC. Matrix Compu-
tations and Statistics Group; 2005:79.
34. Good P: Permutation, Parametric, and Bootstrap Tests of Hypotheses 3rd
edition. New York: Springer Series in Statistics; 2005.
35. Lin Y, Jeon Y: Random Forests and Adaptive Nearest Neigh-
bors. Journal of the American Statistical Association 2006,
101(474):578-590.
36. Svetnik V, Liaw A, Tong C, Culberson JC, Sheridan RP, Feuston BP:
Random Forest: A Classification and Regression Tool for
Compound Classification and QSAR Modeling. Journal of
Chemical Information and Computer Sciences 2003, 43(6):1947-1958.
... In addition to differences in the research question being asked (see above), the difficulty of selecting study sites and the relatively recent recognition of the multiple meanings of "complexity" likely contributes to this rarity. However, an added deterrent is the computational and statistical difficulties associated with using large numbers of explanatory variables relative to the number of observations ("large p -small n") in traditional linear or additive models, often with high colinearity (Graham, 2003, Strobl et al., 2008. This leads to high type II error rates and uncertainty of parameter estimates (Carrascal et al., 2009). ...
... The tree-like structure allows non-linear relationships and complex interactions among predictors to be modelled, without the need to express them explicitly. Random forests are ensembles of tree models, which, by introducing randomness into the tree-building process, often give better accuracy of predictions compared to single tree models (Strobl et al., 2008). For each tree, slightly different training data is obtained by sampling observations with replacement (a "bootstrap" sample). ...
... The increase in the OOB error rate after scrambling is the PVI for that variable, and simulates the effect of removing the variable from the forest. Although it should be used with caution (see Strobl et al., 2008), this PVI should make a suitable measure for comparison between different responses. ...
Thesis
Full-text available
In this thesis novel analytical approaches using random forests were used to explore temporal and inter-specific variation in the influence of landscape context on species of aphid and hymenopterous parasitoid in winter wheat fields in the UK, and to produce models predicting the abundance of aphids, parasitoids and syrphid larvae as functional groups. Estimates of the economic value of natural enemies to farmers were made, showing substantial potential to reduce yield loss and insecticide costs through beneficial habitat management.
... Notably, random forest models are robust to overfitting and to the presence of correlated regressors. In addition, even though bootstrap aggregation methods result in improved accuracy at the expense of interpretability, they enable the obtention of variable importance measures such as the original individual variable importance proposed by Breiman [37], group variable importance [38] and conditional variable importance [39], [40] and, thus, they have been successfully applied to quality assessment in resistance spot welding process [6], [16]. ...
... In [39], the authors pointed that the above-described variable importance measure showed a bias towards correlated predictors and developed an alternative measure: the conditional variable importance, in which the dependence between a predictor and the outcome is calculated conditionally upon the values of other predictors. In particular, for each tree, they propose to divide -completely bisect-the predictor space into a multidimensional grid in accordance with the partition induced by that tree, and it is within each such partition that the OOB values are conditionally permuted. ...
... As regards the variable importance analyses conducted, Fig. 3 shows on the top the results of the individual variable importance proposed by Breiman [37] -which have been obtained using the randomForestSRC R package [60]-and on the bottom the individual conditional importance of each variable according to Debeer and Strobl [40] -which were calculated with the R packages party [39], [61], [62] and permimp [40]-. Notably, Breiman's approach provides the importance of each variable within the model, as its random permutation of the values of the regressor variable is supposed to mimic the absence of such variable within the model; on its part, the conditional variable importance approach quantifies the contribution of each predictor conditioned to the presence of the rest of the regressors. ...
Article
Full-text available
Resistance spot welding (RSW) is a widespread manufacturing process in the automotive industry. There are different approaches for assessing the quality level of RSW joints. Multi-input-single-output methods, which take as inputs either the intrinsic parameters of the welding process or ultrasonic nondestructive testing variables, are commonly used. This work demonstrates that the combined use of both types of inputs can significantly improve the already competitive approach based exclusively on ultrasonic analyses. The use of stacking of tree ensemble models as classifiers dominates the classification results in terms of accuracy, F-measure and area under the receiver operating characteristic curve metrics. Through variable importance analyses, the results show that although the welding process parameters are less relevant than the ultrasonic testing variables, some of the former provide marginal information not fully captured by the latter.
... CART method is commonly used to investigate complex datasets with numeric and/or categorical variables (predictor variables) that interact with each other nonlinearly (De'ath and Fabricius 2000). Both RF and BRT belong to the CART family, which has been implemented in different disciplines, such as species distributions (Shabani et al., 2017), groundwater mapping (Naghibi et al., 2016), water quality (Golden et al., 2016;Povak et al., 2014), aquatic ecosystems (Elith et al., 2008;Smucker et al., 2013;Tonkin et al., 2014), and environmental modeling (Giri et al., 2019;Strobl et al., 2008). ...
... In the end, an ensemble average of all individual trees is computed. The inclusion of several trees increases the probability of deriving an effective prediction model (Breiman, 2001;Strobl et al., 2008). The accuracy of the random forests algorithm relies mainly on the strength of the individual (Amit & Geman, 1997). ...
Article
Full-text available
Reliable water quality prediction can improve environmental flow monitoring and the sustainability of the stream ecosystem. In this study, we compared two machine learning methods to predict water quality parameters, such as total nitrogen (TN), total phosphorus (TP), and turbidity (TUR), for 97 watersheds located in the Southeast Atlantic region of the USA. The modeling framework incorporates multiple climate and watershed variables (characteristics) that often control the water quality indicators in different landscapes. Three techniques, such as stepwise regression (SR), Least Absolute Shrinkage and Selection Operator (LASSO), and genetic algorithm (GA), are implemented to identify appropriate predictors out of 28 climate and catchment-related variables. The selected predictors were then used to develop the Random Forest (RF) and Boosted regression tree (BRT) models for water quality predictions in selected watersheds. The results highlighted that while both algorithms provided reasonable results (based on statistical metrics), the RF algorithm was easier to train and robust to model overfitting. Partial dependence plots highlighted the complex and nonlinear relationships between the individual predictors and the water quality indicators. The thresholds obtained from partial dependence plots showed that the median values of total nitrogen (TN) and total phosphorus (TP) in streams increase significantly when the percentage of urban and agricultural lands is above 40% and 43% of the watershed area, respectively. Furthermore, when soil hydraulic conductivity increases, the reduction in runoff results in decreased Turbidity levels in streams. Therefore, identifying the key watershed characteristics and their critical thresholds can help watershed managers create appropriate regulations for managing and sustaining healthy stream ecosystems. Besides, the forecasting models can improve water quality predictions in ungauged watersheds.
... It can be directly calculated through the 'importance' argument of the training function in R. Creating a partial dependence plot is an alternative way to explore and interpret the relationships between predictor variables and predicted classes [47]. The partial dependence could depict the relative logit contribution of the prediction variables to class probability [48]. The marginal effect for each predictor variable was plotted using the partial dependence plot (PDP) package in R. ...
... In a hyper-fragmented landscape, there is a significant change in the size structure of tree assemblages and a reduction in the number of large trees [2]. The loss of large trees can lead to increased herbaceous/shrub vegetation and fine fuels, which is related to increased active fire detection [48,56]. ...
Article
Full-text available
The fire susceptibility of the Atlantic Forest has largely increased over the past two decades due to a combination of climate change and anthropogenic factors such as land cover change and human modification. High rates of forest fragmentation have contributed to escalating fires in this imperilled global biodiversity hotspot. Understanding fire patterns is essential to developing an effective forest fire management strategy. In this research, we utilized the Random Forest (RF) machine learning approach for identifying the role of climatic and anthropogenic factors in influencing fire occurrence probability and mapping the spatial distribution of fire risk. We found that the Normalized Difference Vegetation Index value and climate variables (i.e., temperature and solar radiation) were significant drivers of fire occurrence risk. Results also confirm that forest fragmentation increases with fire density in the region.
... Therefore, to gain insight which driver variables are mostly responsible for the predictions of the model so-called variable importance measures are typically used. Here we also follow that approach and employ the permutation variable importance (see Strobl et al., 2008). In short, the idea is to break up the relationship between the response variable and one driver variable by permuting the latter and then assessing how much the predictive performance deteriorates. ...
Preprint
Full-text available
Despite its scarcity, upward lightning initiated from tall structures causes more damage than common downward lightning. One particular subtype with a continuous current only is not detectable by conventional lightning location systems (LLS) causing a significantly reduced detection efficiency. Upward lightning has become a major concern due to the recent push in the field of renewable wind energy generation . The growing number of tall wind turbines increased lightning related damages. Upward lightning may be initiated by the tall structure triggering the flash itself (self-triggered) or by a flash striking close by (other-triggered). The major objective of this study is to find the driving atmospheric conditions influencing whether an upward flash is self-triggered or other-triggered and whether it is of the undetectable subtype. We explore upward flashes directly measured at the Gaisberg Tower in Salzburg (Austria) between 2000 and 2015. These upward flashes are combined with atmospheric reanalysis data stratified into five main meteorological groups: cloud physics, mass field, moisture field, surface exchange and wind field. We use classification methods based on tree-structured ensembles in form of conditional random forests. From these random forests we assess the meteorological influence and find the most important atmospheric drivers for one event or the other, respectively.
... One approach is permutation-based, using feature importance (Breiman;2001). Strobl et al. (2008) proposed a permutation algorithm to compute conditional feature importance, but they did not provide a general guidance for feature selection. Kursa and Rudnicki (2010) used a new set of shadow features to debias the feature importance, and a feature selection procedure was also proposed based on a "Z score"; also see Sandri and Zuccolotto (2008) for a similar approach. ...
Preprint
Full-text available
Random forests are a widely used machine learning algorithm, but their computational efficiency is undermined when applied to large-scale datasets with numerous instances and useless features. Herein, we propose a nonparametric feature selection algorithm that incorporates random forests and deep neural networks, and its theoretical properties are also investigated under regularity conditions. Using different synthetic models and a real-world example, we demonstrate the advantage of the proposed algorithm over other alternatives in terms of identifying useful features, avoiding useless ones, and the computation efficiency. Although the algorithm is proposed using standard random forests, it can be widely adapted to other machine learning algorithms, as long as features can be sorted accordingly.
Article
N-is/ist constructions are elements in the left periphery of English/German sentences that have developed pragmatic meaning: they can be used as discourse markers with various functions, depending on the nominal element that is used in the construction. We use evidence from parallel and comparable corpora of English and German to investigate variable article use in these focaliser constructions and model factors that may play a role in article omission/retention (such as modification, choice of head noun, degree of syntactic integration of the focaliser). Our evidence shows that article use largely depends on the lexical head in German but is constrained by different factors in English (notably modification). We interpret our results against the backdrop of construction grammar, arguing that article omission plays a different role in the two languages. From a contrastive point of view, formal syntactic separation in English is easier to achieve than in German and thus facilitates use of English N-is constructions as focalisers.
Article
Detailed maps of soil properties are essential for soil protection planning and management; however, creating very high-resolution maps at the national level with sufficient accuracy is a challenging task and remains unavailable for most countries. For the Czech Republic, very high resolution (20 m·pixel⁻¹) soil property maps (soil organic carbon—SOC, texture, pH, bulk density, soil depth) were created using digital soil mapping methods, combined with a wide database of soil legacy and current samples. The latest approaches were employed for predictive mapping: a quantile random forest model with the determination of prediction intervals, a mosaic of bare soils from Sentinel-2 satellite data, a Gaussian pyramid of terrain attributes, and a buffer distance map. These variables were found to be among the most important in the resulting models. The properties were mapped with an RMSE accuracy of 0.43% SOC, 5.56–11.14% for texture fractions, 0.70 pH, 0.13 g·cm⁻³ bulk density, and 20.03 cm for soil depth, thus providing detailed data on soil cover. Greater levels of inaccuracy were found in areas with extreme values, for which further investigation is necessary either through more detailed sampling based on active learning, or adapted methods for enhanced predictive ability.
Article
Random forests are a combination of tree predictors such that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest. The generalization error for forests converges a.s. to a limit as the number of trees in the forest becomes large. The generalization error of a forest of tree classifiers depends on the strength of the individual trees in the forest and the correlation between them. Using a random selection of features to split each node yields error rates that compare favorably to Adaboost (Y. Freund & R. Schapire, Machine Learning: Proceedings of the Thirteenth International conference, ∗∗∗, 148–156), but are more robust with respect to noise. Internal estimates monitor error, strength, and correlation and these are used to show the response to increasing the number of features used in the splitting. Internal estimates are also used to measure variable importance. These ideas are also applicable to regression.
Article
Recent work has shown that combining multiple versions of unstable classifiers such as trees or neural nets results in reduced test set error. One of the more effective is bagging. Here, modified training sets are formed by resampling from the original training set, classifiers constructed using these training sets and then combined by voting. Y. Freund and R. Schapire [in L. Saitta (ed.), Machine Learning: Proc. Thirteenth Int. Conf. 148-156 (1996); see also Ann. Stat. 26, No. 5, 1651-1686 (1998; Zbl 0929.62069)] propose an algorithm the basis of which is to adaptively resample and combine (hence the acronym “arcing”) so that the weights in the resampling are increased for those cases most often misclassified and the combining is done by weighted voting. Arcing is more successful than bagging in test set error reduction. We explore two arcing algorithms, compare them to each other and to bagging, and try to understand how arcing works. We introduce the definitions of bias and variance for a classifier as components of the test set error. Unstable classifiers can have low bias on a large range of data sets. Their problem is high variance. Combining multiple versions either through bagging or arcing reduces variance significantly.
Article
Random forests are a combination of tree predictors such that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest. The generalization error for forests converges a.s. to a limit as the number of trees in the forest becomes large. The generalization error of a forest of tree classifiers depends on the strength of the individual trees in the forest and the correlation between them. Using a random selection of features to split each node yields error rates that compare favorably to Adaboost (Y. Freund & R. Schapire, Machine Learning: Proceedings of the Thirteenth International conference, ***, 148–156), but are more robust with respect to noise. Internal estimates monitor error, strength, and correlation and these are used to show the response to increasing the number of features used in the splitting. Internal estimates are also used to measure variable importance. These ideas are also applicable to regression.
Article
The party package (Hothorn, Hornik, and Zeileis 2006) aims at providing a recursive part(y)itioning laboratory assembling various high- and low-level tools for building tree-based regression and classification models. This includes conditional inference trees (ctree), condi- tional inference forests (cforest) and parametric model trees (mob). At the core of the pack- age is ctree, an implementation of conditional inference trees which embed tree-structured regression models into a well defined theory of conditional inference procedures. This non- parametric class of regression trees is applicable to all kinds of regression problems, including nominal, ordinal, numeric, censored as well as multivariate response variables and arbitrary measurement scales of the covariates. This vignette comprises a practical guide to exploiting the flexible and extensible computational tools in party for fitting and visualizing conditional inference trees.