Conference PaperPDF Available

BIAS-CORRECTED QUANTILE REGRESSION FORESTS FOR HIGH-DIMENSIONAL DATA

Authors:

Abstract and Figures

Quantile Regression Forest (QRF), a nonparametric regression method based on random forests, has been proved to perform well in terms of prediction accuracy, especially when the con-ditional distributions are not Gaussian. However, the method may have two kinds of bias when solving regression problem-s. The first kind of bias is in the feature selection stage and the second one is in solving the regression problem. In this pa-per, we propose a new bias-correction algorithm that uses bias correction based on QRF. To correct the first kind of bias, we propose a new scheme for feature sampling that allows to selec-t good features for growing trees. The first level QRF is built based on this. For the second kind of bias, the residual term of the first level QRF model is used as the response feature to train the second level QRF model for bias correction. The sec-ond level model is then used to compute bias-corrected predic-tions. In our experiments, the proposed algorithm dramatically reduced the prediction errors and outperformed most of the ex-isting regression random forests models when applied to synthet-ic data set as well as some well-known real-world data sets.
Content may be subject to copyright.
BIAS-CORRECTED QUANTILE REGRESSION FORESTS FOR
HIGH-DIMENSIONAL DATA
NGUYEN THANH TUNG
1 4
, JOSHUA ZHEXUE HUANG
1 2
, THUY THI NGUYEN
3
, IMRAN KHAN
1
1
Shenzhen Key Laboratory of High Performance Data Mining, SIAT, CAS, Shenzhen 518055, China.
2
College of Computer Science and Software Engineering, Shenzhen University, Shenzhen 518060, China.
3
Hanoi University of Agriculture Vietnam.
4
Water Resources University, Vietnam.
E-MAIL: tungnt@wru.vn, {tungnt, zx.huang, imran.khan}@siat.ac.cn, ntthuy@hua.edu.vn
Abstract:
Quantile Regression Forest (QRF), a nonparametric regression
method based on random forests, has been proved to perform
well in terms of prediction accuracy, especially when the con-
ditional distributions are not Gaussian. However, the method
may have two kinds of bias when solving regression problem-
s. The first kind of bias is in the feature selection stage and
the second one is in solving the regression problem. In this pa-
per, we propose a new bias-correction algorithm that uses bias
correction based on QRF. To correct the first kind of bias, we
propose a new scheme for feature sampling that allows to selec-
t good features for growing trees. The first level QRF is built
based on this. For the second kind of bias, the residual term
of the first level QRF model is used as the response feature to
train the second level QRF model for bias correction. The sec-
ond level model is then used to compute bias-corrected predic-
tions. In our experiments, the proposed algorithm dramatically
reduced the prediction errors and outperformed most of the ex-
isting regression random forests models when applied to synthet-
ic data set as well as some well-known real-world data sets.
Keywords:
Bias Correction; Quantile Regression Forests; High-
Dimensional Data; Random Forests; Data mining
1. Introduction
Random forest (RF) [1] is a widely used non-parametric
method for classifications and regression problems. RFs build
trees from the bagged samples and the bagged features of the
training data L = {(X
i
, Y
i
)
N
i=1
|X
i
R
M
, Y R
1
} where N
is the number of training samples and M is the number of fea-
tures. Given an input X = x, a regression RF is used as a func-
tion f : R
M
R
1
to estimate the unknown value y of input
x R
M
, denoted as
ˆ
f(x). We write the regression RF in the
common form Y = f(X) + ε, where E(ε) = 0, V ar(ε) = σ
2
ε
.
The function f(·) is estimated from L and the prediction
ˆ
f(x)
is obtained from an independent test case x.
For point regression, each tree T
k
in RF yields a prediction
ˆ
f
k
(x) and these predictions are averaged across all K trees to
get the final RF prediction
ˆ
f(x) =
K
k=1
ˆ
f
k
(x)/K. This is the
estimation of f(x) = E(Y |X = x). We consider the predic-
tion’s mean-squared error (MSE) to measure the effectiveness
of
ˆ
f, defined as [6]
MSE[
ˆ
f(x)] = E[(Y
ˆ
f(x))
2
|X = x]
= E[Y f(x)]
2
+ [E
ˆ
f(x) f(x)]
2
+ E[E
ˆ
f(x)
ˆ
f(x)]
2
(1)
The first term of Eq. (1) is the variance σ
2
ε
of the target around
its true mean f(x). This cannot be avoided no matter how
well we estimate f (x), unless σ
2
ε
= 0. The second term is
the squared bias Bias
2
[
ˆ
f(x)], and the last term is the variance
V ar[
ˆ
f(x)].
In regression, traditional RFs predict values in each leaf node
which is taken as the mean of Y values of the samples in that
leaf node. This causes bias because extreme values in samples
are underestimated or overestimated. The prediction accuracy
is improved when using the median to obtain predicted values,
which surpasses the mean in robustness towards extreme val-
ues. Therefore, in this framework, the median is used in QR-
F model to predict point regression. QRF [2] uses the same
method to grow unpruned trees in RF [1]. However, at each
leaf node, QRF assesses the estimated distribution of F (y|X =
x) = P (Y < y|X = x) instead of only the mean of Y values
as RF does. Given a probability α, we can estimate the quantile
1
2014 IEEE978-1-4799-4215-2/14/$31.00 ©
Proceedings of the 2014 International Conference on Machine Learning and Cybernetics, Lanzhou, 13-16 July, 2014
Figure 1 Bias in point and range prediction by the QRF model.
A large number of points scatter from the solid line.
Q
α
(X) as
ˆ
Q
α
(X = x
new
) = inf {y :
ˆ
F (y|X = x
new
) α}.
For range prediction, we have
[Q
α
l
(X), Q
α
h
(X)] = [inf {y :
ˆ
F (y|X = x) α
l
},
inf{y :
ˆ
F (y|X = x) α
h
}]
(2)
where α
l
< α
h
and (α
h
α
l
) = τ. Here, τ is the
probability that prediction of Y will fall within a range of
[Q
α
l
(X), Q
α
h
(X)]. For point regression, the prediction can
choose a value in a range such as the mean or the median of the
Y values. Besides that, QRF works well in situations where the
conditional distribution is not Gaussian. A detailed description
can be found in [2, 3].
As stated above, QRF has the same bias as RF in point pre-
diction accuracy even though it takes the median instead of the
mean to predict
ˆ
f. To illustrate this kind of bias, we generat-
ed 200 samples for a training data set and 1000 samples for a
testing data set using the following model:
Y = 10sin(πX
1
X
2
) + 20(X
3
0.5)
2
+ 10X
4
+ 5X
5
+ ϵ (3)
where X
1
, X
2
, X
3
, X
4
, X
5
and ϵ from U(0, 1).
We ran QRF under the default settings in R [14]. Figure 1
shows the predicted median values against the true values for
point and range prediction. These results are based on predic-
tion of QRF for the 200 simulated samples. The bias in the
point estimates becomes larger when the observed values in-
crease or decrease. The solid line connects the points where
the predicted values and the true values are equal. The range
prediction gives a range that will cover the new observation of
Y values with high probability. Range prediction is shown as
transparent grey bars, with vertical black lines at the bottom
and top. The green dots are the observations within the range
prediction; while the red ones are the ones outside the range
prediction.
On the other hand, both RFs and QRFs have bias in the fea-
ture selection process [4]. The main cause is that in the process
of growing a tree from the bagged sample data, the trees may
select less important feature as the best split among candidate
features. It tends to favor uninformative features containing
more values (i.e, less missing values, many categorical or dis-
tinct values) [4, 5].
Breiman [1] introduced bagging in RF as a way to reduce the
prediction variance and increase the accuracy of prediction, but
the bias remained. Recently, Zhang et al. [7] proposed a simple
non-iterative version using original RF to correct the bias in re-
gression problems. His methods compare favorably with other
bias-correction approaches for point prediction. However, their
approach can only be applied to point regression. Moreover,
the mean values were used in predictions, which, as mentioned
before, could suffer from extreme values in data. Besides, the
techniques were tested only on small low dimensional data sets
with the number of features is less than or equal to 13.
In this paper, we propose a new bias-corrected algorithm
with two kinds of bias-correction based on QRF, namely bcQR-
F, where the first one is to correct bias in importance measure
for feature selection and the other one is to correct bias in re-
gression process. In our approach, the bcQRF algorithm based
on the QRF model is used to correct the bias in regression prob-
lems instead of the adaptive bagging proposed by Breiman [6].
The predicted values and the residual term are obtained by the
first level QRF model. The original data set is then extended
with the residuals as response features while the original re-
sponse feature is considered as a predictor feature. The second
level QRF model is used to estimate the bias values for the ex-
tended data set. The bias-corrected values are computed based
on the difference between the values predicted by the first level
QRF model and the bias values, predicted by the second level
QRF model.
In our proposed method, both point regression and range pre-
diction bias are corrected using the QRF algorithm. Further-
more, for the bias-correction of importance measures in feature
selection, we generated artificial features and added to the o-
riginal data set, we then applied feature permutation method
[3] to this extended data to calculate the importance of fea-
tures, producing the feature importance scores. Our experimen-
tal results have shown that the proposed algorithm with these
bias-correction techniques dramatically reduced the prediction
errors and outperformed most of existing regression random
forests on the synthetic as well as well-known real-world da-
ta sets, especially on high-dimensional data.
2
Proceedings of the 2014 International Conference on Machine Learning and Cybernetics, Lanzhou, 13-16 July, 2014
2. Bias-Correction of Feature Importance Measures
A bias-correction for the importance measure in feature s-
election process is intended to avoid uninformative feature
which is selected as the best split when trees are growing in
the RF model. To do so, we put an artificial feature containing
the same values, possible splits and distribution but having no
association with Y values into original data set. This artificial
feature participates only in the competition for the best split
and decreases the probability to select this kind of feature as a
splitting node.
This bias-correction for the importance measure starts with
a brief recall of the feature permutation method recently de-
scribed by Tung et al. [3]. Given a training data set L, let
X = {L\Y } be a matrix containing N samples and M features.
A set of matrices {A
i
}
N
i=1
is generated by randomly permuting
N times the N rows of X. We say that the columns of A are
artificial features; each artificial feature has the same distribu-
tion, values and number of missing values as the corresponding
predictor feature.
We build a quantile regression forests model QRF from this
extended data set of 2M dimensions including predictor and
artificial features. Following the importance measure produced
by a permutation framework [3, 8, 9], we use QRF to com-
pute 2M importance scores for 2M features. We repeat the
same process R times to compute R replicates. In each region
which partitions data set by the regression tree, each artificial
A
i
shares approximately the same properties of corresponding
X
i
, but it is independent on Y, and consequently has approx-
imately the same probability of being selected as a splitting
candidate. That is crucial work to correct bias avoiding un-
informative features when trees are growing in the forest.
We adjust importance measure
c
V I
j
(j = 1, ..M) by the
average of all R replicates of raw importance scores V I
R
X
j
,
c
V I
j
= R
1
R
r=1
V I
R
X
j
. According to the values of
c
V I
j
,
we can normalize them and define a variable θ
j
to map the
c
V I
j
value into [0, 1] using the min-max normalization as follow:
θ
j
=
c
V I
j
min(
c
V I
j
)
max(
c
V I
j
) min(
c
V I
j
)
. (4)
The corrected importance scores θ
j
can be used as an approxi-
mation of the bias affecting the importance of X
j
. Hence, the
permutation method can reduce bias due to different measure-
ment levels of X
j
and can yield a better ranking of features
according to their importance. The weights {θ
1
, θ
2
, .., θ
M
} de-
fine a probability on features in the training data set L, which
can be used for building trees in the forest.
3. Our Proposed Bias-correction Algorithm
3.1. The first level Quantile Regression Forests
We have the weights {θ
j
}
M
j=1
given by Eq. (4). The weight-
ing method is used to select bagged features instead of the
simple sampling from the original Breiman’s method [1]. At
a node, we randomly select the amount of candidate features
mtry (1 < mtry M) using the probability distribution on
the weights {θ
j
}. We then aggregate them into the subspace,
they are used as candidates to split node when growing trees.
The first level quantile regression forests algorithm is summa-
rized as follows.
1) For each feature X
j
, take R importance scores and compute
the bias-corrected importance measures
c
V I
j
.
2) Compute feature weights {θ
1
, .., θ
M
} according to Eq. (4).
3) Draw K bootstrap samples of the original data set.
4) Grow a regression tree T
k
corresponding to each bootstrap
sample as follows:
a) At each node, randomly select a subspace of
mtry = log
2
(M) + 1 features using the probabil-
ities {θ
1
, θ
2
, ..., θ
M
}, and use the subspace features as
candidates for splitting a node.
b) Grow unpruned large tree until the minimum node size
n
min
is reached. At each leaf node, all Y values of the
samples in the leaf node are kept.
c) Compute the observation weights of each X
i
by indi-
vidual trees and the forests with out-of-bag samples.
5) Given a probability τ, α
l
and α
h
for α
h
α
l
= τ, compute
the corresponding quantile Q
α
l
and Q
α
h
with Eq. (2) (We
set default values [α
l
= 0.05, α
h
= 0.95] and τ = 0.5).
6) Given a X = x
new
, estimate the prediction value from a
value in the quantile range of Q
α
l
and Q
α
h
such as the
median.
3.2. Our Proposed Bias-correction Algorithm
In our approach, the QRF model is used to correct the bias in
regression problems. The prediction errors were obtained from
our first level QRF model on the original data set and used to
train the second level QRF model. The final bias-corrected val-
ues are calculated considering the difference between the re-
sults, predicted by the first level QRF model and the second
3
Proceedings of the 2014 International Conference on Machine Learning and Cybernetics, Lanzhou, 13-16 July, 2014
level QRF model. The proposed bias-correction bcQRF algo-
rithm in range prediction is summarized as follows.
Step 1: Grow the first level quantile regression forests QR-
F model from the training data L = (X, Y ).
Step 2: Obtain the predicted values
ˆ
Q
α
(X = x
OOB
) on
out-of-bag data. The estimation of bias is then calculat-
ed as the median predicted values minus the true values,
defined as
e
E =
ˆ
Q
0.5
(X = x
OOB
) Y .
Step 3: Given X = x
new
, use first level QRF to predict
the testing data set, we get
ˆ
Q
α
(X = x
new
) and the range
[Q
α
l
(X = x
new
), Q
α
h
(X = x
new
))].
Step 4: Replace Y in L with vector
e
E, we obtain an ex-
tended data set L
e
= {L,
e
E}. Grow the second level
quantile regression forests QRF model using L
e
(with the
response feature
e
E). Use the second level QRF to predict
the testing data to get
e
E
new
.
Step 5: The bias-corrected values are computed as
ˆ
Q
new
= [Q
α
l
(X = x
new
), Q
α
h
(X = x
new
))]
e
E
new
.
For point prediction, the predicted values are
ˆ
Q
0.5
.
4. Experiments
4.1. Data sets
Synthetic data set: We use the model of Eq. (3) to gener-
ate two pairs of high dimensional synthetic data sets. We first
created 200 samples in 5 dimensions plus a response feature
named LM5, where M5 indicates the dimensions of the data
set. After that, we expanded data set with 195 and 495 noisy
features named LM200 and LM500, respectively. Similarly,
we generated extra data sets with 1000 samples as test data sets
HM200 and HM 500.
Real-world data set: Fourteen real-world data sets are listed
in Table 1 to evaluate the performance of regression forests al-
gorithms. The upper part of Table 1 lists 11 small-size data sets
which used in [10]. The lower part one presents characteris-
tics of high-dimensional data sets. The computed tomography
(CT) data set was taken from UCI
1
. The TFIDF-2006 data set
is used in [11]. The Rivers
2
data set was used to predict the
flow level of a river. It is based on a data set containing riv-
er discharge levels of 1, 439 Californian rivers for a period of
12, 054 days. This data set contains 48.6% missing values, a
imputation function in [12] was used to fill a missing data. The
1
The data are available at http://archive.ics.uci.edu/
2
http://www.usgs.gov
Table 1 Characteristics of the data sets.
Data set name #Samples #Features
1 Servo 167 4
2 Childhood 654 4
3 Computer Hardware 209 6
4 Auto MPG 392 7
5 Concrete Slump Test 103 7
6 Concrete Comp. Strength 1,030 8
7 Pulse Rates 110 10
8 Horse Racing 102 12
9 Boston housing 506 13
10 Breast Cancer Wisc. 194 30
11 Communities and Crime 1,994 102
12 Computed tomography 53,500 385
13 Rivers 12,054 1,440
14 TFIDF-2006 19,395 150,360
level of the 1, 440-th river was predicted in our experiments, the
target values were converted from [0.062; 101, 000] to [0; 1].
4.2. Experimental Settings
We used the latest version of R-package for regression ran-
dom forests RF, unbiased conditional random forests cRF,
quantile regression forests QRF [12, 13, 14] and our bias-
corrected bcQRF algorithm to build regression models from
the training data sets. The MSE measure as Eq. (1) is used
to evaluate the models when applied to the test data sets. The
bias-correction algorithm is a new implementation. In order to
conduct our experiments, we used in R environment to call the
corresponding C/C++ functions on a 64-bit machine with In-
tel(R) Xeon(R) CPU E5620 2.40 GHz, 4 MB cache, and 32
GB main memory.
For first 11 real-world data sets, we used 10-fold cross-
validation to evaluate the prediction performance of regression
random forests algorithms. In each fold, we built 30 regres-
sion models, each with 500 trees and tested 30 models with the
corresponding test data sets.
For the three large data sets CT, Rivers, TFIDF-2006, we
only experimented one model with 500 trees in each random
forests method. For the CT data set, two-third was used for
training, and one-third for testing. For the Rivers, TFIDF-2006
and synthetic data sets, the given training data was used to learn
the models and the test data was used to evaluate the models.
The number of features to split a node is M/3 for low
4
Proceedings of the 2014 International Conference on Machine Learning and Cybernetics, Lanzhou, 13-16 July, 2014
(a) Range prediction by QRF (b) Range prediction by bcQRF (c) Point prediction by QRF (d) Point prediction by bcQRF
Figure 2 Comparison of the QRF, bcQRF models on synthetic data sets HM 200 and HM 500.
dimensional data sets and
M for high dimensional data,
respectively. The parameter nodesize in a regression forests
model is 5. The performance of each algorithm was measured
by the average of MSE.
4.3. Results and discussion
Figures 2 (a)(b) show the range prediction results of the da-
ta sets HM200 and Figures 2 (c)(d) present point prediction
results of the data set HM500, respectively, by QRF without
bias correction in Figures 2(a) and 2(c) and bcQRF with bias
correction in Figures 2(b) and 2(d). The bias range predictions
can be clearly observed from Figure 2(a). The bias range pre-
dictions were corrected by bcQRF as shown in Figure 2(b). We
can see that point predictions with bias correction by bcQRF
were clearly improved as shown in Figure 2(d) in comparison
with the predictions by QRF in Figure 2(c).
Figure 3 shows the 90% range prediction results of a large
high dimensional data set TFIDF-2006 by QRF and bcQRF.
The prediction errors of this data set, a clear bias in point pre-
diction can be observed in the results of QRF as shown in Fig-
ure 3(a). The bias was corrected in the results of bcQRF as
shown in Figure 3(b). The effect of bias correction is clearly
demonstrated.
The upper part of Table 2 presents the average of MSE es-
timated by the 30 repetitions on small-size data sets. In order
to compare our method with the others, we collected the best
results from Roy et al. [10] using the same criteria. The im-
pressive results are in the 5
th
column. We can see that the
results of bcQRF are better than those of the other methods.
In the data sets where bcQRF did not obtain the best results,
the differences from the best results were minor. These results
indicate that bcQRF outperformed the-state-of-the-art QRF in
(a) Range predictions by QRF (b) Range predictions by bcQRF
Figure 3 Comparisons of range predictions by QRF and bcQRF
on large data sets TFIDF-2006.
the prediction.
The comparison results of the four regression random forests
RF, cRF, QRF and bcQRF on high-dimensional data sets eval-
uated in MSE are listed in the lower part of Table 2. Since
cRF was unable to conduct on the large data sets TFIDF-2006,
we excluded it from this experiment. It can be seen that bc-
QRF significantly reduced MSE errors, these results demon-
strate that the proposed algorithm bcQRF with two kinds of
bias-correction gave better performance over other algorithms
on regression problems, especially on high-dimensional data.
5. Conclusions
We have presented a new bias-correction bcQRF algorithm
with two kinds of bias-correction based on QRF model, which
aims to correct the bias in regression problems. The first main
contribution is to correct the bias of importance measures for
the feature selection process. This bias-correction avoids se-
lection of uninformative feature as the best split when trees are
growing. The second one is to correct the bias in the point and
5
Proceedings of the 2014 International Conference on Machine Learning and Cybernetics, Lanzhou, 13-16 July, 2014
Table 2 Comparison of algorithms on real-world data sets, s-
maller values are better. The best results in 5
th
column are from
[10].
Data set name RF cRF QRF RF bcQRF
LAD
Servo 0.391 0.745 0.551 0.524 0.420
Childhood 0.188 0.173 0.175 0.190 0.171
Comp. Hardware 3470 9324 5500 3500 3370
Auto MPG 7.15 8.64 7.96 7.41 7.43
Concrete Slump 2357 3218 2747 2085 1988
Conc. Com. Str. 22.8 43.5 25.2 25.5 20.3
Pulse Rates 224 337 217 255 183
Horse Racing 4.04 6.93 4.32 3.48 3.18
Boston housing 10.3 16.0 9.1 10.3 8.0
Breast Can. Wisc. 1145 1089 1266 1100 1369
Comm. and Crime 0.019 0.019 0.023 0.018 0.019
CT 1.785 6.264 1.280 - 0.261
Rivers (×10
2
) 0.059 0.093 0.068 - 0.026
TFIDF-2006 0.337 - 0.351 - 0.118
range prediction using the QRF model. Both correction tech-
niques increase the prediction accuracy of regression random
forests model. We have presented a series of experiment result-
s on both synthetic and real world data sets to demonstrate the
capability of the bcQRF models in bias correction and advan-
tages of bcQRF over other commonly used regression random
forests algorithms.
Acknowledgment
This research is supported in part by NSFC un-
der Grant No.61203294, Shenzhen New Industry De-
velopment Fund under Grant No.JC201005270342A,
No.JCYJ20120617120716224, the National High-tech
Research and Development Program (No. 2012AA040912),
and Guangdong-CAS project (No. 2012B091100221).
References
[1] L. Breiman, “Random forests, Machine learning, vol. 45,
no. 1, pp. 5–32, 2001.
[2] N. Meinshausen, “Quantile regression forests, The Jour-
nal of Machine Learning Research, vol. 7, pp. 983–999,
2006.
[3] N. T. Tung, J. Z. Huang, K. Imran, M. J. Li, and
G. Williams, “Extensions to quantile regression forests
for very high dimensional data, in Advances in Knowl-
edge Discovery and Data Mining. Springer, 2014, vol.
8444, pp. 247–258.
[4] L. Breiman, J. Friedman, C. J. Stone, and R. A. Olshen,
Classification and regression trees. CRC press, 1984.
[5] C. Strobl, A.-L. Boulesteix, A. Zeileis, and T. Hothorn,
“Bias in random forest variable importance measures: Il-
lustrations, sources and a solution, BMC bioinformatics,
vol. 8, no. 1, p. 25, 2007.
[6] L. Breiman, “Using adaptive bagging to debias regres-
sions, Technical Report 547, Statistics Dept. UCB, Tech.
Rep., 1999.
[7] G. Zhang and Y. Lu, “Bias-corrected random forests in
regression, Journal of Applied Statistics, vol. 39, no. 1,
pp. 151–160, 2012.
[8] E. Tuv, A. Borisov, G. Runger, and K. Torkkola, “Fea-
ture selection with ensembles, artificial variables, and re-
dundancy elimination, The Journal of Machine Learning
Research, vol. 10, pp. 1341–1366, 2009.
[9] M. Sandri and P. Zuccolotto, Analysis and correction
of bias in total decrease in node impurity measures for
tree-based algorithms, Statistics and Computing, vol. 20,
no. 4, pp. 393–407, 2010.
[10] M.-H. Roy and D. Larocque, “Robustness of random
forests for regression, Journal of Nonparametric Statis-
tics, vol. 24, no. 4, pp. 993–1006, 2012.
[11] C.-H. Ho and C.-J. Lin, “Large-scale linear support vector
regression, The Journal of Machine Learning Research,
vol. 13, no. 1, pp. 3323–3348, 2012.
[12] A. Liaw and M. Wiener, “Classification and regression by
randomforest, R news, vol. 2, no. 3, pp. 18–22, 2002.
[13] T. Hothorn, K. Hornik, and A. Zeileis, “party: A labora-
tory for recursive part (y) itioning. r package version 0.9-
9999. 2011, URL: http://cran. r-project. org/package=
party (1 December 2010, date last accessed).
[14] N. Meinshausen, “quantregforest: quantile regression
forests, R package version 0.2-3, 2012.
6
Proceedings of the 2014 International Conference on Machine Learning and Cybernetics, Lanzhou, 13-16 July, 2014
... In this study we focused on the QRF method to construct prediction intervals, which is one of the most commonly used approaches. However, a reviewer has made us aware of several other approaches [60][61][62]. In particular, [61] proposed a method based on the empirical distribution of the out-of-bag predictions errors. ...
Article
Full-text available
Environmental data may be “large” due to number of records, number of covariates, or both. Random forests has a reputation for good predictive performance when using many covariates with nonlinear relationships, whereas spatial regression, when using reduced rank methods, has a reputation for good predictive performance when using many records that are spatially autocorrelated. In this study, we compare these two techniques using a data set containing the macroinvertebrate multimetric index (MMI) at 1859 stream sites with over 200 landscape covariates. A primary application is mapping MMI predictions and prediction errors at 1.1 million perennial stream reaches across the conterminous United States. For the spatial regression model, we develop a novel transformation procedure that estimates Box-Cox transformations to linearize covariate relationships and handles possibly zero-inflated covariates. We find that the spatial regression model with transformations, and a subsequent selection of significant covariates, has cross-validation performance comparable to random forests. We also find that prediction interval coverage is close to nominal for each method, but that spatial regression prediction intervals tend to be narrower and have less variability than quantile regression forest prediction intervals. A simulation study is used to generalize results and clarify advantages of each modeling approach.
Conference Paper
This paper addresses a new problem concerning the evolution of influence relationships between communities in dynamic social networks. A weighted temporal multigraph is employed to represent the dynamics of the social networks and analyze the influence relationships between communities over time. To ensure the interpretability of the knowledge discovered, evolution of the influence relationships is assessed by introducing the Granger causality. Through extensive experiments, we empirically demonstrate the suitability of our model for studying the evolution of influence between communities. Moreover, we empirically show how our model is able to accurately predict the influence of communities over time using random forest regression.
Conference Paper
Full-text available
This paper describes new extensions to the state-of-the-art regression random forests Quantile Regression Forests (QRF) for applications to high-dimensional data with thousands of features. We propose a new subspace sampling method that randomly samples a subset of features from two separate feature sets, one containing important features and the other one containing less important features. The two feature sets partition the input data based on the importance measures of features. The partition is generated by using feature permutation to produce raw importance feature scores first and then applying p-value assessment to separate important features from the less important ones. The new subspace sampling method enables to generate trees from bagged sample data with smaller regression errors. For point regression, we choose the prediction value of Y from the range between two quantiles Q 0.05 and Q 0.95 instead of the conditional mean used in regression random forests. Our experiment results have shown that random forests with these extensions outperformed regression random forests and quantile regression forests in reduction of root mean square residuals.
Article
Full-text available
This paper describes new extensions to the state-of-the-art regression random forests Quantile Regression Forests (QRF) for appli-cations to high dimensional data with thousands of features. We propose a new subspace sampling method that randomly samples a subset of fea-tures from two separate feature sets, one containing important features and the other one containing less important features. The two feature sets partition the input data based on the importance measures of fea-tures. The partition is generated by using feature permutation to produce raw importance feature scores first and then applying p-value assessment to separate important features from the less important ones. The new subspace sampling method enables to generate trees from bagged sample data with smaller regression errors. For point regression, we choose the prediction value of Y from the range between two quantiles Q0.05 and Q0.95 instead of the conditional mean used in regression random forests. Our experiment results have shown that random forests with these ex-tensions outperformed regression random forests and quantile regression forests in reduction of root mean square residuals.
Article
In this paper, we empirically investigate the robustness of random forests for regression problems. We also investigate the performance of six variations of the original random forest method, all aimed at improving robustness. These variations are based on three main ideas: (1) robustify the aggregation method, (2) robustify the splitting criterion and (3) taking a robust transformation of the response. More precisely, with the first idea, we use the median (or weighted median), instead of the mean, to combine the predictions from the individual trees. With the second idea, we use least-absolute deviations from the median, instead of least-squares, as splitting criterion. With the third idea, we build the trees using the ranks of the response instead of the original values. The competing methods are compared via a simulation study with artificial data using two different types of contaminations and also with 13 real data sets. Our results show that all three ideas improve the robustness of the original random forest algorithm. However, a robust aggregation of the individual trees is generally more profitable than a robust splitting criterion.
Article
Support vector regression (SVR) and support vector classification (SVC) are popular learning techniques, but their use with kernels is often time consuming. Recently, linear SVC without kernels has been shown to give competitive accuracy for some applications, but enjoys much faster training/ testing. However, few studies have focused on linear SVR. In this paper, we extend state-of-theart training methods for linear SVC to linear SVR. We show that the extension is straightforward for some methods, but is not trivial for some others. Our experiments demonstrate that for some problems, the proposed linear-SVR training methods can very efficiently produce models that are as good as kernel SVR.
Article
Breiman(1996) showed that bagging could effectively reduce the variance of regression predictors, while leaving the bias unchanged. A new form of bagging we call adaptive bagging is effective in reducing both bias and variance. The procedure works in stages-- the first stage is bagging. Based on the outcomes of the first stage, the output values are altered and a second stage of bagging is carried out using the altered output values. This is repeated until a specified noise level is reached. We give the background theory, and test the method using both trees and nearest neighbor regression methods. Application to two class classification data gives some interesting results.
Article
The party package (Hothorn, Hornik, and Zeileis 2006) aims at providing a recursive part(y)itioning laboratory assembling various high- and low-level tools for building tree-based regression and classification models. This includes conditional inference trees (ctree), condi- tional inference forests (cforest) and parametric model trees (mob). At the core of the pack- age is ctree, an implementation of conditional inference trees which embed tree-structured regression models into a well defined theory of conditional inference procedures. This non- parametric class of regression trees is applicable to all kinds of regression problems, including nominal, ordinal, numeric, censored as well as multivariate response variables and arbitrary measurement scales of the covariates. This vignette comprises a practical guide to exploiting the flexible and extensible computational tools in party for fitting and visualizing conditional inference trees.
Article
It is well known that random forests reduce the variance of the regression predictors compared to a single tree, while leaving the bias unchanged. In many situations, the dominating component in the risk turns out to be the squared bias, which leads to the necessity of bias correction. In this paper, random forests are used to estimate the regression function. Five different methods for estimating bias are proposed and discussed. Simulated and real data are used to study the performance of these methods. Our proposed methods are significantly effective in reducing bias in regression context.