Content uploaded by Imran Khan
Author content
All content in this area was uploaded by Imran Khan on Aug 28, 2014
Content may be subject to copyright.
BIAS-CORRECTED QUANTILE REGRESSION FORESTS FOR
HIGH-DIMENSIONAL DATA
NGUYEN THANH TUNG
1 4
, JOSHUA ZHEXUE HUANG
1 2
, THUY THI NGUYEN
3
, IMRAN KHAN
1
1
Shenzhen Key Laboratory of High Performance Data Mining, SIAT, CAS, Shenzhen 518055, China.
2
College of Computer Science and Software Engineering, Shenzhen University, Shenzhen 518060, China.
3
Hanoi University of Agriculture Vietnam.
4
Water Resources University, Vietnam.
E-MAIL: tungnt@wru.vn, {tungnt, zx.huang, imran.khan}@siat.ac.cn, ntthuy@hua.edu.vn
Abstract:
Quantile Regression Forest (QRF), a nonparametric regression
method based on random forests, has been proved to perform
well in terms of prediction accuracy, especially when the con-
ditional distributions are not Gaussian. However, the method
may have two kinds of bias when solving regression problem-
s. The first kind of bias is in the feature selection stage and
the second one is in solving the regression problem. In this pa-
per, we propose a new bias-correction algorithm that uses bias
correction based on QRF. To correct the first kind of bias, we
propose a new scheme for feature sampling that allows to selec-
t good features for growing trees. The first level QRF is built
based on this. For the second kind of bias, the residual term
of the first level QRF model is used as the response feature to
train the second level QRF model for bias correction. The sec-
ond level model is then used to compute bias-corrected predic-
tions. In our experiments, the proposed algorithm dramatically
reduced the prediction errors and outperformed most of the ex-
isting regression random forests models when applied to synthet-
ic data set as well as some well-known real-world data sets.
Keywords:
Bias Correction; Quantile Regression Forests; High-
Dimensional Data; Random Forests; Data mining
1. Introduction
Random forest (RF) [1] is a widely used non-parametric
method for classifications and regression problems. RFs build
trees from the bagged samples and the bagged features of the
training data L = {(X
i
, Y
i
)
N
i=1
|X
i
∈ R
M
, Y ∈ R
1
} where N
is the number of training samples and M is the number of fea-
tures. Given an input X = x, a regression RF is used as a func-
tion f : R
M
→ R
1
to estimate the unknown value y of input
x ∈ R
M
, denoted as
ˆ
f(x). We write the regression RF in the
common form Y = f(X) + ε, where E(ε) = 0, V ar(ε) = σ
2
ε
.
The function f(·) is estimated from L and the prediction
ˆ
f(x)
is obtained from an independent test case x.
For point regression, each tree T
k
in RF yields a prediction
ˆ
f
k
(x) and these predictions are averaged across all K trees to
get the final RF prediction
ˆ
f(x) =
∑
K
k=1
ˆ
f
k
(x)/K. This is the
estimation of f(x) = E(Y |X = x). We consider the predic-
tion’s mean-squared error (MSE) to measure the effectiveness
of
ˆ
f, defined as [6]
MSE[
ˆ
f(x)] = E[(Y −
ˆ
f(x))
2
|X = x]
= E[Y − f(x)]
2
+ [E
ˆ
f(x) − f(x)]
2
+ E[E
ˆ
f(x) −
ˆ
f(x)]
2
(1)
The first term of Eq. (1) is the variance σ
2
ε
of the target around
its true mean f(x). This cannot be avoided no matter how
well we estimate f (x), unless σ
2
ε
= 0. The second term is
the squared bias Bias
2
[
ˆ
f(x)], and the last term is the variance
V ar[
ˆ
f(x)].
In regression, traditional RFs predict values in each leaf node
which is taken as the mean of Y values of the samples in that
leaf node. This causes bias because extreme values in samples
are underestimated or overestimated. The prediction accuracy
is improved when using the median to obtain predicted values,
which surpasses the mean in robustness towards extreme val-
ues. Therefore, in this framework, the median is used in QR-
F model to predict point regression. QRF [2] uses the same
method to grow unpruned trees in RF [1]. However, at each
leaf node, QRF assesses the estimated distribution of F (y|X =
x) = P (Y < y|X = x) instead of only the mean of Y values
as RF does. Given a probability α, we can estimate the quantile
1
2014 IEEE978-1-4799-4215-2/14/$31.00 ©
Proceedings of the 2014 International Conference on Machine Learning and Cybernetics, Lanzhou, 13-16 July, 2014
Figure 1 – Bias in point and range prediction by the QRF model.
A large number of points scatter from the solid line.
Q
α
(X) as
ˆ
Q
α
(X = x
new
) = inf {y :
ˆ
F (y|X = x
new
) ≥ α}.
For range prediction, we have
[Q
α
l
(X), Q
α
h
(X)] = [inf {y :
ˆ
F (y|X = x) ≥ α
l
},
inf{y :
ˆ
F (y|X = x) ≥ α
h
}]
(2)
where α
l
< α
h
and (α
h
− α
l
) = τ. Here, τ is the
probability that prediction of Y will fall within a range of
[Q
α
l
(X), Q
α
h
(X)]. For point regression, the prediction can
choose a value in a range such as the mean or the median of the
Y values. Besides that, QRF works well in situations where the
conditional distribution is not Gaussian. A detailed description
can be found in [2, 3].
As stated above, QRF has the same bias as RF in point pre-
diction accuracy even though it takes the median instead of the
mean to predict
ˆ
f. To illustrate this kind of bias, we generat-
ed 200 samples for a training data set and 1000 samples for a
testing data set using the following model:
Y = 10sin(πX
1
X
2
) + 20(X
3
−0.5)
2
+ 10X
4
+ 5X
5
+ ϵ (3)
where X
1
, X
2
, X
3
, X
4
, X
5
and ϵ from U(0, 1).
We ran QRF under the default settings in R [14]. Figure 1
shows the predicted median values against the true values for
point and range prediction. These results are based on predic-
tion of QRF for the 200 simulated samples. The bias in the
point estimates becomes larger when the observed values in-
crease or decrease. The solid line connects the points where
the predicted values and the true values are equal. The range
prediction gives a range that will cover the new observation of
Y values with high probability. Range prediction is shown as
transparent grey bars, with vertical black lines at the bottom
and top. The green dots are the observations within the range
prediction; while the red ones are the ones outside the range
prediction.
On the other hand, both RFs and QRFs have bias in the fea-
ture selection process [4]. The main cause is that in the process
of growing a tree from the bagged sample data, the trees may
select less important feature as the best split among candidate
features. It tends to favor uninformative features containing
more values (i.e, less missing values, many categorical or dis-
tinct values) [4, 5].
Breiman [1] introduced bagging in RF as a way to reduce the
prediction variance and increase the accuracy of prediction, but
the bias remained. Recently, Zhang et al. [7] proposed a simple
non-iterative version using original RF to correct the bias in re-
gression problems. His methods compare favorably with other
bias-correction approaches for point prediction. However, their
approach can only be applied to point regression. Moreover,
the mean values were used in predictions, which, as mentioned
before, could suffer from extreme values in data. Besides, the
techniques were tested only on small low dimensional data sets
with the number of features is less than or equal to 13.
In this paper, we propose a new bias-corrected algorithm
with two kinds of bias-correction based on QRF, namely bcQR-
F, where the first one is to correct bias in importance measure
for feature selection and the other one is to correct bias in re-
gression process. In our approach, the bcQRF algorithm based
on the QRF model is used to correct the bias in regression prob-
lems instead of the adaptive bagging proposed by Breiman [6].
The predicted values and the residual term are obtained by the
first level QRF model. The original data set is then extended
with the residuals as response features while the original re-
sponse feature is considered as a predictor feature. The second
level QRF model is used to estimate the bias values for the ex-
tended data set. The bias-corrected values are computed based
on the difference between the values predicted by the first level
QRF model and the bias values, predicted by the second level
QRF model.
In our proposed method, both point regression and range pre-
diction bias are corrected using the QRF algorithm. Further-
more, for the bias-correction of importance measures in feature
selection, we generated artificial features and added to the o-
riginal data set, we then applied feature permutation method
[3] to this extended data to calculate the importance of fea-
tures, producing the feature importance scores. Our experimen-
tal results have shown that the proposed algorithm with these
bias-correction techniques dramatically reduced the prediction
errors and outperformed most of existing regression random
forests on the synthetic as well as well-known real-world da-
ta sets, especially on high-dimensional data.
2
Proceedings of the 2014 International Conference on Machine Learning and Cybernetics, Lanzhou, 13-16 July, 2014
2. Bias-Correction of Feature Importance Measures
A bias-correction for the importance measure in feature s-
election process is intended to avoid uninformative feature
which is selected as the best split when trees are growing in
the RF model. To do so, we put an artificial feature containing
the same values, possible splits and distribution but having no
association with Y values into original data set. This artificial
feature participates only in the competition for the best split
and decreases the probability to select this kind of feature as a
splitting node.
This bias-correction for the importance measure starts with
a brief recall of the feature permutation method recently de-
scribed by Tung et al. [3]. Given a training data set L, let
X = {L\Y } be a matrix containing N samples and M features.
A set of matrices {A
i
}
N
i=1
is generated by randomly permuting
N times the N rows of X. We say that the columns of A are
artificial features; each artificial feature has the same distribu-
tion, values and number of missing values as the corresponding
predictor feature.
We build a quantile regression forests model QRF from this
extended data set of 2M dimensions including predictor and
artificial features. Following the importance measure produced
by a permutation framework [3, 8, 9], we use QRF to com-
pute 2M importance scores for 2M features. We repeat the
same process R times to compute R replicates. In each region
which partitions data set by the regression tree, each artificial
A
i
shares approximately the same properties of corresponding
X
i
, but it is independent on Y, and consequently has approx-
imately the same probability of being selected as a splitting
candidate. That is crucial work to correct bias avoiding un-
informative features when trees are growing in the forest.
We adjust importance measure
c
V I
j
(j = 1, ..M) by the
average of all R replicates of raw importance scores V I
R
X
j
,
c
V I
j
= R
−1
∑
R
r=1
V I
R
X
j
. According to the values of
c
V I
j
,
we can normalize them and define a variable θ
j
to map the
c
V I
j
value into [0, 1] using the min-max normalization as follow:
θ
j
=
c
V I
j
− min(
c
V I
j
)
max(
c
V I
j
) − min(
c
V I
j
)
. (4)
The corrected importance scores θ
j
can be used as an approxi-
mation of the bias affecting the importance of X
j
. Hence, the
permutation method can reduce bias due to different measure-
ment levels of X
j
and can yield a better ranking of features
according to their importance. The weights {θ
1
, θ
2
, .., θ
M
} de-
fine a probability on features in the training data set L, which
can be used for building trees in the forest.
3. Our Proposed Bias-correction Algorithm
3.1. The first level Quantile Regression Forests
We have the weights {θ
j
}
M
j=1
given by Eq. (4). The weight-
ing method is used to select bagged features instead of the
simple sampling from the original Breiman’s method [1]. At
a node, we randomly select the amount of candidate features
mtry (1 < mtry ≪ M) using the probability distribution on
the weights {θ
j
}. We then aggregate them into the subspace,
they are used as candidates to split node when growing trees.
The first level quantile regression forests algorithm is summa-
rized as follows.
1) For each feature X
j
, take R importance scores and compute
the bias-corrected importance measures
c
V I
j
.
2) Compute feature weights {θ
1
, .., θ
M
} according to Eq. (4).
3) Draw K bootstrap samples of the original data set.
4) Grow a regression tree T
k
corresponding to each bootstrap
sample as follows:
a) At each node, randomly select a subspace of
mtry = ⌊log
2
(M) + 1⌋ features using the probabil-
ities {θ
1
, θ
2
, ..., θ
M
}, and use the subspace features as
candidates for splitting a node.
b) Grow unpruned large tree until the minimum node size
n
min
is reached. At each leaf node, all Y values of the
samples in the leaf node are kept.
c) Compute the observation weights of each X
i
by indi-
vidual trees and the forests with out-of-bag samples.
5) Given a probability τ, α
l
and α
h
for α
h
−α
l
= τ, compute
the corresponding quantile Q
α
l
and Q
α
h
with Eq. (2) (We
set default values [α
l
= 0.05, α
h
= 0.95] and τ = 0.5).
6) Given a X = x
new
, estimate the prediction value from a
value in the quantile range of Q
α
l
and Q
α
h
such as the
median.
3.2. Our Proposed Bias-correction Algorithm
In our approach, the QRF model is used to correct the bias in
regression problems. The prediction errors were obtained from
our first level QRF model on the original data set and used to
train the second level QRF model. The final bias-corrected val-
ues are calculated considering the difference between the re-
sults, predicted by the first level QRF model and the second
3
Proceedings of the 2014 International Conference on Machine Learning and Cybernetics, Lanzhou, 13-16 July, 2014
level QRF model. The proposed bias-correction bcQRF algo-
rithm in range prediction is summarized as follows.
• Step 1: Grow the first level quantile regression forests QR-
F model from the training data L = (X, Y ).
• Step 2: Obtain the predicted values
ˆ
Q
α
(X = x
OOB
) on
out-of-bag data. The estimation of bias is then calculat-
ed as the median predicted values minus the true values,
defined as
e
E =
ˆ
Q
0.5
(X = x
OOB
) − Y .
• Step 3: Given X = x
new
, use first level QRF to predict
the testing data set, we get
ˆ
Q
α
(X = x
new
) and the range
[Q
α
l
(X = x
new
), Q
α
h
(X = x
new
))].
• Step 4: Replace Y in L with vector
e
E, we obtain an ex-
tended data set L
e
= {L,
e
E}. Grow the second level
quantile regression forests QRF model using L
e
(with the
response feature
e
E). Use the second level QRF to predict
the testing data to get
e
E
new
.
• Step 5: The bias-corrected values are computed as
ˆ
Q
new
= [Q
α
l
(X = x
new
), Q
α
h
(X = x
new
))] −
e
E
new
.
For point prediction, the predicted values are
ˆ
Q
0.5
.
4. Experiments
4.1. Data sets
Synthetic data set: We use the model of Eq. (3) to gener-
ate two pairs of high dimensional synthetic data sets. We first
created 200 samples in 5 dimensions plus a response feature
named LM5, where M5 indicates the dimensions of the data
set. After that, we expanded data set with 195 and 495 noisy
features named LM200 and LM500, respectively. Similarly,
we generated extra data sets with 1000 samples as test data sets
HM200 and HM 500.
Real-world data set: Fourteen real-world data sets are listed
in Table 1 to evaluate the performance of regression forests al-
gorithms. The upper part of Table 1 lists 11 small-size data sets
which used in [10]. The lower part one presents characteris-
tics of high-dimensional data sets. The computed tomography
(CT) data set was taken from UCI
1
. The TFIDF-2006 data set
is used in [11]. The Rivers
2
data set was used to predict the
flow level of a river. It is based on a data set containing riv-
er discharge levels of 1, 439 Californian rivers for a period of
12, 054 days. This data set contains 48.6% missing values, a
imputation function in [12] was used to fill a missing data. The
1
The data are available at http://archive.ics.uci.edu/
2
http://www.usgs.gov
Table 1 – Characteristics of the data sets.
Data set name #Samples #Features
1 Servo 167 4
2 Childhood 654 4
3 Computer Hardware 209 6
4 Auto MPG 392 7
5 Concrete Slump Test 103 7
6 Concrete Comp. Strength 1,030 8
7 Pulse Rates 110 10
8 Horse Racing 102 12
9 Boston housing 506 13
10 Breast Cancer Wisc. 194 30
11 Communities and Crime 1,994 102
12 Computed tomography 53,500 385
13 Rivers 12,054 1,440
14 TFIDF-2006 19,395 150,360
level of the 1, 440-th river was predicted in our experiments, the
target values were converted from [0.062; 101, 000] to [0; 1].
4.2. Experimental Settings
We used the latest version of R-package for regression ran-
dom forests RF, unbiased conditional random forests cRF,
quantile regression forests QRF [12, 13, 14] and our bias-
corrected bcQRF algorithm to build regression models from
the training data sets. The MSE measure as Eq. (1) is used
to evaluate the models when applied to the test data sets. The
bias-correction algorithm is a new implementation. In order to
conduct our experiments, we used in R environment to call the
corresponding C/C++ functions on a 64-bit machine with In-
tel(R) Xeon(R) CPU E5620 2.40 GHz, 4 MB cache, and 32
GB main memory.
For first 11 real-world data sets, we used 10-fold cross-
validation to evaluate the prediction performance of regression
random forests algorithms. In each fold, we built 30 regres-
sion models, each with 500 trees and tested 30 models with the
corresponding test data sets.
For the three large data sets CT, Rivers, TFIDF-2006, we
only experimented one model with 500 trees in each random
forests method. For the CT data set, two-third was used for
training, and one-third for testing. For the Rivers, TFIDF-2006
and synthetic data sets, the given training data was used to learn
the models and the test data was used to evaluate the models.
The number of features to split a node is ⌊M/3⌋ for low
4
Proceedings of the 2014 International Conference on Machine Learning and Cybernetics, Lanzhou, 13-16 July, 2014
(a) Range prediction by QRF (b) Range prediction by bcQRF (c) Point prediction by QRF (d) Point prediction by bcQRF
Figure 2 – Comparison of the QRF, bcQRF models on synthetic data sets HM 200 and HM 500.
dimensional data sets and ⌊
√
M⌋ for high dimensional data,
respectively. The parameter nodesize in a regression forests
model is 5. The performance of each algorithm was measured
by the average of MSE.
4.3. Results and discussion
Figures 2 (a)(b) show the range prediction results of the da-
ta sets HM200 and Figures 2 (c)(d) present point prediction
results of the data set HM500, respectively, by QRF without
bias correction in Figures 2(a) and 2(c) and bcQRF with bias
correction in Figures 2(b) and 2(d). The bias range predictions
can be clearly observed from Figure 2(a). The bias range pre-
dictions were corrected by bcQRF as shown in Figure 2(b). We
can see that point predictions with bias correction by bcQRF
were clearly improved as shown in Figure 2(d) in comparison
with the predictions by QRF in Figure 2(c).
Figure 3 shows the 90% range prediction results of a large
high dimensional data set TFIDF-2006 by QRF and bcQRF.
The prediction errors of this data set, a clear bias in point pre-
diction can be observed in the results of QRF as shown in Fig-
ure 3(a). The bias was corrected in the results of bcQRF as
shown in Figure 3(b). The effect of bias correction is clearly
demonstrated.
The upper part of Table 2 presents the average of MSE es-
timated by the 30 repetitions on small-size data sets. In order
to compare our method with the others, we collected the best
results from Roy et al. [10] using the same criteria. The im-
pressive results are in the 5
th
column. We can see that the
results of bcQRF are better than those of the other methods.
In the data sets where bcQRF did not obtain the best results,
the differences from the best results were minor. These results
indicate that bcQRF outperformed the-state-of-the-art QRF in
(a) Range predictions by QRF (b) Range predictions by bcQRF
Figure 3 – Comparisons of range predictions by QRF and bcQRF
on large data sets TFIDF-2006.
the prediction.
The comparison results of the four regression random forests
RF, cRF, QRF and bcQRF on high-dimensional data sets eval-
uated in MSE are listed in the lower part of Table 2. Since
cRF was unable to conduct on the large data sets TFIDF-2006,
we excluded it from this experiment. It can be seen that bc-
QRF significantly reduced MSE errors, these results demon-
strate that the proposed algorithm bcQRF with two kinds of
bias-correction gave better performance over other algorithms
on regression problems, especially on high-dimensional data.
5. Conclusions
We have presented a new bias-correction bcQRF algorithm
with two kinds of bias-correction based on QRF model, which
aims to correct the bias in regression problems. The first main
contribution is to correct the bias of importance measures for
the feature selection process. This bias-correction avoids se-
lection of uninformative feature as the best split when trees are
growing. The second one is to correct the bias in the point and
5
Proceedings of the 2014 International Conference on Machine Learning and Cybernetics, Lanzhou, 13-16 July, 2014
Table 2 – Comparison of algorithms on real-world data sets, s-
maller values are better. The best results in 5
th
column are from
[10].
Data set name RF cRF QRF RF bcQRF
LAD
Servo 0.391 0.745 0.551 0.524 0.420
Childhood 0.188 0.173 0.175 0.190 0.171
Comp. Hardware 3470 9324 5500 3500 3370
Auto MPG 7.15 8.64 7.96 7.41 7.43
Concrete Slump 2357 3218 2747 2085 1988
Conc. Com. Str. 22.8 43.5 25.2 25.5 20.3
Pulse Rates 224 337 217 255 183
Horse Racing 4.04 6.93 4.32 3.48 3.18
Boston housing 10.3 16.0 9.1 10.3 8.0
Breast Can. Wisc. 1145 1089 1266 1100 1369
Comm. and Crime 0.019 0.019 0.023 0.018 0.019
CT 1.785 6.264 1.280 - 0.261
Rivers (×10
−2
) 0.059 0.093 0.068 - 0.026
TFIDF-2006 0.337 - 0.351 - 0.118
range prediction using the QRF model. Both correction tech-
niques increase the prediction accuracy of regression random
forests model. We have presented a series of experiment result-
s on both synthetic and real world data sets to demonstrate the
capability of the bcQRF models in bias correction and advan-
tages of bcQRF over other commonly used regression random
forests algorithms.
Acknowledgment
This research is supported in part by NSFC un-
der Grant No.61203294, Shenzhen New Industry De-
velopment Fund under Grant No.JC201005270342A,
No.JCYJ20120617120716224, the National High-tech
Research and Development Program (No. 2012AA040912),
and Guangdong-CAS project (No. 2012B091100221).
References
[1] L. Breiman, “Random forests,” Machine learning, vol. 45,
no. 1, pp. 5–32, 2001.
[2] N. Meinshausen, “Quantile regression forests,” The Jour-
nal of Machine Learning Research, vol. 7, pp. 983–999,
2006.
[3] N. T. Tung, J. Z. Huang, K. Imran, M. J. Li, and
G. Williams, “Extensions to quantile regression forests
for very high dimensional data,” in Advances in Knowl-
edge Discovery and Data Mining. Springer, 2014, vol.
8444, pp. 247–258.
[4] L. Breiman, J. Friedman, C. J. Stone, and R. A. Olshen,
Classification and regression trees. CRC press, 1984.
[5] C. Strobl, A.-L. Boulesteix, A. Zeileis, and T. Hothorn,
“Bias in random forest variable importance measures: Il-
lustrations, sources and a solution,” BMC bioinformatics,
vol. 8, no. 1, p. 25, 2007.
[6] L. Breiman, “Using adaptive bagging to debias regres-
sions,” Technical Report 547, Statistics Dept. UCB, Tech.
Rep., 1999.
[7] G. Zhang and Y. Lu, “Bias-corrected random forests in
regression,” Journal of Applied Statistics, vol. 39, no. 1,
pp. 151–160, 2012.
[8] E. Tuv, A. Borisov, G. Runger, and K. Torkkola, “Fea-
ture selection with ensembles, artificial variables, and re-
dundancy elimination,” The Journal of Machine Learning
Research, vol. 10, pp. 1341–1366, 2009.
[9] M. Sandri and P. Zuccolotto, “Analysis and correction
of bias in total decrease in node impurity measures for
tree-based algorithms,” Statistics and Computing, vol. 20,
no. 4, pp. 393–407, 2010.
[10] M.-H. Roy and D. Larocque, “Robustness of random
forests for regression,” Journal of Nonparametric Statis-
tics, vol. 24, no. 4, pp. 993–1006, 2012.
[11] C.-H. Ho and C.-J. Lin, “Large-scale linear support vector
regression,” The Journal of Machine Learning Research,
vol. 13, no. 1, pp. 3323–3348, 2012.
[12] A. Liaw and M. Wiener, “Classification and regression by
randomforest,” R news, vol. 2, no. 3, pp. 18–22, 2002.
[13] T. Hothorn, K. Hornik, and A. Zeileis, “party: A labora-
tory for recursive part (y) itioning. r package version 0.9-
9999. 2011,” URL: http://cran. r-project. org/package=
party (1 December 2010, date last accessed).
[14] N. Meinshausen, “quantregforest: quantile regression
forests,” R package version 0.2-3, 2012.
6
Proceedings of the 2014 International Conference on Machine Learning and Cybernetics, Lanzhou, 13-16 July, 2014
















