Conference PaperPDF Available

# A bias-variance analysis of ensemble learning for classification

Authors:

## Abstract and Figures

A decomposition of the expected prediction error into bias and variance components is useful when investigating the accuracy of a predictor. However, in classification such a decomposition is not as straightforward as in the case of squared-error loss in regression. As a result various definitions of bias and variance for classification can be found in the literature. In this paper these definitions are reviewed and an empirical study of a particular bias-variance decomposition is presented for ensemble classifiers.
Content may be subject to copyright.
Proceedings of the 58th Annual Conference of SASA (2016), 57 – 64 57
A BIAS-VARIANCE ANALYSIS OF ENSEMBLE
LEARNING FOR CLASSIFICATION
Arnu Pretorius 1
Stellenbosch University
e-mail: arnu@ml.sun.ac.za
Surette Bierman
Stellenbosch University
Sarel J. Steel
Stellenbosch University
Key words: Bias-variance analysis, Classiﬁcation, Ensemble learning.
Abstract: A decomposition of the expected prediction error into bias and variance components is
useful when investigating the accuracy of a predictor. However, in classiﬁcation such a decomposi-
tion is not as straightforward as in the case of squared-error loss in regression. As a result various
deﬁnitions of bias and variance for classiﬁcation can be found in the literature. In this paper these
deﬁnitions are reviewed and an empirical study of a particular bias-variance decomposition is pre-
sented for ensemble classiﬁers.
1. Introduction
Consider a supervised learning problem with response variable Yand input variables X1,X2,..., Xp.
Training data {(xi,yi),i=1,..., N}are used to estimate a function f(x). The estimated function is
denoted by ˆ
f(x)and is referred to as an (estimated) predictor, which is used to predict the response.
The accuracy of ˆ
f(x)is measured in terms of a loss function L(Y,ˆ
f(x)).
In regression problems the most commonly used loss function is squared-error loss, i.e. LSE (Y,ˆ
f(X))
=Yˆ
f(X)2. Let E(ˆ
f(x)) be denoted by ¯
f(x), then if an additive error model Y=f(X) + εis
assumed, where E(ε) = 0 and Var(ε) = σ2
ε, the following decomposition of the expected predic-
tion error, also referred to as the generalisation error of ˆ
fat a point X=xcan be derived (Geman,
Bienenstock and Doursat, 1992):
ErrSE (x) = σ2
ε+ ( ¯
f(x)f(x))2+E[( ˆ
f(x)¯
f(x))2]
=Irreducible Error +Bias2+Variance.(1)
In this expression the expectation is with respect to the response Ycorresponding to xand with
respect to the training data. The function f(x)in (1) denotes the true function underlying the data.
Although the search for good predictors can be restricted to the class of unbiased procedures, it is
1Member of MIH Media Lab at Stellenbosch University.
58 PRETORIUS, BIERMAN & STEEL
well known that better accuracy can often be achieved by trading off a small increase in bias against
a larger decrease in variance. Examples of procedures obtained from such an approach are ridge
regression and the lasso in linear regression analysis (Hastie, Tibshirani and Friedman, 2009).
It is clear that a decomposition of the expected prediction error into bias and variance components
is useful when investigating the accuracy of a predictor. The focus in this paper is on bias-variance
decompositions of prediction error in classiﬁcation problems. It will become clear that such decom-
positions are not nearly as straightforward as for squared-error loss regression problems. In fact, the
literature contains many different deﬁnitions of bias and variance in classiﬁcation problems. These
will be reviewed and an empirical study illustrating the different approaches for several popular
ensemble classiﬁers will be presented.
2. Bias and Variance of a Classiﬁer
Many different deﬁnitions of bias and variance in a classiﬁcation context can be found in the lit-
erature. These include Dietterich and Kong (1995), Breiman (1996), Kohavi and Wolpert (1996),
Tibshirani (1996), James and Hastie (1997), Heskes (1998), Breiman (2000) and Domingos (2000).
The different deﬁnitions are based on different requirements and desired properties of bias and vari-
ance. In most of the proposals there is an interest in ﬁnding an additive decomposition speciﬁcally
suited for expected 0-1 loss, analogous to that for squared-error loss in regression.
Arguably the most convincing explanation for the existence of so many different deﬁnitions and
attempts at ﬁnding an appropriate general decomposition, is given by James and Hastie (1997) and
James (2003). The key observation is that the bias and variance of a model each play two different
roles, referred to here as the inherent measure and the effect measure.
1. Inherent measure: The bias measures the disagreement between the average model and the
truth, and the variance measures the variation of the estimate around its mean.
2. Effect measure: The bias measures the proportion of the generalisation error attributed to the
disagreement between the average model and the truth (the effect of bias on error), and the
variance measures the proportion of the generalisation error attributed to the variability of the
estimated model (the effect of variance on error).
James (2003) notes that in regression these two roles are indistinguishable. In other words, the
inherent measures of bias and variance are equal to their respective effects on the generalisation
error. However this is not generally the case in classiﬁcation scenarios, and more speciﬁcally not so
for expected 0-1 loss.
A BIAS-VARIANCE ANALYSIS OF ENSEMBLE LEARNING FOR CLASSIFICATION 59
3. Bias and Variance for Symmetric Loss
Reconsider the squared-error decomposition in (1). Omitting the argument xfor notational conve-
nience, and following James and Hastie (1997) and James (2003), this can be rewritten as
ErrSE (x) = E[(Yf)2]+( ¯
ff)2+E[( ˆ
f¯
f)2]
=E[(Yf)2] + E[(Y¯
f)2(Yf)2]
+E[(Yˆ
f)2(Y¯
f)2].
Therefore, (1) becomes
ErrSE (x) = σ2
ε+E[LSE (Y,¯
f)LSE (Y,f)] + E[LSE(Y,ˆ
f)LSE (Y,¯
f)].(2)
The second term in (2) measures the effect on generalisation error from the expected difference
in loss between the average model and the truth. The third term measures the effect on generalisation
error from the expected difference in loss between the speciﬁc estimate ˆ
fand the average model.
However, the decomposition given in (2) is not restricted to squared-error loss and is valid for any
symmetric loss function, i.e. where L(a,b) = L(b,a)(James, 2003). Therefore, for an estimate ˆ
hof
a response S(numeric or categorical) at x, with
σ(x) = E[L(S,h)]
SE(x) = E[L(S,¯
h)L(S,h)]
V E(x) = E[L(S,ˆ
h)L(S,¯
h)],
where his the true underlying function, ¯
his the average model and L(·,·)is any symmetric loss, a
general decomposition is given by
Err(x) = σ(x) + SE (x) + V E(x).(3)
James (2003) refer to SE(x)and V E(x)as the systematic effect and the variance effect respec-
tively. In regression with squared-error loss, the systematic and variance effects are indistinguishable
from bias and variance. However, in classiﬁcation the situation is not the same.
Consider the expected 0-1 loss, E[L01(C,g)] = P(g6=C), where gis any classiﬁer and Cis the
true class label at a point x. Let ¯g=argmaxkEI(ˆg(x) = k)] denote the majority vote classiﬁer, then
the analogue of (1) may be expressed as
Irreducible Error +Bias +Variance
=P(g6=C) + I(¯g6=g) + P(ˆg6=¯g)
6=P(g6=C)+[P(¯g6=C)P(g6=C)] + [P(ˆg6=C)P(¯g6=C)]
=σ01(x) + SE01(x) + V E01(x)
=Irreducible Error +Systematic effect +Variance effect.
Therefore, in classiﬁcation using 0-1 loss, the effects of bias and of variance on generalisation
error are not equal to the inherent measures of bias and variance (James, 2003).
60 PRETORIUS, BIERMAN & STEEL
4. An Empirical Investigation
In this section, bias and variance and their respective effects are estimated on simulated data sets for
classiﬁcation trees, bagging, random forests and boosting.2
4.1. Data sets
Sixteen different simulated data sets were used in the empirical investigation of bias and variance
and their respective effects. The ﬁrst set of four simulated data sets consists of observations drawn
from a multivariate normal distribution. Each data set has p=15 input variables with the pairwise
correlation between all variables conﬁgured as follows: ρ=0.9 (highly correlated), ρ=0.5 (fairly
correlated), ρ=0.1 (weakly correlated) and ρ=0 (uncorrelated). All input variables are associated
with the response, which is coded C=1 if 1/(1+e15
j=1Xj)>0.5, or as C=0 otherwise.
The second set of simulated data sets are generated using the following simulation setup:
X1,...,XpU[0,1]and C=1 if q+ (12q)·I(J
l=1Xl>J/2)>0.5, where J<pand 0 <q<1,
otherwise C=0 (Mease and Wyner, 2008). This implies that the response Conly depends on
X1,X2,..., XJ. The remaining pJvariables are noise. With p=30 and q=0.15, four conﬁgurations
were chosen: J=2 (mostly noise), J=5 (fairly noisy), J=15 (half signal/half noise) and J=20
(mostly signal).
In addition, eight popular conﬁgurations were selected and simulated using the mlbench R pack-
age, viz. 2dnormals, Twonorm, Threenorm, Ringnorm, Circle, Cassini, Cuboids and XOR (Leisch
and Dimitriadou, 2010). The data sets Twonorm, Threenorm and Ringnorm were also used in
Breiman (1996) and Breiman (2000).
4.2. Experimental design
To approximate the necessary probabilities (expectations over indicator functions) the ﬁrst step was
to simulate 100 different training sets of size 400 and to ﬁt a model to each training set. For each
model ﬁtted, predictions were made on a test set of size 1000. Using the known data generating
mechanism to obtain the Bayes classiﬁer, together with the predictions from each ﬁt, the bias, vari-
ance, systematic and variance effects could be computed as averages over the training sets. More
speciﬁcally, let 1,..., Ddenote the D=100 training sets and let t e ={(x0i,c0i),i=1, ..., N0}be
the test set. Furthermore, let ¯g(x) = argmaxk1
DD
d=1I(ˆgd(x) = k). Then,
d
Bias =1
N0
N0
i=1
I(¯g(x0i)6=g(x0i)),
\
Variance =1
D
D
d=1
1
N0
N0
i=1
I(ˆgd(x0i)6=¯g(x0i)),
2The code necessary to reproduce the analysis is publicly available at:
https://github.com/arnupretorius/BiasVarAnalEnsembleLearn.
A BIAS-VARIANCE ANALYSIS OF ENSEMBLE LEARNING FOR CLASSIFICATION 61
c
SE =1
N0
N0
i=1
I(¯g(x0i)6=c0i)1
N0
N0
i=1
I(g(x0i)6=c0i),
c
V E =1
D
D
d=1
1
N0
N0
i=1
I(ˆgd(x0i)6=c0i)1
N0
N0
i=1
I(¯g(x0i)6=c0i).
It is typically the case that the choice of values for the tuning parameters of an algorithm largely
affects its bias and variance (James, Witten, Hastie and Tibshirani, 2013). Therefore, before each
ﬁt, ten-fold cross-validation was performed to ﬁnd the optimal tuning parameters for each algorithm
from a pre-speciﬁed grid of available values. The pre-speciﬁed grids were chosen as follows:
Trees: Cost-complexity parameter cp ={0.1,0.2, ..., 0.9,1.0}.
Bagging: Number of trees B=200.
Random forest: Number of trees B=200, with the subset size of randomly selected variables
ξ={1,3,5, ..., p1}.
Boosting: Number of trees B=200, tree interaction depth either one or six and step-length
factor ν={0.01,0.05,0.1}. For binary classiﬁcation the exponential loss was used and, for
multi-class problems, the multinomial loss.
4.3. Results
The results are summarised in Table 1. Values in bold represent the minimum achieved for a partic-
ular quantity amongst the algorithms. The following conclusions are drawn:
Single tree vs. ensemble: For each data set, randomisation and aggregation succeeded in
drastically reducing the variance as well as the variance effect. Interestingly, the bias and
systematic effect was either also reduced or remained unchanged.
Bagging vs. random forest: For the majority of data sets, the additional randomisation at
each node of a tree in random forests resulted in a further reduction of the variance and of the
variance effect when compared to bagging. In many of the data sets, bagging reduced the bias
and systematic effect substantially. This is the most evident in the case of multivariate normal
distributions.
Random forest vs. boosting: The strategy employed by the boosting algorithm had a signiﬁ-
cant effect on variance. This is especially evident in the presence of noise. In several of the
data sets, the reduction in variance lead to a substantial reduction in the variance effect when
compared to random forests.
Negative variance effects: For all the algorithms, the circle data set resulted in most (if not all)
of the reducible error being attributed to the systematic effect, with a bias of equal size. For
this data set, bagging and random forests managed to obtain a variance very close to zero as
well as a small negative variance effect. Boosting performed the worst in term of variance, but
the best in terms of the variance effect. This is possibly an illustration of the Friedman effect.
For more details, see Friedman (1997).
62 PRETORIUS, BIERMAN & STEEL
Table 1: Estimated bias, variance, systematic effect and variance effect on simulated data. (Values
in bold indicate row-wise minima.)
Data Quantity Tree Bagging Random forest Boosting Data Quantity Tree Bagging Random forest Boosting
mvnorm
p=15,
ρ=0.9
Error 0.105 0.044 0.038 0.043
2d
Norms
p=2,
K=6
Error 0.420 0.301 0.292 0.273
Bayes Error 0.028 0.028 0.028 0.028 Bayes Error 0.243 0.243 0.243 0.243
Systematic Effect 0.015 0.003 0.004 0.005 Systematic Effect 0.076 0.004 0.005 0.002
Variance Effect 0.062 0.013 0.006 0.010 Variance Effect 0.101 0.054 0.044 0.028
Bias 0.025 0.005 0.006 0.007 Bias 0.157 0.019 0.019 0.019
Variance 0.097 0.028 0.018 0.026 Variance 0.290 0.158 0.138 0.099
mvnorm
p=15,
ρ=0.5
Error 0.233 0.075 0.061 0.068
Two-
norm
p=20,
K=2
Error 0.319 0.057 0.033 0.040
Bayes Error 0.040 0.040 0.040 0.040 Bayes Error 0.024 0.024 0.024 0.024
Systematic Effect 0.034 0.009 0.010 0.013 Systematic Effect 0.025 00.003 0.004
Variance Effect 0.159 0.026 0.011 0.015 Variance Effect 0.270 0.033 0.006 0.012
Bias 0.060 0.013 0.022 0.033 Bias 0.037 0.008 0.011 0.014
Variance 0.224 0.056 0.034 0.043 Variance 0.314 0.047 0.018 0.025
mvnorm
p=15,
ρ=0.1
Error 0.368 0.152 0.129 0.130
Three-
norm
p=20,
K=2
Error 0.392 0.177 0.157 0.167
Bayes Error 0.078 0.078 0.078 0.078 Bayes Error 0.085 0.085 0.085 0.085
Systematic Effect 0.064 0.011 0.014 0.026 Systematic Effect 0.092 0.039 0.037 0.048
Variance Effect 0.226 0.063 0.037 0.026 Variance Effect 0.215 0.053 0.035 0.034
Bias 0.098 0.027 0.028 0.040 Bias 0.132 0.077 0.075 0.088
Variance 0.355 0.120 0.085 0.084 Variance 0.368 0.127 0.094 0.102
mvnorm
p=15,
ρ=0
Error 0.427 0.239 0.216 0.200
Ring-
norm
p=20,
K=2
Error 0.300 0.087 0.042 0.051
Bayes Error 0.141 0.141 0.141 0.141 Bayes Error 0.018 0.018 0.018 0.018
Systematic Effect 0.074 0 0 0.004 Systematic Effect 0.161 0.013 0.007 0.019
Variance Effect 0.212 0.101 0.077 0.055 Variance Effect 0.121 0.056 0.017 0.014
Bias 0.168 0.031 0.044 0.060 Bias 0.177 0.023 0.021 0.029
Variance 0.405 0.197 0.163 0.136 Variance 0.256 0.077 0.030 0.034
Mease-
Wyner
(2008)
p=30,
J=2
Error 0.295 0.212 0.214 0.212
Circle
p=20,
K=2
Error 0.171 0.168 0.168 0.140
Bayes Error 0.147 0.147 0.147 0.147 Bayes Error 0 0 0 0
Systematic Effect 0.050 0.002 0.008 0.017 Systematic Effect 0.171 0.171 0.171 0.152
Variance Effect 0.098 0.063 0.059 0.048 Variance Effect 0 0.003 0.003 -0.012
Bias 0.072 0.002 0.008 0.021 Bias 0.171 0.171 0.171 0.152
Variance 0.202 0.094 0.096 0.092 Variance 00.007 0.004 0.046
Mease-
Wyner
(2008)
p=30,
J=5
Error 0.390 0.275 0.272 0.259
Cassini
p=2,
K=3
Error 0.003 0.003 0.002 0.004
Bayes Error 0.143 0.143 0.143 0.143 Bayes Error 0 0 0 0
Systematic Effect 0.075 0.021 0.017 0.021 Systematic Effect 0.001 0.001 00.001
Variance Effect 0.172 0.111 0.112 0.095 Variance Effect 0.002 0.002 0.002 0.003
Bias 0.095 0.029 0.029 0.037 Bias 0.001 0.001 00.001
Variance 0.338 0.184 0.179 0.159 Variance 0.003 0.003 0.002 0.003
Mease-
Wyner
(2008)
p=30,
J=15
Error 0.452 0.308 0.304 0.284
Cuboids
p=3,
K=4
Error 0.074 0.0001 00.0002
Bayes Error 0.136 0.136 0.136 0.136 Bayes Error 0 0 0 0
Systematic Effect 0.117 0.034 0.015 0.020 Systematic Effect 0 0 0 0
Variance Effect 0.199 0.138 0.153 0.128 Variance Effect 0.074 0.0001 00.0002
Bias 0.155 0.046 0.029 0.028 Bias 0 0 0 0
Variance 0.424 0.233 0.228 0.201 Variance 0.074 0.0001 00.0002
Mease-
Wyner
(2008)
p=30,
J=20
Error 0.455 0.318 0.308 0.290
XOR
p=2,
K=2
Error 0.059 0.007 0.009 0.008
Bayes Error 0.134 0.134 0.134 0.134 Bayes Error 0 0 0 0
Systematic Effect 0.171 0.043 0.034 0.023 Systematic Effect 0.002 0 0 0
Variance Effect 0.150 0.141 0.140 0.133 Variance Effect 0.057 0.007 0.009 0.008
Bias 0.237 0.059 0.046 0.031 Bias 0.002 0 0 0
Variance 0.421 0.248 0.236 0.212 Variance 0.059 0.007 0.009 0.008
A BIAS-VARIANCE ANALYSIS OF ENSEMBLE LEARNING FOR CLASSIFICATION 63
Inherent and effect measure correlation: As was observed by James (2003), bias and variance
seem to be highly correlated with their respective effects on generalisation error. The median
correlation between bias and the systematic effect in the empirical study was 93.93%, while
the median correlation between the variance and variance effect was 95.61%.
In summary, Table 2 provides a win/tie analysis of the results where a tally of wins and ties for
each quantity and algorithm are given.
Table 2: Win/Tie analysis of bias, variance, systematic effect and variance effect.
Quantity Tree Bagging Random forest Boosting
Win/Tie
Error 0/0 1/18/0 6/1
Systematic Effect 0/15/3 4/4 3/2
Variance Effect 0/1 1/1 3/210/0
Bias 0/16/4 3/4 3/3
Variance 1/0 1/07/0 7/0
Total 1/3 14/925/10 29/4
From Table 2 it is clear that bagging was the best performer in terms of bias. This also translated
into the best performance with respect to the systematic effect. Random forests and boosting were
tied for top position in terms of variance, however with respect to the variance effect, boosting
outperformed all of the other algorithms. Despite this, random forests still managed to achieve the
lowest error rate on the highest number of simulation conﬁgurations. This seems to indicate that a
random forest performs well not because it exclusively reduces either the bias/systematic effect or the
variance/variance effect. Rather, a random forest seems to be the most successful in simultaneously
reducing both these quantities.
5. Conclusion
In this paper we considered bias-variance decompositions of prediction error in a classiﬁcation con-
text. In this context, several deﬁnitions of bias and variance can be found in the literature. In an
empirical study, a decomposition of the prediction error into bias and variance, and into systematic
and variance effects were illustrated for several ensemble classiﬁers. Bagging was generally found
to have the smallest bias and systematic effect. Although random forests did not perform the best
in terms of any of the components, it managed to achieve the best error rate more frequently. Inter-
estingly, boosting was found to perform well in terms of its variance effect. This is contrary to the
conclusion in Bühlmann and Hothorn (2007), viz. that improvements in the bagging prediction error
boasted by boosting should be attributed to a decrease in bias. In conclusion, it seems that heuristic
studies similar to the one presented here may prove useful in facilitating the proposal of sensible
ensemble classiﬁers.
Acknowledgements
The authors would like to thank the reviewers for their insightful comments which improved the
overall quality of this paper. The work in this paper was partially supported by the MIH Media
64 PRETORIUS, BIERMAN & STEEL
Lab at Stellenbosch University and the National Research Foundation (NRF) under grant number:
UID 94615. Opinions expressed and conclusions arrived at, are those of the authors and are not
necessarily to be attributed to the NRF.
References
BREIM AN, L. (1996). Bias, variance, and arcing classiﬁers. Tech. Rep. 460, Statistics Department,
University of California, Berkeley, CA, USA.
BREIM AN, L. (2000). Randomizing outputs to increase prediction accuracy. Machine Learning,
40 (3), 229–242.
BÜHLMANN, P. AND HOT HORN, T. (2007). Boosting algorithms: Regularization, prediction and
model ﬁtting. Statistical Science,22 (4), 477–505.
DIETTERICH, T. G. AND KONG, E. B. (1995). Machine learning bias, statistical bias, and statisti-
cal variance of decision tree algorithms. Technical report, Department of Computer Science,
Oregon State University.
DOMINGOS, P. (2000). A uniﬁed bias-variance decomposition for zero-one and squared loss.
AAAI/IAAI,2000, 564–569.
FRIEDMAN, J. H. (1997). On bias, variance, 0/1 loss, and the curse-of-dimensionality. Data Mining
and Knowledge Discovery,1(1), 55–77.
GEMAN , S., BIE NE NSTOCK, E., AND DOURS AT, R. (1992). Neural networks and the bias/variance
dilemma. Neural Computation,4(1), 1–58.
HASTIE, T., TIBSHIRANI, R., AND FRIEDMAN, J. (2009). The Elements of Statistical Learning.
Springer.
HESKE S, T. (1998). Bias/variance decompositions for likelihood-based estimators. Neural Compu-
tation,10 (6), 1425–1433.
JAMES , G. AND HASTIE, T. (1997). Generalizations of the bias/variance decomposition for predic-
tion error. Dept. Statistics, Stanford Univ., Stanford, CA, Tech. Rep.
JAMES , G., WITTE N, D., HASTIE, T., AND TIBSHIRANI, R. (2013). An introduction to statistical
learning. Springer.
JAMES , G. M. (2003). Variance and bias for general loss functions. Machine Learning,51 (2),
115–135.
KOHAVI, R. AND WO LP ERT, D. H. (1996). Bias plus variance decomposition for zero-one loss
functions. ICML,96, 275–83.
LEISCH, F. AND DIMITRIADOU, E. (2010). Machine learning benchmark problems. mlbench R
package.
MEASE , D. AND WYN ER , A. (2008). Evidence contrary to the statistical view of boosting. Journal
of Machine Learning Research,9(Feb), 131–156.
TIBSHIRANI, R. (1996). Bias, variance and prediction error for classiﬁcation rules. University of
Toronto, Department of Statistics.