Content uploaded by Surette Bierman

Author content

All content in this area was uploaded by Surette Bierman on Mar 15, 2017

Content may be subject to copyright.

Proceedings of the 58th Annual Conference of SASA (2016), 57 – 64 57

A BIAS-VARIANCE ANALYSIS OF ENSEMBLE

LEARNING FOR CLASSIFICATION

Arnu Pretorius 1

Stellenbosch University

e-mail: arnu@ml.sun.ac.za

Surette Bierman

Stellenbosch University

Sarel J. Steel

Stellenbosch University

Key words: Bias-variance analysis, Classiﬁcation, Ensemble learning.

Abstract: A decomposition of the expected prediction error into bias and variance components is

useful when investigating the accuracy of a predictor. However, in classiﬁcation such a decomposi-

tion is not as straightforward as in the case of squared-error loss in regression. As a result various

deﬁnitions of bias and variance for classiﬁcation can be found in the literature. In this paper these

deﬁnitions are reviewed and an empirical study of a particular bias-variance decomposition is pre-

sented for ensemble classiﬁers.

1. Introduction

Consider a supervised learning problem with response variable Yand input variables X1,X2,..., Xp.

Training data {(xi,yi),i=1,..., N}are used to estimate a function f(x). The estimated function is

denoted by ˆ

f(x)and is referred to as an (estimated) predictor, which is used to predict the response.

The accuracy of ˆ

f(x)is measured in terms of a loss function L(Y,ˆ

f(x)).

In regression problems the most commonly used loss function is squared-error loss, i.e. LSE (Y,ˆ

f(X))

=Y−ˆ

f(X)2. Let E(ˆ

f(x)) be denoted by ¯

f(x), then if an additive error model Y=f(X) + εis

assumed, where E(ε) = 0 and Var(ε) = σ2

ε, the following decomposition of the expected predic-

tion error, also referred to as the generalisation error of ˆ

fat a point X=xcan be derived (Geman,

Bienenstock and Doursat, 1992):

ErrSE (x) = σ2

ε+ ( ¯

f(x)−f(x))2+E[( ˆ

f(x)−¯

f(x))2]

=Irreducible Error +Bias2+Variance.(1)

In this expression the expectation is with respect to the response Ycorresponding to xand with

respect to the training data. The function f(x)in (1) denotes the true function underlying the data.

Although the search for good predictors can be restricted to the class of unbiased procedures, it is

1Member of MIH Media Lab at Stellenbosch University.

58 PRETORIUS, BIERMAN & STEEL

well known that better accuracy can often be achieved by trading off a small increase in bias against

a larger decrease in variance. Examples of procedures obtained from such an approach are ridge

regression and the lasso in linear regression analysis (Hastie, Tibshirani and Friedman, 2009).

It is clear that a decomposition of the expected prediction error into bias and variance components

is useful when investigating the accuracy of a predictor. The focus in this paper is on bias-variance

decompositions of prediction error in classiﬁcation problems. It will become clear that such decom-

positions are not nearly as straightforward as for squared-error loss regression problems. In fact, the

literature contains many different deﬁnitions of bias and variance in classiﬁcation problems. These

will be reviewed and an empirical study illustrating the different approaches for several popular

ensemble classiﬁers will be presented.

2. Bias and Variance of a Classiﬁer

Many different deﬁnitions of bias and variance in a classiﬁcation context can be found in the lit-

erature. These include Dietterich and Kong (1995), Breiman (1996), Kohavi and Wolpert (1996),

Tibshirani (1996), James and Hastie (1997), Heskes (1998), Breiman (2000) and Domingos (2000).

The different deﬁnitions are based on different requirements and desired properties of bias and vari-

ance. In most of the proposals there is an interest in ﬁnding an additive decomposition speciﬁcally

suited for expected 0-1 loss, analogous to that for squared-error loss in regression.

Arguably the most convincing explanation for the existence of so many different deﬁnitions and

attempts at ﬁnding an appropriate general decomposition, is given by James and Hastie (1997) and

James (2003). The key observation is that the bias and variance of a model each play two different

roles, referred to here as the inherent measure and the effect measure.

1. Inherent measure: The bias measures the disagreement between the average model and the

truth, and the variance measures the variation of the estimate around its mean.

2. Effect measure: The bias measures the proportion of the generalisation error attributed to the

disagreement between the average model and the truth (the effect of bias on error), and the

variance measures the proportion of the generalisation error attributed to the variability of the

estimated model (the effect of variance on error).

James (2003) notes that in regression these two roles are indistinguishable. In other words, the

inherent measures of bias and variance are equal to their respective effects on the generalisation

error. However this is not generally the case in classiﬁcation scenarios, and more speciﬁcally not so

for expected 0-1 loss.

A BIAS-VARIANCE ANALYSIS OF ENSEMBLE LEARNING FOR CLASSIFICATION 59

3. Bias and Variance for Symmetric Loss

Reconsider the squared-error decomposition in (1). Omitting the argument xfor notational conve-

nience, and following James and Hastie (1997) and James (2003), this can be rewritten as

ErrSE (x) = E[(Y−f)2]+( ¯

f−f)2+E[( ˆ

f−¯

f)2]

=E[(Y−f)2] + E[(Y−¯

f)2−(Y−f)2]

+E[(Y−ˆ

f)2−(Y−¯

f)2].

Therefore, (1) becomes

ErrSE (x) = σ2

ε+E[LSE (Y,¯

f)−LSE (Y,f)] + E[LSE(Y,ˆ

f)−LSE (Y,¯

f)].(2)

The second term in (2) measures the effect on generalisation error from the expected difference

in loss between the average model and the truth. The third term measures the effect on generalisation

error from the expected difference in loss between the speciﬁc estimate ˆ

fand the average model.

However, the decomposition given in (2) is not restricted to squared-error loss and is valid for any

symmetric loss function, i.e. where L(a,b) = L(b,a)(James, 2003). Therefore, for an estimate ˆ

hof

a response S(numeric or categorical) at x, with

σ(x) = E[L(S,h)]

SE(x) = E[L(S,¯

h)−L(S,h)]

V E(x) = E[L(S,ˆ

h)−L(S,¯

h)],

where his the true underlying function, ¯

his the average model and L(·,·)is any symmetric loss, a

general decomposition is given by

Err(x) = σ(x) + SE (x) + V E(x).(3)

James (2003) refer to SE(x)and V E(x)as the systematic effect and the variance effect respec-

tively. In regression with squared-error loss, the systematic and variance effects are indistinguishable

from bias and variance. However, in classiﬁcation the situation is not the same.

Consider the expected 0-1 loss, E[L0−1(C,g)] = P(g6=C), where gis any classiﬁer and Cis the

true class label at a point x. Let ¯g=argmaxkEI(ˆg(x) = k)] denote the majority vote classiﬁer, then

the analogue of (1) may be expressed as

Irreducible Error +Bias +Variance

=P(g6=C) + I(¯g6=g) + P(ˆg6=¯g)

6=P(g6=C)+[P(¯g6=C)−P(g6=C)] + [P(ˆg6=C)−P(¯g6=C)]

=σ0−1(x) + SE0−1(x) + V E0−1(x)

=Irreducible Error +Systematic effect +Variance effect.

Therefore, in classiﬁcation using 0-1 loss, the effects of bias and of variance on generalisation

error are not equal to the inherent measures of bias and variance (James, 2003).

60 PRETORIUS, BIERMAN & STEEL

4. An Empirical Investigation

In this section, bias and variance and their respective effects are estimated on simulated data sets for

classiﬁcation trees, bagging, random forests and boosting.2

4.1. Data sets

Sixteen different simulated data sets were used in the empirical investigation of bias and variance

and their respective effects. The ﬁrst set of four simulated data sets consists of observations drawn

from a multivariate normal distribution. Each data set has p=15 input variables with the pairwise

correlation between all variables conﬁgured as follows: ρ=0.9 (highly correlated), ρ=0.5 (fairly

correlated), ρ=0.1 (weakly correlated) and ρ=0 (uncorrelated). All input variables are associated

with the response, which is coded C=1 if 1/(1+e−∑15

j=1Xj)>0.5, or as C=0 otherwise.

The second set of simulated data sets are generated using the following simulation setup:

X1,...,Xp∼U[0,1]and C=1 if q+ (1−2q)·I(∑J

l=1Xl>J/2)>0.5, where J<pand 0 <q<1,

otherwise C=0 (Mease and Wyner, 2008). This implies that the response Conly depends on

X1,X2,..., XJ. The remaining p−Jvariables are noise. With p=30 and q=0.15, four conﬁgurations

were chosen: J=2 (mostly noise), J=5 (fairly noisy), J=15 (half signal/half noise) and J=20

(mostly signal).

In addition, eight popular conﬁgurations were selected and simulated using the mlbench R pack-

age, viz. 2dnormals, Twonorm, Threenorm, Ringnorm, Circle, Cassini, Cuboids and XOR (Leisch

and Dimitriadou, 2010). The data sets Twonorm, Threenorm and Ringnorm were also used in

Breiman (1996) and Breiman (2000).

4.2. Experimental design

To approximate the necessary probabilities (expectations over indicator functions) the ﬁrst step was

to simulate 100 different training sets of size 400 and to ﬁt a model to each training set. For each

model ﬁtted, predictions were made on a test set of size 1000. Using the known data generating

mechanism to obtain the Bayes classiﬁer, together with the predictions from each ﬁt, the bias, vari-

ance, systematic and variance effects could be computed as averages over the training sets. More

speciﬁcally, let Ω1,..., ΩDdenote the D=100 training sets and let Ωt e ={(x0i,c0i),i=1, ..., N0}be

the test set. Furthermore, let ¯g(x) = argmaxk1

D∑D

d=1I(ˆgd(x) = k). Then,

d

Bias =1

N0

N0

∑

i=1

I(¯g(x0i)6=g(x0i)),

\

Variance =1

D

D

∑

d=1

1

N0

N0

∑

i=1

I(ˆgd(x0i)6=¯g(x0i)),

2The code necessary to reproduce the analysis is publicly available at:

https://github.com/arnupretorius/BiasVarAnalEnsembleLearn.

A BIAS-VARIANCE ANALYSIS OF ENSEMBLE LEARNING FOR CLASSIFICATION 61

c

SE =1

N0

N0

∑

i=1

I(¯g(x0i)6=c0i)−1

N0

N0

∑

i=1

I(g(x0i)6=c0i),

c

V E =1

D

D

∑

d=1

1

N0

N0

∑

i=1

I(ˆgd(x0i)6=c0i)−1

N0

N0

∑

i=1

I(¯g(x0i)6=c0i).

It is typically the case that the choice of values for the tuning parameters of an algorithm largely

affects its bias and variance (James, Witten, Hastie and Tibshirani, 2013). Therefore, before each

ﬁt, ten-fold cross-validation was performed to ﬁnd the optimal tuning parameters for each algorithm

from a pre-speciﬁed grid of available values. The pre-speciﬁed grids were chosen as follows:

•Trees: Cost-complexity parameter cp ={0.1,0.2, ..., 0.9,1.0}.

•Bagging: Number of trees B=200.

•Random forest: Number of trees B=200, with the subset size of randomly selected variables

ξ={1,3,5, ..., p−1}.

•Boosting: Number of trees B=200, tree interaction depth either one or six and step-length

factor ν={0.01,0.05,0.1}. For binary classiﬁcation the exponential loss was used and, for

multi-class problems, the multinomial loss.

4.3. Results

The results are summarised in Table 1. Values in bold represent the minimum achieved for a partic-

ular quantity amongst the algorithms. The following conclusions are drawn:

•Single tree vs. ensemble: For each data set, randomisation and aggregation succeeded in

drastically reducing the variance as well as the variance effect. Interestingly, the bias and

systematic effect was either also reduced or remained unchanged.

•Bagging vs. random forest: For the majority of data sets, the additional randomisation at

each node of a tree in random forests resulted in a further reduction of the variance and of the

variance effect when compared to bagging. In many of the data sets, bagging reduced the bias

and systematic effect substantially. This is the most evident in the case of multivariate normal

distributions.

•Random forest vs. boosting: The strategy employed by the boosting algorithm had a signiﬁ-

cant effect on variance. This is especially evident in the presence of noise. In several of the

data sets, the reduction in variance lead to a substantial reduction in the variance effect when

compared to random forests.

•Negative variance effects: For all the algorithms, the circle data set resulted in most (if not all)

of the reducible error being attributed to the systematic effect, with a bias of equal size. For

this data set, bagging and random forests managed to obtain a variance very close to zero as

well as a small negative variance effect. Boosting performed the worst in term of variance, but

the best in terms of the variance effect. This is possibly an illustration of the Friedman effect.

For more details, see Friedman (1997).

62 PRETORIUS, BIERMAN & STEEL

Table 1: Estimated bias, variance, systematic effect and variance effect on simulated data. (Values

in bold indicate row-wise minima.)

Data Quantity Tree Bagging Random forest Boosting Data Quantity Tree Bagging Random forest Boosting

mvnorm

p=15,

ρ=0.9

Error 0.105 0.044 0.038 0.043

2d

Norms

p=2,

K=6

Error 0.420 0.301 0.292 0.273

Bayes Error 0.028 0.028 0.028 0.028 Bayes Error 0.243 0.243 0.243 0.243

Systematic Effect 0.015 0.003 0.004 0.005 Systematic Effect 0.076 0.004 0.005 0.002

Variance Effect 0.062 0.013 0.006 0.010 Variance Effect 0.101 0.054 0.044 0.028

Bias 0.025 0.005 0.006 0.007 Bias 0.157 0.019 0.019 0.019

Variance 0.097 0.028 0.018 0.026 Variance 0.290 0.158 0.138 0.099

mvnorm

p=15,

ρ=0.5

Error 0.233 0.075 0.061 0.068

Two-

norm

p=20,

K=2

Error 0.319 0.057 0.033 0.040

Bayes Error 0.040 0.040 0.040 0.040 Bayes Error 0.024 0.024 0.024 0.024

Systematic Effect 0.034 0.009 0.010 0.013 Systematic Effect 0.025 00.003 0.004

Variance Effect 0.159 0.026 0.011 0.015 Variance Effect 0.270 0.033 0.006 0.012

Bias 0.060 0.013 0.022 0.033 Bias 0.037 0.008 0.011 0.014

Variance 0.224 0.056 0.034 0.043 Variance 0.314 0.047 0.018 0.025

mvnorm

p=15,

ρ=0.1

Error 0.368 0.152 0.129 0.130

Three-

norm

p=20,

K=2

Error 0.392 0.177 0.157 0.167

Bayes Error 0.078 0.078 0.078 0.078 Bayes Error 0.085 0.085 0.085 0.085

Systematic Effect 0.064 0.011 0.014 0.026 Systematic Effect 0.092 0.039 0.037 0.048

Variance Effect 0.226 0.063 0.037 0.026 Variance Effect 0.215 0.053 0.035 0.034

Bias 0.098 0.027 0.028 0.040 Bias 0.132 0.077 0.075 0.088

Variance 0.355 0.120 0.085 0.084 Variance 0.368 0.127 0.094 0.102

mvnorm

p=15,

ρ=0

Error 0.427 0.239 0.216 0.200

Ring-

norm

p=20,

K=2

Error 0.300 0.087 0.042 0.051

Bayes Error 0.141 0.141 0.141 0.141 Bayes Error 0.018 0.018 0.018 0.018

Systematic Effect 0.074 0 0 0.004 Systematic Effect 0.161 0.013 0.007 0.019

Variance Effect 0.212 0.101 0.077 0.055 Variance Effect 0.121 0.056 0.017 0.014

Bias 0.168 0.031 0.044 0.060 Bias 0.177 0.023 0.021 0.029

Variance 0.405 0.197 0.163 0.136 Variance 0.256 0.077 0.030 0.034

Mease-

Wyner

(2008)

p=30,

J=2

Error 0.295 0.212 0.214 0.212

Circle

p=20,

K=2

Error 0.171 0.168 0.168 0.140

Bayes Error 0.147 0.147 0.147 0.147 Bayes Error 0 0 0 0

Systematic Effect 0.050 0.002 0.008 0.017 Systematic Effect 0.171 0.171 0.171 0.152

Variance Effect 0.098 0.063 0.059 0.048 Variance Effect 0 −0.003 −0.003 -0.012

Bias 0.072 0.002 0.008 0.021 Bias 0.171 0.171 0.171 0.152

Variance 0.202 0.094 0.096 0.092 Variance 00.007 0.004 0.046

Mease-

Wyner

(2008)

p=30,

J=5

Error 0.390 0.275 0.272 0.259

Cassini

p=2,

K=3

Error 0.003 0.003 0.002 0.004

Bayes Error 0.143 0.143 0.143 0.143 Bayes Error 0 0 0 0

Systematic Effect 0.075 0.021 0.017 0.021 Systematic Effect 0.001 0.001 00.001

Variance Effect 0.172 0.111 0.112 0.095 Variance Effect 0.002 0.002 0.002 0.003

Bias 0.095 0.029 0.029 0.037 Bias 0.001 0.001 00.001

Variance 0.338 0.184 0.179 0.159 Variance 0.003 0.003 0.002 0.003

Mease-

Wyner

(2008)

p=30,

J=15

Error 0.452 0.308 0.304 0.284

Cuboids

p=3,

K=4

Error 0.074 0.0001 00.0002

Bayes Error 0.136 0.136 0.136 0.136 Bayes Error 0 0 0 0

Systematic Effect 0.117 0.034 0.015 0.020 Systematic Effect 0 0 0 0

Variance Effect 0.199 0.138 0.153 0.128 Variance Effect 0.074 0.0001 00.0002

Bias 0.155 0.046 0.029 0.028 Bias 0 0 0 0

Variance 0.424 0.233 0.228 0.201 Variance 0.074 0.0001 00.0002

Mease-

Wyner

(2008)

p=30,

J=20

Error 0.455 0.318 0.308 0.290

XOR

p=2,

K=2

Error 0.059 0.007 0.009 0.008

Bayes Error 0.134 0.134 0.134 0.134 Bayes Error 0 0 0 0

Systematic Effect 0.171 0.043 0.034 0.023 Systematic Effect 0.002 0 0 0

Variance Effect 0.150 0.141 0.140 0.133 Variance Effect 0.057 0.007 0.009 0.008

Bias 0.237 0.059 0.046 0.031 Bias 0.002 0 0 0

Variance 0.421 0.248 0.236 0.212 Variance 0.059 0.007 0.009 0.008

A BIAS-VARIANCE ANALYSIS OF ENSEMBLE LEARNING FOR CLASSIFICATION 63

•Inherent and effect measure correlation: As was observed by James (2003), bias and variance

seem to be highly correlated with their respective effects on generalisation error. The median

correlation between bias and the systematic effect in the empirical study was 93.93%, while

the median correlation between the variance and variance effect was 95.61%.

In summary, Table 2 provides a win/tie analysis of the results where a tally of wins and ties for

each quantity and algorithm are given.

Table 2: Win/Tie analysis of bias, variance, systematic effect and variance effect.

Quantity Tree Bagging Random forest Boosting

Win/Tie

Error 0/0 1/18/0 6/1

Systematic Effect 0/15/3 4/4 3/2

Variance Effect 0/1 1/1 3/210/0

Bias 0/16/4 3/4 3/3

Variance 1/0 1/07/0 7/0

Total 1/3 14/925/10 29/4

From Table 2 it is clear that bagging was the best performer in terms of bias. This also translated

into the best performance with respect to the systematic effect. Random forests and boosting were

tied for top position in terms of variance, however with respect to the variance effect, boosting

outperformed all of the other algorithms. Despite this, random forests still managed to achieve the

lowest error rate on the highest number of simulation conﬁgurations. This seems to indicate that a

random forest performs well not because it exclusively reduces either the bias/systematic effect or the

variance/variance effect. Rather, a random forest seems to be the most successful in simultaneously

reducing both these quantities.

5. Conclusion

In this paper we considered bias-variance decompositions of prediction error in a classiﬁcation con-

text. In this context, several deﬁnitions of bias and variance can be found in the literature. In an

empirical study, a decomposition of the prediction error into bias and variance, and into systematic

and variance effects were illustrated for several ensemble classiﬁers. Bagging was generally found

to have the smallest bias and systematic effect. Although random forests did not perform the best

in terms of any of the components, it managed to achieve the best error rate more frequently. Inter-

estingly, boosting was found to perform well in terms of its variance effect. This is contrary to the

conclusion in Bühlmann and Hothorn (2007), viz. that improvements in the bagging prediction error

boasted by boosting should be attributed to a decrease in bias. In conclusion, it seems that heuristic

studies similar to the one presented here may prove useful in facilitating the proposal of sensible

ensemble classiﬁers.

Acknowledgements

The authors would like to thank the reviewers for their insightful comments which improved the

overall quality of this paper. The work in this paper was partially supported by the MIH Media

64 PRETORIUS, BIERMAN & STEEL

Lab at Stellenbosch University and the National Research Foundation (NRF) under grant number:

UID 94615. Opinions expressed and conclusions arrived at, are those of the authors and are not

necessarily to be attributed to the NRF.

References

BREIM AN, L. (1996). Bias, variance, and arcing classiﬁers. Tech. Rep. 460, Statistics Department,

University of California, Berkeley, CA, USA.

BREIM AN, L. (2000). Randomizing outputs to increase prediction accuracy. Machine Learning,

40 (3), 229–242.

BÜHLMANN, P. AND HOT HORN, T. (2007). Boosting algorithms: Regularization, prediction and

model ﬁtting. Statistical Science,22 (4), 477–505.

DIETTERICH, T. G. AND KONG, E. B. (1995). Machine learning bias, statistical bias, and statisti-

cal variance of decision tree algorithms. Technical report, Department of Computer Science,

Oregon State University.

DOMINGOS, P. (2000). A uniﬁed bias-variance decomposition for zero-one and squared loss.

AAAI/IAAI,2000, 564–569.

FRIEDMAN, J. H. (1997). On bias, variance, 0/1 loss, and the curse-of-dimensionality. Data Mining

and Knowledge Discovery,1(1), 55–77.

GEMAN , S., BIE NE NSTOCK, E., AND DOURS AT, R. (1992). Neural networks and the bias/variance

dilemma. Neural Computation,4(1), 1–58.

HASTIE, T., TIBSHIRANI, R., AND FRIEDMAN, J. (2009). The Elements of Statistical Learning.

Springer.

HESKE S, T. (1998). Bias/variance decompositions for likelihood-based estimators. Neural Compu-

tation,10 (6), 1425–1433.

JAMES , G. AND HASTIE, T. (1997). Generalizations of the bias/variance decomposition for predic-

tion error. Dept. Statistics, Stanford Univ., Stanford, CA, Tech. Rep.

JAMES , G., WITTE N, D., HASTIE, T., AND TIBSHIRANI, R. (2013). An introduction to statistical

learning. Springer.

JAMES , G. M. (2003). Variance and bias for general loss functions. Machine Learning,51 (2),

115–135.

KOHAVI, R. AND WO LP ERT, D. H. (1996). Bias plus variance decomposition for zero-one loss

functions. ICML,96, 275–83.

LEISCH, F. AND DIMITRIADOU, E. (2010). Machine learning benchmark problems. mlbench R

package.

MEASE , D. AND WYN ER , A. (2008). Evidence contrary to the statistical view of boosting. Journal

of Machine Learning Research,9(Feb), 131–156.

TIBSHIRANI, R. (1996). Bias, variance and prediction error for classiﬁcation rules. University of

Toronto, Department of Statistics.