Conference PaperPDF Available

A bias-variance analysis of ensemble learning for classification

Authors:

Abstract and Figures

A decomposition of the expected prediction error into bias and variance components is useful when investigating the accuracy of a predictor. However, in classification such a decomposition is not as straightforward as in the case of squared-error loss in regression. As a result various definitions of bias and variance for classification can be found in the literature. In this paper these definitions are reviewed and an empirical study of a particular bias-variance decomposition is presented for ensemble classifiers.
Content may be subject to copyright.
Proceedings of the 58th Annual Conference of SASA (2016), 57 – 64 57
A BIAS-VARIANCE ANALYSIS OF ENSEMBLE
LEARNING FOR CLASSIFICATION
Arnu Pretorius 1
Stellenbosch University
e-mail: arnu@ml.sun.ac.za
Surette Bierman
Stellenbosch University
Sarel J. Steel
Stellenbosch University
Key words: Bias-variance analysis, Classification, Ensemble learning.
Abstract: A decomposition of the expected prediction error into bias and variance components is
useful when investigating the accuracy of a predictor. However, in classification such a decomposi-
tion is not as straightforward as in the case of squared-error loss in regression. As a result various
definitions of bias and variance for classification can be found in the literature. In this paper these
definitions are reviewed and an empirical study of a particular bias-variance decomposition is pre-
sented for ensemble classifiers.
1. Introduction
Consider a supervised learning problem with response variable Yand input variables X1,X2,..., Xp.
Training data {(xi,yi),i=1,..., N}are used to estimate a function f(x). The estimated function is
denoted by ˆ
f(x)and is referred to as an (estimated) predictor, which is used to predict the response.
The accuracy of ˆ
f(x)is measured in terms of a loss function L(Y,ˆ
f(x)).
In regression problems the most commonly used loss function is squared-error loss, i.e. LSE (Y,ˆ
f(X))
=Yˆ
f(X)2. Let E(ˆ
f(x)) be denoted by ¯
f(x), then if an additive error model Y=f(X) + εis
assumed, where E(ε) = 0 and Var(ε) = σ2
ε, the following decomposition of the expected predic-
tion error, also referred to as the generalisation error of ˆ
fat a point X=xcan be derived (Geman,
Bienenstock and Doursat, 1992):
ErrSE (x) = σ2
ε+ ( ¯
f(x)f(x))2+E[( ˆ
f(x)¯
f(x))2]
=Irreducible Error +Bias2+Variance.(1)
In this expression the expectation is with respect to the response Ycorresponding to xand with
respect to the training data. The function f(x)in (1) denotes the true function underlying the data.
Although the search for good predictors can be restricted to the class of unbiased procedures, it is
1Member of MIH Media Lab at Stellenbosch University.
58 PRETORIUS, BIERMAN & STEEL
well known that better accuracy can often be achieved by trading off a small increase in bias against
a larger decrease in variance. Examples of procedures obtained from such an approach are ridge
regression and the lasso in linear regression analysis (Hastie, Tibshirani and Friedman, 2009).
It is clear that a decomposition of the expected prediction error into bias and variance components
is useful when investigating the accuracy of a predictor. The focus in this paper is on bias-variance
decompositions of prediction error in classification problems. It will become clear that such decom-
positions are not nearly as straightforward as for squared-error loss regression problems. In fact, the
literature contains many different definitions of bias and variance in classification problems. These
will be reviewed and an empirical study illustrating the different approaches for several popular
ensemble classifiers will be presented.
2. Bias and Variance of a Classifier
Many different definitions of bias and variance in a classification context can be found in the lit-
erature. These include Dietterich and Kong (1995), Breiman (1996), Kohavi and Wolpert (1996),
Tibshirani (1996), James and Hastie (1997), Heskes (1998), Breiman (2000) and Domingos (2000).
The different definitions are based on different requirements and desired properties of bias and vari-
ance. In most of the proposals there is an interest in finding an additive decomposition specifically
suited for expected 0-1 loss, analogous to that for squared-error loss in regression.
Arguably the most convincing explanation for the existence of so many different definitions and
attempts at finding an appropriate general decomposition, is given by James and Hastie (1997) and
James (2003). The key observation is that the bias and variance of a model each play two different
roles, referred to here as the inherent measure and the effect measure.
1. Inherent measure: The bias measures the disagreement between the average model and the
truth, and the variance measures the variation of the estimate around its mean.
2. Effect measure: The bias measures the proportion of the generalisation error attributed to the
disagreement between the average model and the truth (the effect of bias on error), and the
variance measures the proportion of the generalisation error attributed to the variability of the
estimated model (the effect of variance on error).
James (2003) notes that in regression these two roles are indistinguishable. In other words, the
inherent measures of bias and variance are equal to their respective effects on the generalisation
error. However this is not generally the case in classification scenarios, and more specifically not so
for expected 0-1 loss.
A BIAS-VARIANCE ANALYSIS OF ENSEMBLE LEARNING FOR CLASSIFICATION 59
3. Bias and Variance for Symmetric Loss
Reconsider the squared-error decomposition in (1). Omitting the argument xfor notational conve-
nience, and following James and Hastie (1997) and James (2003), this can be rewritten as
ErrSE (x) = E[(Yf)2]+( ¯
ff)2+E[( ˆ
f¯
f)2]
=E[(Yf)2] + E[(Y¯
f)2(Yf)2]
+E[(Yˆ
f)2(Y¯
f)2].
Therefore, (1) becomes
ErrSE (x) = σ2
ε+E[LSE (Y,¯
f)LSE (Y,f)] + E[LSE(Y,ˆ
f)LSE (Y,¯
f)].(2)
The second term in (2) measures the effect on generalisation error from the expected difference
in loss between the average model and the truth. The third term measures the effect on generalisation
error from the expected difference in loss between the specific estimate ˆ
fand the average model.
However, the decomposition given in (2) is not restricted to squared-error loss and is valid for any
symmetric loss function, i.e. where L(a,b) = L(b,a)(James, 2003). Therefore, for an estimate ˆ
hof
a response S(numeric or categorical) at x, with
σ(x) = E[L(S,h)]
SE(x) = E[L(S,¯
h)L(S,h)]
V E(x) = E[L(S,ˆ
h)L(S,¯
h)],
where his the true underlying function, ¯
his the average model and L(·,·)is any symmetric loss, a
general decomposition is given by
Err(x) = σ(x) + SE (x) + V E(x).(3)
James (2003) refer to SE(x)and V E(x)as the systematic effect and the variance effect respec-
tively. In regression with squared-error loss, the systematic and variance effects are indistinguishable
from bias and variance. However, in classification the situation is not the same.
Consider the expected 0-1 loss, E[L01(C,g)] = P(g6=C), where gis any classifier and Cis the
true class label at a point x. Let ¯g=argmaxkEI(ˆg(x) = k)] denote the majority vote classifier, then
the analogue of (1) may be expressed as
Irreducible Error +Bias +Variance
=P(g6=C) + I(¯g6=g) + P(ˆg6=¯g)
6=P(g6=C)+[P(¯g6=C)P(g6=C)] + [P(ˆg6=C)P(¯g6=C)]
=σ01(x) + SE01(x) + V E01(x)
=Irreducible Error +Systematic effect +Variance effect.
Therefore, in classification using 0-1 loss, the effects of bias and of variance on generalisation
error are not equal to the inherent measures of bias and variance (James, 2003).
60 PRETORIUS, BIERMAN & STEEL
4. An Empirical Investigation
In this section, bias and variance and their respective effects are estimated on simulated data sets for
classification trees, bagging, random forests and boosting.2
4.1. Data sets
Sixteen different simulated data sets were used in the empirical investigation of bias and variance
and their respective effects. The first set of four simulated data sets consists of observations drawn
from a multivariate normal distribution. Each data set has p=15 input variables with the pairwise
correlation between all variables configured as follows: ρ=0.9 (highly correlated), ρ=0.5 (fairly
correlated), ρ=0.1 (weakly correlated) and ρ=0 (uncorrelated). All input variables are associated
with the response, which is coded C=1 if 1/(1+e15
j=1Xj)>0.5, or as C=0 otherwise.
The second set of simulated data sets are generated using the following simulation setup:
X1,...,XpU[0,1]and C=1 if q+ (12q)·I(J
l=1Xl>J/2)>0.5, where J<pand 0 <q<1,
otherwise C=0 (Mease and Wyner, 2008). This implies that the response Conly depends on
X1,X2,..., XJ. The remaining pJvariables are noise. With p=30 and q=0.15, four configurations
were chosen: J=2 (mostly noise), J=5 (fairly noisy), J=15 (half signal/half noise) and J=20
(mostly signal).
In addition, eight popular configurations were selected and simulated using the mlbench R pack-
age, viz. 2dnormals, Twonorm, Threenorm, Ringnorm, Circle, Cassini, Cuboids and XOR (Leisch
and Dimitriadou, 2010). The data sets Twonorm, Threenorm and Ringnorm were also used in
Breiman (1996) and Breiman (2000).
4.2. Experimental design
To approximate the necessary probabilities (expectations over indicator functions) the first step was
to simulate 100 different training sets of size 400 and to fit a model to each training set. For each
model fitted, predictions were made on a test set of size 1000. Using the known data generating
mechanism to obtain the Bayes classifier, together with the predictions from each fit, the bias, vari-
ance, systematic and variance effects could be computed as averages over the training sets. More
specifically, let 1,..., Ddenote the D=100 training sets and let t e ={(x0i,c0i),i=1, ..., N0}be
the test set. Furthermore, let ¯g(x) = argmaxk1
DD
d=1I(ˆgd(x) = k). Then,
d
Bias =1
N0
N0
i=1
I(¯g(x0i)6=g(x0i)),
\
Variance =1
D
D
d=1
1
N0
N0
i=1
I(ˆgd(x0i)6=¯g(x0i)),
2The code necessary to reproduce the analysis is publicly available at:
https://github.com/arnupretorius/BiasVarAnalEnsembleLearn.
A BIAS-VARIANCE ANALYSIS OF ENSEMBLE LEARNING FOR CLASSIFICATION 61
c
SE =1
N0
N0
i=1
I(¯g(x0i)6=c0i)1
N0
N0
i=1
I(g(x0i)6=c0i),
c
V E =1
D
D
d=1
1
N0
N0
i=1
I(ˆgd(x0i)6=c0i)1
N0
N0
i=1
I(¯g(x0i)6=c0i).
It is typically the case that the choice of values for the tuning parameters of an algorithm largely
affects its bias and variance (James, Witten, Hastie and Tibshirani, 2013). Therefore, before each
fit, ten-fold cross-validation was performed to find the optimal tuning parameters for each algorithm
from a pre-specified grid of available values. The pre-specified grids were chosen as follows:
Trees: Cost-complexity parameter cp ={0.1,0.2, ..., 0.9,1.0}.
Bagging: Number of trees B=200.
Random forest: Number of trees B=200, with the subset size of randomly selected variables
ξ={1,3,5, ..., p1}.
Boosting: Number of trees B=200, tree interaction depth either one or six and step-length
factor ν={0.01,0.05,0.1}. For binary classification the exponential loss was used and, for
multi-class problems, the multinomial loss.
4.3. Results
The results are summarised in Table 1. Values in bold represent the minimum achieved for a partic-
ular quantity amongst the algorithms. The following conclusions are drawn:
Single tree vs. ensemble: For each data set, randomisation and aggregation succeeded in
drastically reducing the variance as well as the variance effect. Interestingly, the bias and
systematic effect was either also reduced or remained unchanged.
Bagging vs. random forest: For the majority of data sets, the additional randomisation at
each node of a tree in random forests resulted in a further reduction of the variance and of the
variance effect when compared to bagging. In many of the data sets, bagging reduced the bias
and systematic effect substantially. This is the most evident in the case of multivariate normal
distributions.
Random forest vs. boosting: The strategy employed by the boosting algorithm had a signifi-
cant effect on variance. This is especially evident in the presence of noise. In several of the
data sets, the reduction in variance lead to a substantial reduction in the variance effect when
compared to random forests.
Negative variance effects: For all the algorithms, the circle data set resulted in most (if not all)
of the reducible error being attributed to the systematic effect, with a bias of equal size. For
this data set, bagging and random forests managed to obtain a variance very close to zero as
well as a small negative variance effect. Boosting performed the worst in term of variance, but
the best in terms of the variance effect. This is possibly an illustration of the Friedman effect.
For more details, see Friedman (1997).
62 PRETORIUS, BIERMAN & STEEL
Table 1: Estimated bias, variance, systematic effect and variance effect on simulated data. (Values
in bold indicate row-wise minima.)
Data Quantity Tree Bagging Random forest Boosting Data Quantity Tree Bagging Random forest Boosting
mvnorm
p=15,
ρ=0.9
Error 0.105 0.044 0.038 0.043
2d
Norms
p=2,
K=6
Error 0.420 0.301 0.292 0.273
Bayes Error 0.028 0.028 0.028 0.028 Bayes Error 0.243 0.243 0.243 0.243
Systematic Effect 0.015 0.003 0.004 0.005 Systematic Effect 0.076 0.004 0.005 0.002
Variance Effect 0.062 0.013 0.006 0.010 Variance Effect 0.101 0.054 0.044 0.028
Bias 0.025 0.005 0.006 0.007 Bias 0.157 0.019 0.019 0.019
Variance 0.097 0.028 0.018 0.026 Variance 0.290 0.158 0.138 0.099
mvnorm
p=15,
ρ=0.5
Error 0.233 0.075 0.061 0.068
Two-
norm
p=20,
K=2
Error 0.319 0.057 0.033 0.040
Bayes Error 0.040 0.040 0.040 0.040 Bayes Error 0.024 0.024 0.024 0.024
Systematic Effect 0.034 0.009 0.010 0.013 Systematic Effect 0.025 00.003 0.004
Variance Effect 0.159 0.026 0.011 0.015 Variance Effect 0.270 0.033 0.006 0.012
Bias 0.060 0.013 0.022 0.033 Bias 0.037 0.008 0.011 0.014
Variance 0.224 0.056 0.034 0.043 Variance 0.314 0.047 0.018 0.025
mvnorm
p=15,
ρ=0.1
Error 0.368 0.152 0.129 0.130
Three-
norm
p=20,
K=2
Error 0.392 0.177 0.157 0.167
Bayes Error 0.078 0.078 0.078 0.078 Bayes Error 0.085 0.085 0.085 0.085
Systematic Effect 0.064 0.011 0.014 0.026 Systematic Effect 0.092 0.039 0.037 0.048
Variance Effect 0.226 0.063 0.037 0.026 Variance Effect 0.215 0.053 0.035 0.034
Bias 0.098 0.027 0.028 0.040 Bias 0.132 0.077 0.075 0.088
Variance 0.355 0.120 0.085 0.084 Variance 0.368 0.127 0.094 0.102
mvnorm
p=15,
ρ=0
Error 0.427 0.239 0.216 0.200
Ring-
norm
p=20,
K=2
Error 0.300 0.087 0.042 0.051
Bayes Error 0.141 0.141 0.141 0.141 Bayes Error 0.018 0.018 0.018 0.018
Systematic Effect 0.074 0 0 0.004 Systematic Effect 0.161 0.013 0.007 0.019
Variance Effect 0.212 0.101 0.077 0.055 Variance Effect 0.121 0.056 0.017 0.014
Bias 0.168 0.031 0.044 0.060 Bias 0.177 0.023 0.021 0.029
Variance 0.405 0.197 0.163 0.136 Variance 0.256 0.077 0.030 0.034
Mease-
Wyner
(2008)
p=30,
J=2
Error 0.295 0.212 0.214 0.212
Circle
p=20,
K=2
Error 0.171 0.168 0.168 0.140
Bayes Error 0.147 0.147 0.147 0.147 Bayes Error 0 0 0 0
Systematic Effect 0.050 0.002 0.008 0.017 Systematic Effect 0.171 0.171 0.171 0.152
Variance Effect 0.098 0.063 0.059 0.048 Variance Effect 0 0.003 0.003 -0.012
Bias 0.072 0.002 0.008 0.021 Bias 0.171 0.171 0.171 0.152
Variance 0.202 0.094 0.096 0.092 Variance 00.007 0.004 0.046
Mease-
Wyner
(2008)
p=30,
J=5
Error 0.390 0.275 0.272 0.259
Cassini
p=2,
K=3
Error 0.003 0.003 0.002 0.004
Bayes Error 0.143 0.143 0.143 0.143 Bayes Error 0 0 0 0
Systematic Effect 0.075 0.021 0.017 0.021 Systematic Effect 0.001 0.001 00.001
Variance Effect 0.172 0.111 0.112 0.095 Variance Effect 0.002 0.002 0.002 0.003
Bias 0.095 0.029 0.029 0.037 Bias 0.001 0.001 00.001
Variance 0.338 0.184 0.179 0.159 Variance 0.003 0.003 0.002 0.003
Mease-
Wyner
(2008)
p=30,
J=15
Error 0.452 0.308 0.304 0.284
Cuboids
p=3,
K=4
Error 0.074 0.0001 00.0002
Bayes Error 0.136 0.136 0.136 0.136 Bayes Error 0 0 0 0
Systematic Effect 0.117 0.034 0.015 0.020 Systematic Effect 0 0 0 0
Variance Effect 0.199 0.138 0.153 0.128 Variance Effect 0.074 0.0001 00.0002
Bias 0.155 0.046 0.029 0.028 Bias 0 0 0 0
Variance 0.424 0.233 0.228 0.201 Variance 0.074 0.0001 00.0002
Mease-
Wyner
(2008)
p=30,
J=20
Error 0.455 0.318 0.308 0.290
XOR
p=2,
K=2
Error 0.059 0.007 0.009 0.008
Bayes Error 0.134 0.134 0.134 0.134 Bayes Error 0 0 0 0
Systematic Effect 0.171 0.043 0.034 0.023 Systematic Effect 0.002 0 0 0
Variance Effect 0.150 0.141 0.140 0.133 Variance Effect 0.057 0.007 0.009 0.008
Bias 0.237 0.059 0.046 0.031 Bias 0.002 0 0 0
Variance 0.421 0.248 0.236 0.212 Variance 0.059 0.007 0.009 0.008
A BIAS-VARIANCE ANALYSIS OF ENSEMBLE LEARNING FOR CLASSIFICATION 63
Inherent and effect measure correlation: As was observed by James (2003), bias and variance
seem to be highly correlated with their respective effects on generalisation error. The median
correlation between bias and the systematic effect in the empirical study was 93.93%, while
the median correlation between the variance and variance effect was 95.61%.
In summary, Table 2 provides a win/tie analysis of the results where a tally of wins and ties for
each quantity and algorithm are given.
Table 2: Win/Tie analysis of bias, variance, systematic effect and variance effect.
Quantity Tree Bagging Random forest Boosting
Win/Tie
Error 0/0 1/18/0 6/1
Systematic Effect 0/15/3 4/4 3/2
Variance Effect 0/1 1/1 3/210/0
Bias 0/16/4 3/4 3/3
Variance 1/0 1/07/0 7/0
Total 1/3 14/925/10 29/4
From Table 2 it is clear that bagging was the best performer in terms of bias. This also translated
into the best performance with respect to the systematic effect. Random forests and boosting were
tied for top position in terms of variance, however with respect to the variance effect, boosting
outperformed all of the other algorithms. Despite this, random forests still managed to achieve the
lowest error rate on the highest number of simulation configurations. This seems to indicate that a
random forest performs well not because it exclusively reduces either the bias/systematic effect or the
variance/variance effect. Rather, a random forest seems to be the most successful in simultaneously
reducing both these quantities.
5. Conclusion
In this paper we considered bias-variance decompositions of prediction error in a classification con-
text. In this context, several definitions of bias and variance can be found in the literature. In an
empirical study, a decomposition of the prediction error into bias and variance, and into systematic
and variance effects were illustrated for several ensemble classifiers. Bagging was generally found
to have the smallest bias and systematic effect. Although random forests did not perform the best
in terms of any of the components, it managed to achieve the best error rate more frequently. Inter-
estingly, boosting was found to perform well in terms of its variance effect. This is contrary to the
conclusion in Bühlmann and Hothorn (2007), viz. that improvements in the bagging prediction error
boasted by boosting should be attributed to a decrease in bias. In conclusion, it seems that heuristic
studies similar to the one presented here may prove useful in facilitating the proposal of sensible
ensemble classifiers.
Acknowledgements
The authors would like to thank the reviewers for their insightful comments which improved the
overall quality of this paper. The work in this paper was partially supported by the MIH Media
64 PRETORIUS, BIERMAN & STEEL
Lab at Stellenbosch University and the National Research Foundation (NRF) under grant number:
UID 94615. Opinions expressed and conclusions arrived at, are those of the authors and are not
necessarily to be attributed to the NRF.
References
BREIM AN, L. (1996). Bias, variance, and arcing classifiers. Tech. Rep. 460, Statistics Department,
University of California, Berkeley, CA, USA.
BREIM AN, L. (2000). Randomizing outputs to increase prediction accuracy. Machine Learning,
40 (3), 229–242.
BÜHLMANN, P. AND HOT HORN, T. (2007). Boosting algorithms: Regularization, prediction and
model fitting. Statistical Science,22 (4), 477–505.
DIETTERICH, T. G. AND KONG, E. B. (1995). Machine learning bias, statistical bias, and statisti-
cal variance of decision tree algorithms. Technical report, Department of Computer Science,
Oregon State University.
DOMINGOS, P. (2000). A unified bias-variance decomposition for zero-one and squared loss.
AAAI/IAAI,2000, 564–569.
FRIEDMAN, J. H. (1997). On bias, variance, 0/1 loss, and the curse-of-dimensionality. Data Mining
and Knowledge Discovery,1(1), 55–77.
GEMAN , S., BIE NE NSTOCK, E., AND DOURS AT, R. (1992). Neural networks and the bias/variance
dilemma. Neural Computation,4(1), 1–58.
HASTIE, T., TIBSHIRANI, R., AND FRIEDMAN, J. (2009). The Elements of Statistical Learning.
Springer.
HESKE S, T. (1998). Bias/variance decompositions for likelihood-based estimators. Neural Compu-
tation,10 (6), 1425–1433.
JAMES , G. AND HASTIE, T. (1997). Generalizations of the bias/variance decomposition for predic-
tion error. Dept. Statistics, Stanford Univ., Stanford, CA, Tech. Rep.
JAMES , G., WITTE N, D., HASTIE, T., AND TIBSHIRANI, R. (2013). An introduction to statistical
learning. Springer.
JAMES , G. M. (2003). Variance and bias for general loss functions. Machine Learning,51 (2),
115–135.
KOHAVI, R. AND WO LP ERT, D. H. (1996). Bias plus variance decomposition for zero-one loss
functions. ICML,96, 275–83.
LEISCH, F. AND DIMITRIADOU, E. (2010). Machine learning benchmark problems. mlbench R
package.
MEASE , D. AND WYN ER , A. (2008). Evidence contrary to the statistical view of boosting. Journal
of Machine Learning Research,9(Feb), 131–156.
TIBSHIRANI, R. (1996). Bias, variance and prediction error for classification rules. University of
Toronto, Department of Statistics.
... A high variance model will over-fit on the given training set failing to generalize model on any unseen data [34]. In general, to get a good ensemble model, every single model in it should be as precise as possible and as diverse as possible [35]. There are many various ensemble learning methods, one of the easiest ensemble methods is the average ensemble (AE) also known as an unweighted average ensemble. ...
Article
The plant disease classification based on using digital images is very challenging. In the last decade, machine learning techniques and plant images classification tools such as deep learning can be used for recognizing, detecting and diagnosing plant diseases. Currently, deep learning technology has been used for plant disease detection and classification. In this paper, an ensemble model of two pre-trained convolutional neural networks (CNNs) namely VGG16 and VGG19 have been developed for the task plant disease diagnosis by classifying the leaves images of healthy and unhealthy. In this context, CNNs are used due to its capability of overcoming the technical problems which are associated with the classification problem of plant diseases. However, CNNs suffer from a great variety of hyperparameters with specific architectures which is considered as a challenge to identify manually the optimal hyperparameters. Therefore, orthogonal learning particle swarm optimization (OLPSO) algorithm is utilized in this paper to optimize a number of these hyperparameters by finding optimal values for these hyperparameters rather than using traditional methods such as the manual trial and error method. In this paper, to prevent CNNs from falling into the local minimum and to train efficiently, an exponentially decaying learning rate (EDLR) schema is used. In this paper, the problem of the imbalanced used dataset has been solved by using random minority oversampling and random majority undersampling methods, and some restrictions in terms of both the number and diversity of samples have been overcome. The obtained results of this work show that the accuracy of the proposed model is very competitive. The experimental results are compared with the performance of other pre-trained CNN models namely InceptionV3 and Xception, whose hyperparameters were selected using a non-evolutionary method. The comparison results demonstrated that the proposed diagnostic approach has achieved higher performance than the other models.
Article
Full-text available
Feedforward neural networks trained by error backpropagation are examples of nonparametric regression estimators. We present a tutorial on nonparametric inference and its relation to neural networks, and we use the statistical viewpoint to highlight strengths and weaknesses of neural models. We illustrate the main points with some recognition experiments involving artificial data as well as handwritten numerals. In way of conclusion, we suggest that current-generation feedforward neural networks are largely inadequate for difficult problems in machine perception and machine learning, regardless of parallel-versus-serial hardware or other implementation issues. Furthermore, we suggest that the fundamental challenges in neural modeling are about representation rather than learning per se. This last point is supported by additional experiments with handwritten numerals.
Article
Full-text available
The classification problem is considered in which an output variable y assumes discrete values with respective probabilities that depend upon the simultaneous values of a set of input variables xDf x 1;:::; x ng:At issue is how error in the estimates of these probabilities affects classification error when the estimates are used in a classification rule. These effects are seen to be somewhat counter intuitive in both their strength and nature. In particular the bias and variance components of the estimation error combine to influence classification in a very different way than with squared error on the probabilities themselves. Certain types of (very high) bias can be canceled by low variance to produce accurate classification. This can dramatically mitigate the effect of the bias associated with some simple estimators like "naive" Bayes, and the bias induced by the curse-of-dimensionality on nearest-neighbor procedures. This helps explain why such simple methods are often competitive with and sometimes superior to more sophisticated ones for classification, and why "bagging/aggregating" classifiers can often improve accuracy. These results also suggest simple modifications to these procedures that can (sometimes dramatically) further improve their classification performance.
Article
Full-text available
The statistical perspective on boosting algorithms focuses on optimization, drawing parallels with maximum likelihood estimation for logistic regression. In this paper we present empirical evidence that raises questions about this view. Although the statistical perspective provides a theoretical framework within which it is possible to derive theorems and create new algorithms in general contexts, we show that there remain many unanswered important questions. Furthermore, we provide examples that reveal crucial ∞aws in the many practical suggestions and new methods that are derived from the statistical view. We perform carefully designed experiments using simple simulation models to illustrate some of these ∞aws and their practical consequences.
Article
Full-text available
The bias/variance decomposition of mean-squared error is well understood and relatively straightforward. In this note, a similar simple decomposition is derived, valid for any kind of error measure that, when using the appropriate probability model, can be derived from a Kullback-Leibler divergence or log-likelihood.
Book
An Introduction to Statistical Learning provides an accessible overview of the field of statistical learning, an essential toolset for making sense of the vast and complex data sets that have emerged in fields ranging from biology to finance to marketing to astrophysics in the past twenty years. This book presents some of the most important modeling and prediction techniques, along with relevant applications. Topics include linear regression, classification, resampling methods, shrinkage approaches, tree-based methods, support vector machines, clustering, and more. Color graphics and real-world examples are used to illustrate the methods presented. Since the goal of this textbook is to facilitate the use of these statistical learning techniques by practitioners in science, industry, and other fields, each chapter contains a tutorial on implementing the analyses and methods presented in R, an extremely popular open source statistical software platform. Two of the authors co-wrote The Elements of Statistical Learning (Hastie, Tibshirani and Friedman, 2nd edition 2009), a popular reference book for statistics and machine learning researchers. An Introduction to Statistical Learning covers many of the same topics, but at a level accessible to a much broader audience. This book is targeted at statisticians and non-statisticians alike who wish to use cutting-edge statistical learning techniques to analyze their data. The text assumes only a previous course in linear regression and no knowledge of matrix algebra.
Article
When using squared error loss, bias and variance and their decomposition of prediction error are well understood and widely used concepts. However, there is no universally accepted definition for other loss functions. Numerous attempts have been made to extend these concepts beyond squared error loss. Most approaches have focused solely on 0-1 loss functions and have produced significantly different defini- tions. These differences stem from disagreement as to the essential characteristics that variance and bias should display. This paper suggests an explicit list of rules that we feel any "reasonable" set of definitions should satisfy. Using this framework, bias and variance definitions are produced which generalize to any symmetric loss function. We illustrate these statistics on several loss functions with particular emphasis on 0-1 loss. We conclude with a discussion of the various definitions that have been proposed in the past as well as a method for estimating these quantities on real data sets.
Article
The bias-variance decomposition is a very useful and widely-used tool for understanding machine-learning algorithms. It was originally developed for squared loss. In recent years, several authors have proposed decompositions for zero-one loss, but each has significant shortcomings. In particular, all of these decompositions have only an intuitive relationship to the original squared-loss one. In this paper, we define bias and variance for an arbitrary loss function, and show that the resulting decomposition specializes to the standard one for the squared-loss case, and to a close relative of Kong and Dietterich's (1995) one for the zero-one case. The same decomposition also applies to variable misclassification costs. We show a number of interesting consequences of the unified definition. For example, Schapire et al.'s (1997) notion of "margin" can be expressed as a function of the zero-one bias and variance, making it possible to formally relate a classifier ensemble's generalization error to the base learner's bias and variance on training examples. Experiments with the unified definition lead to further insights.
Article
We study the notions of bias and variance for classification rules. Following Efron (1978) we develop a decomposition of prediction error into its natural components. Then we derive bootstrap estimates of these components and illustrate how they can be used to describe the error behaviour of a classifier in practice. In the process we also obtain a bootstrap estimate of the error of a "bagged" classifier. Keywords: classification, prediction error, bias, variance, bootstrap 1 Introduction This article concerns classification rules that have been constructed from a set of training data. The training set X = (x 1 ; x 2 ; Delta Delta Delta ; x n ) consists of n observations x i = (t i ; g i ), with t i being the predictor or feature vector and g i being the response, taking values in f1; 2; : : : Kg. On the basis of X the Addresses: tibs@utstat.toronto.edu; http://www.utstat.toronto.edu/tibs 1 statistician constructs a classification rule C(t; X ). Our objective here is to un...
Article
The term "bias" is widely used---and with different meanings---in the fields of machine learning and statistics. This paper clarifies the uses of this term and shows how to measure and visualize the statistical bias and variance of learning algorithms. Statistical bias and variance can be applied to diagnose problems with machine learning bias, and the paper shows four examples of this. Finally, the paper discusses methods of reducing bias and variance. Methods based on voting can reduce variance, and the paper compares Breiman's bagging method and our own tree randomization method for voting decision trees. Both methods uniformly improve performance on data sets from the Irvine repository. Tree randomization yields perfect performance on the Letter Recognition task. A weighted nearest neighbor algorithm based on the infinite bootstrap is also introduced. In general, decision tree algorithms have moderate-to-high variance, so an important implication of this work is that var...