Ensemble-based discriminant learning with boosting for face recognition.
ABSTRACT In this paper, we propose a novel ensemble-based approach to boost performance of traditional Linear Discriminant Analysis (LDA)-based methods used in face recognition. The ensemble-based approach is based on the recently emerged technique known as "boosting". However, it is generally believed that boosting-like learning rules are not suited to a strong and stable learner such as LDA. To break the limitation, a novel weakness analysis theory is developed here. The theory attempts to boost a strong learner by increasing the diversity between the classifiers created by the learner, at the expense of decreasing their margins, so as to achieve a tradeoff suggested by recent boosting studies for a low generalization error. In addition, a novel distribution accounting for the pairwise class discriminant information is introduced for effective interaction between the booster and the LDA-based learner. The integration of all these methodologies proposed here leads to the novel ensemble-based discriminant learning approach, capable of taking advantage of both the boosting and LDA techniques. Promising experimental results obtained on various difficult face recognition scenarios demonstrate the effectiveness of the proposed approach. We believe that this work is especially beneficial in extending the boosting framework to accommodate general (strong/weak) learners.
- [Show abstract] [Hide abstract]
ABSTRACT: This paper proposes a novel method of supervised and unsupervised multi-linear neighborhood preserving projection (MNPP) for face recognition. Unlike conventional neighborhood preserving projections, the MNPP method operates directly on tensorial data rather than vectors or matrices, and solves problems of tensorial representation for multi-dimensional feature extraction, classification and recognition. As opposed to traditional approaches such as NPP and 2DNPP, which derive only one subspace, multiple interrelated subspaces are obtained in the MNPP method by unfolding the tensor over different tensorial directions. The number of subspaces derived by MNPP is determined by the order of the tensor space. This approach is used for face recognition and biometrical security classification problems involving higher order tensors. The performance of our proposed and existing techniques is analyzed using three benchmark facial datasets ORL, AR, and FERET. The obtained results show that the MNPP outperforms the standard approaches in terms of the error rate.Pattern Recognition. 02/2014; 47(2):544-555.
- [Show abstract] [Hide abstract]
ABSTRACT: Two main issues for event-related potential (ERP) classification in brain-computer interface (BCI) application are curse-of-dimensionality and bias-variance tradeoff, which may deteriorate classification performance, especially with insufficient training samples resulted from limited calibration time. This study introduces an aggregation of sparse linear discriminant analyses (ASLDA) to overcome these problems. In the ASLDA, multiple sparse discriminant vectors are learned from differently l1-regularized least-squares regressions by exploiting the equivalence between LDA and least-squares regression, and are subsequently aggregated to form an ensemble classifier, which could not only implement automatic feature selection for dimensionality reduction to alleviate curse-of-dimensionality, but also decrease the variance to improve generalization capacity for new test samples. Extensive investigation and comparison are carried out among the ASLDA, the ordinary LDA and other competing ERP classification algorithms, based on different three ERP datasets. Experimental results indicate that the ASLDA yields better overall performance for single-trial ERP classification when insufficient training samples are available. This suggests the proposed ASLDA is promising for ERP classification in small sample size scenario to improve the practicability of BCI.International Journal of Neural Systems 02/2014; 24(1):1450003. · 5.05 Impact Factor
- [Show abstract] [Hide abstract]
ABSTRACT: This article assesses the feasibility of using shape information to detect and quantify the subcortical and ventricular structural changes in mild cognitive impairment (MCI) and Alzheimer's disease (AD) patients. We first demonstrate structural shape abnormalities in MCI and AD as compared with healthy controls (HC). Exploring the development to AD, we then divide the MCI participants into two subgroups based on longitudinal clinical information: (1) MCI patients who remained stable; (2) MCI patients who converted to AD over time. We focus on seven structures (amygdala, hippocampus, thalamus, caudate, putamen, globus pallidus, and lateral ventricles) in 754 MR scans (210 HC, 369 MCI of which 151 converted to AD over time, and 175 AD). The hippocampus and amygdala were further subsegmented based on high field 0.8 mm isotropic 7.0T scans for finer exploration. For MCI and AD, prominent ventricular expansions were detected and we found that these patients had strongest hippocampal atrophy occurring at CA1 and strongest amygdala atrophy at the basolateral complex. Mild atrophy in basal ganglia structures was also detected in MCI and AD. Stronger atrophy in the amygdala and hippocampus, and greater expansion in ventricles was observed in MCI converters, relative to those MCI who remained stable. Furthermore, we performed principal component analysis on a linear shape space of each structure. A subsequent linear discriminant analysis on the principal component values of hippocampus, amygdala, and ventricle leads to correct classification of 88% HC subjects and 86% AD subjects. Hum Brain Mapp, 2014. © 2014 Wiley Periodicals, Inc.Human Brain Mapping 01/2014; · 6.88 Impact Factor
166 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 17, NO. 1, JANUARY 2006
Ensemble-Based Discriminant Learning
With Boosting for Face Recognition
Juwei Lu, Member, IEEE, K. N. Plataniotis, Senior Member, IEEE, A. N. Venetsanopoulos, Fellow, IEEE, and
Stan Z. Li
Abstract—In this paper, we propose a novel ensemble-based
approach to boost performance of traditional Linear Discriminant
Analysis (LDA)-based methods used in face recognition. The
ensemble-based approach is based on the recently emerged tech-
nique known as “boosting.” However, it is generally believed that
boosting-like learning rules are not suited to a strong and stable
learner such as LDA. To break the limitation, a novel weakness
analysis theory is developed here. The theory attempts to boost a
strong learner by increasing the diversity between the classifiers
created by the learner, at the expense of decreasing their margins,
so as to achieve a tradeoff suggested by recent boosting studies
for a low generalization error. In addition, a novel distribution
accounting for the pairwise class discriminant information is
introduced for effective interaction between the booster and the
LDA-based learner. The integration of all these methodologies
proposed here leads to the novel ensemble-based discriminant
learning approach, capable of taking advantage of both the
boosting and LDA techniques. Promising experimental results ob-
tained on various difficult face recognition scenarios demonstrate
the effectiveness of the proposed approach. We believe that this
work is especially beneficial in extending the boosting framework
to accommodate general (strong/weak) learners.
Index Terms—Boosting, face recognition (FR), linear discrimi-
nant analysis, machine learning, mixture of linear models, small-
sample-size (SSS) problem, strong learner.
A. Face Recognition
engines, biometric identity authentication, human-computer
interaction, and multimedia monitoring/surveillance. Within
the past two decades, numerous FR algorithms have been
proposed, and detailed surveys of the developments in the
area have appeared in the literature –. Among various
FR methodologies used, the most popular are the so-called
appearance-based approaches, which include the three most
well-known FR methods, namely Eigenfaces , Fisherfaces
, and Bayes Matching . With focus on low-dimensional
statistical feature extraction, the appearance-based approaches
ACE RECOGNITION (FR) has a wide range of appli-
cations, such as face-based video indexing and browsing
Manuscript received March 23, 2004; revised December 24, 2004.
This work was supported in part by the Bell University Laboratories at the
University of Toronto.
J. Lu, K. N. Plataniotis, and A. N. Venetsanopoulos are with The Edward S.
Rogers Sr. Department of Electrical and Computer Engineering, University of
Toronto, ON M5S 3G4 Canada (e-mail: firstname.lastname@example.org).
Stan Z. Li is with the Center for Biometrics and Security Research, Institute
of Automation, Chinese Academy of Sciences, Beijing 100080, P.R. China.
Digital Object Identifier 10.1109/TNN.2005.860853
generally operate directly on appearance images of face object
and process them as two-dimensional (2-D) holistic patterns
to avoid difficulties associated with three-dimensional (3-D)
modeling, and shape or landmark detection . Of the appear-
ance-based FR methods, those based on linear discriminant
analysis (LDA) have shown promising results as it is demon-
strated in , –. However, statistical learning methods
such as the LDA-based ones often suffer from the so-called
“small-sample-size” (SSS) problem , encountered in
high-dimensional pattern recognition tasks where the number
of training samples available for each subject is smaller than the
dimensionality of the samples. For example, in the experiments
reported here only
are available while the dimensionality of the sample space
is up to
. In addition, the performance of linear
appearance-based methods including LDA often deteriorates
rapidly when face patterns are subject to large variations in
viewpoints, illumination or facial expression. These variations
result in a highly nonconvex and complex distribution of face
images . Thus, the limited success of these methods should
be attributed to their linear nature.
In general, a nonconvex distribution can be handled either
by globally nonlinear models or by a mixture of locally linear
models (or ensemble-based methods as they are known in the
machine learning literature ). Globally nonlinear methods
are not without problems. Approaches such as those based on
kernel machines – require the optimization of many de-
sign parameters, tend to overfit easily due to the increased al-
gorithmic complexity, and they are computationally expensive
compared to their linear counterparts. The last point is particu-
formed in a high-dimensional input space. On the other hand,
ensemble-based approaches embody the principle of “divide
and conquer,” by which a complex recognition task is decom-
posed into a set of simpler ones, in each of which a locally
linear pattern distribution can be generalized and dealt with by
a relatively simple linear solution. As such, the ensemble-based
methods are simpler, easier to implement, and more cost effec-
semble-based FR methods are developed based on traditional
cluster analysis –. As a consequence, a disadvantage to
classification tasks is that the submodels’ division/combination
criteria used in these clustering techniques are not directly re-
lated to the classification error rate (CER) of the resulting clas-
sifier, especially the true CER (often referred to as the general-
ization error rate).
training samples per subject
1045-9227/$20.00 © 2006 IEEE
LU et al.: ENSEMBLE-BASED DISCRIMINANT LEARNING WITH BOOSTING FOR FACE RECOGNITION 167
B. Ensemble-Based Learning With Boosting
Recently, a machine-learning technique known as “boosting”
has received considerable attention in the pattern recognition
community, due to its usefulness in designing ensemble-based
employ a base classifier on a weighted version of the training
sample set to generalize a set of classifiers of its kind. Often
the base classifier is also called “learner.” These weights are
updated at each iteration through a classification-error-driven
mechanism. Although any individual classifier produced by the
learner may perform slightly better than random guessing, the
formed ensemble can provide a veryaccurate (strong) classifier.
It has been shown, both theoretically and experimentally, that
boosting is particularly robust in preventing overfitting and re-
ducing the generalization error by increasing the so-called mar-
gins of the training examples –. The margin is defined
as the minimal distance of an example to the decision surface of
classification . For a classifier, a larger expected margin of
training data generally leads to a lower generalization error.
Since its introduction, AdaBoost became known as the most
However, the machine-learning community generally regards
ensemble-based learning rules, including boosting and bagging
, not suited to a strong and stable learner, such as LDA ,
. The reason behind this belief is that the effectiveness
of these rules depends, to a great extent, on the learner’s
“instability,” which means that small changes in the training
set could cause large changes in the resulting classifier .
On the other hand, it has been found in practical applications
that boosting may fail given a too weak learner . In recent
boosting studies, Murua  introduced a useful notion of
weak dependence between classifiers constructed with thesame
training data, and proposed an interesting upper bound on the
and their dependence. Murua’s bound reveals that to achieve
a low generalization error, the boosting procedure should not
only create the classifiers with large expected margins, but also
keep their dependence low or weak. This suggests in theory
that there exists a tradeoff between the large margins and the
The requirement for an appropriately weak learner signifi-
cantly restricts the applicability of the boosting algorithms in
recognition methods involve the utilization of a strong learner.
Therefore, it is highly desirable to improve the traditional
boosting frameworks, so that they are capable of accommo-
dating more general learners in both the pattern recognition and
machine learning communities.
C. Overview of the Contributions
overcome the limitation of the weak learners, which are neces-
sary in existing boosting algorithms. To this end, a new variable
called “learning difficulty degree” (LDD) is introduced along
with a cross-validation method. They are used to analyze and
appropriately regulate the weakness of the classifiers general-
ized by a strong learner via the training data. In addition, a new
loss function with respect to the LDD is proposed to quantita-
tively estimate the generalization power of these produced clas-
sifiers. This is achieved in the loss function by balancing the
averaged empirical error of the classifiers and their mutual de-
pendence. They are two key factors to the generalization error
of the formed ensemble classifier as shown in Murua’s theory
The proposed weakness analysis theory is applied to boost
the performance of the traditional LDA-based approaches
in complex FR tasks. Thus, the learners in this work are the
LDA-based ones,whichdifferfrom thetraditional learnersused
in boosting at two aspects: 1) They are rather strong and stable
and 2) they are feature extractors rather than pure classifiers.
The latter makes this work similar in spirit to those of Viola,
Tieu and Jones –, where the boosting process is viewed
as a feature selection process. Particularly, to boost the specific
LDA-based learners, a new variable called “pairwise class
discriminant distribution” (PCDD) is also introduced to build
an effective interaction mechanism between the booster and
the learner. As a result, a novel ensemble-based discriminant
learning method is developed here under the boosting frame-
work through the utilization of the PCDD and the weakness
analysistheory. In the proposed method, each round of boosting
generalizes a new LDA subspace particularly targeting those
examples from the hard-to-separate pairs of classes indicated
by its preceding PCDD, so that the separability between these
classes is enhanced in the new LDA subspace. The final result
obtained by the process is an ensemble of multiple relatively
weak but very specific LDA solutions. The ensemble-based
solution is able to take advantage of both boosting and LDA.
It is shown by the FR experiments to outperform the single
solutions created by the LDA-based learners in various difficult
learning scenarios, which include the cases with different SSS
settings and the case with increased nonlinear variations.
Therestofthepaperis organized asfollows.InSectionII,we
briefly review the AdaBoost approach and its multiclass exten-
sions. Then, in Section III, the theory and algorithm of how to
boost a LDA-based strong learner are introduced and described
in detail. Section IV reports on a set of experiments conducted
on the FERET face database to demonstrate the effectiveness
of the proposed methodologies. Finally, conclusions are sum-
marized in Section V. In addition, a brief introduction to the
adopted LDA-based learners is given in Appendix I.
II. RELATED WORK
Since the boosting method proposed here is developed from
AdaBoost , we begin with a brief review of the algorithm
and its multiclass extensions.
In the case of pattern classification, the task of learning from
examples can be formulated in the following way: Given a
consisting of a number of exam-
and their corresponding class labels
examples are available in the set. Let
. Taking as input such a set
classes with each
, a total of
be the label set:
, the objective of
168IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 17, NO. 1, JANUARY 2006
or? ? ????? can be used to replace each other during the boosting process.
Algorithm of boosting a LDA-style learner (simply replacing the learner with JD-LDA or EFM in step 3 to obtain B-JD-LDA or B-EFM). Either ? ?????
learning is to estimate a function or classifier
will correctly classify unseen examples
To this end, AdaBoost works by repeatedly applying a given
weak learner to a weighted version of the training set in a se-
ries of rounds
, and then linearly combining these
constructed in each round into a single
. Themostinterestingfeatureof AdaBoostis
its surprising ability to reduce the amount of overfitting and the
generalization error of classification, even as
, . To explain the property, quite a number of perspec-
aBoost to be an efficient method for maximizing the margin
. However, many researchers have shown that the margin
theory provides only partial answers to the puzzle , .
As a result, AdaBoost still remains as a mysterious algorithm,
which is considered one of the most important unsolved prob-
lems in machine learning . On the other hand, the limita-
tion in the theoretical explanation does not seem to hamper the
success of AdaBoost-style approaches in practical applications.
For example, Viola and Jones  build the first real-time face
detection system by using AdaBoost, whichis considered a dra-
matic breakthrough in the face detection research.
AdaBoost is originally developed to support binary classifi-
cation tasks. Its multiclass extensions include two variants, Ad-
aBoost.M1 and AdaBoost.M2 . AdaBoost.M1 is the most
straightforward generalization. However, the algorithm halts if
the classification error rate (CER) of the weak classifier
duced in any iterative step is
limitation often terminates the procedure too early, resulting in
insufficient classification capabilities , . To avoid the
problem, rather than the ordinary CER, AdaBoost.M2 attempts
to minimize a more sophisticated error measure called “pseu-
, which is expressed as
%. Research indicates that this
so-called “mislabel distribution” defined over the set of all mis-
. With the pseudoloss, the boosting process can continue
as long as the weak classifier produced has pseudoloss slightly
better than random guessing. In addition, the introduction of the
mislabel distribution enhances the communication between the
(see steps 7,8 of Fig. 1 for definition) is the
LU et al.: ENSEMBLE-BASED DISCRIMINANT LEARNING WITH BOOSTING FOR FACE RECOGNITION169
learner and the booster. In this way, AdaBoost.M2 can focus the
learner not only on hard-to-classify examples, but more specif-
ically, on the incorrect labels . For all these reasons, we de-
velop the ensemble-based discriminant algorithm proposed in
the next section following the AdaBoost.M2 paradigm.
There are two LDA-based FR approaches (or learners) that
are boosted in this work. One is the so-called “Enhanced Fisher
LDA Model” (hereafter EFM) , and the other is called “Re-
vised Direct LDA” (hereafter JD-LDA)  proposed by the
authors recently. The EFM method is an improvement of the
Fisherfaces method , while the JD-LDA method is a LDA
variant introduced specifically for face recognition in high-di-
mensional, small-sample-size scenarios. For completeness, the
details of the two learners are described in Appendix I. Com-
pared to traditional learners used in the boosting algorithms,
the two LDA-based learners should be emphasized again at the
following two points. 1) They are strong and stable learners,
which can be successfully used as stand-alone procedures in
FR tasks , , . That obviously contradicts the gen-
eral belief that boosting solutions should operate only on top of
weak learners. 2) The EFM or JD-LDA learner is composed of
a LDA-based featureextractorand anearest centerclassifier.As
it can be seen in Appendix I, the learning focus of such a learner
is on the feature extractor rather than the classifier. It is rather
different at this point from the original boosting design where
the weak learners are used only as pure classifiers without con-
cerning feature extraction. This makes the AdaBoost learning
tend to be an adaptively feature selection process, some of the
ideas seen in . Therefore, accommodating a learner such as
JD-LDA or EFM requires a generalized boosting framework,
which is not restricted by the assumption of the weak learner
availability. To highlight these difference, we call “gClassifier”
the more general classifier produced by the LDA-based learners
in the rest of the paper.
III. BOOSTING A LDA-STYLE LEARNER
A. Interaction Between the LDA Learner and the Booster
To boost a learner, we first have to build a strong connection
between the learner and the boosting framework. In AdaBoost,
this is implemented by manipulating the so-called “sample dis-
tribution,” which is a measure of how hard to classify an ex-
ample. However, we need a more specific connecting variable
in this work, given the fact that the nature of LDA is a feature
extractor, which goal is to find a linear mapping to enhance the
between-class separability of the samples under learning. For
this purpose, a new distribution called “pairwise class discrimi-
nant distribution” (PCDD),
Defined on any one pair of classes
PCDD can be computed at the th iteration as (2), shown at the
, is introduced here. The PCDD
aBoost.M2 developments, the mislabel distribution
indicates the extent of difficulty in distinguishing the example
from the preceding
intuitively considered as a measure of how important it is to dis-
criminate between the classes
. Obviously, a larger value of
worse separability between the two classes. It is, therefore, suit-
able to drive a LDA-based learner through
focused specifically on the hard-to-separate pairs of classes. To
this end, rather than the ordinary definition of the between-class
is the mean of the class
is the average of the ensemble
introduce a variant of
, which can be expressed as
and, respectively. As it is known from the Ad-
andwhen designing the cur-
, so that it is
It should be noted at this point that the variant
embodies the design principle behind the so-called “frac-
tional-step” LDA presented in . According to this principle,
objectclassesthatare difficulttobe separatedinthelow-dimen-
sional output spaces
rounds can potentially result in misclassification. Thus, they
should be paid more attention by being more heavily weighted
that their separability is enhanced in the resulting feature space
. It can be easily seenthat the variant
is equal to a constant.
can be given as follows:
generalized in previous
sample distribution, similar to the one given in AdaBoost. Since
is derived indirectly from the pseudoloss
a “pseudo sample distribution” for the distinguishing
purpose. It can be seen that a larger value of
harder-to-classify example for those preceding gClassifiers.
Recently, it is shown that to achieve a low generalization
error, the boosting procedure should not only create classifiers
with large expected margins, but also keep their dependence
low or weak . Obviously, classifiers trained with more
overlapping examples will result in stronger dependence
among them. A way to avoid building similar gClassifiers
is defined over as the
, we call
170IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 17, NO. 1, JANUARY 2006
repeatedly is to artificially introduce some randomness in the
construction of the training data. To this end, a modified PCDD
is introduced as (5), shown at the bottom of the page, where
. As a result of using
instead of, it can be seen that only those subject sets
that include the examples mislabeled by the last gClassi-
are contributing to the construction of the current
(through). By manipulating
reduce the extent of overlapping between the training exam-
ples used to build different gClassifiers, and thus achieve the
goal of weakening the dependence among these gClassifiers.
Also, this has the effect of forcing every gClassifier to focus
only on the hard-to-separate pairs of classes suggested by its
preceding gClassifier, resulting in a more diverse committee of
gClassifiers to be generalized in the end. On the other hand,
the classification ability of the individual gClassifier
weakened to some extent due to less training examples involved
in its construction. This weakening results in decrease in the
examples’ margins. However, it should be noted at this point
that there appears to be a tradeoff between weak dependence
and large expected margins to achieve a low generalization
error . Our experimentation indicates that in some cases,
the utilization of
may yield a better balance than that
, improving the classification performance.
Based on the introduction of
and , we can now give a new boosting solution, depicted
in Fig. 1, from which it can be seen that the LDA-style learner
at every iteration is tuned to conquer a particular subproblem
generalized by the feedback
subspaces by weighted linear combination. Either JD-LDA or
EFM can be adopted as the LDA learner in the step 3 during
the boosting process. In the remainder of the paper, we call
“B-JD-LDA” the algorithm utilizing JD-LDA, while “B-EFM”
indicates the one employing EFM.
, we can
in a manner similar to “auto-
B. A Cross-Validation Mechanism to Weaken the Strong
As we mentioned earlier, JD-LDA or EFM itself has been
a rather strong and stable learner in terms of classification
ability. As a consequence, two problems are often encountered:
1) gClassifiers created exhibit a high similarity or mutual
dependence, given the same training data; 2) the pseudoloss
is often obtained halting the boosting process too
early. To solve the problems, we have to artificially weaken the
gClassifiers and increase their diversity accordingly. Generally
speaking, the learning capacity of any LDA-like algorithm
is directly proportional to the number of training examples
, and reciprocally proportional to the number of
the subjects,. Combining the two factors, we can define a
variable called Learning Difficulty Degree (LDD):
to roughly estimate the degree of difficulty for the discriminant
proposed boosting framework to weaken the LDA-style learner. The flow chart
is based on one iteration, and the NCC denotes the nearest center classifier.
Flow chart of the cross-validation mechanism embedded in the
learning task on hand. It should be noted that the average
is considered as subjects are allowed to
have different number of training examples,
value implies a more difficult learning task. In
other words, if a learner is trained with different sample sets,
the classification strength of the obtained gClassifiers will be
different: A sample set with a smaller
gClassifier. Thus, from the training data point of view, the
LDD provides a qualitative measure of the weakness of the
gClassifiers created by the same learner. For the purpose of
distinguishing the two meanings, we denote the LDD as
when it is used to express the degree of difficulty for a learning
denotes the weakness of a gClassifier.
Based on the above analysis, we can introduce into the pro-
posed boosting framework the cross-validation mechanism de-
picted in Fig. 2 With the mechanism in place, only a subset of
the entire training set
learner. The subset
is formed in each iteration by choosing
hardest-to-classify examples per class based on cur-
rent values of
. Please note that
denotes the size of. In the sequence, the obtained LDA
(see Appendix I for details of
based on the nearest center rule.
The gClassifier is applied to the entire training set
those unseen, to the learner, examples
ables defined onsuch as
reported and used in the next iteration. The detailed implemen-
tation steps of the mechanism have been embedded in Fig. 1.
value leads to a weaker
, is used to train the LDA-style
) are used to build a gClas-
. All the vari-
(or ) are then, and
LU et al.: ENSEMBLE-BASED DISCRIMINANT LEARNING WITH BOOSTING FOR FACE RECOGNITION 171
It can be seen that under the proposed strategy, the LDD
value of the sample set used to train the strong learner decreases
weakens the gClassifiers produced by the learner. At the same
time, since each iteration feeds the learner a different subset
of the entire training set, this essentially increases the diversity
among these gClassifiers. Also, it should be added at this point
that one of side-effects of using only
during the construction of each gClassifier is obtaining a better
estimate of the pseudoloss
. This is achieved by using what
Leo Breiman calls the “out-of-bag” samples (those samples not
used during the training of the classifier) to estimate the error
rate . Hence finding the optimal
between good classifier performance and an improved estimate
of the misclassification.
) in each iteration. Fol-
examples per subject
also provides a balance
C. Estimation of Appropriate Weakness
hances the strength of the proposed boosting algorithm, but also
raises the problem of model selection, that is, the determination
of the optimal
. As we know from the analysis in
last section, a smaller/larger
a weaker/stronger gClassifier, given the same learner. However,
boosting may fail when either too weak (e.g.,
. Consequently, we can conjecture that a gClassifier with
appropriate weakness should have a
and . Intuitively, it is reasonable to further assume that a
a learner, trained on a smaller fraction of the training set i.e., a
, should generalize a weaker but more diverse
committee of gClassifiers with each one having a more honest
estimate of misclassification. Thus, a sort of loss function with
respectto thatbalances thetwofactorscanbe usedtodrivethe
model selection process. The proposed here function is defined
value will equivalently lead to
) or too
value in between
by applying the gClassifier
regularization parameter that controls the tradeoff between the
weakness and the diversity of the gClassifiers. It can be seen
that the tradeoff embodied in (6) implements the design princi-
ples described earlier in the sense that in order to compensate
for high empirical error, the gClassifiers should have low mu-
tual dependence, and vice versa. With the introduction of the
loss, determining the set of gClassifiers with the optimal
be seen in the experiments reported here, the estimation results
look rather accurate across various settings of the
In this paper, the weakness analysis theory, including the
cross-validation mechanism of weakening a strong learner and
is the empirical CER obtained
constructed byto the
is a , and
the subsequent estimation method of appropriate weakness,
is developed for the LDA-style learners. However, it can be
seen from the previous presentations that both the two methods
are dependent only on the training set, where each subject is
required to have at least two examples. As a result, a traditional
boosting framework enhanced with the weakness analysis
theory is applicable to work with any general (weak/strong)
learners. This exhibits a considerably promising approach to
break the traditional limitation of the weak learners in the
IV. EXPERIMENTAL RESULTS
We evaluate the performance of the proposed boosting
methodology by applying it to a challenging pattern classifi-
cation task, namely face recognition. Due to space limitations,
only the results of B-JD-LDA and B-EFM with the PCDD
are reported here (The interested readers can refer to
 for the results obtained with
A. The Face Database and FR Evaluation Design
To show the high complexity of the face pattern distribution,
the two evaluation databases used in the experiments are taken
from the well-known FERET database, which has been consid-
ered the largest, most comprehensive and representative face
database to be used for evaluating the state-of-the-art in face
recognition , . The evaluation databases are constructed
in two stages. First, for the purpose of preprocessing, we find
in the FERET database all grayscale face images that are sup-
plied along with the coordinate information of eyes, nose tip
and mouth center to form a set
3817 face images of 1200 subjects. In the sequence, the first
evaluation database denoted as
all (1051) images of 104 subjects with each one having at
tasks, ranging from
the corresponding performance changes of the boosting algo-
rithms. Similarly, the second evaluation database denoted as
(including) is constructed by choosing all (1703) images of
256 subjects in
with at least four images per subject. This
is utilized to test the learning capacity of the al-
gorithms as the size of the evaluation database becomes larger.
The details of the images included in
can be found in , .
The original images in the FERET database are raw face im-
ages that include not only the face, but also some irrelevant, for
face recognition, data, such as hair, neck, shoulder and back-
ground, as shown in Fig. 3: Left. To avoid incorrect evaluations
, we follow the preprocessing sequence recommended in
and scaled (to size 150
130) so that the centers of the eyes are
Middle is applied to remove the nonface portions; 3) histogram
equalization is performed in the masked facial pixels; 4) face
data are further normalized to have zero mean and unit standard
deviation. Figs. 3: Right and 4 depict some examples after the
. The setcontains in total
is formed by choosing in the
to, to study
andare depicted in
172IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 17, NO. 1, JANUARY 2006
NUMBER OF IMAGES DIVIDED INTO THE STANDARD FERET IMAGERY
CATEGORIES IN THE EVALUATION DATABASES, AND THE POSE ANGLE
(DEGREE) OF EACH CATEGORY
mask. Right: The samples after preprocessing.
Left: Original samples in the FERET database. Middle: The standard
subsets, ? and ? .
Examples for six subjects drawn from the two normalized FERET
preprocessing sequence is applied. For the computational pur-
pose, each image is finally represented as a column vector of
Following standardFR practices , the database
is randomly partitioned into two subsets: The
and test set. The training set is composed of
to form the test set
. Any FR method evaluated here
is first trained with
, and the resulting face recognizer is then
below are averaged over five runs. Each run is executed on such
a random partition of the database
B. The Comparison of FR Performance in Terms of CER
Besides the two proposed boosting methods, B-JD-LDA
and B-EFM, their corresponding stand-alone JD-LDA and
EFM methods (without boosting) were performed to measure
the improvement brought by boosting. Meanwhile, three FR
algorithms, the Eigenfaces method , the Fisherfaces method
 and the Bayes matching method , were also implemented
to provide performance baselines. Both Eigenfaces and Fisher-
faces are considered to be among the most cited and influential
FR algorithms , while the Bayes method is the top performer
in the 1996/1997 FERET competitions .
COMPARISON OF THE CERs (%) AS A FUNCTION OF ?? ????? ???? OBTAINED
ON THE DATABASE ?
The first experiment conducted on
the sensitivity of the CER measure to
learning tasks arising from different database partitions) and
(i.e., various weakness extents of gClassifiers in each
task). For all the seven methods compared here, the CER is a
function of the number of extracted feature vectors,
the number of available training examples per subject,
addition, the performance of B-JD-LDA and B-EFM is affected
by , the number of examples per subject that is used to control
the weakness of the produced gClassifiers during the boosting
process. Considering the huge computational cost, we simply
fixed the feature number
for B-EFM rather than seek their respective optimal
maximal iteration number used in boosting was set as
beyond which it was empirically observed that boosting was
very likely to overfit. The lowest CERs finally obtained by
the seven methods under various settings of
are depicted in Tables II and III, where
CER of B-JD-LDA or B-EFM with the best found iteration
five nonboosting methods with the best found feature number
. All these variables have been averaged over five runs
as we mentioned earlier. To further facilitate the comparison
of boosting performance, we define a quantitative statistic
regarding the biggest CER improvement achieved by boosting,
anddenote the CERs of a boosting-based
method (B-JD-LDA or B-EFM) and its corresponding
nonboosting version (JD-LDA or EFM), respectively, and
. The results are summarized in
Table IV, from which it can be clearly seen that B-JD-LDA and
of JD-LDA and EFM, respectively, across various SSS learning
is designed to test
(i.e., various SSS
for B-JD-LDA and
denotes the CER of the
LU et al.: ENSEMBLE-BASED DISCRIMINANT LEARNING WITH BOOSTING FOR FACE RECOGNITION 173
THE CERs, ? ? ?? ?, OBTAINED BY THE THREE BENCHMARK METHODS ON ?
THE BIGGEST CER IMPROVEMENT ACHIEVED BY B-JD-LDA AND
B-EFM IN VARIOUS TASKS
the performance enhancement is more aggressive in the worse
SSS scenarios. This demonstrates the effectiveness of the two
boosting approaches against the SSS problem. The biggest
%, is achieved by B-JD-LDA when
in the most difficult task
The second experiment conducted on
the CER performance changes as the size of the evaluation
dataset increases. Since the two boosting methods require at
least three training samples per subject, we are allowed to
create only one partition case from
to an SSS learning task with
the lowest CERs obtained by the seven methods are shown in
Table V. It can be seen from these results that a stable boosting
performance is achieved by both boosting approaches. The
goes up to
B-JD-LDA and B-EFM, respectively. The results indicate
only a slightly better boosting performance compared to that
achieved under the assumption
iment, although in theory a higher performance improvement
is expected when more pattern variations are introduced. The
reason may be explained by the fact that most new samples
added to the database
come from the fa and fb sets, the
simplest categories in the FERET database, as shown in Table
I. As a result, the performance margins between different
methods is reduced to some extent.
In both of the two experiments, Eigenfaces is the worst per-
former among the seven methods. From the results delivered by
the most popular benchmark method, we can roughly learn how
difficult it is to conduct face recognition on the two evaluation
and. Also, it is of interest to compare the perfor-
Published results indicate that the latter generally outperforms,
ysis (K-PCA) or Independent Component Analysis techniques
(ICA) by a margin of at least ten percent . However, as it
superior to the Bayes method. Especially, B-JD-LDA leads the
state-of-the-art method up to 4.38% and 4.91% in the two most
difficult learning tasks,
to . Particularly,
is designed to test
, i.e., , which leads
% and% for
in the first exper-
the first experiment. Although we admit that our implementa-
tion of the Bayes method1may not be as good as the original
implementation of Moghaddam et al. , this comparison still
provides a promising perspective: It is possible to boost a tradi-
tional FR algorithm to the state-of-the-art level under the pro-
posed framework. Moreover, it should be mentioned again at
this point that unlike the five nonboosting methods, we did not
seek the CERs with the optimal
only suboptimal. We expect that a higher boosting performance
gain can be obtained when a better
values for the two boosting
, as a substitute for, is
value is used.
C. Weakness Analysis of the gClassifiers
As it was mentioned earlier, the proposed boosting ap-
proaches would fail, in theory, to perform well when too weak
or too strong gClassifiers are utilized. Clearly, it can be exper-
imentally observed at this point from the example shown in
Fig. 5, where the results are obtained by B-JD-LDA in the task
. In this example, the weakest and strongest
gClassifiers are produced by B-JD-LDA when
respectively. However, the generalization performance of the
former is only slightly better than that of the single JD-LDA,
while the latter tends to overfit quickly, although it yields the
lowest training error. In contrast with this, appropriately weak
gClassifiers are produced when
cases, it can be seen from Fig. 5 that the B-JD-LDA exhibits the
beautiful property of boosting: The test CER is continuously
improved, even long after the training error has dropped down
to zero. Similar phenomena have been also observed with the
Based on the theory developed in Section III-C, the gClassi-
fiers with the best weakness or the optimal
by minimizing a generalization loss function
spect to , i.e.,
curacy of the method, we applied the loss function to the var-
ious learning tasks designed in the first experiment. The ob-
tained results including
are depicted in Tables VI and VII for B-JD-LDA and B-EFM,
respectively, where the values of
should be mentioned here that it is not a difficult task to find an
value within [0,1].In fact, our experimentsreveal
that there exist a range of
values which produce the same esti-
mation for the preference rankings of the
for B-JD-LDA found in the experiment. Com-
are used. In these
can be found
(6) with re-
. To test the estimation ac-
, and the worstvalue
were found empirically. It
values, for example,
1To reduce the effect of reimplementation related issue, we use the maximum
likelihood (ML) version of the Bayes method instead of the maximum a pos-
teriori (MAP) version. The former is much easier to be implemented than the
latter. However, there is only a very slight performance difference between the
ML and MAP versions as shown in the works of .
174IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 17, NO. 1, JANUARY 2006
COMPARISON OF THE CERs (%) OBTAINED ON THE DATABASE ?
is the one obtained with ? ? ? .
Training and test CERs of B-JD-LDA with varying weakness extents of gClassifiers as a function of ? in the task ? ??? ? ?????. The CER of JD-LDA
THE GENERALIZATION LOSS ?????? WITH ? ? ????, THE BEST ? ESTIMATE
?? ? AND THE WORST ? ESTIMATE ?? ? OBTAINED BY B-JD-LDA
ON THE DATABASE ?
be seen that the values of the loss correctly indicate the optimal
, the worst, and even the goodness of the
them in most cases, for example, the second, third, and fourth
THE GENERALIZATION LOSS ?????? WITH ? ? ????, THE BEST ?
ESTIMATE ?? ? AND THE WORST ? ESTIMATE ?? ? OBTAINED BY B-EFM
ON THE DATABASE ?
D. Some Discussions on the Convergence of Boosting
From Fig. 5, it can be seen that given an appropriately weak
learner, the generalization error of
becomes large even long after the training error reaches
often continues to drop
LU et al.: ENSEMBLE-BASED DISCRIMINANT LEARNING WITH BOOSTING FOR FACE RECOGNITION175
zero. However, the phenomenon also leads to the difficulty in
determining when the boosting procedure should be stopped in
order to avoid possible overfitting.
Considering the relationship between boosting and the
margin theory, intuitively, it is reasonable to use the cumulative
margin distribution of the training examples as an indicator to
roughly estimate an appropriate value of
can observe the changes of the margins of the training examples
at every boosting iteration, and consider it convergent when
the margins of most training examples stop increasing or are
increasing slowly. Our experiments indicate that this approach
works well in many cases (see  for details). However, as
mentioned earlier, the margin theory alone is insufficient to
explain the behaviors of boosting , , . It is therefore
unrealistic to expect that the heuristic approach can accurately
estimate the optimal value of
our experiments that JD-LDA with the best found
yielded much better cumulative margin distributions than its
boosting version .
Also, compared to many other boosting methods that usu-
ally need hundreds of iterations, it should be noted that only
iterations are required to find an excellent
result using the B-JD-LDA and B-EFM algorithms in the ex-
periments reported in Tables II and V. Considering that each
compared to those nonboosting methods, such
a computational cost is affordable for most existing personal
. In other words, we
. For example, it is found in
In this paper, a novel weakness analysis theory has been
developed to overcome the limitation of the weak learners in
traditional boosting techniques. The theory proposed here is
composed of a cross-validation mechanism of weakening a
strong learner and a subsequent estimation method of appro-
priate weakness for the classifiers created by the learner. With
the introduction of the weakness analysis theory, a traditional
boosting algorithm can be used to work effectively with a gen-
eral (strong or weak) learner. To demonstrate the effectiveness,
the new boosting framework is applied to two strong LDA-style
learners, which are generally believed to be rather difficult to
be boosted. To this end, a novel variable, the pairwise class
discriminant distribution, is introduced to build an effective
connection between the booster and the learners. As a result,
two novel ensemble-based discriminant learning methods,
B-JD-LDA and B-EFM, are introduced. By manipulating the
boosting process, a set of specific LDA feature spaces can be
constructed effectively in a manner of similar to “automatic
gain control.” Unlike most traditional mixture models of linear
subspaces that are based on cluster analysis , these LDA
subspaces are generalized in the context of classification error
proaches including boosting power, estimation accuracy of the
loss function, and robustness against the overfitting and SSS
problems has been demonstrated through the FR experimenta-
tion performed on the FERET database. It is further anticipated
that in addition to JD-LDA and EFM, other existing traditional
face recognizers such as those based on PCA or ICA techniques
posed boosting framework.
TWO LDA-STYLE LEARNERS: EFM AND JD-LDA
In face recognition applications, each sample
image, represented as a column vector of length
denotes the-dimensional real space. A LDA-style learner
and, by optimizing the Fisher’s discriminant
is a face
is the image size, and
matrices of the training set, respectively. However, the estima-
tion for either
oris extremely ill-posed due to the SSS
problem in most FR tasks. Generally, two kinds of discriminant
feature bases are considered for the solution: 1)
and; and 2)
Some researchers such as , , , prefer the feature (1)
based on the consideration that the small/zero eigenvalues of
tend to capture noise. Other researchers such as , ,
, consider (2) the optimal discriminant feature bases, since
they maximize the ratio of (7). Particularly, EFM  is an
extension to Fisherfaces , while JD-LDA  is an improve-
ment of . Based on the experience of the authors, it is hard
to say which kind of features are better. Different experimental
settings often lead to different conclusions as shown in Table II.
Both EFM and JD-LDA return
is the center of the class . For sim-
plicity, we denote EFM or JD-LDA as a function
. For the limitation of space, only
the pseudocode implementation of the less known JD-LDA
is depicted as a learner in Fig. 6. With the example, the EFM
learner can be implemented easily in a similar way.
For an input face image , its LDA-based representation
can be obtained by a linear mapping:
subsequent classification in the feature space can be performed
using any classifier. However, from the viewpoint of reducing
the overfitting chances in the context of boosting, a simple dis-
criminant function that explains most of the data is preferable to
a complex one. Consequently, a classic nearest center classifier
(NCC) is adopted here for the classification task. The NCC is
based on a normalized Euclidean distance, given by
andare the between- and within-class scatter
. Based on the nearest center rule,
of the inputcan be inferred through
. The classification score
the class label
has values in [0, 1], and thus it can fulfill the functional require-
ment of the boosting algorithm (AdaBoost.M2 ), indicating
176IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 17, NO. 1, JANUARY 2006
updated subset defined in Section III-B.
Pseudocodeimplementation ofthe JD-LDAfeatureextractor:??? ? inthe ?th boostingiteration,wherethe input? ? ? , and? ? ? isan adaptively
a “degree of plausibility” for labeling
such as the NCC discussed here usually yields two
outputs, the classification score
, we denote, and
as the class . Since a
and the class label
Portions oftheresearchinthispaperuse theFERET database
of facial images collected under the FERET program . The
authors would like to thank the FERET Technical Agent, the
U.S. National Institute of Standards and Technology (NIST) for
providing the FERET database.
 A. Samal and P. A. Iyengar, “Automatic recognition and analysis of
human faces and facial expressions: A survey,” Pattern Recognit., vol.
25, pp. 65–77, 1992.
 D. Valentin, H. A. Alice, J. O. Toole, and G. W. Cottrell, “Connectionist
models of face processing: A survey,” Pattern Recognit., vol. 27, no. 9,
pp. 1209–1230, 1994.
nition of faces: A survey,” Proc. IEEE, vol. 83, no. 5, pp. 705–740, May
 S. Gong, S. J. McKenna, and A. Psarrou, Dynamic Vision From Images
to Face Recognition, Singapore: World Scientific , May 2000.
 M. Turk, “A random walk through eigenspace,” IEICE Trans. Inf. Syst.,
vol. E84-D, no. 12, pp. 1586–1695, Dec. 2001.
 W. Zhao, R. Chellappa, P. J. Phillips, and A. Rosenfeld, “Face recog-
nition: A literature survey,” ACM Comput. Surv., vol. 35, no. 4, pp.
399–458, Dec. 2003.
 M. A. Turk and A. P. Pentland, “Eigenfaces for recognition,” J. Cogn.
Neurosci., vol. 3, no. 1, pp. 71–86, 1991.
 P. N. Belhumeur, J. P. Hespanha, and D. J. Kriegman, “Eigenfaces vs.
fisherfaces: Recognition using class specific linear projection,” IEEE
Trans.PatternAnal. Mach.Intell.,vol.19, no.7,pp. 711–720,Jul. 1997.
Pattern Recognit., vol. 33, pp. 1771–1782, 2000.
 W. Zhao, R. Chellappa, and J. Phillips, “Subspace linear discriminant
analysisfor facerecognition,”Univ.Maryland,CollegePark,MD, Tech.
Rep., CS-TR4009, 1999.
 L.-F. Chen, H.-Y. M. Liao, M.-T. Ko, J.-C. Lin, and G.-J. Yu, “A new
LDA-based face recognition system which can solve the small sample
size problem,” Pattern Recognit., vol. 33, pp. 1713–1726, 2000.
 H. Yu and J. Yang, “A direct LDA algorithm for high-dimensional
data—With application to face recognition,” Pattern Recognit., vol. 34,
pp. 2067–2070, Oct. 2001.
 C. Liu and H. Wechsler, “Gabor feature based classification using the
enhanced fisher linear discriminant model for face recognition,” IEEE
Trans. Image Process., vol. 11, no. 4, pp. 467–476, Apr. 2002.
 J. Ye and Q. Li, “LDA/QR: An efficient and effective dimension reduc-
no. 4, pp. 851–854, Apr. 2004.
 M. J. Er, W. Chen, and S. Wu, “High-speed face recognition based on
discrete cosine transform and rbf neural networks,” IEEE Trans. Neural
Netw., vol. 16, no. 3, pp. 679–691, May 2005.
 S. J. Raudys and A. K. Jain, “Small sample size effects in statistical
pattern recognition: Recommendations for practitioners,” IEEE Trans.
Pattern Anal. Mach. Intell., vol. 13, no. 3, pp. 252–264, Mar. 1991.
 M. Bichsel and A. P. Pentland, “Human face recognition and the face
image set’s topology,” CVGIP: Image Understanding, vol. 59, pp.
 S. Kutin, “Algorithmic Stability and Ensemble-Based Learning,” Ph.D.
Thesis, Faculty Div. Phys. Sci., Univ. Chicago, June 2002.
 B. Schölkopf, A. Smola, and K. R. Müller, “Nonlinear component
analysis as a kernel eigenvalue problem,” Neural Comput., vol. 10, pp.
 G. Baudat and F. Anouar, “Generalized discriminant analysis using a
kernel approach,” Neural Comput., vol. 12, pp. 2385–2404, 2000.
 K.-R. Müller, S. Mika, G. Rätsch, K. Tsuda, and B. Schölkopf, “An
introduction to kernel-based learning algorithms,” IEEE Trans. Neural
Netw., vol. 12, no. 2, pp. 181–201, Mar. 2001.
 J. Lu, K. N. Plataniotis, and A. N. Venetsanopoulos, “Face recognition
Int. Workshop Neural Networks for Signal Processing, Falmouth, MA,
Sep. 2001, pp. 373–382.
, “Face recognition using kernel direct discriminant analysis algo-
rithms,” IEEE Trans. Neural Netw., vol. 14, no. 1, pp. 117–126, Jan.
 M. Wang and S. Chen, “Enhanced fmam based on empirical kernel
map,” IEEE Trans. Neural Netw., vol. 16, no. 3, pp. 557–564, May
LU et al.: ENSEMBLE-BASED DISCRIMINANT LEARNING WITH BOOSTING FOR FACE RECOGNITION 177
svm classification tree generated by membership-based lle data parti-
tion,” IEEE Trans. Neural Netw., vol. 16, no. 2, pp. 436–446, Mar. 2005.
 H. Xiong, M. N. S. Swamy, and M. O. Ahmad, “Optimizing the kernel
in the empirical feature space,” IEEE Trans. Neural Netw., vol. 16, no.
2, pp. 460–474, Mar. 2005.
 A. Pentland, B. Moghaddam, and T. Starner, “View-based and modular
eigenspaces for face recognition,” in Proc. Computer Vision and Pattern
Recognition Conf. , June 1994, pp. 1–7.
 K.-K. Sung and T. Poggio, “Example-based learning for view-based
human face detection,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 20,
no. 1, pp. 39–51, Jan. 1998.
 B. J. Frey, A. Colmenarez, and T. S. Huang, “Mixtures of local linear
subspaces for face recognition,” in Proc. IEEE Conf. Computer Vision
and Pattern Recognition, Santa Barbara, CA, Jun. 1998, pp. 32–37.
 J. Lu and K. N. Plataniotis, “Boosting face recognition on a large-scale
database,” in Proc. IEEE Int. Conf. Image Processing, Rochester, NY,
Sep. 2002, pp. II.109–II.112.
 Y. Freund and R. E. Schapire, “A decision-theoretic generalization of
on-line learning and an application to boosting,” J. Comput. Syst. Sci.,
vol. 55, no. 1, pp. 119–139, 1997.
 R. E. Schapire, “The boosting approach to machine learning: An
overview,” MSRI Workshop Nonlinear Estimation and Classification,
pp. 149–172, 2002.
Process. Syst. 8, pp. 479–485, 1996.
 R. E. Schapire, Y. Freund, P. Bartlett, and W. S. Lee, “Boosting the
margin: A new explanation for the effectiveness of voting methods,” in
Proc. 14th Int. Conf. Machine Learning , 1997, pp. 322–330.
 L. Breiman, “Arcing classifiers,” Ann. Statistics, vol. 26, no. 3, pp.
 V. N. Vapnik, The Nature of Statistical Learning Theory.
vol. 32, no. 1, pp. 1–11, 2004.
, “Bagging predictors,” Mach. Learn., vol. 24, no. 2, pp. 123–140,
 M. Skurichina and R. P. W. Duin, “Bagging, boosting and the random
subspace method for linear classifiers,” Pattern Anal. Appl., vol. 5, no.
2, pp. 121–135, Jun. 2002.
 A. Murua, “Upper bounds for error rates of linear combinations of
classifiers,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 24, no. 5, pp.
591–602, May 2002.
, vol. 2, Oct. 2003, pp. 734–741.
 K. Tieu and P. Viola, “Boosting image retrieval,” Int. J. Comput. Vis.,
vol. 56, no. 1, pp. 17–36, 2004.
 P. Viola and M. J. Jones, “Robust real-time face detection,” Int. J.
Comput. Vis., vol. 57, pp. 137–154, May 2004.
 Y. Freund and R. E. Schapire, “A discussion of “process consistency for
adaboost” by wenxin jiang, “on the bayes-risk consistency of regular-
ized boosting methods” by gbor lugosi and nicolas vayatis, “statistical
behavior and consistency of classification methods based on convex risk
minimization” by tong zhang,” Ann. Statist., vol. 32, no. 1, 2004.
 A. J. Grove and D. Schuurmans, “Boosting in the limit: Maximizing the
margin of learned ensembles,” in Proc. 15th Nat. Conf. Artifical Intelli-
gence , July 1998, pp. 692–699.
 C. Rudin, R. E. Schapire, and I. Daubechies, “Boosting based on
a smooth margin,” in COLT (Computational Learning Theory), J.
Shawe-Taylor and Y. Singer, Eds.
 J. Lu, K. N. Plataniotis, and A. N. Venetsanopoulos, “Face recognition
using LDA based algorithms,” IEEE Trans. Neural Netw., vol. 14, no. 1,
pp. 195–200, Jan. 2003.
, “Regularized discriminant analysis for the small sample size
problem in face recognition,” Pattern Recognit. Lett., vol. 24, no. 16,
pp. 3079–3087, Dec. 2003.
 R. Lotlikar and R. Kothari, “Fractional-step dimensionality reduction,”
 Out-of-Bag Estimation, L. Breiman. (1996). [Online]. Available:
 J. Lu, “Discriminant learning for face recognition,” Ph.D. Dissertation,
The Edward S. Rogers Sr. Dept. Elect. Comp. Eng., Univ. Toronto,
Toronto, Canada, Jun. 2004.
 Image Group, Information Access Division, ITL, NIST (2004, Jan.).
[Online]. Available: http://www.itl.nist.gov/iad/humanid/feret/
New York: Springer-Verlag, 2004.
tion methodology for face-recognition algorithms,”IEEE Trans. Pattern
Anal. Mach. Intell., vol. 22, no. 10, pp. 1090–1104, Oct. 2000.
and evaluation procedure for face recognition algorithms,” Image Vis.
Comput. J., vol. 16, no. 5, pp. 295–306, 1998.
 L.-F. Chen, H.-Y. M. Liao, J.-C. Lin, and C.-C. Han, “Why recognition
in a statistics-based face recognition system should be based on the pure
face portion: A probabilistic decision-based proof,” Pattern Recognit.,
vol. 34, no. 7, pp. 1393–1403, 2001.
nition, Washinton, DC, May 20–21, 2002, pp. 235–241.
 B. Moghaddam, “Principal manifolds and probabilistic subspaces for
visual recognition,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 24,
no. 6, pp. 780–788, Jun. 2002.
 Evaluation of Face Recognition Algorithms Website, R. Bev-
 R. Xu andD.Wunsch II,“Survey ofclusteringalgorithms,”IEEETrans.
Neural Netw., vol. 16, no. 3, pp. 645–678, May 2005.
Juwei Lu (M’00) received the B.Eng. degree in
electrical engineering from Nanjing University of
Aeronautics and Astronautics, China, in 1994, the
M.Eng. degree in electrical and electronic engi-
neering from Nanyang Technological University,
Singapore, in 1999, and the Ph.D. degree in elec-
trical and computer engineering from University of
Toronto, Canada, in 2004.
From July 1999 to January 2001, he was a
Research Engineer with the Center for Signal Pro-
cessing, Singapore. From April 2004 to December
2004, he was a Postdoctoral Researcher at the Bell Canada Multimedia Labora-
University of Toronto. Currently, he is a Senior Software Developer at the
Epson Canada Limited, Toronto, ON, Canada. His research interests include
multimedia signal processing, visual object detection and recognition, kernel
methods, support vector machines, neural networks, and boosting technologies.
He has published 28 refereed papers and book chapters in these areas.
Dr. Lu is a member IEEE Computational Intelligence Society. He is a re-
viewer of many journals, such as IEEE TRANSACTIONS ON PATTERN ANALYSIS
AND MACHINE INTELLIGENCE, IEEE TRANSACTIONS ON SYSTEMS, MAN AND
CYBERNETICS - PART B and Pattern Recognition Letters.
(S’90–M’92–SM’03) received the B. Eng. degree in
computer engineering informatics from University
of Patras, Greece, in 1988, and the M.S and the
Ph.D. degrees in electrical engineering from Florida
Institute of Technology (Florida Tech) in Melbourne,
Florida, in 1992 and 1994, respectively.
He is an Associate Professor with The Edward S.
Rogers Sr. Department of Electrical and Computer
Engineering at the University of Toronto, Toronto,
ON, Canada, a Nortel Institute for Telecommunica-
tions Associate, a member of the Knowledge Media Design Institute at the Uni-
versity of Toronto and an Adjunct Professor with the School of Computer Sci-
metrics, image and signal processing, stochastic estimation, and pattern recog-
Dr. Plataniotis is the Vice Chair of the 9th International IEEE Conference on
Intelligent Transportation Systems (ISTC 06), September 18-20 2006, Toronto,
Canada, and the Technical Program Co-Chair for the IEEE International Con-
ference on Multimedia Expo (ICME) 2006, July 9–12, Toronto, Canada. He is
an Associate Editor of the IEEE TRANSACTIONS ON NEURAL NETWORKS.
178IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 17, NO. 1, JANUARY 2006
SM’79–F’88) received the Diploma in engineering
degree from the National Technical University
of Athens (NTU), Athens, Greece, in 1965, and
the M.S., M.Phil., and Ph.D. degrees in electrical
engineering from Yale University, New Heaven, CT,
in 1966, 1968, and 1969, respectively.
He joined the Department of Electrical and Com-
puter Engineering of the University of Toronto, ON,
Canada, in September 1968, as a Lecturer and he was
promoted to Assistant Professor in 1970, Associate
Professor in 1973, and Professor in 1981. Since July 1997, he has been Asso-
ciate Chair: Graduate Studies of the Edward S. Rogers Sr. Department of Elec-
trical and Computer Engineering and was Acting Chair during the spring term
of 1998–1999. Since July 2001, he has served as the 12th Dean of the Fac-
ulty of Applied Science and Engineering of the University of Toronto. He has
served as Chair of the Communications Group and Associate Chair of the De-
partment of Electrical Engineering and Associate Chair: Graduate Studies for
the Department of Electrical and Computer Engineering. He was on research
leave at Imperial College of Science and Technology, the National Technical
University of Athens, the Swiss Federal Institute of Technology, the University
of Florence and the Federal University of Rio de Janeiro, and has also served
as Adjunct Professor at Concordia University. He has served as lecturer in 138
short courses to industry and continuing education programs and as Consultant
thor of Color Image Processing and Applications (New York: Springer-Verlag,
2000), Nonlinear Filters in Image Processing: Principles Applications (Nor-
well, MA: Kluwer, 1990), Artificial Neural Networks: Learning Algorithms,
Performance Evaluation and Applications (Norwell, MA: Kluwer, 1993), and
Fuzzy Reasoning in Information Decision and Control systems (Norwell, MA:
Kluwer 1994). He has served as Chair on numerous boards, councils and tech-
nical conference committees of the Institute of Electrical and Electronic En-
gineers (IEEE), such as the Toronto Section (1977–1979) and the IEEE Cen-
tral Canada Council (1980–1982); he was President of the Canadian Society
for Electrical Engineering and Vice President of the Engineering Institute of
Canada (EIC) (1983–1986). He was a Guest Editor or Associate Editor for
several IEEE Journals and the Editor of the Canadian Electrical Engineering
Journal (1981–1983). He was the Technical Program Co-Chair of the IEEE In-
ternational Conference on Image Processing (ICIP’01). He has published 750
papers in refereed journals and conference PROCEEDINGS on digital signal and
image processing, and digital communications.
Prof. A.N. Venetsanopoulos is a member of the IEEE Communications, Cir-
cuits and Systems, Computer, and Signal Processing Societies of IEEE, as well
as a member of Sigma Xi, the Technical Chamber of Greece, the European As-
sociation of Signal Processing, the Association of Professional Engineers of
Ontario (APEO) and Greece. He was elected as a Fellow of the IEEE “for con-
tributions to digital signal and image processing.” He is also a Fellow of the
EIC, “for contributions to electrical engineering,” and was awarded an Hon-
orary Doctorate from the National Technical University of Athens, in October
1994. In October 1996, he was awarded the “Excellence in Innovation Award”
of the Information Technology Research Center of Ontario and Royal Bank of
Canada, “for innovative work in color image processing.”
Stan Z. Li received the B.Eng. degree from Hunan
University, P. R. China, the M.Eng. degree from Na-
and the Ph.D. degree from Surrey University, U.K.
He is a Researcher at National Lab of Pattern
Recognition (NLPR), Institute of Automation,
Chinese Academy of Sciences (CASIA), China,
Beijing, P. R. China, and the Director of the Center
for Biometrics and Security Research (CBSR), Bei-
jing, P. R. China. He worked at Microsoft Research
Asia, Beijing, P. R. China, as a Researcher, from
May 2000 to August 2004 . Prior to that, he was an Associate Professor of
Nanyang Technological University, Singapore. His current research interest is
in face recognition technologies, biometrics, intelligent surveillance, pattern
recognition, and machine learning. He has published several books, including
Handbook of Face Recognition (New York: Springer-Verlag, 2004) and Markov
Random Field Modeling in Image Analysis (New York: Springer-Verlag, 2nd
edition in 2001), and over 200 refereed papers and book chapters in these areas.