Normalized LDA for semi-supervised learning
ABSTRACT Linear Discriminant Analysis (LDA) has been a popular method for feature extracting and face recognition. As a supervised method, it requires manually labeled samples for training, while making labeled samples is a time consuming and exhausting work. A semi-supervised LDA (SDA) has been proposed recently to enable training of LDA with partially labeled samples. In this paper, we first reformulate supervised LDA based on the normalized perspective of LDA. Then we show that such a reformulation is powerful for semi-supervised learning of LDA. We call this approach Normalized LDA, which uses total diversity to normalize intra-class diversity and aims to find projection directions that minimize normalized intra-class diversity. Although the Normalized LDA is identical to LDA in the supervised situation, a semi-supervised approach can be easily incorporated into its framework to make use of unlabeled samples to improve the performance in the learned subspace. Moreover, different with SDA which uses unlabeled samples to preserve neighboring relations, unlabeled samples in the Normalized LDA are used for a more accurate estimation of data space. Experiments of face recognition on the FRGC version 2 database and CMU PIE database demonstrate that the Normalized LDA outperforms SDA.
-
Citations (0)
-
Cited In (0)
Page 1
Normalized LDA for Semi-supervised Learning
Bin FanZhen Lei Stan Z. Li
Center for Biometrics and Security Research & National Laboratory of Pattern Recognition,
Institute of Automation, Chinese Academy of Sciences
95 Zhongguancun Donglu, Beijing 100190, China
Abstract
Linear Discriminant Analysis (LDA) has been a popular
method for feature extracting and face recognition. As a
supervisedmethod, itrequiresmanuallylabeledsamplesfor
training, while making labeled samples is a time consuming
and exhausting work. A semi-supervised LDA (SDA [3])
has been proposed recently to enable training of LDA with
partially labeled samples.
In this paper, we first reformulate supervised LDA based
on the normalized perspective of LDA. Then we show that
such a reformulation is powerful for semi-supervised learn-
ing of LDA. We call this approach Normalized LDA, which
uses total diversity to normalize intra-class diversity and
aims to find projection directions that minimize normalized
intra-class diversity. Although the Normalized LDA is iden-
tical to LDA in the supervised situation, a semi-supervised
approach can be easily incorporated into its framework
to make use of unlabeled samples to improve the perfor-
mance in the learned subspace. Moreover, different with
SDA which uses unlabeled samples to preserve neighbor-
ing relations, unlabeled samples in the Normalized LDA are
used for a more accurate estimation of data space. Experi-
ments of face recognition on the FRGC version 2 database
and CMU PIE database demonstrate that the Normalized
LDA outperforms SDA.
1. Introduction
Inthepasttwodecades, facerecognitionhasbeenwidely
researched in the area of pattern recognition and computer
vision [12] [4] [15]. This is not only for its potential ap-
plications such as biometrics, human-computer interaction
and security, but also for its theoretical challenges since it
is a hard pattern classification problem with large amount
of classes and large diversity in each class in its framework.
One of the most successful and well-studied techniques to
face recognition is the appearance-based method. PCA [7]
and LDA [8] are two popular ones of this kind, which learn
a subspace of face from a set of training samples. In PCA,
the subspace is constructed to retain most information about
original samples by Karhunen-Loeve transform. However,
it does not utilize any class information and thus it may drop
some important clues for classification. LDA is then pro-
posed and it seeks subspace of features that best separat-
ing different face classes by maximizing the between class
covariance and simultaneously minimizing the within class
covariance. Becauseofthehighdimensionsoffeaturespace
(e.g. the total number of pixels in a facial image) and small
sample size, the within-class scatter matrix Swis often sin-
gular, so the optimal solution of LDA cannot be found di-
rectly. Efforts have been made to attenuate the singularity
problem in the estimation of Sw. These methods including
PCA+LDA (Fisher LDA) [8], Direct LDA (D-LDA) [16],
Null space LDA (N-LDA) [5], regularized LDA [6] and
so on. However, all these LDAs suffer the same problem,
which is the weak capability of generalization when there
do not exist enough training samples.
In reality, it is difficult to obtain a large number of la-
beled samples, but unlabeled ones are abundant, e.g. we
can find thousands of millions facial images on the Internet.
Can we use these unlabeled samples to help us improve the
classifier’s accuracy? If the answer is definite, then how
can we use them? These are the problems to be solved in
semi-supervised learning [18]. It uses both labeled and un-
labeled data to find a hypothesis that accurately labels un-
seen examples. The research of semi-supervised learning
has gained much attention in recent years [13] [17] [19] [2].
Rosenberg et al. [11] have presented a semi-supervised ap-
proach to train object detectors based on self-training, and
their experimental results show that the trained model for
detection is comparable to the model trained with a large
number of labeled samples. Balcan et al. [1] have applied
semi-supervised learning to the problem of person identi-
fication in webcam images. Roli and Marcialis [10] have
used self-training to develop a semi-supervised face recog-
nition algorithm on the basis of the standard PCA-based al-
gorithm.
In this paper, we proposed a novel approach for semi-
supervised LDA learning, called Normalized LDA, which
978-1-4244-2154-1/08/$25.00 c ?2008 IEEE
Page 2
is motivated by reformulating LDA in a normalized way.
The Normalized LDA aims to find a group of projection
directions on which the normalized intra-class diversity is
minimized, equivalent to maximizing the diversity of to-
tal samples and minimizing the intra-class diversity simul-
taneously. With comparison to supervised LDA, a semi-
supervised approach can be easily incorporated into the
Normalized LDA to learn a more discriminative subspace.
Specifically, the unlabeled and labeled samples are used to-
gether to get a more accurate estimation of the diversity
of total sample space to improve the performance in the
learned subspace. In the aspect of semi-supervised learn-
ingofLDA,arelatedworkisSemi-supervisedDiscriminant
Analysis (SDA) [3] proposed by Cai et al, which is essen-
tially a regularized approach based on the assumption that
nearby points will have similar embeddings. It uses unla-
beled samples to preserve neighboring relations. However,
different from SDA, the unlabeled samples in the Normal-
ized LDA is used to estimate the diversity of total sample
space. Statistically, more samples leads to a more accurate
estimation, and hence improves the generalization of the
proposed approach. Moreover, we generalize our algorithm
by introducing weights for labeled samples to improve the
proposed algorithm’s robustness further. The experimental
results show that our algorithm outperforms SDA.
The rest of this paper is organized as follows. Section 2
describes how to enable semi-supervised learning of LDA
by reformulating LDA in a normalized way. In Section 3,
weintroduceweightsforlabeledsamplestofurtherimprove
the performance of our approach. The algorithmic proce-
dure of our approach is presented in Section 4. Then the
experimental results are shown in Section 5 and finally we
conclude this paper in Section 6.
2. Semi-supervised learning of LDA
In this section, we will reformulate supervised LDA as
an approach which aims to find projection vectors minimiz-
ing the normalized intra-class diversity. The intra-class di-
versity is normalized by the diversity of total samples. Then
wewillshowthatsuchareformulationispowerfulforsemi-
supervised learning of LDA.
2.1. Normalized Perspective of LDA
Let XL
ing set with N labeled samples belonging to K classes
{C1,C2,··· ,CK}, and X = [x1,x2,··· ,xN] is the data
matrix of these training samples. We denote the number of
samples in the l-th class as Nl.
In LDA, intra-class scatter matrix Swand between-class
=
{x1,x2,··· ,xN}⊂
Rdbe a train-
scatter matrix Sbare defined as follows:
Sw=
1
N
K
?
?K
1
Nl
?N
1
N
l=1
?
xi∈Cl
(xi− ml)(xi− ml)T
(1)
Sb=
1
N
l=1Nl(ml− m)(ml− m)T
?
(2)
in which ml =
Cland m =
scatter matrix Stis defined as:
xi∈Clxiis the mean vector of class
i=1xiis the total mean vector. The total
1
N
St=
?N
i=1(xi− m)(xi− m)T
(3)
It follows that St = Sw+ Sb, and then using tr(St) to
normalize tr(Sw) and tr(Sb), we get
tr(Sw)
tr(St)+tr(Sb)
tr(St)= 1
(4)
in which tr(A) denotes the trace of matrix A.
The objective of LDA is to maximize the between-class
scatter matrix while minimize the intra-class scatter matrix.
In Equ. (4), we can find that these two objectives are identi-
cal after normalization and can be achieved simultaneously.
In other words, LDA aims to find projection vectors that
either minimize
expressed as follows:
tr(Sw)
tr(St)or maximize
tr(Sb)
tr(St), which can be
minimize
tr(PTSwP)
tr(PTStP)
(5)
maximize
tr(PTSbP)
tr(PTStP)
(6)
in which P is the transform matrix constituted by projection
vectors as columns. Since
tr(PTSwP) =
1
N
K
?
?K
1
N
l=1
?
xi∈Cl
??PTxi− PTml
??PTml− PTm??2
??2
(7)
tr(PTSbP) =
1
N
l=1Nl
?N
(8)
tr(PTStP) =
i=1
??PTxi− PTm??2
(9)
the tr(PTSwP) represents the intra-class diversity, the
tr(PTSbP) represents the between-class diversity and
the tr(PTStP) represents the total diversity.
tr(PTSwP)
tr(PTStP)
is the normalized intra-class diversity and
the
Therefore, the objective of LDA is reformulated either to
maximize the normalized between-class diversity or to min-
imize the normalized intra-class diversity.
Thus the
tr(PTSbP)
tr(PTStP)is the normalized between-class diversity.
Page 3
Based on the above reformulation, we propose an ap-
proach named Normalized LDA for semi-supervised learn-
ing of LDA. In real applications, e.g. face recognition, there
are thousands of millions classes that makes estimating re-
lationship between classes difficult with only a number of
training samples. However, we can assume that samples
in each class have a similar distribution. Taking facial im-
ages for example, the variety of different image of one per-
son is similar to that of another person. Therefore, we take
Equ. (5) as the objective of our method, which is to mini-
mize the normalized intra-class diversity. The small value
of normalized intra-class diversity means that the samples
in each class scatter compactly while all the samples in
the data space scatter dispersively, which indicates differ-
ent classes are separated well. In the supervised situation, it
is identical to LDA. However, as we shall see later, a semi-
supervised approach can be easily incorporated to the Nor-
malized LDA to enable training with partially labeled sam-
ples for the semi-supervised learning of LDA. The advan-
tageofNormalizedLDAistrainingwithunlabeledsamples.
2.2. Training with unlabeled samples
In LDA, the discriminating subspace is learned by just
consideringthelabeledsamples. Whentherearenotenough
labeled samples, its performance on testing samples can not
be guaranteed. In real applications especially in face recog-
nition, obtaining samples is an easy work but labeling these
samples is a time consuming and exhausting work. There-
fore, developing an algorithm that uses unlabeled samples
to improve performance in the learned subspace will largely
reducehumanlabor. Inthissubsection, wewillshowhowto
use unlabeled samples to improve the accuracy in the Nor-
malized LDA as well as the reason why it can learn a more
discriminative subspace.
As described in the above subsection, the objective of
Normalized LDAistomaximize the totaldiversity and min-
imize the intra-class diversity. Such an objective can easily
make use of unlabeled samples to improve the discriminat-
ing capability of the learned subspace. Specifically speak-
ing, since the diversity of total samples evaluates how data
scatters in the sample space, it considers the character of
data space. Thus it does not utilize any class information.
In other words, it does not depend upon labels of samples.
Therefore, all available samples including labeled and un-
labeled ones can used to calculate the total diversity. This
case Stbecomes:
1
NL+ NU
xi∈{XL,XU}
in which XLand XUdenote labeled sample set and un-
labeled sample set respectively, while NLand NUare the
number of samples of these sets, and m?is the mean vec-
tor of both labeled and unlabeled samples. In the view of
St?=
?
(xi− m?)(xi− m?)T(10)
statistics, more number of samples leads to a more accurate
estimation of data space. In this way, the discriminating
capability of the learned subspace is improved by a more
accurate estimation of total diversity.
3. Weighting labeled samples
Estimating the intra-class diversity as well as the total
diversity plays a key role in the Normalized LDA, and it is
vital for the performance of the learned subspace. The more
accurate estimation is, the more discriminative the learned
subspaceis. Byaddingalargenumber ofunlabeledsamples
tothetrainingset, estimationoftotaldiversityofthedataset
will be more accurate. However, obtaining a large number
of samples in each class to guarantee an accurate estimation
of the intra-class diversity is nearly impractical. Usually,
we can only get a few samples in each class. Thus the esti-
mation of the intra-class diversity is sensitive to noise, since
that if one sample is noise then it will distort the estimation
of the intra-class diversity largely due to the lack of enough
samples. In order to reduce the influence of noise and fur-
ther improve the semi-supervised learning of LDA, we in-
troduce a weight for each labeled sample xiinstead of treat-
ing labeled samples equally, denoted by w(xi). If one sam-
ple is likely to be a noise, it should have small weight. By
assigning weights to different labeled samples to reduce the
influence of noise, it is expected that the estimation of the
intra-class diversity will be more accurate. With a weight
for each labeled sample, Equ. (7) is rewritten as:
tr(PTSwP) =
K
?
l=1
?
xi∈Cl
w(xi)??PTxi− PTml
??2
(11)
For simplicity in presentation, we denote the sum of
the weights of samples in class Clby sl. In other words,
sl =?
vector mlof class Clcan be rewritten as ml= XlWlel/sl,
where Xlis a data matrix containing samples in class Cl,
each column is a sample in Cl, and elis the vector of all
ones whose size is the number of samples in Cl.
We then rewrite the intra-class diversity as follows:
?K
=
l=1
i∈Cl
?K
=
l=1
l
i∈Clw(xi). Then, we denote Wlto be the diago-
nal matrix of the weights in class Cl. Therefore, the mean
tr(PTSwP) =
l=1
?
????PTxi− PTXlWlel
sl
i∈Cl
w(xi)??PTxi− PTml
??2
?K
?
????PT(Xl− XlWlel
w(xi)
sl
????
????
2
=
l=1
eT
l)W1/2
l
2
F
?K
?????PTXlW1/2
(I −W1/2
l
eleT
sl
lW1/2
l
)
?????
2
F
Page 4
Let Ml= I −W1/2
we have M2
tr(AAT), we obtain:
l
eleT
sl
lW1/2
l
and note that sl= eT
Based on the fact that ?A?2
lWlel,
F=
l
= Ml.
tr(PTSwP) = tr(PT?K
Al= XlW1/2
Then, the task is how to set the weight for each labeled
sample to improve the robustness of Normalized LDA. We
believe that the weight of one sample should be related to
its location in the data space. If one sample scatters in the
margin, it’s more likely to be a noisy, thus it should have
small weight. On the contrary, one sample should have big
weightifitscattersinthedensearea. Herewesettheweight
of one sample inverse to its distance to the mean of the class
that this sample belongs to, which is
l=1AlP)
(12)
in which
l
MlW1/2
l
XT
l
(13)
w(xi∈ Cl) ∝
1
Dis(xi,ml)
where Dis() is a function of distance. This paper uses Eu-
clidean distance.
4. The Algorithm
Given a set XL= {(x1,y1),(x2,y2),··· ,(xN,yN)}
of samples belonging to K classes as labeled sample
set in the training set, in which yi is the label of the
i-th sample. Without loss of generality, we assume that
XLis ordered according to sample’s label, which is
XL= {X1,X2,··· ,XK} where Xiis the set of samples
in the i-th class. Besides, we have an unlabeled sample
set XU= {xN+1,xN+2,··· ,xN+M} in the training set.
Assume the transform matrix which we want to find is P,
the steps of the Normalized LDA can be summarized as
follows:
1.
Calculate the total diversity:
Equ. (10), calculate the total scatter matrix St?and
tr(PTS?
According to
tP) is the total diversity.
2. Calculate the intra-class diversity: Either assign a
weight to each sample in Xlinverse to its distance to the
mean of samples in Xl and then calculate Al according
to Equ. (13) or calculate the intra-class scatter matrix Sw
according to Equ. (1).Thus the intra-class diversity is
tr(PTSw?P) in which Sw?=?K
3. Find optimal solution: The objective of the Normal-
ized LDA is to find a transform matrix P that minimize the
normalized intra-class diversity as follows:
l=1Alor Sw?= Sw.
minimize
tr(PTSw?P)
tr(PTSt?P)
(14)
The solution can be obtained by solving the generalized
eigen-problem shown in the following:
Sw?P = λSt?P
(15)
Then, computing the eigenvectors and eigenvalues
in Equ. (15), and sorting these eigenvectors according
to the value of their corresponding eigenvalues in an
ascend manner. The sorted eigenvectors are denoted by
[V1,V2,··· ,Vd], where d is the dimension of sample.
4. Select projection directions: Construct the transfor-
mation matrix due to the number of extracted features. If
we want to extract Nf(Nf< d) features, then the transfor-
mation matrix is P = [V1,V2,··· ,VNf].
5. Experiments
In this section, we conduct experiments on face recogni-
tion to test our algorithm, using FRGC version 2 database
[9] and CMU PIE database [14]. In the experiments, each
facial image in the databases is in grey scale and cropped to
be 71×60 pixels by fixing the positions of two eyes. Fig. 1
shows some sample facial images in the databases. PCA is
used to all of samples in the training set, the gallery set and
the probe set to reduce the computational complexity. After
extracting features, the nearest neighbor classifier with Eu-
clidian distance metric is adopted to classify the samples.
Experimental results on both of these two databases demon-
strate that the proposed approach in this paper outperforms
the SDA.
5.1. On FRGC Database
The query set of facial images in the FRGC version 2 is
used for the experiment to evaluate our algorithm. There
are 8104 facial images of 466 subjects containing the vari-
ations of illumination, expression, time and blurring. Since
there are only a few images available for some persons in
this set, a labeled subset is selected to ensure the number
of images of each subject in the selected subset is no less
than 10. Specifically, the first 10 facial images are taken
into the subset if the number of images of this subject is no
less than 10. Thus we get a subset of 3160 facial images
of 316 subjects. Then, the selected labeled subset is further
divided into three subsets: the training set, the probe set
and the gallery set. First, 200 subjects’ images are selected
as labeled samples in the training set, and the remaining
116 subjects’ images are exploited as the gallery set and the
probe set. Second, five facial images of each subject in the
remaining 116 subjects are selected as the gallery set and
the other five images are used as the probe set. The images
in the query set but not in the selected labeled subset are
served as the unlabeled samples in the training set.
Page 5
Figure 1. Sample facial images in the data set. First row shows samples in the CMU PIE database, and second row shows sample images
in the FRGC database
To reduce variation, the experiment is repeated ten times
through random selection. Thus the results shown in Fig. 2
are the average results. In Fig. 2, Norm-LDA indicates
the proposed method of this paper, W-Norm-LDA indicates
Norm-LDA with samples have different weights discussed
in the Section 3 and SDA indicates Cai’s method [3]. In
Cai’s method, there are two parameters needed to be set,
which are number of nearest neighbors p that is used to cal-
culate the graph matrix and the coefficient α controls bal-
ance between the model complexity and the empirical loss.
Since there is no guide for setting these parameters in [3],
we set the value of them to be 3 and 0.01 respectively which
aretheoptimalvaluesaccordingtotheresultsofmanytimes
of experiment. For comparison, we also demonstrate the re-
sult of LDA. We can find that by adding unlabeled sam-
ples to the training set, it does improve the discriminat-
ing capability of the learned subspace, either our method
or Cai’s method. However, compared with Cai’s method,
our method has a better performance on the improvement
of classifying accuracy. Moreover, there is no parameter to
be set in our method while Cai’s method need to set sev-
eral parameters which is depend on experiences. Further-
more, the higher performance of W-Norm-LDA compared
with Norm-LDA validates that our strategy for introducing
different weight to each sample is useful.
5.2. On CMU PIE Database
The CMU PIE database contains 68 subjects with 41,368
facial images under 13 different poses, 43 different illumi-
nation conditions, and with 4 different expressions. The
frontal pose (C27) subset of CMU PIE database is chosen
for our experiment. It contains facial images with varying
lighting and illumination with fixed pose and expression. In
the experiment, we randomly choose 30 subjects for train-
ing. For each subject, 10 images are randomly selected as
the labeled samples in the training set. The probe set and
the gallery set are selected from the images of 20 subjects
within the remaining subjects in the data set. Specifically,
we first select 10 images in each person of the selected 20
persons to be exploited as the probe set and the gallery set.
Then, five images in one subject are chosen as the probe set
and the other five images of this subject are chosen as the
20406080100120140160
86
87
88
89
90
91
92
Number of Features
Accuracy
W−Norm−LDA
Norm−LDA
LDA
SDA
Figure 2. Experimental result on FRGC
gallery set. The the unlabeled samples in the training set are
selected from the remaining images of this subset (C27). Fi-
nally, we obtain a training set with 300 labeled images from
30 subjects and 907 unlabeled samples, a probe set with 100
images belonging to 20 subjects and a gallery set with 100
images belonging to the same subjects as the images in the
probe set. The set of persons for training is disjoint with
that of persons in the gallery as well as the probe. The same
as the experiment performed on the FRGC data set, we av-
erage the results over 10 random splits to reduce variation.
Fig. 3 demonstrates the experimental results on CMU
PIE database which are similar to the results on FRGC
database. It can be found that the proposed method, ei-
ther Norm-LDA or W-Norm-LDA, has better performance
than the SDA. Meanwhile, the W-Norm-LDA outperforms
Norm-LDA in Fig. 3, which further indicates that the strat-
egy for introducing different weight to each sample is use-
ful.
6. Conclusion
This paper proposed a new method called Normalized
LDA for semi-supervised LDA learning, which is motivated