Feature Extraction for Regression Problems and an Example Application for Pose Estimation of a Face.
ABSTRACT In this paper, we propose a new feature extraction method for regression problems. It is a modified version of linear discriminant
analysis (LDA) which is a very successful feature extraction method for classification problems. In the proposed method, the
between class and the within class scatter matrices in LDA are modified so that they fit in regression problems. The samples
with small differences in the target values are used to constitute the within class scatter matrix while the ones with large
differences in the target values are used for the between class scatter matrix. We have applied the proposed method in estimating
the head pose and compared the performance with the conventional feature extraction methods.
 Citations (8)
 Cited In (0)

Article: From Few to Many: Illumination Cone Models for Face Recognition under Variable Lighting and Pose.
[Show abstract] [Hide abstract]
ABSTRACT: We present a generative appearancebased method for recognizing human faces under variation in lighting and viewpoint. Our method exploits the fact that the set of images of an object in fixed pose, but under all possible illumination conditions, is a convex cone in the space of images. Using a small number of training images of each face taken with different lighting directions, the shape and albedo of the face can be reconstructed. In turn, this reconstruction serves as a generative model that can be used to render—or synthesize—images of the face under novel poses and illumination conditions. The pose space is then sampled, and for each pose the corresponding illumination cone is approximated by a lowdimensional linear subspace whose basis vectors are estimated using the generative model. Our recognition algorithm assigns to a test image the identity of the closest approximated illumination cone (based on Euclidean distance within the image space). We test our face recognition method on 4050 images from the Yale Face Database B; these images contain 405 viewing conditions (9 poses 45 illumination conditions) for 10 individuals. The method performs almost without error, except on the most extreme lighting directions, and significantly outperforms popular recognition methods that do not use a generative model.IEEE Transactions on Pattern Analysis and Machine Intelligence 01/2001; 23:643660. · 4.80 Impact Factor  SourceAvailable from: Terence Sim[Show abstract] [Hide abstract]
ABSTRACT: In the Fall of 2000, we collected a database of more than 40,000 facial images of 68 people. Using the Carnegie Mellon University 3D Room, we imaged each person across 13 different poses, under 43 different illumination conditions, and with four different expressions. We call this the CMU pose, illumination, and expression (PIE) database. We describe the imaging hardware, the collection procedure, the organization of the images, several possible uses, and how to obtain the database.IEEE Transactions on Pattern Analysis and Machine Intelligence 01/2004; 25(12):1615 1618. · 4.80 Impact Factor  SourceAvailable from: Nojun Kwak
Conference Paper: Dimensionality Reduction Based on ICA for Regression Problems.
[Show abstract] [Hide abstract]
ABSTRACT: In manipulating data such as in supervised learning, we often extract new features from the original features for the purpose of reducing the dimensions of feature space and achieving better performance. In this paper, we show how standard algorithms for independent component analysis (ICA) can be applied to extract features for regression problems. The advantage is that general ICA algorithms become available to a task of feature extraction for regression problems by maximizing the joint mutual information between target variable and new features. Using the new features, we can greatly reduce the dimension of feature space without degrading the regression performance.Artificial Neural Networks  ICANN 2006, 16th International Conference, Athens, Greece, September 1014, 2006. Proceedings, Part I; 01/2006
Page 1
Feature Extraction for Regression Problems and an
Example Application for Pose Estimation of a Face
Nojun Kwak1, SangIl Choi2, and ChongHo Choi2
1Division of Electrical & Computer Engineering, Ajou University, San5, Woncheondong,
Yeongtonggu, Suwon, GyeonggiDo, 443749 KOREA,
nojunk@ieee.org,
WWW home page: http://ajou.ac.kr/∼nojunk
2School of Electrical Engineering and Computer Science, Seoul National University,
#047, San 561, Sillimdong, Gwanakgu, Seoul 151744, Korea
{karachchoi}@csl.snu.ac.kr
Abstract. In this paper, we propose a new feature extraction method for regres
sion problems. It is a modified version of linear discriminant analysis (LDA)
which is a very successful feature extraction method for classification problems.
In the proposed method, the between class and the within class scatter matrices in
LDA are modified so that they fit in regression problems. The samples with small
differences in the target values are used to constitute the within class scatter ma
trix while the ones with large differences in the target values are used for the
between class scatter matrix. We have applied the proposed method in estimat
ing the head pose and compared the performance with the conventional feature
extraction methods.
Key words: Regression, Feature extraction, Dimensionality reduction, LDA.
1Introduction
Regression, which is a process of estimating a realvalue function based on a finite
set of noisy samples, is one of the classical problems in statistics, machine learning
and pattern recognition societies. When dealing with classification problems, regression
problems can be classified as supervised learning, where a set data consisting of pairs
of input objects and desired outputs are given. The input objects and the desired outputs
are usually called the input variables and the target variables, respectively.
It is well known that reducing the number of input variables through dimension
ality reduction techniques such as feature selection or feature extraction is desirable.
Reducing the dimensionality of the feature space may improve the learning process by
considering only the most important data representation, possibly with elements retain
ing the maximum information of the original data and better generalization capabilities
[1]. Dimensionality reduction is quite desirable not only in the aspect of the number of
required data, but also in terms of data storage and computational complexity.
In this paper, we focus on the linear feature extraction methods for regression prob
lems to reduce the dimensionality of input space.
Page 2
2N. Kwak et al.
Many studies have been performed to solve the feature extraction problems among
which the principal component analysis (PCA) [2] and the independent component
analysis (ICA) [3] have been widely used. Although PCA is one of the most popular
and widely used methods, which is very useful in reducing the dimension of a feature
space to a manageable size, it can still be improved for supervised learning problems
since it is an unsupervised learning method that does not make use of the target infor
mation. Likewise, ICA, which is another unsupervised learning method, leaves much
room for improvement to be used for supervised learning problems. Unlike PCA and
ICA, linear discriminant analysis (LDA) [4] was originally developed for supervised
learning, especially to find the optimal linear discriminating functions for classification
problems.
Although many feature extraction methods have been developed for classification
problems, relatively little attention has been given to feature extraction for regression
problems in the machine learning society.
On the other hand, in statistics, several algorithms have been developed for dimen
sionality reduction in regression problems, among which the classical multivariate lin
ear regression (MLR) [5] can be a starting point. Although MLR is optimal in the sense
of least squared error, it has the limitation that it can produce only one feature. To
overcome this limitation, a local linear dimensionality reduction method based on the
nearest neighbor scheme has been proposed [6]. Sliced inverse regression (SIR) [7] and
principal hessian directions (PHD) [8] are also very popular dimensionality reduction
techniques for regression problems in statistics.
In this paper, we propose a new feature extraction method for regression problems.
It is a generalization of LDA to regression problems which tries to maximize the ra
tio of distances of samples with large differences in target value and those with small
differences in target value. The experimental results show that the proposed method
performs well for many regression problems. In addition, because it only needs to solve
the eigenvalue decomposition problem, it is relatively faster than iterative methods such
as ICAFX [9].
The paper is organized as follows. In Section II, we briefly overview the conven
tional feature extraction methods for regression problems. A new feature extraction
method is presented in Section III and the experimental results are shown in Section IV.
Finally, the conclusions and future works follow in Section V.
2Conventional Methods: Linear Feature Extraction for
Regression
Consider a set of predictor/response3pairs {(x x xi,y y yi)}n
and n denotes the number of given predictor/response pairs. Here, d is the number of
input variables and t is the number of target variables which will be equal to 1 in most
problems.4.
i=1where x x xi∈ ℜd×1,y y yi∈ ℜt×1
3Note that instead of the terms predictor and response, input and target can be used without
notification.
4From now on, we will assume t = 1 and instead of the vector form y y y, the scalar form y will
be used without notification.
Page 3
Feature extraction for regression problems3
In this regression setting, we want to find a set of linear transformations ofx x x that can
constitute sufficient statistics for target vector y y y. This transformation can be denoted as
fi = w w wT
coefficient or weight vector.
In this section, we introduce several conventional methods for this purpose.
ix x x, where fiis the ith new feature and w w wi ∈ ℜd×1is the corresponding
2.1Sliced Inverse Regression (SIR)
The following is the standard SIR algorithm. For simplicity, let us assume that t = 1
and the covariance matrix Sxof input variables x x x is d × d identity matrix.
Step 1. Sort the data yiin increasing order.
Step 2. DividetheordereddatasetintoLslicestomaketheslicesizeasequallyaspossible.
Let nlbe the number of examples in slice l.
Step 3. Within each slice, compute the sample mean of x x x, ¯ x x xl=
Step 4. Compute the covariance matrix for the slice means ofx x x, weighted by the slice sizes.
1
nl
?
i∈slice lx x xi.
Sη=1
n
L
?
l=1
nl(¯ x x xl− ¯ x x x)(¯ x x xl− ¯ x x x)T
(1)
Here, ¯ x x x denotes the sample mean of x x x such that ¯ x x x =1
Step 5. Find the kth SIR direction w w wkby conducting the eigenvalue decomposition of Sη.
n
?n
i=1x x xi.
Sηw w wk= λkw w wk,λ1≥ λ2≥ ··· ≥ λd
(2)
Note the similarity of SIR to PCA. SIR takes L points each of which is the sample
mean of nlpoints in each slice l and then performs the PCA to these L points. However,
the difference is that in generating the L points, x x xs that are associated with similar y
values are averaged out to capture the relationship between the inputx x x and the target y.
2.2Principal Hessian Directions (PHD)
As in SIR, let us assume that t = 1 and let f(x x x) be the regression function E(Y x x x).
Here, E(·) denotes expectation. Consider the Hessian matrix H(x x x) of f(x x x) whose (i,j)
component is as follows:
Hij(x) =
∂2
∂xi∂xjf(x),
(3)
where xkis the kth component of the vector x x x.
Hessian matrices are important in studying multivariate nonlinear functions and
PHD focuses on the utilization of the properties of Hessian matrices for dimensionality
reduction. In the PHD algorithm, the principal Hessian directions w w wks (k = 1,··· ,d)
are obtained by solving the following eigenvalue decomposition problem:
Syxxw w wk= λkw w wk,
λ1 ≥ λ2 ≥ ··· ≥ λd
(4)
Page 4
4N. Kwak et al.
where Syxxcan be estimated by
Syxx=1
n
n
?
i=1
(yi− ¯ y)(x x xi− ¯ x x x)(x x xi− ¯ x x x)T.
(5)
BecausethePHDisbasedontheHessianmatrix,itperformspoorlyontheproblems
where targets are linearly related to the input variables.
2.3Linear Discriminant Analysis (LDA)
Unlike the methods previously described in this section, LDA focuses on the classifica
tion problem where instead of a continuous target variable y, a discrete class identifier
c ∈ {1,··· ,Nc} is used. Here, Ncis the number of classes.
In LDA, we try to optimize the following Fisher’s criterion such that the ratio of
the betweencovariance matrix Sb =
n
?Nc
covariance matrix Sw=1
n
?Nc
1
c=1nc(¯ x x xc− ¯ x x x)(¯ x x xc− ¯ x x x)Tand the within
i∈{class=c}(x x xi− ¯ x x xc)(x x xi− ¯ x x xc)Tis maximized.
c=1
?
W = argmax
W
WTSbW
WTSwW
(6)
Here, ¯ x x x =1
longing to the class c and ¯ x x xc=
to the class c.
The optimization problem in (6) is equivalent to the following generalized eigen
value problem,
Sbw w wk= λkSww w wk
λ1≥ λ2≥ ··· ≥ λd,
n
?n
i=1x x xiis the total mean of the samples, ncis the number of samples be
1
nc
?
i∈{class=c}x x xiis the mean of the samples belonging
(7)
where w w w1is the most discriminant component, w w w2is the second, and so on.
3The Proposed Method: LDA for regression
In the classification problems, LDA has been a very successful method for dimensional
ity reduction and many variants have been also developed. As described in the previous
section, the gist of LDA lies in maximizing Fisher’s criterion which tries to maximize
the betweenclass scatter while minimizing the withinclass scatter.
In this section, we extend this idea to the regression problems and a new feature
extraction algorithm for regression is proposed. From now on, the new method will be
referred to as LDAr.
Unlike the classification problems, it is difficult to define the betweenclass scat
ter and withinclass scatter matrices in regression problems because the target variable
is continuous. The simple idea that the samples with small differences in the target
values are considered as belonging to the same class, while the ones with large differ
ences are considered as belonging to different classes, is used to define the between
class and withinclass scatter matrices. The followings are the modified withinclass
Page 5
Feature extraction for regression problems5
and betweenclass scatter matrices for LDAr:
Swr=
1
nw
?
(i,j)∈Aw
f(yi− yj)(x x xi−x x xj)(x x xi−x x xj)T
(8)
Sbr=
1
nb
?
(i,j)∈Ab
f(yi− yj)(x x xi−x x xj)(x x xi−x x xj)T.
(9)
Here, Aw= {(i,j) : yi− yj < τ,
yi− yj ≥ τ,i,j ∈ {1,··· ,n},
function f(·) is a weight function positive values. Note that nw+ nb=n(n−1)
Using this modified scatter matrices, the Fisher’s criterion can be rewritten for re
gression problems as
i,j ∈ {1,··· ,n},
i ?= j} and nw = Aw and nb = Ab. The
i ?= j}, Ab= {(i,j) :
2
.
W = argmax
W
WTSbrW
WTSwrW.
(10)
As stated earlier, maximizing the above Fisher’s criterion is equivalent to solving the
generalized eigenvalue problem:
Sbrw w wk= λkSwrw w wk
λ1≥ λ2≥ ··· ≥ λd
(11)
which is again equivalent to the following eigenvalue decomposition problem:
S−1
wrSbrw w wk= λkw w wk
λ1≥ λ2≥ ··· ≥ λd
(12)
where w w w1is the most important component, w w w2is the second, and so on.
In modifying LDA for regression problems, we could have segmented the given
dataset into several virtual classes based on the target values with fixed boundaries and
applied the conventional LDA for classification problems. Although this method is sim
ple, the results can be highly dependent on how to segment boundaries and the number
of virtual classes. In addition, this approach may not take into account the different lev
els of similarity among different classes. Therefore, in LDAr, soft boundaries which
are different from one sample to another are used.
Note that the threshold parameter τ plays an important role in setting the boundary.
If τ is small, nwbecomes small while nbbecomes large and vise versa. The threshold
τ can be represented as a multiple of the standard deviation σyof target variable y such
that τ = ασy. Typical range for α is 0.1 to 1.0.
Although the weight function f(·) can be set as a constant, e.g., f(x) = 1, it is
probably better to make f(x) take different values for different inputs. Because yi−
yj = τ sets a boundary whether the pair (i,j) should belong to Awor Ab, the effect of
(i,j)pair which is near this boundary can be reduced by setting f(x) ≃ 0 for x ≃ τ.
Typical examples of f(·) fulfilling this requirement are f(x) = x − τ and f(x) =
?x − τ.
Note that LDAr is not invariant to transformation of input features and susceptible
to scaling of input features as in LDA. Therefore, it is desirable to preprocess the given
dataset by applying PCA which is often called the sphering process [2].
The computational complexity of LDAr can be decomposed into two parts. The
first part is related to obtaining the covariance matrices shown in (9) and it is propor
tional to the square of the number of examples, i.e., O(n2). The second part is related
Page 6
6N. Kwak et al.
3210
x1
123
3
2
1
0
1
2
3
x2
(a) Linear Target
32 10123
3
2
1
0
1
2
3
x1
x2
(b) Quadratic Target
Fig.1. One thousand random points drawn from N(0,I2). The slanted lines and ellipses in red
are the contour map which connects the points that have the same y value.
to solving the eigenvalue decomposition problem in (11) and it is typically proportional
to the cubic of the input dimension, i.e., O(d3).
Comparing this to the complexity of LDA, because the second part is common in
LDA and LDAr, we see that LDAr is somewhat more computationally complex than
LDA which requires O(n) operations in obtaining the scatter matrices. However, for a
large n, a subset of samples can be selected in computing the scatter matrices to reduce
the computational complexity.
4Experimental Results
4.1Linear and Quadratic Targets
Consider two independent input features x1and x2which have normal distribution with
zero mean and variance of 1. In addition, suppose the target output variable y has the
following relationships with the input x x x:
Linear:
Quadratic:
y = 2x1+ x2
y = 4(x1− 2x2)2+ (2x1+ x2)2.
(13)
(14)
In Fig. 1(a) and (b), we have plotted 1,000 samples each. In each figure, a contour
map was drawn in red which connects the points that have the same y value (slanted
lines for the linear case, and ellipsoids for the quadratic case). For these empirical data,
we have applied SIR, PHD, and LDAr.
Linear target: For the linear case, the optimal feature is f = 2x1+ x2which corre
sponds to the optimal weight vector w w w∗= [2,1]T.
Consideringthattheareabetweentheneighboringslantedlinescanbeconsideredas
a slice in SIR, there will be significant differences in the mean values ¯ x x xl(l = 1,··· ,L)
of each slices and we expect the SIR will work well for this problem. As expected, SIR
Page 7
Feature extraction for regression problems7
produced w w w = [0.89,0.45]Twhich is very close to the optimal value w w w∗. The number
of slices was set to L = 10 in this case.
Regarding PHD, because y is linear with respect tox x x, all the elements in the Hessian
matrix of this problem are zeros and we can expect PHD can not solve this problem.
As a matter of fact, for the empirical data shown in Fig. 1(a), PHD produced w w w =
[0.88,−0.51] which is far from w w w∗.
The reason PHD fails to this problem lies in the form of the weight function. In
PHD, the weight function is just the deviation from the target mean ¯ y. Therefore, the
points in the lower left part in Fig. 1(a) have negative weights (yi−¯ y < 0) and the other
points which are located in the upper right part have positive weights (yj− ¯ y > 0). As
a result, contributions of any two points which are symmetric with respect to the center
cancel out each other in the formation of Syxxand the eigenvalues of Syxxbecome very
small resulting in poor performance of PHD.
For this example, LDAr is also applied with weight function f(x) =
and α = 0.3. LDAr resulted in w w w = [0.89,0.45]Twhich is very close to the optimal
weight. Note that in LDAr, the scatter matrices are all positive semidefinite.
?x − τ
Quadratic target: As shown in Fig. 1(b), for a fixed y, (x1,x2) constitutes an ellipsoid
whose major axis is in the direction of (2,1) and the minor axis is in (−1,2).
If we are to extract only one feature among the set of linear combinations of input
variables x1and x2, the major axis is the best projection which corresponds to a feature
f = x1− 2x2, i.e., w w w∗= [1,−2]T.
As expected, SIR does not work well for this example because all the mean values
of the different slices are near zero and a random direction which is highly dependent
on a specific data will be chosen. For the empirical data shown in Fig. 1(b), SIR with
L = 10 extracted the first weight vector w w w = [−0.84,0.52]Twhich is far from the
optimal value w w w∗= [1,−2]T.
Unlike SIR, PHD works well for this problem because y is quadratic with respect
to x x x and the principal Hessian directions are easily calculated. Calculating the Hessian
?
−1234
as expected. For the empirical data shown in Fig. 1(b), the PHD algorithm resulted in
w w w = [0.44,−0.90]Twhich is very close to the optimal value.
For this example, LDAr is also applied with weight function f(x) =
with α = 0.3. LDAr resulted in w w w = [0.44,−0.90]Twhich is the optimal vector.
matrix, it becomes H =
16 −12
?
and the principle Hessian direction is [1,−2]T
?x − τ
4.2Pose Estimation
In this part, the proposed algorithm is applied to a pose estimation problem, by taking
it as a regression problem, and the proposed algorithm are compared to some of other
conventional methods.
Infacerecognitionsystems,posevariationinafaceimagesignificantlydegradesthe
accuracy of face recognition. Therefore, it is important to estimate the pose of a face
image and classify the estimated pose into the correct pose class before the recognition
Page 8
8 N. Kwak et al.
(a)
c22 c02 c05 c27 c29 c14 c34
(b)
Fig.2. Edge images for different poses: (a) images under various poses; (b) corresponding edge
images.
procedure. Given face images with pose variation, an image can be assigned to a pose
class by a classification method using a feature extraction method.
However, unlike general classification problems, since pose classes can be placed
sequentially from left profiles to right profiles in the pose space, there is an order re
lationship between classes, which can be represented in distance, and the distance be
tween classes can be used as a measure of class similarity. For example, consider a pose
estimation problem which consists of three pose classes ‘front (0°)’, ‘half profile (45°)’
and ‘profile (90°)’. In this problem, ‘profile’ images are more closer to ‘half profile’
images than ‘front’ images. If a classifier misclassifies a ‘profile’ image, it would be
better to classify it into a ‘half profile’ than a ‘front’ image. Thus, we can make use of
the order relationship between classes for feature extraction. In this sense, these types
of classification problems are similar to regression problems. If each of the pose classes
is assigned a numerical target value, the pose estimation problem may be regarded as
a regression problem and the feature extraction methods can be used to extract useful
features in discriminating the pose of a face image.
We evaluate the performance of pose estimation on the CMUPIE database [10].
The CMUPIE database contains more than 40,000 facial images of 68 individuals,
21 illumination conditions, 12 poses and four different expressions. Among them, we
selected the images of 65 individuals with seven pose indices (c22, c02, c05, c27, c29,
c14, c34). Each face was cropped to include only the face and rotated on the basis of the
distance among the manually selected points on an image, and then rescaled to a size of
120 × 100 (see Fig. 2(a)). Three images under different illumination variation for each
of the 65 individuals in each pose class were used as a training set while the other 8190
(65x18x7) images were used for testing. We first divided the pose space into seven pose
classes from left profile to right profile and built a feature space for each pose class
using feature extraction methods explained in the previous section. In order to estimate
a pose of a face image, each of the seven pose classes was assigned a numerical target
value from 1 (left profile) to 7 (right profile).
Page 9
Feature extraction for regression problems9
Table 1. Error rate in pose classification on face images(%)
Method
PHD (1200) 28.80 44.62 28.89 1.37 1.88 5.98 3.76 12.36
SIR (1200) 29.74 44.87 27.95 1.71 2.22 7.61 3.25 16.76
LDA (6)9.6600
LDAr(200) 7.61 0.090
c22
c02
c05
c27 c29 c14
c34 Overall
4.53 9.49 8.38 12.48 6.34
2.56 2.82 4.87 7.183.59
Table 2. Error rate in pose classification on edge images(%)
Method
PHD (1200) 9.91 5.04 1.97 2.65 2.65 5.73 4.87 4.69
SIR (1200) 9.32 4.87 1.97 2.65 2.65 5.38 4.44 4.47
LDA (6)1.03 1.03 0.17 0.26 0.26 1.97 2.56 1.04
LDAr(200) 0.94 0.940
c22 c02 c05 c27 c29 c14 c34 Overall
0.35 0.09 1.03 3.23 0.80
In the experiment below, each of the pixels was used as an input feature constituting
a 12,000 dimensional input space and the methods presented in the previous section
were used to extract features for estimating the pose. As can be seen, this problem
is a typical example of the SSS problem whose input dimension d (12,000) is much
larger than the number of training examples n (1,365). To resolve this SSS problem,
in all the feature extraction methods, we have preprocessed the dataset with PCA and
reduced the dimension of input space into n − 1. For the proposed method, the weight
function f(x) =?x − τ and α was set to 0.1. With these extracted features, the one
nearest neighborhood rule was used as a classifier with the Euclidean distance (L2) as
the distance metric.
Table 1 shows the error rates of pose classification for the test images using several
methods. Numbers in the parentheses are the number of features. As can be seen in
Table 1, the proposed method is better than the other methods in most cases. Overall
error rates of PHD and SIR (L = 10) are above 12%, while LDA gives an overall error
rate of 6.34%. However, since the pose estimation is a classification problem where
levels of similarity among different classes can be defined, LDAr is more suitable for
this problem than LDA, and we can see that the overall error rate of LDAr is 2.75%
lower than that of LDA.
On the other hand, the images such as those in Fig. 2(a) contain necessary informa
tion for pose estimation as well as other information such as the illumination condition,
appearance variation, etc. In order to remove the redundant information for pose esti
mation, we transform a face image to an edge image by using the Sobel mask [11]. As
shown in Fig. 2(b), the edge images enhance the geometrical distribution of facial fea
ture points. Even though the edge images may be sensitive to illumination variation, the
pose estimation can be reliably performed on images under illumination variation if the
training set contains edge images under various illumination conditions. Subsequently,
as can be seen in Table 2, the overall error rates are lower than those in Table 1. In the
case of edge images, the performance difference between each feature extraction meth
Page 10
10N. Kwak et al.
ods became smaller compare to the raw images, but we can see that the performance of
LDAr is still better than the other methods.
5 Conclusions
Inthispaper,wehaveproposedanewmethodforlinearfeatureextractionforregression
problems. It is a modified version of LDA. The distance information among samples are
utilized in constructing the within class and between class scatter matrices.
ThetwoexamplesinSection4.1showtheadvantageoftheproposedmethodagainst
the conventional methods such as SIR and PHD. It showed good performance on both
examples, while SIR and PHD performed poorly in one of the examples. We also ap
plied the proposed method to estimating the head pose of a face image and compared
the performance to those of the conventional feature extraction methods.
The experimental result in pose estimation shows that the proposed method pro
duces better features than the conventional methods such as SIR, PHD and LDA. The
proposed method is easy to implement and is expected to be useful in finding good
linear transformations for regression problems.
Acknowledgments. This work was partly supported by Samsung Electronics.
References
1. Cios, K.J., Pedrycz, W., Swiniarski, R.W.: Data Mining Methods for Knowledge Discovery 
Chapter 9. Kluwer Academic Publishers (1998)
2. Joliffe, I.T.: Principal Component Analysis. SpringerVerlag (1986)
3. Bell, A.J., Sejnowski, T.J.: An InformationMaximization Approach to Blind Separation and
Blind Deconvolution. Neural Computation. 7 (1995) 11291159
4. Fukunaga, K.: Introduction to Statistical Pattern Recognition. 2nd ed. Academic Press, New
York (1990)
5. Weisberg, S.: Applied Linear Regression  Chapter 3. 2nd ed. John Wiley, New York (1985)
324
6. M. Loog: Supervised Dimensionality Reduction and Contextual Pattern Recognition in Med
ical Image Processing  Chapter 3. Ponsen & Looijen, Wageningen, The Netherlands. (2004)
7. Li, K. C.: Sliced Inverse Regression for Dimension Reduction (with discussioin): J. the Amer
ican Statistical Association. 86 (1991) 316342
8. Li, K. C.: On Principal Hessian Directions for Data Visualization and Dimension Reduc
tion: Another Application of Stein’s lemma: J. the American Statistical Assiciation. 87 (1992)
10251039
9. Kwak, N., Kim, C.: Dimensionality reduction based on ICA for regression problems: Proc.
Int’l Conf. on Artificial Neural Networks (IJCNN) (2006) 110
10. Sim, T., Baker, S., Bsat,M.: The CMU Pose, Illumination, and Expression Database. IEEE
Trans. Pattern Analysis and Machine Intelligence. 25 (2003) 16151618
11. Georghiades, A.S., Belhumeur, P.N.: Frome Few to Many: Illumination Cone Models for
Face Recognition Under Variable Lighting and Pose. IEEE Trans. Pattern Analysis and Ma
chine Intelligence. 23 (2001) 643660