Content uploaded by Md Kamrul Hasan
Author content
All content in this area was uploaded by Md Kamrul Hasan on Aug 10, 2018
Content may be subject to copyright.
k-Means Discriminant Maps for Data Visualization and
Classification
Vo Dinh Minh Nhat
Ubiquitous Computing Lab
Kyung Hee University
Suwon, Korea
vdmnhat@uclab.khu.ac.kr
SungYoung Lee
Ubiquitous Computing Lab
Kyung Hee University
Suwon, Korea
sylee@uclab.khu.ac.kr
ABSTRACT
Over the years, many dimensionality reduction algorithms
have been proposed for learning the structure of high dimen-
sional data by linearly or non-linearly transforming it into a
low-dimensional space. Some techniques can keep the local
structure of data, while the others try to preserve the global
structure. In this paper, we propose a linear dimensionality
reduction technique that characterizes the local and global
properties of data by firstly applying k-means algorithm on
original data, and then finding the projection by simultane-
ously globally maximizing the between-cluster scatter ma-
trix and locally minimizing the within-cluster scatter ma-
trix, which actually keeps both local and global structure of
data. Low complexity and structure preserving are two main
advantages of the proposed technique. The experiments on
both artificial and real data sets show the effectiveness and
novelty of proposed algorithm in visualization and classifi-
cation tasks.
Categories and Subject Descriptors
I.5.2 [Computing Methodologies]: Pattern Recognition—
Design Methodology.
General Terms
Algorithms, Design, Experimentation, Performance, The-
ory.
Keywords
Dimensionality Reduction, k-Means, Manifold Learning, Lin-
ear Discriminant Analysis.
1. INTRODUCTION
The purpose of dimensionality reduction is to transform
high dimensional data into a low-dimensional space, while
retaining most of the underlying structure in the data. The
reason for using dimensionality reduction is based on the fact
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
SAC’08 March 16-20, 2008, Fortaleza, Cear´
a, Brazil
Copyright 2008 ACM 978-1-59593-753-7/08/0003 ...$5.00.
that some features may be irrelevant and ”intrinsic” dimen-
sionality of data may be smaller than the number of features.
Dimensionality reduction can also be used to visualize high
dimensional data by transforming the data into two or three
dimensions, thereby giving additional insight into the prob-
lem at hand. With the rapidly increasing demand on dimen-
sionality reduction techniques, it is not surprising to see an
overwhelming amount of research publications on this topic
in recent years. In general, there are linear and nonlinear
dimensionality reduction techniques. Linear dimensionality
reduction methods include Principal Component Analysis
(PCA) developed by Pearson (1901) and Hotelling (1933)
[6][3], and Multi-Dimensional Scaling (MDS) by Torgerson
(1952) and Shepard (1962) [10][8]. While PCA finds a low-
dimensional embedding of the data points that best pre-
serves their variance as measured in the high-dimensional
input space, MDS finds an embedding that preserves the
inter-point distances, which is equivalent to PCA when the
distances are Euclidean. Besides linear methods, there are
also non-linear dimensionality reduction techniques which
have been developed up-to-date. Kernel PCA (KPCA) [4]
map inputs nonlinearly to a new space, then perform PCA.
Laplacian Eigenmaps (LE) [1] preserve nearness relations
as encoded by graph Laplacian. ISOMAP [9] assumes that
the data lie on a (Riemannian) manifold and maps data to
its low-dimensional representation in such a way that the
geodesic distance between two date points is as close to the
Euclidean distance between two respectively points in low-
dimensional space as possible. Diffusion Maps (DM) [5] is
based on defining a Markov random walk on the graph of
the data. In the low-dimensional representation of the data,
the pairwise diffusion distances are retained as well as pos-
sible. Locally Linear Embedding (LLE) [7] maps its inputs
into a single global coordinate system of lower dimensional-
ity by computing low-dimensional, neighborhood preserving
embedding of high-dimensional inputs, and its optimization
does not involve local minima. It actually recovers global
nonlinear structure from locally linear fits. Due to the length
of paper some other techniques most of which are variants of
above reviewed techniques are not covered here. In this pa-
per, we propose a linear dimensionality reduction technique
called k-Means Discriminant Maps (kDM). The algorithm
firstly applies k-Means to cluster the original data, then for
the purpose of keeping both local and global structure of
data, it try to find a desirable projection that simultane-
ously minimizes the within-cluster scatter and maximizes
the between-cluster scatter matrices. Some main contribu-
tions of proposed algorithm can be described as: low com-
plexity due to its linear property, and keeping both local
and global structure of data. The outline of this paper is as
follows. The proposed method is described in Section 2. In
Section 3, experimental results are performed on both arti-
ficial and real data sets to demonstrate the effectiveness of
our method. Finally, conclusions are presented in Section 4.
2. K-MEANS DISCRIMINANT MAPS
The dimension reduction problem is, given a data set
{x1, x2, ..., xN}where xi∈ <n, to find a set of points
{y1, y2, ..., yN}where yi∈ <mand (m¿n), such that
each yi“represents” its counterpart xi. For the convenience
of presentation, we denote the matrix X= [x1, x2, ..., xN]
and correspondingly the matrix Y= [y1, y2, ..., yN]. In this
section, our emphasis is on the description of our proposed
algorithms. Due to the paper length, those previous dimen-
sionality reduction techniques can be referenced in respec-
tive literatures.
2.1 k-Means
The objective it tries to achieve is to minimize total intra-
cluster variance, or, the squared error function
f=
k
X
i=1
X
xj∈Πi
kxj−µik2(1)
where there are kclusters µi,i= 1,2, ..., k and Πiis the
centroid or mean point of all the points xj∈Πi.
2.2 k-Means Discriminant Maps - kDM
Assume that after clustering data by k-means algorithm
each data sample belongs to one of Ccluster {Π1,Π2, ..., ΠC}.
Let Nibe the number of the samples in cluster Πi(i=
1,2, ..., C), µi=1
NiP
x∈Πi
xbe the mean of the samples or the
centroid in cluster Πi. Then we define the between-cluster
scatter matrix Sband the within-cluster scatter matrix Sw
as follow
Sb=1
N
C
X
i=1
Ni(µi−µ)(µi−µ)T(2)
Sw=1
N
C
X
i=1
X
xk∈Πi
(xk−µi)(xk−µi)T(3)
For the purpose of keeping both local and global structure
of data, we try is to find a projection which will draw the
close samples (ones in the same cluster) closer together while
simultaneously making the distant samples (ones from dif-
ferent clusters) even more distant from each other. From
this point of view, a desirable projection should be the one
that, at the same time, minimizes the within-cluster scatter
and maximizes the between-cluster scatter matrices. So in
kDM, the projection Wopt is chosen to maximize the ratio of
the determinant of the between-cluster scatter matrix of the
projected samples to the determinant of the within-cluster
scatter matrix of the projected samples, i.e.,
J(w) = arg max
w
wTSbw
wTSww(4)
From the criterion in ( 4), we can find the projection by si-
multaneously globally maximizing the between-cluster scat-
ter and locally minimizing the within-cluster scatter, which
Technique Parameter Settings
PCA None
Kernel PCA κ= (XX T+ 1)3
Diffusion Maps σ= 1
LLE k= 12
LE k= 12, σ = 1
ISOMAP k= 12
kDM k= 3
Table 1: Parameter settings for the experiments
actually keep both local and global structure of data. It
is also easy to realize that the criterion ( 4) is formally
similar to the Fisher criterion since they are both Rayleigh
quotients. However in kDM, we form the between-cluster
scatter matrix Sband the within-cluster scatter matrix Sw
without knowing the class labels of samples. This means
Fisher discriminant projection is supervised, while the pro-
jection determined by kDM can be obtained in an unsu-
pervised manner. The optimal projection for kDM is W=
[w1w2...wm], where {wi|i= 1,2, ..., m}is the set of gener-
alized eigenvectors of Sband Swcorresponding to the m
largest generalized eigenvalues {λi|i= 1,2, ..., m}, i.e.,
Sbwi=λiSwwi
⇔S−1
wSbwi=λiwii= 1,2, ..., m (5)
However in some cases, the dimension of the sample space
is typically larger than the number of samples. As a conse-
quence, Swis singular. This problem is known as the “small
sample size (3S) problem”[2]. To solve this problem, we use
the strategy of Direct-LDA[11] to implement our kDM al-
gorithm in the case of 3S problem. The key idea of DLDA
is to discard the null space of Sb, which contains no useful
information, rather than discarding the null space of Sw,
which contains the most discriminative information. Sbis
firstly diagonalized as Sb=UΛUT, where U∈ <n×(C−1)
is a matrix whose columns are eigenvectors of Sb,Cis the
number of classes (equivalent to the number of clusters kin
k-Means of kDM, we use kand Cinterchangeably) and Λ
is a diagonal matrix with eigenvalues. The new projected
within scatter matrix is formed as
˜
Sw= Λ−1/2UTSwUΛ−1/2(6)
Let ˜
Ww= [w1w2...wC−1], where {wi|i= 1,2, ..., C −1}is
the set of eigenvectors of ˜
Sw. Then, the optimal projection
for DLDA is Wopt =UΛ−1/2˜
Ww.
3. EXPERIMENTS
In this section, a systematic empirical experiments of the
performance of previous techniques and our proposed tech-
nique kDM are performed. We perform the evaluation on
two types of datasets: (1) artificial datasets and (2) real
datasets (ORL face database and PolyU Palmprint database).
3.1 Data Visualization on Artificial Datasets
The artificial datasets on which we performed experiments
are: (1) the Swiss roll dataset and (2) the intersecting dataset.
Some parameters used in this part of experiments can be
seen in Table 1. We perform PCA, Kernel PCA, Diffusion
Maps, LLE, LE, ISOMAP and kDM on 1000 data points of
−15
−10
−505
10
15 −15
−10
−5 0510 15
0
10
20
30
Swiss Roll
0 10 20
−15
−10
−5
0
5
10
15 PCA
t = 0.23438s −1 −0.5 0 0.5
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
Kernel PCA
t = 39.4844s −4 −2 0 2 4
x 10−7
−3
−2
−1
0
1
2
3
4x 10−7
Diffusion Maps
t = 1.2188s
−2 0 2
−2
−1.5
−1
−0.5
0
0.5
1
1.5 LLE
t = 2.375s −0.05 0 0.05
−0.04
−0.02
0
0.02
0.04
LE
t = 1.1719s −60 −40 −20 0 20 40
−15
−10
−5
0
5
10
15 ISOMAP
t = 104.3906s −10 0 10
−15
−10
−5
0
5
10
15 kDM
t = 0.15625s
Figure 1: Two-dimensionality Visualization of the Swiss roll dataset based on variety of techniques
−15
−10
−5
0
5
10
15 −15 −10 −5 0510 15
0
30
60
Swiss Roll
−20 −10 0 10 20
−15
−10
−5
0
5
10
k = 5 −20 −10 0 10 20
−15
−10
−5
0
5
10
15
20
k = 10
−20 −10 0 10 20
−15
−10
−5
0
5
10
15
k = 15 −20 −10 0 10 20
−15
−10
−5
0
5
10
15
k = 20 −20 −10 0 10 20
−10
−5
0
5
10
15
20
k = 25
Figure 2: Performance of kDM versus k= 5,10,15,20
on Swiss roll dataset.
−1.5−1
−0.5 0
0.5 1
1.5 −0.8
−0.6
−0.4
−0.2
00.20.40.6
−2
0
2
4
6
Intersect
−1.5 −1 −0.5 0 0.5 1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
k = 5 −1 −0.5 0 0.5 1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
k = 10
−1.5 −1 −0.5 0 0.5 1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
k = 15 −1 0 1
−0.5
0
0.5
k = 20 −1.5 −1 −0.5 0 0.5 1
−0.5
0
0.5
k = 25
Figure 3: Performance of kDM versus k= 5,10,15,20
on Intersection dataset.
−5000 −4000 −3000
−1000
−500
0
500
1000 PCA (88%)
t = 0.6875s −0.5 0 0.5 1
−0.4
−0.2
0
0.2
0.4
0.6
0.8
Kernel PCA (15%)
t = 0.09375s −4 −2 0 2 4 6
x 10−4
−3
−2
−1
0
1
2
3
x 10−4
Diffusion Maps (91%)
t = 0.25s
−2 0 2
−3
−2
−1
0
1
2LLE (91%)
t = 0.28125s −0.2 0 0.2
−0.2
−0.1
0
0.1
0.2
LE (90%)
t = 0.10938s −4000−2000 0 2000 4000
−3000
−2000
−1000
0
1000
2000
3000 ISOMAP (86%)
t = 0.35938s −3000 −2000 −1000
−1000
−500
0
500
kDM (93%)
t = 0.0825s
Sample Image
Figure 4: Two-dimensionality Visualization of the ORL face database based on variety of techniques
5000 6000 7000
−500
0
500
PCA (87%)
t = 0.40625s −0.5 0 0.5 1
−0.2
0
0.2
Kernel PCA (13%)
t = 0.10938s
−2 0 2 4
x 10−3
−2
0
2
4
x 10−3
Diffusion Maps (70%)
t = 0.29688s
−2 0 2 4
−2
−1
0
1
LLE (95%)
t = 0.29688s −0.2 0 0.2
−0.2
−0.1
0
0.1
0.2 LE (79%)
t = 0.078125s −2000 0 2000 4000
−1000
0
1000
2000 ISOMAP (96%)
t = 0.39063s −1000 −500 0 500
−400
−200
0
200
400 kDM (97%)
t = 0.1125s
Sample Palm
Figure 5: Two-dimensionality Visualization of the PolyU Palmprint database based on variety of techniques
Swiss roll dataset to show the two-dimensional representa-
tions of the Swiss roll dataset which can be seen in Fig. 1.
From the depicted representations, we can see that PCA,
Kernel PCA and Diffusion Maps techniques are not capa-
ble of successfully learning the 2-dimensional structure of
the Swiss roll manifold. While LLE and Laplacian Eigen-
maps are capable of learning the local structure of the man-
ifold, ISOMAP can learn the global structure of data. Also,
from the graph we can see advantages of new dimensionality
reduction technique kDM that it can learn both local and
global structure of Swiss roll dataset, i.e. “close” data points
will be retained “close” and ”far” data points will lie “far” in
the embedding coordinates. Since kDM is a linear method,
the running time is quite very low compared to other non-
linear techniques. We next vary the value of cluster number
k= 5,10,15,25 in kDM algorithm to see how kDM works
(see Fig. 2, 3). It seems to us that , to some extent, kDM
does not systematically depends on the number of clusters
k, which actually is an issue under our investigation.
3.2 Experiment on Biometrics Databases
In this section, we do some experiment on real biometrics
databases which are ORL face database and PolyU Palm-
print database. We choose the value k= 5 in kDM algo-
rithm to perform on both databases, while the parameters
for the other methods are still same as in Table 1. Due
to high dimensionality of biometrics data (3S problem), in
this section the kDM algorithm is implemented based on
the strategy of DLDA as discussed in previous section. In
ORL face database, we randomly select 10 subjects, each of
which contains 10 sample images. All the images were taken
against a dark homogeneous background with the subjects
in an upright, frontal position (with tolerance for some side
movement) and are manually cropped and resized to 50x50
pixel images. Two-dimensionality visualization of the ORL
face database based on variety of techniques are presented in
Fig. 4. It should be noted that the classification error rates
are calculated and put on the title of each subplot in Fig.
4. We can see that kDM give the best accuracy rate (93%),
LLE and Diffusion Maps are second best (91%), while Kernel
PCA gives a very bad performance. The PolyU Palmprint
Database[12] contains 7752 grayscale images corresponding
to 386 different palms. We also select randomly 10 sub jects
and 10 palms for each subject to do the experiment. We
use inscribed circle-based segmentation approach in [13] to
extract palms and resize each palm to radius of 25 pixels.
In the case of palmprint database, from Fig. 5, kDM still
give good performance in terms of both data visualization
and classification with 97% accuracy.
3.3 Discussion
The experiments on both artificial and real biometrics
datasets have been systematically performed. These exper-
iments reveal a number of interesting points as follow:
•kDM can be a good candidate for data visualization
because it can learn the whole structure (both local
and global structure) of data.
•Though kDM is a unsupervised technique, it still have
the ability of finding discriminative features which is
very helpful in classification taks.
•It is quite easy to implement and run fast compared
to the other non-linear techniques.
4. CONCLUSIONS
In this paper, we propose a linear dimensionality reduc-
tion technique that can keep both local and global struc-
ture of data. The experiments on both artificial and real
datasets show its potential in data visualization and clas-
sification tasks. The corner-stone of the idea is the usage
of nice properties from k-means and Fisher criteria. In the
first step of applying k-means, those “close” data samples
will be tendentiously kept in the same cluster, those ”dis-
tant” data samples will be grouped into different clusters.
And this topology of data will be preserved by using Fisher
criteria to embed data into low-dimensional representation.
A future work is obviously the effect of k-means on the kDM
algorithm.
5. ACKNOWLEDGMENTS
This research was supported by the MIC(Ministry of In-
formation and Communication), Korea, Under the ITFSIP
(IT Foreign Specialist Inviting Program) Supervised by the
IITA (Institute of Information Technology Advancement).
6. REFERENCES
[1] M. Belkin and P. Niyogi. Laplacian eigenmaps and
spectral techniques for embedding and clustering.
Advances in Neural Information Processing Systems
14.
[2] K. Fukunaga. Introduction to statistical pattern
recognition. Academic Press Professional, Inc., San
Diego, CA, USA, 1990.
[3] H. Hotelling. Analysis of a complex of statistical
variables into principal components. J. Educational
Psychology, 27:417–441, 1933.
[4] S. Mika, B. Sch¨olkopf, A. J. Smola, K.-R. M¨uller,
M. Scholz, and G. R¨atsch. Kernel pca and de-noising
in feature spaces. pages 536–542, 1998.
[5] B. Nadler, S. Lafon, R. R. Coifman, and I. G.
Kevrekidis. Diffusion maps, spectral clustering and
reaction coordinates of dynamical systems. Applied
and Computational Harmonic Analysis, 21.
[6] K. Pearson. On lines and planes of closest fit to
systems of points in space. Philoshophical Magazine,
2:559–572, 1901.
[7] S. T. Roweis and L. K. Saul. Nonlinear dimensionality
reduction by locally linear embedding. Science,
290(5500):2323–2326, 2000.
[8] R. N. Shepard. The analysis of proximities:
Multidimensional scaling with an unknown distance
function. Psychometrika, 27:125–140, 1962.
[9] J. Tenenbaum, V. de Silva, and J. Langford. A global
geometric framework for nonlinear dimensionality
reduction. 290(5500):2319–2323, December 2000.
[10] W. S. Torgerson. Multidimensional scaling.
Psychometrika, 17:401–419, 1952.
[11] H. Yu and J. Yang. A direct lda algorithm for
high-dimensional data - with application to face
recognition. Pattern Recognition, 34(10):2067–2070,
2001.
[12] D. Zhang. Polyu palmprint palmprint database -
http://www.comp.polyu.edu.hk/ biometrics/.
[13] D. Zhang. Palmprint Authentication. Kluwer
Academic, 2004.