Content uploaded by Md Kamrul Hasan

Author content

All content in this area was uploaded by Md Kamrul Hasan on Aug 10, 2018

Content may be subject to copyright.

k-Means Discriminant Maps for Data Visualization and

Classiﬁcation

Vo Dinh Minh Nhat

Ubiquitous Computing Lab

Kyung Hee University

Suwon, Korea

vdmnhat@uclab.khu.ac.kr

SungYoung Lee

Ubiquitous Computing Lab

Kyung Hee University

Suwon, Korea

sylee@uclab.khu.ac.kr

ABSTRACT

Over the years, many dimensionality reduction algorithms

have been proposed for learning the structure of high dimen-

sional data by linearly or non-linearly transforming it into a

low-dimensional space. Some techniques can keep the local

structure of data, while the others try to preserve the global

structure. In this paper, we propose a linear dimensionality

reduction technique that characterizes the local and global

properties of data by ﬁrstly applying k-means algorithm on

original data, and then ﬁnding the projection by simultane-

ously globally maximizing the between-cluster scatter ma-

trix and locally minimizing the within-cluster scatter ma-

trix, which actually keeps both local and global structure of

data. Low complexity and structure preserving are two main

advantages of the proposed technique. The experiments on

both artiﬁcial and real data sets show the eﬀectiveness and

novelty of proposed algorithm in visualization and classiﬁ-

cation tasks.

Categories and Subject Descriptors

I.5.2 [Computing Methodologies]: Pattern Recognition—

Design Methodology.

General Terms

Algorithms, Design, Experimentation, Performance, The-

ory.

Keywords

Dimensionality Reduction, k-Means, Manifold Learning, Lin-

ear Discriminant Analysis.

1. INTRODUCTION

The purpose of dimensionality reduction is to transform

high dimensional data into a low-dimensional space, while

retaining most of the underlying structure in the data. The

reason for using dimensionality reduction is based on the fact

Permission to make digital or hard copies of all or part of this work for

personal or classroom use is granted without fee provided that copies are

not made or distributed for proﬁt or commercial advantage and that copies

bear this notice and the full citation on the ﬁrst page. To copy otherwise, to

republish, to post on servers or to redistribute to lists, requires prior speciﬁc

permission and/or a fee.

SAC’08 March 16-20, 2008, Fortaleza, Cear´

a, Brazil

Copyright 2008 ACM 978-1-59593-753-7/08/0003 ...$5.00.

that some features may be irrelevant and ”intrinsic” dimen-

sionality of data may be smaller than the number of features.

Dimensionality reduction can also be used to visualize high

dimensional data by transforming the data into two or three

dimensions, thereby giving additional insight into the prob-

lem at hand. With the rapidly increasing demand on dimen-

sionality reduction techniques, it is not surprising to see an

overwhelming amount of research publications on this topic

in recent years. In general, there are linear and nonlinear

dimensionality reduction techniques. Linear dimensionality

reduction methods include Principal Component Analysis

(PCA) developed by Pearson (1901) and Hotelling (1933)

[6][3], and Multi-Dimensional Scaling (MDS) by Torgerson

(1952) and Shepard (1962) [10][8]. While PCA ﬁnds a low-

dimensional embedding of the data points that best pre-

serves their variance as measured in the high-dimensional

input space, MDS ﬁnds an embedding that preserves the

inter-point distances, which is equivalent to PCA when the

distances are Euclidean. Besides linear methods, there are

also non-linear dimensionality reduction techniques which

have been developed up-to-date. Kernel PCA (KPCA) [4]

map inputs nonlinearly to a new space, then perform PCA.

Laplacian Eigenmaps (LE) [1] preserve nearness relations

as encoded by graph Laplacian. ISOMAP [9] assumes that

the data lie on a (Riemannian) manifold and maps data to

its low-dimensional representation in such a way that the

geodesic distance between two date points is as close to the

Euclidean distance between two respectively points in low-

dimensional space as possible. Diﬀusion Maps (DM) [5] is

based on deﬁning a Markov random walk on the graph of

the data. In the low-dimensional representation of the data,

the pairwise diﬀusion distances are retained as well as pos-

sible. Locally Linear Embedding (LLE) [7] maps its inputs

into a single global coordinate system of lower dimensional-

ity by computing low-dimensional, neighborhood preserving

embedding of high-dimensional inputs, and its optimization

does not involve local minima. It actually recovers global

nonlinear structure from locally linear ﬁts. Due to the length

of paper some other techniques most of which are variants of

above reviewed techniques are not covered here. In this pa-

per, we propose a linear dimensionality reduction technique

called k-Means Discriminant Maps (kDM). The algorithm

ﬁrstly applies k-Means to cluster the original data, then for

the purpose of keeping both local and global structure of

data, it try to ﬁnd a desirable projection that simultane-

ously minimizes the within-cluster scatter and maximizes

the between-cluster scatter matrices. Some main contribu-

tions of proposed algorithm can be described as: low com-

plexity due to its linear property, and keeping both local

and global structure of data. The outline of this paper is as

follows. The proposed method is described in Section 2. In

Section 3, experimental results are performed on both arti-

ﬁcial and real data sets to demonstrate the eﬀectiveness of

our method. Finally, conclusions are presented in Section 4.

2. K-MEANS DISCRIMINANT MAPS

The dimension reduction problem is, given a data set

{x1, x2, ..., xN}where xi∈ <n, to ﬁnd a set of points

{y1, y2, ..., yN}where yi∈ <mand (m¿n), such that

each yi“represents” its counterpart xi. For the convenience

of presentation, we denote the matrix X= [x1, x2, ..., xN]

and correspondingly the matrix Y= [y1, y2, ..., yN]. In this

section, our emphasis is on the description of our proposed

algorithms. Due to the paper length, those previous dimen-

sionality reduction techniques can be referenced in respec-

tive literatures.

2.1 k-Means

The objective it tries to achieve is to minimize total intra-

cluster variance, or, the squared error function

f=

k

X

i=1

X

xj∈Πi

kxj−µik2(1)

where there are kclusters µi,i= 1,2, ..., k and Πiis the

centroid or mean point of all the points xj∈Πi.

2.2 k-Means Discriminant Maps - kDM

Assume that after clustering data by k-means algorithm

each data sample belongs to one of Ccluster {Π1,Π2, ..., ΠC}.

Let Nibe the number of the samples in cluster Πi(i=

1,2, ..., C), µi=1

NiP

x∈Πi

xbe the mean of the samples or the

centroid in cluster Πi. Then we deﬁne the between-cluster

scatter matrix Sband the within-cluster scatter matrix Sw

as follow

Sb=1

N

C

X

i=1

Ni(µi−µ)(µi−µ)T(2)

Sw=1

N

C

X

i=1

X

xk∈Πi

(xk−µi)(xk−µi)T(3)

For the purpose of keeping both local and global structure

of data, we try is to ﬁnd a projection which will draw the

close samples (ones in the same cluster) closer together while

simultaneously making the distant samples (ones from dif-

ferent clusters) even more distant from each other. From

this point of view, a desirable projection should be the one

that, at the same time, minimizes the within-cluster scatter

and maximizes the between-cluster scatter matrices. So in

kDM, the projection Wopt is chosen to maximize the ratio of

the determinant of the between-cluster scatter matrix of the

projected samples to the determinant of the within-cluster

scatter matrix of the projected samples, i.e.,

J(w) = arg max

w

wTSbw

wTSww(4)

From the criterion in ( 4), we can ﬁnd the projection by si-

multaneously globally maximizing the between-cluster scat-

ter and locally minimizing the within-cluster scatter, which

Technique Parameter Settings

PCA None

Kernel PCA κ= (XX T+ 1)3

Diﬀusion Maps σ= 1

LLE k= 12

LE k= 12, σ = 1

ISOMAP k= 12

kDM k= 3

Table 1: Parameter settings for the experiments

actually keep both local and global structure of data. It

is also easy to realize that the criterion ( 4) is formally

similar to the Fisher criterion since they are both Rayleigh

quotients. However in kDM, we form the between-cluster

scatter matrix Sband the within-cluster scatter matrix Sw

without knowing the class labels of samples. This means

Fisher discriminant projection is supervised, while the pro-

jection determined by kDM can be obtained in an unsu-

pervised manner. The optimal projection for kDM is W=

[w1w2...wm], where {wi|i= 1,2, ..., m}is the set of gener-

alized eigenvectors of Sband Swcorresponding to the m

largest generalized eigenvalues {λi|i= 1,2, ..., m}, i.e.,

Sbwi=λiSwwi

⇔S−1

wSbwi=λiwii= 1,2, ..., m (5)

However in some cases, the dimension of the sample space

is typically larger than the number of samples. As a conse-

quence, Swis singular. This problem is known as the “small

sample size (3S) problem”[2]. To solve this problem, we use

the strategy of Direct-LDA[11] to implement our kDM al-

gorithm in the case of 3S problem. The key idea of DLDA

is to discard the null space of Sb, which contains no useful

information, rather than discarding the null space of Sw,

which contains the most discriminative information. Sbis

ﬁrstly diagonalized as Sb=UΛUT, where U∈ <n×(C−1)

is a matrix whose columns are eigenvectors of Sb,Cis the

number of classes (equivalent to the number of clusters kin

k-Means of kDM, we use kand Cinterchangeably) and Λ

is a diagonal matrix with eigenvalues. The new projected

within scatter matrix is formed as

˜

Sw= Λ−1/2UTSwUΛ−1/2(6)

Let ˜

Ww= [w1w2...wC−1], where {wi|i= 1,2, ..., C −1}is

the set of eigenvectors of ˜

Sw. Then, the optimal projection

for DLDA is Wopt =UΛ−1/2˜

Ww.

3. EXPERIMENTS

In this section, a systematic empirical experiments of the

performance of previous techniques and our proposed tech-

nique kDM are performed. We perform the evaluation on

two types of datasets: (1) artiﬁcial datasets and (2) real

datasets (ORL face database and PolyU Palmprint database).

3.1 Data Visualization on Artiﬁcial Datasets

The artiﬁcial datasets on which we performed experiments

are: (1) the Swiss roll dataset and (2) the intersecting dataset.

Some parameters used in this part of experiments can be

seen in Table 1. We perform PCA, Kernel PCA, Diﬀusion

Maps, LLE, LE, ISOMAP and kDM on 1000 data points of

−15

−10

−505

10

15 −15

−10

−5 0510 15

0

10

20

30

Swiss Roll

0 10 20

−15

−10

−5

0

5

10

15 PCA

t = 0.23438s −1 −0.5 0 0.5

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

Kernel PCA

t = 39.4844s −4 −2 0 2 4

x 10−7

−3

−2

−1

0

1

2

3

4x 10−7

Diffusion Maps

t = 1.2188s

−2 0 2

−2

−1.5

−1

−0.5

0

0.5

1

1.5 LLE

t = 2.375s −0.05 0 0.05

−0.04

−0.02

0

0.02

0.04

LE

t = 1.1719s −60 −40 −20 0 20 40

−15

−10

−5

0

5

10

15 ISOMAP

t = 104.3906s −10 0 10

−15

−10

−5

0

5

10

15 kDM

t = 0.15625s

Figure 1: Two-dimensionality Visualization of the Swiss roll dataset based on variety of techniques

−15

−10

−5

0

5

10

15 −15 −10 −5 0510 15

0

30

60

Swiss Roll

−20 −10 0 10 20

−15

−10

−5

0

5

10

k = 5 −20 −10 0 10 20

−15

−10

−5

0

5

10

15

20

k = 10

−20 −10 0 10 20

−15

−10

−5

0

5

10

15

k = 15 −20 −10 0 10 20

−15

−10

−5

0

5

10

15

k = 20 −20 −10 0 10 20

−10

−5

0

5

10

15

20

k = 25

Figure 2: Performance of kDM versus k= 5,10,15,20

on Swiss roll dataset.

−1.5−1

−0.5 0

0.5 1

1.5 −0.8

−0.6

−0.4

−0.2

00.20.40.6

−2

0

2

4

6

Intersect

−1.5 −1 −0.5 0 0.5 1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

k = 5 −1 −0.5 0 0.5 1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

k = 10

−1.5 −1 −0.5 0 0.5 1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

k = 15 −1 0 1

−0.5

0

0.5

k = 20 −1.5 −1 −0.5 0 0.5 1

−0.5

0

0.5

k = 25

Figure 3: Performance of kDM versus k= 5,10,15,20

on Intersection dataset.

−5000 −4000 −3000

−1000

−500

0

500

1000 PCA (88%)

t = 0.6875s −0.5 0 0.5 1

−0.4

−0.2

0

0.2

0.4

0.6

0.8

Kernel PCA (15%)

t = 0.09375s −4 −2 0 2 4 6

x 10−4

−3

−2

−1

0

1

2

3

x 10−4

Diffusion Maps (91%)

t = 0.25s

−2 0 2

−3

−2

−1

0

1

2LLE (91%)

t = 0.28125s −0.2 0 0.2

−0.2

−0.1

0

0.1

0.2

LE (90%)

t = 0.10938s −4000−2000 0 2000 4000

−3000

−2000

−1000

0

1000

2000

3000 ISOMAP (86%)

t = 0.35938s −3000 −2000 −1000

−1000

−500

0

500

kDM (93%)

t = 0.0825s

Sample Image

Figure 4: Two-dimensionality Visualization of the ORL face database based on variety of techniques

5000 6000 7000

−500

0

500

PCA (87%)

t = 0.40625s −0.5 0 0.5 1

−0.2

0

0.2

Kernel PCA (13%)

t = 0.10938s

−2 0 2 4

x 10−3

−2

0

2

4

x 10−3

Diffusion Maps (70%)

t = 0.29688s

−2 0 2 4

−2

−1

0

1

LLE (95%)

t = 0.29688s −0.2 0 0.2

−0.2

−0.1

0

0.1

0.2 LE (79%)

t = 0.078125s −2000 0 2000 4000

−1000

0

1000

2000 ISOMAP (96%)

t = 0.39063s −1000 −500 0 500

−400

−200

0

200

400 kDM (97%)

t = 0.1125s

Sample Palm

Figure 5: Two-dimensionality Visualization of the PolyU Palmprint database based on variety of techniques

Swiss roll dataset to show the two-dimensional representa-

tions of the Swiss roll dataset which can be seen in Fig. 1.

From the depicted representations, we can see that PCA,

Kernel PCA and Diﬀusion Maps techniques are not capa-

ble of successfully learning the 2-dimensional structure of

the Swiss roll manifold. While LLE and Laplacian Eigen-

maps are capable of learning the local structure of the man-

ifold, ISOMAP can learn the global structure of data. Also,

from the graph we can see advantages of new dimensionality

reduction technique kDM that it can learn both local and

global structure of Swiss roll dataset, i.e. “close” data points

will be retained “close” and ”far” data points will lie “far” in

the embedding coordinates. Since kDM is a linear method,

the running time is quite very low compared to other non-

linear techniques. We next vary the value of cluster number

k= 5,10,15,25 in kDM algorithm to see how kDM works

(see Fig. 2, 3). It seems to us that , to some extent, kDM

does not systematically depends on the number of clusters

k, which actually is an issue under our investigation.

3.2 Experiment on Biometrics Databases

In this section, we do some experiment on real biometrics

databases which are ORL face database and PolyU Palm-

print database. We choose the value k= 5 in kDM algo-

rithm to perform on both databases, while the parameters

for the other methods are still same as in Table 1. Due

to high dimensionality of biometrics data (3S problem), in

this section the kDM algorithm is implemented based on

the strategy of DLDA as discussed in previous section. In

ORL face database, we randomly select 10 subjects, each of

which contains 10 sample images. All the images were taken

against a dark homogeneous background with the subjects

in an upright, frontal position (with tolerance for some side

movement) and are manually cropped and resized to 50x50

pixel images. Two-dimensionality visualization of the ORL

face database based on variety of techniques are presented in

Fig. 4. It should be noted that the classiﬁcation error rates

are calculated and put on the title of each subplot in Fig.

4. We can see that kDM give the best accuracy rate (93%),

LLE and Diﬀusion Maps are second best (91%), while Kernel

PCA gives a very bad performance. The PolyU Palmprint

Database[12] contains 7752 grayscale images corresponding

to 386 diﬀerent palms. We also select randomly 10 sub jects

and 10 palms for each subject to do the experiment. We

use inscribed circle-based segmentation approach in [13] to

extract palms and resize each palm to radius of 25 pixels.

In the case of palmprint database, from Fig. 5, kDM still

give good performance in terms of both data visualization

and classiﬁcation with 97% accuracy.

3.3 Discussion

The experiments on both artiﬁcial and real biometrics

datasets have been systematically performed. These exper-

iments reveal a number of interesting points as follow:

•kDM can be a good candidate for data visualization

because it can learn the whole structure (both local

and global structure) of data.

•Though kDM is a unsupervised technique, it still have

the ability of ﬁnding discriminative features which is

very helpful in classiﬁcation taks.

•It is quite easy to implement and run fast compared

to the other non-linear techniques.

4. CONCLUSIONS

In this paper, we propose a linear dimensionality reduc-

tion technique that can keep both local and global struc-

ture of data. The experiments on both artiﬁcial and real

datasets show its potential in data visualization and clas-

siﬁcation tasks. The corner-stone of the idea is the usage

of nice properties from k-means and Fisher criteria. In the

ﬁrst step of applying k-means, those “close” data samples

will be tendentiously kept in the same cluster, those ”dis-

tant” data samples will be grouped into diﬀerent clusters.

And this topology of data will be preserved by using Fisher

criteria to embed data into low-dimensional representation.

A future work is obviously the eﬀect of k-means on the kDM

algorithm.

5. ACKNOWLEDGMENTS

This research was supported by the MIC(Ministry of In-

formation and Communication), Korea, Under the ITFSIP

(IT Foreign Specialist Inviting Program) Supervised by the

IITA (Institute of Information Technology Advancement).

6. REFERENCES

[1] M. Belkin and P. Niyogi. Laplacian eigenmaps and

spectral techniques for embedding and clustering.

Advances in Neural Information Processing Systems

14.

[2] K. Fukunaga. Introduction to statistical pattern

recognition. Academic Press Professional, Inc., San

Diego, CA, USA, 1990.

[3] H. Hotelling. Analysis of a complex of statistical

variables into principal components. J. Educational

Psychology, 27:417–441, 1933.

[4] S. Mika, B. Sch¨olkopf, A. J. Smola, K.-R. M¨uller,

M. Scholz, and G. R¨atsch. Kernel pca and de-noising

in feature spaces. pages 536–542, 1998.

[5] B. Nadler, S. Lafon, R. R. Coifman, and I. G.

Kevrekidis. Diﬀusion maps, spectral clustering and

reaction coordinates of dynamical systems. Applied

and Computational Harmonic Analysis, 21.

[6] K. Pearson. On lines and planes of closest ﬁt to

systems of points in space. Philoshophical Magazine,

2:559–572, 1901.

[7] S. T. Roweis and L. K. Saul. Nonlinear dimensionality

reduction by locally linear embedding. Science,

290(5500):2323–2326, 2000.

[8] R. N. Shepard. The analysis of proximities:

Multidimensional scaling with an unknown distance

function. Psychometrika, 27:125–140, 1962.

[9] J. Tenenbaum, V. de Silva, and J. Langford. A global

geometric framework for nonlinear dimensionality

reduction. 290(5500):2319–2323, December 2000.

[10] W. S. Torgerson. Multidimensional scaling.

Psychometrika, 17:401–419, 1952.

[11] H. Yu and J. Yang. A direct lda algorithm for

high-dimensional data - with application to face

recognition. Pattern Recognition, 34(10):2067–2070,

2001.

[12] D. Zhang. Polyu palmprint palmprint database -

http://www.comp.polyu.edu.hk/ biometrics/.

[13] D. Zhang. Palmprint Authentication. Kluwer

Academic, 2004.