Content uploaded by David Horn

Author content

All content in this area was uploaded by David Horn

Content may be subject to copyright.

VOLUME 88, NUMBER 1 PHYSICAL REVIEW LETTERS 7J

ANUARY 2002

Algorithm for Data Clustering in Pattern Recognition Problems Based on Quantum Mechanics

David Horn and Assaf Gottlieb

School of Physics and Astronomy, Raymond and Beverly Sackler Faculty of Exact Sciences, Tel Aviv University,

Tel Aviv 69978, Israel

(Received 16 July 2001; published 20 December 2001)

We propose a novel clustering method that is based on physical intuition derived from quantum me-

chanics. Starting with given data points, we construct a scale-space probability function. Viewing the

latter as the lowest eigenstate of a Schrödinger equation, we use simple analytic operations to derive a

potential function whose minima determine cluster centers. The method has one parameter, determin-

ing the scale over which cluster structures are searched. We demonstrate it on data analyzed in two

dimensions (chosen from the eigenvectors of the correlation matrix). The method is applicable in higher

dimensions by limiting the evaluation of the Schrödinger potential to the locations of data points.

DOI: 10.1103/ PhysRevLett.88.018702 PACS numbers: 89.75.Kd, 02.70. –c, 03.65.Ge, 03.67.Lx

Clustering of data is a well-known problem of pattern

recognition, covered in textbooks such as [1–3]. The prob-

lem we are looking at is deﬁning clusters of data solely by

the proximity of data points to one another. This problem is

one of unsupervised learning, and is in general ill deﬁned.

Solutions to such problems can be based on intuition de-

rived from physics. A good example of the latter is the

algorithm by [4] that is based on associating points with

Potts spins and formulating an appropriate model of sta-

tistical mechanics. We propose an alternative that is also

based on physical intuition, this one being derived from

quantum mechanics.

As an introduction to our approach we start with the

scale-space algorithm by [5] who uses a Parzen-window

estimator [3] of the probability distribution leading to the

data at hand. The estimator is constructed by associating

a Gaussian with each of the Ndata points in a Euclidean

space of dimension dand summing over all of them. This

can be represented, up to an overall normalization, by

c共x兲苷X

i

e2共x2xi兲2兾2s2,(1)

where xiare the data points. Roberts [5] views the maxima

of this function as determining the locations of cluster

centers.

An alternative, and somewhat related, method is support

vector clustering (SVC) [6] that is based on a Hilbert-space

analysis. In SVC, one deﬁnes a transformation from data

space to vectors in an abstract Hilbert space. SVC pro-

ceeds to search for the minimal sphere surrounding these

states in Hilbert space. We will also associate data points

with states in Hilbert space. Such states may be repre-

sented by Gaussian wave functions, whose sum is c共x兲.

This is the starting point of our quantum clustering (QC)

method. We will search for the Schrödinger potential for

which c共x兲is a ground state. The minima of the potential

deﬁne our cluster centers.

The Schrödinger potential.—We wish to view cas an

eigenstate of the Schrödinger equation

Hc⬅µ2s2

2=21V共x兲∂c苷Ec.(2)

Here we rescaled Hand Vof the conventional quantum

mechanical equation to leave only one free parameter, s.

For comparison, the case of a single point at x1corre-

sponds to Eq. (2) with V苷

1

2s2共x2x1兲2and E苷d兾2,

thus coinciding with the ground state of the harmonic os-

cillator in quantum mechanics.

Given cfor any set of data points we can solve Eq. (2)

for V:

V共x兲苷E1

s2

2=2c

c

苷E2d

211

2s2cX

i

共x2xi兲2e2共x2xi兲2兾2s2.

(3)

Let us furthermore require that minV苷0. This sets the

value of

E苷2min

s2

2=2c

c(4)

and determines V共x兲uniquely. Ehas to be positive since

Vis a non-negative function. Moreover, since the last term

in Eq. (3) is positive deﬁnite, it follows that

0,E#d

2.(5)

We note that cis positive deﬁnite. Hence, being an eigen-

function of the operator Hin Eq. (2), its eigenvalue Eis the

lowest eigenvalue of H, i.e., it describes the ground state.

All higher eigenfunctions have nodes whose numbers in-

crease as their energy eigenvalues increase. (In quantum

mechanics, where one interprets jcj2as the probability dis-

tribution, all eigenfunctions of Hhave physical meaning.

Although this approach could be adopted, we have chosen

cas the probability distribution because of the simplicity

of algebraic manipulations.)

Given a set of points deﬁned within some region of

space, we expect V共x兲to grow quadratically outside this

018702-1 0031-9007兾02兾88(1)兾018702(4)$15.00 © 2001 The American Physical Society 018702-1

VOLUME 88, NUMBER 1 PHYSICAL REVIEW LETTERS 7J

ANUARY 2002

region, and to exhibit one or several local minima within

the region. We identify these minima with cluster centers,

which seems natural in view of the opposite roles of the

two terms in Eq. (2): Given a potential function, it attracts

the data distribution function cto its minima, while the

Laplacian drives it away. The diffused character of the

distribution is the balance of the two effects.

As an example we display results for the crab data set

taken from Ripley’s book [7]. These data, given in a

ﬁve-dimensional parameter space, show nice separation

of the four classes contained in them when displayed in

two dimensions spanned by the second and third principal

components [8] (eigenvectors) of the correlation matrix of

the data. The information supplied to the clustering algo-

rithm contains only the coordinates of the data points. We

display the correct classiﬁcation to allow for visual com-

parison of the clustering method with the data. Starting

with s苷1兾p2we see in Fig. 1 that the Parzen proba-

bility distribution, or the wave-function c, has only a

single maximum. Nonetheless, the potential, displayed in

Fig. 2, already shows four minima at the relevant locations.

The overlap of the topographic map of the potential with

the true classiﬁcation is quite amazing. The minima are the

centers of attraction of the potential, and they are clearly

evident although the wave function does not display local

maxima at these points. The fact that V共x兲苷Elies above

the range where all valleys merge explains why c共x兲is

smoothly distributed over the whole domain.

As sis being decreased more minima will appear in

V共x兲. For the crab data, we ﬁnd two new minima as s

is decreased to one-half. Nonetheless, the previous

minima become deeper and still dominate the scene. The

new minima are insigniﬁcant, in the sense that they lie

at high values (of order E). Classifying data points to

clusters according to their topographic location on the

−3−2−1 0 1 2 3

−3

−2

−1

0

1

2

3

PC3

PC2

FIG. 1. Ripley’s crab data [7] displayed on a plot of their sec-

ond and third principal components with a superimposed topo-

graphic map of Roberts’ probability distribution for s苷1兾p2.

surface of V共x兲, roughly the same clustering assignment

is expected for a range of svalues. One important

advantage of quantum clustering is that Esets the scale

on which minima are observed. Thus, we learn from

Fig. 2 that the cores of all 4 clusters can be found at V

values below 0.4E. In comparison, the additional maxima

of c, which start to appear at lower values of s, may lie

much lower than the leading maximum and may be hard

to locate numerically.

Principal component analysis (PCA).— In our example,

data were given in some high-dimensional space and we

analyzed them after deﬁning a projection and a metric,

using the PCA approach. The latter deﬁnes a metric that is

intrinsic to the data, determined by second order statistics.

But, even then, several possibilities exist, leading to non-

equivalent results.

Principal component decomposition can be applied both

to the correlation matrix Cab 苷具xaxb典and to the covari-

ance matrix

Cab 苷具具具共xa2具x典a兲共xb2具x典b兲典典典 苷Cab 2具x典a具x典b.

(6)

In both cases averaging is performed over all data points,

and the indices indicate spatial coordinates from 1 to d.

The principal components are the eigenvectors of these

matrices. Thus we have two natural bases in which to

represent the data. Moreover, one often renormalizes the

eigenvector projections, dividing them by the square roots

of their eigenvalues. This procedure is known as “whiten-

ing,” leading to a renormalized correlation or covariance

matrix of unity. This is a scale-free representation that

would naturally lead one to start with s苷1in the search

for (higher order) structure of the data.

−2−1.5 −1−0.5 0 0.5 1 1.5 2 2.5

−3

−2

−1

0

1

2

3

PC3

PC2

FIG. 2. A topographic map of the potential for the crab data

with s苷1兾p2, displaying four minima (denoted by crossed

circles) that are interpreted as cluster centers. The contours of

the topographic map are set at values of V共x兲兾E苷0.2, 0.4, 0.6,

0.8, 1.

018702-2 018702-2

VOLUME 88, NUMBER 1 PHYSICAL REVIEW LETTERS 7J

ANUARY 2002

The PCA approach that we have used in our example

was based on whitened correlation matrix projections.

Had we used the covariance matrix instead, we would get

similar, but slightly worse, separation of the crab data.

Our example is meant to convince the reader that once a

good metric is found, QC conveys the correct information.

Hence we allowed ourselves to search ﬁrst for the best

geometric representation, and then apply QC.

QC in higher dimensions.— Increasing dimensionality

means higher computational complexity, often limiting the

applicability of a numerical method. Nonetheless, here we

can overcome this “curse of dimensionality” by limiting

ourselves to evaluating Vat locations of data points only.

Since we are interested in where the minima lie, and since

invariably they lie near data points, no harm is done by

this limitation. The results are depicted in Fig. 3. Here

we analyzed the crab problem in a three-dimensional (3D)

space, spanned by the ﬁrst three PCs. Shown in this ﬁg-

ure are V兾Evalues as functions of the serial number of the

data, using the same symbols as in Fig. 2 to allow for com-

parison. Using all data of V,0.3E, one obtains cluster

cores that are well separated in space, corresponding to the

four classes that exist in the data. Only 9 of the 129 points

that obey V,0.3Eare misclassiﬁed by this procedure.

Adding higher PCs, ﬁrst component 4 and then component

5, leads to deterioration in clustering quality. In particular,

lower cutoffs in V兾E, including lower fractions of data,

are required to deﬁne cluster cores that are well separated

in their relevant spaces.

One may locate the cluster centers, and deduce the clus-

tering allocation of the data, by following the dynamics of

gradient descent into the potential minima. By deﬁning

yi共0兲苷xi, one follows the steps of yi共t1Dt兲苷yi共t兲2

0 20 40 60 80 100 120 140 160 180 200

0

0.2

0.4

0.6

0.8

1

1.2

serial number

V/E

FIG. 3. Values of V共x兲兾Eare depicted in the crab problem

with three leading PCs for s苷1兾2. They are presented as a

function of the serial number of the data, using the same symbols

of data employed previously. One observes low lying data of all

four classes.

h共t兲=V共共共 yi共t兲兲兲兲, letting the points yireach an asymptotic

ﬁxed value coinciding with a cluster center. More sophis-

ticated minimum search algorithms (see, e.g., chapter 10

in [9]) can be applied to reach the ﬁxed points faster. The

results of a gradient-descent procedure, applied to the 3D

analysis of the crab data shown in Fig. 3, are that the three

classes of data points 51 to 200 are clustered correctly with

only ﬁve misclassiﬁcations. The ﬁrst class, data points

1– 50, has 31 points forming a new cluster, with most of the

rest joining the cluster of the second class. Only 3 points

of the ﬁrst class fall outside the 4 clusters.

We also ran our method on the iris data set [10], which is

a standard benchmark obtainable from the UC Irvine (UCI)

repository [11]. The data set contains 150 instances, each

composed of four measurements of an iris ﬂower. There

are three types of ﬂowers, represented by 50 instances each.

Clustering of these data in the space of the ﬁrst two princi-

pal components, using s苷1兾4, has the amazing result of

misclassiﬁcation of 3 instances only. Quantum clustering

can be applied to the raw data in four dimensions, leading

to misclassiﬁcations of the order of 15 instances, similar

to the clustering quality of [4].

Distance-based QC formulation.— Gradient descent

calls for the calculation of Vboth on the original data

points as well as on the trajectories they follow. An alter-

native approach can be to restrict oneself to the original

values of V, as in the example displayed in Fig. 3, and

follow a hybrid algorithm to be described below. Before

turning to such an algorithm let us note that, in this case,

we evaluate Von a discrete set of points V共xi兲苷Vi.

We can then express Vin terms of the distance matrix

Dij 苷jxi2xjjas

Vi苷E2d

211

2s2PjD2

ij e2D2

ij 兾2s2

Pje2D2

ij兾2s2(7)

with Echosen appropriately so that minVi苷0. This type

of formulation is of particular importance if the original

information is given in terms of distances between data

points rather than their locations in space. In this case we

have to proceed with distance information only.

By applying QC we can reach results such as in Fig. 3

without invoking any explicit spatial distribution of the

points in question. One may then analyze the results by

choosing a cutoff, e.g., V,0.2E, such that a fraction

(e.g., one-third) of the data will be included. On this sub-

set we select groups of points whose distances from one

another are smaller than, e.g., 2s, thus deﬁning cores of

clusters. Then we continue with higher values of V, e.g.,

0.2E,V,0.4E, allocating points to previous clusters

or forming new cores. Since the choice of distance cutoff

in cluster allocation is quite arbitrary, this method can-

not be guaranteed to work as well as the gradient-descent

approach.

Generalization.— Our method can be easily generalized

to allow for different weighting of different points, as in

018702-3 018702-3

VOLUME 88, NUMBER 1 PHYSICAL REVIEW LETTERS 7J

ANUARY 2002

c共x兲苷X

i

cie2共x2xi兲2兾2s2(8)

with ci$0. This is important if we have some prior in-

formation or some other means for emphasizing or deem-

phasizing the inﬂuence of data points. An example of the

latter is using QC in conjunction with SVC [6]. SVC has

the possibility of labeling points as outliers. This is done

by applying quadratic maximization to the Lagrangian

W苷12X

i,j

bibje2共xi2xj兲兾2s2(9)

over the space of all 0#b

i#1

pN subject to the constraint

Pibi苷1. The points for which the upper bound of biis

reached are labeled as outliers. Their number is regulated

by p, being limited by pN . Using for the QC analysis

a choice of ci苷

1

pN 2b

iwill eliminate the outliers of

SVC and emphasize the role of the points expected to lie

within the clusters.

Discussion.— QC constructs a potential function V共x兲

on the basis of data points, using one parameter, s, that

controls the width of the structures that we search for. The

advantage of the potential Vover the scale-space proba-

bility distribution is that the minima of the former are

better deﬁned (deep and robust) than the maxima of the

latter. However, both of these methods put the empha-

sis on cluster centers, rather than, e.g., cluster boundaries.

Since the equipotentials of Vmay take arbitrary shapes,

the clusters are not spherical, as in the k-means approach.

Nonethelss, spherical clusters appear more naturally than,

e.g., ring-shaped or toroidal clusters, even if the data would

accommodate them. If some global symmetry is to be ex-

pected, e.g., global spherical symmetry, it can be incorpo-

rated into the original Schrödinger equation deﬁning the

potential function.

QC can be applied in high dimensions by limiting the

evaluation of the potential, given as an explicit analytic

expression of Gaussian terms, to locations of data points

only. Thus the complexity of evaluating Viis of order N2

independent of dimensionality.

Our algorithm has one free parameter, the scale s. In all

examples we conﬁned ourselves to scales that are of order

1, because we have worked within whitened PCA spaces.

If our method is applied to a different data space, the range

of scales to be searched for could be determined by some

other prior information.

Since the strength of our algorithm lies in the easy se-

lection of cluster cores, it can be used as a ﬁrst stage of

a hybrid approach employing other techniques after the

identiﬁcation of cluster centers. The fact that we do not

have to take care of feeble minima, but consider only ro-

bust deep minima, turns the identiﬁcation of a core into an

easy problem. Thus, an approach that drives its rationale

from physical intuition in quantum mechanics can lead to

interesting results in the ﬁeld of pattern classiﬁcation.

We thank B. Reznik for a helpful discussion.

[1] A. K. Jain and R. C. Dubes, Algorithms for Clustering

Data (Prentice-Hall, Englewood Cliffs, NJ, 1988).

[2] K. Fukunaga, Introduction to Statistical Pattern Recogni-

tion (Academic Press, San Diego, CA, 1990).

[3] R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classiﬁca-

tion (Wiley-Interscience, New York, 2001), 2nd ed.

[4] M. Blat, S. Wiseman, and E. Domany, Phys. Rev. Lett. 76,

3251 (1996).

[5] S. J. Roberts, Pattern Recognit. 30,261 (1997).

[6] A. Ben-Hur, D. Horn, H. T. Siegelmann, and V. Vapnik,

in Proceedings of the Conference on Advances in Neural

Information Processing Systems 13, 2000, edited by Todd

K. Leen, Thomas G. Dietterich, and Volker Tresp (MIT

Press, Cambridge, MA, 2001), p. 367.

[7] B. D. Ripley, Pattern Recognition and Neural Networks

(Cambridge University Press, Cambridge, UK, 1996).

[8] I. T. Jolliffe, Principal Component Analysis (Springer-

Verlag, New York, 1986).

[9] W. H. Press, S. A. Teuklosky, W. T. Vetterling, and B. P.

Flannery, Numerical Recipes — The Ar t of Scient iﬁc Com-

puting (Cambridge University, Cambridge, UK, 1992),

2nd ed.

[10] R. A. Fisher, Ann. Eugenics 7,179 (1936).

[11] C. L. Blake and C. J. Merz, UCI repository of machine

learning databases, 1998.

018702-4 018702-4