Content uploaded by Charles Elkan
Author content
All content in this area was uploaded by Charles Elkan on Nov 07, 2012
Content may be subject to copyright.
Learning the in -means
Greg Hamerly, Charles Elkan
ghamerly,elkan @cs.ucsd.edu
Department of Computer Science and Engineering
University of California, San Diego
La Jolla, California 92093-0114
Abstract
When clustering a dataset, the right number
of clusters to use is often
not obvious, and choosing
automatically is a hard algorithmic prob-
lem. In this paper we present an improvedalgorithm for learning
while
clustering. The G-means algorithm is based on a statistical test for the
hypothesisthat a subset of data follows a Gaussian distribution. G-means
runs
-means with increasing in a hierarchical fashion until the test ac-
cepts the hypothesis that the data assigned to each
-means center are
Gaussian. Two key advantages are that the hypothesis test does not limit
the covariance of the data and does not compute a full covariance matrix.
Additionally, G-means only requires one intuitive parameter, the stand-
ard statistical significance level
. We present results from experiments
showing that the algorithm works well, and better than a recent method
based on the BIC penalty for model complexity. In these experiments,
we show that the BIC is ineffective as a scoring function, since it does
not penalize strongly enough the model’s complexity.
1 Introduction and related work
Clustering algorithms are useful tools for data mining, compression, probability density es-
timation, and many other important tasks. However, most clustering algorithms require the
user to specify the number of clusters (called
), and it is not always clear what is the best
value for
. Figure 1 shows examples where has been improperly chosen. Choosing is
often an ad hoc decision based on prior knowledge, assumptions, and practical experience.
Choosing
is made more difficult when the data has many dimensions, even when clusters
are well-separated.
Center-based clustering algorithms (in particular
-means and Gaussian expectation-
maximization) usually assume that each cluster adheres to a unimodal distribution, such
as Gaussian. With these methods, only one center should be used to model each subset
of data that follows a unimodal distribution. If multiple centers are used to describe data
drawn from one mode, the centers are a needlessly complex description of the data, and in
fact the multiple centers capture the truth about the subset less well than one center.
In this paper we present a simple algorithm called G-means that discovers an appropriate
using a statistical test for deciding whether to split a -means center into two centers.
We describe examples and present experimental results that show that the new algorithm
−0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
−3 −2 −1 0 1 2 3
−4
−3
−2
−1
0
1
2
3
4
Figure 1: Two clusterings where was improperly chosen. Dark crosses are -means
centers. On the left, there are too few centers; five should be used. On the right, too many
centers are used; one center is sufficient for representing the data. In general, one center
should be used to represent one Gaussian cluster.
is successful. This technique is useful and applicable for many clustering algorithms other
than
-means, but here we consider only the -means algorithm for simplicity.
Several algorithms have been proposed previously to determine
automatically. Like our
method, most previous methods are wrappers around
-means or some other clustering
algorithm for fixed
. Wrapper methods use splitting and/or merging rules for centers to
increase or decrease
as the algorithm proceeds.
Pelleg and Moore [14] proposed a regularization framework for learning
, which they call
-means. The algorithm searches over many values of and scores each clustering model
using the so-called Bayesian Information Criterion [10]:
where is the log-likelihood of the dataset according to model ,
is the number of parameters in the model with dimensionality and cluster centers,
and
is the number of points in the dataset. -means chooses the model with the best BIC
score on the data. Aside from the BIC, other scoring functions are also available.
Bischof et al. [1] use a minimum description length (MDL) framework, where the descrip-
tion length is a measure of how well the data are fit by the model. Their algorithm starts
with a large value for
and removes centers (reduces ) whenever that choice reduces
the description length. Between steps of reducing
, they use the -means algorithm to
optimize the model fit to the data.
With hierarchical clustering algorithms, other methods may be employed to determine the
best number of clusters. One is to build a merging tree (“dendrogram”) of the data based
on a cluster distance metric, and search for areas of the tree that are stable with respect
to inter- and intra-cluster distances [9, Section 5.1]. This method of estimating
is best
applied with domain-specific knowledge and human intuition.
2 The Gaussian-means (G-means) algorithm
The G-means algorithm starts with a small number of
-means centers, and grows the
number of centers. Each iteration of the algorithm splits into two those centers whose data
appear not to come from a Gaussian distribution. Between each round of splitting, we run
-means on the entire dataset and all the centers to refine the current solution. We can
initialize with just
, or we can choose some larger value of if we have some prior
knowledge about the range of
.
G-means repeatedly makes decisions based on a statistical test for the data assigned to each
center. If the data currently assigned to a
-means center appear to be Gaussian, then we
want to represent that data with only one center. However, if the same data do not appear
Algorithm 1 G-means( , )
1: Let be the initial set of centers (usually ).
2:
.
3: Let
class be the set of datapoints assigned to center .
4: Use a statistical test to detect if each
class follow a Gaussian distribution
(at confidence level
).
5: If the data look Gaussian, keep
. Otherwise replace with two centers.
6: Repeat from step 2 until no more centers are added.
to be Gaussian, then we want to use multiple centers to model the data properly. The
algorithm will run
-means multiple times (up to times when finding centers), so the
time complexity is at most
times that of -means.
The
-means algorithmimplicitly assumes that the datapoints in each clusterare spherically
distributed around the center. Less restrictively, the Gaussian expectation-maximization
algorithm assumes that the datapoints in each cluster have a multidimensional Gaussian
distribution with a covariance matrix that may or may not be fixed, or shared. The Gaussian
distribution test that we present below are valid for either covariance matrix assumption.
The test also accounts for the number of datapoints
tested by incorporating in the
calculation of the critical value of the test (see Equation 2). This prevents the G-means
algorithm from making bad decisions about clusters with few datapoints.
2.1 Testing clusters for Gaussian fit
To specify the G-means algorithm fully we need a test to detect whether the data assigned
to a center are sampled from a Gaussian. The alternative hypotheses are
: The data around the center are sampled from a Gaussian.
: The data around the center are not sampled from a Gaussian.
If we accept the null hypothesis , then we believe that the one center is sufficient to
model its data, and we should not split the cluster into two sub-clusters. If we reject
and accept , then we want to split the cluster.
The test we use is based on the Anderson-Darling statistic. This one-dimensional test has
been shown empiricallyto be the mostpowerfulnormalitytest that is based on the empirical
cumulativedistribution function (ECDF). Given a list of values that have been converted
to mean 0 and variance 1, let
be the th ordered value. Let , where is
the cumulative distribution function. Then the statistic is
(1)
Stephens [17] showed that for the case where and are estimated from the data (as in
clustering), we must correct the statistic according to
(2)
Given a subset of data in dimensions that belongs to center , the hypothesis test
proceeds as follows:
1. Choose a significance level
for the test.
2. Initialize two centers, called “children” of . See the text for good ways to do this.
3. Run
-means on these two centers in . This can be run to completion, or to some
early stopping point if desired. Let
be the child centers chosen by -means.
4. Let
be a -dimensional vector that connects the two centers. This is
the direction that
-means believes to be important for clustering. Then project
onto : . is a 1-dimensional representation of the data
projected onto
. Transform so that it has mean 0 and variance 1.
5. Let
. If is in the range of non-critical values at confidence
level
, then accept , keep the original center, and discard . Otherwise,
reject
and keep in place of the original center.
A primary contribution of this work is simplifying the test for Gaussian fit by projecting
the data to one dimension where the test is simple to apply. The authors of [5] also use
this approach for online dimensionality reduction during clustering. The one-dimensional
representation of the data allows us to consider only the data along the direction that
-
means has found to be important for separating the data. This is related to the problem
of projection pursuit [7], where here
-means searches for a direction in which the data
appears non-Gaussian.
We must choose the significance level of the test,
, which is the desired probability of
making a Type I error (i.e. incorrectly rejecting
). It is appropriate to use a Bonferroni
adjustment to reduce the chance of making Type I errorsovermultiple tests. Forexample, if
we want a 0.01 chance of making a Type I error in 100 tests, we should apply a Bonferroni
adjustment to make each test use . To find final centers the
G-means algorithm makes
statistical tests, so the Bonferroni correction does not need to
be extreme. In our tests, we always use
.
We consider two ways to initialize the two child centers. Both approaches initialize with
, where is a center and is chosen. The first method chooses as a random
-dimensional vector such that is small compared to the distortion of the data. A
second method finds the main principal component
of the data (having eigenvalue ),
and chooses
. This deterministic method places the two centers in their
expected locations under
. The principal component calculations require
time and space, but since we only want the main principal component, we can use
fast methods like the power method, which takes time that is at most linear in the ratio of
the two largest eigenvalues [4]. In this paper we use principal-component-basedsplitting.
2.2 An example
Figure 2 shows a run of the G-means algorithm on a synthetic dataset with two true clusters
and 1000 points, using
. The critical value for the Anderson-Darling test is
1.8692 for this confidence level. Starting with one center, after one iteration of G-means,
we have 2 centers and the
statistic is 38.103. This is much larger than the critical value,
so we reject
and accept this split. On the next iteration, we split each new center and
repeat the statistical test. The
values for the two splits are 0.386 and 0.496, both of
which are well below the critical value. Therefore we accept
for both tests, and discard
these splits. Thus G-means gives a final answer of
.
2.3 Statistical power
Figure 3 shows the power of the Anderson-Darling test, as compared to the BIC. Lower is
better for both plots. We run 1000 tests for each data point plotted for both plots. In the left
0 2 4 6 8 10 12
4
5
6
7
8
9
10
11
12
13
14
0 2 4 6 8 10 12
4
5
6
7
8
9
10
11
12
13
14
0 2 4 6 8 10 12
4
5
6
7
8
9
10
11
12
13
14
Figure 2: An example of running G-means for three iterations on a 2-dimensional dataset
with two true clusters and 1000 points. Starting with one center (left plot), G-means splits
into two centers (middle). The test for normality is significant, so G-means rejects
and
keeps the split. After splitting each center again (right), the test values are not significant,
so G-means accepts
for both tests and does not accept these splits. The middle plot is
the G-means answer. See the text for further details.
0
0.2
0.4
0.6
0.8
1
0 30 60 90 120 150 180 210
P(Type I error)
number of datapoints
G-means
X-means
0
0.2
0.4
0.6
0.8
1
0 30 60 90 120 150 180 210
P(Type II error)
number of datapoints
G-means
X-means
Figure 3: A comparison of the power of the Anderson-Darling test versus the BIC. For
the AD test we fix the significance level (
), while the BIC’s significance level
depends on
. The left plot shows the probability of incorrectly splitting (Type I error) one
true 2-
cluster that is 5% elliptical. The right plot shows the probability of incorrectly not
splitting two true clusters separated by
(Type II error). Both plots are functions of .
Both plots show that the BIC overfits (splits clusters) when
is small.
plot, for each test we generate datapoints from a single true Gaussian distribution, and
then plot the frequency with which BIC and G-means will choose
rather than
(i.e. commit a Type I error). BIC tends to overfit by choosing too many centers when the
data is not strictly spherical, while G-means does not. This is consistent with the tests of
real-world data in the next section. While G-means commits more Type II errors when
is
small, this prevents it from overfitting the data.
The BIC can be considered a likelihood ratio test, but with a significance level that cannot
be fixed. The significance level instead varies depending on and (the change in the
number of model parameters between two models). As
or decrease, the significance
level increases (the BIC becomes weaker as a statistical test) [10]. Figure 3 shows this
effect for varying
. In [11] the authors show that penalty-based methods require problem-
specific tuning and don’t generalize as well as other methods, such as cross validation.
3 Experiments
Table 1 showsthe results from running G-means and
-means on many large synthetic. On
synthetic datasets with spherically distributed clusters, G-means and
-means do equally
Table 1: Results for many synthetic datasets. We report distortion relative to the optimum
distortion for the correct clustering (closer to one is better), and time is reported relative to
-means run with the correct . For BIC, larger values are better, but it is clear that finding
the correct clustering does not always coincide with finding a larger BIC. Items with a star
are where
-means always chose the largest number of centers we allowed.
dataset method found distortion( optimal) BIC( ) time( -means)
synthetic 2 G-means 9.1 9.9 0.89 0.23 -0.19 2.70 13.2
=5 -means 18.1 3.2 0.37 0.12 0.70 0.93 2.8
synthetic 2 G-means 20.1 0.6 0.99 0.01 0.21 0.18 2.1
=20 -means 70.5 11.6 9.45 28.02 14.83 3.50 1.2
synthetic 2 G-means 80.0 0.2 1.00 0.01 1.84 0.12 2.2
=80 -means 171.7 23.7 48.49 70.04 40.16 6.59 1.8
synthetic 8 G-means 5.0 0.0 1.00 0.00 -0.74 0.16 4.6
=5 -means *20.0 0.0 0.47 0.03 -2.28 0.20 11.0
synthetic 8 G-means 20.0 0.1 0.99 0.00 -0.18 0.17 2.6
=20 -means *80.0 0.0 0.47 0.01 14.36 0.21 4.0
synthetic 8 G-means 80.2 0.5 0.99 0.00 1.45 0.20 2.9
=80 -means 229.2 36.8 0.57 0.06 52.28 9.26 6.5
synthetic 32 G-means 5.0 0.0 1.00 0.00 -3.36 0.21 4.4
=5 -means *20.0 0.0 0.76 0.00 -27.92 0.22 29.9
synthetic 32 G-means 20.0 0.0 1.00 0.00 -2.73 0.22 2.3
=20 -means *80.0 0.0 0.76 0.01 -11.13 0.23 21.2
synthetic 32 G-means 80.0 0.0 1.00 0.00 -1.10 0.16 2.8
=80 -means 171.5 10.9 0.84 0.01 11.78 2.74 53.3
0 2 4 6 8 10 12
0
1
2
3
4
5
6
7
0 2 4 6 8 10 12
0
1
2
3
4
5
6
7
Figure 4: 2- synthetic dataset with 5 true clusters. On the left, G-means correctly chooses
5 centers and deals well with non-spherical data. On the right, the BIC causes
-means to
overfit the data, choosing 20 unevenly distributed clusters.
well at finding the correct
and maximizing the BIC statistic, so we don’t show these
results here. Most real-world data is not spherical, however.
The synthetic datasets used here each have 5000 datapoints in
dimensions.
The true
s are 5, 20, and 80. For each synthetic dataset type, we generate 30 datasets with
the true center means chosen uniformly randomly from the unit hypercube, and choosing
so that no two clusters are closer than 3 apart. Each cluster is also given a transformation
to make it non-spherical, by multiplying the data by a randomly chosen scaling and rotation
matrix. We run G-means starting with one center. We allow
-means to search between 2
and
centers (where here is the true number of clusters).
The G-means algorithmclearly doesbetter at findingthe correct
on non-sphericaldata. Its
results are closer to the true distortions and the correct
s. The BIC statistic that -means
uses has been formulated to maximize the likelihood for spherically-distributed data. Thus
it overestimatesthe number of true clusters in non-sphericaldata. This is especially evident
when the number of points per cluster is small, as in datasets with 80 true clusters.
5 10 15 20 25 30
0
1
2
3
4
5
6
7
8
9
Cluster
Digit
10 20 30 40 50 60
0
1
2
3
4
5
6
7
8
9
Cluster
Digit
Figure 5: NIST and Pendigits datasets: correspondence between each digit (row) and each
cluster (column) found by G-means. G-means did not have the labels, yet it found mean-
ingful clusters corresponding with the labels.
Because of this overestimation, -means often hits our limit of centers. Figure 4 shows
an example of overfitting on a dataset with 5 true clusters.
-means chooses while
G-means finds all 5 true cluster centers. Also of note is that
-means does not distribute
centers evenly among clusters; some clusters receive one center, but others receive many.
G-means runs faster than
-means for 8 and 32 dimensions, which we expect, since the
-tree structures which make -means fast in low dimensions take time exponential in
, making them slow for more than 8 to 12 dimensions. All our code is written in Matlab;
-means is written in C.
3.1 Discovering true clusters in labeled data
We tested these algorithms on two real-world datasets for handwritten digit recognition:
the NIST dataset [12] and the Pendigits dataset [2]. The goal is to cluster the data without
knowledge of the labels and measure how well the clustering captures the true labels. Both
datasets have 10 true classes (digits 0-9). NIST has 60000 training examples and 784
dimensions (28
28 pixels). We use 6000 randomly chosen examples and we reduce the
dimension to 50 by random projection (following [3]). The Pendigits dataset has 7984
examples and 16 dimensions; we did not change the data in any way.
We cluster each dataset with G-means and
-means, and measure performance by com-
paring the cluster labels
with the true labels . We define the partition quality (PQ) as
where is the true number of classes, and is
the number of clusters found by the algorithm. This metric is maximized when
induces
the same partition of the data as
; in other words, when all points in each cluster have the
same true label, and the estimated
is the true . The term is the frequency-based
probability that a datapoint will be labeled
by and by . This quality is normalized
by the sum of true probabilities, squared. This statistic is related to the Rand statistic for
comparing partitions [8].
For the NIST dataset, G-means finds 31 clusters in 30 seconds with a PQ score of 0.177.
-means finds 715 clusters in 4149 seconds, and 369 of these clusters contain only one
point, indicating an overestimation problem with the BIC.
-means receives a PQ score
of 0.024. For the Pendigits dataset, G-means finds 69 clusters in 30 seconds, with a PQ
score of 0.196;
-means finds 235 clusters in 287 seconds, with a PQ score of 0.057.
Figure 5 shows Hinton diagrams of the G-means clusterings of both datasets, showing that
G-means succeeds at identifying the true clusters concisely, without aid of the labels. The
confusions between different digits in the NIST dataset (seen in the off-diagonal elements)
are common for other researchers using more sophisticated techniques, see [3].
4 Discussion and conclusions
We have introduced the new G-means algorithm for learning based on a statistical test
for determining whether datapoints are a random sample from a Gaussian distribution with
arbitrary dimension and covariance matrix. The splitting uses dimension reduction and a
powerful test for Gaussian fitness. G-means uses this statistical test as a wrapper around
-means to discover the number of clusters automatically. The only parameter supplied
to the algorithm is the significance level of the statistical test, which can easily be set in
a standard way. The G-means algorithm takes linear time and space (plus the cost of the
splitting heuristic and test) in the number of datapoints and dimension, since -means is
itself linear in time and space. Empirically, the G-means algorithm works well at finding
the correct number of clusters and the locations of genuine cluster centers, and we have
shown it works well in moderately high dimensions.
Clustering in high dimensions has been an open problem for many years. Recent research
has shown that it may be preferable to use dimensionality reduction techniques before clus-
tering, and then use a low-dimensional clustering algorithm such as
-means, rather than
clustering in the high dimension directly. In [3] the author shows that using a simple,
inexpensive linear projection preserves many of the properties of data (such as cluster dis-
tances), while making it easier to find the clusters. Thus there is a need for good-quality,
fast clustering algorithms for low-dimensional data. Our work is a step in this direction.
Additionally, recent image segmentation algorithms such as normalized cut [16, 13] are
based on eigenvector computations on distance matrices. These “spectral” clustering al-
gorithms still use
-means as a post-processing step to find the actual segmentation and
they require
to be specified. Thus we expect G-means will be useful in combination with
spectral clustering.
References
[1] Horst Bischof, Aleˇs Leonardis, and Alexander Selb. MDL principle for robust vector quantisation. Pattern analysis and applications, 2:59–72,
1999.
[2] C.L. Blake and C.J. Merz. UCI repository of machine learning databases, 1998. http://www.ics.uci.edu/
mlearn/MLRepository.html.
[3] Sanjoy Dasgupta. Experiments with random projection. In Uncertainty in Artificial Intelligence: Proceedings of the Sixteenth Conference
(UAI-2000), pages 143–151, San Francisco, CA, 2000. Morgan Kaufmann Publishers.
[4] Gianna M. Del Corso. Estimating an eigenvector by the power method with a random start. SIAM Journal on Matrix Analysis and Applications,
18(4):913–937, 1997.
[5] Chris Ding, Xiaofeng He, Hongyuan Zha, and Horst Simon. Adaptive dimension reduction for clustering high dimensional data. In Proceedings
of the 2nd IEEE International Conference on Data Mining, 2002.
[6] Fredrik Farnstrom, James Lewis, and Charles Elkan. Scalability for clustering algorithms revisited. SIGKDD Explorations, 2(1):51–57, 2000.
[7] Peter J. Huber. Projection pursuit. Annals of Statistics, 13(2):435–475, June 1985.
[8] L. Hubert and P. Arabie. Comparing partitions. Journal of Classification, 2:193–218, 1985.
[9] A. K. Jain, M. N. Murty, and P. J. Flynn. Data clustering: a review. ACM Computing Surveys, 31(3):264–323, 1999.
[10] Robert E. Kass and Larry Wasserman. A reference Bayesian test for nested hypotheses and its relationship to the Schwarz criterion. Journal of
the American Statistical Association, 90(431):928–934, 1995.
[11] Michael J. Kearns, Yishay Mansour, Andrew Y. Ng, and Dana Ron. An experimental and theoretical comparison of model selection methods. In
Computational Learing Theory (COLT), pages 21–30, 1995.
[12] Yann LeCun, L´eon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the
IEEE, 86(11):2278–2324, 1998.
[13] Andrew Ng, Michael Jordan, and Yair Weiss. On spectral clustering: Analysis and an algorithm. Neural Information Processing Systems, 14,
2002.
[14] Dan Pelleg and Andrew Moore.
-means: Extending -means with efficient estimation of the number of clusters. In Proceedings of the 17th
International Conf. on Machine Learning, pages 727–734. Morgan Kaufmann, San Francisco, CA, 2000.
[15] Peter Sand and Andrew Moore. Repairing faulty mixture models using density estimation. In Proceedings of the 18th International Conf. on
Machine Learning. Morgan Kaufmann, San Francisco, CA, 2001.
[16] Jianbo Shi and Jitendra Malik. Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence,
22(8):888–905, 2000.
[17] M. A. Stephens. EDF statistics for goodness of fit and some comparisons. American Statistical Association, 69(347):730–737, September 1974.