ArticlePDF Available

Abstract

When clustering a dataset, the right number k of clusters to use is often not obvious, and choosing k automatically is a hard algorithmic problem. In this paper we present an improved algorithm for learning k while clustering. The G-means algorithm is based on a statistical test for the hypothesis that a subset of data follows a Gaussian distribution. G-means runs k-means with increasing k in a hierarchical fashion until the test accepts the hypothesis that the data assigned to each k-means center are Gaussian. Two key advantages are that the hypothesis test does not limit the covariance of the data and does not compute a full covariance matrix. Additionally, G-means only requires one intuitive parameter, the standard statistical significance level alpha. We present results from experiments showing that the algorithm works well, and better than a recent method based on the BIC penalty for model complexity. In these experiments, we show that the BIC is ineffective as a scoring function, since it does not penalize strongly enough the model's complexity.
Learning the in -means
Greg Hamerly, Charles Elkan
ghamerly,elkan @cs.ucsd.edu
Department of Computer Science and Engineering
University of California, San Diego
La Jolla, California 92093-0114
Abstract
When clustering a dataset, the right number
of clusters to use is often
not obvious, and choosing
automatically is a hard algorithmic prob-
lem. In this paper we present an improvedalgorithm for learning
while
clustering. The G-means algorithm is based on a statistical test for the
hypothesisthat a subset of data follows a Gaussian distribution. G-means
runs
-means with increasing in a hierarchical fashion until the test ac-
cepts the hypothesis that the data assigned to each
-means center are
Gaussian. Two key advantages are that the hypothesis test does not limit
the covariance of the data and does not compute a full covariance matrix.
Additionally, G-means only requires one intuitive parameter, the stand-
ard statistical significance level
. We present results from experiments
showing that the algorithm works well, and better than a recent method
based on the BIC penalty for model complexity. In these experiments,
we show that the BIC is ineffective as a scoring function, since it does
not penalize strongly enough the model’s complexity.
1 Introduction and related work
Clustering algorithms are useful tools for data mining, compression, probability density es-
timation, and many other important tasks. However, most clustering algorithms require the
user to specify the number of clusters (called
), and it is not always clear what is the best
value for
. Figure 1 shows examples where has been improperly chosen. Choosing is
often an ad hoc decision based on prior knowledge, assumptions, and practical experience.
Choosing
is made more difficult when the data has many dimensions, even when clusters
are well-separated.
Center-based clustering algorithms (in particular
-means and Gaussian expectation-
maximization) usually assume that each cluster adheres to a unimodal distribution, such
as Gaussian. With these methods, only one center should be used to model each subset
of data that follows a unimodal distribution. If multiple centers are used to describe data
drawn from one mode, the centers are a needlessly complex description of the data, and in
fact the multiple centers capture the truth about the subset less well than one center.
In this paper we present a simple algorithm called G-means that discovers an appropriate
using a statistical test for deciding whether to split a -means center into two centers.
We describe examples and present experimental results that show that the new algorithm
−0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
−3 −2 −1 0 1 2 3
−4
−3
−2
−1
0
1
2
3
4
Figure 1: Two clusterings where was improperly chosen. Dark crosses are -means
centers. On the left, there are too few centers; five should be used. On the right, too many
centers are used; one center is sufficient for representing the data. In general, one center
should be used to represent one Gaussian cluster.
is successful. This technique is useful and applicable for many clustering algorithms other
than
-means, but here we consider only the -means algorithm for simplicity.
Several algorithms have been proposed previously to determine
automatically. Like our
method, most previous methods are wrappers around
-means or some other clustering
algorithm for fixed
. Wrapper methods use splitting and/or merging rules for centers to
increase or decrease
as the algorithm proceeds.
Pelleg and Moore [14] proposed a regularization framework for learning
, which they call
-means. The algorithm searches over many values of and scores each clustering model
using the so-called Bayesian Information Criterion [10]:
where is the log-likelihood of the dataset according to model ,
is the number of parameters in the model with dimensionality and cluster centers,
and
is the number of points in the dataset. -means chooses the model with the best BIC
score on the data. Aside from the BIC, other scoring functions are also available.
Bischof et al. [1] use a minimum description length (MDL) framework, where the descrip-
tion length is a measure of how well the data are fit by the model. Their algorithm starts
with a large value for
and removes centers (reduces ) whenever that choice reduces
the description length. Between steps of reducing
, they use the -means algorithm to
optimize the model fit to the data.
With hierarchical clustering algorithms, other methods may be employed to determine the
best number of clusters. One is to build a merging tree (“dendrogram”) of the data based
on a cluster distance metric, and search for areas of the tree that are stable with respect
to inter- and intra-cluster distances [9, Section 5.1]. This method of estimating
is best
applied with domain-specific knowledge and human intuition.
2 The Gaussian-means (G-means) algorithm
The G-means algorithm starts with a small number of
-means centers, and grows the
number of centers. Each iteration of the algorithm splits into two those centers whose data
appear not to come from a Gaussian distribution. Between each round of splitting, we run
-means on the entire dataset and all the centers to refine the current solution. We can
initialize with just
, or we can choose some larger value of if we have some prior
knowledge about the range of
.
G-means repeatedly makes decisions based on a statistical test for the data assigned to each
center. If the data currently assigned to a
-means center appear to be Gaussian, then we
want to represent that data with only one center. However, if the same data do not appear
Algorithm 1 G-means( , )
1: Let be the initial set of centers (usually ).
2:
.
3: Let
class be the set of datapoints assigned to center .
4: Use a statistical test to detect if each
class follow a Gaussian distribution
(at confidence level
).
5: If the data look Gaussian, keep
. Otherwise replace with two centers.
6: Repeat from step 2 until no more centers are added.
to be Gaussian, then we want to use multiple centers to model the data properly. The
algorithm will run
-means multiple times (up to times when finding centers), so the
time complexity is at most
times that of -means.
The
-means algorithmimplicitly assumes that the datapoints in each clusterare spherically
distributed around the center. Less restrictively, the Gaussian expectation-maximization
algorithm assumes that the datapoints in each cluster have a multidimensional Gaussian
distribution with a covariance matrix that may or may not be xed, or shared. The Gaussian
distribution test that we present below are valid for either covariance matrix assumption.
The test also accounts for the number of datapoints
tested by incorporating in the
calculation of the critical value of the test (see Equation 2). This prevents the G-means
algorithm from making bad decisions about clusters with few datapoints.
2.1 Testing clusters for Gaussian fit
To specify the G-means algorithm fully we need a test to detect whether the data assigned
to a center are sampled from a Gaussian. The alternative hypotheses are
: The data around the center are sampled from a Gaussian.
: The data around the center are not sampled from a Gaussian.
If we accept the null hypothesis , then we believe that the one center is sufficient to
model its data, and we should not split the cluster into two sub-clusters. If we reject
and accept , then we want to split the cluster.
The test we use is based on the Anderson-Darling statistic. This one-dimensional test has
been shown empiricallyto be the mostpowerfulnormalitytest that is based on the empirical
cumulativedistribution function (ECDF). Given a list of values that have been converted
to mean 0 and variance 1, let
be the th ordered value. Let , where is
the cumulative distribution function. Then the statistic is
(1)
Stephens [17] showed that for the case where and are estimated from the data (as in
clustering), we must correct the statistic according to
(2)
Given a subset of data in dimensions that belongs to center , the hypothesis test
proceeds as follows:
1. Choose a significance level
for the test.
2. Initialize two centers, called “children of . See the text for good ways to do this.
3. Run
-means on these two centers in . This can be run to completion, or to some
early stopping point if desired. Let
be the child centers chosen by -means.
4. Let
be a -dimensional vector that connects the two centers. This is
the direction that
-means believes to be important for clustering. Then project
onto : . is a 1-dimensional representation of the data
projected onto
. Transform so that it has mean 0 and variance 1.
5. Let
. If is in the range of non-critical values at confidence
level
, then accept , keep the original center, and discard . Otherwise,
reject
and keep in place of the original center.
A primary contribution of this work is simplifying the test for Gaussian fit by projecting
the data to one dimension where the test is simple to apply. The authors of [5] also use
this approach for online dimensionality reduction during clustering. The one-dimensional
representation of the data allows us to consider only the data along the direction that
-
means has found to be important for separating the data. This is related to the problem
of projection pursuit [7], where here
-means searches for a direction in which the data
appears non-Gaussian.
We must choose the significance level of the test,
, which is the desired probability of
making a Type I error (i.e. incorrectly rejecting
). It is appropriate to use a Bonferroni
adjustment to reduce the chance of making Type I errorsovermultiple tests. Forexample, if
we want a 0.01 chance of making a Type I error in 100 tests, we should apply a Bonferroni
adjustment to make each test use . To find final centers the
G-means algorithm makes
statistical tests, so the Bonferroni correction does not need to
be extreme. In our tests, we always use
.
We consider two ways to initialize the two child centers. Both approaches initialize with
, where is a center and is chosen. The first method chooses as a random
-dimensional vector such that is small compared to the distortion of the data. A
second method finds the main principal component
of the data (having eigenvalue ),
and chooses
. This deterministic method places the two centers in their
expected locations under
. The principal component calculations require
time and space, but since we only want the main principal component, we can use
fast methods like the power method, which takes time that is at most linear in the ratio of
the two largest eigenvalues [4]. In this paper we use principal-component-basedsplitting.
2.2 An example
Figure 2 shows a run of the G-means algorithm on a synthetic dataset with two true clusters
and 1000 points, using
. The critical value for the Anderson-Darling test is
1.8692 for this confidence level. Starting with one center, after one iteration of G-means,
we have 2 centers and the
statistic is 38.103. This is much larger than the critical value,
so we reject
and accept this split. On the next iteration, we split each new center and
repeat the statistical test. The
values for the two splits are 0.386 and 0.496, both of
which are well below the critical value. Therefore we accept
for both tests, and discard
these splits. Thus G-means gives a final answer of
.
2.3 Statistical power
Figure 3 shows the power of the Anderson-Darling test, as compared to the BIC. Lower is
better for both plots. We run 1000 tests for each data point plotted for both plots. In the left
0 2 4 6 8 10 12
4
5
6
7
8
9
10
11
12
13
14
0 2 4 6 8 10 12
4
5
6
7
8
9
10
11
12
13
14
0 2 4 6 8 10 12
4
5
6
7
8
9
10
11
12
13
14
Figure 2: An example of running G-means for three iterations on a 2-dimensional dataset
with two true clusters and 1000 points. Starting with one center (left plot), G-means splits
into two centers (middle). The test for normality is significant, so G-means rejects
and
keeps the split. After splitting each center again (right), the test values are not significant,
so G-means accepts
for both tests and does not accept these splits. The middle plot is
the G-means answer. See the text for further details.
0
0.2
0.4
0.6
0.8
1
0 30 60 90 120 150 180 210
P(Type I error)
number of datapoints
G-means
X-means
0
0.2
0.4
0.6
0.8
1
0 30 60 90 120 150 180 210
P(Type II error)
number of datapoints
G-means
X-means
Figure 3: A comparison of the power of the Anderson-Darling test versus the BIC. For
the AD test we fix the significance level (
), while the BIC’s significance level
depends on
. The left plot shows the probability of incorrectly splitting (Type I error) one
true 2-
cluster that is 5% elliptical. The right plot shows the probability of incorrectly not
splitting two true clusters separated by
(Type II error). Both plots are functions of .
Both plots show that the BIC overfits (splits clusters) when
is small.
plot, for each test we generate datapoints from a single true Gaussian distribution, and
then plot the frequency with which BIC and G-means will choose
rather than
(i.e. commit a Type I error). BIC tends to overfit by choosing too many centers when the
data is not strictly spherical, while G-means does not. This is consistent with the tests of
real-world data in the next section. While G-means commits more Type II errors when
is
small, this prevents it from overfitting the data.
The BIC can be considered a likelihood ratio test, but with a significance level that cannot
be fixed. The significance level instead varies depending on and (the change in the
number of model parameters between two models). As
or decrease, the significance
level increases (the BIC becomes weaker as a statistical test) [10]. Figure 3 shows this
effect for varying
. In [11] the authors show that penalty-based methods require problem-
specific tuning and don’t generalize as well as other methods, such as cross validation.
3 Experiments
Table 1 showsthe results from running G-means and
-means on many large synthetic. On
synthetic datasets with spherically distributed clusters, G-means and
-means do equally
Table 1: Results for many synthetic datasets. We report distortion relative to the optimum
distortion for the correct clustering (closer to one is better), and time is reported relative to
-means run with the correct . For BIC, larger values are better, but it is clear that finding
the correct clustering does not always coincide with finding a larger BIC. Items with a star
are where
-means always chose the largest number of centers we allowed.
dataset method found distortion( optimal) BIC( ) time( -means)
synthetic 2 G-means 9.1 9.9 0.89 0.23 -0.19 2.70 13.2
=5 -means 18.1 3.2 0.37 0.12 0.70 0.93 2.8
synthetic 2 G-means 20.1 0.6 0.99 0.01 0.21 0.18 2.1
=20 -means 70.5 11.6 9.45 28.02 14.83 3.50 1.2
synthetic 2 G-means 80.0 0.2 1.00 0.01 1.84 0.12 2.2
=80 -means 171.7 23.7 48.49 70.04 40.16 6.59 1.8
synthetic 8 G-means 5.0 0.0 1.00 0.00 -0.74 0.16 4.6
=5 -means *20.0 0.0 0.47 0.03 -2.28 0.20 11.0
synthetic 8 G-means 20.0 0.1 0.99 0.00 -0.18 0.17 2.6
=20 -means *80.0 0.0 0.47 0.01 14.36 0.21 4.0
synthetic 8 G-means 80.2 0.5 0.99 0.00 1.45 0.20 2.9
=80 -means 229.2 36.8 0.57 0.06 52.28 9.26 6.5
synthetic 32 G-means 5.0 0.0 1.00 0.00 -3.36 0.21 4.4
=5 -means *20.0 0.0 0.76 0.00 -27.92 0.22 29.9
synthetic 32 G-means 20.0 0.0 1.00 0.00 -2.73 0.22 2.3
=20 -means *80.0 0.0 0.76 0.01 -11.13 0.23 21.2
synthetic 32 G-means 80.0 0.0 1.00 0.00 -1.10 0.16 2.8
=80 -means 171.5 10.9 0.84 0.01 11.78 2.74 53.3
0 2 4 6 8 10 12
0
1
2
3
4
5
6
7
0 2 4 6 8 10 12
0
1
2
3
4
5
6
7
Figure 4: 2- synthetic dataset with 5 true clusters. On the left, G-means correctly chooses
5 centers and deals well with non-spherical data. On the right, the BIC causes
-means to
overfit the data, choosing 20 unevenly distributed clusters.
well at finding the correct
and maximizing the BIC statistic, so we don’t show these
results here. Most real-world data is not spherical, however.
The synthetic datasets used here each have 5000 datapoints in
dimensions.
The true
s are 5, 20, and 80. For each synthetic dataset type, we generate 30 datasets with
the true center means chosen uniformly randomly from the unit hypercube, and choosing
so that no two clusters are closer than 3 apart. Each cluster is also given a transformation
to make it non-spherical, by multiplying the data by a randomly chosen scaling and rotation
matrix. We run G-means starting with one center. We allow
-means to search between 2
and
centers (where here is the true number of clusters).
The G-means algorithmclearly doesbetter at findingthe correct
on non-sphericaldata. Its
results are closer to the true distortions and the correct
s. The BIC statistic that -means
uses has been formulated to maximize the likelihood for spherically-distributed data. Thus
it overestimatesthe number of true clusters in non-sphericaldata. This is especially evident
when the number of points per cluster is small, as in datasets with 80 true clusters.
Figure 5: NIST and Pendigits datasets: correspondence between each digit (row) and each
cluster (column) found by G-means. G-means did not have the labels, yet it found mean-
ingful clusters corresponding with the labels.
Because of this overestimation, -means often hits our limit of centers. Figure 4 shows
an example of overfitting on a dataset with 5 true clusters.
-means chooses while
G-means finds all 5 true cluster centers. Also of note is that
-means does not distribute
centers evenly among clusters; some clusters receive one center, but others receive many.
G-means runs faster than
-means for 8 and 32 dimensions, which we expect, since the
-tree structures which make -means fast in low dimensions take time exponential in
, making them slow for more than 8 to 12 dimensions. All our code is written in Matlab;
-means is written in C.
3.1 Discovering true clusters in labeled data
We tested these algorithms on two real-world datasets for handwritten digit recognition:
the NIST dataset [12] and the Pendigits dataset [2]. The goal is to cluster the data without
knowledge of the labels and measure how well the clustering captures the true labels. Both
datasets have 10 true classes (digits 0-9). NIST has 60000 training examples and 784
dimensions (28
28 pixels). We use 6000 randomly chosen examples and we reduce the
dimension to 50 by random projection (following [3]). The Pendigits dataset has 7984
examples and 16 dimensions; we did not change the data in any way.
We cluster each dataset with G-means and
-means, and measure performance by com-
paring the cluster labels
with the true labels . We define the partition quality (PQ) as
where is the true number of classes, and is
the number of clusters found by the algorithm. This metric is maximized when
induces
the same partition of the data as
; in other words, when all points in each cluster have the
same true label, and the estimated
is the true . The term is the frequency-based
probability that a datapoint will be labeled
by and by . This quality is normalized
by the sum of true probabilities, squared. This statistic is related to the Rand statistic for
comparing partitions [8].
For the NIST dataset, G-means finds 31 clusters in 30 seconds with a PQ score of 0.177.
-means finds 715 clusters in 4149 seconds, and 369 of these clusters contain only one
point, indicating an overestimation problem with the BIC.
-means receives a PQ score
of 0.024. For the Pendigits dataset, G-means finds 69 clusters in 30 seconds, with a PQ
score of 0.196;
-means finds 235 clusters in 287 seconds, with a PQ score of 0.057.
Figure 5 shows Hinton diagrams of the G-means clusterings of both datasets, showing that
G-means succeeds at identifying the true clusters concisely, without aid of the labels. The
confusions between different digits in the NIST dataset (seen in the off-diagonal elements)
are common for other researchers using more sophisticated techniques, see [3].
4 Discussion and conclusions
We have introduced the new G-means algorithm for learning based on a statistical test
for determining whether datapoints are a random sample from a Gaussian distribution with
arbitrary dimension and covariance matrix. The splitting uses dimension reduction and a
powerful test for Gaussian fitness. G-means uses this statistical test as a wrapper around
-means to discover the number of clusters automatically. The only parameter supplied
to the algorithm is the significance level of the statistical test, which can easily be set in
a standard way. The G-means algorithm takes linear time and space (plus the cost of the
splitting heuristic and test) in the number of datapoints and dimension, since -means is
itself linear in time and space. Empirically, the G-means algorithm works well at finding
the correct number of clusters and the locations of genuine cluster centers, and we have
shown it works well in moderately high dimensions.
Clustering in high dimensions has been an open problem for many years. Recent research
has shown that it may be preferable to use dimensionality reduction techniques before clus-
tering, and then use a low-dimensional clustering algorithm such as
-means, rather than
clustering in the high dimension directly. In [3] the author shows that using a simple,
inexpensive linear projection preserves many of the properties of data (such as cluster dis-
tances), while making it easier to find the clusters. Thus there is a need for good-quality,
fast clustering algorithms for low-dimensional data. Our work is a step in this direction.
Additionally, recent image segmentation algorithms such as normalized cut [16, 13] are
based on eigenvector computations on distance matrices. These “spectral” clustering al-
gorithms still use
-means as a post-processing step to find the actual segmentation and
they require
to be specified. Thus we expect G-means will be useful in combination with
spectral clustering.
References
[1] Horst Bischof, Aleˇs Leonardis, and Alexander Selb. MDL principle for robust vector quantisation. Pattern analysis and applications, 2:59–72,
1999.
[2] C.L. Blake and C.J. Merz. UCI repository of machine learning databases, 1998. http://www.ics.uci.edu/
mlearn/MLRepository.html.
[3] Sanjoy Dasgupta. Experiments with random projection. In Uncertainty in Artificial Intelligence: Proceedings of the Sixteenth Conference
(UAI-2000), pages 143–151, San Francisco, CA, 2000. Morgan Kaufmann Publishers.
[4] Gianna M. Del Corso. Estimating an eigenvector by the power method with a random start. SIAM Journal on Matrix Analysis and Applications,
18(4):913–937, 1997.
[5] Chris Ding, Xiaofeng He, Hongyuan Zha, and Horst Simon. Adaptive dimension reduction for clustering high dimensional data. In Proceedings
of the 2nd IEEE International Conference on Data Mining, 2002.
[6] Fredrik Farnstrom, James Lewis, and Charles Elkan. Scalability for clustering algorithms revisited. SIGKDD Explorations, 2(1):51–57, 2000.
[7] Peter J. Huber. Projection pursuit. Annals of Statistics, 13(2):435–475, June 1985.
[8] L. Hubert and P. Arabie. Comparing partitions. Journal of Classification, 2:193–218, 1985.
[9] A. K. Jain, M. N. Murty, and P. J. Flynn. Data clustering: a review. ACM Computing Surveys, 31(3):264–323, 1999.
[10] Robert E. Kass and Larry Wasserman. A reference Bayesian test for nested hypotheses and its relationship to the Schwarz criterion. Journal of
the American Statistical Association, 90(431):928–934, 1995.
[11] Michael J. Kearns, Yishay Mansour, Andrew Y. Ng, and Dana Ron. An experimental and theoretical comparison of model selection methods. In
Computational Learing Theory (COLT), pages 21–30, 1995.
[12] Yann LeCun, L´eon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the
IEEE, 86(11):2278–2324, 1998.
[13] Andrew Ng, Michael Jordan, and Yair Weiss. On spectral clustering: Analysis and an algorithm. Neural Information Processing Systems, 14,
2002.
[14] Dan Pelleg and Andrew Moore.
-means: Extending -means with efficient estimation of the number of clusters. In Proceedings of the 17th
International Conf. on Machine Learning, pages 727–734. Morgan Kaufmann, San Francisco, CA, 2000.
[15] Peter Sand and Andrew Moore. Repairing faulty mixture models using density estimation. In Proceedings of the 18th International Conf. on
Machine Learning. Morgan Kaufmann, San Francisco, CA, 2001.
[16] Jianbo Shi and Jitendra Malik. Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence,
22(8):888–905, 2000.
[17] M. A. Stephens. EDF statistics for goodness of fit and some comparisons. American Statistical Association, 69(347):730–737, September 1974.
... By integrating technologies like Google Cloud VertexAI, AWS Bedrock, OpenAI, and Meta LLaMA with Retrieval Augmented Generation (RAG) [11,16] through the LangChain framework, we fine-tune multiple models for enhanced QA performance. In addition to these models, we propose the Aggregated Knowledge Model (AKM), which synthesizes responses from the various fine-tuned and RAG models using K-means clustering [12,13] to select the most representative answers. This novel approach aims to improve the overall performance and reliability of QA systems in the LBL ScienceIT domain. ...
... The AKM evaluates responses from these models by utilizing K-means clustering [12,13] with = 1 to identify the most representative answer for each question. The specific steps involved are as follows: ...
... The most effective method was K-means clustering [12,13]. By clustering the TF-IDF [19] vectors of the predicted answers and selecting the answer closest to the centroid, we were able to consistently identify the most representative answer. ...
Preprint
Full-text available
This paper introduces a novel approach to enhancing closed-domain Question Answering (QA) systems, focusing on the specific needs of the Lawrence Berkeley National Laboratory (LBL) Science Information Technology (ScienceIT) domain. Utilizing a rich dataset derived from the ScienceIT documentation, our study embarks on a detailed comparison of two fine-tuned large language models and five retrieval-augmented generation (RAG) models. Through data processing techniques, we transform the documentation into structured context-question-answer triples, leveraging the latest Large Language Models (AWS Bedrock, GCP PaLM2, Meta LLaMA2, OpenAI GPT-4, Google Gemini-Pro) for data-driven insights. Additionally, we introduce the Aggregated Knowledge Model (AKM), which synthesizes responses from the seven models mentioned above using K-means clustering to select the most representative answers. The evaluation of these models across multiple metrics offers a comprehensive look into their effectiveness and suitability for the LBL ScienceIT environment. The results demonstrate the potential benefits of integrating fine-tuning and retrieval-augmented strategies, highlighting significant performance improvements achieved with the AKM. The insights gained from this study can be applied to develop specialized QA systems tailored to specific domains.
... 4.1). Then, we adopt K-means-based clustering to refine the pre-trained label prototypes into ′ macro prototypes [15]. Given the pre-trained label prototypes , we randomly initialize ′ macro prototype centroids { 1 , ..., , ..., ′ }, where ∈ R is the centroids of macro label prototype , ′ ≪ is the hyperparameter set as the macro prototype number. ...
Preprint
Multi-label node classification is an important yet under-explored domain in graph mining as many real-world nodes belong to multiple categories rather than just a single one. Although a few efforts have been made by utilizing Graph Convolution Networks (GCNs) to learn node representations and model correlations between multiple labels in the embedding space, they still suffer from the ambiguous feature and ambiguous topology induced by multiple labels, which reduces the credibility of the messages delivered in graphs and overlooks the label correlations on graph data. Therefore, it is crucial to reduce the ambiguity and empower the GCNs for accurate classification. However, this is quite challenging due to the requirement of retaining the distinctiveness of each label while fully harnessing the correlation between labels simultaneously. To address these issues, in this paper, we propose a Correlation-aware Graph Convolutional Network (CorGCN) for multi-label node classification. By introducing a novel Correlation-Aware Graph Decomposition module, CorGCN can learn a graph that contains rich label-correlated information for each label. It then employs a Correlation-Enhanced Graph Convolution to model the relationships between labels during message passing to further bolster the classification process. Extensive experiments on five datasets demonstrate the effectiveness of our proposed CorGCN.
... By counting the number of times two data points appear in the same cluster amongst base clusterings, the CA matrix subtly denotes the underlying relationship between data points by a new similarity matrix. A better clustering result can be obtained by directly employing some off-the-shelf clustering methods, such as K-Means [14] and Spectral Clustering [15]. In [16], Liu et al. used the Normalized Cut [17] to process the constructed CA matrix and proved the theoretical equivalence between the proposed method and weighted K-means clustering [18]. ...
Article
Full-text available
Ensemble clustering methods incorporate multiple base clusterings to provide a more accurate and reliable result compared to traditional clustering methods and have consequently gained popularity in recent years. In this paper, we propose a novel ensemble clustering method, dubbed Sparse Dual-Weighting Ensemble Clustering (SDWEC), to thoroughly consider the quality and diversity of base clusterings, which are critical and overlooked in many existing ensemble clustering methods. Specifically, SDWEC employs a dual-weighting scheme, which adaptively weighs the importance of multiple base clusterings while considering the reliability of clusters within base clusterings. The sparse constraint enables SDWEC to effectively select and fuse the most beneficial information from base clusterings, thereby enhancing the clustering performance. Unlike many existing ensemble clustering methods that require extra post-processing to extract the indicator matrix, SDWEC directly learns the cluster indicators without any discretization, thus avoiding potential information loss. To solve the tricky optimization problem of SDWEC, we design an efficient alternating optimization algorithm with linear complexity and theoretical convergence guarantees. We conduct extensive experiments on eight real-world datasets to evaluate the performance of SDWEC. Experimental results against state-of-the-art ensemble clustering methods demonstrate the superiority of SDWEC in terms of clustering accuracy and robustness.
... Simple, fast, and flexible [36] Sensitive to non-convex shape clusters, outliers, initialization [56], [59] OPTICS Handles datasets with varying densities [41] Dependency to distance measurement [41] Agglomerative understandable Treelike output, visualization [55] weak for categorical data [37], need manual interpretation, time cost, BIRCH Proper noise handling. It is suitable for spherical clusters [38] Sensitive to the cluster shape [38] Spectral Handles non-globular clusters [39] computation and time cost [58] Needs postprocessing [40] each with its distinct philosophy. ...
Article
Full-text available
The role of clustering in unsupervised fault diagnosis is significant, but different clustering techniques can yield varied results and cause inevitable uncertainty. Ensemble clustering methods have been introduced to tackle this challenge. This study presents a novel integrated technique in the field of fault diagnosis using spectral ensemble clustering. A new dimensionality reduction technique is proposed to intelligently identify faults, even in ambiguous scenarios, by exploiting the informative segment of the underlying bipartite graph. This is achieved by identifying and extracting the most informative sections of the bipartite graph based on the eigenvector centrality measure of nodes within the graph. The proposed method is applied to experimental current-voltage (I-V) curve data collected from a real photovoltaic (PV) platform. The obtained results remarkably improved the accuracy of aging fault detection to more than 83.50%, outperforming the existing state-of-the-art approaches. We also decided to separately analyze the ensemble clustering part of our FDD method, which indicated surpassing performance compared to similar methods by evaluating commonly used datasets like handwritten datasets. This proves that the proposed approach inherently holds promise for application in various real-world scenarios that are indicated by ambiguity and complexity.
... These methods can be categorized into non-spatial and spatial clustering approaches. Non-spatial methods, such as k-means [9] and Louvain [10], utilize gene expression data alone to cluster ST data. However, they face significant challenges due to the high dimensionality and sparsity nature of ST data, often resulting in clustering outcomes that do not correspond well to tissue sections and fail to accurately represent spatial domains. ...
Preprint
Single-cell spatial transcriptomics (ST) offers a unique approach to measuring gene expression profiles and spatial cell locations simultaneously. However, most existing ST methods assume that cells in closer spatial proximity exhibit more similar gene expression patterns. Such assumption typically results in graph structures that prioritize local spatial information while overlooking global patterns, limiting the ability to fully capture the broader structural features of biological tissues. To overcome this limitation, we propose GATES (Graph Attention neTwork with global Expression fuSion), a novel model designed to capture structural details in spatial transcriptomics data. GATES first constructs an expression graph that integrates local and global information by leveraging both spatial proximity and gene expression similarity. The model then employs an autoencoder with adaptive attention to assign proper weights for neighboring nodes, enhancing its capability of feature extraction. By fusing features of both the spatial and expression graphs, GATES effectively balances spatial context with gene expression data. Experimental results across multiple datasets demonstrate that GATES significantly outperforms existing methods in identifying spatial domains, highlighting its potential for analyzing complex biological tissues. Our code can be accessed on GitHub at https://github.com/xiaoxiongtao/GATES.
... These approaches can be subdivided into top-down processes which start with a low number of clusters that is successively increased and bottom-up processes which work the opposite way. Examples for top-down approaches include the Gap Statistic [35], X-Means [29], G-Means [14], P G-Means [11], and DipMeans [18]. An prominent bottomup approach is proposed in [6]. ...
Preprint
Full-text available
Finding meaningful groups, i.e., clusters, in high-dimensional data such as images or texts without labeled data at hand is an important challenge in data mining. In recent years, deep clustering methods have achieved remarkable results in these tasks. However, most of these methods require the user to specify the number of clusters in advance. This is a major limitation since the number of clusters is typically unknown if labeled data is unavailable. Thus, an area of research has emerged that addresses this problem. Most of these approaches estimate the number of clusters separated from the clustering process. This results in a strong dependency of the clustering result on the quality of the initial embedding. Other approaches are tailored to specific clustering processes, making them hard to adapt to other scenarios. In this paper, we propose UNSEEN, a general framework that, starting from a given upper bound, is able to estimate the number of clusters. To the best of our knowledge, it is the first method that can be easily combined with various deep clustering algorithms. We demonstrate the applicability of our approach by combining UNSEEN with the popular deep clustering algorithms DCN, DEC, and DKM and verify its effectiveness through an extensive experimental evaluation on several image and tabular datasets. Moreover, we perform numerous ablations to analyze our approach and show the importance of its components. The code is available at: https://github.com/collinleiber/UNSEEN
... How can a dataset be divided into different groups? The ML-LOC method [34] groups the dataset using the k-means clustering algorithm [78] by measuring the similarity between instances in the label space, assuming that instances with similar label vectors typically share the same label correlations. The GD-LDL-SCL method [36] follows a similar premise, focusing on measuring the similarity between instances in the label space, which is generally smaller than the feature space, thus effectively reducing the time required for clustering. ...
Article
Full-text available
Multi-Label Classification refers to the classification task where a data sample is associated with multiple labels simultaneously, which is widely used in text classification, image classification, and other fields. Different from the traditional single-label classification, each instance in Multi-Label Classification corresponds to multiple labels, and there is a correlation between these labels, which contains a wealth of information. Therefore, the ability to effectively mine and utilize the complex correlations between labels has become a key factor in Multi-Label Classification methods. In recent years, research on label correlations has shown a significant growth trend internationally, reflecting its importance. Given that, this paper presents a survey on the label correlations in Multi-Label Classification to provide valuable references and insights for future researchers. The paper introduces multi-label datasets across various fields, elucidates and categorizes the concept of label correlations, emphasizes their utilization in Multi-Label Classification and associated subproblems, and provides a prospect for future work on label correlations.
Preprint
Spatial transcriptomics technology allows for the detection of cellular transcriptome information while preserving the spatial location of cells. This capability enables researchers to better understand the cellular heterogeneity, spatial organization and functional interactions in complex biological systems. However, current technological methods are limited by low resolution, which reduces the accuracy of gene expression levels. Here, we propose scstGCN, a multimodal information fusion method based on Vision Transformer (ViT) and Graph Convolutional Network (GCN) that integrates histological images, spot-based spatial transcriptomics data and spatial location information to infer super-resolution gene expression profiles at single-cell level. We evaluated the accuracy of the super-resolution gene expression profiles generated on diverse tissue ST datasets with disease and healthy by scstGCN along with their performance in identifying spatial patterns, conducting functional enrichment analysis, and tissue annotation. The results show that scstGCN can predict super-resolution gene expression accurately, aid researchers in discovering biologically meaningful differentially expressed genes and pathways. Additionally, scstGCN can segment and annotate tissues at a finer granularity, with results demonstrating strong consistency with coarse manual annotations. Key Points scstGCN combines multi-modal information including histology image, spot-based spatial transcriptomics (ST) data, and physical spatial location through deep learning methods to achieve single-cell resolution of spot-based ST data without requiring single-cell references. scstGCN employs GCN to capture complex relationships between neighboring cells, facilitating the integration of multimodal feature information based on single-cell level, and then accurately infers single-cell resolution spatial gene expression. scstGCN can infer single-cell resolution gene expression across the entire tissue region. Through transfer learning, gene expression in three-dimensional tissues can be characterized efficiently. Furthermore, it demonstrates outstanding performance in spatial patterns enhancement, functional enrichment analysis, and annotate tissues at the high-resolution.
Article
Full-text available
We address the problem of finding the optimal number of reference vectors for vector quantisation from the point of view of the Minimum Description Length (MDL) principle. We formulate vector quantisation in terms of the MDL principle, and then derive different instantiations of the algorithm, depending on the coding procedure. Moreover, we develop an efficient algorithm (similar to EM-type algorithms) for optimising the MDL criterion. In addition, we use the MDL principle to increase the robustness of the training algorithm, namely, the MDL principle provides a criterion to decide which data points are outliers. We illustrate our approach on 2D clustering problems (in order to visualise the behaviour of the algorithm), and present applications on image coding. Finally, we outline various ways to extend the algorithm.
Conference Paper
Full-text available
It is well-known that for high dimensional data clustering, standard algorithms such as EM and K-means are often trapped in a local minimum. Many initialization methods have been proposed to tackle this problem, with only limited success. In this paper we propose a new approach to resolve this problem by repeated dimension reductions such that K-means or EM are performed only in very low dimensions. Cluster membership is utilized as a bridge between the reduced dimensional subspace and the original space, providing flexibility and ease of implementation. Clustering analysis performed on highly overlapped Gaussians, DNA gene expression profiles and Internet newsgroups demonstrate the effectiveness of the proposed algorithm.
Article
We propose a new class of center-based iterative clustering algorithms, K-Harmonic Means (KHMp), which is essentially insensitive to the initialization of the centers, demonstrated through many experiments. The insensitivity to initialization is attributed to a dynamic weighting function, which increases the importance of the data points that are far from any centers in the next iteration. The dependency of the K-Means' and EM's performance on the initialization of the centers has been a major problem. Many have tried to generate good initializations to solve the sensitivity problem. KHMp addresses the intrinsic problem by replacing the minimum distance from a data point to the centers, used in K-Means, by the Harmonic Averages of the distances from the data point to all centers. KHMp significantly improves the quality of clustering results comparing with both K-Means and EM. The KHMp algorithms have been implemented in both sequential and parallel languages and tested on hundreds of randomly generated datasets with different data distribution and clustering characteristics.
Article
This paper addresses the problem of approximating an eigenvector belonging to the largest eigenvalue of a symmetric positive definite matrix by the power method. We assume that the starting vector is randomly chosen with uniform distribution over the unit sphere. This paper provides lower and upper as well as asymptotic bounds on the randomized error in the ℒp sense, p ∈ [1,+∞]. We prove that it is impossible to achieve sharp bounds that are 1 independent of the ratio between the two largest eigenvalues. This should be contrasted to the problem of approximating the largest eigenvalue, for which Kuczyński and Woźniakowski [SIAM J. Matrix Anal. Appl., 13 (1992), pp. 1094-1122] proved that it is possible to bound the randomized error at the kth step with a quantity that depends only on k and on the size of the matrix. We prove that the rate of convergence depends on the ratio of the two largest eigenvalues, on their multiplicities, and on the particular norm. The rate of convergence is at most linear in the ratio of the two largest eigenvalues.
Article
This article offers a practical guide to goodness-of-fit tests using statistics based on the empirical distribution function (EDF). Five of the leading statistics are examined—those often labelled D, W , V, U , A —and three important situations: where the hypothesized distribution F(x) is completely specified and where F(x) represents the normal or exponential distribution with one or more parameters to be estimated from the data. EDF statistics are easily calculated, and the tests require only one line of significance points for each situation. They are also shown to be competitive in terms of power.
Article
We propose a novel approach for solving the perceptual grouping problem in vision. Rather than focusing on local features and their consistencies in the image data, our approach aims at extracting the global impression of an image. We treat image segmentation as a graph partitioning problem and propose a novel global criterion, the normalized cut, for segmenting the graph. The normalized cut criterion measures both the total dissimilarity between the different groups as well as the total similarity within the groups. We show that an efficient computational technique based on a generalized eigenvalue problem can be used to optimize this criterion. We have applied this approach to segmenting static images, as well as motion sequences, and found the results to be very encouraging.
Article
Projection pursuit is concerned with "interesting" projections of high dimensional data sets, with finding such projections by machine, and with using them for nonparametric fitting and other data-analytic purposes. This survey attempts to put the fascinating problems and ramifications of projection pursuit--which range from principal components, multidimensional scaling, factor analysis, nonparametric regression, density estimation and deconvolution of time series to computer tomography and problems in pure mathematics--into a coherent perspective
Article
The problem of comparing two different partitions of a finite set of objects reappears continually in the clustering literature. We begin by reviewing a well-known measure of partition correspondence often attributed to Rand (1971), discuss the issue of correcting this index for chance, and note that a recent normalization strategy developed by Morey and Agresti (1984) and adopted by others (e.g., Miligan and Cooper 1985) is based on an incorrect assumption. Then, the general problem of comparing partitions is approached indirectly by assessing the congruence of two proximity matrices using a simple cross-product measure. They are generated from corresponding partitions using various scoring rules. Special cases derivable include traditionally familiar statistics and/or ones tailored to weight certain object pairs differentially. Finally, we propose a measure based on the comparison of object triples having the advantage of a probabilistic interpretation in addition to being corrected for chance (i.e., assuming a constant value under a reasonable null hypothesis) and bounded between ±1.