PreprintPDF Available
Preprints and early-stage research may not have been peer reviewed yet.

Abstract

Minimum spanning trees (MSTs) provide a convenient representation of datasets in numerous pattern recognition activities. Moreover, they are relatively fast to compute. In this paper, we quantify the extent to which they can be meaningful in data clustering tasks. By identifying the upper bounds for the agreement between the best (oracle) algorithm and the expert labels from a large battery of benchmark data, we discover that MST methods can overall be very competitive. Next, instead of proposing yet another algorithm that performs well on a limited set of examples, we review, study, extend, and generalise existing, the state-of-the-art MST-based partitioning schemes, which leads to a few new and interesting approaches. It turns out that the Genie method and the information-theoretic approaches often outperform the non-MST algorithms such as k-means, Gaussian mixtures, spectral clustering, BIRCH, and classical hierarchical agglomerative procedures.
Clustering with minimum spanning trees:
How good can it be?
Marek Gagolewski1*, Anna Cena2, Maciej Bartoszuk3
and Łukasz Brzozowski2
1*Deakin University, Data to Intelligence Research Centre, School
of IT, Geelong, VIC 3220, Australia.
2Warsaw University of Technology, Faculty of Mathematics and
Information Science, ul. Koszykowa 75, 00-662 Warsaw, Poland.
3QED Software, ul. Miedziana 3A, 00-814 Warsaw, Poland.
*Corresponding author(s). E-mail(s):
m.gagolewski@deakin.edu.au;
Contributing authors: anna.cena@pw.edu.pl;
maciej.bartoszuk@qed.pl;lukasz.brzozowski.dokt@pw.edu.pl;
Abstract
Minimum spanning trees (MSTs) provide a convenient representation
of datasets in numerous pattern recognition activities. Moreover, they
are relatively fast to compute. In this paper, we quantify the extent to
which they can be meaningful in data clustering tasks. By identifying
the upper bounds for the agreement between the best (oracle) algo-
rithm and the expert labels from a large battery of benchmark data,
we discover that MST methods can overall be very competitive. Next,
instead of proposing yet another algorithm that performs well on a
limited set of examples, we review, study, extend, and generalise exist-
ing, the state-of-the-art MST-based partitioning schemes, which leads
to a few new and interesting approaches. It turns out that the Genie
method and the information-theoretic approaches often outperform the
non-MST algorithms such as k-means, Gaussian mixtures, spectral
clustering, BIRCH, and classical hierarchical agglomerative procedures.
Keywords: hierarchical clustering, minimum spanning tree, MST, cluster
validity measure, single linkage, Genie algorithm
arXiv:2303.05679v1 [stat.ML] 10 Mar 2023
1 Introduction
Clustering (segmentation) aims to find some meaningful partitions of a given
dataset in a purely supervised manner. They are useful in many practical appli-
cations; see, e.g., (Guo, Yang, Li, Xiong, & Ma,2023;Hwang et al.,2023;Zhao
et al.,2023;Zhou et al.,2023). Up to this date, many clustering approaches
have been proposed (see, e.g., (Wierzchoń & Kłopotek,2018) for an overview)
together with methods to assess their usefulness: internal (Arbelaitz, Gurrutx-
aga, Muguerza, Pérez, & Perona,2013;Gagolewski, Bartoszuk, & Cena,2021;
Halkidi, Batistakis, & Vazirgiannis,2001;Jaskowiak, Costa, & Campello,2022;
Maulik & Bandyopadhyay,2002;Milligan & Cooper,1985;Q. Xu, Zhang,
Liu, & Luo,2020) and external cluster validity measures (Gagolewski,2022a;
Horta & Campello,2015;Rezaei & Fränti,2016;Wagner & Wagner,2006)
on various kinds of benchmark data (Dua & Graff,2021;Fränti & Sieranoja,
2018;Gagolewski,2022b,?;Graves & Pedrycz,2010;Thrun & Ultsch,2020).
Given a dataset X={x1,...,xn}with npoints in Rd, the space of all
its possible k-partitions Xk, is very large. Namely, the number of possible
divisions of Xinto k2nonempty, mutually disjoint clusters is equal to the
Stirling value of the second kind, n
k=O(kn).
Thus, in practice, clustering algorithms tend to construct a simpler rep-
resentation of the search space to make their job easier. For instance, in the
well-known k-means algorithm (Lloyd,1957 (1982)), k(continuous) cluster
centroids are sought and the point’s belongingness to a subset is determined
by means of the proximity thereto. In hierarchical agglomerative algorithms,
we start with nsingletons, and then keep merging pairs of clusters (based on
different criteria, e.g., average or complete linkage; see (Müllner,2011)) until
we obtain kof them. In divisive schemes, on the other hand, we start with
one cluster consisting of all the points and then we try to split it into smaller
and smaller chunks iteratively.
From this perspective, different spanning trees of a given dataset offer
a very attractive representation. In particular, the1minimum spanning tree
(MST; the shortest dendrite) with respect to the Euclidean metric2minimises
the sum of pairwise distances.
More formally, given an undirected weighted graph representing our
dataset G= (V, E, W ); V={1, . . . , n}, E ={{u, v}, u < v}, W ({u, v}) =
kxuxvk, the minimum spanning tree T= MST(G)=(V, E 0, W 0),
E0E,W0=W|E0is a connected tree spanning Vwith E0minimising
P{u,v}∈E0W({u, v}).
Any spanning tree representing a dataset with npoints has n1edges. If
we remove k1of them, we will obtain kconnected components which can
be interpreted as clusters; compare Figure 1. This reduces the search space to
1We will assume in this paper that an MST is always unique. This can be assured by adding,
e.g., a tiny amount of noise to the points’ coordinates.
2How to perform the appropriate feature engineering is an independent problem (e.g., selec-
tion of relevant features, normalisation of columns, noise point removal, etc.), which we are not
concerned with in our paper for simplicity of presentation.
Fig. 1 Removing three edges from a spanning tree gives four connected components, which
we can treat as separate clusters
n1
k1=O(nk1). While still large, some heuristics (e.g., greedy approaches)
allow for further simplifications.
MSTs are fast to compute: in O(n2)time for general metrics, see the classic
algorithms by Borůvka (1926), Jarník (1930) (which is more widely known
as a method by Prim (1957); see (Olson,1995) for its parallelised version),
and Kruskal (1956); see (Gower & Ross,1969;Graham & Hell,1985;Zhong,
Malinen, Miao, & Fränti,2015) for some historical notes. In small-dimensional
Euclidean spaces, further speed-ups are possible (e.g., (March, Ram, & Gray,
2010): Ω(nlog n)for d= 2). Approximate MST can be computed as well (e.g.,
(Naidan, Boytsov, Malkov, & Novak,2019;Zhong et al.,2015)).
Applications of MST-based algorithms are plentiful (e.g., gene expression
discussed in (Y. Xu, Olman, & Xu,2002), pattern recognition in images (Yin
& Liu,2009), etc.). Overall, in our case, they allow for detecting well-separated
clusters of arbitrary shapes (e.g., spirals, connected line segments, blobs; see
Figure 2). They do not necessarily have to be convex like in the k-means
algorithm (via its connection to Voronoi diagrams).
This paper aims to review, unify, and extend a large number of exist-
ing approaches to clustering based on MSTs (that yield a specific number of
clusters, k) and determine which of them works best on an extensive battery
of benchmark data. Furthermore, we quantify how well the particular MST-
based methods perform in general: are they comparable with state-of-the-art
clustering procedures?
This paper is set out as follows. Section 2reviews existing MST-based
methods and introduces some noteworthy generalisations thereof, in partic-
ular: divisive and agglomerative schemes optimising different cluster validity
measures (with or without additional constraints). In Section 3, we answer the
question of whether MSTs can provide us with a meaningful representation
of the benchmark datasets studied for the purpose of data clustering tasks.
Then, we pinpoint the best-performing algorithms and compare them with
non-MST-based approaches. Section 4concludes the paper and suggests some
topics for further research.
SIPU/pathbased WUT/olympic
SIPU/spiral WUT/smile
SIPU/compound WUT/x1
Fig. 2 Example benchmark datasets (see Tables 2and 3and (Chang & Yeung,2008;
Fränti & Sieranoja,2018;Gagolewski,2022b;Zahn,1971)). Minimum spanning trees often
lead to a meaningful representation of well-separable clusters of arbitrary shapes
Table 1 Clustering methods studied (denotes an algorithm not based on MSTs)
method
1 Genie_G0.1 (Gagolewski,2021;Gagolewski, Bartoszuk, & Cena,2016)
2 Genie_G0.3
3 Genie_G0.5
4 Genie_G0.7
5 Genie+Ic (k+ 0) (information criterion agglomerative from a partial partition)
(Cena,2018;Gagolewski,2021)
6 Genie+Ic (k+ 5)
7 Genie+Ic (k+ 10)
8 IcA (Gagolewski,2021) (information criterion agglomerative strategy)
9 ITM (information criterion divisive strategy) (Müller, Nowozin, & Lampert,2012)
10 Single
11 HEMST (Grygorash, Zhou, & Jorgensen,2006)
12 CTCEHC (Ma, Lin, Wang, Huang, & He,2021)
13 MST/D_BallHall (Ball & Hall,1965) (optimising the cluster validity index
a divisive strategy over MSTs)
14 MST/D_CalinskiHarabasz (Caliński & Harabasz,1974)
15 MST/D_DaviesBouldin (Davies & Bouldin,1979)
16 MST/D_Silhouette (Rousseeuw,1987)
17 MST/D_SilhouetteW (Rousseeuw,1987)
18 MST/D_GDunn_d1_D1 (Bezdek & Pal,1998;Dunn,1974)
19 MST/D_GDunn_d1_D2 (Bezdek & Pal,1998)
20 MST/D_GDunn_d1_D3
21 MST/D_GDunn_d2_D1
22 MST/D_GDunn_d2_D2
23 MST/D_GDunn_d2_D3
24 MST/D_GDunn_d3_D1
25 MST/D_GDunn_d3_D2
26 MST/D_GDunn_d3_D3
27 MST/D_GDunn_d4_D1
28 MST/D_GDunn_d4_D2
29 MST/D_GDunn_d4_D3
30 MST/D_GDunn_d5_D1
31 MST/D_GDunn_d5_D2
32 MST/D_GDunn_d5_D3
33 MST/D_WCNN_25 (Gagolewski et al.,2021)
34 MST/D_DuNN_25_Min_Max (Gagolewski et al.,2021)
35 MST/D_DuNN_25_Mean_Mean
36 MST/D_DuNN_25_Max_Min
37Average
38Complete
39Ward
40GaussMix
41KMeans
42Birch (T=0.01, BF=50) (Zhang, Ramakrishnan, & Livny,1996)
43Spectral (RBF, G=1)
44–95Minima of 52 different cluster validity measures (Gagolewski et al.,2021)
96–97Other hierarchical methods (centroid, median, weighted/McQuitty linkage)
98–125Birch with 23 other parameter settings
126–140Spectral with 19 other parameter settings
2 Methods
Table 1lists all the methods we consider in this study. Let us describe them
in detail.
2.1 Divisive algorithms
Perhaps the most widely known MST-based method is the classic single linkage
scheme (Wrocław Taxonomy, dendrite method, nearest neighbour cluster-
ing). It was proposed by Polish mathematicians Florek, Łukasiewicz, Perkal,
Steinhaus, and Zubrzycki in (1951).
That the single linkage clustering can be computed using the following
divisive scheme over MSTs was noted in (Gower & Ross,1969).
Algorithm 1 (Single Linkage Divisively).To obtain the single linkage k-
partition of a given dataset Xrepresented by a complete graph Gwhose
weights correspond to pairwise distances between all point pairs, proceed as
follows:
1. Let T= MST(G) = (V, E 0, W 0);
2. Let {{1, . . . , n}} be an initial 1-partition consisting of the cluster represent-
ing all the points;
3. For i= 1, . . . , k 1do:
(a) Split the cluster containing the u-th and the v-th point (so that they do
not belong to the same connected component anymore), where {u, v}
W0is the edge of the MST with the i-th greatest weight;
4. Return the current k-partition as a result.
In other words, we remove the k1edges of the greatest lengths3from E0
and study the resulting connected components.
Another divisive algorithm over MSTs was studied by Caliński and
Harabasz in (1974). They minimised the total within-cluster sum of squares
(the same as in the k-means algorithm; they provided it as an alternative to
the agglomerative (but non-MST) Ward (1963) algorithm and to the one by
Edwards and Sforza (1965) who employed an exhaustive divisive procedure).
More generally, let F:XlRbe some objective function that we would
like to maximise over the set of possible partitionings of any cardinality like l
(not just k, which we treat as fixed). We will refer to it as a cluster validity
measure.
Moreover, let C(V, E00 )=(X1, . . . , Xl) Xlbe a partition corresponding
to the connected components (with no loss in generality, assuming that there
are lof them) in a subgraph (V, E 00)of (V, E ).
Algorithm 2 (Maximising Fover an MST Divisively).A general divisive
scheme over an MST is a greedy optimisation algorithm that goes as follows:
3Which usually leads to outliers being classified as singleton clusters.
1. Let T= MST(G) = (V, E 0, W 0);
2. Let E00 =E0;
3. For i= 1, . . . , k 1do:
(a) Find {u, v} E00 which is a solution to:
max
{u,v}F(C(V, E00 \ {{u, v}}))
(b) Remove {u, v}from E00;
4. Return C(V, E 00)as a result.
Overall, a divisive scheme is slightly more time-intense (the partition refine-
ment data structure can be used) than the agglomerative approach which
we mention below. However, it is still significantly more feasible than in the
case where the dataset is represented by a more complicated graph (nearest
neighbours, complete, etc.).
And thus, in the case of the single linkage scheme, the objective is such
that we simply maximise the sum of weights of the omitted MST edges and
in the setting of the Caliński and Harabasz (1974) paper, we maximise (note
the minus): WCSS(X1, . . . , Xl) = Pl
i=1 PxjXikxjµik2,where µiis
the centroid (componentwise arithmetic mean) of the i-th cluster.
Naturally, other objective functions can be studied. For instance, Müller,
Nowozin, and Lampert in (2012) considered the information-theoretic crite-
rion based on entropy which takes into account cluster sizes and average
within-cluster MST edges’ weights: IC(X1, . . . , Xl) = dPk
i=1
ni
nlog Li
ni
Pk
i=1
ni
nlog ni
n,where Lidenotes the sum of the weights of edges in the subtree
of the MST representing the i-th cluster and nidenotes its size. Interestingly,
this estimator can be derived from the Renyi entropy estimated on various
graph representations of data, including MSTs; see, e.g., (Eggels & Crommelin,
2019;Hero III & Michel,1998;Pál, Póczos, & Szepesvári,2010). This leads
to an algorithm called ITM4.
Other popular internal cluster validity indices can be optimised as well.
Here, we shall consider the most notable measures that we reviewed in our
previous paper (Gagolewski et al.,2021) and implemented in (Gagolewski,
2021) (leading to clustering methods denoted with MST/D_ in Table 1): the
indices by Ball–Hall (1965), Caliński–Harabasz (1974, Eq. (3)) (equivalent
to maximising the above WCSS), Davies–Bouldin (1979, Def. 5), Silhouette,
SilhouetteW (Rousseeuw,1987), generalisations of the Dunn index (1974) pro-
posed in (Bezdek & Pal,1998) (GDunn_dX_dY) and (Gagolewski et al.,
2021) (DuNN_M_X_Y), and the nearest-neighbour count (Gagolewski et al.,
2021) (WCNN_M).
As a byproduct, we will be able to assess the meaningfulness of the cluster
validity measures, just like in (Gagolewski et al.,2021) where we have done
4Python implementation available at https://github.com/amueller/information-theoretic-mst.
this in the space of all possible clusterings (leading to the conclusion that
many measures are actually invalid).
On a side note, as the size of the space of all possible k-partitions of MSTs
is O(nk1), for small k, it is technically possible to find the true maximum of F
(note that for k= 2 the divisive strategy gives exactly the global maximum).
We leave this topic for further research.
2.2 Agglomerative algorithms
Single linkage was rediscovered by Sneath in (1957), who introduced it as a
general agglomerative scheme. Its resemblance to the Kruskal MST algorithm
(and hence that an MST is sufficient to compute it) was noted in, amongst
others, (Gower & Ross,1969). Thus, we can formulate it also as follows.
Algorithm 3 (Single Linkage Agglomeratively).To obtain the single
linkage k-clustering:
1. Let T= MST(G) = (V, E 0, W 0);
2. Let {{1},...,{n}} be an initial n-partition consisting of nsingletons;
3. For i= 1, . . . , n kdo:
(a) Merge the two clusters containing the u-th and the v-th point, where
{u, v} W0is the edge of the MST with the i-th smallest weight;
4. Return the current k-partition as a result.
For a given MST with edges sorted increasingly, the disjoint sets (union-
find) data structure can be used to implement the above so that the total
run-time is only O(nk).
Given a cluster validity measure F, the above agglomerative approach can
be generalised as below.
Algorithm 4 (Maximising Fover an MST Agglomeratively).A general
agglomerative scheme over an MST is a greedy optimisation algorithm that
consists of the following steps:
1. Let T= MST(G) = (V, E 0, W 0);
2. Let E00 =;
3. For i= 1, . . . , n kdo:
(a) Find {u, v} E0\E00 which is a solution to:
max
{u,v}F(C(V, E00 {{u, v}}))
(b) Add {u, v}to E00;
4. Return C(V, E 00)as a result.
In the single linkage case, Fis simply the sum of the MSTs edges left
unconsumed (or minus the weight of the edge to be omitted).
Unfortunately, many cluster validity measures are not only inherently slow
to compute, but also they might not be well-defined for singleton clusters (and
this is the starting point of the agglomerative algorithm). Due to an already
large number of procedures in our study, we will consider the agglomerative
maximising of only the aforementioned information criterion, leading to the
algorithm which we denote as IcA in Table 1. Its implementation is available
in (Gagolewski,2021).
2.3 Variations on the agglomerative scheme
Genie (Gagolewski, Bartoszuk, & Cena,2016) is an example of a variation on
the agglomerative single linkage theme, where we greedily optimise the total
edge lengths, but under the constraint that if the Gini index of the cluster
sizes5grows above a given threshold g, only the smallest clusters can take part
in the merging.
Algorithm 5 (Genie).Given g(0,1]:
1. Let T= MST(G) = (V, E 0, W 0);
2. Let E00 =;
3. For i= 1, . . . , n kdo:
(a) If the Gini index of the sizes of clusters in C(V , E00)is below g, pick
{u, v} E0\E00 as the edge with the smallest weight (equivalently, that
the sum of weights of edges in E0\(E00 {{u, v}})is the largest);
(b) Otherwise, pick {u, v} E0\E00 as the edge with the smallest weight
provided that the size of the connected component containing u(or v) is
the smallest of them all;
(c) Add {u, v}to E00;
4. Return C(V, E 00)as a result.
Here, we will rely on the implementation of Genie included in the genieclust
package for Python (Gagolewski,2021). Given a precomputed MST, the
procedure runs in O(nn)time.
The algorithm depends on the threshold parameter g. In this study, we
will only compare the results obtained for g {0.1,0.3,0.5,0.7}(for a com-
prehensive treatment of the sensitivity analysis of Genie’s parameters; see
(Gagolewski, Cena, & Bartoszuk,2016)).
In (Gagolewski, Bartoszuk, & Cena,2016), the use of g= 0.3is rec-
ommended. Cena in (2018) noted that Genie gives very good results, but
sometimes other thresholds might work better than the default one. She thus
proposed an agglomerative scheme optimising the information criterion, which
does not start from a set of nsingletons, but the intersection of the clusters
obtained by multiple runs of Genie.
5Let (c1,...,cl)be a sequence such that cidenotes the cardinality of the i-th cluster in a given l-
partition. The Gini index is given by G(c1,...,cl) = Pl
i=1(l2i+ 1)c(i)/(n1) Pl
i=1 ci[0,1],
where c(i)denotes the i-th greatest value. It is a measure of inequality of the cluster sizes.
We have implemented an extended version of this algorithm in the
genieclust (Gagolewski,2021) package. Namely, what we denote with
Genie+Ic (k+l)in Table 1, is a variation of Algorithm 4that starts at
E00 =E0\(E0.1E0.3E0.5E0.7), where Egis the final E00 from the run of
Algorithm 5seeking k+lclusters using a given threshold g, (i.e., an intersec-
tion of possibly more fine-grained clusterings returned by the Genie algorithm
with different parameters). We shall only consider l {0,5,10}, as we observe
that other choices of gand lled to similar results.
2.4 Other methods
Other MST-based methods that we consider in this study6include:
HEMST (Grygorash et al.,2006) deletes edges from the MST to achieve
the best possible edge weights’ standard deviation reduction;
CTCEHC (Ma et al.,2021) constructs a preliminary partition based on
the vertex degrees and then merges clusters based on the geodesic distances
between the cluster centroids.
There are a few other MST-based methods in the literature, but usually
they do not result in a given-in-advance number of clusters, k(which we require
for benchmarking purposes as described in the next section). For instance,
Zahn in (Zahn,1971) constructs an MST and deletes “inconsistent” edges
(with weights significantly (±) larger than the average weight of the nearby
edges), but the number thereof cannot be easily controlled.
We also do not include the methods whose search space is not solely
based on the information from MSTs (e.g., (González-Barrios & Quiroz,2003;
Karypis, Han, & Kumar,1999;Mishra & Mohanty,2019;Zhong, Miao, &
Fränti,2011;Zhong, Miao, & Wang,2010)), which construct the MST based
on transformed distances (Campello, Moulavi, Zimek, & Sander,2015;Chaud-
huri & Dasgupta,2010), which use an MST for very different purposes, such
as auxiliary density estimation (e.g., (Peter,2013)), refinement thereof (e.g.,
(Wang, Wang, & Wilkes,2009)).
We also do not include a few of the methods which we found so badly
described, that we could not implement them ourselves.
6Their Python implementation is available at https://github.com/lukaszbrzozowski/msts.
Table 2 Benchmark datasets studied (part I; see (Gagolewski,2022b); database
(Gagolewski et al.,2022) v.1.1.0). Exclamation marks denote “difficult” labellings: !!!
maximal obtained AAA (Eq. (1)) was <0.5, !! max AAA <0.8, ! - max AAA <0.95.
Asterisks mark cases where the performance of MST-based methods is subpar (*
maximal AAA for MST relative to the maximal overall AAA was <0.95). Also, e.g., 2×3
means that there are three reference label vectors with k= 2.
battery dataset n d ks
1 FCPS atom 800 3 2
2 chainlink 1000 3 2
3 engytime 4096 2 2×2
4 hepta 212 3 7
5 lsun 400 2 3
6 target 770 2 2, 6
7 tetra 400 3 4
8 twodiamonds 800 2 2
9 wingnut 1016 2 2
10 Graves dense 200 2 2
11 fuzzyx 1000 2 2×3, 4, 5
12 line 250 2 2
13 parabolic 1000 2 2!, 4!
14 ring 1000 2 2
15 ring_noisy 1050 2 2
16 ring_outliers 1030 2 2, 5
17 zigzag 250 2 3, 5
18 zigzag_noisy 300 2 3, 5
19 zigzag_outliers 280 2 3, 5
20 Other chameleon_t4_8k 8000 2 6
21 chameleon_t5_8k 8000 2 6
22 chameleon_t8_8k 8000 2 8
23 hdbscan 2309 2 6
24 iris 150 4 3
25 square 1000 2 2
26 SIPU a1 3000 2 20
27 a2 5250 2 35
28 a3 7500 2 50
29 aggregation 788 2 7
30 compound 399 2 4×2, 5×2, 6!
31 d31 3100 2 31
32 flame 240 2 2×2
33 jain 373 2 2
34 pathbased 300 2 3, 4
Table 3 Benchmark datasets studied (part II)
battery dataset n d ks
35 SIPU r15 600 2 8, 9, 15
36 s1 5000 2 15
37 s2 5000 2 15
38 s3 5000 2 15!
39 s4 5000 2 15!!
40 spiral 312 2 3
41 unbalance 6500 2 8
42 UCI ecoli 336 7 8!!
43 ionosphere 351 33 2!!
44 sonar 208 60 2!!!
45 statlog 2310 18 7!!
46 wdbc 569 30 2!
47 wine 178 13 3*!
48 yeast 1484 8 10!!!
49 WUT circles 4000 2 4
50 cross 2000 2 4
51 graph 2500 2 10*!
52 isolation 9000 2 3
53 labirynth 3546 2 6
54 mk1 300 2 3
55 mk2 1000 2 2
56 mk3 600 3 3!
57 mk4 1500 3 3
58 olympic 5000 2 5*
59 smile 1000 2 4, 6
60 stripes 5000 2 2
61 trapped_lovers 5000 3 3
62 twosplashes 400 2 2!
63 windows 2977 2 5
64 x1 120 2 3
65 x3 185 2 3, 4
66 z1 192 2 3*
67 z2 900 2 5
68 z3 1000 2 4
3 Experiments
3.1 Clustering datasets, reference labels, and assessing
the similarity thereto
We test the discussed methods against the benchmark suite for clustering
algorithms introduced in (Gagolewski,2022b). We use version 1.1.0 of the
open-access database (Gagolewski et al.,2022) (which features datasets dis-
cussed in, amongst others, (Bezdek, Keller, Krishnapuram, Kuncheva, & Pal,
1999;Dua & Graff,2021;Fränti & Sieranoja,2018;Fränti & Virmajoki,2006;
Graves & Pedrycz,2010;Jain & Law,2005;Karypis et al.,1999;McInnes,
Healy, & Astels,2017;Rezaei & Fränti,2016;Sieranoja & Fränti,2019;Thrun
& Stier,2021;Thrun & Ultsch,2020;Ultsch,2005)). We have taken into
account all the datasets with n < 10,000 except UCI/glass, WUT/x2, and Oth-
er/iris5 whose some 25-near-neighbour graphs’ connected components were
too small, leading to some of the algorithms’ failing (e.g., MST/D_WCNN_25
and MST/D_DuNN_25_Min_Max).
This gives 68 datasets7in total; see Tables 2and 3.
Each dataset comes with one or more reference label vectors created by
experts. Each of them defines a specific number of clusters, k.
We run each algorithm in a purely unsupervised manner: they are only
given the data matrix XRn×dand kon input, not the true labels.
To enable a fair comparison (ceteris paribus), no kind of data preprocess-
ing (e.g., standardisation of variables, removal of noise points, etc.) is applied.
However, let us note that the spectral method and Gaussian mixtures can be
thought of as algorithms that have some built-in feature engineering capabili-
ties. In other cases, the methods are asked to rely only on the “raw” Euclidean
distance.
As a measure of clustering quality, we consider the adjusted asymmetric
accuracy (AAA; (Gagolewski,2022a)) given by:
AAA(C) = maxσ:permutation of {1,...,k}1
kPk
i=1
ci,σ(i)
ci,·1
k
11
k
(1)
= 1 min
σ 1
k
k
X
i=1
ci,1+· ·· +ci,k ci,σ(i)
k1
k(ci,1+· ·· +ci,k )!,
where the confusion matrix Cis such that ci,j denotes the number of points in
the i-th reference cluster that a given algorithm assigned to the j-th cluster.
AAA is a measure of the overall percentage of correctly classified points in each
cluster (one minus average classification error) that uses the optimal matching
of cluster labels between the partitions (just like PSI (Rezaei & Fränti,2016)
7Note that the website of the clustering-benchmarks project (Gagolewski,2022b) features an
interactive datasets’ explorer; see https://clustering-benchmarks.gagolewski.com.
which is additionally symmetric and hence less interpretable). It is corrected
for chance and cluster size imbalancedness.
The total number of unique reference labels was 89. Let us note that some
label vectors might define the same number of clusters k. Thus, only 83 unique
partitions needed to be generated and in the case of tied ks, the maximal AAA
was considered. This is in line with the recommendation from (Gagolewski,
2022b), where it was noted that there could be many equally valid partitions
and the algorithm should be rewarded for finding any of them (note that unlike
in (Gagolewski et al.,2021), we consider the maximum over datasets and ks,
not just datasets); see also (Dasgupta & Ng,2009;Luxburg,2012) for further
discussion.
Also, following the aforementioned guidelines, if a reference partitioning
marks some points as noise, the actual way they are allocated to particular
clusters by the clustering methods studied is irrelevant (they are omitted when
computing the confusion matrix).
3.2 Some benchmark cases are difficult for all the
methods
Overall, 68/83 '82% of cases can be considered “easy” for at least one of
the methods (maximal AAA 0.95). In other words, for each of them, there
exists an approach that reproduces the reference partition relatively well.
On the other hand, 6 benchmark cases turned out very “difficult” for all
of the methods studied (AAA < 0.80). We marked them with two and three
exclamation marks in Tables 2and 3.
The said sextet includes most datasets that we sourced from the UCI
repository, which are all high-dimensional, and it is hard to verify if the ref-
erence clusters are meaningful. Originally, these datasets were suggested for
benchmarking classification, not clustering problems.
This might mean that there is something wrong with these reference label
vectors themselves (and not the algorithms tested; e.g., the clusters are over-
lapping), or that some further data preprocessing must be applied in order to
reveal the cluster structure (this is, e.g., the case for the WUT/twosplashes
datasets which normally requires the features be standardised beforehand, we
got max AAA of 0.86).
Therefore, we exclude these 6 datasets from further analysis, as it does not
make sense to compare an algorithm against what is potentially noise.
The topmost box-and-whisker in Figure 3(“Max All” on the lefthand side)
depicts the distribution of the highest observed cluster validity scores across
all the remaining 77 benchmark cases.
3.3 Are MST-based methods any good?
Recall that the number of possible partitions of an MST with nedges into
ksubtrees is equal to (n1)(n2) ···(nk+ 1). For all datasets with
k= 2,3,4, and those with n2500 for k= 5, we were able to identify
0.4 0.6 0.8 1.0
AAA
Max All
Max MST
Max Obs.
Max Obs. MST
Max Obs. Non-MST
0.4 0.6 0.8 1.0
AAA Relative to Max All
Fig. 3 The distribution of the adjusted asymmetric accuracies across the 77 benchmark
cases (absolute AAA on the left and AAA relative to “Max All” on the righthand side). “Max
Obs. gives the maximal observed AAA based on the outputs of all the 140 methods, and their
counterparts for MST and non-MST algorithms only are denoted with “Max Obs. MST”
and “Max Obs. Non-MST”. “Max MST” gives the theoretically achievable maxima of the
accuracy scores for the MST-based methods. Moreover, “Max All” is the maximum of “Max
MST“ and “Max Obs.”. Apart from a few “hard” datasets, the MST-based methods are
potentially very competitive, despite their simplicity. They can be improved further by
appropriate feature engineering.
the true maximum of AAA easily using the brute-force approach (considering
all the possible partitions of the MST). The remaining cases were too time-
consuming to examine exhaustively. Therefore, we applied a tabu-like steepest
ascent search strategy with at least 10 random restarts to find the lower bound
for the maximum (similarly as in (Gagolewski et al.,2021)).
Studying the “Max MST” box-and-whisker on the righthand side of
Figure 3, which denotes these theoretically achievable maxima of AAA (a
hypothetical “oracle” MST-based algorithm), we note that for only 4/77 '5%
datasets, the minimum spanning tree (with respect to the Euclidean distance
between unpreprocessed points) is not a good representation of the feature
space. Namely, the accuracy scores relative to “Max All” is significantly smaller
than 0.95. We marked them with asterisks in Tables 2and 3(WUT/olympic,
WUT/z1, UCI/wine, and WUT/graph).
In terms of absolute AAA for “Max MST” 3/77 '4% and 12/77 '16%
cases gave scores <0.8and <0.95, respectively.
On the other hand, 6 cases turned out difficult for the non-MST methods
(relative “Max Obs. Non-MST” AAA less than 0.95). This includes Graves/-
parabolic, SIPU/pathbased with k= 3 and k= 4, SIPU/Compound for k= 6,
WUT/cross, and Other/chameleon_t8_8k. Still, they can be successfully
tackled with MSTs.
3.4 Which MST-based algorithm then?
The above observation does not mean that we are in possession of an algorithm
that gets the most out of the information conveyed by the minimum spanning
trees, nor that a single strategy is always best.
We should thus inspect which strategies and/or objective functions are
more useful than others.
Figure 4depicts the adjusted accuracies relative to “Max MST” for each
method, i.e., how well each algorithm compares to the best possible solution.
We note that the agglomerative Genie (Gagolewski,2021;Gagolewski,
Bartoszuk, & Cena,2016) algorithm outperforms other approaches. The
agglomerative and divisive approaches optimising the information criterion
(Genie+Ic, IcA (Cena,2018), ITM (Müller et al.,2012)) also give high average
relative AAA and the one optimising the new near-neighbour-based criteria
(DuNN_25_Min_Max, WCNN_25, etc.) yield high median relative scores.
As far as other “standalone” algorithms are concerned, HEMST and Single
linkage exhibit inferior performance, and CTCEHC is comparable with the
divisive Caliński–Harabasz criterion optimiser.
Quite strikingly, some well-established internal cluster validity measures
promote clusterings of very poor agreeableness with the reference labels
(Davies–Bouldin, SilhouetteW, some generalised Dunn indices). This is in line
with our observation in (Gagolewski et al.,2021), where we performed a sim-
ilar study over the space of all possible partitionings. This puts their actual
meaningfulness into question: are they really good indicators of clustering
quality?
3.5 How MST-based methods compare against other
clustering approaches?
Figure 5compares the MST and non-MST approaches in terms of absolute
AAAs.
As far as the current (large) benchmark battery is concerned, the MST-
based methods outperform the popular “parametric” approaches (Gaussian
Mixtures, K-means) and other algorithms (Birch, Ward, Average, Com-
plete linkage, and spectral clustering with the best-identified parameters)
implemented in the scikit-learn package (Pedregosa et al.,2011) for Python.
We also notice that choosing the wrong objective function to optimise over
MST can also lead to very poor results. This is particularly the case if the
Davies–Bouldin and SilhouetteW indices are considered.
0.0 0.2 0.4 0.6 0.8 1.0
AAA Relative to Max MST
Genie_G0.3
Genie+Ic (k+0)
Genie_G0.1
Genie+Ic (k+5)
Genie+Ic (k+10)
Genie_G0.5
ITM
MST/D_DuNN_25_Min_Max
IcA
MST/D_CalinskiHarabasz
MST/D_DuNN_25_Mean_Mean
CTCEHC
MST/D_WCNN_25
Genie_G0.7
MST/D_GDunn_d2_D2
MST/D_GDunn_d2_D1
MST/D_GDunn_d2_D3
MST/D_DuNN_25_Max_Min
MST/D_Silhouette
MST/D_GDunn_d5_D1
MST/D_GDunn_d5_D2
MST/D_GDunn_d1_D3
HEMST
MST/D_GDunn_d1_D2
MST/D_GDunn_d1_D1
MST/D_GDunn_d5_D3
MST/D_BallHall
Single
MST/D_GDunn_d3_D3
MST/D_GDunn_d3_D2
MST/D_GDunn_d3_D1
MST/D_GDunn_d4_D3
MST/D_GDunn_d4_D2
MST/D_GDunn_d4_D1
MST/D_DaviesBouldin
MST/D_SilhouetteW
Fig. 4 The distribution of the adjusted asymmetric accuracies for different MST-based
algorithms relative to the “Max MST” AAA score. The agglomerative Genie (Gagolewski,
2021;Gagolewski, Bartoszuk, & Cena,2016) and the information criterion-based methods
(Genie+Ic, IcA (Cena,2018;Gagolewski,2021), ITM (Müller et al.,2012)) outperform
other approaches. Also, the new divisive near-neighbour-based schemes give a high median
performance. We also note that many well-established cluster validity measures provide poor
guidance for the selection of an informative partitioning.
0.0 0.2 0.4 0.6 0.8 1.0
AAA
Genie_G0.3
Genie+Ic (k+0)
Genie_G0.1
Genie+Ic (k+5)
Genie_G0.5
Genie+Ic (k+10)
ITM
MST/D_DuNN_25_Min_Max
IcA
GaussMix
MST/D_CalinskiHarabasz
MST/D_DuNN_25_Mean_Mean
MST/D_WCNN_25
Genie_G0.7
Birch
CTCEHC
KMeans
Ward
Spectral
Average
Complete
MST/D_GDunn_d2_D2
MST/D_GDunn_d2_D1
MST/D_GDunn_d2_D3
MST/D_DuNN_25_Max_Min
MST/D_Silhouette
MST/D_GDunn_d5_D1
MST/D_GDunn_d1_D3
MST/D_GDunn_d1_D2
MST/D_GDunn_d5_D2
HEMST
MST/D_GDunn_d1_D1
MST/D_GDunn_d5_D3
MST/D_BallHall
Single
MST/D_GDunn_d3_D3
MST/D_GDunn_d3_D2
MST/D_GDunn_d3_D1
MST/D_GDunn_d4_D3
MST/D_GDunn_d4_D2
MST/D_GDunn_d4_D1
MST/D_DaviesBouldin
MST/D_SilhouetteW
Fig. 5 The distribution of the adjusted asymmetric accuracies for different algorithms. The
MST-based algorithms Genie (Gagolewski,2021;Gagolewski, Bartoszuk, & Cena,2016),
Genie+Ic (Cena,2018;Gagolewski,2021), and ITM (Müller et al.,2012) outperform other
methods. However, we also see that an invalid objective function to be optimised over an
MST can lead to meaningless clusterings.
4 Conclusion
Apart from a few “difficult” label vectors, the minimum spanning tree-
based methods have been shown to be potentially very competitive clustering
approaches. Furthermore, they can be improved by appropriate feature engi-
neering (scaling of data columns, noise point and outlier removal, modifying
the distance matrix, etc.; see, e.g., (Campello et al.,2015;Yin & Liu,2009)).
They are quite simple and easy to compute: once the minimum spanning
tree is considered (which takes up to O(n2)time, but approximate meth-
ods exist as well; e.g., (Naidan et al.,2019)), we can potentially get a whole
hierarchy of clusters of any cardinality. For instance, our top performer, the
Genie algorithm as implemented in (Gagolewski,2021), needs O(nn)to
generate all possible partitions given a prebuilt MST. Unlike, e.g., the well-
known k-means algorithm, which is fast for small fixed ks, this property makes
them suitable for solving extreme clustering tasks (compare (Kobren, Monath,
Krishnamurthy, & McCallum,2017)).
Just like in our previous contribution (Gagolewski et al.,2021) (where
we tried to find an optimal clustering over the whole space of all possible
partitions), we note that many internal cluster validity indices actually pro-
mote clusterings that agree poorly with the reference ones. This puts their
validity/meaningfulness into question.
Overall, no single best MST-based method probably exists, but there is still
some room for improvement, and thus the development of new algorithms is
encouraged. In particular, the new divisive and agglomerative approaches we
have proposed in this paper perform well on certain dataset types. Therefore,
it might be promising to explore the many possible combinations of parame-
ters/objective functions we have left out due to the obvious space constraints
in this paper.
Future work should involve the testing of clustering methods based on near-
neighbour graphs and more complex MST-inspired data structures (compare
(Fränti, Virmajoki, & Hautamäki,2006;González-Barrios & Quiroz,2003;
Karypis et al.,1999;Zhong et al.,2011,2010)).
It would also be interesting to inspect the stability of the results when
different random subsets of benchmark data are selected or study the problem
of overlapping clusters (e.g., (Campagner, Ciucci, & Denœux,2023)). Also, the
application of the MST-based algorithms could be examined in the problem
of community detection in graphs (e.g., (Gerald, Zaatiti, Hajri, et al.,2023)).
Finally, let us recall that we have only focused on methods that guarantee
to return a fixed-in-advance number of clusters k. In the future, it would be
interesting to allow for the relaxation of this constraint.
Acknowledgements
This research was supported by the Australian Research Council Discovery
Project ARC DP210100227 (MG).
CRediT author statement
MG: Conceptualisation, Methodology, Data Curation, Software, Visualisation,
Investigation, Formal analysis, Writing Original Draft
AC: Methodology, Data Curation, Investigation
MB: Software, Data Curation, Investigation
ŁB: Software, Investigation
Data Availability
All benchmark data are publicly available from https://clustering-benchmarks
.gagolewski.com (Gagolewski,2022b). In particular, a snapshot of the test
battery (Gagolewski et al.,2022) can be fetched from https://github.com/
gagolews/clustering-data-v1/releases/tag/v1.1.0.
All computed partitions can be downloaded from https://github.com/
gagolews/clustering-results-v1/.
Conflict of interest
All authors certify that they have no affiliations with or involvement in any
organisation or entity with any financial interest or non-financial interest in
the subject matter or materials discussed in this manuscript.
References
Arbelaitz, O., Gurrutxaga, I., Muguerza, J., Pérez, J.M., Perona, I. (2013).
An extensive comparative study of cluster validity indices. Pattern
Recognition,46 (1), 243–256. DOI 10.1016/j.patcog.2012.07.021
Ball, G., & Hall, D. (1965). ISODATA: A novel method of data analysis and
pattern classification (Tech. Rep. No. AD699616). Stanford Research
Institute.
Bezdek, J., Keller, J., Krishnapuram, R., Kuncheva, L., Pal, N. (1999). Will
the real Iris data please stand up? IEEE Transactions on Fuzzy Systems,
7(3), 368–369. DOI 10.1109/91.771092
Bezdek, J., & Pal, N. (1998). Some new indexes of cluster validity. IEEE
Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics),
28 (3), 301–315. DOI 10.1109/3477.678624
Borůvka, O. (1926). O jistém problému minimálním. Práce Moravské
Přírodovědecké Společnosti v Brně,3, 37–58.
Caliński, T., & Harabasz, J. (1974). A dendrite method for cluster
analysis. Communications in Statistics,3(1), 1–27. DOI 10.1080/
03610927408827101
Campagner, A., Ciucci, D., Denœux, T. (2023). A general framework for
evaluating and comparing soft clusterings. Information Sciences,623 ,
70–93. DOI 10.1016/j.ins.2022.11.114
Campello, R.J.G.B., Moulavi, D., Zimek, A., Sander, J. (2015). Hierarchical
density estimates for data clustering, visualization, and outlier detection.
ACM Transactions on Knowledge Discovery from Data,10 (1), 5:1–5:51.
DOI 10.1145/2733381
Cena, A. (2018). Adaptive hierarchical clustering algorithms based on
data aggregation methods (Unpublished doctoral dissertation). Systems
Research Institute, Polish Academy of Sciences. (In Polish)
Chang, H., & Yeung, D. (2008). Robust path-based spectral clustering.
Pattern Recognition,41 (1), 191–203.
Chaudhuri, K., & Dasgupta, S. (2010). Rates of convergence for the cluster
tree. Advances in neural information processing systems (pp. 343–351).
Dasgupta, S., & Ng, V. (2009). Single data, multiple clusterings. Proc. nips
workshop clustering: Science or art? towards principled approaches.
Davies, D.L., & Bouldin, D.W. (1979). A cluster separation measure.
IEEE Transactions on Pattern Analysis and Machine Intelligence,
PAMI–1 (2), 224–227. DOI 10.1109/TPAMI.1979.4766909
Dua, D., & Graff, C. (2021). UCI Machine Learning Repository. Irvine, CA.
(http://archive.ics.uci.edu/ml)
Dunn, J. (1974). A fuzzy relative of the ISODATA process and its use in
detecting compact well-separated clusters. Journal of Cybernetics,3(3),
32–57. DOI 10.1080/01969727308546046
Edwards, A.W.F., & Cavalli-Sforza, L.L. (1965). A method for cluster analysis.
Biometrics,21 (2), 362–375. DOI 10.2307/2528096
Eggels, A., & Crommelin, D. (2019). Quantifying data dependencies with
Rényi mutual information and minimum spanning trees. Entropy,21 (2).
DOI 10.3390/e21020100
Florek, K., Łukaszewicz, J., Perkal, J., Steinhaus, H., Zubrzycki, S. (1951).
Sur la liaison et la division des points d’un ensemble fini. Colloquium
Mathematicum,2, 282–285.
Fränti, P., & Sieranoja, S. (2018). K-means properties on six clustering
benchmark datasets. Applied Intelligence,48 (12), 4743–4759.
Fränti, P., & Virmajoki, O. (2006). Iterative shrinking method for clustering
problems. Pattern Recognition,39 (5), 761-765.
Fränti, P., Virmajoki, O., Hautamäki, V. (2006). Fast agglomerative cluster-
ing using a k-nearest neighbor graph. IEEE Transactions on Pattern
Analysis and Machine Intelligence,28 (11).
Gagolewski, M. (2021). genieclust: Fast and robust hierarchical clustering.
SoftwareX,15 , 100722. DOI 10.1016/j.softx.2021.100722
Gagolewski, M. (2022a). Adjusted asymmetric accuracy: A well-behaving
external cluster validity measure. arXiv. (under review (preprint)) DOI
10.48550/arXiv.2209.02935
Gagolewski, M. (2022b). A framework for benchmarking clustering algo-
rithms. SoftwareX,20 , 101270. Retrieved from https://clustering-
benchmarks.gagolewski.com DOI 10.1016/j.softx.2022.101270
Gagolewski, M., Bartoszuk, M., Cena, A. (2016). Genie: A new, fast, and
outlier-resistant hierarchical clustering algorithm. Information Sciences,
363 , 8–23.
Gagolewski, M., Bartoszuk, M., Cena, A. (2021). Are cluster validity measures
(in)valid? Information Sciences,581 , 620–636. DOI 10.1016/j.ins.2021
.10.004
Gagolewski, M., Cena, A., Bartoszuk, M. (2016). Hierarchical cluster-
ing via penalty-based aggregation and the Genie approach. V. Torra,
Y. Narukowa, G. Navarro-Arribas, & C. Yanez (Eds.), Modeling deci-
sions for artificial intelligence (lecture notes in artificial intelligence
9880) (pp. 191–202). Springer.
Gagolewski, M., et al. (2022). A benchmark suite for clustering algorithms:
Version 1.1.0. Retrieved from https://github.com/gagolews/clustering-
data-v1/releases/tag/v1.1.0 DOI 10.5281/zenodo.7088171
Gerald, T., Zaatiti, H., Hajri, H., et al. (2023). A hyperbolic approach for
learning communities on graphs. Data Mining and Knowledge Discovery.
DOI 10.1007/s10618-022-00902-8
González-Barrios, J.M., & Quiroz, A.J. (2003). A clustering procedure based
on the comparison between the k nearest neighbors graph and the min-
imal spanning tree. Statistics & Probability Letters,62 , 23–34. DOI
10.1016/S0167-7152(02)00421-2
Gower, J.C., & Ross, G.J.S. (1969). Minimum spanning trees and single
linkage cluster analysis. Journal of the Royal Statistical Society. Series
C (Applied Statistics),18 (1), 54–64.
Graham, R., & Hell, P. (1985). On the history of the minimum spanning tree
problem. Annals of the History of Computing,7(1), 43–57.
Graves, D., & Pedrycz, W. (2010). Kernel-based fuzzy clustering: A
comparative experimental study. Fuzzy Sets and Systems,161 , 522–543.
Grygorash, O., Zhou, Y., Jorgensen, Z. (2006). Minimum spanning tree based
clustering algorithms. Proc. ictai’06 (pp. 1–9).
Guo, X., Yang, Z., Li, C., Xiong, H., Ma, C. (2023). Combining the clas-
sic vulnerability index and affinity propagation clustering algorithm to
assess the intrinsic aquifer vulnerability of coastal aquifers on an inte-
grated scale. Environmental Research,217 , 114877. DOI 10.1016/
j.envres.2022.114877
Halkidi, M., Batistakis, Y., Vazirgiannis, M. (2001). On clustering validation
techniques. Journal of Intelligent Information Systems, 107–145. DOI
10.1023/A:1012801612483
Hero III, A.O., & Michel, O. (1998). Robust entropy estimation strategies
based on edge weighted random graphs. A. Mohammad-Djafari (Ed.),
Bayesian inference for inverse problems (Vol. 3459, pp. 250 261). SPIE.
DOI 10.1117/12.323804
Horta, D., & Campello, R. (2015). Comparing hard and overlapping
clusterings. Journal of Machine Learning Research,16 (93), 2949–2997.
Hwang, Y.-C., Ahn, H.-Y., Jun, J.E., Jeong, I.-K., Ahn, K.J., Chung, H.Y.
(2023). Subtypes of type 2 diabetes and their association with outcomes
in korean adults - a cluster analysis of community-based prospective
cohort. Metabolism,141 , 155514. DOI 10.1016/j.metabol.2023.155514
Jain, A., & Law, M. (2005). Data clustering: A user’s dilemma. Lecture Notes
in Computer Science,3776 , 1–10.
Jarník, V. (1930). O jistém problému minimálním (z dopisu panu
O. Borůvkovi). Práce Moravské Přírodovědecké Společnosti v Brně,6,
57–63.
Jaskowiak, P., Costa, I., Campello, R. (2022). The area under the ROC
curve as a measure of clustering quality. Data Mining and Knowledge
Discovery,36 , 1219–1245. DOI 10.1007/s10618-022-00829-0
Karypis, G., Han, E., Kumar, V. (1999). CHAMELEON: Hierarchical
clustering using dynamic modeling. Computer ,32 (8), 68–75. DOI
10.1109/2.781637
Kobren, A., Monath, N., Krishnamurthy, A., McCallum, A. (2017). A hierar-
chical algorithm for extreme clustering. Proc. 23rd acm sigkdd’17 (pp.
255–264). DOI 10.1145/3097983.3098079
Kruskal, J.B. (1956). On the shortest spanning subtree of a graph and the
traveling salesman problem. Proceedings of the American Mathematical
Society,7, 48–50.
Lloyd, S. (1957 (1982)). Least squares quantization in PCM. IEEE Trans-
actions on Information Theory,28 , 128–137. (Originally a 1957 Bell
Telephone Laboratories Research Report; republished in 1982) DOI
10.1109/TIT.1982.1056489
Ma, Y., Lin, H., Wang, Y., Huang, H., He, X. (2021). A multi-stage hierarchi-
cal clustering algorithm based on centroid of tree and cut edge constraint.
Information Sciences,557 , 194–219. DOI 10.1016/j.ins.2020.12.016
March, W.B., Ram, P., Gray, A.G. (2010). Fast Euclidean minimum spanning
tree: Algorithm, analysis, and applications. Proceedings of the 16th acm
sigkdd international conference on knowledge discovery and data mining
(pp. 603–612). ACM.
Maulik, U., & Bandyopadhyay, S. (2002). Performance evaluation of some
clustering algorithms and validity indices. IEEE Transactions on Pattern
Analysis and Machine Intelligence,24 (12), 1650-1654. DOI 10.1109/
TPAMI.2002.1114856
McInnes, L., Healy, J., Astels, S. (2017). hdbscan: Hierarchical density based
clustering. The Journal of Open Source Software,2(11), 205. DOI
10.21105/joss.00205
Milligan, G.W., & Cooper, M.C. (1985). An examination of procedures for
determining the number of clusters in a data set. Psychometrika,50 (2),
159–179.
Mishra, G., & Mohanty, S.K. (2019). A fast hybrid clustering technique based
on local nearest neighbor using minimum spanning tree. Expert Systems
with Applications,132 , 28–43. DOI 10.1016/j.eswa.2019.04.048
Müller, A., Nowozin, S., Lampert, C. (2012). Information theoretic clustering
using minimum spanning trees. Proc. german conference on pattern
recognition. (https://github.com/amueller/information-theoretic-mst)
Müllner, D. (2011). Modern hierarchical, agglomerative clustering algorithms.
ArXiv:1109.2378 [stat.ML]. (http://arxiv.org/abs/1109.2378)
Naidan, B., Boytsov, L., Malkov, Y., Novak, D. (2019).
Non-metric space library (NMSLIB) manual, ver-
sion 2.0 [Computer software manual]. Retrieved from
https://github.com/nmslib/nmslib/blob/master/manual/latex/manual.pdf
Olson, C.F. (1995). Parallel algorithms for hierarchical clustering. Parallel
Computing,21 , 1313–1325.
Pál, D., Póczos, B., Szepesvári, C. (2010). Estimation of rényi entropy
and mutual information based on generalized nearest-neighbor graphs.
Advances in Neural Information Processing Systems,23 .
Pedregosa, F., et al. (2011). Scikit-learn: Machine learning in Python. Journal
of Machine Learning Research,12 , 2825–2830.
Peter, S. (2013). Local density-based hierarchical clustering using mini-
mum spanning tree. Journal of Discrete Mathematical Sciences and
Cryptography,16 . DOI 10.1080/09720529.2013.778471
Prim, R.C. (1957). Shortest connection networks and some generalizations.
Bell System Technical Journal,36 (6), 1389–1401. DOI 10.1002/j.1538
-7305.1957.tb01515.x
Rezaei, M., & Fränti, P. (2016). Set matching measures for external cluster
validity. IEEE Transactions on Knowledge and Data Engineering,28 (8),
2173–2186. DOI 10.1109/TKDE.2016.2551240
Rousseeuw, P.J. (1987). Silhouettes: A graphical aid to the interpretation
and validation of cluster analysis. Journal of Computational and Applied
Mathematics,20 , 53–65. DOI 10.1016/0377-0427(87)90125-7
Sieranoja, S., & Fränti, P. (2019). Fast and general density peaks clustering.
Pattern Recognition Letters,128 , 551–558. DOI 10.1016/j.patrec.2019
.10.019
Sneath, P. (1957). The application of computers to taxonomy. Journal of
General Microbiology,17 (1), 201–226. DOI 10.1099/00221287-17-1-201
Thrun, M., & Stier, Q. (2021). Fundamental clustering algorithms suite.
SoftwareX,13 , 100642. DOI 10.1016/j.softx.2020.100642
Thrun, M., & Ultsch, A. (2020). Clustering benchmark datasets exploiting
the fundamental clustering problems. Data in Brief ,30 , 105501. DOI
10.1016/j.dib.2020.105501
Ultsch, A. (2005). Clustering with SOM: U*C. Workshop on self-organizing
maps (pp. 75–82). WSOM 2005.
von Luxburg, U., Williamson, R., Guyon, I. (2012). Clustering: Science or
art? I. Guyon et al. (Eds.), Proc. icml workshop on unsupervised and
transfer learning (Vol. 27, pp. 65–79).
Wagner, S., & Wagner, D. (2006). Comparing clusterings An overview (Tech.
Rep. No. 2006-04). Faculty of Informatics, Universität Karlsruhe (TH).
Wang, X., Wang, X., Wilkes, D.M. (2009). A divide-and-conquer approach
for minimum spanning tree-based clustering. IEEE Transations on
Knowledge and Data Engineering ,21 (7), 945–958.
Ward Jr., J.H. (1963). Hierarchical grouping to optimize an objective function.
Journal of the American Statistical Association,58 (301), 236–244. DOI
10.1080/01621459.1963.10500845
Wierzchoń, S.T., & Kłopotek, M.A. (2018). Modern algorithms of cluster
analysis. Springer.
Xu, Q., Zhang, Q., Liu, J., Luo, B. (2020). Efficient synthetical cluster-
ing validity indexes for hierarchical clustering. Expert Systems with
Applications,151 , 113367. DOI 10.1016/j.eswa.2020.113367
Xu, Y., Olman, V., Xu, D. (2002). Clustering gene expression data using a
graph-theoretic approach: An application of minimum spanning trees.
Bioinformatics,18 (2), 536–545.
Yin, F., & Liu, C.-L. (2009). Handwritten Chinese text line segmentation by
clustering with distance metric learning. Pattern Recognition,42 (12),
3146–3157. DOI 10.1016/j.patcog.2008.12.013
Zahn, C. (1971). Graph-theoretical methods for detecting and describing
gestalt clusters. IEEE Transactions on Computers,C-20 (1), 68–86.
Zhang, T., Ramakrishnan, R., Livny, M. (1996). BIRCH: An efficient data
clustering method for large databases. Proc. acm sigmod international
conference on management of data sigmod ’96 (pp. 103–114).
Zhao, W., Ma, J., Liu, Q., Song, J., Tysklind, M., Liu, C., . . . Wu, F.
(2023). Comparison and application of sofm, fuzzy c-means and k-
means clustering algorithms for natural soil environment regionalization
in china. Environmental Research,216 , 114519. DOI 10.1016/j.envres
.2022.114519
Zhong, C., Malinen, M., Miao, D., Fränti, P. (2015). A fast minimum spanning
tree algorithm based on k-means. Information Sciences,205 , 1–17. DOI
10.1016/j.ins.2014.10.012
Zhong, C., Miao, D., Fränti, P. (2011). Minimum spanning tree based split-
and-merge: A hierarchical clustering method. Information Sciences,181 ,
3397–3410. DOI 10.1016/j.ins.2011.04.013
Zhong, C., Miao, D., Wang, R. (2010). A graph-theoretical clustering method
based on two rounds of minimum spanning trees. Pattern Recognition,
43 (3), 752–766. DOI 10.1016/j.patcog.2009.07.010
Zhou, H., Bai, J., Wang, Y., Ren, J., Yang, X., Jiao, L. (2023). Deep
radio signal clustering with interpretability analysis based on saliency
map. Digital Communications and Networks. DOI 10.1016/j.dcan.2023
.01.010
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
The evaluation of clustering algorithms can involve running them on a variety of benchmark problems, and comparing their outputs to the reference, ground-truth groupings provided by experts. Unfortunately, many research papers and graduate theses consider only a small number of datasets. Also, the fact that there can be many equally valid ways to cluster a given problem set is rarely taken into account. In order to overcome these limitations, we have developed a framework whose aim is to introduce a consistent methodology for testing clustering algorithms. Furthermore, we have aggregated, polished, and standardised many clustering benchmark dataset collections referred to across the machine learning and data mining literature, and included new datasets of different dimensionalities, sizes, and cluster types. An interactive datasets explorer, the documentation of the Python API, a description of the ways to interact with the framework from other programming languages such as R or MATLAB, and other details are all provided at https://clustering-benchmarks.gagolewski.com.
Preprint
Full-text available
There is no, nor will there ever be, single best clustering algorithm, but we would still like to be able to distinguish between methods which work well on certain task types and those that systematically underperform. Clustering algorithms are traditionally evaluated using either internal or external validity measures. Internal measures quantify different aspects of the obtained partitions, e.g., the average degree of cluster compactness or point separability. Yet, their validity is questionable, because the clusterings they promote can sometimes be meaningless. External measures, on the other hand, compare the algorithms' outputs to the reference, ground truth groupings that are provided by experts. In this paper, we argue that the commonly-used classical partition similarity scores, such as the normalised mutual information, Fowlkes-Mallows, or adjusted Rand index, miss some desirable properties, e.g., they do not identify worst-case scenarios correctly or are not easily interpretable. This makes comparing clustering algorithms across many benchmark datasets difficult. To remedy these issues, we propose and analyse a new measure: a version of the optimal set-matching accuracy, which is normalised, monotonic, scale invariant, and corrected for the imbalancedness of cluster sizes (but neither symmetric nor adjusted for chance).
Article
Full-text available
The Fundamental Clustering Problems Suite (FCPS) offers a variety of clustering challenges that any algorithm should be able to handle given real-world data. The FCPS consists of datasets with known a priori classifications that are to be reproduced by the algorithm. The datasets are intentionally created to be visualized in two or three dimensions under the hypothesis that objects can be grouped unambiguously by the human eye. Each dataset represents a certain problem that can be solved by known clustering algorithms with varying success. In the R package “Fundamental Clustering Problems Suite” on CRAN, user-defined sample sizes can be drawn for the FCPS. Additionally, the distances of two high-dimensional datasets called Leukemia and Tetragonula are provided here. This collection is useful for investigating the shortcomings of clustering algorithms and the limitations of dimensionality reduction methods in the case of three-dimensional or higher datasets. This article is a simultaneous co-submission with Swarm Intelligence for Self-Organized Clustering [1].
Article
Full-text available
This paper has two contributions. First, we introduce a clustering basic benchmark. Second, we study the performance of k-means using this benchmark. Specifically, we measure how the performance depends on four factors: (1) overlap of clusters, (2) number of clusters, (3) dimensionality, and (4) unbalance of cluster sizes. The results show that overlap is critical, and that k-means starts to work effectively when the overlap reaches 4% level. © 2018, Springer Science+Business Media, LLC, part of Springer Nature.
Article
Full-text available
Many modern clustering methods scale well to a large number of data items, N, but not to a large number of clusters, K. This paper introduces PERCH, a new non-greedy algorithm for online hierarchical clustering that scales to both massive N and K--a problem setting we term extreme clustering. Our algorithm efficiently routes new data points to the leaves of an incrementally-built tree. Motivated by the desire for both accuracy and speed, our approach performs tree rotations for the sake of enhancing subtree purity and encouraging balancedness. We prove that, under a natural separability assumption, our non-greedy algorithm will produce trees with perfect dendrogram purity regardless of online data arrival order. Our experiments demonstrate that PERCH constructs more accurate trees than other tree-building clustering algorithms and scales well with both N and K, achieving a higher quality clustering than the strongest flat clustering competitor in nearly half the time.
Article
In the northern plains of Laizhou City, groundwater quality suffers dual threats from anthropogenic activities: seawater intrusion caused by overextraction of fresh groundwater, and vertical infiltration of agricultural pollutants. Groundwater management requires a comprehensive analysis of both horizontal and vertical pollution in coastal aquifers. In this paper, Intrinsic Aquifer Vulnerability (IAV) was assessed on an integrated scale using two classic IAV models (DRASTIC and GALDIT) separately based on a GIS database. Hydrogeological parameters from two classic IAV models were clustered using affinity propagation (AP) clustering algorithm, and silhouette coefficients were used to determine the optimal classification result. In our application, the objects of the AP algorithm are 3320 units divided from the whole study area with 500 m*500 m precision. A comparison of all four outputs in AP-DRASTIC shows that the clustering results of the 4-classification yielded the best silhouette coefficient of 0.406 out of all four. Cluster 4, which comprises 21% of the area, had relatively low level of groundwater contamination, despite its high level of vulnerability as indicated by the classic DRASTIC index. In the second level of vulnerability Cluster 3, 53.8% of all water samples were found to be contaminated, indicating a greater level of nitrate contamination. With respect to AP-GALDIT, the silhouette coefficient for result 7-classification reaches the highest value of 0.343. There was a high level of vulnerability identified in Clusters 2, 4 and 5 (34.7% of the study area) relating to the classic GALDIT index. The concentration of chloride in all water samples obtained in these areas was extremely high. Groundwater management should be addressed by AP-DRASTIC results on anthropogenic activity/contamination control, and by AP-GALDIT results on groundwater extraction limitation. Overall, this method allows for the evaluation of IAV in other coastal areas on an integrated scale, facilitating the development of groundwater management strategies based on a better understanding of the aquifer's essential characteristics.
Article
Soil attributes and their environmental drivers exhibit different patterns in different geographical directions, along with distinct regional characteristics, which may have important effects on substance migration and transformation such as organic matter and soil elements or the environmental impacts of pollutants. Therefore, regional soil characteristics should be considered in the process of regionalization for environmental management. However, no comprehensive evaluation or systematic classification of the natural soil environment has been established for China. Here, we established an index system for natural soil environmental regionalization (NSER) by combining literature data obtained based on bibliometrics with the analytic hierarchy process (AHP). Based on the index system, we collected spatial distribution data for 14 indexes at the national scale. In addition, three clustering algorithms-self-organizing feature mapping (SOFM), fuzzy c-means (FCM) and k-means (KM)-were used to classify and define the natural soil environment. We imported four cluster validity indexes (CVI) to evaluate different models: Davies-Bouldin index (DB), Silhouette index (Sil) and Calinski-Harabasz index (CH) for FCM and KM, clustering quality index (CQI) for SOFM. Analysis and comparison of the results showed that when the number of clusters was 13, the FCM clustering algorithm achieved the optimal clustering results (DB = 1.16, Sil = 0.78, CH = 6.77 × 106), allowing the natural soil environment of China to be divided into 12 regions with distinct characteristics. Our study provides a set of comprehensive scientific research methods for regionalization research based on spatial data, it has important reference value for improving soil environmental management based on local conditions in China.
Article
This book provides the reader with a basic understanding of the formal concepts of the cluster, clustering, partition, cluster analysis etc. The book explains feature-based, graph-based and spectral clustering methods and discusses their formal similarities and differences. Understanding the related formal concepts is particularly vital in the epoch of Big Data; due to the volume and characteristics of the data, it is no longer feasible to predominantly rely on merely viewing the data when facing a clustering problem. Usually clustering involves choosing similar objects and grouping them together. To facilitate the choice of similarity measures for complex and big data, various measures of object similarity, based on quantitative (like numerical measurement results) and qualitative features (like text), as well as combinations of the two, are described, as well as graph-based similarity measures for (hyper) linked objects and measures for multilayered graphs. Numerous variants demonstrating how such similarity measures can be exploited when defining clustering cost functions are also presented. In addition, the book provides an overview of approaches to handling large collections of objects in a reasonable time. In particular, it addresses grid-based methods, sampling methods, parallelization via Map-Reduce, usage of tree-structures, random projections and various heuristic approaches, especially those used for community detection.