A preview of this full-text is provided by Springer Nature.
Content available from Neural Computing and Applications
This content is subject to copyright. Terms and conditions apply.
ORIGINAL ARTICLE
A deep embedded clustering technique using dip test and unique
neighbourhood set
Md Anisur Rahman
1
•Li-minn Ang
2
•Yuan Sun
1
•Kah Phooi Seng
3
Received: 18 March 2024 / Accepted: 27 September 2024 / Published online: 20 November 2024
The Author(s), under exclusive licence to Springer-Verlag London Ltd., part of Springer Nature 2024
Abstract
In recent years, there has been a growing interest in deep learning-based clustering. A recently introduced technique called
DipDECK has shown effective performance on large and high-dimensional datasets. DipDECK utilises Hartigan’s dip test,
a statistical test, to merge small non-viable clusters. Notably, DipDECK was the first deep learning-based clustering
technique to incorporate the dip test. However, the number of initial clusters of DipDECK is overestimated and the
algorithm then randomly selects the initial seeds to produce the final clusters for a dataset. Therefore, in this paper, we
presented a technique called UNSDipDECK , which is an improved version of DipDECK and does not require user input
for datasets with an unknown number of clusters. UNSDipDECK produces high-quality initial seeds and the initial number
of clusters through a deterministic process. UNSDipDECK uses the unique closest neighbourhood and unique neigh-
bourhood set approaches to determine high-quality initial seeds for a dataset. In our study, we compared the performance of
UNSDipDECK with fifteen baseline clustering techniques, including DipDECK, using NMI and ARI metrics. The
experimental results indicate that UNSDipDECK outperforms the baseline techniques, including DipDECK. Additionally,
we demonstrated that the initial seed selection process significantly contributes to UNSDipDECK ’s ability to produce
high-quality clusters.
Keywords Unsupervised learning Deep learning Dip test Clustering Deep clustering Cluster evaluation
Curse-of-dimensionality
1 Introduction
Pattern extraction from unlabeled data is an important data
mining task. Unsupervised learning , such as clustering, is
commonly used for pattern extraction from unlabeled data.
Clustering partitions the records of a dataset into groups
where the records in a group are similar, whereas the
records in different groups are dissimilar. However, one of
the limitations of clustering is that very often the number of
actual clusters existing in a dataset is not known. Many
existing techniques have tried to address this problem. Dip-
means [1] and X-means [2] are based on k-means also tried
to address this issue of an unknown number of clusters.
PG-means [3] is an expectation maximisation-based clus-
tering technique that can detect clusters in different shapes.
Dip-means, X-means, and PG-means are based on Gaus-
sian distribution. However, they may not work well on
datasets with clusters in arbitrary shapes. DBSCAN [4]isa
density-based clustering technique that works well on
&Md Anisur Rahman
anisur.rahman@latrobe.edu.au
Li-minn Ang
lang@usc.edu.au
Yuan Sun
yuan.sun@latrobe.edu.au
Kah Phooi Seng
jasmine.seng@xjtlu.edu.cn
1
La Trobe Business School, La Trobe University, Plenty
Road, Bundoora, VIC 3086, Australia
2
School of Science, Technology and Engineering, University
of the Sunshine Coast, Sippy Downs, QLD 4556, Australia
3
School of AI & Advanced Computing, Xi’an Jiaotong
Liverpool University, Suzhou 215123, China
123
Neural Computing and Applications (2025) 37:1345–1356
https://doi.org/10.1007/s00521-024-10497-4(0123456789().,-volV)(0123456789().,-volV)
Content courtesy of Springer Nature, terms of use apply. Rights reserved.