ArticlePublisher preview available

A deep embedded clustering technique using dip test and unique neighbourhood set

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract and Figures

In recent years, there has been a growing interest in deep learning-based clustering. A recently introduced technique called DipDECK has shown effective performance on large and high-dimensional datasets. DipDECK utilises Hartigan’s dip test, a statistical test, to merge small non-viable clusters. Notably, DipDECK was the first deep learning-based clustering technique to incorporate the dip test. However, the number of initial clusters of DipDECK is overestimated and the algorithm then randomly selects the initial seeds to produce the final clusters for a dataset. Therefore, in this paper, we presented a technique called UNSDipDECK , which is an improved version of DipDECK and does not require user input for datasets with an unknown number of clusters. UNSDipDECK produces high-quality initial seeds and the initial number of clusters through a deterministic process. UNSDipDECK uses the unique closest neighbourhood and unique neighbourhood set approaches to determine high-quality initial seeds for a dataset. In our study, we compared the performance of UNSDipDECK with fifteen baseline clustering techniques, including DipDECK, using NMI and ARI metrics. The experimental results indicate that UNSDipDECK outperforms the baseline techniques, including DipDECK. Additionally, we demonstrated that the initial seed selection process significantly contributes to UNSDipDECK ’s ability to produce high-quality clusters.
This content is subject to copyright. Terms and conditions apply.
ORIGINAL ARTICLE
A deep embedded clustering technique using dip test and unique
neighbourhood set
Md Anisur Rahman
1
Li-minn Ang
2
Yuan Sun
1
Kah Phooi Seng
3
Received: 18 March 2024 / Accepted: 27 September 2024 / Published online: 20 November 2024
The Author(s), under exclusive licence to Springer-Verlag London Ltd., part of Springer Nature 2024
Abstract
In recent years, there has been a growing interest in deep learning-based clustering. A recently introduced technique called
DipDECK has shown effective performance on large and high-dimensional datasets. DipDECK utilises Hartigan’s dip test,
a statistical test, to merge small non-viable clusters. Notably, DipDECK was the first deep learning-based clustering
technique to incorporate the dip test. However, the number of initial clusters of DipDECK is overestimated and the
algorithm then randomly selects the initial seeds to produce the final clusters for a dataset. Therefore, in this paper, we
presented a technique called UNSDipDECK , which is an improved version of DipDECK and does not require user input
for datasets with an unknown number of clusters. UNSDipDECK produces high-quality initial seeds and the initial number
of clusters through a deterministic process. UNSDipDECK uses the unique closest neighbourhood and unique neigh-
bourhood set approaches to determine high-quality initial seeds for a dataset. In our study, we compared the performance of
UNSDipDECK with fifteen baseline clustering techniques, including DipDECK, using NMI and ARI metrics. The
experimental results indicate that UNSDipDECK outperforms the baseline techniques, including DipDECK. Additionally,
we demonstrated that the initial seed selection process significantly contributes to UNSDipDECK ’s ability to produce
high-quality clusters.
Keywords Unsupervised learning Deep learning Dip test Clustering Deep clustering Cluster evaluation
Curse-of-dimensionality
1 Introduction
Pattern extraction from unlabeled data is an important data
mining task. Unsupervised learning , such as clustering, is
commonly used for pattern extraction from unlabeled data.
Clustering partitions the records of a dataset into groups
where the records in a group are similar, whereas the
records in different groups are dissimilar. However, one of
the limitations of clustering is that very often the number of
actual clusters existing in a dataset is not known. Many
existing techniques have tried to address this problem. Dip-
means [1] and X-means [2] are based on k-means also tried
to address this issue of an unknown number of clusters.
PG-means [3] is an expectation maximisation-based clus-
tering technique that can detect clusters in different shapes.
Dip-means, X-means, and PG-means are based on Gaus-
sian distribution. However, they may not work well on
datasets with clusters in arbitrary shapes. DBSCAN [4]isa
density-based clustering technique that works well on
&Md Anisur Rahman
anisur.rahman@latrobe.edu.au
Li-minn Ang
lang@usc.edu.au
Yuan Sun
yuan.sun@latrobe.edu.au
Kah Phooi Seng
jasmine.seng@xjtlu.edu.cn
1
La Trobe Business School, La Trobe University, Plenty
Road, Bundoora, VIC 3086, Australia
2
School of Science, Technology and Engineering, University
of the Sunshine Coast, Sippy Downs, QLD 4556, Australia
3
School of AI & Advanced Computing, Xi’an Jiaotong
Liverpool University, Suzhou 215123, China
123
Neural Computing and Applications (2025) 37:1345–1356
https://doi.org/10.1007/s00521-024-10497-4(0123456789().,-volV)(0123456789().,-volV)
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Facilitated by the powerful feature extraction ability of neural networks, deep clustering has achieved great success in analyzing high-dimensional and complex real-world data. The performance of deep clustering methods is affected by various factors such as network structures and learning objectives. However, as pointed out in this survey, the essence of deep clustering lies in the incorporation and utilization of prior knowledge, which is largely ignored by existing works. From pioneering deep clustering methods based on data structure assumptions to recent contrastive clustering methods based on data augmentation invariances, the development of deep clustering intrinsically corresponds to the evolution of prior knowledge. In this survey, we provide a comprehensive review of deep clustering methods by categorizing them into six types of prior knowledge. We find that in general the prior innovation follows two trends, namely, i) from mining to constructing, and ii) from internal to external. Besides, we provide a benchmark on five widely-used datasets and analyze the performance of methods with diverse priors. By providing a novel prior knowledge perspective, we hope this survey could provide some novel insights and inspire future research in the deep clustering community.
Article
Full-text available
Deep graph clustering is an unsupervised learning task that divides nodes in a graph into disjoint regions with the help of graph auto-encoders. Currently, such methods have several problems, as follows. (1) The deep graph clustering method does not effectively utilize the generated pseudo-labels, resulting in sub-optimal model training results. (2) Each cluster has a different confidence level, which affects the reliability of the pseudo-label. To address these problems, we propose a Deep Self-supervised Attribute Graph Clustering model (DSAGC) to fully leverage the information of the data itself. We divide the proposed model into two parts: an upstream model and a downstream model. In the upstream model, we use the pseudo-label information generated by spectral clustering to form a new high-confidence distribution with which to optimize the model for a higher performance. We also propose a new reliable sample selection mechanism to obtain more reliable samples for downstream tasks. In the downstream model, we only use the reliable samples and the pseudo-label for the semi-supervised classification task without the true label. We compare the proposed method with 17 related methods on four publicly available citation network datasets, and the proposed method generally outperforms most existing methods in three performance metrics. By conducting a large number of ablative experiments, we validate the effectiveness of the proposed method.
Article
Full-text available
Clustering algorithm is one of the most widely used and influential analysis techniques. With the advent of deep learning, deep embedding clustering algorithms have rapidly evolved and yield promising results. Much of the success of these algorithms depends on the potential expression captured by the autoencoder network. Therefore, the quality of the potential expression directly determines the algorithm’s performance. In view of this, researchers have proposed many improvements. Although the performance has been slightly improved, they all have one shortcoming, that is, too much emphasis is placed on the original data reconstruction ability during the process of feature expression, which greatly limits the further expression of potential features according to specific clustering tasks. Moreover, there is a large amount of noise in the original data, so blindly emphasizing reconstruction will only backfire. Hence, we innovatively propose a deep embedding clustering algorithm based on residual autoencoder (DECRA) after in-depth research. Specifically, a novel autoencoder network with residual structure is proposed and introduced into deep embedded clustering tasks. The network introduces an adaptive weight layer in feature representation z, which can make it have good robustness, generalization for specific tasks, and adaptive learning of better feature embeddings according to category classification. In this paper, the reasons for the validity of this structure are explained theoretically, and comprehensive experiments on six benchmark datasets including various types show that the clustering performance of the DECRA is very competitive and significantly superior to the most advanced methods.
Article
Full-text available
Recently the deep learning has shown its advantage in representation learning and clustering for time series data. Despite the considerable progress, the existing deep time series clustering approaches mostly seek to train the deep neural network by some instance reconstruction based or cluster distribution based objective, which, however, lack the ability to exploit the sample-wise (or augmentation-wise) contrastive information or even the higher-level (e.g., cluster-level) contrastiveness for learning discriminative and clustering-friendly representations. In light of this, this paper presents a deep temporal contrastive clustering (DTCC) approach, which for the first time, to our knowledge, incorporates the contrastive learning paradigm into the deep time series clustering research. Specifically, with two parallel views generated from the original time series and their augmentations, we utilize two identical auto-encoders to learn the corresponding representations, and in the meantime perform the cluster distribution learning by incorporating a k-means objective. Further, two levels of contrastive learning are simultaneously enforced to capture the instance-level and cluster-level contrastive information, respectively. With the reconstruction loss of the auto-encoder, the cluster distribution loss, and the two levels of contrastive losses jointly optimized, the network architecture is trained in a self-supervised manner and the clustering result can thereby be obtained. Experiments on a variety of time series datasets demonstrate the superiority of our DTCC approach over the state-of-the-art. The code is available at https://github.com/07zy/DTCC.
Article
Recently, deep clustering has been extensively employed for various data mining tasks, and it can be divided into auto-encoder (AE)-based and graph neural networks (GNN)-based methods. However, existing AE-based methods fall short in effectively extracting structural information, while GNN suffer from smoothing and heterophily. Although methods that combine AE and GNN achieve impressive performance, there remains an inadequate balance between preserving the raw structure and exploring the underlying structure. Accordingly, we propose a novel network named Structure-Aware Deep Clustering network (SADC). Firstly, we compute the cumulative influence of non-adjacent nodes at multiple depths and, thus, enhance the adjacency matrix. Secondly, an enhanced graph auto-encoder is designed. Thirdly, the latent space of AE is endowed with the ability to perceive the raw structure during the learning process. Besides, we design self-supervised mechanisms to achieve co-optimization of node representation learning and topology learning. A new loss function is designed to preserve the inherent structure while also allowing for exploration of latent data structure. Extensive experiments on six benchmark datasets validate that our method outperforms state-of-the-art methods.
Chapter
Papers from the 2006 flagship meeting on neural computation, with contributions from physicists, neuroscientists, mathematicians, statisticians, and computer scientists. The annual Neural Information Processing Systems (NIPS) conference is the flagship meeting on neural computation and machine learning. It draws a diverse group of attendees—physicists, neuroscientists, mathematicians, statisticians, and computer scientists—interested in theoretical and applied aspects of modeling, simulating, and building neural-like or intelligent systems. The presentations are interdisciplinary, with contributions in algorithms, learning theory, cognitive science, neuroscience, brain imaging, vision, speech and signal processing, reinforcement learning, and applications. Only twenty-five percent of the papers submitted are accepted for presentation at NIPS, so the quality is exceptionally high. This volume contains the papers presented at the December 2006 meeting, held in Vancouver. Bradford Books imprint