Figure - available from: Computational Intelligence and Neuroscience
This content is subject to copyright. Terms and conditions apply.
Source publication
The cluster evaluation process is of great importance in areas of machine learning and data mining. Evaluating the clustering quality of clusters shows how much any proposed approach or algorithm is competent. Nevertheless, evaluating the quality of any cluster is still an issue. Although many cluster validity indices have been proposed, there is a...
Similar publications
The development of UAV (unmanned aerial vehicle) technology provides an ideal data source for the information extraction of surface cracks, which can be used for efficient, fast, and easy access to surface damage in mining areas. Understanding how to effectively assess the degree of development of surface cracks is a prerequisite for the reasonable...
Citations
... By constructing a Delaunay triangulation, we obtain correlation information between data points. Using the Validity Index for Arbitrary-Shaped Clusters based on Kernel Density Estimation (VIASCKDE) [19] as an internal index, we optimize connectivity among data points until the relevant conditions are met, allowing the algorithm to perform clustering without any dependence on manual input. Experimental results show that the proposed algorithm outperforms similar algorithms on both synthetic and real-world datasets. ...
Clustering is a fundamental tool in data mining, widely used in various fields such as image segmentation, data science, pattern recognition, and bioinformatics. Density Peak Clustering (DPC) is a density-based method that identifies clusters by calculating the local density of data points and selecting cluster centers based on these densities. However, DPC has several limitations. First, it requires a cutoff distance to calculate local density, and this parameter varies across datasets, which requires manual tuning and affects the algorithm’s performance. Second, the number of cluster centers must be manually specified, as the algorithm cannot automatically determine the optimal number of clusters, making the algorithm dependent on human intervention. To address these issues, we propose an adaptive Density Peak Clustering (DPC) method, which automatically adjusts parameters like cutoff distance and the number of clusters, based on the Delaunay graph. This approach uses the Delaunay graph to calculate the connectivity between data points and prunes the points based on these connections, automatically determining the number of cluster centers. Additionally, by optimizing clustering indices, the algorithm automatically adjusts its parameters, enabling clustering without any manual input. Experimental results on both synthetic and real-world datasets demonstrate that the proposed algorithm outperforms similar methods in terms of both efficiency and clustering accuracy.
... (3) a large number of clusters-for instance, the A3 dataset, which has several clusters; (4) overlapping clusters-for instance, S2, which has overlapping samples across clusters. The synthetic datasets used in this study have been widely used as benchmarks for densitybased clustering techniques to demonstrate identifying clusters of arbitrary shapes, sizes and densities [15,20,[48][49][50]. ...
... The real-world benchmark datasets used were IRIS, Heart Disease, Seeds and Wine obtained from the UCI Machine Learning repository [51] and cover different domains, such as biology, healthcare and agriculture, ensuring a balanced and extensive evaluation of the clustering method. While the synthetic datasets, all two-dimensional, represent the inherent complexity in cluster structures, the real-world datasets have increased dimensionality and cluster overlap [15,31,[48][49][50]52,53]. The inherent noise and outliers in the real-world dataset are common characteristics reflecting realistic scenarios the proposed method is designed to handle. ...
The task of finding natural groupings within a dataset exploiting proximity of samples is known as clustering, an unsupervised learning approach. Density-based clustering algorithms, which identify arbitrarily shaped clusters using spatial dimensions and neighbourhood aspects, are sensitive to the selection of parameters. For instance, DENsity CLUstEring (DENCLUE)—a density-based clustering algorithm—requires a trial-and-error approach to find suitable parameters for optimal clusters. Earlier attempts to automate the parameter estimation of DENCLUE have been highly dependent either on the choice of prior data distribution (which could vary across datasets) or by fixing one parameter (which might not be optimal) and learning other parameters. This article addresses this challenge by learning the parameters of DENCLUE through the differential evolution optimisation technique without prior data distribution assumptions. Experimental evaluation of the proposed approach demonstrated consistent performance across datasets (synthetic and real datasets) containing clusters of arbitrary shapes. The clustering performance was evaluated using clustering validation metrics (e.g., Silhouette Score, Davies–Bouldin Index and Adjusted Rand Index) as well as qualitative visual analysis when compared with other density-based clustering algorithms, such as DPC, which is based on weighted local density sequences and nearest neighbour assignments (DPCSA) and Variable KDE-based DENCLUE (VDENCLUE).
... (3) a large number of clusters-for instance, the A3 dataset, which has several clusters; (4) overlapping clusters-for instance, S2, which has overlapping samples across clusters. The synthetic datasets used in this study have been widely used as benchmarks for densitybased clustering techniques to demonstrate identifying clusters of arbitrary shapes, sizes and densities [15,20,[48][49][50]. ...
... The real-world benchmark datasets used were IRIS, Heart Disease, Seeds and Wine obtained from the UCI Machine Learning repository [51] and cover different domains, such as biology, healthcare and agriculture, ensuring a balanced and extensive evaluation of the clustering method. While the synthetic datasets, all two-dimensional, represent the inherent complexity in cluster structures, the real-world datasets have increased dimensionality and cluster overlap [15,31,[48][49][50]52,53]. The inherent noise and outliers in the real-world dataset are common characteristics reflecting realistic scenarios the proposed method is designed to handle. ...
The task of finding natural groupings within a dataset exploiting proximity of samples is known as clustering, an unsupervised learning approach. Density-based clustering algorithms, which identify arbitrarily shaped clusters using spatial dimensions and neighbourhood aspects, are sensitive to the selection of parameters. For instance, DENsity CLUstEring (DENCLUE)-a density-based clustering algorithm-requires a trial-and-error approach to find suitable parameters for optimal clusters. Earlier attempts to automate the parameter estimation of DENCLUE have been highly dependent either on the choice of prior data distribution (which could vary across datasets) or by fixing one parameter (which might not be optimal) and learning other parameters. This article addresses this challenge by learning the parameters of DENCLUE through the differential evolution optimisation technique without prior data distribution assumptions. Experimental evaluation of the proposed approach demonstrated consistent performance across datasets (synthetic and real datasets) containing clusters of arbitrary shapes. The clustering performance was evaluated using clustering validation metrics (e.g., Silhouette Score, Davies-Bouldin Index and Adjusted Rand Index) as well as qualitative visual analysis when compared with other density-based clustering algorithms, such as DPC, which is based on weighted local density sequences and nearest neighbour assignments (DPCSA) and Variable KDE-based DENCLUE (VDENCLUE).
... For clustering various types of data, numerous clustering algorithms have been developed. Among these, K-means is particularly notable for its widespread adoption and effectiveness in efficiently managing diverse datasets [10,11]. However, a critical challenge in implementing K-means is the selection of the Optimal Number of Clusters (ONC), which significantly influences the quality of the clustering outcomes [12]. ...
Unsupervised learning, particularly K-means clustering, seeks to partition data into clusters with distinct intraclass cohesion and inter-class disparity. However, the arbitrary selection of clusters in K-means introduces challenges, leading to trial and error in determining the Optimal Number of Clusters (ONC). To address this,
various methodologies have been devised, among which the Gap Statistic is prominent. Gap Statistic reliance on expected values for reference data selection poses limitations, especially in scenarios involving diverse scale, noise, and overlapping data. To tackle these challenges, this study introduces Enhanced Gap Statistic (EGS), which standardizes reference data using an exponential distribution within the Gap Statistic framework, integrating an adjustment factor for a
more dependable estimation of the ONC. Application of EGS to K-means clustering facilitates accurate ONC
determination. For comparison purposes, EGS is benchmarked against traditional Gap Statistic and other
established methods used for ONC selection in K-means, evaluating accuracy and efficiency across datasets with
varying characteristics. The results demonstrate EGS superior accuracy and efficiency, affirming its effectiveness
in diverse data environments.
... This approach calculates the density and separation values of the cluster based on the distance of the data, independent of parameters such as cluster center. This feature successfully handles situations where the distance of the data to the nearest data point is more critical than the distance to the cluster center, especially in nonspherical clusters [64]. ...
... DI (Dunn Index) is an internal clustering validation metric used to evaluate the performance of clustering algorithms and determine an accurate clustering structure. This index provides a validity criterion that involves geometric calculations based on the intrinsic compactness of each cluster and the separation between clusters [64]. While DI measures the similarity within the cluster (intra-cluster similarity), it also evaluates by considering the distance between the clusters (inter-cluster distance). ...
In the machine learning area, having a large number of irrelevant or less relevant features to the result of the dataset can reduce classification success and run-time performance. For this reason, feature selection or reduction methods are widely used. The aim is to eliminate irrelevant features or transform the features into new features that have fewer numbers and are relevant to the results. However, in some cases, feature reduction methods are not sufficient on their own to increase success. In this study, we propose a new hybrid feature projection model to increase the classification performance of classifiers. For this goal, the MCMSTClustering algorithm is used in the data preprocessing stage of classification with various feature projection methods, which are PCA, LDA, SVD, t-SNE, NCA, Isomap, and PR, to increase the classification performance of the sleep disorder diagnosis. To determine the best parameters of the MCMSTClustering algorithm, we used the VIASCKDE Index, Dunn Index, Silhouette Index, Adjusted Rand Index, and Accuracy as cluster quality evaluation methods. To evaluate the performance of the proposed model, we first appended class labels produced by the MCMSTClustering to the dataset as a new feature. We applied selected feature projection methods to decrease the number of features. Then, we performed the kNN algorithm on the dataset. Finally, we compared the obtained results. To reveal the efficiency of the proposed model, we tested it on a sleep disorder diagnosis dataset and compared it with two models that were pure kNN and kNN with the feature projection methods used in the proposed approach. According to the experimental results, the proposed method, in which the feature projection method was Kernel PCA, was the best model with a classification accuracy of 0.9627. In addition, the MCMSTClustering algorithm increases the performance of PCA, Kernel PCA, SVD, t-SNE, and PR. However, the performance of the LDA, NCA, and Isomap remains the same.
... Let = [ 1 , … , ] be an ndimensional vector of multivariate Gaussian distribution of n-dimensional mean vector and ∑ the covariance matrix of n x n dimensions. Therefore, the multivariate kernel distribution equation will be Equation (3) [28]. As a nonparametric method, kernel density estimation tries to estimate where any new incoming data to locate according to existing data. ...
K-means is the best known clustering algorithm, because of its usage simplicity, fast speed and efficiency. However, resultant clusters are influenced by the randomly selected initial centroids. Therefore, many techniques have been implemented to solve the mentioned issue. In this paper, a new version of the k-means clustering algorithm named as ImpKmeans shortly (An Improved Version of K-Means Algorithm by Determining Optimum Initial Centroids Based on Multivariate Kernel Density Estimation and Kd-tree) that uses kernel density estimation, to find the optimum initial centroids, is proposed. Kernel density estimation is used, because it is a nonparametric distribution estimation method, that can identify density regions. To understand the efficiency of the ImpKmeans, we compared it with some state-of-the-art algorithms. According to the experimental studies, the proposed algorithm was better than the compared versions of k-means. While ImpKmeans was the most successful algorithm in 46 tests of 60, the second-best algorithm, was the best on 34 tests. Moreover, experimental results indicated that the ImpKmeans is fast, compared to the selected k-means versions.
... The cluster validity index (CVI) evaluates the goodness of a clustering algorithm by considering the information in the data themselves [106][107][108][109][110][111][112][113][114][115]. CVI is a mathematically justifiable function which can be either maximised or minimised. ...
In real-world scenarios, identifying the optimal number of clusters in a dataset is a difficult task due to insufficient knowledge. Therefore, the indispensability of sophisticated automatic clustering algorithms for this purpose has been contemplated by some researchers. Several automatic clustering algorithms assisted by quantum-inspired metaheuristics have been developed in recent years. However, the literature lacks definitive documentation of the state-of-the-art quantum-inspired metaheuristic algorithms for automatically clustering datasets. This article presents a brief overview of the automatic clustering process to establish the importance of making the clustering process automatic. The fundamental concepts of the quantum computing paradigm are also presented to highlight the utility of quantum-inspired algorithms. This article thoroughly analyses some algorithms employed to address the automatic clustering of various datasets. The reviewed algorithms were classified according to their main sources of inspiration. In addition, some representative works of each classification were chosen from the existing works. Thirty-six such prominent algorithms were further critically analysed based on their aims, used mechanisms, data specifications, merits and demerits. Comparative results based on the performance and optimal computational time are also presented to critically analyse the reviewed algorithms. As such, this article promises to provide a detailed analysis of the state-of-the-art quantum-inspired metaheuristic algorithms, while highlighting their merits and demerits.
... Face, Aggregation, Outliers, Thyroid, Crescent Full Moon, and Cure-t1-2000n which are the datasets used in the experimental study, are known as imbalanced datasets [55]. ...
... Another problem in the clustering area is handling imbalanced datasets. Face, Aggregation, Outliers, Thyroid, Crescent Full Moon, and Cure-t1-2000n are known as imbalanced datasets [55]. MCMSTClustering was the best one among others in the aspect of handling imbalanced datasets. ...
Clustering is a technique for statistical data analysis and is widely used in many areas where class labels are not available. Major problems related to clustering algorithms are handling high-dimensional, imbalanced, and/or varying-density datasets, detecting outliers, and defining arbitrary-shaped clusters. In this study, we proposed a novel clustering algorithm named as MCMSTClustering (Defining Non-Spherical Clusters by using Minimum Spanning Tree over KD-Tree-based Micro-Clusters) to overcome mentioned issues simultaneously. Our algorithm consists of three parts. The first part is defining micro-clusters using the KD-Tree data structure with range search. The second part is constructing macro-clusters by using minimum spanning tree (MST) on defined micro-clusters, and the final part is regulating defined clusters to increase the accuracy of the algorithm. To state the efficiency of our algorithm, we performed some experimental studies on some state-of-the-art algorithms. The findings were presented in detail with tables and graphs. The success of the proposed algorithm using various performance evaluation criteria was confirmed. According to the experimental studies, MCMSTClustering outperformed competitor algorithms in aspects of clustering quality in acceptable run-time. Besides, the obtained results showed that the novel algorithm can be applied effectively in solving many different clustering problems in the literature.
... Such as DBCV [18], Xie-Beni (XB) [27], CDbw [10], S Dbw [9], and RMSSTD [8]. Besides, new cluster validity indices keep emerging, such as the CVNN [16], CVDD [11], DSI [6], SCV [28], AWCD [14] and VIASCKDE [23]. ...
A new model called Clustering with Neural Network and Index (CNNI) is introduced. CNNI uses a Neural Network to cluster data points. Training of the Neural Network mimics supervised learning, with an internal clustering evaluation index acting as the loss function. An experiment is conducted to test the feasibility of the new model, and compared with results of other clustering models like K-means and Gaussian Mixture Model (GMM). The result shows CNNI can work properly for clustering data; CNNI equipped with MMJ-SC, achieves the first parametric (inductive) clustering model that can deal with non-convex shaped (non-flat geometry) data.
... Such as DBCV [15], Xie-Beni (XB) [24], CDbw [8], S Dbw [7], and RMSSTD [6]. Besides, new cluster validity indices keep emerging, such as the CVNN [13], CVDD [10], DSI [5], SCV [25], AWCD [12] and VIASCKDE [21]. ...
A new index for internal evaluation of clustering is introduced. The index is defined as a mixture of two sub-indices. The first sub-index is called the Ambiguous Index; the second sub-index is called the Similarity Index. Calculation of the two sub-indices is based on density estimation to each cluster of a partition of the data. An experiment is conducted to test the performance of the new index, and compared with three popular internal clustering evaluation indices -- Calinski-Harabasz index, Silhouette coefficient, and Davies-Bouldin index, on a set of 145 datasets. The result shows the new index improves the three popular indices by 59\%, 34\%, and 74\%, correspondingly.