ArticlePDF Available

A Survey of Partition based Clustering Algorithms in Data Mining: An Experimental Approach

Authors:
  • Dwaraka Doss Goverdhan Doss Vaishnav College

Abstract

Clustering is one of the most important research areas in the field of data mining. Clustering means creating groups of objects based on their features in such a way that the objects belonging to the same groups are similar and those belonging in different groups are dissimilar. Clustering is an unsupervised learning technique. Data clustering is the subject of active research in several fields such as statistics, pattern recognition and machine learning. From a practical perspective clustering plays an outstanding role in data mining applications in many domains. The main advantage of clustering is that interesting patterns and structures can be found directly from very large data sets with little or none of the background knowledge. Clustering algorithms can be applied in many areas, for instance marketing, biology, libraries, insurance, city-planning, earthquake studies and www document classification. Data mining adds to clustering the complications of very large datasets with very many attributes of different types. This imposes unique computational requirements on relevant clustering algorithms. A variety of algorithms have recently emerged that meet these requirements and were successfully applied to real-life data mining problems. They are subject of this survey. Also, this survey explores the behavior of some of the partition based clustering algorithms and their basic approaches with experimental results.
A preview of the PDF is not available
... Clustering, an unsupervised learning method, involves organizing unlabeled data points into groups based on their similarities, aiming to identify inherent structures or patterns within the data [33]. Hierarchical and non-hierarchical methods are two commonly used clustering methods [34][35][36]. In this study, four hierarchical clustering methods and two non-hierarchical clustering methods are employed to categorize the climate type for the 36 stations. ...
... Non-hierarchical clustering methods include k-means clustering and k-medoids clustering. In the k-means algorithm, given a set of points, , ,... , to be classified into k classes, = , , … , , the objective is to minimize the summation of squared errors within the same group, which is defined as follows [36]: ...
Article
Full-text available
Climate classification plays a fundamental role in understanding climatic patterns, particularly in the context of a changing climate. This study utilized hourly meteorological data from 36 major cities in China from 2011 to 2021, including 2 m temperature (T2), relative humidity (RH), and precipitation (PRE). Both original hourly sequences and daily value sequences were used as inputs, applying two non-hierarchical clustering methods (k-means and k-medoids) and four hierarchical clustering methods (ward, complete, average, and single) for clustering. The classification results were compared using two clustering evaluation indices: the silhouette coefficient and the Calinski–Harabasz index. Additionally, the clustering was compared with the Köppen–Geiger climate classification based on the maximum difference in intra-cluster variables. The results showed that the clustering method outperformed the Köppen–Geiger climate classification, with the k-medoids method achieving the best results. Our research also compared the effectiveness of climate classification using two variables (T2 and PRE) versus three variables, including the addition of hourly RH. Cluster evaluation confirmed that incorporating the original sequence of hourly T2, PRE, and RH yielded the best performance in climate classification. This suggests that considering more meteorological variables and using hourly observation data can significantly improve the accuracy and reliability of climate classification. In addition, by setting the class numbers to two, the clustering methods effectively identified climate boundaries between northern and southern China, aligning with China’s traditional geographical division along the Qinling–Huaihe River line.
... Clustering is a type of unsupervised learning [15]. When no target values are known, or "supervisors," in an unsupervised learning task, the purpose is to produce training data from the inputs themselves. ...
... Clustering using K-Means is common practice [2]. Clustering's ability to help uncover interesting patterns and structures in massive data sets with little to no prior knowledge is a major benefit [15]. ...
Article
Full-text available
Clustering is a type of unsupervised learning [15]. When no target values are known, or "supervisors," in an unsupervised learning task, the purpose is to produce training data from the inputs themselves. Data mining and machine learning would be useless without clustering. If you utilize it to categorize your datasets according to their similarities, you'll be able to predict user behavior more accurately. The purpose of this research is to compare and contrast three widely-used data-clustering methods. Clustering techniques include partitioning, hierarchy, density, grid, and fuzzy clustering. Machine learning, data mining, pattern recognition, image analysis, and bioinformatics are just a few of the many fields where clustering is utilized as an analytical technique. In addition to defining the various algorithms, specialized forms of cluster analysis, linking methods, and please offer a review of the clustering techniques used in the big data setting.
... The suggested technique and k-means algorithm demonstrated superior performance in terms of recall, while the AHC and AP clustering methods exhibited slightly lower results. Velmurugan and Santhanam (2011) investigated of three cluster techniques applied to a geographic map data set. The k-means algorithm demonstrated favorable performance when applied to small datasets, while the k-medoids algorithm exhibited strong performance when dealing with large datasets. ...
Article
Full-text available
Text Clustering consists of grouping objects of similar categories. The initial centroids influence operation of the system with the potential to become trapped in local optima. The second issue pertains to the impact of a huge number of features on the determination of optimal initial centroids. The problem of dimensionality may be reduced by feature selection. Therefore, Wind Driven Optimization (WDO) was employed as Feature Selection to reduce the unimportant words from the text. In addition, the current study has integrated a novel clustering optimization technique called the WDO (Wasp Swarm Optimization) to effectively determine the most suitable initial centroids. The result showed the new meta-heuristic which is WDO was employed as the multi-objective first time as unsupervised Feature Selection (WDOFS) and the second time as a Clustering algorithm (WDOC). For example, the WDOC outperformed Harmony Search and Particle Swarm in terms of F-measurement by 93.3%; in contrast, text clustering's performance improves 0.9% because of using suggested clustering on the proposed feature selection. With WDOFS more than 50 percent of features have been removed from the other examination of features. The best result got the multi-objectives with F-measurement 98.3%.
... Trees 2020). Despite the elevated time complexity, this method has an overall high computing efficiency (Velmurugan and Santhanam 2011). As part of the PAM algorithm, we analyzed several indices to determine the optimal number of clusters, such as the Average Silhouette Method, the Hubert index and the D-index, as well as the total sum of squares within the cluster. ...
Article
Full-text available
Key message Mediterranean forest stands manifest diverse flammability traits according to their potential ecological successional stage and promoting a gradient from flammable to less flammable ecosystem. From a general consideration of vegetation as ‘fuel’, it has been well proven that plant traits have the potential to promote the forest stand gradient from flammable to less flammable. While the ever-growing literature helps to assess the relationship between plants and their flammability at species level, at the landscape scale this relationship should be evaluated along with a variety of forest features such as structural and stand parameters and from the perspective of successional forest stages. To this end, we clustered several forest stands in Southern Europe (Apulia region, Italy), characterized by oaks, conifers, and arboreal shrub species, according to their flammability traits. We hypothesized that flammability traits change along different horizontal and vertical structural features of forest stands, shifting from high to low-flammability propensity. The results confirmed that forest stands with greater height and diameter classes are associated with traits with a low-flammability propensity. It is worth highlighting the importance of shrub coverage in differentiating the clusters denoting their strong influence in increasing fuel load (litter and fuel bed traits). Finally, our findings lead us to assume that high-flammability propensity traits are associated with typical pioneer successional stages, supporting the notion that later successional forest stands are less flammable and, therefore, that flammability decreases along with succession.
... Data Mining is the process of finding something meaningful from a new correlation, existing patterns and trends by sifting through large data stored in a repository, using pattern recognition technology and mathematical and statistical techniques (Larose & Larose, 2014). The most primarily accepted definition of data mining is to turn raw data into useful data or information (Velmurugan & Santhanam, 2011). According to Deka (2014) Clustering is a data mining technique used to obtain a set of objects that have the same characteristics with large enough data. ...
Article
Full-text available
Rice (Latin: Oryza sativa) is one of the most important cultivated plants in civilization. This plant is the main commodity for almost all Indonesian people. Indonesia is in third place as the largest rice producing country in the world. However, based on data from the Statistics Indonesia, Indonesia will still import rice until 2022. The transfer of paddy fields is one of the reasons why Indonesia is still importing rice to this day. Many lands that used to be paddy fields have turned into airports, industrial land, housing, and so on. Rice production is one of the important topics to be discussed in order to develop rice production in areas that are still relatively low. The purpose of this research is to classify cities/regencies in Indonesia based on rice production data in 2021. In this study, three clustering methods were used, namely, Partitioning Around Medoid (PAM), Clustering Large Applications (CLARA) and Fuzzy C-Means (FCM). Then the three methods are compared based on their silhouette coefficient values. The best obtained method is FCM method with two clusters and a silhouette value of 0.828. Results clustering with the best method is used as a reference in making maps clustering. Areas that are still relatively low are expected to increase rice productivity.
Article
Although clustering methods have shown promising performance in various applications, they cannot effectively handle incomplete data. Existing studies often impute missing values first before clustering analysis and conduct these two processes separately. However, inaccurate imputation does not necessarily contribute positively to the subsequent clustering. Intuitively, accurate imputation and clustering can serve and benefit from each other, where clustering-based imputation methods typically utilize cluster signals to impute incomplete data and accurate fillings are expected to bring more valuable data for clustering. Therefore, in this manuscript, rather than considering two tasks independently or conducting them respectively, we study simultaneous clustering and imputing over incomplete data. The immediate benefit is that such a strategy improves both clustering and imputation performance simultaneously, to get a win-win result. Our major technical highlights include (1) the problem formalization and NP-hardness analysis on computing simultaneous clustering and imputing results, (2) exact solutions by transforming the problem as the integer linear programming (ILP) formulation, and (3) efficient approximation algorithms based on the linear programming (LP) relaxation and local neighbors (LN) solution, with approximation guarantees. Experiments on various real-world datasets demonstrate the superiority of our work in clustering and imputing incomplete data.
Conference Paper
This study presents a detailed analysis of coding standards in the context of university-level programming courses. Focusing on the challenges faced by students in understanding and applying these standards, the study utilized the CheckStyle tool for Java programming and Natural Language Processing (NLP) techniques, such as Doc2Vec, for error categorization. The analysis revealed common issues, such as spacing problems and non-compliance with naming conventions, exacerbated by factors like tight schedules and limited grading emphasis. Importantly, this paper serves as an exploratory study utilizing advanced natural language processing methods, shedding light on students' complexities in adhering to coding standards.
Article
Full-text available
Data mining approach and its technology used to extract the unknown pattern from the large set of data for the business and real time applications. The unlabeled data from the large data set can be classified in an unsupervised manner using clustering and classification algorithms. Cluster analysis or clustering is the assignment of a set of observations into subsets (called clusters) so that observations in the same cluster are similar in some sense. The result of the clustering process and its domain application efficiency are determined through the algorithms. This research work deals with two of the most delegated clustering algorithms namely centroid based K-Means and representative object based Fuzzy C-Means. These two algorithms are implemented and the performance is analyzed based on their clustering result quality. The behavior of both the algorithms depends on the number of data points as well as on the number of clusters. The input data points are generated by two ways, one by using normal distribution and another by applying uniform distribution (by Box-Muller formula). The performance of the algorithm is investigated during different execution of the program on the input data points. The execution time for each algorithm is also analyzed and the results are compared with one another.
Article
Full-text available
Clustering algorithms have been utilized in a wide variety of application areas. One of these algorithms is the Fuzzy C-Means algorithm (FCM). One of the problems with these algorithms is the time needed to converge. In this paper, a Fast Fuzzy C-Means algorithm (FFCM) is proposed based on experimentations, for improving fuzzy clustering. The algorithm is based on decreasing the number of distance calculations by checking the membership value for each point and eliminating those points with a membership value smaller than a threshold value. We applied FFCM on several data sets. The experiments demonstrate the efficiency of the proposed algorithm.
Article
Full-text available
A measure is presented which indicates the similarity of clusters which are assumed to have a data density which is a decreasing function of distance from a vector characteristic of the cluster. The measure can be used to infer the appropriateness of data partitions and can therefore be used to compare relative appropriateness of various divisions of the data. The measure does not depend on either the number of clusters analyzed nor the method of partitioning of the data and can be used to guide a cluster seeking algorithm.
Article
Full-text available
K-means is a well-known and widely used partitional clustering method. While there are considerable research efforts to characterize the key features of the K-means clustering algorithm, further investigation is needed to understand how data distributions can have impact on the performance of K-means clustering. To that end, in this paper, we provide a formal and organized study of the effect of skewed data distributions on K-means clustering. Along this line, we first formally illustrate that K-means tends to produce clusters of relatively uniform size, even if input data have varied ldquotruerdquo cluster sizes. In addition, we show that some clustering validation measures, such as the entropy measure, may not capture this uniform effect and provide misleading information on the clustering performance. Viewed in this light, we provide the coefficient of variation (CV) as a necessary criterion to validate the clustering results. Our findings reveal that K-means tends to produce clusters in which the variations of cluster sizes, as measured by CV, are in a range of about 0.3-1.0. Specifically, for data sets with large variation in ldquotruerdquo cluster sizes (e.g., CV > 1.0 ), K-means reduces variation in resultant cluster sizes to less than 1.0. In contrast, for data sets with small variation in ldquotruerdquo cluster sizes (e.g., CV < 0.3), K-means increases variation in resultant cluster sizes to greater than 0.3. In other words, for the earlier two cases, K-means produces the clustering results which are away from the ldquotruerdquo cluster distributions.
Article
Image thresholding has played an important role in image segmentation. In this paper, we present a novel spatially weighted fuzzy c-means (SWFCM) clustering algorithm for image thresholding. The algorithm is formulated by incorporating the spatial neighborhood information into the standard FCM clustering algorithm. Two improved implementations of the k-nearest neighbor (k-NN) algorithm are introduced for calculating the weight in the SWFCM algorithm so as to improve the performance of image thresholding. To speed up the FCM algorithm, the iteration is carried out with the gray level histogram of image instead of the conventional whole data of image. The performance of the algorithm is compared with those of an existed fuzzy thresholding algorithm and widely applied between variance and entropy methods. Experimental results with synthetic and real images indicate the proposed approach is effective and efficient. In addition, due to the neighborhood model, the proposed method is more tolerant to noise. Thresholding is an important technique for image segmentation based on the assumption that the objects can be distinguished and extracted from the background by their gray levels. The output of the thresholding operation is a binary image whose gray level 0 (black) will indicate the foreground and gray level 255 (white) will indicate the background, and vice versa. Many thresholding methods have been developed. A detailed survey can be found in references (1) and (2). In general, threshold selection can be categorized into two classes, local methods and global methods. The global thresholding methods segment an entire image with a single threshold using the gray level histogram of image, while the local methods partition the given image into a number of sub-images and select a threshold for each of sub-images. The global thresholding techniques are easy to implement and computationally less involved, therefore they are superior to local methods in terms of many real image processing applications. The global thresholding methods select the threshold based on different criterions, such as Otsu's method (3), minimum error thresholding (4), and entropic method which was first proposed by Pun (5) and then modified and extended by Kapur et al. (6), etc. Otsu (3) selects the optimal thresholds by maximizing the between-class variance of gray values. Kittler and Illingworth (4) assume that the gray values of object and background are normally distributed. In their methods, threshold was chosen by a minimum error rate scheme for resultant classes. Kapur et al. (6) proposed a thresholding method by maximizing the entropy of the histogram of gray levels of object and background. Generally, all these conventional
Article
We propose a hybrid genetic algorithm for k-medoids clustering. A novel heuristic operator is designed and integrated with the genetic algorithm to fine-tune the search. Further, variable length individuals that encode different number of medoids (clusters) are used for evolution with a modified Davies-Bouldin index as a measure of the fitness of the corresponding partitionings. As a result the proposed algorithm can efficiently evolve appropriate partitionings while making no a priori assumption about the number of clusters present in the datasets. In the experiments, we show the effectiveness of the proposed algorithm and compare it with other related clustering methods.