Article

Density-Weighted Fuzzy c-Means Clustering

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

In this short paper, a unified framework for performing density-weighted fuzzy c-means (FCM) clustering of feature and relational datasets is presented. The proposed approach consists of reducing the original dataset to a smaller one, assigning each selected datum a weight reflecting the number of nearby data, clustering the weighted reduced dataset using a weighted version of the feature or relational data FCM algorithm, and if desired, extending the reduced data results back to the original dataset. Several methods are given for each of the tasks of data subset selection, weight assignment, and extension of the weighted clustering results. The newly proposed weighted version of the non-Euclidean relational FCM algorithm is proved to produce the identical results as its feature data analog for a certain type of relational data. Artificial and real data examples are used to demonstrate and contrast various instances of this general approach.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... A density-based wFCM was proposed in [11] for reducing the size of the input dataset. In [12,13] two weighted FCM algorithms are used for image segmentation. ...
... Information 2020, 11,351 16 of 18 ...
... Sunflowers G Band-compression rate 0.25-segmented images.Information 2020,11, 351 14 of 18 ...
Article
Full-text available
A novel bit reduced fuzzy clustering method applied to segment high resolution massive images is proposed. The image is decomposed in blocks and compressed by using the fuzzy transform method, then adjoint pixels with same gray level are binned and the fuzzy c-means algorithm is applied on the bins to segment the image. This method has the advantage to be applied to massive images as the compressed image can be stored in memory and the runtime to segment the image are reduced. Comparison tests are performed with respect to the fuzzy c-means algorithm to segment high resolution images; the results shown that for not very high compression the results are comparable with the ones obtained applying to the fuzzy c-means algorithm on the source image and the runtimes are reduced by about an eighth with respect to the runtimes of fuzzy c-means.
... The cluster centers are given by: (9) in which the weight wj provides the degree of influence of the jth object to find the cluster centers. A set of wFCM-based cluster methods was proposed by some researchers (see, for example, [23][24][25][26]) in order to reduce the number of objects in massive datasets assigning to an object a weight based on the density of near data [23] and to encode pixel's local information in image segmentation activities [24,25]. In [26] some variations of wFCM are proposed in order to handle very large data. ...
... The cluster centers are given by: (9) in which the weight wj provides the degree of influence of the jth object to find the cluster centers. A set of wFCM-based cluster methods was proposed by some researchers (see, for example, [23][24][25][26]) in order to reduce the number of objects in massive datasets assigning to an object a weight based on the density of near data [23] and to encode pixel's local information in image segmentation activities [24,25]. In [26] some variations of wFCM are proposed in order to handle very large data. ...
... The cluster centers are given by: (9) in which the weight w j provides the degree of influence of the jth object to find the cluster centers. A set of wFCM-based cluster methods was proposed by some researchers (see, for example, [23][24][25][26]) in order to reduce the number of objects in massive datasets assigning to an object a weight based on the density of near data [23] and to encode pixel's local information in image segmentation activities [24,25]. In [26] some variations of wFCM are proposed in order to handle very large data. ...
Article
Full-text available
One of the main drawbacks of the well-known Fuzzy C-means clustering algorithm (FCM) is the random initialization of the centers of the clusters as it can significantly affect the performance of the algorithm, thus not guaranteeing an optimal solution and increasing execution times. In this paper we propose a variation of FCM in which the initial optimal cluster centers are obtained by implementing a weighted FCM algorithm in which the weights are assigned by calculating a Shannon Fuzzy Entropy function. The results of the comparison tests applied on various classification datasets of the UCI Machine Learning Repository show that our algorithm improved in all cases relating to the performances of FCM.
... This notion can be considered as a marginal case of clustering fuzzy data [25]. Examples for this setting include clustering of a weighted set, clustering of sampled data, clustering in the presence of multiple classes of datums with different priorities [1], and a measure used in order to speed up the execution through data reduction [89,41,100]. Nock and Nielsen [73] formalize the case in which weights are manipulated in order to move the clustering results towards datums which are harder to include regularly. ...
... Nock and Nielsen [73] formalize the case in which weights are manipulated in order to move the clustering results towards datums which are harder to include regularly. Note that the extension of FCM on weighted sets has been developed under different names, including Density-Weighted FCM (WFCM) [41], Fuzzy Weighted C-means (FWCM) [67], and New Weighted FCM (NW-FCM) [45]. ...
... In non-inclusive mode, however, c n is not defined if x n is not an inlier, i.e. if it does not satisfy (41). This strategy has similarities to Conditional Fuzzy C-means (CFCM) [64], but the method developed in this paper does not require the a priori knowledge necessary in CFCM. ...
... The current improvements can be divided into two types, including improvements to traditional algorithms from the perspective of data features and algorithm principles. From the perspective of data features, the entropy measure [30], distance measure [25,31], and probabilistic Euclidean distance [26] are extended to obtain the contributions of different features to the sample. For instance, Cherif et al. [27] proposed the three new interval type-2 fuzzy similarity measures and joined with fuzzy C-means algorithm. ...
Article
Full-text available
Fuzzy c-means (FCM) algorithm is an unsupervised clustering algorithm that effectively expresses complex real world information by integrating fuzzy parameters. Due to its simplicity and operability, it is widely used in multiple fields such as image segmentation, text categorization, pattern recognition and others. The intuitionistic fuzzy c-means (IFCM) clustering has been proven to exhibit better performance than FCM due to further capturing uncertain information in the dataset. However, the IFCM algorithm has limitations such as the random initialization of cluster centers and the unrestricted influence of all samples on all cluster centers. Therefore, a novel algorithm named equidistance index IFCM (EI-IFCM) is proposed for improving shortcomings of the IFCM. Firstly, the EI-IFCM can commence its learning process from more superior initial clustering centers. The EI-IFCM algorithm organizes the initial cluster centers based on the contribution of local density information from the data samples. Secondly, the membership degree boundary is assigned for the data samples satisfying the equidistance index to avoid the unrestricted influence of all samples on all cluster centers in the clustering process. Finally, the performance of the proposed EI-IFCM is numerically validated using UCI datasets which contain data from healthcare, plant, animal, and geography. The experimental results indicate that the proposed algorithm is competitive and suitable for fields such as plant clustering, medical classification, image differentiation and others. The experimental results also indicate that the proposed algorithm is surpassing in terms of iteration and precision in the mentioned fields by comparison with other efficient clustering algorithms.
... To solve these problems, experts and scholars have proposed some variant FCM algorithms. Hathaway and Hu (Hathaway and Hu, 2009) designed a densityweighted fuzzy C-means clustering (DWFCM) to improve convergence speed by simplifying a larger data set into a smaller weighted data set. Hung et al. (Hung and Yang, 2001) refined the initial value of the FCM algorithm and proposed a psFCM algorithm. ...
Preprint
Full-text available
The Fuzzy C-Means (FCM) algorithm is widely used in data mining and machine learning. However,the sensitivity of FCM to the initial value and noise inevitably leads to the decline of the clusteringeffect. In this paper, a new improved fuzzy clustering algorithm is proposed— Robust denoising FCMclustering via L2,1 NMF and local constraint (RFCM-L 2,1NMF). Firstly, RFCM-L 2,1NMF combinesthe L 2,1NMF that has noise residual estimation with FCM, using the robustness and noise constraintterms of the L 2,1NMF to attenuate the effect of noise on data clustering. Secondly, RFCM-L 2,1NMFuses the low-dimensional representation of L 2,1NMF as the initial value of FCM, which reduces thedefects of FCM caused by the initial value to a certain extent, and makes the clustering effect morestable. Furthermore, since the low-dimensional representation of L 2,1NMF is the hub connecting L 2,1NMF and FCM, to obtain a more accurate low-dimensional representation, we construct a newlocal constraint term in this paper. Finally, experiments on data sets validate that RFCM-L 2,1NMFis superior compared to other state-of-the-art methods.
... In order to solve these problems, experts and scholars have proposed some variant FCM algorithms. Hathaway and Hu (2009) designed a density-weighted fuzzy C-means clustering (DWFCM) to improve convergence speed by simplifying a larger data set into a smaller weighted data set. Hung and Yang (2001) refined the initial value of the FCM algorithm and proposed psFCM algorithm. ...
Article
Full-text available
The fuzzy C-means (FCM) algorithm is a classical clustering algorithm which is widely used. However, especially for high-dimensional data sets with complex structures, the large-scale calculation of FCM suffers from decreasing clustering effect. In order to improve the clustering performance, we propose two new modified fuzzy clustering algorithms—modified fuzzy clustering algorithm based on non-negative matrix factorization (MFCM-NMF) and modified fuzzy clustering algorithm based on non-negative matrix factorization with local constraint (MFCM-LCNMF). Since MFCM-NMF combines NMF with modified FCM, the algorithm can use the dimensionality reduction technology of NMF, which greatly improves the computational efficiency. MFCM-LCNMF introduces NMF with local linear constraints into modified FCM, and it has a new objective function and adopts a new algorithm for alternate iteration. In the iterative process, the new membership update formula is utilized for the samples selected by the triangle inequality, which not only reduces the amount of calculation, but also obtains a higher clustering quality. Finally, a number of experiments on many data sets verify that MFCM-NMF and MFCM-LCNMF are more effective compared with other state-of-the-art methods.
... Various accelerated fcm algorithms have been proposed since the 1980s. Nearly all of these algorithms are inexact, as they involve numerical approximations (Cannon et al. 1986;Höppner 2002) or sampling (followed by optional weighting) (Cheng et al. 1998;Pal and Bezdek 2002;Eschrich et al. 2003;Hathaway and Hu 2009;Parker and Hall 2014). Furthermore, most of these algorithms attain only modest (e.g., 2-to 6-fold) speed-ups. ...
Article
Full-text available
Color quantization (cq), the reduction of the number of distinct colors in a given image with minimal distortion, is a common image processing operation with various applications in computer graphics, image processing/analysis, and computer vision. The first cq algorithm, median-cut, was proposed over 40 years ago. Since then, many clustering algorithms have been applied to the cq problem. In this paper, we present a comprehensive overview of the cq algorithms proposed in the literature. We first examine various aspects of cq, including the number of distinguishable colors, cq artifacts, types of cq, applications of cq, data structures, data reduction, color spaces and color difference equations, and color image fidelity assessment. We then provide an overview of image-independent cq algorithms, followed by a detailed survey of image-dependent ones. After presenting a brief discussion of pixel mapping, we conclude our survey with an outline of the open problems in cq.
... Cannon [7] presented an effective realization of FCM clustering algorithm. A cohesive structure for implementing the clustering with density weighted FCM [8] developed in order to optimize and deal with the problems of FCM algorithm. K-Means is specific clustering method [9] to find out the pixels groupings in an image and the method used generally, since it is uncomplicated and very fast. ...
Article
Full-text available
Defect detection in metallic surface images is a challenging task in the image analysis process. The data clustering and optimization techniques have been widely used for image segmentation and the combination of these two approaches improves the output stability as well as convergence speed. In this work developed an automatic, efficient method for the detection and segmentation of coating defects in metal surfaces. The Fuzzy c-means (FCM) and Firefly algorithm (FA) are well-known and popular methods to discover the image information comprising indiscriminate objects and solves many complex problems involved in image segmentation. In this paper, proposed a new technique for the coated metal surface defect detection using the hybridization of two methods, FCM with FA (FCM-FA). The results from experiments verified the efficiency of the developed FCM with FA over comparison with three existing methods in terms of evaluation parameters of defect detection for scanned high resolution images. It can be seen from the experimental results that the incorporated algorithm has the potential to segment and identify the defected regions from the coated surface.
... Fuzzy C-Means Clustering Based on Dual Expression between Cluster Prototypes and Reconstructed Data (DEFCM) [9] was proposed by introducing a dual expression between cluster prototypes and reconstructed data to improve the convergence and complexity of FCM. Hathaway and Hu [10] designed the density-weighted fuzzy c-means (DWFCM) to improve the convergence by reducing the larger data set to a weighted smaller one. A fuzzy clustering algorithm based on evolutionary programming in [11] accelerated FCM by the global search strategy of evolutionary programming. ...
... Determining the weight associated with each observation can be carried out following different strategies. We cite, among others, the use of neighbourhood density information of each sample (the number of objects that are near to the sample using a distance threshold) or userdefined constants [24]. The optimization of the objective function 1 through an iterative process builds the fuzzy partitions. ...
Article
Full-text available
The rapid growth in virtualization solutions has driven the widespread adoption of cloud computing paradigms among various industries and applications. This has led to a growing need for XaaS solutions and equipment to enable teleworking. To meet this need, cloud operators and datacenters have to overtake several challenges related to continuity, the quality of services provided, data security, and anomaly detection issues. Mainly, anomaly detection methods play a critical role in detecting virtual machines’ abnormal behaviours that can potentially violate service level agreements established with users. Unsupervised machine learning techniques are among the most commonly used technologies for implementing anomaly detection systems. This paper introduces a novel clustering approach for analyzing virtual machine behaviour while running workloads in a system based on resource usage details (such as CPU utilization and downtime events). The proposed algorithm is inspired by the intuitive mechanism of flocking birds in nature to form reasonable clusters. Each starling movement’s direction depends on self-information and information provided by other close starlings during the flight. Analogically, after associating a weight with each data sample to guide the formation of meaningful groups, each data element determines its next position in the feature space based on its current position and surroundings. Based on a realistic dataset and clustering validity indices, the experimental evaluation shows that the new weighted fuzzy c-means algorithm provides interesting results and outperforms the corresponding standard algorithm (weighted fuzzy c-means).
... d-FuzzStream poses a limit on the number of f-mcs kept in memory: whenever such limit is exceeded, the least recently updated f-mcs are deleted, thus ensuring both memory compliance and adaptation to evolving settings. The offline step organizes f-mcs in fuzzy macroclusters through a weighted fuzzy c-means (WFCM) procedure [29]. Each f-mc is represented as a virtual point in the original space, located in the f-mc center and weighted by the sum of the membership degrees of the relative examples. ...
Article
In recent years, several clustering algorithms have been proposed with the aim of mining knowledge from streams of data generated at a high speed by a variety of hardware platforms and software applications. Among these algorithms, density-based approaches have proved to be particularly attractive, thanks to their capability of handling outliers and capturing clusters with arbitrary shapes. The streaming setting poses additional challenges that need to be addressed as well: data streams are potentially unbounded and affected by concept drift, i.e. a modification over time in the underlying data generation process. In this paper, we propose Temporal Streaming Fuzzy DBSCAN (TSF-DBSCAN), a novel fuzzy clustering algorithm for streaming data. TSF-DBSCAN is an extension of the well-known DBSCAN algorithm, one of the most popular density-based clustering approaches. Fuzziness is introduced in TSF-DBSCAN to model the uncertainty about the distance threshold that defines the neighborhood of an object. As a consequence, TSF-DBSCAN identifies clusters with fuzzy overlapping borders. A fading model, which makes objects less relevant as they become more remote in time, endows TSF-DBSCAN with the capability of adapting to evolving data streams. The integration of the model in a two-stage approach ensures computational and memory efficiency: during the online stage continuously arriving objects are organized in proper data structures that are later exploited in the offline stage to determine a fine-grained partition. An extensive experimental analysis on synthetic and real world datasets shows that TSF-DBSCAN yields competitive performance when compared to other clustering algorithms recently proposed for streaming data.
... Several further studies have been done on this issue. Density-weighted fuzzy c-means (DWFCM) [11] was designed to improve the convergence speed by decreasing a large data set to a smaller, weighted one. Geometric progressive fuzzy c-means (GOFCM) and minimum sample estimate random fuzzy c-means (MSERFCM) [12] were proposed to accelerate FCM by progressive and random sampling, respectively. ...
Article
Fuzzy c-means (FCM) is one of the most frequently used methods for clustering. However, with the increasing amount of data, FCM suffers from slow convergence and a large amount of calculation because all samples are involved in updating the solutions per iteration without considering the current clustering results. In this research, a new membership scaling FCM (MSFCM) is proposed, based on the observation that the samples, whose nearest cluster center is v, aid the convergence of v, whereas the remaining samples prevent the convergence of v. In the new algorithm, many samples whose nearest cluster centers do not change in the next iteration are chosen by using the triangle inequality. A new scheme for scaling the membership degrees of the chosen samples is suggested to boost the effect of the in-cluster samples and weaken the effect of the out-of-cluster samples in the clustering process. The new scheme not only accelerates the convergence of the algorithm but also maintains the high clustering quality. Many experimental results on synthetic and real-world data sets have verified the effectiveness of the proposed algorithm in improving the speed of the convergence of the fuzzy clustering. In particular, compared with FCM, MSFCM saves at least two thirds of total rounds of iterations without significantly increasing the cost per iteration.
... (6), (7) we define u m ij = f (x i |v j ) which means that, given the center of j − th cluster, how much the degree of belonging of i − th data point is. Also, We extend the definition of the weights for each data in [32] to the weights for each cluster. So, we define the prior knowledge f (v j ) as the weight of each cluster. ...
Article
Fuzzy c-means is one of the popular algorithms in clustering, but it has some drawbacks such as sensitivity to outliers. Although many correntropy based works have been proposed to improve the robustness of FCM, fundamentally a proper error function is required to apply to FCM. In this paper, we present a new perspective based on the expected loss (or risk) to FCM method to provide different kinds of robustness such as robustness to outliers, to the volume of clusters and robustness in noisy environments. First, we propose Robust FCM method (RCM) by defining a loss function as a least square problem and benefiting the correntropy to make FCM robust to outliers. Furthermore, we utilize the half-quadratic (HQ) optimization as a problem-solving method. Second, inspiring by the Bayesian perspective, we define a new loss function based on correntropy as a distance metric to present Robust Heterogeneous C-Means (RHCM) by utilizing direct clustering (DC) method. DC helps RHCM to have robust initialization. Besides, RHCM will make some robust cluster centers in noisy environments and is capable of clustering the elliptical or spherical shaped data accurately, regardless of the volume of each cluster. The results are shown visually on some synthetic datasets including the noisy ones, the UCI repository and also on real image dataset that was gathered manually from 500px social media. Also , for evaluation of the clustering results, several validity indices are calculated. Experimental results indicate the superiority of our proposed method over the base FCM, DC, KFCM, two new methods called GPFCM and GEPFCM and a method called DC-KFCM that we created for the comparison purpose.
... In addition, the number of clusters must be pre-specified by users [9], [10]. However, these algorithms cannot cluster non-spherical data sets effectively, because data points are always assigned to the nearest center [11]. ...
Article
Full-text available
Clustering algorithms have a very wide range of applications on data analysis, such as machine learning, data mining. However, data sets often have problems with unbalanced and non-spherical distribution. Clustering by fast search and find of density peaks (DPC) is a density-based clustering algorithm which could identify clusters with non-spherical data. In real applications, this algorithm and its variants are not very effective for the division of unevenly distributed clusters, because they only use one indicator (the distance of neighbor points) to handle inner points and boundary points at the same time. To this end, we introduce a new indicator named asymmetry measure which enhances the ability of finding boundary points. Then we propose a boundary detection-based density peaks clustering (BDDPC) algorthm that combines the above two indicators, so that different clusters are separated from each other accurately and the purpose of improving the clustering effect is achieved. The BDDPC algorithm can not only cluster uniformly distributed data, but also cluster unevenly distributed data. In real life, the distribution of high-dimensional data sets are always unbalanced, so this algorithm has very important applications. Experimental results with synthetic and real-world data sets illustrate the effectiveness of our algorithm.
... However, for the sake of consistency, our goodness criteria for the outputs are those of PCA and FA. In this respect fuzzy mountain clustering [5,32] and fuzzy c-means clustering methods [2,3,6,8,12,15,16,17,20,25,26] are analogous to PCA and FA, respectively. ...
Chapter
Full-text available
A rapid soft computing method for dimensionality reduction of data sets is presented. Traditional approaches usually base on factor or principal component analysis. Our method applies fuzzy cluster analysis and approximate reasoning instead, and thus it is also viable to nonparametric and nonlinear models. Comparisons are drawn between the methods with two empiric data sets.
... Cannon [3] suggested an efficient implementation of FCM clustering algorithm. To enhance and tackle the shortcoming of the FCM algorithm, a unified framework for performing density-weighted FCM clustering is developed [4]. A comparative analysis of FCM and Entropy-based Fuzzy Clustering (EFC) algorithm are performed in terms of the quality of clusters and their computational time. ...
Article
Full-text available
Classifying the data into a meaningful group is one of the fundamental ways of understanding and learning the valuable information. High-quality clustering methods are necessary for the valuable and efficient analysis of the increasing data. The Firefly Algorithm (FA) is one of the bio-inspired algorithms and it is recently used to solve the clustering problems. In this paper, Hybrid F-Firefly algorithm is developed by combining the Fuzzy C-Means (FCM) with FA to improve the clustering accuracy with global optimum solution. The Hybrid F-Firefly algorithm is developed by incorporating FCM operator at the end of each iteration in FA algorithm. This proposed algorithm is designed to utilize the goodness of existing algorithm and to enhance the original FA algorithm by solving the shortcomings in the FCM algorithm like the trapping in local optima and sensitive to initial seed points. In this research work, the Hybrid F-Firefly algorithm is implemented and experimentally tested for various performance measures under six different benchmark datasets. From the experimental results, it is observed that the Hybrid F-Firefly algorithm significantly improves the intra-cluster distance when compared with the existing algorithms like K-means, FCM and FA algorithm.
... Furthermore, the value of K must be pre-specified [15,18,33] . The Global K-means algorithm and its variations were proposed to overcome the disadvantages of K -means [21,32] , and other algorithms developed to remedy the sensitivity of K -means to the outliers [4,16,20] . However, because a data point is always assigned to the nearest center, these approaches are not able to detect nonspherical clusters. ...
Article
Clustering by fast search and find of Density Peaks (referred to as DPC) was introduced by Alex Rodríguez and Alessandro Laio. The DPC algorithm is based on the idea that cluster centers are characterized by having a higher density than their neighbors and by being at a relatively large distance from points with higher densities. The power of DPC was demonstrated on several test cases. It can intuitively find the number of clusters and can detect and exclude the outliers automatically, while recognizing the clusters regardless of their shape and the dimensions of the space containing them. However, DPC does have some drawbacks to be addressed before it may be widely applied. First, the local density ρi of point i is affected by the cutoff distance dc, and is computed in different ways depending on the size of datasets, which can influence the clustering, especially for small real-world cases. Second, the assignment strategy for the remaining points, after the density peaks (that is the cluster centers) have been found, can create a “Domino Effect”, whereby once one point is assigned erroneously, then there may be many more points subsequently mis-assigned. This is especially the case in real-word datasets where there could exist several clusters of arbitrary shape overlapping each other. To overcome these deficiencies, a robust clustering algorithm is proposed in this paper. To find the density peaks, this algorithm computes the local density ρi of point i relative to its K-nearest neighbors for any size dataset independent of the cutoff distance dc, and assigns the remaining points to the most probable clusters using two new point assignment strategies. The first strategy assigns non-outliers by undertaking a breadth first search of the K-nearest neighbors of a point starting from cluster centers. The second strategy assigns outliers and the points unassigned by the first assignment procedure using the technique of fuzzy weighted K-nearest neighbors. The proposed clustering algorithm is benchmarked on publicly available synthetic and real-world datasets which are commonly used for testing the performance of clustering algorithms. The clustering results of the proposed algorithm are compared not only with that of DPC but also with that of several well known clustering algorithms including Affinity Propagation (AP), Density-Based Spatial Clustering of Applications with Noise (DBSCAN) and K-means. The benchmarks used are: clustering accuracy (Acc), Adjusted Mutual Information (AMI) and Adjusted Rand Index (ARI). The experimental results demonstrate that our proposed clustering algorithm can find cluster centers, recognize clusters regardless of their shape and dimension of the space in which they are embedded, be unaffected by outliers, and can often outperform DPC, AP, DBSCAN and K-means.
... Nock and Nielsen (2006) concluded that the clustering algorithm did not use the weighting factor in allocation of data samples. Hathaway and Hu (2009) tried a reducing factor to make smaller data sets from the original data set. This reduced data set was selected as a datum with the surrounding data. ...
Article
The segmentation and classification of high-resolution satellite images (HRSI) are useful approaches to extract information. In recent times, roads and buildings have been classified for analysis of urban areas in a better manner. Apart from these, healthy trees are also an important factor in HRSI, i.e. adjacent to roads, and vegetation. They reflect the area in an image as land cover. Other important information, shadow, is extracted from satellite images, which indicates the presence of trees and built-up areas such as buildings, flyovers, etc. In this article, a weighted membership-function-based fuzzy c-means with spatial constraints (WMFCSC) approach for automated satellite image classification is proposed. Initially, spatially fuzzy clustering is used to classify the satellite images in healthy trees with vegetation, roads, and shadows, which includes the information of spatial constraints. The road results of the classified image are still having non-road segments. Therefore, the proposed four intermediate stages (IS) are used to extract the road information, followed by the results of road areas of the WMFCSC approach. The framework of IS helps to remove the false road segments which are adjacent to roads and renovates the segmented roads due to the shadow effect. A final step of a hybrid WMFCSC-IS approach is used to extract the road network. The results of classified images confirm the effectiveness of the WMFCSC-IS approach for satellite image classification.
... Richard J. Hathaway proposed a unified framework for performing density-weighted FCM clustering of feature and relational datasets [2]. Chenglong Tang proposed a new kind of data weighted fuzzy c-means clustering approach [3]. ...
Article
Full-text available
Multidimensional data has multiple features. Different features can't be treated equivalently because they have different impacts on clustering results. To solve this problem, ontology-based clustering algorithm with feature weights (OFW-Clustering) is proposed to reflect the different importantce of different features in this paper. Prior knowledge described with ontology is introduced into clustering. Ontology-based domain feature graph is built to calculate feature weights in clustering. One feature is viewed as one ontology semantic node. Feature weight in the ontology tree is calculated according to the feature's overall relevancy. Under the guidance of the prior knowledge, parameter values of optimal clustering results can be gotten through continuous change of α, β.Experiments show that ontology-based clustering algorithm with feature weights do a better job in getting domain knowledge and a more accurate result.
Article
Network security has continuously been a major focus of research and concern on a global scale. The Intrusion Detection System (IDS), as a crucial defensive measure against network attacks, has undergone multiple iterations and evolutions since its inception to adapt to the ever-changing network environment. Due to the widespread issue of data imbalance in network security datasets, a single machine learning or deep learning model often struggles to effectively handle different types of attacks. In this work, we propose a Multi-Critics Generative Adversarial Networks (GAN) Clustering-Based IDS (MCGC-IDS) model to address the issue of data imbalance. The quality of the generated data is analyzed using correlation heatmaps and PCA plots, which later is used to update the dataset that is utilized for feature extraction with autoencoders. Subsequently, CNN-LSTM models are employed to analyze clusters formed by the Weighted Fuzzy C-means (WFCM) clustering algorithm to achieve enhanced performance for the IDS system. This model is then compared with two existing models. The results indicate that while the GAN-generated data retains the original dataset distribution, it also addresses the issue of imbalance. Moreover, the subsequent multi-layered processing enables the overall model to more effectively handle various types of attacks. Finally, when this model is tested on a similar dataset, the UNSW-NB15, it continues to demonstrate superior performance, indicating its strong generalizability.
Article
Full-text available
There has been increasing interest in pattern classification methods and neuroimaging studies using permutation tests to estimate the statistical significance of a classifier (p-value). Permutation tests usually use the test error as a dataset statistic to estimate the p-value(s)by measuring the dissimilarity between two or more number of populations. Using the test error as a dataset statistics; however, may camouflage the lowest recognizable classes,and the resulting p-value will be biased toward better values because of the highly recognizable classes;thus,lower p-values could sometimes be the result of under coverage.In this study,we investigate this problem and propose the implementation of permutation tests based on a per-class test error as a data set statistic. We also propose a model that is based on partially scrambling the testing samples when computing the non-permuted statistic in order to judge the p-value's tolerance and to draw conclusions regarding, which permutation test procedures are more reliable. For the same purpose, we propose an other model that is based on chance-level shifting of the permuted statistic. We tested the set we proposed models on functional magnetic resonance imaging data that we recollected while human subjects responded to visual stimulation paradigms, and our results showed that these models can aid in determining, which permutation test procedure is superior. We also found that permutation tests that use a per class test error as a data set statistic are more reliable in addressing the null hypothesis that all classes in the problem domain are drawn from the same distribution.
Article
Fuzzy clustering algorithms have been widely used to reveal the possible hidden structure of data. However, with the increasing of data amount, large scale data has brought genuine challenges for fuzzy clustering. Most fuzzy clustering algorithms suffer from the long time-consumption problem since a large amount of distance calculations are involved to update the solution per iteration. To address this problem, we introduce the popular anchor graph technique into fuzzy clustering and propose a scalable fuzzy clustering algorithm referred to as Scalable Fuzzy Clustering with Anchor Graph (SFCAG). The main characteristic of SFCAG is that it addresses the scalability issue plaguing fuzzy clustering from two perspectives: anchor graph construction and membership matrix learning. Specifically, we select a small number of anchors and construct a sparse anchor graph, which is beneficial to reduce the computational complexity. We then formulate a trace ratio model, which is parameter-free, to learn the membership matrix of anchors to speed up the clustering procedure. In addition, the proposed method enjoys linear time complexity with the data size. Extensive experiments performed on both synthetic and real world datasets demonstrate the superiority (both effectiveness and scalability) of the proposed method over some representative large scale clustering methods.
Article
Fuzzy c-means (FCM) is one of the most frequently used methods for clustering, where the fuzziness weighting exponent m is a key hyper-parameter that directly affects the clustering performance. However, FCM requires careful tuning the fuzziness parameter which results in significant time costs. In this research, an improved FCM clustering by varying the fuzziness parameter, called vFCM, is proposed to overcome this issue, based on the facts that the FCM objective is easy to optimize when m is large, while more local valleys appear as m decreases, hence the optimization problem presents a search process from simple to complex when m varies from a large value to a small value approaching 1. Here, the nature of m is similar to the temperature parameter in the deterministic annealing, and moving along a sequence of the FCM objectives by a linear method that proposes to update m automatically provides a form of annealing. Extensive experiments on simulated and real-world data sets show that vFCM is not only more robust to initialization but also improves the clustering performance in high dimensions. Furthermore, the clustering results of vFCM have a low fluctuation according to different m, so it does not require careful tuning the fuzziness parameter. The time that vFCM takes is greatly reduced.
Chapter
In this paper a new approach to interval fuzzy model identification is given. It is based on evolving Gaussian clustering algorithm, eGauss+. This algorithm clusters the data form the stream into small clusters called granules. These granules are then merged together if they fulfill all necessary criteria. This means that the cluster partitions are learned incrementally online from data streams and after that they are merged together in bigger structures. The proposed approach is not limited to the use in the data stream clustering, but can be used also in the case of classical batch clustering methods, especially for big data problems. The idea of interval fuzzy model is to find a lower and upper bound of the data set and describe these bounds by fuzzy model. The band or confidence interval should contain the prescribed number of samples. The interval fuzzy model is described and shown on simple examples.
Article
Full-text available
Emotion detection in the natural language text has drawn the attention of several scientific communities as well as commercial/marketing companies: analyzing human feelings expressed in the opinions and feedback of web users helps understand general moods and support market strategies for product advertising and market predictions. This paper proposes a framework for emotion‐based classification from social streams, such as Twitter, according to Plutchik's wheel of emotions. An entropy‐based weighted version of the fuzzy c‐means (FCM) clustering algorithm, called EwFCM, to classify the data collected from streams has been proposed, improved by a fuzzy entropy method for the FCM center cluster initialization. Experimental results show that the proposed framework provides high accuracy in the classification of tweets according to Plutchik's primary emotions; moreover, the framework also allows the detection of secondary emotions, which, as defined by Plutchik, are the combination of the primary emotions. Finally, a comparative analysis with a similar fuzzy clustering‐based approach for emotion classification shows that EwFCM converges more quickly with better performance in terms of accuracy, precision, and runtime. Finally, a straightforward mapping between the computed clusters and the emotion‐based classes allows the assessment of the classification quality, reporting coherent and consistent results.
Chapter
With the increasing amount of data, the calculation of the distance is complicated in fuzzy c-means. In this paper, we propose a new global membership scaling FCM (GMSFCM). The data will be divided into two types at each iteration: the first one is the in-cluster samples, which will not change their clusters in next iteration; the second one is the out-of-cluster samples, which will change their clusters in next iteration; then a new scheme for scaling the membership degrees is suggested to boost the effect of the in-cluster samples and weaken the effect of the out-of-cluster samples. However, the filtering of the in-cluster and the out-of-cluster samples often leads to a high computational complexity per iteration. Thus, we will use triangle inequality to avoid unnecessary distance calculations. The new scheme not only improves the convergence but also keeps the quality for fuzzy clustering.
Article
In this article, we are concerned with the formation of type-2 information granules in a two-stage approach. We present a comprehensive algorithmic framework which gives rise to information granules of a higher type (type-2, to be specific) such that the key structure of the local granular data, their topologies, and their diversities become fully reflected and quantified. In contrast to traditional collaborative clustering where local structures (information granules) are obtained by running algorithms on the local datasets and communicating findings across sites, we propose a way of characterizing granular data (formed) by forming a suite of higher type information granules to reveal an overall structure of a collection of locally available datasets. Information granules built at the lower level on a basis of local sources of data are weighted by the number of data they represent while the information granules formed at the higher level of hierarchy are more abstract and general, thus facilitating a formation of a hierarchical description of data realized at different levels of detail. The construction of information granules is completed by resorting to fuzzy clustering algorithms (more specifically, the well-known Fuzzy C-Means). In the formation of information granules, we follow the fundamental principle of granular computing, viz. , the principle of justifiable granularity. Experimental studies concerning selected publicly available machine-learning datasets are reported.
Article
The fuzzy local information C-means clustering algorithm (FLICM) is an important robust fuzzy clustering segmentation method, which has attracted considerable attention over the years. However, it lacks certain robustness to high noise or severe outliers. To improve the accuracy and robustness of the FLICM algorithm for images corrupted by high noise, a novel fuzzy local information c-means clustering utilizing total Bregman divergence (TFLICM) is proposed in this paper. The total Bregman divergence is modified by the local neighborhood information of sample to further enhance the ability to suppress noise, and then modified total Bregman divergence is introduced into the FLICM to construct a new objective function of robust fuzzy clustering, and the iterative clustering algorithm with high robustness is obtained through optimization theory. The convergence of the TFLICM algorithm is proved by the Zangwill theorem. In addition, the validity of the TFLICM algorithm applied in noise image segmentation is explained by means of sample weighting fuzzy clustering. Meanwhile, the generalized total Bregman divergence unifies the Bregman divergence with the total Bregman divergence and enhances the universality of the TFLICM algorithm applied in segmenting complex medical and remote sensing images. Some experimental results show that the TFLICM algorithm can obtain better segmentation quality and stronger anti-noise robustness than the existing FLICM algorithm.
Chapter
Following Chap. 9, this chapter continues to deal with clustering. We describe many associated topics such as the underutilization problem, robust clustering, hierarchical clustering, and cluster validity. Kernel-based clustering is introduced in Chap. 20.
Article
In this paper, a new dynamic merging concept for evolving clustering is presented. This means that the cluster partitions are incrementally learned on-line from the streams of data. The criterion of merging is based on the comparison between the sum of volumes of two clusters which fulfill the criteria of a minimal number of samples in the cluster and the expected volume of the new generated merged cluster. The newly generated merged cluster is conducted by using the weighted averaging of cluster centers and calculation of the joint covariance matrix from the covariance matrices of the clusters. It has been shown that the proposed new merging concept is very easy to implement, able to work on higher dimensional data sets, can perform all necessary computation in on-line manner, and produce reliable clusters.
Chapter
Burden distribution matrix is the key to guarantee the long-term stable production of the blast furnace. The optimization of the burden distribution matrix aims to form a reasonable burden surface. It can help to achieve the goal of smooth, high-quality and low-consumption blast furnace production. This paper uses the blast furnace condition parameter to measure the burden distribution matrix. And these data is characterized by panel data in statistics. The fuzzy c-means algorithm is used to cluster. Finally, evaluation indicators are using to analyze the clustering effect. It has important reference value for the blast furnace actual production.
Article
Image clustering is a key technique for better accomplishing image annotation and searching in large image repositories. Fuzzy c-means and its variations have achieved excellent performance on image clustering because they allow each image to belong to more than one cluster. However, these methods neglect the relations between different image clusters, and hence often suffer from the “cluster one-sidedness” problem that redundant centers are learned to characterize the same or similar image clusters. To this issue, we propose a diverse fuzzy c-means for image clustering via introducing a novel diversity regularization into the traditional fuzzy c-means objective. This diversity regularization guarantees the learned image cluster centers to be different from each other and to fill the image data space as much as possible. An efficient optimization algorithm is exploited to address the diverse fuzzy c-means objective, which is proved to converge to local optimal solutions and has a satisfactory time complexity. Experiments on synthetic and six image datasets demonstrate the effectiveness of the proposed method as well as the necessity of the diversity regularization.
Chapter
Full-text available
Conventional competitive learning-based clustering algorithms like C-means and LVQ are plagued by a severe initialization problem [57, 106]. If the initial values of the prototypes are not in the convex hull formed by the input data, clustering may not produce meaningful results.
Chapter
Full-text available
The list of publications includes one thousand bibliographical sources on fuzzy clustering. Some books and edited volumes, papers in proceedings of some conferences and edited volumes, and papers in refereed journals are presented in the bibliography.
Article
Data clustering is the generic process of splitting a set of datums into a number of homogenous sets. Nevertheless, although a clustering process inputs datums as a set of separate mathematical objects, these entities are in fact correlated within a spatial context specific to the problem class in hand. For example, when the data acquisition process yields a 2D matrix of regularly sampled measurements, as it is the case with image sensors which utilize different modalities, adjacent datums are highly correlated. Hence, the clustering process must take into consideration the spatial context of the datums. A review of the literature, however, reveals that a significant majority of the well-established clustering techniques in the literature ignore spatial context. Other approaches, which do consider spatial context, however, either utilize pre- or post-processing operations or engineer into the cost function one or more regularization terms which reward spatial contiguity. We argue that employing cost functions and constraints based on heuristics and intuition is a hazardous approach from an epistemological perspective. This is in addition to the other shortcomings of those approaches. Instead, in this paper, we apply Bayesian inference on the clustering problem and construct a mathematical model for data clustering which is aware of the spatial context of the datums. This model utilizes a robust loss function and is independent of the notion of homogeneity relevant to any particular problem class. We then provide a solution strategy and assess experimental results generated by the proposed method in comparison with the literature and from the perspective of computational complexity and spatial contiguity.
Article
Background/Objectives: In this paper we segment breast and brain Magnetic Resonance Images. Methods/Statistical Analysis: This automated process implemented by a robust Fuzzy C-Means (FCM). This FCM needs novel objective function. This is obtained by performing replacement. The source is original Euclidean distance. Findings: The target is properties of kernel function on feature space. This transformation uses Tsallis entropy. The effective objective functions are minimized. It results in membership partition matrices and successive prototypes with equation. The initial cluster reduces both the running time and computational complexity. The synthetic image with benchmark dataset used to perform initial experimental work. Then it is applied to real breast and brain magnetic resonance image on different region. Conclusion/Improvements: The silhouette method shows better segmentation than existing method.
Article
The rock fracture detection by image analysis is significant for fracture measurement and assessment engineering. The paper proposes a novel image segmentation algorithm for the centerline tracing of a rock fracture based on Hessian Matrix at Multi-scales and Steger algorithm. A traditional fracture detection method, which does edge detection first, then makes image binarization, and finally performs noise removal and fracture gap linking, is difficult for images of rough rock surfaces. To overcome the problem, the new algorithm extracts the centerlines directly from a gray level image. It includes three steps: (1) Hessian Matrix and Frangi filter are adopted to enhance the curvilinear structures, then after image binarization, the spurious-fractures and noise are removed by synthesizing the area, circularity and rectangularity; (2) On the binary image, Steger algorithm is used to detect fracture centerline points, then the centerline points or segments are linked according to the gap distance and the angle differences; and (3) Based on the above centerline detection roughly, the centerline points are searched in the original image in a local window along the direction perpendicular to the normal of the centerline, then these points are linked. A number of rock fracture images have been tested, and the testing results show that compared to other traditional algorithms, the proposed algorithm can extract rock fracture centerlines accurately.
Article
Objective: To determine the number of clustering categories of different MR T1WI adaptively using the histogram, and to achieve the adaptive segmentation by fuzzy c-means algorithm (FCM). Methods: Firstly, the smooth histogram envelope was fitted through the wavelet transform, in order to alleviate the impact of noises on finding the extremes of the envelope. Secondly, the number of envelope maxima was found according to the knowledge of calculus, and then the maximums of the envelope were filtered in accordance with the rules given in the paper, thereby the number of peaks of the histogram would be determined. Then MR images were segmented through FCM for which the number of clustering categories was equal to histogram peak number and the centers of clustering categories were the corresponding histogram peaks. Results: The number of clustering categories of multiple abdomen and brain MR image was determined effectively and adaptively with this method. Conclusion: This method can effectively and accurately determine the number of the clustering categories of different MR images, and so achieve the adaptive of FCM. Copyright © 2013 by the Press of Chinese Journal of Medical Imaging Technology.
Article
In this paper, a novel approach to clustering data with arbitrary shapes is proposed. In the partition process, a bisecting fuzzy c-means clustering is proposed to cluster the data set into a large number of sub-clusters. In the clustering process, two clusters are merged only if the distance-connectivity and the number-connectivity are high. The merging process using the dynamic model presented in this paper facilitates discovery of natural and homogeneous clusters. The experiments on some synthetic data sets show the validity of the proposed approach. The data sets contain clusters of different shapes, densities and size. Experimental results show that the proposed approach achieves a significant improvement as compared with other clustering algorithm.
Article
Blast furnace burden surface data derived from multi radars were processed. Fuzzy C-means and feature weighted fuzzy C-means clustering were applied to identify the burden surface data according to the data information, and a standard burden surface model database was set up. Each target burden surface was matched with the model database by using the method of nearness in fuzzy pattern recognition, and this provides a basis for the next burden surface control. The algorithm was carried out into a 2500 m 3 blast furnace, and the control effect has been improved. The simulation results show the effectiveness of the proposed method.
Article
The operation of blast furnace is directly affected by the charging distribution. Productivity can be boosted considerably if good charging distribution strategy is adopted. At the same time, great economic benefits will be brought about. In our method, a large amount of burden surface data from radars are classified by using fuzzy c-means clustering, and the multiple models set of burden surface is built. When the expected burden surface is given, multiple control strategies are designed based on multiple burden surfaces of the model set, and multiple charge distribution are obtained. In every charging distribution period, the real time burden surface data will be matched with the model set by fuzzy recognition, and the corresponding charge distribution matrices will be selected for charge distribution until the expected burden surface is produced. Feedback mechanism is formed from the observed data of radars, and closed-loop control is realized. The proposed control strategy is applied to a 2500 m3 blast furnace in an Iron and Steel Plant; the control effect has been improved greatly, and the energy conservation and consumption reduction are realized.
Article
Unsupervised clustering of a set of datums into homogeneous groups is a primitive operation required in many signal and image processing applications. In fact, different incarnations and hybrids of Fuzzy C-Means (FCM) and Possibilistic C-means (PCM) have been suggested which address additional requirements such as accepting weighted sets and being robust to the presence of outliers. Nevertheless, arriving at a general framework, which is independent of the datum model and the notion of homogeneity of a particular problem class, is a challenge. However, this process has not been followed organically and clustering algorithms are generally based on exogenous objective functions which are heuristically engineered and are believed to lead to the satisfaction of a required behavior. These techniques also commonly depend on regularization coefficients which are to be set “prudently” by the user or through separate processes. In contrast, in this work, we utilize Bayesian inference and derive a robustified objective function for a fuzzy–possibilistic clustering algorithm by assuming a generic datum model and a generic notion of cluster homogeneity. We utilize this model for the purpose of cluster validity assessment as well. We emphasize the epistemological importance of the theoretical basis on which the developed methodology rests. At the end of this paper, we present experimental results to exhibit the utilization of the developed framework in the context of four different problem classes.
Article
Because of the complex structure of the rock images, using the general image processing methods to segment the ore-bearing rock particle images may cause uneven, owe segmentation and over segmentation phenomenon. In order to improve the accuracy of rock image segmentation, a kind of ore image segmentation algorithm was proposed based on graph theory. In this algorithm, the first step was to reduce images by multi-scale analysis, then, the ore image was segmented by using Normalized Cut. Various categories of ore-bearing rocks were segmented by NCut and the segmentation images were compared with the segmentation results of traditional image processing methods such as threshold, watershed and clustering analysis, FCM and other processing methods in the experiment. The result of the experiment showed that for some specific rocks, the new algorithm is better than the traditional ones. ©, 2015, Editorial Department of Journal of Sichuan University. All right reserved.
Article
This article addresses the problem of incorporating an inclusion structure in the general class of fuzzy c-means algorithms. Conventionally, all the classes of fuzzy clustering algorithms involve a distance structure as the main tool to compute the interaction between the expected class prototypes and all the patterns. However, as the inclusion violates the basic metric assumptions, thereby it cannot be directly substituted for regular distance structure. The approach, advocated in this paper, consists of supporting the distance structure by a semi definite matrix A, which preserves the inclusion constraint globally for each class. Particularly, a graded inclusion index is put forward that takes into account the rational requirements underlying the definition of the inclusion of two Gaussian membership functions. Behaviour and algebraic properties of the proposed methodology are investigated. The proposed approach is then incorporated into the general fuzzy c-mean scheme, where the corresponding optimization problem is solved. Using both synthetic and real datasets, some illustrations are carried out in order to highlight the performances of the constructed algorithm and their evaluations, which are also compared to standard fuzzy c-means algorithm.
Chapter
Full-text available
Microarray generated gene expression data are characterized by their volume and by the intrinsic background noise. The main task of revealing patterns in gene expression data is typically carried out using clustering analysis, with soft clustering leading the more promising candidate methods. In this chapter, Fuzzy C-Means with a variable Focal Point (FCMFP) is exploited as the first stage in gene expression data analysis. FCMFP is inspired by the observation that the visual perception of a group of similar objects is (highly) dependent on the observer position. This metaphor is used to provide a new analysis insight, with different levels of granularity, over a gene expression dataset.
Article
A spatial constraints-based fuzzy clustering technique is introduced in the paper and the target application is classification of high resolution multispectral satellite images. This fuzzy-C-means (FCM) technique enhances the classification results with the help of a weighted membership function (W-mf). Initially, spatial fuzzy clustering (FC) is used to segment the targeted vegetation areas with the surrounding low vegetation areas, which include the information of spatial constraints (SCs). The performance of the FCM image segmentation is subject to appropriate initialization of W-mf and SC. It is able to evolve directly from the initial segmentation by spatial fuzzy clustering. The controlling parameters in fuzziness of the FCM approach, W mf and SC, help to estimate the segmented road results, then the Stentiford thinning algorithm is used to estimate the road network from the classified results. Such improvements facilitate FCM method manipulation and lead to segmentation that is more robust. The results confirm its effectiveness for satellite image classification, which extracts useful information in suburban and urban areas. The proposed approach, spatial constraint-based fuzzy clustering with a weighted membership function (SCFCWmf), has been used to extract the information of healthy trees with vegetation and shadows showing elevated features in satellite images. The performance values of quality assessment parameters show a good degree of accuracy for segmented roads using the proposed hybrid SCFCWmf-MO (morphological operations) approach which also occluded nonroad parts. (C) 2014 Society of Photo-Optical Instrumentation Engineers (SPIE)
Article
Full-text available
A counterexample to the original incorrect convergence theorem for the fuzzy c-means (FCM) clustering algorithms (see J.C. Bezdak, IEEE Trans. Pattern Anal. and Math. Intell., vol.PAMI-2, no.1, pp.1-8, 1980) is provided. This counterexample establishes the existence of saddle points of the FCM objective function at locations other than the geometric centroid of fuzzy c-partition space. Counterexamples previously discussed by W.T. Tucker (1987) are summarized. The correct theorem is stated without proof: every FCM iterate sequence converges, at least along a subsequence, to either a local minimum or saddle point of the FCM objective function. Although Tucker's counterexamples and the corrected theory appear elsewhere, they are restated as a caution not to further propagate the original incorrect convergence statement.
Article
Full-text available
The relational fuzzy c-means (RFCM) algorithm can be used to cluster a set of n objects described by pair-wise dissimilarity values if (and only if) there exist n points in Rn − 1 whose squared Euclidean distances precisely match the given dissimilarity data. This strong restriction on the dissimilarity data renders RFCM inapplicable to most relational clustering problems. This paper substantially improves RFCM by generalizing it to the case of arbitrary (symmetric) dissimilarity data. The generalization is obtained using a computationally efficient modification of the existing algorithm that is equivalent to applying a “spreading” transformation to the dissimilarity data. While the method given applies specifically to dissimilarity data, a simple transformation can be used to convert similarity relations into dissimilarity data, so the method is applicable to any numerical relational data that are positive, reflexive (or anti-reflexive) and symmetric. Numerical examples illustrate and compare the present approach to problems that can be studied with alternatives such as the linkage algorithms.
Article
Full-text available
The hard and fuzzy c-means algorithms are widely used, effective tools for the problem of clustering n objects into (hard or fuzzy) groups of similar individuals when the data is available as object data, consisting of a set of n feature vectors in RP. However, object data algorithms are not directly applicable when the n objects are implicitly described in terms of relational data, which consists of a set of n2 measurements of relations between each of the pairs of objects. New relational versions of the hard and fuzzy c-means algorithms are presented here for the case when the relational data can reasonably be viewed as some measure of distance. Some convergence properties of the algorithms are given along with a numerical example.
Article
Full-text available
In this paper, we present an efficient implementation of the fuzzy c-means clustering algorithm. The original algorithm alternates between estimating centers of the clusters and the fuzzy membership of the data points. The size of the membership matrix is on the order of the original data set, a prohibitive size if this technique is to be applied to very large data sets with many clusters. Our implementation eliminates the storage of this data structure by combining the two updates into a single update of the cluster centers. This change significantly affects the asymptotic runtime as the new algorithm is linear with respect to the number of clusters, while the original is quadratic. Elimination of the membership matrix also reduces the overhead associated with repeatedly accessing a large data structure. Empirical evidence is presented to quantify the savings achieved by this new method
Article
Approximating clusters in very large (VL=unloadable) data sets has been considered from many angles. The proposed approach has three basic steps: (i) progressive sampling of the VL data, terminated when a sample passes a statistical goodness of fit test; (ii) clustering the sample with a literal (or exact) algorithm; and (iii) non-iterative extension of the literal clusters to the remainder of the data set. Extension accelerates clustering on all (loadable) data sets. More importantly, extension provides feasibility—a way to find (approximate) clusters—for data sets that are too large to be loaded into the primary memory of a single computer. A good generalized sampling and extension scheme should be effective for acceleration and feasibility using any extensible clustering algorithm. A general method for progressive sampling in VL sets of feature vectors is developed, and examples are given that show how to extend the literal fuzzy (c-means) and probabilistic (expectation-maximization) clustering algorithms onto VL data. The fuzzy extension is called the generalized extensible fast fuzzy c-means (geFFCM) algorithm and is illustrated using several experiments with mixtures of five-dimensional normal distributions.
Article
Different extensions of fuzzy c-means (FCM) clustering have been developed to approximate FCM clustering in very large (unloadable) image (eFFCM) and object vector (geFFCM) data. Both extensions share three phases: (1) progressive sampling of the VL data, terminated when a sample passes a statistical goodness of fit test; (2) clustering with (literal or exact) FCM; and (3) noniterative extension of the literal clusters to the remainder of the data set. This article presents a comparable method for the remaining case of interest, namely, clustering in VL relational data. We will propose and discuss each of the four phases of eNERF and our algorithm for this last case: (1) finding distinguished features that monitor progressive sampling, (2) progressively sampling a square N × N relation matrix RN until an n × n sample relation Rn passes a statistical test, (3) clustering Rn with literal non-Euclidean relational fuzzy c-means, and (4) extending the clusters in Rn to the remainder of the relational data. The extension phase in this third case is not as straightforward as it was in the image and object data cases, but our numerical examples suggest that eNERF has the same approximation qualities that eFFCM and geFFCM do. © 2006 Wiley Periodicals, Inc. Int J Int Syst 21: 817–841, 2006.
Article
We present a method for sampling feature vectors in large (e.g., 2000 /spl times/ 5000 /spl times/ 16 bit) images that finds subsets of pixel locations which represent c "regions" in the image. Samples are accepted by the chi-square (/spl chi//sup 2/) or divergence hypothesis test. A framework that captures the idea of efficient extension of image processing algorithms from the samples to the rest of the population is given. Computationally expensive (in time and/or space) image operators (e.g., neural networks (NNs) or clustering models) are trained on the sample, and then extended noniteratively to the rest of the population. We illustrate the general method using fuzzy c-means (FCM) clustering to segment Indian satellite images. On average, the new method can achieve about 99% accuracy (relative to running the literal algorithm) using roughly 24% of the image for training. This amounts to an average savings of 76% in CPU time. We also compare our method to its closest relative in the group of schemes used to accelerate FCM: our method averages a speedup of about 4.2, whereas the multistage random sampling approach achieves an average acceleration of 1.63.
Article
In this letter, we give a new, more direct derivation of the convergence properties of the fuzzy c-means (FCM) algorithm, using the equivalence between the original and reduced FCM criterion. From the point of view of the reduced criterion, the FCM algorithm is simply a steepest descent algorithm with variable steplength. We prove that steplength adjustment follows from the majorization principle for steplength. By applying the majorization principle we give a straightforward proof of global convergence. Further convergence properties follow immediately using known results of optimization theory
Article
In this paper, we revisit the convergence and optimization properties of fuzzy clustering algorithms, in general, and the fuzzy c-means (FCM) algorithm, in particular. Our investigation includes probabilistic and (a slightly modified implementation of) possibilistic memberships, which will be discussed under a unified view. We give a convergence proof for the axis-parallel variant of the algorithm by Gustafson and Kessel, that can be generalized to other algorithms more easily than in the usual approach. Using reformulated fuzzy clustering algorithms, we apply Banach's classical contraction principle and establish a relationship between saddle points and attractive fixed points. For the special case of FCM we derive a sufficient condition for fixed points to be attractive, allowing identification of them as (local) minima of the objective function (excluding the possibility of a saddle point).
Article
Clustering is a useful approach in image segmentation, data mining, and other pattern recognition problems for which unlabeled data exist. Fuzzy clustering using fuzzy c-means or variants of it can provide a data partition that is both better and more meaningful than hard clustering approaches. The clustering process can be quite slow when there are many objects or patterns to be clustered. This paper discusses the algorithm brFCM, which is able to reduce the number of distinct patterns which must be clustered without adversely affecting the partition quality. The reduction is done by aggregating similar examples and then using a weighted exemplar in the clustering process. The reduction in the amount of clustering data allows a partition of the data to be produced faster. The algorithm is applied to the problem of segmenting 32 magnetic resonance images into different tissue types and the problem of segmenting 172 infrared images into trees, grass and target. Average speed-ups of as much as 59-290 times a traditional implementation of fuzzy c-means were obtained using brFCM, while producing partitions that are equivalent to those produced by fuzzy c-means.