Multifractal-based cluster hierarchy optimisation algorithm.

IJBIDM 01/2008; 3:353-374. DOI: 10.1504/IJBIDM.2008.022734
Source: DBLP

ABSTRACT A cluster is a collection of data objects that are similar to one another within the same cluster and are dissimilar to the objects in other clusters. Moreover, there will exist more or less similarities among these large amounts of initial cluster results in a real-life data set. Accordingly, an analyser will have difficulty implementing further analysis if they know nothing about these similarities. Therefore, it is very valuable to analyse these similarities and construct the hierarchy structures of the initial clusters. The traditional cluster methods are unfit for this cluster postprocessing problem for their favour of finding the spherical shape clusters, impractical hypothesis and multiple scans of the data set. Based on multifractal theory, we propose the MultiFractal-based Cluster Hierarchy Optimisation (MFCHO) algorithm, which integrates the cluster similarity with cluster shape and cluster distribution to construct the cluster hierarchy tree from the disjoint initial clusters. The elementary time-space complexity of the MFCHO algorithm is presented. Several comparative experiments using synthetic and real-life data sets show the performance and the effectivity of MFCHO.

  • Source
    WSEAS Transactions on Information Science and Applications 1(1):73-81.
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Clustering, in data mining, is useful for discovering groups and identifying interesting distributions in the underlying data. Traditional clustering algorithms either favor clusters with spherical shapes and similar sizes, or are very fragile in the presence of outliers. We propose a new clustering algorithm called CURE that is more robust to outliers, and identifies clusters having non-spherical shapes and wide variances in size. CURE achieves this by representing each cluster by a certain fixed number of points that are generated by selecting well scattered points from the cluster and then shrinking them toward the center of the cluster by a specified fraction. Having more than one representative point per cluster allows CURE to adjust well to the geometry of non-spherical shapes and the shrinking helps to dampen the effects of outliers. To handle large databases, CURE employs a combination of random sampling and partitioning. A random sample drawn from the data set is first partitioned and each partition is partially clustered. The partial clusters are then clustered in a second pass to yield the desired clusters. Our experimental results confirm that the quality of clusters produced by CURE is much better than those found by existing algorithms. Furthermore, they demonstrate that random sampling and partitioning enable CURE to not only outperform existing algorithms but also to scale well for large databases without sacrificing clustering quality.
    Information Systems 01/1998; · 1.77 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Clustering is a widely used knowledge discovery technique. It helps uncovering structures in data that were not previously known. The clustering of large data sets has received a lot of attention in recent years, however, clustering is a still a challenging task since many published algorithms fail to do well in scaling with the size of the data set and the number of dimensions that describe the points, or in #nding arbitrary shapes of clusters, or dealing e#ectively with the presence of noise. In this paper, we present a new clustering algorithm, based in the fractal properties of the data sets. The new algorithm, whichwe call Fractal Clustering #FC#, places points incrementally in the cluster for which the change in the fractal dimension after adding the point is the least. This is a very natural way of clustering points, since points in the same cluster have a great degree of selfsimilarity among them #and much less self-similarity with respect to points in other clusters#. FC requires one scan of the data, is suspendable at will, providing the best answer possible at that point, and is incremental. We show via experiments that FC e#ectively deals with large data sets, high-dimensionality and noise and is capable of recognizing clusters of arbitrary shape. Categories and Subject Descriptors I.5.3 #Computing Methodologies#: Pattern Recognition--- Clustering General Terms Fractals 1.
    Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining; 01/2000