ArticlePDF Available

Figures

M. Kuchaki Rafsanjani, Z. Asghari Varzaneh, N. Emami Chukanlo / TJMCS Vol .5 No.3 (2012) 229-240
229
 
http://www.TJMCS.com


A survey of hierarchical clustering algorithms
Marjan Kuchaki Rafsanjani


Zahra Asghari Varzaneh


Nasibeh Emami Chukanlo




Abstract
            
            
         


Keywords
2010 Mathematics Subject Classification: Primary 91C20; Secondary 62D05.
1. Introduction
              
 

            
       


The Journal of
Mathematics and Computer Science
M. Kuchaki Rafsanjani, Z. Asghari Varzaneh, N. Emami Chukanlo / TJMCS Vol .5 No.3 (2012) 229-240
230
             

         

2. Clustering process

                




XCiCj
Fig. 1 
        

       
            
  

3. Clustering algorithms

Sequential algorithms:         
  
            

Hierarchical clustering algorithms: 
M. Kuchaki Rafsanjani, Z. Asghari Varzaneh, N. Emami Chukanlo / TJMCS Vol .5 No.3 (2012) 229-240
231
Agglomerative algorithms (bottom-up, merging): 
m

Divisive algorithms (top-down, splitting):  
          m    
             

Clustering algorithms based on cost function optimization: 
   J     
     m         
    J.
       
 

Other
           
           


Fig. 2. 
4. Hierarchical clustering algorithms
    
             
               
       

 
  
  Fuzzy C.A   
  


M. Kuchaki Rafsanjani, Z. Asghari Varzaneh, N. Emami Chukanlo / TJMCS Vol .5 No.3 (2012) 229-240
232
Fig. 3. 
         

   

             
       
 
    
       


Fig. 4. 
           
   

M. Kuchaki Rafsanjani, Z. Asghari Varzaneh, N. Emami Chukanlo / TJMCS Vol .5 No.3 (2012) 229-240
233
Fig. 5. 
5. Specific algorithms
            

5.l. CURE (Clustering Using REpresentatives)
 
   

    
 

      

Fig. 6. 
   
     
    

5.1.1 Disadvantage of CURE



  
     
 

M. Kuchaki Rafsanjani, Z. Asghari Varzaneh, N. Emami Chukanlo / TJMCS Vol .5 No.3 (2012) 229-240
234
5.2 BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies)
     
  
  

        


  

Fig. 7. 
      
             


 On
           

5.2.1 Advantages of BIRCH



5.2.2 Disadvantages of BIRCH

M. Kuchaki Rafsanjani, Z. Asghari Varzaneh, N. Emami Chukanlo / TJMCS Vol .5 No.3 (2012) 229-240
235


5.3 ROCK (RObust Clustering using linKs)
     




          
              
   
         
  

        

                 
             

Fig. 8. 
5.4 CHAMELEON


            

  
          

Fig. 9. 
M. Kuchaki Rafsanjani, Z. Asghari Varzaneh, N. Emami Chukanlo / TJMCS Vol .5 No.3 (2012) 229-240
236

nOn mm 

5.4.1 Disadvantage of CHAMELEON


5.5 Linkage algorithms

    S Ave 
   
            


           
         
       
            
             
  
        
            
 

5.5.1 Disadvantages of linkage algorithm
S       


      



5.6 Leaders–Subleaders
              
             
M. Kuchaki Rafsanjani, Z. Asghari Varzaneh, N. Emami Chukanlo / TJMCS Vol .5 No.3 (2012) 229-240
237
            
         
     




     
    
             
         
            

        
d
1

Fig. 10. 
5.7 Bisecting k-means
              

                
    
         
     
    
       

M. Kuchaki Rafsanjani, Z. Asghari Varzaneh, N. Emami Chukanlo / TJMCS Vol .5 No.3 (2012) 229-240
238
6. Comparison of algorithms
             


         
       
            
 

              

           

   
           

Table 1.
Algorithms
Hierarchical For
large
data
set
Sensitive to
Outlier/Noise
Model
Time complexity Space complexity
agglomerative divisive Static Dynamic
CURE
Less sensitive to noise
O(n2logn)
O(n)
BIRTH
Handle noise
effectively O(n)
ROCK
- O(n2+mmma+n2logn) O(min{n2,nmmma})
Chameleon
_
O(n(log2n +m))
S
-lin
k
Sensitive to outlier _ O(n2logn) O(n2
)
Ave-lin
k
_
_  
Com-link
Not strongly affected
by outliers  
O(n3) : obvious algorithm
O(n2logn) : priority queues O(n2
)
O(n2
)
O(n)
O(nlog2n) : Euclidean plan O(n)
O(n logn + nlog2(1/ε)
:ε- approximation
O(n)
Leader-
Subleaders
_
O(ndh) :h=2 O((L+SL)d)
BKMS
_
  O(nk)
m: 



M. Kuchaki Rafsanjani, Z. Asghari Varzaneh, N. Emami Chukanlo / TJMCS Vol .5 No.3 (2012) 229-240
239
7. Conclusion
            

  

Acknowledgment. 

References.
    

     
            

   
    


           


           

           

    


           

             
    

  
            

          
             



M. Kuchaki Rafsanjani, Z. Asghari Varzaneh, N. Emami Chukanlo / TJMCS Vol .5 No.3 (2012) 229-240
240


            

 

    

  
        

           




... It has been successfully applied to tasks such as speech recognition, image classification, and predicting stock prices. In general, supervised learning is a powerful tool for solving many types of problems and has the potential to improve the efficiency and accuracy of decision-making in a wide range of industries (Gourisaria et al., 2020). is an algorithm that is designed to identify clusters of high density, and can also identify outliers (points that do not belong to any cluster) (Kuchaki et al., 2012). According to Campello et al. (2015), DBSCAN is a densitybased algorithm that is particularly well-suited for identifying clusters of arbitrary shape. ...
... Unsupervised learning: A model is trained with unlabeled data in this machine learning approach(Adi et al., 2019). The model is then asked to explore the data for patterns and correlations without being instructed on what to look for.Numerous unsupervised learning methods exist, including:• Clustering algorithms: These algorithms are a type of unsupervised machine learning algorithm that divide a dataset into groups, or clusters, of similar items(Kuchaki et al., 2012). These algorithms are used to discover patterns and relationships within data that may not be immediately obvious. ...
... II. Hierarchical clustering: Hierarchical clustering creates a tree-like structure (called a dendrogram) that shows how the data can be divided into clusters(Kuchaki et al., 2012). Hierarchical clustering may be divided into two primary categories: agglomerative, which begins with individual points and groups them into clusters, and divisive, which begins with the complete dataset and divides it into smaller clusters. ...
Thesis
Full-text available
This thesis presents a machine-learning approach to the exploratory data analysis and prediction of unemployment among individuals with mental health challenges. The goal of the research is to understand the factors that contribute to unemployment among this population and to develop models that can accurately predict unemployment among individuals with mental health challenges. The purpose of the study is to address the gap in the existing literature by examining the relationship between mental health issues and unemployment. Specifically, the study aims to explore this relationship through the use of exploratory data analysis and various machine learning models, including logistic regression, random forest, support vector machine (SVM), naive Bayes, decision tree classifier, and k-nearest neighbors. The study will seek to identify patterns and trends in the data related to unemployment among people with mental health issues. By utilizing multiple machine learning models, the study can evaluate which models perform best in predicting unemployment among this population. The study employed a comprehensive data analysis and pre-processing technique to prepare the dataset for modeling. The models were trained and evaluated using standard performance metrics such as accuracy, precision, recall, and F1-score. The results were compared, and the best model was selected based on the highest overall performance. The study found that the SVM model performed the best with an accuracy of 91%, followed by Random Forest with an accuracy of 89%. The results of this study provide valuable insights for policymakers and practitioners in the field of mental health and unemployment by demonstrating the effectiveness of machine learning models in predicting unemployment among individuals with mental health challenges. This study also makes a significant contribution to the field by providing a more comprehensive understanding of the relationship between mental health and unemployment through the use of multiple advanced machine learning models. The benefits of this study include providing a more comprehensive understanding of the relationship between mental health and unemployment, identifying an effective machine learning model for predicting unemployment among individuals with mental health challenges, providing valuable insights for policymakers and practitioners in the field of mental health and unemployment, and making a significant contribution to the field of mental health and unemployment research. This study can be used as a reference for future research on this topic, and also, it can be used by the policymakers to make a decision and also help the practitioners on how to approach this issue. Keywords: Unemployment, mental health, exploratory data analysis, machine learning, prediction
... After obtaining sentence embeddings with/without TSDAE training, hierarchical (agglomerative) [64] (H-AC) and density-based clustering based on hierarchical density estimates (HDBSCAN) [65,66] algorithms with cosine similarity metric and Euclidean distance were utilized for semantically clustering the answer sentences previously classified as a misconception. We executed an extensive hyper-parameter search and evaluation for both algorithms. ...
... One of the clustering algorithms is the agglomerative type, and the other is a density-based clustering. After obtaining sentence embeddings with and without TSDAE training, hierarchical agglomerative (H-AC) [64], density-based clustering based on hierarchical density estimates (HDBSCAN) [65], and the cosine similarity metric were used to cluster the answer sentences previously categorized semantically as a misconception. ...
... Hierarchical clustering (H-AC) [64] is a group of methods that differ in figuring how far apart things are. Aside from the usual selection of distance functions, the user also has to determine the linkage criterion. ...
... After obtaining sentence embeddings with/without TSDAE training, hierarchical (agglomerative) [64] (H-AC) and density-based clustering based on hierarchical density estimates (HDBSCAN) [65,66] algorithms with cosine similarity metric and Euclidean distance were utilized for semantically clustering the answer sentences previously classified as a misconception. We executed an extensive hyper-parameter search and evaluation for both algorithms. ...
... One of the clustering algorithms is the agglomerative type, and the other is a density-based clustering. After obtaining sentence embeddings with and without TSDAE training, hierarchical agglomerative (H-AC) [64], density-based clustering based on hierarchical density estimates (HDBSCAN) [65], and the cosine similarity metric were used to cluster the answer sentences previously categorized semantically as a misconception. ...
... Hierarchical clustering (H-AC) [64] is a group of methods that differ in figuring how far apart things are. Aside from the usual selection of distance functions, the user also has to determine the linkage criterion. ...
... After obtaining sentence embeddings with/without TSDAE training, hierarchical (agglomerative) [64] (H-AC) and density-based clustering based on hierarchical density estimates (HDBSCAN) [65,66] algorithms with cosine similarity metric and Euclidean distance were utilized for semantically clustering the answer sentences previously classified as a misconception. We executed an extensive hyper-parameter search and evaluation for both algorithms. ...
... One of the clustering algorithms is the agglomerative type, and the other is a density-based clustering. After obtaining sentence embeddings with and without TSDAE training, hierarchical agglomerative (H-AC) [64], density-based clustering based on hierarchical density estimates (HDBSCAN) [65], and the cosine similarity metric were used to cluster the answer sentences previously categorized semantically as a misconception. ...
... Hierarchical clustering (H-AC) [64] is a group of methods that differ in figuring how far apart things are. Aside from the usual selection of distance functions, the user also has to determine the linkage criterion. ...
Article
Students’ misconceptions of various topics in physics have been investigated by many researchers. The detection of misconceptions is very difficult and takes a long time as a human being. Our aim in the study carried out is to determine the misconceptions of the students regarding the concept of the atom by machine instead of humans. This study proposes two novel methods: the Transformers model and the fastText algorithm, to classify the students’ answers. Since there is currently no Turkish language model for physics-related questions or the physics domain, we trained a transformer model from scratch for this domain using transfer learning and domain adaptation techniques. In the second part of this research, we proposed an unsupervised learning approach to accurately understand and identify the reasons behind the misconceptions. For this purpose, we utilized sentence transformers to obtain vector representation of the sentences with transformer-based denoising autoencoder training. We then used two clustering algorithms: an agglomerative one and a density-based one, to group similar sentences in a high-dimensional vector space. Again, for the first time, the unsupervised transformer-based denoising autoencoder training of the sentence transformers was employed for the Turkish language to provide domain adaptation for sentence transformers. Finally, we compared the human performance (experts’ opinions) and the proposed method results for both the classification and the clustering tasks according to the kappa metric. According to our results, we managed to distinguish misconceptions with a high accuracy of between 0.97 and 1.00 with our proposed methodology.
... Agglomerative clustering starts with n clusters each containing a single object, and it is ensured that all objects are included in a certain number of clusters with a series of merging operations. Important algorithms that use the agglomerative method in hierarchical clustering are singlelink (SLINK), average-link (ALINK), and complete-link (CLINK) (Rafsanjani et al., 2012). In a standard agglomerative hierarchical algorithm (e.g., CLINK), the nxn distance matrix is scanned exhaustively in each of the n-1 iterations to find the lowest cost, which makes the complexity of the algorithm O(n 3 ). ...
Article
Abstract Clustering is a prominent research area, with numerous studies and the development of hundreds of algorithms over the years. However, a fundamental challenge in clustering research is the trade-off between algorithm speed and clustering quality. Existing algorithms tend to prioritize either fast execution with compromised clustering quality or slower performance with superior clustering results. In this study, we propose a novel CDC-2 algorithm, an improved version of the Critical Distance Clustering (CDC) algorithm, to address this challenge. Inspired by the concepts of hybridization in biology and the division of labor in the economic system, we present a new hybridization strategy. Our approach integrates the connectivity and coherence aspects of the K-means and CDC-2 algorithms, respectively, allowing us to combine speed and quality in a single algorithm. This approach is referred to as the CDC++ algorithm, and it is characterized as a hybrid that combines elements from two algorithms, K-means and CDC-2, in order to leverage their strengths while mitigating their weaknesses. Moreover, the structure and mechanism of the CDC++ algorithm led to the introduction of a new concept called “object autoencoder.” Unlike traditional feature reduction methods, this concept focuses on object reduction, representing a significant advancement in clustering techniques. To validate our approach, we conducted experimental studies on thirteen synthetic and five real datasets. Comparative analysis with four well-known algorithms demonstrates that our proposed development and hybridization enable efficient processing of large-scale and high-dimensional datasets without compromising clustering quality.
... Hierarchical clustering [38][39][40]: Similar constraints were clustered into higher-level classes until it was no longer possible to merge them unless shifting to a more generic (or hierarchically higher) class [42]. In this way, two further classes (levels 02 and 03) were added (see Section 3.3). ...
Article
Full-text available
In many countries, depending on climatic conditions and the energy performance of buildings, the built stock is highly energy-consuming and constitutes a main source of greenhouse gas emissions. This is particularly true for Europe, where most of the existing buildings were built before 2001. For this reason, EU policies have focused on the Deep Energy Renovation Process of the residential building stock as the mainstream way for its decarbonization strategy by 2050. Based on a broad investigation of seven EU local retrofitting markets carried out within the H2020 re-MODULEES project, this paper defines a holistic methodology for understanding and facing the complexity of the renovation market and its inner constraints. Thanks to systematic surveys and the activation of stakeholders’ core groups (re-LABs), the main market barriers (cultural, social, technical, processual, and financial) were explored. Through a bottom-up clustering approach and vote analysis, a relevance classification of constraints of each pilot market and a detailed scenario of the most relevant market constraints at the European level were provided. This scalable methodology offers the baseline necessary for shaping more effective, cooperative, and tailored-made policies aimed at overcoming the current limitations to the full deployment of the Deep Energy Renovation Process (DERP) across the European markets.
... Metode clustering dibagi menjadi dua yakni metode hierarki dan non hierarki. Metode hierarki merupakan teknik clustering yang membentuk tingkatan tertentu seperti struktur pohon dan hasilnya disajikan dalam bentuk dendogram, metode ini yakni Single Linkage, Complete Linkage, Average Linkage, Ward, dan Centroid [9]. Sedangkan metode non hierarki prosedurnya dimulai dengan memilih sejumlah nilai cluster awal, kemudian objek pengamatan digabungkan ke dalam cluster-cluster tersebut, metode ini meliputi Sequential Threshold, Parallel Threshold, dan Optimizing Partitioning, salah satu metode Optimizing Partitioning yang sangat terkenal adalah K-Means [10]. ...
Article
Inflasi merupakan suatu kondisi perekonomian yang menujukkan adanya kecenderungan kenaikan tingkat harga umum karena barang dan jasa yang ada di pasaran mempunyai jumlah dan jenis yang sangat beragam, sebagian besar dari harga-harga tersebut selalu meningkat dan mengakibatkan terjadinya inflasi. Perhitungan laju inflasi salah satunya adalah menggunakan Indeks Harga Konsumen (IHK), indeks ini disusun dari harga barang dan jasa yang dikonsumsi oleh masyarakat. Penelitian ini membahas tentang perbandingan pengelompokan kota di Indonesia berdasarkan indikator inflasi metode Ward dan K-Means dengan menggunakan 11 variabel kelompok pengeluaran pada IHK. Evaluasi metode dilakukan dengan membandingkan nilai rasio simpangan baku dari masing-masing metode. Berdasarkan hasil penelitian jumlah cluster yang dihasilkan adalah sebanyak 3 cluster , dengan nilai rasio simpangan baku yang diperoleh menggunakan metode Ward adalah 1,77, sedangkan dengan metode K-Means adalah sebesar 1,43. Dengan demikian hasil cluster yang bisa digunakan sebagai rujukan pada pengelompokan kota di Indonesia berdasarkan indikator inflasi adalah hasil analisis cluster menggunakan metode K-Means .
... In particular, principal component analysis (PCA) [28][29][30][31] was used to reduce the data into a two-dimensional space that is useful for visualizing pathogens, AMR genes, and antimicrobials with detected resistance. Based on the projection of data on the two-dimensional space, hierarchical clustering [32][33][34] was further implemented to study the relationships and similarities shown in those pathogens, genes, or antimicrobials. Finally, the occurrence frequencies of antimicrobial resistance cases were projected onto the U.S. map and then presented in time charts to study the spatiotemporal trends of the AMR status and cross-validate the findings from multivariate statistical analysis. ...
Article
Full-text available
Foodborne pathogens pose substantial health hazards and result in considerable economic losses in the U.S. Fortunately, the National Center for Biotechnology Information Pathogen Detection Isolates Browser (NPDIB) provides valuable access to antimicrobial resistance (AMR) genes and antimicrobial assay data. This study aimed to conduct the first comprehensive investigation of AMR genes in pathogens isolated from U.S. cattle over the past decade, driven by the urgent need to address the dangers of AMR specifically originating in pathogens isolated from U.S. cattle. In this study, around 28,000 pathogen isolate samples were extracted from the NPDIB and then analyzed using multivariate statistical methods, mainly principal component analysis (PCA) and hierarchical clustering (H-clustering). These approaches were necessary due to the high dimensions of the raw data. Specifically, PCA was utilized to reduce the dimensions of the data, converting it to a two-dimensional space, and H-clustering was used to better identify the differences among data points. The findings from this work highlighted Salmonella enterica and Escherichia coli as the predominant pathogens among the isolates, with E. coli being the more concerning pathogen due to its increasing prevalence in recent years. Moreover, tetracycline was observed as the most commonly resistant antimicrobial, with the resistance genes mdsA, mdsB, mdtM, blaEC, and acrF being the most prevalent in pathogen isolates from U.S. cattle. The occurrence of mdtM, blaEC, acrF, and glpT_E448k showed an increase in pathogens isolated from U.S. cattle in recent years. Furthermore, based on the data collected for the locations of AMR cases, Texas, California, and Nebraska were the major areas carrying major AMR genes or antimicrobials with detected resistance. The results from this study provide potential directions for targeted interventions to mitigate pathogens’ antimicrobial resistance in U.S. cattle.
Article
Full-text available
The need for information is gradually shifting from text to images due to the technology’s growth and increase in digital images. It is quite challenging for people to find similar color images. To obtain similarity matching, the color of the image needs to be identified. This paper aims at various clustering techniques to identify the color of the digital image. Though many clustering techniques exist, this paper focuses on Fuzzy c-Means, Mean-Shift, and a hybrid technique that amalgamates the agglomerative hierarchies and k-Means, known as hKmeans to cluster the intensity of the image. Applying evaluation metrics of Mean Squared Error, Root Mean Squared Error, Mean Absolute Error, Homogeneity, Completeness, V-Score, and Peak signal-to-noise ratio it is proven that the results obtained demonstrate the good performance of the proposed technique. Then the color histogram is applied to identify the color and differentiate the color distribution on the original and clustered image.
Article
Contextual understanding is a key aspect for learning a new domain through web search more effectively for making informed decisions. And with advent of machine learning approaches, it becomes even more fast and robust that enable collaboration between machine algorithms and humans. However, human expertise still holds the key for new domain, which has been proposed in this study as a key step in unsupervised learning approach of k-means clustering technique. Domain search term and context terms for the new domain are added to the clustering technique, and the relevance of the resultant groups has been tested. Context setting helps to analyse and understand the content of documents and other sources of information. For a new domain like Algorithmic Government, which does not have many documents on the web, it was found that contextual learning was up to 40% more relevant than the normal learning approach. The qualitative aspect of the clusters was found much better by the experts than quantitative aspect due to availability of lesser number of search documents. It was found that scientific research also supports the groups formed during contextual learning approach. This approach should help government to better understand and respond to the needs and concerns of their citizens by deriving better data insights in quick time, and to make more informed, evidence-based decisions, and sensitive to the needs and values of different communities and stakeholders. And thus, many stakeholders in the new domain can use this approach for exploration, research, policy formulation, strategizing, implementing and testing the various learnt concepts. A total of 15 search engines were used in the experimental settings with thousands of web crawling being done using Carrot ² engine. Text embedding was done using bag-of-word technique and k-means clustering was implemented for producing 25 clusters across the two types of learnings.
Article
Full-text available
In this paper, two clustering algorithms called dynamic hierarchical compact and dynamic hierarchical star are presented. Both methods aim to construct a cluster hierarchy, dealing with dynamic data sets. The first creates disjoint hierarchies of clusters, while the second obtains overlapped hierarchies. The experimental results on several benchmark text collections show that these methods not only are suitable for producing hierarchical clustering solutions in dynamic environments effectively and efficiently, but also offer hierarchies easier to browse than traditional algorithms. Therefore, we advocate its use for tasks that require dynamic clustering, such as information organization, creation of document taxonomies and hierarchical topic detection.
Article
Full-text available
Clustering, in data mining, is useful to discover distribution patterns in the underlying data. Clustering algorithms usually employ a distance metric based (e.g., euclidean) similarity measure in order to partition the database such that data points in the same partition are more similar than points in different partitions. In this paper, we study clustering algorithms for data with boolean and categorical attributes. We show that traditional clustering algorithms that use distances between points for clustering are not appropriate for boolean and categorical attributes. Instead, we propose a novel concept of links to measure the similarity/proximity between a pair of data points. We develop a robust hierarchical clustering algorithm ROCK that employs links and not distances when merging clusters. Our methods naturally extend to non-metric similarity measures that are relevant in situations where a domain expert/similarity table is the only source of knowledge. In addition to presenting detailed complexity results for ROCK, we also conduct an experimental study with real-life as well as synthetic data sets to demonstrate the effectiveness of our techniques. For data with categorical attributes, our findings indicate that ROCK not only generates better quality clusters than traditional algorithms, but it also exhibits good scalability properties.
Article
Clustering is an activity of finding abstractions from data and these abstractions can be used for decision making [1]. In this paper, we select the cluster representatives as prototypes for efficient classification [3]. There are a variety of clustering algorithms reported in the literature. However, clustering algorithms that perform multiple scans of large databases (of size in Tera bytes) residing on the disk demand prohibitive computational times. As a consequence, there is a growing interest in designing clustering algorithms that scan the database only once. Algorithms like BIRCH [2], Leader [5] and Single-pass k-means algorithm [4] belong to this category.
Article
Clustering, in data mining, is useful for discovering groups and identifying interesting distributions in the underlying data. Traditional clustering algorithms either favor clusters with spherical shapes and similar sizes, or are very fragile in the presence of outliers. We propose a new clustering algorithm called CURE that is more robust to outliers, and identifies clusters having non-spherical shapes and wide variances in size. CURE achieves this by representing each cluster by a certain fixed number of points that are generated by selecting well scattered points from the cluster and then shrinking them toward the center of the cluster by a specified fraction. Having more than one representative point per cluster allows CURE to adjust well to the geometry of non-spherical shapes and the shrinking helps to dampen the effects of outliers. To handle large databases, CURE employs a combination of random sampling and partitioning. A random sample drawn from the data set is first partitioned and each partition is partially clustered. The partial clusters are then clustered in a second pass to yield the desired clusters. Our experimental results confirm that the quality of clusters produced by CURE is much better than those found by existing algorithms. Furthermore, they demonstrate that random sampling and partitioning enable CURE to not only outperform existing algorithms but also to scale well for large databases without sacrificing clustering quality.
Article
In this paper, an efficient hierarchical clustering algorithm, suitable for large data sets is proposed for effective clustering and prototype selection for pattern classification. It is another simple and efficient technique which uses incremental clustering principles to generate a hierarchical structure for finding the subgroups/subclusters within each cluster. As an example, a two level clustering algorithm––‘Leaders–Subleaders’, an extension of the leader algorithm is presented. Classification accuracy (CA) obtained using the representatives generated by the Leaders–Subleaders method is found to be better than that of using leaders as representatives. Even if more number of prototypes are generated, classification time is less as only a part of the hierarchical structure is searched.
Article
The ROCK algorithm is an agglomerative hierarchical clustering algorithm for clustering categorical data [Guha S., Rastogi, R., Shim, K., 1999. ROCK: A robust clustering algorithm for categorical attributes. In: Proc. IEEE Internat. Conf. Data Engineering, Sydney, March 1999]. In this paper we prove that under certain conditions, the final clusters obtained by the algorithm are nothing but the connected components of a certain graph with the input data-points as vertices. We propose a new algorithm QROCK which computes the clusters by determining the connected components of the graph. This leads to a very efficient method of obtaining the clusters giving a drastic reduction of the computing time of the ROCK algorithm. We also justify that it is more practical for specifying the similarity threshold rather than specifying the desired number of clusters a priori. The QROCK algorithm also detects the outliers in this process. We also discuss a new similarity measure for categorical attributes.
Article
Techniques based on agglomerative hierarchical clustering constitute one of the most frequent approaches in unsupervised clustering. Some are based on the single linkage methodology, which has been shown to produce good results with sets of clusters of various sizes and shapes. However, the application of this type of algorithms in a wide variety of fields has posed a number of problems, such as the sensitivity to outliers and fluctuations in the density of data points. Additionally, these algorithms do not usually allow for automatic clustering.In this work we propose a method to improve single linkage hierarchical cluster analysis (HCA), so as to circumvent most of these problems and attain the performance of most sophisticated new approaches. This completely automated method is based on a self-consistent outlier reduction approach, followed by the building-up of a descriptive function. This, in turn, allows to define natural clusters. Finally, the discarded objects may be optionally assigned to these clusters.The validation of the method is carried out by employing widely used data sets available from literature and others for specific purposes created by the authors. Our method is shown to be very efficient in a large variety of situations.
Article
As an important technique for data analysis, clustering has been employed in many applications such as image segmentation, document clustering and vector quantization. Divisive clustering, which is a branch of hierarchical clustering, has been studied and widely used due to its computational efficiency. Generally, which cluster should be split and how to split the selected cluster are two major principles that should be taken into account when a divisive clustering algorithm is used. However, one disadvantage of the divisive clustering is its degraded performance compared to the partitional clustering, thus making it hard to achieve a good trade-off between computational time and clustering performance. To tackle this problem, we propose a novel divisive clustering algorithm by integrating an improved discrete particle swarm optimizer into a divisive clustering framework. Experiments on several synthetic data sets, real-world data sets and two real-world applications (document clustering and vector quantization) show some promising results. Firstly, the proposed algorithm performs better or at least comparable to the other representative clustering algorithms in terms of clustering quality and robustness. Secondly, the proposed algorithm runs much faster than the other competing algorithms on all the benchmark sets. At last, the good time-quality trade-off is still achievable when the size of the problem instance is increased.
Conference Paper
Motivated by applications such as document and image classification in information retrieval, we consider the prob- lem of clustering dynamic point sets in a metric space. We propose a model called incremental clusteringwhich is based on a careful analysis of the requirements of the information retrieval application, and which should also be useful in ot her applications. The goal is to efficiently maintain clusters of small diameter as new points are inserted. We analyze sev- eral natural greedy algorithms and demonstrate that they pe r- form poorly. We propose new deterministic and random- ized incremental clustering algorithms which have a prov- ably good performance. We complement our positive res- ults with lower bounds on the performance of incremental al- gorithms. Finally, we consider the dual clustering problem where the clusters are of fixed diameter, and the goal is to minimize the number of clusters.