Figure 3 - uploaded by Marek Gagolewski

Content may be subject to copyright.

# The Gini-indices of the cluster size distributions in the case of the Iris data set: the Genie algorithm; the Gini-index thresholds are set to 0.3, 0.4, 0.5, and 0.6.

Source publication

The time needed to apply a hierarchical clustering algorithm is most often dominated by the number of computations of a pairwise dissimilarity measure. Such a constraint, for larger data sets, puts at a disadvantage the use of all the classical linkage criteria but the single linkage one. However, it is known that the single linkage clustering algo...

## Context in source publication

**Context 1**

... modification prevents drastic increases of the chosen inequity measure and forces early merges of small clusters with some other ones. Figure 3 gives the cluster size distri- bution (compare Figure 2) in case of the proposed algorithm and the Iris data set. Here, we used four different thresholds for the Gini-index, namely, 0.3, 0.4, 0.5, and 0.6. ...

## Similar publications

The time needed to apply a hierarchical clustering algorithm is most often dominated by the number of computations of a pairwise dissimilarity measure. Such a constraint, for larger data sets, puts at a disadvantage the use of all the classical linkage criteria but the single linkage one. However, it is known that the single linkage clustering algo...

## Citations

... DBHT combined with TMFG has been shown to have better accuracy than single-linkage and averagelinkage HAC [22,28]. Standard HAC methods can be sensitive to small changes in the dataset, and there have methods proposed to address this issue [1,14]. ...

Filtered graphs provide a powerful tool for data clustering. The triangular maximally filtered graph (TMFG) method, when combined with the directed bubble hierarchy tree (DBHT) method, defines a useful algorithm for hierarchical data clustering. This combined TMFG-DBHT algorithm has been shown to produce clusters with good accuracy for time series data, but the previous state-of-the-art parallel algorithm has limited parallelism. This paper presents an improved parallel algorithm for TMFG-DBHT. Our algorithm increases the amount of parallelism by aggregating the bulk of the work of TMFG construction together to reduce the overheads of parallelism. Furthermore, our TMFG algorithm updates information lazily, which reduces the overall work. We find further speedups by computing all-pairs shortest paths approximately instead of exactly in DBHT. We show experimentally that our algorithm gives a 3.7--10.7x speedup over the previous state-of-the-art TMFG-DBHT implementation, while preserving clustering accuracy.

... Genie_G0.5, Genie_G0.7 (Genie with different Gini index thresholds; Algorithm 3; Gagolewski et al., 2016) 6 IcA (optimising the information criterion -starting from singletons; Algorithm 2) 7-9 Genie+Ic (k + 0), Genie+Ic (k + 5), Genie+Ic (k + 10) (optimising the information criterion -agglomerative from a partial partition; Algorithm 4) ...

... Genie proposed by Gagolewski et al. (2016) is an example variation on the agglomerative single linkage theme, where the total edge lengths are optimised in a greedy manner, but under the constraint that if the Gini index of the cluster sizes grows above a given threshold g, only the smallest clusters take part in the merging. Thanks to this, we can prevent the outliers from being classified as singleton clusters. ...

Minimum spanning trees (MSTs) provide a convenient representation of datasets in numerous pattern recognition activities. Moreover, they are relatively fast to compute. In this paper, we quantify the extent to which they are meaningful in low-dimensional partitional data clustering tasks. By identifying the upper bounds for the agreement between the best (oracle) algorithm and the expert labels from a large battery of benchmark data, we discover that MST methods can be very competitive. Next, we review, study, extend, and generalise a few existing, state-of-the-art MST-based partitioning schemes. This leads to some new noteworthy approaches. Overall, the Genie and the information-theoretic methods often outperform the non-MST algorithms such as K-means, Gaussian mixtures, spectral clustering, Birch, density-based, and classical hierarchical agglomerative procedures. Nevertheless, we identify that there is still some room for improvement, and thus the development of novel algorithms is encouraged.

... • Use the robust hierarchical clustering proposed by Gagolewski et al. (2016), to get τ 1 , and run our algorithm from there; • Randomly choose K centers from the data and take k = I d and π k = 1 K for all k. ...

... Our interest for the method proposed by Gagolewski et al. (2016) is that it is both deterministic (i.e. does not change from one run of the algorithm to another) and robust. ...

Grouping observations into homogeneous groups is a recurrent task in statistical data analysis. We consider Gaussian Mixture Models, which are the most famous parametric model-based clustering method. We propose a new robust approach for model-based clustering, which consists in a modification of the EM algorithm (more specifically, the M-step) by replacing the estimates of the mean and the variance by robust versions based on the median and the median covariation matrix. All the proposed methods are available in the R package RGMM accessible on CRAN.

... As a result, several clustering algorithms adopted this approach to improve computational efficiency. For instance, Gagolewski (2021) effectively applied this structural relationship-based technique to boost the efficiency of Genie (Gagolewski et al. 2016), an enhancement of hierarchical clustering with the single linkage. Similarly, Hahsler et al. (2019) employed this structural relationship-based technique to enhance the efficiency of DBSCAN. ...

This paper introduces the randomized self-updating process (rSUP) algorithm for clustering large-scale data. rSUP is an extension of the self-updating process (SUP) algorithm, which has shown effectiveness in clustering data with characteristics such as noise, varying cluster shapes and sizes, and numerous clusters. However, SUP’s reliance on pairwise dissimilarities between data points makes it computationally inefficient for large-scale data. To address this challenge, rSUP performs location updates within randomly generated data subsets at each iteration. The Law of Large Numbers guarantees that the clustering results of rSUP converge to those of the original SUP as the partition size grows. This paper demonstrates the effectiveness and computational efficiency of rSUP in large-scale data clustering through simulations and real datasets.

... Each dataset comes with at least one reference partition, which we related, using all the external cluster validity measures studied herein, to the outputs of the following 12 algorithms: Genie (with Gini index thresholds g of 0.1, 0.3, and 0.5; see [9,10]), ITM [27], classical agglomerative hierarchical clustering algorithms (linkages: Single, Average, Complete, Ward), as well as some methods implemented in scikit-learn for Python: K-Means, expectationmaximisation (EM) for Gaussian mixtures (n_init=100), Birch (threshold=0.01, branching_factor=50), and Spectral (affinity=Laplacian, gamma=5). ...

There is no, nor will there ever be, single best clustering algorithm, but we would still like to be able to distinguish between methods which work well on certain task types and those that systematically underperform. Clustering algorithms are traditionally evaluated using either internal or external validity measures. Internal measures quantify different aspects of the obtained partitions, e.g., the average degree of cluster compactness or point separability. Yet, their validity is questionable, because the clusterings they promote can sometimes be meaningless. External measures, on the other hand, compare the algorithms' outputs to the reference, ground truth groupings that are provided by experts. In this paper, we argue that the commonly-used classical partition similarity scores, such as the normalised mutual information, Fowlkes-Mallows, or adjusted Rand index, miss some desirable properties, e.g., they do not identify worst-case scenarios correctly or are not easily interpretable. This makes comparing clustering algorithms across many benchmark datasets difficult. To remedy these issues, we propose and analyse a new measure: a version of the optimal set-matching accuracy, which is normalised, monotonic, scale invariant, and corrected for the imbalancedness of cluster sizes (but neither symmetric nor adjusted for chance).

... From the practical side, their usefulness has been thoroughly evaluated in [4], where also some further tweaks were proposed to increase the quality of the generated results, e.g., by including in Genie correction for cluster size inequality [7]. ...

... It is worthy to note that, in line with the condition that need only be decreasing from 2 onwards, sequences such as (1, 2, 1, 1, 0, 0, … , 0) and (1, 2, 2, 1, 0, … , 0) can be verified as satisfying the sufficient condition. One can also see that (0, 1, 0, 0, … , 0) will satisfy (7). It can either be viewed separately from the framework that fixes 1 = 1 or as a limiting case ( 2 → ∞). ...

... In [4], the practical usefulness of OWAbased clustering was evaluated thoroughly on numerous benchmark datasets from the suite described in [6]. It was noted that adding the Genie correction for cluster size inequality [7] leads to high-quality partitions, especially based on linkages that rely on a few closest point pairs (e.g., the single linkages and fuzzified/smoothened minimum). These papers provide many examples of practically useful OWA weight generators. ...

... The dendrogram begins with every data point as a separate cluster and eventually combines related groups according to a similarity and distance value. The algorithm assesses the closeness of clusters and picks the clusters to combine at each stage [30]. According to the data and problematic domain, the similarity and distance metrics employed in hierarchical clustering might differ. ...

Data mining is an analytical approach that contributes to achieving a solution to many problems by extracting previously unknown, fascinating, nontrivial, and potentially valuable information from massive datasets. Clustering in data mining is used for splitting or segmenting data items/points into meaningful groups and clusters by grouping the items that are near to each other based on certain statistics. This paper covers various elements of clustering, such as algorithmic methodologies, applications, clustering assessment measurement, and researcher-proposed enhancements with their impact on data mining thorough grasp of clustering algorithms, its applications, and the advances achieved in the existing literature. This study includes a literature search for papers published between 1995 and 2023, including conference and journal publications. The study begins by outlining fundamental clustering techniques along with algorithm improvements and emphasizing their advantages and limitations in comparison to other clustering algorithms. It investigates the evolution measures for clustering algorithms with an emphasis on metrics used to gauge clustering quality, such as the F-measure and the Rand Index. This study includes a variety of clustering-related topics, such as algorithmic approaches, practical applications, metrics for clustering evaluation, and researcher-proposed improvements. It addresses numerous methodologies offered to increase the convergence speed, resilience, and accuracy of clustering, such as initialization procedures, distance measures, and optimization strategies. The work concludes by emphasizing clustering as an active research area driven by the need to identify significant patterns and structures in data, enhance knowledge acquisition, and improve decision making across different domains. This study aims to contribute to the broader knowledge base of data mining practitioners and researchers, facilitating informed decision making and fostering advancements in the field through a thorough analysis of algorithmic enhancements, clustering assessment metrics, and optimization strategies.

... Specifically, in this paper, we develop a framework for applying hierarchical clustering algorithms (e.g., [26,36,44]) to the task of community detection. While perhaps not the newest by itself, the hierarchical approach turns out robust and easily applicable on graphs, because it relies solely on similarity metrics between the pairs of nodes. ...

... In each case, we will determine the best linkage function. This will also provide us with a good opportunity to consider a new graph extension of the recently proposed Genie algorithm [26], whose performance on many benchmark datasets in the Euclidean spaces turned out to be above par [24]. ...

... One way to bypass the complexity problem is to add additional requirements. For example, the Genie algorithm [24,26] uses a single-linkage criterion but simultaneously monitors the partition's Gini Index value, i.e., a measure of inequality of cluster sizes. Only the cluster of the smallest cardinality can be merged when the algorithm reaches a pre-set threshold of the Gini Index value. ...

Community detection is a critical challenge in the analysis of real-world graphs and complex networks, including social, transportation, citation, cybersecurity networks, and food webs. Motivated by many similarities between community detection and clustering in Euclidean spaces, we propose three algorithm frameworks to apply hierarchical clustering methods for community detection in graphs. We show that using our methods, it is possible to apply various linkage-based (single-, complete-, average- linkage, Ward, Genie) clustering algorithms to find communities based on vertex similarity matrices, eigenvector matrices thereof, and Euclidean vector representations of nodes. We convey a comprehensive analysis of choices for each framework, including state-of-the-art graph representation learning algorithms, such as Deep Neural Graph Representation, and a vertex proximity matrix known to yield high-quality results in machine learning -- Positive Pointwise Mutual Information. Overall, we test over a hundred combinations of framework components and show that some -- including Wasserman-Faust and PPMI proximity, DNGR representation -- can compete with algorithms such as state-of-the-art Leiden and Louvain and easily outperform other known community detection algorithms. Notably, our algorithms remain hierarchical and allow the user to specify any number of clusters a priori.

... From the practical side, their usefulness has been thoroughly evaluated in [4], where also some further tweaks were proposed to increase the quality of the generated results, e.g., by including in Genie correction for cluster size inequality [7]. ...

... It is worthy to note that, in line with the condition that need only be decreasing from 2 onwards, sequences such as (1, 2, 1, 1, 0, 0, … , 0) and (1, 2, 2, 1, 0, … , 0) can be verified as satisfying the sufficient condition. One can also see that (0, 1, 0, 0, … , 0) will satisfy (7). It can either be viewed separately from the framework that fixes 1 = 1 or as a limiting case ( 2 → ∞). ...

... In [4], the practical usefulness of OWAbased clustering was evaluated thoroughly on numerous benchmark datasets from the suite described in [6]. It was noted that adding the Genie correction for cluster size inequality [7] leads to high-quality partitions, especially based on linkages that rely on a few closest point pairs (e.g., the single linkages and fuzzified/smoothened minimum). These papers provide many examples of practically useful OWA weight generators. ...

Agglomerative hierarchical clustering based on Ordered Weighted Averaging (OWA) operators not only generalises the single, complete, and average linkages, but also includes intercluster distances based on a few nearest or farthest neighbours, trimmed and winsorised means of pairwise point similarities, amongst many others. We explore the relationships between the famous Lance-Williams update formula and the extended OWA-based linkages with weights generated via infinite coefficient sequences. Furthermore, we provide some conditions for the weight generators to guarantee the resulting dendrograms to be free from unaesthetic inversions.

... And Großwendt A, Röglin H have briefly described their improved complete linkage method in the hierarchical clustering algorithm in 2017 [44]. In 2016, Gagolewski M, Bartoszuk M and Cena A improved the average linkage method in Hierarchical clustering algorithm [45]. In 1981, Srivastava R K, Leone R P and Shocker A D discussed how to use Hierarchical clustering to make analysis on market products [46]. ...

The stock is one of the most important instruments of finance. However, the tendency of stock always has a high level of irregularity. In stock market, the stock price moving is considered as a time series problem. Clustering method on stock data is one of the machine learning methods and it is one of the most important analysis methods of technical analysis. The aim of this project is to find an efficient unsupervised learning way to analysis the stock market data to make classification of the patterns on different stock price moving data and get useful information for investment decisions by implementing different clustering algorithms. For this aim, the research objective of this project is to compare several of clustering methods like K-means algorithm, EM algorithm, Canopy algorithm, specify the best number of clusters for each clustering method by several evaluation indexes, show the result of each clustering method and make evaluation on the results of these clustering methods on stock market data of standard S&P 500 stock marketing data. In addition, Weka 3 and Matlab are used to implement the clustering methods and evaluation program. Data visualization shows clearly that those public companies in the same cluster have similar stock price moving pattern. The experiment shows the result that K-means algorithm and EM algorithm perform effectively in stock price moving and Canopy algorithm can be used before K-means algorithm to improve the efficiency.