Figure 3 - uploaded by Marek Gagolewski
Content may be subject to copyright.
The Gini-indices of the cluster size distributions in the case of the Iris data set: the Genie algorithm; the Gini-index thresholds are set to 0.3, 0.4, 0.5, and 0.6.

The Gini-indices of the cluster size distributions in the case of the Iris data set: the Genie algorithm; the Gini-index thresholds are set to 0.3, 0.4, 0.5, and 0.6.

Source publication
Article
Full-text available
The time needed to apply a hierarchical clustering algorithm is most often dominated by the number of computations of a pairwise dissimilarity measure. Such a constraint, for larger data sets, puts at a disadvantage the use of all the classical linkage criteria but the single linkage one. However, it is known that the single linkage clustering algo...

Context in source publication

Context 1
... modification prevents drastic increases of the chosen inequity measure and forces early merges of small clusters with some other ones. Figure 3 gives the cluster size distri- bution (compare Figure 2) in case of the proposed algorithm and the Iris data set. Here, we used four different thresholds for the Gini-index, namely, 0.3, 0.4, 0.5, and 0.6. ...

Similar publications

Preprint
Full-text available
The time needed to apply a hierarchical clustering algorithm is most often dominated by the number of computations of a pairwise dissimilarity measure. Such a constraint, for larger data sets, puts at a disadvantage the use of all the classical linkage criteria but the single linkage one. However, it is known that the single linkage clustering algo...

Citations

... Specifically, in this paper, we develop a framework for applying hierarchical clustering algorithms (e.g., [26,36,44]) to the task of community detection. While perhaps not the newest by itself, the hierarchical approach turns out robust and easily applicable on graphs, because it relies solely on similarity metrics between the pairs of nodes. ...
... In each case, we will determine the best linkage function. This will also provide us with a good opportunity to consider a new graph extension of the recently proposed Genie algorithm [26], whose performance on many benchmark datasets in the Euclidean spaces turned out to be above par [24]. ...
... One way to bypass the complexity problem is to add additional requirements. For example, the Genie algorithm [24,26] uses a single-linkage criterion but simultaneously monitors the partition's Gini Index value, i.e., a measure of inequality of cluster sizes. Only the cluster of the smallest cardinality can be merged when the algorithm reaches a pre-set threshold of the Gini Index value. ...
Preprint
Full-text available
Community detection is a critical challenge in the analysis of real-world graphs and complex networks, including social, transportation, citation, cybersecurity networks, and food webs. Motivated by many similarities between community detection and clustering in Euclidean spaces, we propose three algorithm frameworks to apply hierarchical clustering methods for community detection in graphs. We show that using our methods, it is possible to apply various linkage-based (single-, complete-, average- linkage, Ward, Genie) clustering algorithms to find communities based on vertex similarity matrices, eigenvector matrices thereof, and Euclidean vector representations of nodes. We convey a comprehensive analysis of choices for each framework, including state-of-the-art graph representation learning algorithms, such as Deep Neural Graph Representation, and a vertex proximity matrix known to yield high-quality results in machine learning -- Positive Pointwise Mutual Information. Overall, we test over a hundred combinations of framework components and show that some -- including Wasserman-Faust and PPMI proximity, DNGR representation -- can compete with algorithms such as state-of-the-art Leiden and Louvain and easily outperform other known community detection algorithms. Notably, our algorithms remain hierarchical and allow the user to specify any number of clusters a priori.
... From the practical side, their usefulness has been thoroughly evaluated in [4], where also some further tweaks were proposed to increase the quality of the generated results, e.g., by including in Genie correction for cluster size inequality [7]. ...
... It is worthy to note that, in line with the condition that need only be decreasing from 2 onwards, that sequences such as (1, 2, 1, 1, 0, 0, … , 0) and (1, 2, 2, 1, 0, … , 0) can be verified as satisfying the sufficient condition. One can also see that (0, 1, 0, 0, … , 0) will satisfy (7). It can either be viewed separately from the framework that fixes 1 = 1 or as a limiting case ( 2 → ∞). ...
... In [4], the practical usefulness of OWA-based clustering was evaluated thoroughly on numerous benchmark datasets from the suite described in [6]. It was noted that adding the Genie correction for cluster size inequality [7] leads to high-quality partitions, especially based on linkages that rely on a few closest point pairs (e.g., the single linkages and fuzzified/smoothened minimum). ...
Preprint
Agglomerative hierarchical clustering based on Ordered Weighted Averaging (OWA) operators not only generalises the single, complete, and average linkages, but also includes intercluster distances based on a few nearest or farthest neighbours, trimmed and winsorised means of pairwise point similarities, amongst many others. We explore the relationships between the famous Lance-Williams update formula and the extended OWA-based linkages with weights generated via infinite coefficient sequences. Furthermore, we provide some conditions for the weight generators to guarantee the resulting dendrograms to be free from unaesthetic inversions.
... And Großwendt A, Röglin H have briefly described their improved complete linkage method in the hierarchical clustering algorithm in 2017 [44]. In 2016, Gagolewski M, Bartoszuk M and Cena A improved the average linkage method in Hierarchical clustering algorithm [45]. In 1981, Srivastava R K, Leone R P and Shocker A D discussed how to use Hierarchical clustering to make analysis on market products [46]. ...
Article
Full-text available
The stock is one of the most important instruments of finance. However, the tendency of stock always has a high level of irregularity. In stock market, the stock price moving is considered as a time series problem. Clustering method on stock data is one of the machine learning methods and it is one of the most important analysis methods of technical analysis. The aim of this project is to find an efficient unsupervised learning way to analysis the stock market data to make classification of the patterns on different stock price moving data and get useful information for investment decisions by implementing different clustering algorithms. For this aim, the research objective of this project is to compare several of clustering methods like K-means algorithm, EM algorithm, Canopy algorithm, specify the best number of clusters for each clustering method by several evaluation indexes, show the result of each clustering method and make evaluation on the results of these clustering methods on stock market data of standard S&P 500 stock marketing data. In addition, Weka 3 and Matlab are used to implement the clustering methods and evaluation program. Data visualization shows clearly that those public companies in the same cluster have similar stock price moving pattern. The experiment shows the result that K-means algorithm and EM algorithm perform effectively in stock price moving and Canopy algorithm can be used before K-means algorithm to improve the efficiency.
... 'Genie_G1.0']We thus have got access to data on the Genie[8,10] algorithm with different gini_threshold ( ) parameter settings ( = 1.0 gives the single linkage method). ...
Preprint
Full-text available
The evaluation of clustering algorithms can be performed by running them on a variety of benchmark problems, and comparing their outputs to the reference, ground-truth groupings provided by experts. Unfortunately, many research papers and graduate theses consider only a small number of datasets. Also, rarely the fact that there can be many equally valid ways to cluster a given problem set is taken into account. In order to overcome these limitations, we have developed a framework whose aim is to introduce a consistent methodology for testing clustering algorithms. Furthermore, we have aggregated, polished, and standardised many clustering benchmark batteries referred to across the machine learning and data mining literature, and included new datasets of different dimensionalities, sizes, and cluster types. An interactive datasets explorer, the documentation of the Python API, a description of the ways to interact with the framework from other programming languages such as R or MATLAB, and other details are all provided at https://clustering-benchmarks.gagolewski.com.
... This whole method is implemented in R and all these studied algorithms are available in the R package Kmedians https://cran.r-project.org/package=Kmedians. In what follows, the centers initialization are generated from robust hierarchical clustering algorithm with genieclust package (Gagolewski et al., 2016). ...
Preprint
Full-text available
Clustering is a usual unsupervised machine learning technique for grouping the data points into groups based upon similar features. We focus here on unsupervised clustering for contaminated data, i.e in the case where K-medians algorithm should be preferred to K-means because of its robustness. More precisely, we concentrate on a common question in clustering: how to chose the number of clusters? The answer proposed here is to consider the choice of the optimal number of clusters as the minimization of a penalized criterion. In this paper, we obtain a suitable penalty shape for our criterion and derive an associated oracle-type inequality. Finally, the performance of this approach with different types of K-medians algorithms is compared on a simulation study with other popular techniques. All studied algorithms are available in the R package Kmedians on CRAN.
... This whole method is implemented in R and all these studied algorithms are available in the R package Kmedians https://cran.r-project.org/package=Kmedians. In what follows, the centers initialization are generated from robust hierarchical clustering algorithm with genieclust package (Gagolewski et al., 2016). ...
Preprint
Full-text available
Clustering is a usual unsupervised machine learning technique for grouping the data points into groups based upon similar features. We focus here on unsupervised clustering for contaminated data, i.e in the case where K-medians should be preferred to K-means because of its robustness. More precisely, we concentrate on a common question in clustering: how to chose the number of clusters? The answer proposed here is to consider the choice of the optimal number of clusters as the minimization of a risk function via penalization. In this paper, we obtain a suitable penalty shape for our criterion and derive an associated oracle-type inequality. Finally, the performance of this approach with different types of K-medians algorithms is compared on a simulation study with other popular techniques. All studied algorithms are available in the R package Kmedians on CRAN.
... Each dataset comes with at least one reference partition, which we related, using all the external cluster validity measures studied herein, to the outputs of the following 12 algorithms: Genie (with Gini index thresholds g of 0.1, 0.3, and 0.5; see [9,10]), ITM [27], classical agglomerative hierarchical clustering algorithms (linkages: Single, Average, Complete, Ward), as well as some methods implemented in scikit-learn for Python: K-Means, expectationmaximisation (EM) for Gaussian mixtures (n_init=100), Birch (threshold=0.01, branching_factor=50), and Spectral (affinity=Laplacian, gamma=5). ...
Preprint
Full-text available
There is no, nor will there ever be, single best clustering algorithm, but we would still like to be able to pinpoint those which are well-performing on certain task types and filter out the systematically disappointing ones. Clustering algorithms are traditionally evaluated using either internal or external validity measures. Internal measures quantify different aspects of the obtained partitions, e.g., the average degree of cluster compactness or point separability. Yet, their validity is questionable because the clusterings they promote can sometimes be meaningless. External measures, on the other hand, compare the algorithms' outputs to the reference, ground truth groupings that are provided by experts. The commonly-used classical partition similarity scores, such as the normalised mutual information, Fowlkes-Mallows, or adjusted Rand index, might not possess all the desirable properties, e.g., they do not identify pathological edge cases correctly. Furthermore, they are not nicely interpretable: it is hard to say what a score of 0.8 really means. Its behaviour might also vary as the number of true clusters changes. This makes comparing clustering algorithms across many benchmark datasets difficult. To remedy this, we propose and analyse a new measure: an asymmetric version of the optimal set-matching accuracy. It is corrected for chance and the imbalancedness of cluster sizes.
... The algorithm Genie is a multi-objective clustering algorithm, as effective and simple as any other distance-based hierarchical clustering algorithm (Gagolewski et al., 2016). It only requires a measure of similarity between a pair of observations. ...
... The whole data analysis process was carried out in R-Studio version 1.4.1103 (RStudio Team, 2021) by using packages NbClust (Charrad et al., 2014), clusterCrit (Desgraupes, 2016), optCluster (Sekula et al., 2017), cluster (Maechler et al., 2021), genieclust (Gagolewski et al., 2016), and MESS (Wickham et al., 2021). ...
Article
Bu çalışmada, çok amaçlı karar vermeye dayalı kümeleme analizine entegre bir yaklaşım sunmak amacıyla, 27 iç geçerlilik kriterinin tamamı MULTIMOORA yöntemi ile eş zamanlı olarak değerlendirilerek 11 farklı kümeleme algoritması arasından en iyi kümeleme algoritmasının belirlenmesi amaçlanmıştır. Çalışmada öncelikle iki veri kümesi için en uygun küme sayısı ve bu küme sayısına bağlı olarak en iyi kümeleme algoritması belirlenmiştir. Daha sonra, belirlenen ülke kümelerinin insani gelişmişlik sınıflarıyla ilişkisinin belirlenmesine odaklanılmıştır. Yapılan analizler sonucunda COVID-19 salgınından etkilenen ülkeler, Öklid uzaklığı aracılığıyla hesaplanan yakınlıklarına göre CLARA ve SOM algoritmaları ile kümelenmiştir. Her iki veri kümesi için de en uygun küme sayısı olarak üç küme belirlenmiştir. Vaka-ölüm oranına kıyasla insidans oranının kümeler arasındaki gerçek farkta daha baskın faktör olduğu bulunmuştur. Bir diğer dikkat çekici bulgu ise, ekonomik gücü ve insani gelişmişlik düzeyi yüksek ülkelerin, aşılama öncesinde pandemiden daha az etkilenmesi beklenirken, insani gelişmişlik düzeyi yüksek olan ülkelerin pandemiden etkilenme düzeyinin her değişken bakımından da yüksek olmasıdır.
... The Genie (Gagolewski, Bartoszuk and Cena 2016) algorithm is an alternative for the more classical, single-linkage criteria hierarchical clustering. The algorithm aims to offset the disadvantages of the single linkage scheme, that is, sensitivity to outliers, the creation of very skewed dendrograms, and consequently not re ecting the actual underlying data structure unless there are well-separated clusters. ...
... For the Kmeans algorithm, we chose the memory ef cient implementation found in (Emerson and Kane 2020), optimized for large scale applications. For the Genie algorithm, the corresponding R package "genie" (Gagolewski, Bartoszuk and Cena 2016) was used. For the Ncutdc algorithm, the implementation found in the authors 's "PPCI" (Hofmeyr and Pavlidis, PPCI: an R Package for Cluster Identi cation using Projection Pursuit 2019) package was used. ...
... • Genie [23] (with different thresholds), ...
... Genie_G0.5, Genie_G0.7 -the robust hierarchical clustering algorithm Genie that we have proposed in [23], with different thresholds for the Gini index of the inequity in cluster sizes; 10) ITM -greedy divisive minimiser of an information theoretic criterion over minimum spanning trees [46]; 11) GaussMix -expectation-maximisation (EM) for Gaussian mixtures with 100 restarts and each cluster having its own covariance matrix; 12) KMeans -Lloyd-like k-means algorithm with 10 restarts (note that this is a heuristic to optimise the Caliński-Harabasz index/within-cluster sum of squares). [50]). ...
Preprint
Full-text available
Internal cluster validity measures (such as the Calinski-Harabasz, Dunn, or Davies-Bouldin indices) are frequently used for selecting the appropriate number of partitions a dataset should be split into. In this paper we consider what happens if we treat such indices as objective functions in unsupervised learning activities. Is the optimal grouping with regards to, say, the Silhouette index really meaningful? It turns out that many cluster (in)validity indices promote clusterings that match expert knowledge quite poorly. We also introduce a new, well-performing variant of the Dunn index that is built upon OWA operators and the near-neighbour graph so that subspaces of higher density, regardless of their shapes, can be separated from each other better.