ArticlePDF Available

Rousseeuw, P.J.: Silhouettes: A Graphical Aid to the Interpretation and Validation of Cluster Analysis. Comput. Appl. Math. 20, 53-65

Authors:

Abstract and Figures

A new graphical display is proposed for partitioning techniques. Each cluster is represented by a so-called silhouette, which is based on the comparison of its tightness and separation. This silhouette shows which objects lie well within their cluster, and which ones are merely somewhere in between clusters. The entire clustering is displayed by combining the silhouettes into a single plot, allowing an appreciation of the relative quality of the clusters and an overview of the data configuration. The average silhouette width provides an evaluation of clustering validity, and might be used to select an ‘appropriate’ number of clusters.
Content may be subject to copyright.
A preview of the PDF is not available
... After encoding the images in latent space, the t-SNE method [37] is applied to reduce dimensionality into two components. The images onto the plane formed by the two components are clustered using K-means [38,39], and the cluster number is determined using the silhouette [40] and elbow rule [41] procedures. ...
Article
Full-text available
Tuberculosis (TB) is the deadliest disease from a single infectious agent, ranking above malaria, HIV/AIDS and COVID-19. The World Health Organization (WHO) states that treating both active tuberculosis (ATB) and tuberculosis infection (TBI) can reduce tuberculosis mortality to fewer than one death per million by 2050. In this context, the WHO recommends the use of computer-aided detection (CAD) for screening TB as part of a world-wide elimination plan. Recently, CADs have been composed of deep learning models trained with medical images as a tool for classification, segmentation and synthetic image generation. Medical images are scarcer than nature pictures, hence one of the primary gaps in producing more accurate models. Therefore, we propose a framework with three generative adversarial networks (GAN) (i.e., Wasserstein GAN, GAN Pix2Pix, Cycle-GAN) as a synthetic data generate strategy to enlarge TB-related data availability while introducing diversity into the classifier training process of a CAD classifier model. We emphasize that among the synthetic production of chest radiographs (CXR) related to TB, we have created synthetic images from data collected in TBI studies, a novelty to our knowledge.
... (ii) cluster assignments of the clustered observations The adjusted Rand index (ARI) (Hubert and Arabie 1985) is determined to the true simulated cluster assignments of the data generating process and the resulting cluster partition based on the analysis. In addition, the internal validation index average silhouette width (ASW) relates the average within-cluster distance for the assigned cluster to the best alternative (Rousseeuw 1987;Aschenbruck and Szepannek 2020). (iii) computation time Although in most cases computation time is not a primary decision factor for or against a mathematical method, especially smart algorithm initialization can make a big difference. ...
Article
Full-text available
One of the most popular partitioning cluster algorithms is k-means, which is only applicable to numerical data. An extension to mixed-type data containing numerical and categorical variables is the k-prototypes algorithm. Due to its iterative structure, the algorithm may only converges to a local minimum rather than a global minimum. Therefore, just like the solution of the original k-means, the resulting cluster partition suffers from the initialization. In general, there are two ways of achieving an improvement of the random-based initialization of the algorithm: One possibility is to determine concrete initial cluster centers, and the other strategy is to repeat the algorithm with different randomly chosen initial centers. In this work, algorithm initializations of both options are analyzed and evaluated comparatively in a benchmark study. Therefore, selected initialization strategies of the k-means algorithm are transformed to the application on mixed-type data. For the simulation study, several data sets are artificially generated and cluster partitions are determined by using the competing initialization strategies. It is shown that an improvement of the cluster algorithm’s target criterion can be achieved as well as the ability to identify appropriate groups, even with manageable time expenditure.
Article
Full-text available
Neuroblastoma is a common pediatric cancer that affects thousands of infants worldwide, especially children under five years of age. Although recovery for patients with neuroblastoma is possible in 80% of cases, only 40% of those with high-risk stage four neuroblastoma survive. Electronic health records of patients with this disease contain valuable data on patients that can be analyzed using computational intelligence and statistical software by biomedical informatics researchers. Unsupervised machine learning methods, in particular, can identify clinically significant subgroups of patients, which can lead to new therapies or medical treatments for future patients belonging to the same subgroups. However, access to these datasets is often restricted, making it difficult to obtain them for independent research projects. In this study, we retrieved three open datasets containing data from patients diagnosed with neuroblastoma: the Genoa dataset and the Shanghai dataset from the Neuroblastoma Electronic Health Records Open Data Repository, and a dataset from the TARGET-NBL renowned program. We analyzed these datasets using several clustering techniques and measured the results with the DBCV (Density-Based Clustering Validation) index. Among these algorithms, DBSCAN (Density-Based Spatial Clustering of Applications with Noise) was the only one that produced meaningful results. We scrutinized the two clusters of patients’ profiles identified by DBSCAN in the three datasets and recognized several relevant clinical variables that clearly partitioned the patients into the two clusters that have clinical meaning in the neuroblastoma literature. Our results can have a significant impact on health informatics, because any computational analyst wishing to cluster small data of patients of a rare disease can choose to use DBSCAN and DBCV rather than utilizing more common methods such as k-Means and Silhouette coefficient.
Article
Full-text available
Unbalanced longitudinal data appears commonly in practice, for example in cases where measurements are collected at different time points for different subjects and can therefore be sparse and/or irregularly sampled. Treating such data as functional enables smooth curve estimation and better handling of missing or irregularly spaced observations. Therefore, a Gaussian copula kernel mixture model (CKMM), based on functional data analysis, is proposed for clustering unbalanced multivariate longitudinal data. In this model, subject-specific warping matrices are included to account for irregularly spaced observations. A regularized functional eigen-decomposition is employed to estimate the copula correlation parameters, ensuring the smoothing procedure is integrated into clustering. Additionally, a functional gradient descent algorithm is implemented as an alternative to kernel density estimation to reduce computational complexity. An expectation-maximization-like algorithm is proposed to estimate marginal distributions, copula parameters, eigenfunctions, and eigenvalues in the CKMM. The performance of the CKMM is demonstrated through a simulation study and a data application. The proposed model exhibits superior performance compared to k-means with dynamic time warping, the growth mixture model, and functional high-dimensional data clustering.
Article
Full-text available
Fused deposition modeling (FDM) is an additive manufacturing (AM) technology recognized for its ability to easily and efficiently prototype complex geometries. However, its sensitivity to factors, such as material properties, temperature, and printing speed can influence the quality and mechanical characteristics of the printed parts. This study aimed to advance real-time anomaly detection in FDM by leveraging multimodal time-series data from sensors. Our approach introduces and evaluates novel hybrid architectures combining clustering algorithms with classification machine learning algorithms to monitor FDM processes through acoustic and vibration data. By fusing multiple data sources, such as inertial measurement unit (IMU) with acoustic signals, both independently and in combination, we investigate clustering–classifier pairings, revealing that some combinations, such as support vector machine (SVM) with agglomerative clustering and random forest (RF) with Gaussian mixture model (GMM) labels, yield higher classification accuracies. This study extends beyond previous approaches by analyzing clustering efficacy within feature spaces of varying dimensionalities. The results indicate that models with moderate dimensions (e.g., 20 components) enhance classification accuracy and better capture anomaly nuances. Furthermore, the paper benchmarks the compatibility of clustering techniques (K-means, agglomerative, GMM) with classifiers like SVM, RF, and decision tree (DT), providing insights into optimal clustering–classifier combinations for 3D printing anomaly detection. The findings underscore that SVM and RF perform exceptionally well in multimodal datasets with hierarchical clustering, especially in cases requiring high classification precision. These insights contribute to advancing FDM process monitoring, laying the groundwork for robust multimodal anomaly detection systems capable of accurately assessing complex and dynamic manufacturing conditions.
Article
This chapter discusses multivariate display and discusses a few methods that are either exemplary of a class of others or are sufficiently useful and unique to warrant inclusion. Polygon plots as well as other sorts of iconic displays based upon nonmetaphorical icons are useful for the qualitative conveyance of information and are made more useful if the icons are displayed in a position that is meaningful. Trees are the only display methods that do a good job at depicting the covariance structure among the variables. A powerful use of a graphic display is to present information in a nonlinear way. Thus, the exploded diagrams of automobile transmissions show clearly as to which pieces go where and indicate clearly as to which orders of assembly are possible and which are not. The complex charts of population, often found in statistical atlases, provide many stories of immigration trends.
Article
An icicle plot is a method for presenting a hierarchical clustering. Compared with other methods of presentation, it is far easier in an icicle plot to read off which objects belong to which clusters, and which objects join or drop out from a cluster as we move up and down the levels of the hierarchy, though these benefits only appear when enough objects are being clustered. Icicle plots are described, and their benefits are illustrated using a clustering of 48 objects.
Article
A quantitative taxonomic method is described. The method is based on the calculation of distance functions in multidimensional space, and provides objective discrimination and characterization of taxa.