ArticlePDF Available

A Dendrite Method for Cluster Analysis

Authors:

Abstract

A method for identifying clusters of points in a multidimensional Euclidean space is described and its application to taxonomy considered. It reconciles, in a sense, two different approaches to the investigation of the spatial relationships between the points, viz., the agglomerative and the divisive methods. A graph, the shortest dendrite of Florek etal. (1951a), is constructed on a nearest neighbour basis and then divided into clusters by applying the criterion of minimum within cluster sum of squares. This procedure ensures an effective reduction of the number of possible splits. The method may be applied to a dichotomous division, but is perfectly suitable also for a global division into any number of clusters. An informal indicator of the "best number" of clusters is suggested. It is a"variance ratio criterion" giving some insight into the structure of the points. The method is illustrated by three examples, one of which is original. The results obtained by the dendrite method are compared with those obtained by using the agglomerative method or Ward (1963) and the divisive method of Edwards and Cavalli-Sforza (1965).
A preview of the PDF is not available
... A higher score implies better performance. It is also known as the Variance Ratio Criterion [11]. It is calculated using the equation given below, of the clusters by a human expert or user. ...
... Whereas, clusters are evaluated based on similarity or dissimilarity measures where ground truths are unknown. One of the most popular cluster evaluation metrics Silhouette Coefficient [10], Calinski-Harabasz Index [11] and Davies Bouldin Index [12] are considered in this research. The analysis and comparison based on both datasets are as follows: According to the values of the cluster evaluation metrics mentioned in III the performance orders are as follows: ...
Conference Paper
Customer segmentation is a process that divides customers into groups based on common characteristics. The customer segmentation problem belongs to the domain of unsupervised learning, more specifically clustering. The effectiveness of customer segmentation distinctly depends on the chosen clustering algorithm. Moreover, the efficacy of a clustering algorithm is highly dependent on the dataset, type of data, utilised subspaces, and complexity, etc. However, different e-commerce or internet-based businesses collect and utilise their customer data differently and even the slightest difference in data might require a different clustering algorithm for effective customer segmentation. In this paper, we propose a system which consists of two modules, an unsupervised module and a supervised module. The unsupervised module will utilise unlabelled customer data and apply different categories of unsupervised clustering algorithms to find the most suitable algorithm for a given dataset. We use the acquired results to convert the unlabelled customer data into labelled data. After training a classification model using the labelled data, the supervised module can identify the groups of new customers using the trained model without further clustering. This system will work as a customer segmentation and identification system which will help businesses take data-driven decisions more efficiently.
... The studied mammal assemblage includes species with different mobility abilities and movement behaviours; therefore, for the analyses I used nine functional groups (Fig. 2) defined by Meza-Joya et al. (2020). These groups were delimited based on species-specific traits (trophic category, life-history characteristics, social structure, body size, and environmental sensitivity; for details see Meza-Joya et al. 2020) by applying Gower distances (Gower 1966) and the Calinski criterion (Calinski and Harabasz 1974). I also selected twelve environmental variables representing landscape features (Table 1) al. 2020), regardless of the reasons for movement or the potential response of species or individuals towards fragmentation or matrix effects. ...
Article
Full-text available
Mitigation planning for road projects in Colombia has been largely based on actions aimed at reducing wildlife roadkills. Nonetheless, the efficiency of these actions is compromised because of the absence of robust empirical studies supporting their implementation. In this work, I used the Road Permeability Index (RPI) in conjunction with expert knowledge information to estimate the strength of the barrier effect imposed by an under-construction road (Yuma road, Santander department, Colombia) on nine functional groups of medium and large-sized mammals. The influence of 12 landscape variables on the permeability of each functional group was assessed at 30 locations along the road. The RPI was calculated for each functional group, and the whole studied mammal assemblage at each location. The relative influence of each variable on overall permeability was also estimated. I found that functional groups including terrestrial and semiarboreal species present higher contribution values to overall road permeability, indicating that they represent priority targets for mitigation actions. The RPI identified six highly permeable locations for animal movement—where higher roadkill rates are expected—which are key for implementing mitigation strategies aimed at reducing wildlife road mortality. Forest cover had the strongest influence on road permeability, therefore is crucial for landscape conectivity. Overall, the results of this work show that RPI constitutes a reliable and easily adaptable alternative for identifying priority species, or faunal groups, and locations for road mitigation planning.
... Because FlowSOM requires knowledge about the final number of clusters, it was set to 24 (as the number of cell types identified by experts). The results were then compared in terms of the Calinski-Harabasz Index [20], Davies-Bouldin Index [21], and the number of clusters found (ClusterX and PARC). ...
Article
Full-text available
Cell subtype identification from mass cytometry data presents a persisting challenge, particularly when dealing with millions of cells. Current solutions are consistently under development, however, their accuracy and sensitivity remain limited, particularly in rare cell-type detection due to frequent downsampling. Additionally, they often lack the capability to analyze large data sets. To overcome these limitations, a new method was suggested to define an extended feature space. When combined with the robust clustering algorithm for big data, it results in more efficient cell clustering. Each marker’s intensity distribution is presented as a mixture of normal distributions (Gaussian Mixture Model, GMM), and the expanded space is created by spanning over all obtained GMM components. The projection of the initial flow cytometry marker domain into the expanded space employs GMM-based membership functions. An evaluation conducted on three established cellular identification algorithms (FlowSOM, ClusterX, and PARC) utilizing the most substantial publicly available annotated dataset by Samusik et al. demonstrated the superior performance of the suggested approach in comparison to the standard. Although our approach identified 20 cell clusters instead of the expected 24, their intra-cluster homogeneity and inter-cluster differences were superior to the 24-cluster FlowSOM-based solution.
Article
Full-text available
Background Psoriasis is a chronic immune-mediated inflammatory systemic disease with skin manifestations characterized by erythematous, scaly, itchy and/or painful plaques resulting from hyperproliferation of keratinocytes. Certolizumab pegol [CZP], a PEGylated antigen binding fragment of a humanized monoclonal antibody against TNF-alpha, is approved for the treatment of moderate-to-severe plaque psoriasis. Patients with psoriasis present clinical and molecular variability, affecting response to treatment. Herein, we utilized an in silico approach to model the effects of CZP in a virtual population (vPop) with moderate-to-severe psoriasis. Our proof-of-concept study aims to assess the performance of our model in generating a vPop and defining CZP response variability based on patient profiles. Methods We built a quantitative systems pharmacology (QSP) model of a clinical trial-like vPop with moderate-to-severe psoriasis treated with two dosing schemes of CZP (200 mg and 400 mg, both every two weeks for 16 weeks, starting with a loading dose of CZP 400 mg at weeks 0, 2, and 4). We applied different modelling approaches: (i) an algorithm to generate vPop according to reference population values and comorbidity frequencies in real-world populations; (ii) physiologically based pharmacokinetic (PBPK) models of CZP dosing schemes in each virtual patient; and (iii) systems biology-based models of the mechanism of action (MoA) of the drug. Results The combination of our different modelling approaches yielded a vPop distribution and a PBPK model that aligned with existing literature. Our systems biology and QSP models reproduced known biological and clinical activity, presenting outcomes correlating with clinical efficacy measures. We identified distinct clusters of virtual patients based on their psoriasis-related protein predicted activity when treated with CZP, which could help unravel differences in drug efficacy in diverse subpopulations. Moreover, our models revealed clusters of MoA solutions irrespective of the dosing regimen employed. Conclusion Our study provided patient specific QSP models that reproduced clinical and molecular efficacy features, supporting the use of computational methods as modelling strategy to explore drug response variability. This might shed light on the differences in drug efficacy in diverse subpopulations, especially useful in complex diseases such as psoriasis, through the generation of mechanistically based hypotheses.
Chapter
Global industrialization and excessive consumption of fossil fuels have led to an increase in greenhouse gas emissions and, as a result, rising global temperatures and environmental problems. Growing challenges are setting the world community on the path to reducing carbon emissions as much as possible. Adopted in 2015, the Paris Agreement placed an obligation on the signatory countries to change their development trajectory in order to limit global warming. Responding to this need, the aim of this research is to explore the possibility of applying artificial intelligence techniques in economic decisions to model and analyze decarbonisation capabilities of nations effectively and efficiently. We proposed and validated 11 indicators to define clusters among of 39 countries with similar decarbonisation capabilities over ten-year period. As a result, eight distinct clusters were obtained. The cluster analysis has been conducted in dynamics and identified leader and other clusters, the countries of which should follow leaders by changing their indicators’ values in order to improve their decarbonisation positions in the map. These changes will be associated with transformations of carbon-intensive to zero-carbon economies. We believe that clustering countries by their decarbonisation capabilities have implications for designing zero-carbon policies towards shifting an economy’s sectors to renewable energy consumption or/and supply.
Article
Full-text available
We present a framework for extracting spatiotemporal trip typologies using noisy mobile ticketing boarding data sampled from passengers in a bus network. Our case study was the Pioneer Valley Transit Authority in Massachusetts. We first used a greedy approach to infer bus boarding stops. Next, we calculated the multi-dimensional dissimilarity of passenger activation time series using the AWarp alignment algorithm for sparse time series. We then employed hierarchical clustering to discover the spatiotemporal patterns, resulting in four distinct trip pattern typologies. We analyzed the typologies, based on trip length and duration, seasonality and other temporal distributions, spatial distributions, and faretype. Three typologies were linked to regular commuters, distinguished by boarding time or transfer tendency. The fourth typology was primarily associated with leisure or other activities. Our typology method provides valuable passenger behavioral insights and can facilitate demand estimation by planners. Further, we demonstrate a potential for decision-making support for other regional transit authorities with limited passenger data availability.
Article
Spectrogram zeros, originated by the destructive interference between the components of a signal in the time-frequency plane, have proven to be a relevant feature to describe the time-varying frequency structure of a signal. In this work, we first introduce a classification of the spectrogram zeros in three classes that depend on the nature of the components that interfere to produce them. Then, we describe an algorithm to classify these points in an unsupervised way, based on the analysis of the stability of their location with respect to additive noise. Finally, potential uses of the classification of zeros of the spectrogram for signal detection and denoising are investigated, and compared with other methods on both synthetic and real-world signals.
Article
A procedure for forming hierarchical groups of mutually exclusive subsets, each of which has members that are maximally similar with respect to specified characteristics, is suggested for use in large-scale (n > 100) studies when a precise optimal solution for a specified number of groups is not practical. Given n sets, this procedure permits their reduction to n − 1 mutually exclusive sets by considering the union of all possible n(n − 1)/2 pairs and selecting a union having a maximal value for the functional relation, or objective function, that reflects the criterion chosen by the investigator. By repeating this process until only one group remains, the complete hierarchical structure and a quantitative estimate of the loss associated with each stage in the grouping can be obtained. A general flowchart helpful in computer programming and a numerical example are included.
Article
Each individual of a multivariate sample may be represented by a point in a multidimensional Euclidean space. Cluster analysis attempts to group these points into disjoint sets which it is hoped will correspond to marked features of the sample. Different methods of cluster analysis of the same sample may assume different geometrical distributions of the points or may employ different clustering criteria or may differ in both respects. Three superficially different methods of cluster analysis are examined. It is shown that the clustering criteria of all these methods, and several new ones derived from or suggested by these methods, can be interpreted in terms of the distances between the centroids of the clusters; the geometrical point distribution is found in most instances. The methods are compared, suggestions made for their improvement, and some of their properties are established.
A method f o r c l u s t e r analysis
  • A W F Edwards
  • L L Orza
Edwards, A.W.F. and C a v a l l i S f orza, L.L. (1965). A method f o r c l u s t e r analysis. Biometrics 3, 362-75.
Sur l a l i a i s o n e t l a d i v i s i o n des points d'un ensemble f i n i
  • Florek
  • J Lukaszewicz
  • J Perkal
  • H Steinhaus
  • S Zubrsgcki
Florek, K o, Lukaszewicz J., Perkal, J., Steinhaus, H. and Zubrsgcki, S. (195laj. Sur l a l i a i s o n e t l a d i v i s i o n des points d'un ensemble f i n i. Colloquium Mathematicum 2, 232-5.
Matenatgka dla Przyrodnik6w i Rolnik6w. Padstwowe Wydawnictwo Naukowe
  • J Perkal
Perkal, J. (1965). Matenatgka dla Przyrodnik6w i Rolnik6w. Padstwowe Wydawnictwo Naukowe, Warszawa.
The use and i n t e r p r e t a t i o n of principal component a n a l y s i s i n applied research. Sankhsg A 26
  • C R Rao
Rao, C.R. (1964). The use and i n t e r p r e t a t i o n of principal component a n a l y s i s i n applied research. Sankhsg A 26, 32 9-5 3. Downloaded by [Mr Tadeusz Calinski] at 04:25 05 September 2013 DENDRITE METHOD FOR CLUSTER ANALYSIS
Advanced S t a t i s t i c a l Methods i n Biometric Research
  • C R Rao
Rao, C.R. (1952). Advanced S t a t i s t i c a l Methods i n Biometric Research, John Wileg and Sons, Inc., New York.
ldinimum spanning t r e e (Algorithm AS 13). Appl. S t a t i s t
  • G J S Ross
Ross, G. J.S. (1969a). ldinimum spanning t r e e (Algorithm AS 13). Appl. S t a t i s t. 18, 103-4.