ArticlePDF Available

A Dendrite Method for Cluster Analysis

Authors:

Abstract

A method for identifying clusters of points in a multidimensional Euclidean space is described and its application to taxonomy considered. It reconciles, in a sense, two different approaches to the investigation of the spatial relationships between the points, viz., the agglomerative and the divisive methods. A graph, the shortest dendrite of Florek etal. (1951a), is constructed on a nearest neighbour basis and then divided into clusters by applying the criterion of minimum within cluster sum of squares. This procedure ensures an effective reduction of the number of possible splits. The method may be applied to a dichotomous division, but is perfectly suitable also for a global division into any number of clusters. An informal indicator of the "best number" of clusters is suggested. It is a"variance ratio criterion" giving some insight into the structure of the points. The method is illustrated by three examples, one of which is original. The results obtained by the dendrite method are compared with those obtained by using the agglomerative method or Ward (1963) and the divisive method of Edwards and Cavalli-Sforza (1965).
A preview of the PDF is not available
... The subset and the K value that provide the highest SIL score are selected to cluster the molecules; other validation metrics are also computed: the Dunn index, 30 the Davies−Bouldin (DB) index, 31 and the Calinśki-Harabasz (CH) index. 32 Among all of the CVIs calculated, the SIL score was chosen as the main selection criterion to determine the optimal K because it has been demonstrated to be among the best CVIs when evaluated across a series of clustering algorithms. 33 Table 1 displays the range of values that can be assumed by each CVIs, along with their interpretation. ...
Article
The clustering of small molecules implies the organization of a group of chemical structures into smaller subgroups with similar features. Clustering has important applications to sample chemical datasets or libraries in a representative manner (e.g., to choose, from a virtual screening hit list, a chemically diverse subset of compounds to be submitted to experimental confirmation, or to split datasets into representative training and validation sets when implementing machine learning models). Most strategies for clustering molecules are based on molecular fingerprints and hierarchical clustering algorithms. Here, two open-source in-house methodologies for clustering of small molecules are presented: iterative Random subspace Principal Component Analysis clustering (iRaPCA), an iterative approach based on feature bagging, dimensionality reduction, and K-means optimization; and Silhouette Optimized Molecular Clustering (SOMoC), which combines molecular fingerprints with the Uniform Manifold Approximation and Projection (UMAP) and Gaussian Mixture Model algorithm (GMM). In a benchmarking exercise, the performance of both clustering methods has been examined across 29 datasets containing between 100 and 5000 small molecules, comparing these results with those given by two other well-known clustering methods, Ward and Butina. iRaPCA and SOMoC consistently showed the best performance across these 29 datasets, both in terms of within-cluster and between-cluster distances. Both iRaPCA and SOMoC have been implemented as free Web Apps and standalone applications, to allow their use to a wide audience within the scientific community.
Article
Density-Based Spatial Clustering of Applications with Noise (DBSCAN) is a widely used algorithm for exploratory clustering applications. Despite the DBSCAN algorithm being considered an unsupervised pattern recognition method, it has two parameters that must be tuned prior to the clustering process in order to reduce uncertainties, the minimum number of points in a clustering segmentation MinPts, and the radii around selected points from a specific dataset Eps. This article presents the performance of a clustering hybrid algorithm for automatically grouping datasets into a two-dimensional space using the well-known algorithm DBSCAN. Here, the function nearest neighbor and a genetic algorithm were used for the automation of parameters MinPts and Eps. Furthermore, the Factor Analysis (FA) method was defined for pre-processing through a dimensionality reduction of high-dimensional datasets with dimensions greater than two. Finally, the performance of the clustering algorithm called FA+GA-DBSCAN was evaluated using artificial datasets. In addition, the precision and Entropy of the clustering hybrid algorithm were measured, which showed there was less probability of error in clustering the most condensed datasets.
Article
Pudu deer ( Pudu puda ) is endemic to the temperate rainforests of Chile. Genetic studies at different geographic scales for this species are required to better determine the genetic divergence within and among populations and their demography across the distribution range. These data can provide unique insights into the species or population status for conservation plans and decision-makers. We analyzed the mtDNA control region (CR) and cytochrome b (Cyt b) sequences of pudu deer in five provinces of southern Chile located at different latitudinal locations (Cautín, Valdivia, Osorno, Llanquihue and Chiloé Island) and three geographic areas within the studied provinces, representative of different longitudinal sites (Andes range, Central Valley and Coastal Range), to understand their genetic divergence and demography. The haplotype (H) and nucleotide (Π) diversities of CR and Cyt b ranged from 0.64286 to 0.98333 and from 0.00575 to 0.01022, respectively. CR diversity was significantly different among provinces, with Valdivia showing higher values than Llanquihue and Chiloé Island (H = 0.98333 vs. 0.64286–0.92727, P < 0.05). Cyt b variation also showed significant differences among provinces, particularly, among Cautín and Llanquihue (H = 1.000 vs. 0.222, P < 0.05). Genetic structuring among provinces was relatively high, as indicated by the F ST index (F ST = 0.41905). Clustering analysis indicated the presence of a distinctive cluster for Chiloé Island individuals. Fu’s F S and Tajima’s D based on CR revealed significant, negative deviations from equilibrium for Chiloé Island (D = -1.65898), Valdivia (Fs = -7.75335) and Llanquihue (Fs = -3.93267), suggesting population expansion in these provinces. Analysis at the longitudinal range showed significant differences among areas based on Π (P < 0.05), with the Andes range and Central Valley showing higher diversity than the Coastal Range. Neither population structuring (F ST = 0.01360, P > 0.05) nor distinctive clusters in the longitudinal range were observed. Fu’s Fs and Tajima’s D were negative and significant for the Coastal Range based on CR (Fs = -6.64752, P < 0.001) and Cyt b (D = -1.74110, P < 0.05), suggesting the existence of population expansion. Our results suggest that pudu deer in the analyzed provinces is a genetically structured species, which could be associated with reduced panmixia among populations. The genetic divergence pattern and the population expansion recorded are likely to be associated with past processes of recolonization after Pleistocene glaciation events.
Article
Software-Defined Networking (SDN) is an emerging network architecture that offers flexible network management. Although the decoupling of the control plane and data plane provides network programmability for SDN, it also makes SDN become vulnerable to several attacks. The saturation attack is one of these attacks. It is a concealed attack that has a highly negative impact by overwhelming the SDN controller. Once the SDN controller is crashed, the network can not work. Currently, the cusp catastrophe theory has already been used for detecting saturation attack against SDN controller. When using the cusp catastrophe theory to detect saturation attack in SDN, most instances will be identified as unstable instances. The additional detection of unstable instances is achieved using the distance between current state and previous state, leading to the low detection accuracy. To overcome that issues, in this work, we propose LICENSE, a saturation attack detection mechanism designed based on confusable instance analysis. More specifically, a Condition Transferring Mechanism (CTM) method is designed to first classify the input instances into two kinds, the unconfusable instance that clearly belongs to attack or benign instance and the confusable instance which is not easy to distinguish. Then a Network State Base Cusp model is proposed to further distinguish the confusable instance to stable instance and unstable instance. At last, a method recorded as Unstable Instance Detection (UID) is proposed for identifying unstable instances. The evaluation results demonstrate that LICENSE can reduce the number of unstable instances and improve the detection accuracy of unstable instances, thus achieving a higher overall detection performance. In conclusion, LICENSE can effectively detect saturation attack in SDN.
Article
Full-text available
This review explores the avenues for the application of meta-heuristics in sports. The necessity of sophisticated algorithms to investigate different NP hard problems encountered in sports analytics was established in the recent past. Meta-heuristics have been applied as a promising approach to such problems. We identified team selection, optimal lineups, sports equipment optimization, scheduling and ranking, performance analysis, predictions in sports, and player tracking as seven major categories where meta-heuristics were implemented in research in sports. Some of our findings include (a) genetic algorithm and particle swarm optimization have been extensively used in the literature, (b) meta-heuristics have been widely applied in the sports of cricket and soccer, (c) the limitations and challenges of using meta-heuristics in sports. Through awareness and discussion on implementation of meta-heuristics, sports analytics research can be rich in the future.
Article
Recently, the two concepts that have been often discussed in the literature on taxonomy are the cluster ensemble and stability. An interesting proposal regarding the combination of these two concepts was presented by Șenbabaoğlu, Michailidis, and Li, who proposed as a measure of stability a proportion of ambiguously clustered pairs (PAC) for selecting the optimal number of groups in the cluster ensemble. This proposal appeared in the field of genetic research, but as the authors themselves write, the method can be successfully used also in other research areas. The aim of this paper is to compare the results of indicating the number of clusters (k parameter) using the aggregated approach in taxonomy and the above-mentioned measure of stability and classical indices (e.g. Caliński–Harabasz, Dunn, Davies–Bouldin).
Article
This paper proposes an automatic, machine learning methodology for precision agriculture, aiming at learning management zones that allow a more efficient and sustainable use of fertiliser. In particular, the methodology consists of clustering remote sensing data and estimating the impact of decision-making based on the extracted knowledge. A case study is developed on experimental data coming from winter wheat (Triticum aestivum) crops receiving site-specific fertilisation. A first approximation to the data allows measuring the effects of the fertilisation treatments on the yield and quality of the crops. After verifying the significance of such effects, clustering analysis is applied on sensor readings on vegetation and soil electric conductivity in order to automatically learn the best configuration of zones for differentiated treatment. The complete methodology for identifying management zones from vegetation and soil sensing is validated for two experimental sites in Denmark, estimating its potential impact for decision-making on site-specific N fertilisation.
Article
Meta-learning frameworks have been proposed to generalize machine learning models for domain adaptation without sufficient label data in computer vision. However, text classification with meta-learning is less investigated. In this paper, we propose SumFS to find global top-ranked sentences by extractive summary and improve the local vocabulary category features. The SumFS consists of three modules: (1) an unsupervised text summarizer that removes redundant information; (2) a weighting generator that associates feature words with attention scores to weight the lexical representations of words; (3) a regular meta-learning framework that trains with limited labeled data using a ridge regression classifier. In addition, a marine news dataset was established with limited label data. The performance of the algorithm was tested on THUCnews, Fudan, and marine news datasets. Experiments show that the SumFS can maintain or even improve accuracy while reducing input features. Moreover, the training time of each epoch is reduced by more than 50%.
Article
Full-text available
Vessel big data play a significant role in understanding vessel behaviors and thus facilitating the prosperity of waterway transportation. However, relevant research regarding vessel trajectory recognition in a broad range of narrow channels still lacks, especially using VITS data. The major objective of this paper is to conduct vessel trajectory analysis based on the novel VITS data and examine its availability in inland waterway vessel transportation. An alternate aim is to develop a more comprehensive framework to extract the vessel trajectory of multiple narrow waterways. This paper utilized vessel trajectory information of multiple narrow channels belonging to Yangtze River captured by VITS. Four compression algorithms were conducted. Additionally, the performances of three clustering approaches were evaluated. Speed distribution analysis was also implemented. The results indicated that slide window (SW) algorithm outperforms its other counterparts. Relative to DBSCAN, K-means and hierarchical clustering analysis (HCA) tend to be more capable of balanced classification. This paper is the first to utilize VITS data in vessel trajectory feature extraction and can potentially provide useful insight for vessel trajectory extraction in multiple narrow channels.
Article
Full-text available
The cluster evaluation process is of great importance in areas of machine learning and data mining. Evaluating the clustering quality of clusters shows how much any proposed approach or algorithm is competent. Nevertheless, evaluating the quality of any cluster is still an issue. Although many cluster validity indices have been proposed, there is a need for new approaches that can measure the clustering quality more accurately because most of the existing approaches measure the cluster quality correctly when the shape of the cluster is spherical. However, very few clusters in the real world are spherical. erefore, a new Validity Index for Arbitrary-Shaped Clusters based on the kernel density estimation (the VIASCKDE Index) to overcome the mentioned issue was proposed in the study. In the VIASCKDE Index, we used separation and compactness of each data to support arbitrary-shaped clusters and utilized the kernel density estimation (KDE) to give more weight to the denser areas in the clusters to support cluster compactness. To evaluate the performance of our approach, we compared it to the state-of-the-art cluster validity indices. Experimental results have demonstrated that the VIASCKDE Index outperforms the compared indices.
Article
A procedure for forming hierarchical groups of mutually exclusive subsets, each of which has members that are maximally similar with respect to specified characteristics, is suggested for use in large-scale (n > 100) studies when a precise optimal solution for a specified number of groups is not practical. Given n sets, this procedure permits their reduction to n − 1 mutually exclusive sets by considering the union of all possible n(n − 1)/2 pairs and selecting a union having a maximal value for the functional relation, or objective function, that reflects the criterion chosen by the investigator. By repeating this process until only one group remains, the complete hierarchical structure and a quantitative estimate of the loss associated with each stage in the grouping can be obtained. A general flowchart helpful in computer programming and a numerical example are included.
Article
Each individual of a multivariate sample may be represented by a point in a multidimensional Euclidean space. Cluster analysis attempts to group these points into disjoint sets which it is hoped will correspond to marked features of the sample. Different methods of cluster analysis of the same sample may assume different geometrical distributions of the points or may employ different clustering criteria or may differ in both respects. Three superficially different methods of cluster analysis are examined. It is shown that the clustering criteria of all these methods, and several new ones derived from or suggested by these methods, can be interpreted in terms of the distances between the centroids of the clusters; the geometrical point distribution is found in most instances. The methods are compared, suggestions made for their improvement, and some of their properties are established.
A method f o r c l u s t e r analysis
  • A W F Edwards
  • L L Orza
Edwards, A.W.F. and C a v a l l i S f orza, L.L. (1965). A method f o r c l u s t e r analysis. Biometrics 3, 362-75.
Sur l a l i a i s o n e t l a d i v i s i o n des points d'un ensemble f i n i
  • Florek
  • J Lukaszewicz
  • J Perkal
  • H Steinhaus
  • S Zubrsgcki
Florek, K o, Lukaszewicz J., Perkal, J., Steinhaus, H. and Zubrsgcki, S. (195laj. Sur l a l i a i s o n e t l a d i v i s i o n des points d'un ensemble f i n i. Colloquium Mathematicum 2, 232-5.
Matenatgka dla Przyrodnik6w i Rolnik6w. Padstwowe Wydawnictwo Naukowe
  • J Perkal
Perkal, J. (1965). Matenatgka dla Przyrodnik6w i Rolnik6w. Padstwowe Wydawnictwo Naukowe, Warszawa.
The use and i n t e r p r e t a t i o n of principal component a n a l y s i s i n applied research. Sankhsg A 26
  • C R Rao
Rao, C.R. (1964). The use and i n t e r p r e t a t i o n of principal component a n a l y s i s i n applied research. Sankhsg A 26, 32 9-5 3. Downloaded by [Mr Tadeusz Calinski] at 04:25 05 September 2013 DENDRITE METHOD FOR CLUSTER ANALYSIS
Advanced S t a t i s t i c a l Methods i n Biometric Research
  • C R Rao
Rao, C.R. (1952). Advanced S t a t i s t i c a l Methods i n Biometric Research, John Wileg and Sons, Inc., New York.
ldinimum spanning t r e e (Algorithm AS 13). Appl. S t a t i s t
  • G J S Ross
Ross, G. J.S. (1969a). ldinimum spanning t r e e (Algorithm AS 13). Appl. S t a t i s t. 18, 103-4.