ArticlePDF Available

Abstract and Figures

We present the R package clustrd which implements a class of methods that combine dimension reduction and clustering of continuous or categorical data. In particular, for continuous data, the package contains implementations of factorial K-means and reduced K-means; both methods combine principal component analysis with K-means clustering. For categorical data, the package provides MCA K-means, i-FCB and cluster correspondence analysis, which combine multiple correspondence analysis with K-means. Two examples on real datasets are provided to illustrate the usage of the main functions.
Content may be subject to copyright.
A preview of the PDF is not available
... The method is to find a reasonable allocation of observations to the groups which are similar concerning observed variables (van de Velden et al., 2017). It conducts correspondence analysis for cross-tabulation of the cluster membership and the variable categories (Markos et al., 2019). The method can maximize the cluster variance by optimizing the scaling values for rows and columns. ...
... Categories with differing distributions over the clusters are optimized as well. An R package called 'clustrd' was adopted to perform the analysis (Markos et al., 2019). The method can be described as follows. ...
... Within Fig. 2, each bar plot displays one of the first thirty attributes with the highest standard residuals in each cluster, containing attributes associated with the perception of WFH decisions after the pandemic (WHAC). Markos et al. (2019) stated the standard residual calculation method in their original publication, demonstrating how the residual, whether positive or negative, reflects how far an attribute deviates from Table 2 Counts of WFH-related Features. ...
Article
Full-text available
The working standard of shared office spaces has evolved in recent years. Due to the ongoing COVID-19 pandemic, many companies have instituted work from home (WFH) policies in accordance with public health guidelines in order to increase social distancing and decrease the spread of COVID-19. As the pandemic and WFH-related policies have continued for more than a year, there has been a rise in people becoming accustomed to the remote environments; however, others are more enthusiastic about returning to in-person work environments, reflecting the desire to restore pre-pandemic environments. As working from home is related to transportation issues such as changing commuting patterns and decreased congestion, motorized trips, and emission, there is a need to explore the extent of public attitudes on this important issue. This study used unique open-source survey data that provides substantial information on this topic. Using an advanced categorical data analysis method known as cluster correspondence analysis, this study identified several key findings. Not having prior WFH experiences, being eager to interact with colleagues, difficulties with adapting to virtual meeting technologies, and challenges with self-discipline while WFH were strongly associated with individuals who refused to continuously WFH at all after the pandemic. Individuals holding a strong view against the seriousness of the COVID-19 pandemic were also largely associated with never choosing WFH during and after the pandemic. For individuals with some prior WFH experiences, the transition to WFH every day in response to the outbreak was much easier, compared to those without prior experiences. Moreover, being forced to WFH during the COVID-19 pandemic positively influences the choice of WFH after the pandemic. The findings of this study will be beneficial to help policymakers and sustainable city planners understand public opinions about WFH.
... Performing dimension reduction and (spatial) clustering as two sequential analytical steps is not ideal for two important reasons. First, these tandem methods optimize distinct loss functions for dimension reduction and (spatial) clustering separately, and the two loss functions may not be consistent with each other when aiming to achieve optimal (spatial) cluster allocation [39]. PCA aims to retain as much variance as possible in as few PCs as possible, whereas spatial clustering aims to either minimize within-cluster variances or maximize between-cluster variances. ...
... We conducted comprehensive simulations and real data analysis by comparing the dimension reduction and clustering performance of DR-SC with those of existing methods. In detail, we considered the following eight dimension-reduction methods to compare the dimensionreduction performance: (1) PCA implemented in the R package stats; (2) WPCA [48] implemented in the R package DR.SC; (3) factorial k-means (FKM) [39] implemented in the R package clustrd; (4) tSNE; (5) UMAP, in which tSNE and UMAP were implemented in the R package scater; (6) ZIFA implemented in the Python module ZIFA; (7) ZINB-WaVE implemented in the R package zinbwave; and (8) scVI implemented in the Python module scvi. As the last three methods, ZIFA, ZINB-WaVE, and scVI, can be applied to only raw count data, we compared their performance with that of DR-SC in Simulation 2 with the count matrix for expression levels and real datasets. ...
... We considered the following 10 clustering methods when comparing clustering performances. (1) BayesSpace [49] implemented in the R package BayesSpace; (2) Giotto [33] implemented in the R package Giotto; (3) SC-MEB [27] implemented in the R package SC.MEB; (4) SpaGCN [25] implemented in the Python module SpaGCN; (5) Louvain [50] implemented in the R package igraph; (6) Leiden [51] implemented in the R package leiden; (7) GMM implemented in the R package mclust; (8) k-means implemented in the R package stats; (9) FKM [39] implemented in the R package clustrd; and (10) subspace clustering based on arbitrarily oriented projected cluster generation (ORCLUS) [52] implemented in the R package 5 . CC-BY-NC-ND 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. ...
Article
Full-text available
Dimension reduction and (spatial) clustering is usually performed sequentially; however, the low-dimensional embeddings estimated in the dimension-reduction step may not be relevant to the class labels inferred in the clustering step. We therefore developed a computation method, Dimension-Reduction Spatial-Clustering (DR-SC), that can simultaneously perform dimension reduction and (spatial) clustering within a unified framework. Joint analysis by DR-SC produces accurate (spatial) clustering results and ensures the effective extraction of biologically informative low-dimensional features. DR-SC is applicable to spatial clustering in spatial transcriptomics that characterizes the spatial organization of the tissue by segregating it into multiple tissue structures. Here, DR-SC relies on a latent hidden Markov random field model to encourage the spatial smoothness of the detected spatial cluster boundaries. Underlying DR-SC is an efficient expectation-maximization algorithm based on an iterative conditional mode. As such, DR-SC is scalable to large sample sizes and can optimize the spatial smoothness parameter in a data-driven manner. With comprehensive simulations and real data applications, we show that DR-SC outperforms existing clustering and spatial clustering methods: it extracts more biologically relevant features than conventional dimension reduction methods, improves clustering performance, and offers improved trajectory inference and visualization for downstream trajectory inference analyses.
... Performing dimension reduction and (spatial) clustering in two sequential analytic steps is not ideal for two important reasons. First, these tandem methods optimize distinct loss functions for dimension reduction and (spatial) clustering separately, and the two loss functions may not be consistent with each for achieving optimal (spatial) cluster allocation [38]. PCA aims to retain as much variance as possible in as few PCs as possible, whereas spatial clustering aims to either minimize within-cluster variances or maximize betweencluster variances. ...
... Among them, SpaGCN software took the log-normalized expression matrix as the input, used its internally embedded PCA algorithm to obtain PCs, and could only be applied to perform spatial clustering. The second group was joint analysis, including PSC [51] and FKM [38]. By setting the smoothing parameter to zero, BayesSpace, SC-MEB and Giotto could be applied to cluster non-spatial data. ...
... By setting the smoothing parameter to zero, BayesSpace, SC-MEB and Giotto could be applied to cluster non-spatial data. On the other hand, to evaluate the estimation accuracy of low-dimensional embeddings, we compared DR-SC with eight dimension reduction methods in all simulation settings, including PCA, weighted PCA (WPCA) [19], FKM [38], tSNE [20], UMAP [21], ZIFA [27], ZINB-WaVE [28], and scVI [29]. ...
Preprint
Full-text available
Dimension reduction and (spatial) clustering are two key steps for the analysis of both single-cell RNA-sequencing (scRNA-seq) and spatial transcriptomics data collected from different platforms. Most existing methods perform dimension reduction and (spatial) clustering sequentially, treating them as two consecutive stages in tandem analysis. However, the low-dimensional embeddings estimated in the dimension reduction step may not necessarily be relevant to the class labels inferred in the clustering step and thus may impair the performance of the clustering and other downstream analysis. Here, we develop a computation method, DR-SC, to perform both dimension reduction and (spatial) clustering jointly in a unified framework. Joint analysis in DR-SC ensures accurate (spatial) clustering results and effective extraction of biologically informative low-dimensional features. Importantly, DR-SC is not only applicable for cell type clustering in scRNA-seq studies but also applicable for spatial clustering in spatial transcriptimics that characterizes the spatial organization of the tissue by segregating it into multiple tissue structures. For spatial transcriptoimcs analysis, DR-SC relies on an underlying latent hidden Markov random field model to encourage the spatial smoothness of the detected spatial cluster boundaries. We also develop an efficient expectation-maximization algorithm based on an iterative conditional mode. DR-SC is not only scalable to large sample sizes, but is also capable of optimizing the spatial smoothness parameter in a data-driven manner. Comprehensive simulations show that DR-SC outperforms existing clustering methods such as Seurat and spatial clustering methods such as BayesSpace and SpaGCN and extracts more biologically relevant features compared to the conventional dimension reduction methods such as PCA and scVI. Using 16 benchmark scRNA-seq datasets, we demonstrate that the low-dimensional embeddings and class labels estimated from DR-SC lead to improved trajectory inference. In addition, analyzing three published scRNA-seq and spatial transcriptomics data in three platforms, we show DR-SC can improve both the spatial and non-spatial clustering performance, resolving a low-dimensional representation with improved visualization, and facilitate the downstream analysis such as trajectory inference.
... An R package of Cluster CA, known as ''clustrd,'' was published in 2019 (29). The vignette of this package provided a step-by-step example to demonstrate the coding framework. ...
... The standardized residual measures the distance an attribute deviates from the attribute distribution conditional's independence on the cluster. A positive (negative) residual indicates the attribute has an above (below) average frequency with the cluster (29). The bars with negative standard residuals will, therefore, not be discussed in the later sections. ...
Article
Full-text available
Studying near-crashes can help safety researchers understand the nature of crashes from different perspectives. Conventional crash data sets lack information about what occurred directly before the crash event. This study used a near-crash data set extracted from a naturalistic driving study that includes features describing the vehicles, drivers, and information on other vehicles involved before and during the near-crash incidents. This data set provides us with a unique perspective to understand the patterns of near-crashes. This study applied the cluster correspondence analysis (cluster CA) algorithm to explore the patterns and the magnitude of each feature’s dominance within and between the clusters through dimension reduction. The analysis identifies six clusters with four types of near-crashes: near-crash with adjacent vehicles; near-crash with the following or leading vehicles; near-crash with turning vehicles; and near-crash with objects on the roadway. The results also show that the first two types of near-crash are the most common. The patterns for these two most common types of near-crash are different with or without the engagement of secondary tasks. The findings of this study provide a fresh perspective to understand different types of crash and associated patterns. Furthermore, these findings could help transportation agencies or vehicle designers develop a more effective countermeasure to mitigate the risk of collision.
... The CCA method was applied to FSI and MI crashes separately. The analysis was run using the 'clustrd' package (38) in R software (39). The selection of the optimal number of clusters is crucial in performing the CCA. ...
Conference Paper
Rainy weather significantly affects traffic safety especially in the state of Louisiana. Aiming to identify the patterns of collective association of attributes in rainfall-involved crashes statewide, crashes that occurred during rainy weather and resulted in two injury groups – fatal and severe injury (FSI) and moderate injury (MI) – were extracted from the databases acquired from the Louisiana Department of Transportation and Development. A total of 3,381 crashes were extracted, among which FSI crashes were 502 (14.85%) and MI crashes were 2,879 (85.15%). The ‘Cluster Corresponding Analysis’ method was applied which combined cluster analysis and correspondence analysis and efficiently generated clusters by partitioning of individual attributes based on the profiles over the categorical variables identified through dimensional reduction of the dataset. In addition to the biplots illustrating the association of all attributes in the clusters, the top 20 standardized residuals indicating the stronger association have been presented in bar plots. Four optimum clusters from FSI and MI crashes reveal that the association of roadway, crash environment and driver condition characteristics identified in the clusters are highly distinguishable across roadway functional classes. Specifically, varieties of attributes linked to speed limit, lighting condition, alignment, area type, manner of collision, restraint usage, and alcohol/drug can have associative impacts on these two injury severities. The identified associations of crash attributes across various functional class roadways could provide valuable implications for the incorporation in the countermeasure development prioritized for severer injury prevention.
... Following a tandem approach (Markos et al., 2019), a CA was then conducted applying the k-means algorithm first proposed by MacQueen (1967) and using as cluster variates the two PCs retained. The k-means algorithm is a partitional clustering algorithm where first the number of clusters is specified in advance, then each cluster is associated with a centroid, and then each unit is assigned to the cluster with the closest centroid. ...
Article
Government agendas are increasingly focused on environmental issues, pressuring companies to reduce the environmental impact of their activities, and innovation plays a key role in this. Substantial evidence from academics and practitioners has revealed that technological innovation can reduce negative environmental impact, but the role of marketing innovation in driving environmental benefits has been overlooked by academic research. This study aims to shed light on the environmental contribution driven by marketing innovation by examining the different roles of four types of marketing innovation (product, promotion, placement and price) in achieving environmental benefits, through an empirical analysis of the latest available Community Innovation Survey in Germany and Portugal related to the period 2012–2014. In our models we consider a multinomial outcome, namely four clusters of companies with a different combination of environmental benefits, constructed using a Cluster Analysis conducted on Principal Component Analysis main factors. Then, the determinants of the cluster membership are analyzed through a Multinomial Logistic Regression model. Results show that the introduction of a marketing innovation yields environmental benefits both within the enterprise (internal) and during the consumption or use of services and products (external). When the four types of marketing innovation are analyzed separately, further results emerge: only two types of innovation, in pricing and placement, were found to be significantly related to both internal and external environmental benefits. Companies are challenged to carefully evaluate the types of marketing innovation that should be introduced to positively impact the environment.
... The CCA method was applied to FSI and MI crashes separately. The analysis was run using the 'clustrd' package (38) in R software (39). The selection of the optimal number of clusters is crucial in performing the CCA. ...
Article
Full-text available
Rainy weather significantly affects traffic safety, especially in the state of Louisiana. With the aim of identifying the patterns of collective association of attributes in rainfall-involved crashes statewide, crashes that occurred during rainy weather and resulted in two injury groups, fatal and severe injury (FSI) and moderate injury (MI), were extracted from the databases acquired from the Louisiana Department of Transportation and Development. A total of 3,381 crashes were extracted, among which FSI crashes were 502 (14.85%) and MI crashes were 2,879 (85.15%). This study applied Cluster Correspondence Analysis (CCA) method, a unique method in combination with cluster analysis and correspondence analysis, to generate clusters by partitioning of individual attributes based on the profiles over the categorical variables identified through dimensional reduction of the dataset. In addition to the biplots illustrating the association of all attributes in the clusters, the top 20 standardized residuals indicating the stronger association are presented in bar plots. Four optimum clusters from FSI and MI crashes reveal that the association of roadway, crash environment, and driver condition characteristics identified in the clusters are highly distinguishable across roadway functional classes. Specifically, varieties of attributes linked to speed limit, lighting condition, alignment, area type, manner of collision, restraint usage, and alcohol/drug can have associative impacts on these two injury severities. The identified associations of crash attributes across various functional class roadways could provide valuable implications for countermeasure development prioritized for the prevention of fatalities and injuries.
... This study used R package clustrd (Markos, D'Enza, and Velden 2019) to perform the analysis. The Calinski-Harabasz measure (also known as the valence ratio criterion) provides the ratio of the sum of between-clusters dispersion and of inter-cluster dispersion for all clusters. ...
Article
Full-text available
Moped and seated motor scooter (50 ccs or less) riders have a relatively high risk of becoming crash casualties. Motorcycle riders are obliged to wear a helmet but moped and light-moped riders are not. Comparison between 2015 and 2019 fatal crash data indicates that fatal moped crashes have increased by 76%, whereas fatal motorcycle crashes have decreased by 2%. This study collected moped and seated motor scooter-related fatal crash data for five years (2015-2019) from the Fatality Analysis Reporting System (FARS) to perform the analysis. Using an innovative categorical data analysis method known as cluster correspondence analysis (CCA), this study identified some critical clusters with a group of co-occurring variable categories. The findings indicated that the grouping of environment and road-related factors significantly influences the manner of moped and motor scooter-related fatal collisions. Angle crashes occurred at wide intersections in poor lighting conditions, whereas front-to-rear collisions happened at high-speed non-junctions of two-way streets in the dark without street lighting. The clusters also revealed risk-taking maneuvers of moped riders with fatal consequences, i.e., disregarding traffic rules at intersections, lane changing when other vehicles are traveling at high speed, and speeding while negotiating curves. The contextual understanding of fatal crash patterns could guide authorities in developing data-driven interventions and countermeasures aiming to minimize moped collisions and related fatalities. The findings of this study can provide a better understanding of the patterns of contributing factors in moped and seated motor scooter fatal crashes.
Article
Full-text available
In the context of global change, a better understanding of the dynamics of wood degradation, and how they relate to tree attributes and climatic conditions, is necessary to improve broad‐scale assessments of the contributions of deadwood to various ecological processes, and ultimately, for the development of adaptive post‐disturbance management strategies. The objective of this meta‐analysis was to review the effects of tree attributes and local climatic conditions on the time‐since‐death of coarse woody debris ranging in decomposition states. Results from our meta‐analysis showed that projected warming will likely accelerate wood decomposition and significantly decrease the residence time in decay stages. By promoting such a decrease in residence time, further climate warming is very likely to alter the dynamics of deadwood, which in turn may affect saproxylic biodiversity by decreasing the temporal availability of specific habitats. Moreover, while coarse woody debris has been recognized as a key resource for bioenergy at the global scale, the acceleration of decay‐stages transition dynamics indicates that the temporal window during which dead trees are available as feedstock for value‐added products will shrink. Consequently, future planning and implementation of salvage harvesting will need to occur within a short period following disturbance, especially in warmer regions dominated by hardwood species. Another important contribution of this work was the development of a harmonized classification system that relies on the correspondence between the visual criteria used to characterize deadwood decomposition stages in locally‐developed systems the literature. This system could be used in future investigations to facilitate direct comparisons between studies. Our literature survey also highlights that most of the information on wood decay dynamics comes from temperate and boreal forests, whereas data from subtropical, equatorial, and subarctic forests is scarce. Such data is urgently needed to allow broader‐scale conclusions on global wood degradation dynamics.
Preprint
Full-text available
Clustering mixed-type data, that is, observation by variable data that consist of both continuous and categorical variables poses novel challenges. Foremost among these challenges is the choice of the most appropriate clustering method for the data. This paper presents a benchmarking study comparing six distance-based partitioning methods for mixed-type data in terms of cluster recovery performance. A series of simulations carried out by a full factorial design are presented that examined the effect of a variety of factors on cluster recovery. The amount of cluster overlap had the largest effect on cluster recovery and in most of the tested scenarios. Modha-Spangler K-Means, K-Prototypes and a sequential Factor Analysis and K-Means clustering typically performed better than other methods. The study can be a useful reference for practitioners in the choice of the most appropriate method.
Chapter
Full-text available
There exist several methods for clustering high-dimensional data. One popular approach is to use a two-step procedure. In the first step, a dimension reduction technique is used to reduce the dimensionality of the data. In the second step, cluster analysis is applied to the data in the reduced space. This method may be referred to as the tandem approach. An important drawback of this method is that the dimension reduction may distort or hide the cluster structure. As an alternative, various authors have proposed joint dimension reduction and clustering approaches. In this paper we review some of these existing joint dimension reduction and clustering methods for categorical data in a unified framework that facilitates comparison.
Book
Full-text available
"Correspondence Analysis: Theory, Practice and New Strategies" examines the key issues of correspondence analysis, and discusses the new advances that have been made over the last 20 years. The main focus of this book is to provide a comprehensive discussion of some of the key technical and practical aspects of correspondence analysis, and to demonstrate how they may be put to use. Particular attention is given to the history and mathematical links of the developments made. These links include not just those major contributions made by researchers in Europe (which is where much of the attention surrounding correspondence analysis has focused) but also the important contributions made by researchers in other parts of the world. Key features include: • A comprehensive international perspective on the key developments of correspondence analysis • Discussion of correspondence analysis for nominal and ordinal categorical data. • Discussion of correspondence analysis of contingency tables with varying association structures (symmetric and non-symmetric relationship between two or more categorical variables). • Extensive treatment of many of the members of the correspondence analysis family for two-way, three-way and multiple contingency tables. "Correspondence Analysis" offers a comprehensive and detailed overview of this topic which will be of value to academics, postgraduate students and researchers wanting a better understanding of correspondence analysis. Readers interested in the historical development, internationalisation and diverse applicability of correspondence analysis will also find much to enjoy in this book.
Article
Full-text available
The new R package FactoClass to combine factorial methods and cluster analysis is presented. This package is implemented in order to perform a multivariate exploration of a data table according to Lebart et al. (1995). We use some ade4 functions (Chessel et al. 2004) to perform the factorial analysis of the data and some stats functions in R to perform cluster methods. Some new functions are programmed to make specific tasks and another old ones are modified. We describe the implementation of FactoClass in the Windows environment and illustrate its use with an example.
Article
Clustering (partitioning) and simultaneous dimension reduction of objects and variables of a two-way two-mode data matrix is proposed here. The methodology is based on a general model that includes K-means clustering, factorial K-means, projection pursuit clustering (also known as reduced K-means), principal component analysis and intermediate cases of object clustering and variable reduction. Since we often have sets consisting of both qualitative and quantitative variables, the general model is now extended to deal with the general relevant case of mixed variables, analogous to variants of PCA handling qualitative (nominal and ordinal) variables in addition to quantitative variables. The model, called clustering and dimension reduction (CDR), is fully discussed in all the special cases cited above. For least-squares estimation of the model, an efficient coordinate descent algorithm is presented. Finally, a simulation study and two analyses on real data illustrate the features of CDR and study the performance of the proposed algorithm.
Article
A method is proposed that combines dimension reduction and cluster analysis for categorical data by simultaneously assigning individuals to clusters and optimal scaling values to categories in such a way that a single between variance maximization objective is achieved. In a unified framework, a brief review of alternative methods is provided and we show that the proposed method is equivalent to GROUPALS applied to categorical data. Performance of the methods is appraised by means of a simulation study. The results of the joint dimension reduction and clustering methods are compared with the so-called tandem approach, a sequential analysis of dimension reduction followed by cluster analysis. The tandem approach is conjectured to perform worse when variables are added that are unrelated to the cluster structure. Our simulation study confirms this conjecture. Moreover, the results of the simulation study indicate that the proposed method also consistently outperforms alternative joint dimension reduction and clustering methods.
Article
Multiple correspondence analysis (MCA) is a useful tool for investigating the interrelationships among dummy-coded categorical variables. MCA has been combined with clustering methods to examine whether there exist heterogeneous subclusters of a population, which exhibit cluster-level heterogeneity. These combined approaches aim to classify either observations only (one-way clustering of MCA) or both observations and variable categories (two-way clustering of MCA). The latter approach is favored because its solutions are easier to interpret by providing explicitly which subgroup of observations is associated with which subset of variable categories. Nonetheless, the two-way approach has been built on hard classification that assumes observations and/or variable categories to belong to only one cluster. To relax this assumption, we propose two-way fuzzy clustering of MCA. Specifically, we combine MCA with fuzzy k-means simultaneously to classify a subgroup of observations and a subset of variable categories into a common cluster, while allowing both observations and variable categories to belong partially to multiple clusters. Importantly, we adopt regularized fuzzy k-means, thereby enabling us to decide the degree of fuzziness in cluster memberships automatically. We evaluate the performance of the proposed approach through the analysis of simulated and real data, in comparison with existing two-way clustering approaches.