ArticlePDF Available

Abstract and Figures

We present the R package clustrd which implements a class of methods that combine dimension reduction and clustering of continuous or categorical data. In particular, for continuous data, the package contains implementations of factorial K-means and reduced K-means; both methods combine principal component analysis with K-means clustering. For categorical data, the package provides MCA K-means, i-FCB and cluster correspondence analysis, which combine multiple correspondence analysis with K-means. Two examples on real datasets are provided to illustrate the usage of the main functions.
Content may be subject to copyright.
A preview of the PDF is not available
... Spatial transcriptomics analysis utilizes several spatial clustering methods, such as the graph convolutional network (GCN)-based approach SpaGCN [5], Giotto [6], BayesSpace [7], and SC-MEB [1]. The joint approach estimates the low-dimensional representations and the latent cell clustering membership labels simultaneously, such as FKM [8] and ORCLUS [9] for non-spatial clustering, and DR.SC [10] for spatial clustering. However, all abovementioned ST analysis methods use the data generated from the current ST studies alone, while not taking advantage of data generated from other existing studies relevant to the concerning disease type, including scRNA-seq data and ST data. ...
Preprint
Full-text available
Background: Spatial transcriptomics have emerged as a powerful tool in biomedical research because of its ability to capture both the spatial contexts and abundance of the complete RNA transcript profile in organs of interest. However, limitations of the technology such as the relatively low resolution and comparatively insufficient sequencing depth make it difficult to reliably extract real biological signals from these data. To alleviate this challenge, we propose a novel transfer learning framework, referred to as TransST, to adaptively leverage the cell-labeled information from external sources in inferring cell-level heterogeneity of a target spatial transcriptomics data. Results: Applications in several real studies as well as a number of simulation settings show that our approach significantly improves existing techniques. For example, in the breast cancer study, TransST successfully identifies five biologically meaningful cell clusters, including the two subgroups of cancer in situ and invasive cancer; in addition, only TransST is able to separate the adipose tissues from the connective issues among all the studied methods. Conclusions: In summary, the proposed method TransST is both effective and robust in identifying cell subclusters and detecting corresponding driving biomarkers in spatial transcriptomics data.
... This doubly robust approach merges elements of hierarchical and partitioning methods, increasing accuracy and precision of cluster assignment compared to using either method alone. 53 Furthermore, we quantitatively and visually evaluated cluster metrics using the Dunn Index 54 , Connectivity 55 , Silhouette Index 56 for multiple different specifications of numbers of clusters calculated using the clValid package in R. 57 Based on converging evidence from the training set, the presumed number of selected clusters was tested in the (independent) validation set to assess the internal reliability and validity of the clusters. ...
Preprint
Full-text available
Accumulating evidence of heterogeneous long-term outcomes after traumatic brain injury (TBI) has challenged longstanding approaches to TBI outcome classification that are largely based on global functioning. A lack of studies with clinical and biomarker data from individuals living with chronic (>1 year post-injury) TBI has precluded refinement of long-term outcome classification ontology. Multimodal data in well-characterized TBI cohorts is required to understand the clinical phenotypes and biological underpinnings of persistent symptoms in the chronic phase of TBI. The present cross-sectional study leveraged data from 281 participants with chronic complicated mild-to-severe TBI in the Late Effects of Traumatic Brain Injury (LETBI) Study. Our primary objective was to develop and validate clinical phenotypes using data from 41 TBI measures spanning a comprehensive cognitive battery, motor testing, and assessments of mood, health, and functioning. We performed a 70/30% split of training (n=195) and validation (n=86) datasets and performed principal components analysis to reduce the dimensionality of data. We used Hierarchical Cluster Analysis on Principal Components with k-means consolidation to identify clusters, or phenotypes, with shared clinical features. Our secondary objective was to investigate differences in brain volume in seven cortical networks across clinical phenotypes in the subset of 168 participants with brain MRI data. We performed multivariable linear regression models adjusted for age, age-squared, sex, scanner, injury chronicity, injury severity, and training/validation set. In the training/validation sets, we observed four phenotypes: 1) mixed cognitive and mood/behavioral deficits (11.8%; 15.1% in the training and validation set, respectively); 2) predominant cognitive deficits (20.5%; 23.3%); 3) predominant mood/behavioral deficits (27.7%; 22.1%); and 4) few deficits across domains (40%; 39.5%). The predominant cognitive deficit phenotype had lower cortical volumes in executive control, dorsal attention, limbic, default mode, and visual networks, relative to the phenotype with few deficits. The predominant mood/behavioral deficit phenotype had lower volumes in dorsal attention, limbic, and visual networks, compared to the phenotype with few deficits. Contrary to expectation, we did not detect differences in network-specific volumes between the phenotypes with mixed deficits versus few deficits. We identified four clinical phenotypes and their neuroanatomic correlates in a well-characterized cohort of individuals with chronic TBI. TBI phenotypes defined by symptom clusters, as opposed to global functioning, could inform clinical trial stratification and treatment selection. Individuals with predominant cognitive and mood/behavioral deficits had reduced cortical volumes in specific cortical networks, providing insights into sensitive, though not specific, candidate imaging biomarkers of clinical symptom phenotypes after chronic TBI and potential targets for intervention.
... For reduced k-means, we use clustrd package in R software [Markos et al., 2019]. For tuning parameters η 1 , γ and ρ in the proposed method are selected by modified cross-validation based on the idea of clustering stability [Wang et al., 2016, Sun et al., 2013 which is based on kappa coefficient [Cohen, 1960]. ...
Preprint
Full-text available
We propose a new method based on sparse optimal discriminant clustering (SODC), by a penalty term to scoring matrix based on convex clustering. With the addition of this penalty term, it is expected to improve the accuracy of cluster identification by attaching points from the same cluster closer together and points from different clusters further apart. Moreover, we develop a novel algorithm to derive the updated formula of this scoring matrix using majorizing function. It solves the difficulty to satisfy both constraint and containing the clustering structure to the scoring matrix. We have demonstrated the numerical simulations and its an application to real data to assess the performance of the proposed method.
... The CCA was used to determine the optimal number of clusters for the motorcyclist crash dataset. The "clustrd ", an R software package ( Markos et al., 2019 ), used both Multiple Correspondence Analysis K-means approach and the average silhouette width (ASW) criterion to assess the quality of clustering solution, segment the data into clusters and choose the best number of clusters. A starting clustering range of 2 and 10 with dimensions between 1 and 9 was specified. ...
... La metodología aplicada sigue el enfoque de análisis conjunto de reducción de dimensiones y agrupamiento, específicamente mediante la utilización del análisis de correspondencias múltiples (MCA) combinado con el algoritmo K-means. Este método se seleccionó debido a su capacidad para manejar datos categóricos y su efectividad en identificar estructuras de clústeres en conjuntos de datos complejos ( Markos et al., 2019). ...
Article
Full-text available
El presente trabajo se enfoca en analizar la generación de empleo por parte de las asociaciones del sector popular y solidario en la zona 6 de Ecuador. Utilizando técnicas de análisis multivariado como el Análisis de Componentes Múltiples y clustering, se identificaron patrones y características clave que distinguen a las asociaciones según su capacidad de generar pleno empleo. Los resultados muestran que ciertos clústeres de asociaciones, caracterizados por su afiliación al seguro social, beneficios legales y tipos de contrato, tienen una mayor influencia en la generación de empleo. Además, se destaca la relación entre cada clúster y el tipo de asociación, dándonos como resultado principal que las asociaciones de tipo agropecuarias son las predominantes en el primer clúster, las asociaciones de tipo alimentos, limpieza y mantenimiento y textil en el segundo clúster, y en el tercer clúster predominan las asociaciones de tipo agropecuarias, alimentos, y textil. lo que permite entender mejor el rol que juegan estas organizaciones en la economía local. Se puede afirmar que, aunque la asociatividad ha tenido un impacto positivo en la generación de empleo, la evidencia muestra que no se alcanza completamente la hipótesis de investigación sobre la creación de pleno empleo, pues esto depende mucho más de otros factores como los niveles de inversión, beneficios característicos de la formalidad y también de implementación de políticas públicas para fortalecer a las asociaciones menos desarrolladas. Este estudio ofrece bases sólidas para futuras investigaciones y el diseño de políticas públicas orientadas a fortalecer el sector popular y solidario.
Article
The quality of medicinal plants is closely related to the ecological factors of their growing environment, as their efficacy is reflected in the content of key medicinal components, which in turn indicates the quality of the plants. This study measured the daily variations in major constituents, including lobetyolin, polysaccharides, and total flavonoids, in Codonopsis pilosula (Franch.) Nannf., which in the Changzhi and Jincheng regions of Shanxi Province, China is known as Lu Tangshen. Throughout its growth cycle. Additionally, the study explored the effects of 11 ecological factors (both climatic and soil variables) on the primary medicinal components of C. pilosula. Through block experiments and comparisons between future data predictions and actual measurements, the reliability of the model and the consistency of block experimental data were ultimately confirmed. Principal component analysis (PCA), stepwise multiple linear regression analysis, and nonlinear polynomial modeling were employed to investigate the relationships between ecological factors and quality-related constituents (polysaccharides, total flavonoids, and lobetyolin). The results showed that linear models effectively explained daily temperature (DT) with an adjusted R2 exceeding 0.8, but due to the inherently nonlinear nature of the data, it is evident that linear models are fundamentally inadequate for accurately capturing the underlying relationships. Therefore, their fit for total flavonoids and lobetyolin was suboptimal. The introduction of nonlinear polynomial models (second-, fourth-, and fifth-order) significantly improved the model fit, indicating the existence of complex nonlinear relationships between ecological factors and medicinal components. For polysaccharides, the fourth-order model demonstrated the best performance, while fifth-order models were required to adequately describe the relationships for total flavonoids and lobetyolin. Based on the best models, the optimal ranges for key ecological factors were identified: polysaccharides were best influenced by atmospheric pressure (AP) between 9.1 and 9.3 kPa, air relative humidity (ARH) between 30% and 60%, 40 cm soil mean annual temperature (40cmMAT) between 27.5 °C and 28.5 °C, soil pH between 9.68 and 9.72, and soil nitrogen (N) content between 7 and 9 mg/kg. For total flavonoids, narrow optimal ranges were observed for temperature, humidity, and pH (MAT between 10 °C and 15 °C, 40cmMAT between 27.5 °C and 28.5 °C, and pH between 9.68 and 9.72). Lobetyolin showed optimal conditions at AP of 9.1 to 9.3 kPa, 40cmMAT of 28.0 °C to 28.5 °C, ARH of 65% to 75%, pH near 9.70, and days after planting (DAP) between 10 and 50. The adoption of higher-order polynomial models clarified critical nonlinear inflection points and optimal ecological ranges, providing a refined reference for enhancing the content of medicinal components. These findings offer valuable insights for precision cultivation strategies aimed at improving the quality of C. pilosula.
Article
This article focuses on the boards of Italian top-ranked sociology journals by conceiving of them as an affiliation network formed by scholars participating in the same journals’ editorial boards as members. We consider the space formed by editorial memberships in these journals as a social space and analyse it by means of network-analytic and factorial techniques, providing a field-theoretic representation of such space. Through Cluster Correspondence Analysis, we look at how the different clusters (network positions) relate to each other and to what extent journals’ patterns of board participation differ through clusters. Our findings show how this social space is structured along dimensions that oppose journals and scholars by reason of their mutual relationships ( interlocking editorships), thus highlighting the relative positions of journals with similar or distinct patterns of board membership and the positions of scholars linked by common participation in various editorial boards. We interpret this participation as a form of ‘position-taking’ in Bourdieu’s sense, reflecting the adjustment of boards’ composition to meet the recent demands of the journals’ accreditation system.
Article
This work explores the intersection of Multiple Criteria Decision Aid (MCDA) and clustering techniques, revealing unexploited potential and novel perspectives arising from their integration, challenging their conventional separation. It serves as a compass, guiding researchers through a bibliometric exploration and a conceptual taxonomy consolidating existing knowledge. Employing a two‐fold methodology, we first sketch the field's contours through a bibliometric lens, uncovering its intellectual structure, thematic landscape, and social dynamics. Then, using science mapping techniques like co‐word analysis, historiography, and collaboration network analysis, we examine patterns, revealing an interconnected mosaic of concepts. Our findings unveil a natural grouping into three categories: (1) Mixed‐yet‐not‐integrated approaches, explores sequential applications—clustering followed by MCDA or vice versa—where one method precedes and informs the other. (2) ‘Relational/ordered clustering’ leveraging criteria dependency to refine structures. (3) Using MCDA to improve clustering mechanics through similarity metrics, domain knowledge incorporation, and robustness. We conclusively propose a taxonomy along three axes: Units of Analysis, Instrumentalisation, and Objective. The key takeaway emphasises the collaborative potential of MCDA, envisioning a landscape where the integration of MCDA and clustering not only enhances existing methodologies but also spawns innovative paradigms, fostering a symbiotic relationship that transcends conventional boundaries.
Chapter
Full-text available
There exist several methods for clustering high-dimensional data. One popular approach is to use a two-step procedure. In the first step, a dimension reduction technique is used to reduce the dimensionality of the data. In the second step, cluster analysis is applied to the data in the reduced space. This method may be referred to as the tandem approach. An important drawback of this method is that the dimension reduction may distort or hide the cluster structure. As an alternative, various authors have proposed joint dimension reduction and clustering approaches. In this paper we review some of these existing joint dimension reduction and clustering methods for categorical data in a unified framework that facilitates comparison.
Book
Full-text available
"Correspondence Analysis: Theory, Practice and New Strategies" examines the key issues of correspondence analysis, and discusses the new advances that have been made over the last 20 years. The main focus of this book is to provide a comprehensive discussion of some of the key technical and practical aspects of correspondence analysis, and to demonstrate how they may be put to use. Particular attention is given to the history and mathematical links of the developments made. These links include not just those major contributions made by researchers in Europe (which is where much of the attention surrounding correspondence analysis has focused) but also the important contributions made by researchers in other parts of the world. Key features include: • A comprehensive international perspective on the key developments of correspondence analysis • Discussion of correspondence analysis for nominal and ordinal categorical data. • Discussion of correspondence analysis of contingency tables with varying association structures (symmetric and non-symmetric relationship between two or more categorical variables). • Extensive treatment of many of the members of the correspondence analysis family for two-way, three-way and multiple contingency tables. "Correspondence Analysis" offers a comprehensive and detailed overview of this topic which will be of value to academics, postgraduate students and researchers wanting a better understanding of correspondence analysis. Readers interested in the historical development, internationalisation and diverse applicability of correspondence analysis will also find much to enjoy in this book.
Article
Full-text available
The new R package FactoClass to combine factorial methods and cluster analysis is presented. This package is implemented in order to perform a multivariate exploration of a data table according to Lebart et al. (1995). We use some ade4 functions (Chessel et al. 2004) to perform the factorial analysis of the data and some stats functions in R to perform cluster methods. Some new functions are programmed to make specific tasks and another old ones are modified. We describe the implementation of FactoClass in the Windows environment and illustrate its use with an example.
Article
Clustering (partitioning) and simultaneous dimension reduction of objects and variables of a two-way two-mode data matrix is proposed here. The methodology is based on a general model that includes K-means clustering, factorial K-means, projection pursuit clustering (also known as reduced K-means), principal component analysis and intermediate cases of object clustering and variable reduction. Since we often have sets consisting of both qualitative and quantitative variables, the general model is now extended to deal with the general relevant case of mixed variables, analogous to variants of PCA handling qualitative (nominal and ordinal) variables in addition to quantitative variables. The model, called clustering and dimension reduction (CDR), is fully discussed in all the special cases cited above. For least-squares estimation of the model, an efficient coordinate descent algorithm is presented. Finally, a simulation study and two analyses on real data illustrate the features of CDR and study the performance of the proposed algorithm.
Article
A method is proposed that combines dimension reduction and cluster analysis for categorical data by simultaneously assigning individuals to clusters and optimal scaling values to categories in such a way that a single between variance maximization objective is achieved. In a unified framework, a brief review of alternative methods is provided and we show that the proposed method is equivalent to GROUPALS applied to categorical data. Performance of the methods is appraised by means of a simulation study. The results of the joint dimension reduction and clustering methods are compared with the so-called tandem approach, a sequential analysis of dimension reduction followed by cluster analysis. The tandem approach is conjectured to perform worse when variables are added that are unrelated to the cluster structure. Our simulation study confirms this conjecture. Moreover, the results of the simulation study indicate that the proposed method also consistently outperforms alternative joint dimension reduction and clustering methods.
Article
Multiple correspondence analysis (MCA) is a useful tool for investigating the interrelationships among dummy-coded categorical variables. MCA has been combined with clustering methods to examine whether there exist heterogeneous subclusters of a population, which exhibit cluster-level heterogeneity. These combined approaches aim to classify either observations only (one-way clustering of MCA) or both observations and variable categories (two-way clustering of MCA). The latter approach is favored because its solutions are easier to interpret by providing explicitly which subgroup of observations is associated with which subset of variable categories. Nonetheless, the two-way approach has been built on hard classification that assumes observations and/or variable categories to belong to only one cluster. To relax this assumption, we propose two-way fuzzy clustering of MCA. Specifically, we combine MCA with fuzzy k-means simultaneously to classify a subgroup of observations and a subset of variable categories into a common cluster, while allowing both observations and variable categories to belong partially to multiple clusters. Importantly, we adopt regularized fuzzy k-means, thereby enabling us to decide the degree of fuzziness in cluster memberships automatically. We evaluate the performance of the proposed approach through the analysis of simulated and real data, in comparison with existing two-way clustering approaches.