PreprintPDF Available

Hypergraph Representations of scRNA-seq Data for Improved Clustering with Random Walks

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract

Analysis of single-cell RNA sequencing data is often conducted through network projections such as coexpression networks, primarily due to the abundant availability of network analysis tools for downstream tasks. However, this approach has several limitations: loss of higher-order information, inefficient data representation caused by converting a sparse dataset to a fully connected network, and overestimation of coexpression due to zero-inflation. To address these limitations, we propose conceptualizing scRNA-seq expression data as hypergraphs, which are generalized graphs in which the hyperedges can connect more than two vertices. In the context of scRNA-seq data, the hypergraph nodes represent cells and the edges represent genes. Each hyperedge connects all cells where its corresponding gene is actively expressed and records the expression of the gene across different cells. This hypergraph conceptualization enables us to explore multi-way relationships beyond the pairwise interactions in coexpression networks without loss of information. We propose two novel clustering methods: (1) the Dual-Importance Preference Hypergraph Walk (DIPHW) and (2) the Coexpression and Memory-Integrated Dual-Importance Preference Hypergraph Walk (CoMem-DIPHW). They outperform established methods on both simulated and real scRNA-seq datasets. The improvement brought by our proposed methods is especially significant when data modularity is weak. Furthermore, CoMem-DIPHW incorporates the gene coexpression network, cell coexpression network, and the cell-gene expression hypergraph from the single-cell abundance counts data altogether for embedding computation. This approach accounts for both the local level information from single-cell level gene expression and the global level information from the pairwise similarity in the two coexpression networks.
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Identification of cell populations often relies on manual annotation of cell clusters using established marker genes. However, the selection of marker genes is a time-consuming process that may lead to sub-optimal annotations as the markers must be informative of both the individual cell clusters and various cell types present in the sample. Here, we developed a computational platform, ScType, which enables a fully-automated and ultra-fast cell-type identification based solely on a given scRNA-seq data, along with a comprehensive cell marker database as background information. Using six scRNA-seq datasets from various human and mouse tissues, we show how ScType provides unbiased and accurate cell type annotations by guaranteeing the specificity of positive and negative marker genes across cell clusters and cell types. We also demonstrate how ScType distinguishes between healthy and malignant cell populations, based on single-cell calling of single-nucleotide variants, making it a versatile tool for anticancer applications. The widely applicable method is deployed both as an interactive web-tool (https://sctype.app), and as an open-source R-package. Cell types are typically identified in single cell transcriptomic data by manual annotation of cell clusters using established marker genes. Here the authors present a fully-automated computational platform that can quickly and accurately distinguish between cell types.
Article
Full-text available
Collaboration patterns offer important insights into how scientific breakthroughs and innovations emerge in small and large research groups. However, links in traditional networks account only for pairwise interactions, thus making the framework best suited for the description of two-person collaborations, but not for collaborations in larger groups. We therefore study higher-order scientific collaboration networks where a single link can connect more than two individuals, which is a natural description of collaborations entailing three or more people. We also consider different layers of these networks depending on the total number of collaborators, from one upwards. By doing so, we obtain novel microscopic insights into the representativeness of researchers within different teams and their links with others. In particular, we can follow the maturation process of the main topological features of collaboration networks, as we consider the sequence of graphs obtained by progressively merging collaborations from smaller to bigger sizes starting from the single-author ones. We also perform the same analysis by using publications instead of researchers as network nodes, obtaining qualitatively the same insights and thus confirming their robustness. We use data from the arXiv to obtain results specific to the fields of physics, mathematics, and computer science, as well as to the entire coverage of research fields in the database.
Article
Full-text available
Ovarian Granulosa Cells (GCs) are known to proliferate in the developing follicle and undergo several biochemical processes during folliculogenesis. They represent a multipotent cell population that has been differentiated to neuronal cells, chondrocytes, and osteoblasts in vitro. However, progression and maturation of GCs are accompanied by a reduction in their stemness. In the developing follicle, GCs communicate with the oocyte bidirectionally via gap junctions. Together with neighboring theca cells, they play a crucial role in steroidogenesis, particularly the production of estradiol, as well as progesterone following luteinization. Many signaling pathways are known to be important throughout the follicle development, leading either towards luteinization and release of the oocyte, or follicular atresia and apoptosis. These signaling pathways include cAMP, PI3K, SMAD, Hedgehog (HH), Hippo and Notch, which act together in a complex manner to control the maturation of GCs through regulation of key genes, from the primordial follicle to the luteal phase. Small molecules such as resveratrol, a phytoalexin found in grapes, peanuts and other dietary constituents, may be able to activate/inhibit these signaling pathways and thereby control physiological properties of GCs. This article reviews the current knowledge about granulosa stem cells, the signaling pathways driving their development and maturation, as well as biological activities of resveratrol and its properties as a pro-differentiation agent.
Article
Full-text available
One primary reason that makes single-cell RNA-seq analysis challenging is dropouts, where the data only captures a small fraction of the transcriptome of each cell. Almost all computational algorithms developed for single-cell RNA-seq adopted gene selection, dimension reduction or imputation to address the dropouts. Here, an opposite view is explored. Instead of treating dropouts as a problem to be fixed, we embrace it as a useful signal. We represent the dropout pattern by binarizing single-cell RNA-seq count data, and present a co-occurrence clustering algorithm to cluster cells based on the dropout pattern. We demonstrate in multiple published datasets that the binary dropout pattern is as informative as the quantitative expression of highly variable genes for the purpose of identifying cell types. We expect that recognizing the utility of dropouts provides an alternative direction for developing computational algorithms for single-cell RNA-seq analysis.
Article
Full-text available
Single-cell transcriptomics yields ever growing data sets containing RNA expression levels for thousands of genes from up to millions of cells. Common data analysis pipelines include a dimensionality reduction step for visualising the data in two dimensions, most frequently performed using t-distributed stochastic neighbour embedding (t-SNE). It excels at revealing local structure in high-dimensional data, but naive applications often suffer from severe shortcomings, e.g. the global structure of the data is not represented accurately. Here we describe how to circumvent such pitfalls, and develop a protocol for creating more faithful t-SNE visualisations. It includes PCA initialisation, a high learning rate, and multi-scale similarity kernels; for very large data sets, we additionally use exaggeration and downsampling-based initialisation. We use published single-cell RNA-seq data sets to demonstrate that this protocol yields superior results compared to the naive application of t-SNE. t-SNE is widely used for dimensionality reduction and visualization of high-dimensional single-cell data. Here, the authors introduce a protocol to help avoid common shortcomings of t-SNE, for example, enabling preservation of the global structure of the data.
  • D Szklarczyk
  • A Franceschini
  • S Wyder
  • K Forslund
  • D Heller
  • J Huerta-Cepas
  • M Simonovic
  • A Roth
  • A Santos
  • K P Tsafou
D. Szklarczyk, A. Franceschini, S. Wyder, K. Forslund, D. Heller, J. Huerta-Cepas, M. Simonovic, A. Roth, A. Santos, K. P. Tsafou, et al., Nucleic acids research 43, D447 (2015).
  • D V Veres
  • D M Gyurkó
  • B Thaler
  • K Z Szalay
  • D Fazekas
  • T Korcsmáros
  • P Csermely
D. V. Veres, D. M. Gyurkó, B. Thaler, K. Z. Szalay, D. Fazekas, T. Korcsmáros, and P. Csermely, Nucleic acids research 43, D485 (2015).
  • A Vazquez
  • A Flammini
  • A Maritan
  • A Vespignani
A. Vazquez, A. Flammini, A. Maritan, and A. Vespignani, Nature biotechnology 21, 697 (2003).
  • M Ashtiani
  • A Salehzadeh-Yazdi
  • Z Razaghi-Moghadam
  • H Hennig
  • O Wolkenhauer
  • M Mirzaie
  • M Jafari
M. Ashtiani, A. Salehzadeh-Yazdi, Z. Razaghi-Moghadam, H. Hennig, O. Wolkenhauer, M. Mirzaie, and M. Jafari, BMC systems biology 12, 1 (2018).