Figure - available from: Genome Biology
This content is subject to copyright. Terms and conditions apply.
Accuracy evaluation of cell-embedding spaces. a 2D visualization of the inferred cell-embedding spaces of a canonical VAE, siVAE (γ = 0) (no regularization term), siVAE (γ = 0.05) (default regularization weight), and LDVAE. Each point represents a cell and is colored according to cell type. b Barplot indicating the balanced accuracy of a k-NN (k = 80) classifier predicting the cell type labels of single cells based on their inferred position in the cell-embedding space inferred by various methods trained on the fetal liver atlas dataset. Higher accuracies are interpreted as more accurate inferred cell-embedding spaces. c 2D UMAP visualization of the original NeurDiff dataset, without batch correction. Top row shows annotation based on cell type, and the bottom row shows annotation based on the batch. d Same as c, except visualization is a tSNE visualization of the siVAE-inferred cell-embedding space where siVAE corrects for batch within the model
Source publication
Neural networks such as variational autoencoders (VAE) perform dimensionality reduction for the visualization and analysis of genomic data, but are limited in their interpretability: it is unknown which data features are represented by each embedding dimension. We present siVAE, a VAE that is interpretable by design, thereby enhancing downstream an...
Citations
... Still, there are some approaches that incorporate interpretability already as part of the model design. For example, the siVAE approach simultaneously infers a cell and gene embedding space via two encoder-decoder frameworks and and uses an additional regularization term in the loss function, where embeddings of genes indicate their contribution to distinct dimensions of the cell embedding space 36 . However, the dimensions in the cell embedding may still be entangled, and the contribution of variables to the dimensions of the cell embedding is not constrained to be sparse. ...
Dimensionality reduction greatly facilitates the exploration of cellular heterogeneity in single-cell RNA sequencing data. While most of such approaches are data-driven, it can be useful to incorporate biologically plausible assumptions about the underlying structure or the experimental design. We propose the boosting autoencoder (BAE) approach, which combines the advantages of unsupervised deep learning for dimensionality reduction and boosting for formalizing assumptions. Specifically, our approach selects small sets of genes that explain latent dimensions. As illustrative applications, we explore the diversity of neural cell identities and temporal patterns of embryonic development.
... This makes VAEs particularly useful for clustering, dimensionality reduction, and transcription factor perturbation analysis [18]. Examples of VAE models for single-cell data include scGen [19], VEGA [20], siVAE [21], scVAE [22], scDHA [23], scVI [24], manatee [25], and ScInfoVAE [26]. ...
As multi-omics sequencing technologies advance, the need for simulation tools capable of generating realistic and diverse (bulk and single-cell) multi-omics datasets for method testing and benchmarking becomes increasingly important. We present MOSim, an R package that simulates both bulk (via mosim function) and single-cell (via sc_mosim function) multi-omics data. The mosim function generates bulk transcriptomics data (RNA-seq) and additional regulatory omics layers (ATAC-seq, miRNA-seq, ChIP-seq, Methyl-seq, and transcription factors), while sc_mosim simulates single-cell transcriptomics data (scRNA-seq) with scATAC-seq and transcription factors as regulatory layers. The tool supports various experimental designs, including simulation of gene co-expression patterns, biological replicates, and differential expression between conditions. MOSim enables users to generate quantification matrices for each simulated omics data type, capturing the heterogeneity and complexity of bulk and single-cell multi-omics datasets. Furthermore, MOSim provides differentially abundant features within each omics layer and elucidates the active regulatory relationships between regulatory omics and gene expression data at both bulk and single-cell levels. By leveraging MOSim, researchers will be able to generate realistic and customizable bulk and single-cell multi-omics datasets to benchmark and validate analytical methods specifically designed for the integrative analysis of diverse regulatory omics data.
... To bridge the complementary strengths of VAEs and LaMs, we propose sciLaMA (single cell interpretable Language Model Adapter), a novel representation learning framework that extends the siVAE Choi et al. [2023] architecture to integrate precomputed static gene embeddings from pretrained multimodal LaMs with scRNA-seq tabular data. By combining the representation power of VAEs with the adaptable and knowledge-rich embeddings of LaMs, our approach projects static gene information into context-aware representations by aligning each dimension of gene and cell latent space within the unified paired-VAE framework (section 3). ...
... However, these approaches primarily focus on cell representations without inferring gene representations, and pipelines leveraging other tools are needed for gene-centric analyses such as marker identification and gene network discvovery. To address this limitation, siVAE Choi et al. [2023] introduced a unified framework for learning both cell and gene representations, enabling direct gene-centric analyses using the gene representations and therefore eliminating the need for explicit gene module inference via external tools. However, siVAE gene representation learning involves training an encoder whose number of input nodes scales with the number of cells, thus limiting its applications to large datasets. ...
... sciLaMA is based on a paired encoder-decoder design, inspired by siVAE Choi et al. [2023], a interpretable deep generative model that jointly learns sample (cell) and feature (gene) embeddings using a paired VAE design. siVAE only uses scRNA-seq data to learn both sets of embeddings, whereas sciLaMA uses external data to inform gene embeddings. ...
A bstract
Single-cell RNA sequencing (scRNA-seq) enables high-resolution exploration of cellular diversity and gene regulation, yet analyzing such data remains challenging due to technical and methodological limitations. Existing task-specific deep generative models like Variational Auto-Encoder (VAE) and its variants struggle to incorporate external biological knowledge, while transformer-based foundational large Language Models (LLMs or large LaMs) face limitations in computational cost and applicability to tabular gene expression data. Here, we introduce sciLaMA (single-cell interpretable Language Model Adapter), a novel representation learning framework that bridges these gaps by integrating static gene embeddings from multimodal LaMs with scRNA-seq tabular data through a paired-VAE architecture. Our approach generates context-aware representations for both cells and genes and outperforms state-of-the-art methods in key single-cell downstream tasks, including batch effect correction, cell clustering, and cell-state-specific gene marker and module identification, while maintaining computational efficiency. sciLaMA offers a computationally efficient, unified framework for comprehensive single-cell data analysis and biologically interpretable gene module discovery.
... There exist several generative methods to learn interpretable latent spaces that decompose the input single-cell expression profiles into relevant sources of variation. These methods can be directly trained to capture a specific source of variation [29][30][31][32][33][34][35] or post-hoc-interpreted after training [36][37][38][39][40]. Furthermore, there exist several methods to learn a latent space such that shifts within the latent space represent specific perturbation effects on an unobserved cell or cell type [4][5][6][7][8][9][10][11][12][13][14]. ...
While single-cell experiments provide deep cellular resolution within a single sample, some single-cell experiments are inherently more challenging than bulk experiments due to dissociation difficulties, cost, or limited tissue availability. This creates a situation where we have deep cellular profiles of one sample or condition, and bulk profiles across multiple samples and conditions. To bridge this gap, we propose BuDDI (BUlk Deconvolution with Domain Invariance). BuDDI utilizes domain adaptation techniques to effectively integrate available corpora of case-control bulk and reference scRNA-seq observations to infer cell-type-specific perturbation effects. BuDDI achieves this by learning independent latent spaces within a single variational autoencoder (VAE) encompassing at least four sources of variability: 1) cell type proportion, 2) perturbation effect, 3) structured experimental variability, and 4) remaining variability. Since each latent space is encouraged to be independent, we simulate perturbation responses by independently composing each latent space to simulate cell-type-specific perturbation responses. We evaluated BuDDI’s performance on simulated and real data with experimental designs of increasing complexity. We first validated that BuDDI could learn domain invariant latent spaces on data with matched samples across each source of variability. Then we validated that BuDDI could accurately predict cell-type-specific perturbation response when no single-cell perturbed profiles were used during training; instead, only bulk samples had both perturbed and non-perturbed observations. Finally, we validated BuDDI on predicting sex-specific differences, an experimental design where it is not possible to have matched samples. In each experiment, BuDDI outperformed all other comparative methods and baselines. As more reference atlases are completed, BuDDI provides a path to combine these resources with bulk-profiled treatment or disease signatures to study perturbations, sex differences, or other factors at single-cell resolution.
... We adapt ten baselines for comparisons and assess the embeddings for preservation of co-expression, fidelity of visualization, coherence of downstream analyses and ability to capture signal localization. We compare against approaches from graph signal processing and representation learning for benchmarking tasks [25][26][27][28][29][30][31] (Extended Data Fig. 2 and Methods). ...
In single-cell sequencing analysis, several computational methods have been developed to map the cellular state space, but little has been done to map or create embeddings of the gene space. Here we formulate the gene embedding problem, design tasks with simulated single-cell data to evaluate representations, and establish ten relevant baselines. We then present a graph signal processing approach, called gene signal pattern analysis (GSPA), that learns rich gene representations from single-cell data using a dictionary of diffusion wavelets on the cell–cell graph. GSPA enables characterization of genes based on their patterning and localization on the cellular manifold. We motivate and demonstrate the efficacy of GSPA as a framework for diverse biological tasks, such as capturing gene co-expression modules, condition-specific enrichment and perturbation-specific gene–gene interactions. Then we showcase the broad utility of gene representations derived from GSPA, including for cell–cell communication (GSPA-LR), spatial transcriptomics (GSPA-multimodal) and patient response (GSPA-Pt) analysis.
... This capability makes generative models particularly valuable for addressing issues of sparse or limited multiomics datasets [65,66]. Currently, the most prominent generative models in this field include Generative Adversarial Networks (GANs) [67][68][69][70][71][72], Variational Autoencoders (VAEs) [73][74][75][76][77][78][79][80][81], and diffusion models [82][83][84][85][86][87][88]. For example, Ghebrehiwet et al. demonstrated the use of GANs in generating synthetic electronic health record (EHR) data, highlighting their role in producing synthetic patient data to improve diagnostic accuracy while protecting privacy in precision medicine [89]. ...
... These requirements surpass the capabilities of traditional single-omics approaches [95]. Generative models have become indispensable in single-cell multi-omics research, finding applications in multimodal data integration, cross-modal generation, and dynamic process modeling [68,75,76,78,96]. For instance, by integrating multimodal data and generating cross-modal samples, researchers can simulate cellular perturbation responses under various conditions, providing critical insights into dynamic changes in cell states [78]. ...
... Additionally, MichiGAN leverages decoupled designs and semantic manipulation of latent representations to generate high-quality single-cell data, shedding light on the cooperative mechanisms of gene regulatory networks [68]. In dynamic process modeling, siVAE employs interpretable latent representations to elucidate the dynamic roles of key gene modules in cell differentiation trajectories and disease phenotypes [76]. Overall, the application of generative models in single-cell multi-omics is largely task-specific. ...
With the rapid development of high-throughput sequencing platforms, an increasing number of omics technologies, such as genomics, metabolomics, and transcriptomics, are being applied to disease genetics research. However, biological data often exhibit high dimensionality and significant noise, making it challenging to effectively distinguish disease subtypes using a single-omics approach. To address these challenges and better capture the interactions among DNA, RNA, and proteins described by the central dogma, numerous studies have leveraged artificial intelligence to develop multi-omics models for disease research. These AI-driven models have improved the accuracy of disease prediction and facilitated the identification of genetic loci associated with diseases, thus advancing precision medicine. This paper reviews the mathematical definitions of multi-omics, strategies for integrating multi-omics data, applications of artificial intelligence and deep learning in multi-omics, the establishment of foundational models, and breakthroughs in multi-omics technologies, drawing insights from over 130 related articles. It aims to provide practical guidance for computational biologists to better understand and effectively utilize AI-based multi-omics machine learning algorithms in the context of central dogma.
... This capability makes generative models particularly valuable for addressing issues of sparse or limited multiomics datasets [65,66]. Currently, the most prominent generative models in this field include Generative Adversarial Networks (GANs) [67][68][69][70][71][72], Variational Autoencoders (VAEs) [73][74][75][76][77][78][79][80][81], and diffusion models [82][83][84][85][86][87][88]. For example, Ghebrehiwet et al. demonstrated the use of GANs in generating synthetic electronic health record (EHR) data, highlighting their role in producing synthetic patient data to improve diagnostic accuracy while protecting privacy in precision medicine [89]. ...
... These requirements surpass the capabilities of traditional single-omics approaches [95]. Generative models have become indispensable in single-cell multi-omics research, finding applications in multimodal data integration, cross-modal generation, and dynamic process modeling [68,75,76,78,96]. For instance, by integrating multimodal data and generating cross-modal samples, researchers can simulate cellular perturbation responses under various conditions, providing critical insights into dynamic changes in cell states [78]. ...
... Additionally, MichiGAN leverages decoupled designs and semantic manipulation of latent representations to generate high-quality single-cell data, shedding light on the cooperative mechanisms of gene regulatory networks [68]. In dynamic process modeling, siVAE employs interpretable latent representations to elucidate the dynamic roles of key gene modules in cell differentiation trajectories and disease phenotypes [76]. Overall, the application of generative models in single-cell multi-omics is largely task-specific. ...
With the rapid development of high-throughput sequencing platforms, anincreasing number of omics technologies, such as genomics, metabolomics, andtranscriptomics, are being applied to disease genetics research. However, bio-logical data often exhibit high dimensionality and significant noise, makingit challenging to effectively distinguish disease subtypes using a single-omicsapproach. To address these challenges and better capture the interactions amongDNA, RNA, and proteins described by the central dogma, numerous studieshave leveraged artificial intelligence to develop multi-omics models for diseaseresearch. These AI-driven models have improved the accuracy of disease pre-diction and facilitated the identification of genetic loci associated with diseases,thus advancing precision medicine. This paper reviews the mathematical defi-nitions of multi-omics, strategies for integrating multi-omics data, applicationsof artificial intelligence and deep learning in multi-omics, the establishment offoundational models, and breakthroughs in multi-omics technologies, drawinginsights from over 130 related articles. It aims to provide practical guidance forcomputational biologists to better understand and effectively utilize AI-basedmulti-omics machine learning algorithms in the context of central dogma.
... We reasoned that shallow sequencing depth would also effectively reduce the number of samples available to train the encoders of scPair or other cell state inference methods. The encoders, being a form of dimensionality reduction in the case of scPair and other cell state inference methods, intuitively use the covariance structure in both the input features and non-linear transformations of them in order to reduce the dimensionality of each data modality [77][78][79] . However, robust covariance estimation can require many more samples (cells) compared to the number of input features, which can be a challenge since data modalities such as ATAC-seq can measure millions of features 3,13,60 , and even RNA-seq can have tens of thousands of features 29 . ...
Multimodal single-cell assays profile multiple sets of features in the same cells and are widely used for identifying and mapping cell states between chromatin and mRNA and linking regulatory elements to target genes. However, the high dimensionality of input features and shallow sequencing depth compared to unimodal assays pose challenges in data analysis. Here we present scPair, a multimodal single-cell data framework that overcomes these challenges by employing an implicit feature selection approach. scPair uses dual encoder-decoder structures trained on paired data to align cell states across modalities and predict features from one modality to another. We demonstrate that scPair outperforms existing methods in accuracy and execution time, and facilitates downstream tasks such as trajectory inference. We further show scPair can augment smaller multimodal datasets with larger unimodal atlases to increase statistical power to identify groups of transcription factors active during different stages of neural differentiation.
... The challenge in isolating the lineage signals that drive different cell fates is that these signals are often overshadowed by more prominent signals, such as cell type variations, which could be orthogonal to cell fate in complex biological systems. Many current methods in single-cell analysis use variational autoencoder (VAE) architectures to model single-cell data [16,17,18,19], which excel at reconstructing gene expression. However, since these methods are unsupervised, controlling what the deep learning models learn is difficult. ...
Single-cell lineage tracing technology has advanced the investigation of progenitor cells' development using static, inheritable barcodes. It can determine temporal dynamics in progenitor-progeny relationships through single-cell RNA-sequencing (scRNA-seq) data. However, studying fate commitment from scRNA-seq can be difficult since the gene expression profiles are confounded with information about many cell processes beyond fate commitment. This paper demonstrates a novel framework to specifically isolate lineage signals driving cell fate, allowing us to learn the gene pathways that differentiate different lineages based on their eventual fates.
Our novel approach, LCL (Lineage-aware Contrastive Learning), is a contrastive-learning deep learning model for analyzing lineage-tracing scRNA-seq data. Using two lineage-tracing datasets, one about reprogramming embryonic fibroblasts and the other about hematopoietic progenitor cells, we demonstrate that LCL can produce low-dimensional representations that effectively isolate fate-determining signals from other key biological signals. We evaluate the quality of LCL embeddings and demonstrate that they perform well in out-of-sample evaluation, both in terms of predicting the lineage and cell type compositions at a future time point. LCL also enables us to identify differential genes stably expressed within a lineage and visualize the fate-determining landscape using self-organizing maps based on the results from LCL. Lastly, we demonstrate the consistency of our approach across datasets of varying complexity using a series of pseudo-real datasets. In conclusion, our results demonstrate that LCL allows researchers to explore fate commitment in single-cell lineage-tracing data and uncover lineage-specific gene pathways.
... The proportion of cellular programs that can be effectively captured by linear versus non-linear approaches remains uncertain for any given dataset. Enhancing the interpretability of deep learning-based models, such as VAEs, while maintaining their ability to capture complex biological phenomena is an active research area 50,51 . ...
Single-cell RNA sequencing (scRNA-seq) maps gene expression heterogeneity within a tissue. However, identifying biological signals in this data is challenging due to confounding technical factors, sparsity, and high dimensionality. Data factorization methods address this by separating and identifying signals in the data, such as gene expression programs, but the resulting factors must be manually interpreted. We developed Single-Cell Interpretable Residual Decomposition (sciRED) to improve the interpretation of scRNA-seq factor analysis. sciRED removes known confounding effects, uses rotations to improve factor interpretability, maps factors to known covariates, identifies unexplained factors that may capture hidden biological phenomena and determines the genes and biological processes represented by the resulting factors. We apply sciRED to multiple scRNA-seq datasets and identify sex-specific variation in a kidney map, discern strong and weak immune stimulation signals in a PBMC dataset, reduce ambient RNA contamination in a rat liver atlas to help identify strain variation, and reveal rare cell type signatures and anatomical zonation gene programs in a healthy human liver map. These demonstrate that sciRED is useful in characterizing diverse biological signals within scRNA-seq data.