ArticlePublisher preview available

Biophysical modeling with variational autoencoders for bimodal, single-cell RNA sequencing data

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract and Figures

Here we present biVI, which combines the variational autoencoder framework of scVI with biophysical models describing the transcription and splicing kinetics of RNA molecules. We demonstrate on simulated and experimental single-cell RNA sequencing data that biVI retains the variational autoencoder’s ability to capture cell type structure in a low-dimensional space while further enabling genome-wide exploration of the biophysical mechanisms, such as system burst sizes and degradation rates, that underlie observations.
BiVI fits single-cell data from mouse primary motor cortex (Allen sample B08) and suggests the biophysical basis for expression differences a,b, Observed, scVI and biVI reconstructed distributions of Foxp2 (a), a marker gene for L6 CT cells, and Rorb (b), a marker gene for L5 IT cells, restricted to respective cell type. Kullback–Leibler divergence (KLD) and Hellinger distance (HD) between empirical (observed) and predicted distributions are indicated. c,d, Cell-specific parameters inferred for Foxp2 (c) and Rorb (d) demonstrate identifiable differences in means and parameters in the marked cell types (1,333 L6 CT cells (blue) and 2,395 L5 IT cells (purple) of 6,398 total illustrated cells). e, Cell subclasses show different modulation patterns, especially pronounced in non-neuronal cells (fractions of 2,000 highly variable genes that exhibit differences in each parameter (top); number of cells in each subclass (bottom)). f, biVI allows the identification of cells that exhibit differences in burst size or relative degradation rate, without detectable differences in mature mean expression. Cell types are as defined and labeled in ref. ²⁴: GABAergic neurons that are marked by the genes Lamp5 (Lamp5), Sncg (Sncg), Vip (Vip), Sst (Sst), Pvalb (Pvalb), layer 2/3 intratelencephalic neurons (L2/3 IT), layer 5 intracelenphalic neurons (L5 IT), layer 5/6 near-projecting neurons (L5/6 NP), layer 6 corticothalamic neurons (L6 CT), layer 6 intratelencephalic neurons (L6 IT), layer 6b neurons (L6b), astrocytes (Astro), oligodendrocyte precursor cells (OPC), oligodendrocytes (Oligo), macrophages (Macrophage) and endothelial cells (Endo). g, Histograms of biVI parameters and scVI mature means for genes that exhibit modulation in biVI parameters (degradation rate in L5 IT cells for Trem2 (top); burst size in L6 CT cells for Ndnf (bottom)) but no identifiable mature mean modulation.
… 
This content is subject to copyright. Terms and conditions apply.
Nature Methods | Volume 21 | August 2024 | 1466–1469 1466
nature methods
Brief Communication
https://doi.org/10.1038/s41592-024-02365-9
Biophysical modeling with variational
autoencoders for bimodal, single-cell RNA
sequencing data
Maria Carilli  1,7, Gennady Gorin  2,6,7, Yongin Choi  3,4, Tara Chari  1 &
Lior Pachter  1,5
Here we present biVI, which combines the variational autoencoder
framework of scVI with biophysical models describing the transcription
and splicing kinetics of RNA molecules. We demonstrate on simulated
and experimental single-cell RNA sequencing data that biVI retains
the variational autoencoder’s ability to capture cell type structure in a
low-dimensional space while further enabling genome-wide exploration of
the biophysical mechanisms, such as system burst sizes and degradation
rates, that underlie observations.
Advances in experimental methods for single-cell RNA sequencing
(scRNA-seq) allow for the simultaneous quantification of multiple cel-
lular species, such as nascent and mature transcriptomes
1,2
, surface
35
and nuclear
6
proteomes, and chromatin accessibility
7,8
. While these
datasets enable insight into cell type and state in development and dis-
ease, joint analyses of distinct modalities remain challenging. We show
that principled biophysical ‘integration’ of multimodal datasets can be
achieved through parameterization of interpretable mechanistic mod-
els
9
, scalable to thousands of genes across tens of thousands of cells
10
.
Recent approaches to integrate and reduce the dimensionality
of multimodal single-cell genomics data have leveraged advances
in machine learning1113. For example, the popular tool scVI is a vari-
ational autoencoder that uses neural networks to encode scRNA-seq
counts to a low-dimensional representation. This is decoded by
another neural network to cell- and gene-specific parameters for
conditional likelihood distributions of observed counts14. These
distributions are chosen post hoc to be consistent with the discrete,
over-dispersed nature of scRNA-seq counts, but can be derived from
biophysical models (Methods). Extensions of scVI for protein
11
and
chromatin measurements
15
jointly encode data modalities to a sin-
gle latent space, then employ two decoding networks to produce
parameters for independent conditional likelihoods specific to each
datatype. Nascent and mature transcripts, available by realigning
existing scRNA-seq reads1,2, could be similarly treated (Fig. 1a); how-
ever, using independent conditional likelihoods for bimodal measure-
ments derived from the same gene ignores their inherent causality
and has no biophysical basis; the generative model is a ‘black box’
representation to summarize data.
Nevertheless, good causal model candidates are available (Fig. 1b
and Supplementary Figs. 1 and 2). For example, Fig. 1b illustrates the
extensively validated
1618
bursty model of transcription. While the joint
steady-state distribution induced by the bursty model is analytically
intractable
19
, we have previously shown that it can be approximated by
a set of basis functions with neural network-learned weights20.
We introduce biVI, a strategy that adapts scVI to work with
well-characterized stochastic models of transcription. We propose
models, formalized by chemical master equations (CMEs), for RNA
lifecycles, then use the bivariate, CME-derived distribution as the
conditional data likelihood distribution for nascent and mature counts
(Fig. 1c). The inferred conditional likelihood parameters thus have
biophysical interpretations as part of a mechanistic model of transcrip-
tion, moving beyond associational analyses to fit biophysical values
that parameterize causal relationships based on known transcrip-
tional dynamics
19
. The likelihood distributions cannot be obtained
solely from the data distributions but require some ‘knowledge of the
data-generating process’21.
Received: 2 May 2023
Accepted: 27 June 2024
Published online: 25 July 2024
Check for updates
1Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, CA, USA. 2Division of Chemistry and Chemical Engineering,
California Institute of Technology, Pasadena, CA, USA. 3Department of Biomedical Engineering, University of California, Davis, Davis, CA, USA. 4Genome
Center, University of California, Davis, Davis, CA, USA. 5Department of Computing and Mathematical Sciences, California Institute of Technology,
Pasadena, CA, USA. 6Present address: Fauna Bio, Emeryville, CA, USA. 7These authors contributed equally: Maria Carilli, Gennady Gorin.
e-mail: lpachter@caltech.edu
Content courtesy of Springer Nature, terms of use apply. Rights reserved
... Given recent works combining machine learning with CME-based models 59,60 , integration of meK-means with machine learning could enable simultaneous gene selection or K optimization, as well as single-cell and single-gene parameter resolution. A machine-learning-based framework could also improve runtime and scalability of meK-means. ...
... The runtime remains between 5-10 min per dataset for 100 to 100,000 cells ( Supplementary Fig. 9); however, the analytical solution to the CME model requires storage of an array of size Ω (where Ω is a finite subdomain determined by maximum molecular counts observed for a gene) and a time complexity of O(ΩlogΩ) 61 . Machine learning extensions would enable parameter inference for models without exact solutions 59,60 , and greater capacity to test or reject transcription models 62 . ...
... Article https://doi.org/10.1038/s43588-024-00689-2 com/resources/datasets/10-k-pbm-cs-from-a-healthy-donorv-3-chemistry-3-standard-3-0-0), the mouse germ cell dataset 28 , and developing mouse brain data 1 (for runtime metrics), counts were already processed in Carilli et al. 59 and Gorin and Pachter 38 , Mayère et al. 28 , and La Manno et al. 1 respectively. For all other datasets, counts were generated with kallisto | bustools 0.26.0 67,68 from the FASTQ files. ...
Article
Full-text available
Multimodal, single-cell genomics technologies enable simultaneous measurement of multiple facets of DNA and RNA processing in the cell. This creates opportunities for transcriptome-wide, mechanistic studies of cellular processing in heterogeneous cell populations, such as regulation of cell fate by transcriptional stochasticity or tumor proliferation through aberrant splicing dynamics. However, current methods for determining cell types or ‘clusters’ in multimodal data often rely on ad hoc approaches to balance or integrate measurements, and assumptions ignoring inherent properties of the data. To enable interpretable and consistent cell cluster determination, we present meK-means (mechanistic K-means) which integrates modalities through a unifying model of transcription to learn underlying, shared biophysical states. With meK-means we can cluster cells with nascent and mature mRNA measurements, utilizing the causal, physical relationships between these modalities. This identifies shared transcription dynamics across cells, which induce the observed molecule counts, and provides an alternative definition for ‘clusters’ through the governing parameters of cellular processes.
... There are also substantial routes to strengthening our mathematical formalism: we have not yet considered cell-cell interactions or cell cycle effects -of which will be the subject of future work. Other groups have also proposed mechanisms to combine biophysical modelling with deep learning frameworks (Carilli et al., 2024), suggesting we are not the only group thinking in this manner. ...
Preprint
Full-text available
Despite substantial efforts, deep learning has not yet delivered a transformative impact on elucidating regulatory biology, particularly in the realm of predicting gene expression profiles. Here, we argue that genuine "foundation models" of regulatory biology will remain out of reach unless guided by frameworks that integrate mechanistic insight with principled experimental design. We present one such ground-up, semi-mechanistic framework that unifies perturbation-based experimental designs across both in vitro and in vivo CRISPR screens, accounting for differentiating and non-differentiating cellular systems. By revealing previously unrecognised assumptions in published machine learning methods, our approach clarifies links with popular techniques such as variational autoencoders and structural causal models. In practice, this framework suggests a modified loss function that we demonstrate can improve predictive performance, and further suggests an error analysis that informs batching strategies. Ultimately, since cellular regulation emerges from innumerable interactions amongst largely uncharted molecular components, we contend that systems-level understanding cannot be achieved through structural biology alone. Instead, we argue that real progress will require a first-principles perspective on how experiments capture biological phenomena, how data are generated, and how these processes can be reflected in more faithful modelling architectures.
... A variety of mechanisms underpin this variability, including the stochastic binding and unbinding of transcription factors (TFs) 1 and RNA polymerase (RNAP) 2 , gene dosage effects, and partitioning of macromolecules upon cell division 3,4 , among many others 5,6 . Conversely, measurements of gene expression variability can be used to shed light on these cellular processes [7][8][9] , although attributing observed variability to specific mechanisms can be challenging. Among the biological implications of gene expression variability, one of the most prominent is "bet-hedging," the idea that subpopulations of cells exhibiting altered gene expression states may be better prepared to survive rapid and unpredictable shifts in environmental conditions 10,11 . ...
Article
Full-text available
The rate at which transcription factors (TFs) bind their cognate sites has long been assumed to be limited by diffusion, and thus independent of binding site sequence. Here, we systematically test this assumption using cell-to-cell variability in gene expression as a window into the in vivo association and dissociation kinetics of the model transcription factor LacI. Using a stochastic model of the relationship between gene expression variability and binding kinetics, we performed single-cell gene expression measurements to infer association and dissociation rates for a set of 35 different LacI binding sites. We found that both association and dissociation rates differed significantly between binding sites, and moreover observed a clear anticorrelation between these rates across varying binding site strengths. These results contradict the long-standing hypothesis that TF binding site strength is primarily dictated by the dissociation rate, but may confer the evolutionary advantage that TFs do not get stuck in near-operator sequences while searching.
... To gain greater resolution on RNA metabolism, we turned to a recently developed computational tool, biVI, that can infer sizes of transcriptional bursts, rates of RNA degradation, and rates of RNA splicing. 60 By integrating biophysical modeling, biVI uses transcript counts to infer rates of RNA metabolism from the distributions of nascent and mature RNA transcripts. Using biVI, we find that compared to MEFs and iMNs, hyperproliferative cells are estimated to have reduced burst sizes, rates of degradation, and rates of splicing, suggesting a global reduction in RNA metabolism (Figure 5g-i). ...
Preprint
Full-text available
While transcription factors (TFs) provide essential cues for directing and redirecting cell fate, TFs alone are insufficient to drive cells to adopt alternative fates. Rather, transcription factors rely on receptive cell states to induce novel identities. Cell state emerges from and is shaped by cellular history and the activity of diverse processes. Here, we define the cellular and molecular properties of a highly receptive state amenable to transcription factor-mediated direct conversion from fibroblasts to induced motor neurons. Using a well-defined model of direct conversion to a post-mitotic fate, we identify the highly proliferative, receptive state that transiently emerges during conversion. Through examining chromatin accessibility, histone marks, and nuclear features, we find that cells reprogram from a state characterized by global reductions in nuclear size and transcriptional activity. Supported by globally increased levels of H3K27me3, cells enter a quiescent-like state of reduced RNA metabolism and elevated expression of REST and p27, markers of quiescent neural stem cells. From this transient state, cells convert to neurons at high rates. Inhibition of Ezh2, the catalytic subunit of PRC2 that deposits H3K27me3, abolishes conversion. Our work offers a roadmap to identify global changes in cellular processes that define cells with different conversion potentials that may generalize to other cell-fate transitions. Highlights Proliferation drives cells to a compact nuclear state that is receptive to TF-mediated conversion. Increased receptivity to TFs corresponds to reduced nuclear volumes. Reprogrammable cells display global, genome-wide increases in H3K27me3. High levels of H3K27me3 support cells’ transits through a state of altered RNA metabolism. Inhibition of Ezh2 increases nuclear size, reduces the expression of the quiescence marker p27. Acute inhibition of Ezh2 abolishes motor neuron conversion. One Sentence Summary Cells transit through a quiescent-like state characterized by global reductions in nuclear size and transcriptional activity to convert to neurons at high rates. Graphical Abstract
Article
There is a growing interest in generating bimodal, single-cell RNA sequencing (RNA-seq) data for studying biological pathways. These data are predominantly utilized in understanding phenotypic trajectories using RNA velocities; however, the shape information encoded in the two-dimensional resolution of such data is not yet exploited. In this paper, we present an elliptical parametrization of two-dimensional RNA-seq data, from which we derived statistics that reveal four different modalities. These modalities can be interpreted as manifestations of the changes in the rates of splicing, transcription or degradation. We performed our analysis on a cell cycle and a colorectal cancer dataset. In both datasets, we found genes that are not picked up by differential gene expression analysis (DGEA), and are consequently unnoticed, yet visibly delineate phenotypes. This indicates that, in addition to DGEA, searching for genes that exhibit the discovered modalities could aid recovering genes that set phenotypes apart. For communities studying biomarkers and cellular phenotyping, the modalities present in bimodal RNA-seq data broaden the search space of genes, and furthermore, allow for incorporating cellular RNA processing into regulatory analyses.
Article
In single-cell and single-nucleus RNA sequencing (RNA-seq), the coexistence of nascent (unprocessed) and mature (processed) messenger RNA (mRNA) poses challenges in accurate read mapping and the interpretation of count matrices. The traditional transcriptome reference, defining the “region of interest” in bulk RNA-seq, restricts its focus to mature mRNA transcripts. This restriction leads to two problems: reads originating outside of the “region of interest” are prone to mismapping within this region, and additionally, such external reads cannot be matched to specific transcript targets. Expanding the “region of interest” to encompass both nascent and mature mRNA transcript targets provides a more comprehensive framework for RNA-seq analysis. Here, we introduce the concept of distinguishing flanking k-mers (DFKs) to improve mapping of sequencing reads. We have developed an algorithm to identify DFKs, which serve as a sophisticated “background filter”, enhancing the accuracy of mRNA quantification. This dual strategy of an expanded region of interest coupled with the use of DFKs enhances the precision in quantifying both mature and nascent mRNA molecules, as well as in delineating reads of ambiguous status.
Article
The term 'RNA-seq' refers to a collection of assays based on sequencing experiments that involve quantifying RNA species from bulk tissue, single cells or single nuclei. The kallisto, bustools and kb-python programs are free, open-source software tools for performing this analysis that together can produce gene expression quantification from raw sequencing reads. The quantifications can be individualized for multiple cells, multiple samples or both. Additionally, these tools allow gene expression values to be classified as originating from nascent RNA species or mature RNA species, making this workflow amenable to both cell-based and nucleus-based assays. This protocol describes in detail how to use kallisto and bustools in conjunction with a wrapper, kb-python, to preprocess RNA-seq data. Execution of this protocol requires basic familiarity with a command line environment. With this protocol, quantification of a moderately sized RNA-seq dataset can be completed within minutes.
Preprint
Full-text available
The mammalian nucleus is compartmentalized by diverse subnuclear structures. These subnuclear structures, marked by nuclear bodies and histone modifications, are often cell-type specific and affect gene regulation and 3D genome organization 1–3 . Understanding nuclear organization requires identifying the molecular constituents of subnuclear structures and mapping their associations with specific genomic loci in individual cells, within complex tissues. Here, we introduce two-layer DNA seqFISH+, which allows simultaneous mapping of 100,049 genomic loci, together with nascent transcriptome for 17,856 genes and a diverse set of immunofluorescently labeled subnuclear structures all in single cells in cell lines and adult mouse cerebellum. Using these multi-omics datasets, we showed that repressive chromatin compartments are more variable by cell type than active compartments. We also discovered a single exception to this rule: an RNA polymerase II (RNAPII)-enriched compartment was associated with long, cell-type specific genes (> 200kb), in a manner distinct from nuclear speckles. Further, our analysis revealed that cell-type specific facultative and constitutive heterochromatin compartments marked by H3K27me3 and H4K20me3 are enriched at specific genes and gene clusters, respectively, and shape radial chromosomal positioning and inter-chromosomal interactions in neurons and glial cells. Together, our results provide a single-cell high-resolution multi-omics view of subnuclear compartments, associated genomic loci, and their impacts on gene regulation, directly within complex tissues.
Article
Full-text available
Single-cell RNA sequencing data can be modeled using Markov chains to yield genome-wide insights into transcriptional physics. However, quantitative inference with such data requires careful assessment of noise sources. We find that long pre-mRNA transcripts are over-represented in sequencing data. To explain this trend, we propose a length-based model of capture bias, which may produce false positive observations. We solve this model, and use it to find concordant parameter trends, as well as systematic, mechanistically interpretable technical and biological differences in paired datasets.
Article
Full-text available
Single-cell multimodal sequencing technologies are developed to simultaneously profile different modalities of data in the same cell. It provides a unique opportunity to jointly analyze multimodal data at the single-cell level for the identification of distinct cell types. A correct clustering result is essential for the downstream complex biological functional studies. However, combining different data sources for clustering analysis of single-cell multimodal data remains a statistical and computational challenge. Here, we develop a novel multimodal deep learning method, scMDC, for single-cell multi-omics data clustering analysis. scMDC is an end-to-end deep model that explicitly characterizes different data sources and jointly learns latent features of deep embedding for clustering analysis. Extensive simulation and real-data experiments reveal that scMDC outperforms existing single-cell single-modal and multimodal clustering methods on different single-cell multimodal datasets. The linear scalability of running time makes scMDC a promising method for analyzing large multimodal datasets.
Article
Full-text available
Single-cell ATAC sequencing (scATAC-seq) is a powerful and increasingly popular technique to explore the regulatory landscape of heterogeneous cellular populations. However, the high noise levels, degree of sparsity, and scale of the generated data make its analysis challenging. Here, we present PeakVI, a probabilistic framework that leverages deep neural networks to analyze scATAC-seq data. PeakVI fits an informative latent space that preserves biological heterogeneity while correcting batch effects and accounting for technical effects, such as library size and region-specific biases. In addition, PeakVI provides a technique for identifying differential accessibility at a single-region resolution, which can be used for cell-type annotation as well as identification of key cis-regulatory elements. We use public datasets to demonstrate that PeakVI is scalable, stable, robust to low-quality data, and outperforms current analysis methods on a range of critical analysis tasks. PeakVI is publicly available and implemented in the scvi-tools framework.
Article
Full-text available
Single-cell RNA-seq and single-cell ATAC-seq technologies are used extensively to create cell type atlases for a wide range of organisms, tissues, and disease processes. To increase the scale of these atlases, lower the cost, and pave the way for more specialized multi-ome assays, custom droplet microfluidics may provide solutions complementary to commercial setups. We developed HyDrop, a flexible and open-source droplet microfluidic platform encompassing three protocols. The first protocol involves creating dissolvable hydrogel beads with custom oligos that can be released in the droplets. In the second protocol, we demonstrate the use of these beads for HyDrop-ATAC, a low-cost non-commercial scATAC-seq protocol in droplets. After validating HyDrop-ATAC, we applied it to flash-frozen mouse cortex and generated 7,996 high-quality single-cell chromatin accessibility profiles in a single run. In the third protocol, we adapt both the reaction chemistry and the capture sequence of the barcoded hydrogel bead to capture mRNA, and demonstrate a significant improvement in throughput and sensitivity compared to previous open-source droplet-based scRNA-seq assays (Drop-seq and inDrop). Similarly, we applied HyDrop-RNA to flash-frozen mouse cortex and generated 9,508 single-cell transcriptomes closely matching reference single-cell gene expression data. Finally, we leveraged HyDrop-RNA's high capture rate to analyse a small population of FAC-sorted neurons from the Drosophila brain, confirming the protocol's applicability to low-input samples and small cells. HyDrop is currently capable of generating single-cell data in high throughput and at a reduced cost compared to commercial methods, and we envision that HyDrop can be further developed to be compatible with novel (multi-) omics protocols.
Article
Full-text available
Single-cell transcriptomics can provide quantitative molecular signatures for large, unbiased samples of the diverse cell types in the brain1–3. With the proliferation of multi-omics datasets, a major challenge is to validate and integrate results into a biological understanding of cell-type organization. Here we generated transcriptomes and epigenomes from more than 500,000 individual cells in the mouse primary motor cortex, a structure that has an evolutionarily conserved role in locomotion. We developed computational and statistical methods to integrate multimodal data and quantitatively validate cell-type reproducibility. The resulting reference atlas—containing over 56 neuronal cell types that are highly replicable across analysis methods, sequencing technologies and modalities—is a comprehensive molecular and genomic account of the diverse neuronal and non-neuronal cell types in the mouse primary motor cortex. The atlas includes a population of excitatory neurons that resemble pyramidal cells in layer 4 in other cortical regions⁴. We further discovered thousands of concordant marker genes and gene regulatory elements for these cell types. Our results highlight the complex molecular regulation of cell types in the brain and will directly enable the design of reagents to target specific cell types in the mouse primary motor cortex for functional analysis.
Article
Full-text available
Identifying gene-regulatory targets of nuclear proteins in tissues is a challenge. Here we describe intranuclear cellular indexing of transcriptomes and epitopes (inCITE-seq), a scalable method that measures multiplexed intranuclear protein levels and the transcriptome in parallel across thousands of nuclei, enabling joint analysis of transcription factor (TF) levels and gene expression in vivo. We apply inCITE-seq to characterize cell state-related changes upon pharmacological induction of neuronal activity in the mouse brain. Modeling gene expression as a linear combination of quantitative protein levels revealed genome-wide associations of each TF and recovered known gene targets. TF-associated genes were coexpressed as distinct modules that each reflected positive or negative TF levels, showing that our approach can disentangle relative putative contributions of TFs to gene expression and add interpretability to inferred gene networks. inCITE-seq can illuminate how combinations of nuclear proteins shape gene expression in native tissue contexts, with direct applications to solid or frozen tissues and clinical specimens.
Article
Full-text available
To what extent do cell-to-cell differences in transcription rate affect RNA copy number distributions, and what can this variation tell us about biological processes underlying transcription? We argue that successfully answering such questions requires quantitative models that are both interpretable (describing concrete biophysical phenomena) and tractable (amenable to mathematical analysis); in particular, such models enable the identification of experiments which best discriminate between competing hypotheses. As a proof of principle, we introduce a simple but flexible class of models involving a stochastic transcription rate (governed by a stochastic differential equation) coupled to a discrete stochastic RNA transcription and splicing process, and compare and contrast two biologically plausible hypotheses about observed transcription rate variation. One hypothesis assumes transcription rate variation is due to DNA experiencing mechanical strain and relaxation, while the other assumes that variation is due to fluctuations in the number of an abundant regulator. Through a thorough mathematical analysis, we show that these two models are challenging to distinguish: properties like first- and second-order moments, autocorrelations, and several limiting distributions are shared. However, our analysis also points to the experiments which best discriminate between them. Our work illustrates the importance of theory-guided data collection in general, and multimodal single-molecule data in particular for distinguishing between competing hypotheses. We use this theoretical case study to introduce and motivate a general framework for constructing and solving such nontrivial continuous-discrete models.