A preview of this full-text is provided by Springer Nature.
Content available from Nature Methods
This content is subject to copyright. Terms and conditions apply.
Nature Methods | Volume 21 | August 2024 | 1466–1469 1466
nature methods
Brief Communication
https://doi.org/10.1038/s41592-024-02365-9
Biophysical modeling with variational
autoencoders for bimodal, single-cell RNA
sequencing data
Maria Carilli 1,7, Gennady Gorin 2,6,7, Yongin Choi 3,4, Tara Chari 1 &
Lior Pachter 1,5
Here we present biVI, which combines the variational autoencoder
framework of scVI with biophysical models describing the transcription
and splicing kinetics of RNA molecules. We demonstrate on simulated
and experimental single-cell RNA sequencing data that biVI retains
the variational autoencoder’s ability to capture cell type structure in a
low-dimensional space while further enabling genome-wide exploration of
the biophysical mechanisms, such as system burst sizes and degradation
rates, that underlie observations.
Advances in experimental methods for single-cell RNA sequencing
(scRNA-seq) allow for the simultaneous quantification of multiple cel-
lular species, such as nascent and mature transcriptomes
1,2
, surface
3–5
and nuclear
6
proteomes, and chromatin accessibility
7,8
. While these
datasets enable insight into cell type and state in development and dis-
ease, joint analyses of distinct modalities remain challenging. We show
that principled biophysical ‘integration’ of multimodal datasets can be
achieved through parameterization of interpretable mechanistic mod-
els
9
, scalable to thousands of genes across tens of thousands of cells
10
.
Recent approaches to integrate and reduce the dimensionality
of multimodal single-cell genomics data have leveraged advances
in machine learning11–13. For example, the popular tool scVI is a vari-
ational autoencoder that uses neural networks to encode scRNA-seq
counts to a low-dimensional representation. This is decoded by
another neural network to cell- and gene-specific parameters for
conditional likelihood distributions of observed counts14. These
distributions are chosen post hoc to be consistent with the discrete,
over-dispersed nature of scRNA-seq counts, but can be derived from
biophysical models (Methods). Extensions of scVI for protein
11
and
chromatin measurements
15
jointly encode data modalities to a sin-
gle latent space, then employ two decoding networks to produce
parameters for independent conditional likelihoods specific to each
datatype. Nascent and mature transcripts, available by realigning
existing scRNA-seq reads1,2, could be similarly treated (Fig. 1a); how-
ever, using independent conditional likelihoods for bimodal measure-
ments derived from the same gene ignores their inherent causality
and has no biophysical basis; the generative model is a ‘black box’
representation to summarize data.
Nevertheless, good causal model candidates are available (Fig. 1b
and Supplementary Figs. 1 and 2). For example, Fig. 1b illustrates the
extensively validated
16–18
bursty model of transcription. While the joint
steady-state distribution induced by the bursty model is analytically
intractable
19
, we have previously shown that it can be approximated by
a set of basis functions with neural network-learned weights20.
We introduce biVI, a strategy that adapts scVI to work with
well-characterized stochastic models of transcription. We propose
models, formalized by chemical master equations (CMEs), for RNA
lifecycles, then use the bivariate, CME-derived distribution as the
conditional data likelihood distribution for nascent and mature counts
(Fig. 1c). The inferred conditional likelihood parameters thus have
biophysical interpretations as part of a mechanistic model of transcrip-
tion, moving beyond associational analyses to fit biophysical values
that parameterize causal relationships based on known transcrip-
tional dynamics
19
. The likelihood distributions cannot be obtained
solely from the data distributions but require some ‘knowledge of the
data-generating process’21.
Received: 2 May 2023
Accepted: 27 June 2024
Published online: 25 July 2024
Check for updates
1Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, CA, USA. 2Division of Chemistry and Chemical Engineering,
California Institute of Technology, Pasadena, CA, USA. 3Department of Biomedical Engineering, University of California, Davis, Davis, CA, USA. 4Genome
Center, University of California, Davis, Davis, CA, USA. 5Department of Computing and Mathematical Sciences, California Institute of Technology,
Pasadena, CA, USA. 6Present address: Fauna Bio, Emeryville, CA, USA. 7These authors contributed equally: Maria Carilli, Gennady Gorin.
e-mail: lpachter@caltech.edu
Content courtesy of Springer Nature, terms of use apply. Rights reserved