Yongin Choi’s research while affiliated with University of California, Davis and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (7)


sciLaMA: A Single-Cell Representation Learning Framework to Leverage Prior Knowledge from Large Language Models
  • Preprint
  • File available

February 2025

·

11 Reads

Hongru Hu

·

Shuwen Zhang

·

Yongin Choi

·

[...]

·

Gerald Quon

A bstract Single-cell RNA sequencing (scRNA-seq) enables high-resolution exploration of cellular diversity and gene regulation, yet analyzing such data remains challenging due to technical and methodological limitations. Existing task-specific deep generative models like Variational Auto-Encoder (VAE) and its variants struggle to incorporate external biological knowledge, while transformer-based foundational large Language Models (LLMs or large LaMs) face limitations in computational cost and applicability to tabular gene expression data. Here, we introduce sciLaMA (single-cell interpretable Language Model Adapter), a novel representation learning framework that bridges these gaps by integrating static gene embeddings from multimodal LaMs with scRNA-seq tabular data through a paired-VAE architecture. Our approach generates context-aware representations for both cells and genes and outperforms state-of-the-art methods in key single-cell downstream tasks, including batch effect correction, cell clustering, and cell-state-specific gene marker and module identification, while maintaining computational efficiency. sciLaMA offers a computationally efficient, unified framework for comprehensive single-cell data analysis and biologically interpretable gene module discovery.

Download

BiVI reinterprets and extends scVI to infer biophysical parameters
a, scVI can take in concatenated nascent (N) and mature (M) RNA count matrices, encode each cell with neural networks (NN) to a low-dimensional space z and learn per-cell parameters and per-gene parameters for independent nascent and mature count distributions, which are by default negative binomial distributions (PNB). This is not motivated by a biophysical model. b, The telegraph model of transcription: a gene locus has the on rate k, the off rate koff and the RNA polymerase binding rate kRNAP. Nascent RNA molecules are produced in geometrically distributed bursts with mean b = kRNAP/koff, which are spliced and degraded at rates β and γ, respectively. The model’s steady-state distribution can be approximated by a pretrained neural network F\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${{{\mathcal{F}}}}$$\end{document} and a set of basis functions (Methods). c, biVI intakes nascent and mature count matrices, produces a low-dimensional representation for each cell and outputs per-cell and per-gene parameters for a mechanistically motivated joint distribution of nascent and mature counts.
BiVI fits single-cell data from mouse primary motor cortex (Allen sample B08) and suggests the biophysical basis for expression differences
a,b, Observed, scVI and biVI reconstructed distributions of Foxp2 (a), a marker gene for L6 CT cells, and Rorb (b), a marker gene for L5 IT cells, restricted to respective cell type. Kullback–Leibler divergence (KLD) and Hellinger distance (HD) between empirical (observed) and predicted distributions are indicated. c,d, Cell-specific parameters inferred for Foxp2 (c) and Rorb (d) demonstrate identifiable differences in means and parameters in the marked cell types (1,333 L6 CT cells (blue) and 2,395 L5 IT cells (purple) of 6,398 total illustrated cells). e, Cell subclasses show different modulation patterns, especially pronounced in non-neuronal cells (fractions of 2,000 highly variable genes that exhibit differences in each parameter (top); number of cells in each subclass (bottom)). f, biVI allows the identification of cells that exhibit differences in burst size or relative degradation rate, without detectable differences in mature mean expression. Cell types are as defined and labeled in ref. ²⁴: GABAergic neurons that are marked by the genes Lamp5 (Lamp5), Sncg (Sncg), Vip (Vip), Sst (Sst), Pvalb (Pvalb), layer 2/3 intratelencephalic neurons (L2/3 IT), layer 5 intracelenphalic neurons (L5 IT), layer 5/6 near-projecting neurons (L5/6 NP), layer 6 corticothalamic neurons (L6 CT), layer 6 intratelencephalic neurons (L6 IT), layer 6b neurons (L6b), astrocytes (Astro), oligodendrocyte precursor cells (OPC), oligodendrocytes (Oligo), macrophages (Macrophage) and endothelial cells (Endo). g, Histograms of biVI parameters and scVI mature means for genes that exhibit modulation in biVI parameters (degradation rate in L5 IT cells for Trem2 (top); burst size in L6 CT cells for Ndnf (bottom)) but no identifiable mature mean modulation.
Biophysical modeling with variational autoencoders for bimodal, single-cell RNA sequencing data

July 2024

·

48 Reads

·

16 Citations

Nature Methods

Here we present biVI, which combines the variational autoencoder framework of scVI with biophysical models describing the transcription and splicing kinetics of RNA molecules. We demonstrate on simulated and experimental single-cell RNA sequencing data that biVI retains the variational autoencoder’s ability to capture cell type structure in a low-dimensional space while further enabling genome-wide exploration of the biophysical mechanisms, such as system burst sizes and degradation rates, that underlie observations.


Deciphering the History of ERK Activity from Fixed-Cell Immunofluorescence Measurements

February 2024

·

35 Reads

The Ras/ERK pathway drives cell proliferation and other oncogenic behaviors, and quantifying its activity in situ is of high interest in cancer diagnosis and therapy. Pathway activation is often assayed by measuring phosphorylated ERK. However, this form of measurement overlooks dynamic aspects of signaling that can only be observed over time. In this study, we combine a live, single-cell ERK biosensor approach with multiplexed immunofluorescence staining of downstream target proteins to ask how well immunostaining captures the dynamic history of ERK activity. Combining linear regression, machine learning, and differential equation models, we develop an interpretive framework for immunostains, in which Fra-1 and pRb levels imply long term activation of ERK signaling, while Egr-1 and c-Myc indicate recent activation. We show that this framework can distinguish different classes of ERK dynamics within a heterogeneous population, providing a tool for annotating ERK dynamics within fixed tissues.


siVAE: interpretable deep generative models for single-cell transcriptomes

February 2023

·

150 Reads

·

26 Citations

Genome Biology

Neural networks such as variational autoencoders (VAE) perform dimensionality reduction for the visualization and analysis of genomic data, but are limited in their interpretability: it is unknown which data features are represented by each embedding dimension. We present siVAE, a VAE that is interpretable by design, thereby enhancing downstream analysis tasks. Through interpretation, siVAE also identifies gene modules and hubs without explicit gene network inference. We use siVAE to identify gene modules whose connectivity is associated with diverse phenotypes such as iPSC neuronal differentiation efficiency and dementia, showcasing the wide applicability of interpretable generative models for genomic data analysis.


Mechanistic modeling with a variational autoencoder for multimodal single-cell RNA sequencing data

January 2023

·

76 Reads

·

2 Citations

We motivate and present biVI, which combines the variational autoencoder framework of scVI with biophysically motivated, bivariate models for nascent and mature RNA distributions. In simulated benchmarking, biVI accurately recapitulates key properties of interest, including cell type structure, parameter values, and copy number distributions. In biological datasets, biVI provides a route for the identification of the biophysical mechanisms underlying differential expression. The analytical approach outlines a generalizable strategy for representing multimodal datasets generated by single-cell RNA sequencing.


Interpretable deep generative models for genomics

September 2021

·

67 Reads

·

4 Citations

Deep neural networks implementing generative models for dimensionality reduction have been extensively used for the visualization and analysis of genomic data. One of their key limitations is lack of interpretability: it is challenging to quantitatively identify which input features are used to construct the embedding dimensions, thus preventing insight into why cells are organized in a particular data visualization, for example. Here we present a scalable, interpretable variational autoencoder (siVAE) that is interpretable by design: it learns feature embeddings that guide the interpretation of the cell embeddings in a manner analogous to factor loadings of factor analysis. siVAE is as powerful and nearly as fast to train as the standard VAE but achieves full interpretability of the embedding dimensions. We exploit a number of connections between dimensionality reduction and gene network inference to identify gene neighborhoods and gene hubs, without the explicit need for gene network inference. Finally, we observe a systematic difference in the gene neighborhoods identified by dimensionality reduction methods and gene network inference algorithms in general, suggesting they provide complementary information about the underlying structure of the gene co-expression network.


Figure 5. AREG Drives Stochastic ERK Signaling to Induce Heterogeneous ETG Expression (A and B) Single cell and mean traces of ERKTR C/N in S1m cells exposed to the indicated EGFR ligand and concentrations for a duration of >18 h. Vertical red lines indicate the time of ligand or vehicle addition, the bottom plot shows the mean ERKTR C/N in bold, with IQR shaded. Above the mean trace, 5 representative singlecell measurements of ERKTR C/N are shown. >500 cells were for each condition. (C) Bar graphs depict the percent of cells responding to the indicated dose of EGFR ligand within 30 min of treatment. n > 500 cells per condition (D) Mean ERKTR C/N traces for S1 cells receiving the indicated EGF or AREG treatments. Percentages represent the temporal variability score for each condition. n > 200 cells per condition. (E) Immunofluorescence imaging of AREG expression in T4-2 cells, AREG (red) and nuclei (blue). (F and G) Co-staining of Fra-1 and Egr1 in S1m cells at the indicated timepoints and conditions. Percentages indicate the coefficient of variation for Fra-1 or Egr1 respectively. Numbers inset on ''merge'' images indicate the R 2 value for Fra-1 and Egr1. n > 1,000 cells per condition. Scale bar, 20 mm.
Systems-Level Properties of EGFR-RAS-ERK Signaling Amplify Local Signals to Generate Dynamic Gene Expression Heterogeneity

July 2020

·

98 Reads

·

52 Citations

Cell Systems

Intratumoral heterogeneity is associated with aggressive tumor behavior, therapy resistance, and poor patient outcomes. Such heterogeneity is thought to be dynamic, shifting over periods of minutes to hours in response to signaling inputs from the tumor microenvironment. However, models of this process have been inferred from indirect or post-hoc measurements of cell state, leaving the temporal details of signaling-driven heterogeneity undefined. Here, we developed a live-cell model system in which microenvironment-driven signaling dynamics can be directly observed and linked to variation in gene expression. Our analysis reveals that paracrine signaling between two cell types is sufficient to drive continual diversification of gene expression programs. This diversification emerges from systems-level properties of the EGFR-RAS-ERK signaling cascade, including intracellular amplification of amphiregulin-mediated paracrine signals and differential kinetic filtering by target genes including Fra-1, c-Myc, and Egr1. Our data enable more precise modeling of paracrine-driven transcriptional variation as a generator of gene expression heterogeneity. A record of this paper’s transparent peer review process is included in the Supplemental Information.

Citations (5)


... There are also substantial routes to strengthening our mathematical formalism: we have not yet considered cell-cell interactions or cell cycle effects -of which will be the subject of future work. Other groups have also proposed mechanisms to combine biophysical modelling with deep learning frameworks (Carilli et al., 2024), suggesting we are not the only group thinking in this manner. ...

Reference:

No Foundations without Foundations -- Why semi-mechanistic models are essential for regulatory biology
Biophysical modeling with variational autoencoders for bimodal, single-cell RNA sequencing data

Nature Methods

... Still, there are some approaches that incorporate interpretability already as part of the model design. For example, the siVAE approach simultaneously infers a cell and gene embedding space via two encoder-decoder frameworks and and uses an additional regularization term in the loss function, where embeddings of genes indicate their contribution to distinct dimensions of the cell embedding space 36 . However, the dimensions in the cell embedding may still be entangled, and the contribution of variables to the dimensions of the cell embedding is not constrained to be sparse. ...

siVAE: interpretable deep generative models for single-cell transcriptomes

Genome Biology

... 77 Our results suggest greater attention should be paid to measuring and controlling for global changes in transcription, proliferation, and other central metabolic processes. 77,85 Cells retain memory through diverse biomolecules such as RNA, proteins and posttranslational modifications, including those propagated in chromatin. Rapid proliferation accelerates the dilution of stable biomolecules, biasing cell fate. ...

Mechanistic modeling with a variational autoencoder for multimodal single-cell RNA sequencing data

... Studies that use single-cell sequencing generally have different research questions, and are focused on clustering and integration using AE architectures (e.g. [67][68][69][70]). Recent popular publications on these tasks use foundation models, which are promising as they can perform several tasks such as cell-type annotation and batch correction. ...

Interpretable deep generative models for genomics
  • Citing Preprint
  • September 2021

... Intratumoral heterogeneity is associated with aggressive tumor behavior, therapy resistance, and unfavorable patient outcomes 52 cancer treatment. Therefore, we identified the differences in sensitivity of various cell types to commonly used targeted therapeutic agents for breast cancer treatment (Fig. 12A-F). ...

Systems-Level Properties of EGFR-RAS-ERK Signaling Amplify Local Signals to Generate Dynamic Gene Expression Heterogeneity

Cell Systems