Romain Lopez’s research while affiliated with Genentech and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (35)


Figure 1: The Open Problems in Single-cell Analysis living benchmarking platform. A) Overview and timeline of published benchmarks of single-cell batch integration. Four publications have benchmarked 19 methods using 18 metrics. Light grey and black squares indicate whether one or two
Figure 2: Task overview, setup and results. A) Overview of the seven tasks currently included in the Open Problems platform. Batch integration and cell-cell communication (CCC) consist of three and two subtasks respectively, making up the current total of 10 tasks. B) Schematic diagram of the CCC task.
Defining and benchmarking open problems in single-cell analysis
  • Preprint
  • File available

April 2024

·

298 Reads

·

9 Citations

Malte Luecken

·

·

·

[...]

·

With the growing number of single-cell analysis tools, benchmarks are increasingly important to guide analysis and method development. However, a lack of standardisation and extensibility in current benchmarks limits their usability, longevity, and relevance to the community. We present Open Problems, a living, extensible, community-guided benchmarking platform including 10 current single-cell tasks that we envision will raise standards for the selection, evaluation, and development of methods in single-cell analysis.

Download

Fig. 1. Differential expression model for deep generative models. (A) lvm-DE takes annotated data (from clustering, metadata, or transfer learning), a latent variable model, and a target FDR level as inputs and returns LFC estimates as well as calibrated DE predictions. (B) lvm-DE works as follows. 1) A preliminary step consists in fitting the latent variable model of choice of the data from the collection of available scRNA-seq data. 2) lvm-DE uses existing cell states annotations to approximate the distributions of c conditioned on the cell states. 3) These distributions help determine the normalized expression level distributions of the compared populations. 4) The associated LFC distribution helps to determine posterior DE probabilities that correspond to the model in which the LFC is higher than a given threshold. 5) To tag DE genes of interpretable interest, we estimate the maximum number of genes for which the posterior expected FDR is below the desired FDR level.
Fig. 2. SymSim results. (A) Dataset presentation. Top: SymSim is a simulation framework modeling biological and technical effects to provide realistic simulations. Bottom: We consider a two-cell-type DE analysis scenario. We subsample population A to compare the different algorithms for rare cell-type detection. For this subpanel and all the experiments, we refit the models for each scenario such that all of the algorithms use the same number of observations from A and B for model fitting and DE. (B) LFC point estimation error when comparing two populations of A = B = 200 cells. For Bayesian techniques, we summarize the posterior LFC distribution by its median. For this figure and in the remainder of the article, boxplots represent medians (line), interquartile range (box), and distribution range (whiskers) estimates. (C) TPR (dots) and FDR (crosses) changes for an increasing number of external cells for the different latent variable models. (D) FDR and TPR of decisions for the detection of DE genes when comparing varying A ∈ {25, 50, 100, 150} cells to B = 500 cells (for C = 2,000) (We refit each model for each configuration such that all algorithms use the same data). Squares, circles, and diamonds correspond to decisions controlling FDR at targets 0.05, 0.1, and 0.2, respectively. For scVI's original DE procedure, we reject the null when Bayes factors are greater than three in absolute value.
Fig. 3. PBMCs results. (A) UMAP from scVI's embedding (B) Negative controls (among B cells), corresponding to LFC range study for the different methods. For this experiment, the lvm-DE outlier removal procedure was not employed. (C) Positive controls. Distribution of Pearson correlation between the reference LFC (bulk-RNA) and estimated LFC for pairwise comparisons of B cells, mDC, pDC, and monocytes. Each point in these graphs corresponds to one of the six possible cell-type comparisons. For lvm-DE, we use the custom LFC median estimator. Individual scatter plots can be found in the annex. (D) Distribution of Spearman correlations between the reference P values (bulk-RNA) and estimated significance scores for pairwise comparisons of B cells, mDC, pDC, and monocytes. GLMs and lvm-DE, respectively, used P values and posterior DE probabilities as significance scores. Stars represent significant differences with all the GLMs at various significant levels (*, **, and *** denote, respectively, significance levels < 0.05, 0.01, 0.005), under a two-sample F test for the negative control and a one-sided two-sample t-test for the positive control experiments.
Fig. 4. Batch harmonization on PbmcBench. Pooled information from several batches improves the match of the prediction with bulk for scPhere and scVIlvm. Left: Pearson correlation of the predicted and the reference LFCs (from bulk-RNA) for two cell-type pairs. Right: Spearman correlation of the predicted significance scores (posterior DE probabilities for scPhere and scVI-lvm, P values for other algorithms) and the reference P values (from bulk-RNA) for two cell-type pairs. In both graphs, points correspond to a given training on a subset of PbmcBench containing a varying number of batches (color). As GLMs struggled to scale to large sample sizes, these algorithms used a maximum of 500 cells per dataset.
Fig. 5. SARS-CoV2 dataset results. (A) Dataset presentation. UMAP from scVI's embeddings colored by cell type (Left) and batch (Right). Counts were obtained from six healthy donors (H1 to H6) and seven SARS-CoV-2-infected patients (C1 to C7). (B) Negative controls (among DC cells), corresponding to the study of the range of the LFC parameter (LFC) for the different methods. (C) Positive controls for inter-cell-type analysis. Left: Distribution of Pearson correlations between the reference and estimated LFC for pairwise comparisons of B cells, mDC, pDC, and monocytes. Right: Distribution of Spearman correlations between the reference P values and estimated significance scores for pairwise comparisons. Each point in these graphs corresponds to one of the six possible cell-type comparisons. (D) Positive controls for within-cell-type analysis. Distribution for different cell types of Pearson correlations between the reference and estimated LFC. The reference corresponds to cell-type-specific cytokine signature genes' LFC independently computed on microarray, but that unfortunately did not contain significance assessments. The considered cell types are dendritic cells, NK, neutrophils, gd T, B, CD4T, and CD8T cells.
An empirical Bayes method for differential expression analysis of single cells with deep generative models

May 2023

·

98 Reads

·

29 Citations

Proceedings of the National Academy of Sciences

Detecting differentially expressed genes is important for characterizing subpopulations of cells. In scRNA-seq data, however, nuisance variation due to technical factors like sequencing depth and RNA capture efficiency obscures the underlying biological signal. Deep generative models have been extensively applied to scRNA-seq data, with a special focus on embedding cells into a low-dimensional latent space and correcting for batch effects. However, little attention has been paid to the problem of utilizing the uncertainty from the deep generative model for differential expression (DE). Furthermore, the existing approaches do not allow for controlling for effect size or the false discovery rate (FDR). Here, we present lvm-DE, a generic Bayesian approach for performing DE predictions from a fitted deep generative model, while controlling the FDR. We apply the lvm-DE framework to scVI and scSphere, two deep generative models. The resulting approaches outperform state-of-the-art methods at estimating the log fold change in gene expression levels as well as detecting differentially expressed genes between subpopulations of cells.



Figure 5: SARS-CoV2 dataset results. A. Dataset presentation. UMAP from scVI's embeddings colored by cell-type (left) and batch (right). Counts were obtained from 6 healthy donors (H1 to H6) and 7 SARS-CoV-2 infected patients (C1 to C7). B. Negative controls (among DC cells), corresponding the study of the range of the LFC parameter (LFC) for the different methods. . C. Positive controls for inter cell-type analysis. Left: Distribution of Pearson correlations between the reference and estimated LFC for pairwise comparisons of B cells, mDC, pDC and monocytes. Right: Distribution of Spearman correlations between the reference pvalues and estimated significance scores for pairwise comparisons. Each point in these graphs corresponds to one of the six possible cell-type comparison. D. Positive controls for within cell-type analysis. Distribution for different cell-types of Pearson correlations between the reference and estimated LFC. The reference corresponds to cell-type-specific cytokine signature genes LFC independently computed on microarray, but that unfortunately did not contain significance assessments. The considered cell-types are dendritic cells, NK, neutrophils, gd T, B, CD4T, and CD8T cells.
An Empirical Bayes Method for Differential Expression Analysis of Single Cells with Deep Generative Models

May 2022

·

50 Reads

·

4 Citations

A bstract Detecting differentially expressed genes is important for characterizing subpopulations of cells. In scRNA-seq data, however, nuisance variation due to technical factors like sequencing depth and RNA capture efficiency obscures the underlying biological signal. Deep generative models have been extensively applied to scRNA-seq data, with a special focus on embedding cells into a low-dimensional latent space and correcting for batch effects. However, little attention has been given to the problem of utilizing the uncertainty from the deep generative model for differential expression. Furthermore, the existing approaches do not allow controlling for the effect size or the false discovery rate. Here, we present lvm-DE, a generic Bayesian approach for performing differential expression from using a fitted deep generative model, while controlling the false discovery rate. We apply the lvm-DE framework to scVI and scSphere, two deep generative models. The resulting approaches outperform the state-of-the-art methods at estimating the log fold change in gene expression levels, as well as detecting differentially expressed genes between subpopulations of cells.


Schematic representation of the ST analysis pipeline with DestVI
a, A ST analysis workflow relies on two data modalities, producing unpaired transcriptomic measurements, each in the form of count matrices. The ST data measures the gene expression ys in a given spot s and its location λs. However, each spot may contain multiple cells. The scRNA-seq data measure the gene expression xn in a cell n, but the spatial information is lost because of tissue dissociation. After annotation, we may associate each cell with a cell type cn. These matrices are the input to DestVI, composed of two LVMs: the scLVM and the stLVM. DestVI outputs a joint representation of the single-cell data and the spatial data by estimating the proportion of every cell type in every spot and projecting the expression of each spot onto cell-type-specific latent spaces. These inferred values may be used for performing downstream analysis, such as cell-type-specific DE and comparative analyses of conditions. b, Schematic of the scLVM. RNA counts and cell type information from the scRNA-seq data are jointly transformed by an encoder neural network into the parameters of the approximate posterior of γn, a low-dimensional representation of cell-type-specific cell state. Next, a decoder neural network maps samples from the approximate posterior of γn along with the cell type information cn to the parameters of a count distribution for every gene. The superscript notation fg denotes the g-th entry ρng of the vector ρn. c, Schematic of the stLVM. RNA counts from the ST data are transformed by an encoder neural network into the parameters of the cell-type-specific embeddings γsc\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\gamma _s^c$$\end{document}. Free parameters βsc\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\beta _s^c$$\end{document} encode the abundance of cell type c in spot s and may be normalized into CTP πsc\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\pi _s^c$$\end{document} (Methods). The decoder from the scLVM model maps cell-type-specific embeddings γsc\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\gamma _s^c$$\end{document} to estimates of cell-type-specific gene expression. These values are summed across all cell types, weighted by the abundance parameters βsc\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\beta _s^c$$\end{document}, to obtain the parameter rsg approximating the gene expression of the spot. After training, the decoder may be used to perform cell-type-specific imputation of gene expression across all spots.
Evaluating the performance on DestVI on simulations
a, Schematic view of the semi-simulation framework. For each cell type of an scRNA-seq dataset, we learned a continuous model of gene expression. We sampled spatially relevant random vectors on a grid to encode the proportion of every cell type in every spot πsc\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\pi _s^c$$\end{document} as well as the cell-type-specific embeddings γsc\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\gamma _s^c$$\end{document}. Then, we feed those parameters into the learned continuous model to generate ST data (Methods). b, c, Visualization of the single-cell data and the cell state labels used for comparison to competing methods (UMAP embeddings of the single-cell data; 32,000 cells). b, Cells are colored by cell type. c, Cells are colored by the sub-cell types, obtained via hierarchical clustering (five clusters). d–f, Comparison of DestVI to competing algorithms, possibly applied to different clustering resolutions. Performance is not reported for cases that did not terminate by 3 hours (SPOTLight with eight sub-clusters; Methods). d, Spearman correlation of estimated CTP compared to ground truth for all methods. e, Spearman correlation of estimated cell-type-specific gene expression compared to ground truth, for combinations of spot and cell type for which the proportion is >0.4 for the parent cluster (not applicable to algorithms run at the coarsest level, as they do not provide cell type proportions at any sub-cell-type level). f, Scatter plot of both metrics that shows the tradeoff reached by all methods. Colors in this panel are in concordance with the ones from e and f. g, h, Follow-up stress tests for DestVI. g, Accuracy of imputation, measured by Spearman correlation as a function of the cell type proportion in a given spot. h, Head-to-head comparison of estimated cell type proportion against ground truth across all spots and cell types (8,000 combinations of spot and cell type). i, j, Ablation studies for the amortization scheme used by DestVI. ‘None’ stands for vanilla MAP inference. ‘Latent’ and ‘Proportion’ refer to only the inference of the latent variables and only the cell type abundance being amortized with a neural network, respectively. ‘Both’ refers to fully amortized MAP inference. i, Spearman correlation of estimated CTP compared to ground truth. j, Spearman correlation of estimated cell-type-specific gene expression compared to ground truth. UMAP, uniform manifold approximation and projection.
Application of DestVI to the murine lymph nodes
a, Schematics of the experimental pipeline. We processed murine lymph nodes with ST (10x Visium) and scRNA-seq (10x Chromium) after 48-hour stimulation by MS compared to PBS control (two sections from each condition). b, ST data (1,092 spots; only three sections passed the quality check) (Supplementary Methods). Sample MS-1 and samples PBS/MS-2 were processed on different capture areas of the same Visium gene expression slide. c, UMAP projection of the scRNA-seq data (14,989 cells). d, Spatial autocorrelation of the CTP. e, Spatial distribution of CTP for B cells, CD8 T cells, monocytes and NK cells, as inferred by DestVI. f, Embedding of the monocytes (circles; 128 single cells) alongside the monocyte-abundant spots (crosses; 79 spots). Single cells are colored by expression of IFN-II genes identified by Hotspot (Fcgr1, Cxcl9 and Cxcl10; Supplementary Figs. 12–14). g, Imputation of monocyte-specific expression of the IFN-II marker genes for the monocyte-abundant spots of the spatial data (log-scale). h, Monocyte-specific DE analysis between MS and PBS lymph nodes (2,000 genes, 79 spots, total 10,980 samples from the generative model). Red dots designate genes with statistical significance, according to our DE procedure (two-sided Kolmogorov–Smirnov test, adjusted for multiple testing using the Benjamini–Hochberg procedure; Methods). i, Immunofluorescence imaging from an MS lymph node, staining for CD11b, CD64 and Ly6C in the IFA. Scale bar, 50 μm. j, Embedding of the B cells (circles, 8,359 single cells) alongside the B-cell-abundant spots (crosses, 579 spots). Single cells are colored by expression of the IFN-I genes identified by Hotspot (Ifit3, Ifit3b, Stat1, Ifit1, Usp18 and Isg15; Supplementary Figs. 17 and 18). k, Imputation of B-cell-specific expression of the IFN-I gene module on the spatial data (log-scale), reported on B-cell-abundant spots. l, B-cell-specific DE analysis between MS and PBS lymph nodes (2,000 genes, 579 spots, 6,160 samples). Red dots designate genes with statistical significance, according to our DE procedure (two-sided Kolmogorov–Smirnov test, adjusted for multiple testing using the Benjamini–Hochberg procedure; Methods). m, Immunofluorescence imaging from an MS lymph node, staining for IFIT3, B220 and Ly6C in B cell follicle near the inflammatory IFA. Scale bar, 50 μm. UMAP, uniform manifold approximation and projection. cDC2, type 2 conventional dendritic cell; GD, gamma delta; pDC, plasmacytoid dendritic cell.
Application of DestVI to a MCA205 tumor sample
a, Schematics of the experimental pipeline. We performed ST (10x Visium) and scRNA-seq (single-cell MARS sequencing protocol) on MCA205 tumor that contains heterogeneous immune cell populations 14 days after intracutaneous transplantation into the wild-type mouse (two sections). b, Visualization of the ST data for two MCA205 tumor sections, after quality control (4,027 spots). Scale bar, 1,000 μm. The two sections were processed on the different capture areas of the same Visium gene expression slide. c, UMAP projection of the scRNA-seq data (8,051 cells), embedded by scVI and manually annotated. d, Spatial autocorrelation of the CTP for every cell type, computed using Hotspot. e, Spatial distribution of CTP for DCs, monocytes and macrophages (Mon-Mac), CD8 T cells and NK cells. f, Immunofluorescence imaging from neighboring tumor sections, using antibodies for MHCII⁺ cells showing for DCs (Section-3, +20 μm from Section-2), F4/80⁺MHCII⁻ cells showing for Mon-Mac (Section-3, +20 μm from Section-2), TCRb⁺ cells showing for CD8 T cells (Section-5, +60 μm from Section-2) and NK1.1⁺ cells showing for NK cells (Section-4, +30 μm from Section-2). All scale bars denote 500 μm. Red solid lines indicate the section boundary. Right side is the MCA205 tumor marginal boundary. The cells positive for staining marker are segmented and annotated using QuPath and showing yellow color here with changed brightness and contrast (Supplementary Methods). UMAP, uniform manifold approximation and projection.
DestVI identifies a hypoxic population of macrophages in the tumor core
a, Visualization of the hypoxia gene expression module on the Mon-Mac cells from the scRNA-seq data (4,400 cells), on the embedding from scVI (identified using Hotspot; Supplementary Figs. 28 and 29). b, Imputation of gene expression for this module on the spatial dataset (log-scale), reported only on spots with high abundance of Mon-Mac (3,906 spots across the two sections). Imputation for other modules is shown in Supplementary Fig. 30. c, H&E-stained histology of Section-1 (left), with overlapping Mreg-identified regions from DestVI showing red polygons (as identified in Supplementary Fig. 32). Blue arrows show the location of cells from the necrotic core. H&E-stained histology showing a magnification of the necrotic core of the yellow frame in Section-1 (right). Scale bar, 55 μm. d, Mon-Mac cell-specific DE analysis between the Mreg-enriched areas and the rest of the tumor section (2,886 genes; 379 spots for the Mreg-enriched area and 361 randomly sampled spots from the rest of the tumor; total of 2,220 samples from the generative model). Red dots designate genes with statistical significance, according to our DE procedure (two-sided Kolmogorov–Smirnov test, adjusted for multiple testing using the Benjamini–Hochberg procedure; Methods). e, Representative image of the multiplexed immunofluorescence staining. Left, Hypoxic areas as identified by the Hypoxyprobe (HYPO) in a whole MCA205 tumor section. Two yellow frames show the hypoxic areas with necrotic cores. Scale bar, 500 μm. Middle, Magnification of a necrotic core with F4/80, Arg1, GPNMB, HYPO and DAPI staining. Scale bar, 50 μm. Right, Annotation of different macrophages surrounding the necrotic core. Different colors shown in the legend bar show different staining combinations. Red spindle shows the extent of hypoxia. Blue arrow shows the location of cells from the necrotic core. Scale bar, 50 μm.
DestVI identifies continuums of cell types in spatial transcriptomics data

April 2022

·

581 Reads

·

151 Citations

Nature Biotechnology

Most spatial transcriptomics technologies are limited by their resolution, with spot sizes larger than that of a single cell. Although joint analysis with single-cell RNA sequencing can alleviate this problem, current methods are limited to assessing discrete cell types, revealing the proportion of cell types inside each spot. To identify continuous variation of the transcriptome within cells of the same type, we developed Deconvolution of Spatial Transcriptomics profiles using Variational Inference (DestVI). Using simulations, we demonstrate that DestVI outperforms existing methods for estimating gene expression for every cell type inside every spot. Applied to a study of infected lymph nodes and of a mouse tumor model, DestVI provides high-resolution, accurate spatial characterization of the cellular organization of these tissues and identifies cell-type-specific changes in gene expression between different tissue regions or between conditions. DestVI is available as part of the open-source software package scvi-tools (https://scvi-tools.org).



Figure 2: Relationship between uncertainty and error in the Gaussian Process Poisson Log-normal simulation experiments with the TreeVAE model. (a) Variance of posterior predictive density on latent space for each internal node compared to the depth (Pearson's r = 0.6761). (b) Error and uncertainty of each prediction (Pearson's r = −0.6765).
Figure 3: Behavior of TreeVAE internal node predictions. (a) Variance of posterior predictive density on latent space for each internal node. Uncertainty is negatively correlated with depth (Pearson's r = −0.60). (b) Predicted expression of CEACAM5 for each internal node. Color gradient at the leaves indicates observed gene expression.
Results on the Gaussian process factor analysis simulations (averaged across ten different simulations).
Reconstructing unobserved cellular states from paired single-cell lineage tracing and transcriptomics data

May 2021

·

105 Reads

·

4 Citations

A bstract Novel experimental assays now simultaneously measure lineage relationships and transcriptomic states from single cells, thanks to CRISPR/Cas9-based genome engineering. These multimodal measurements allow researchers not only to build comprehensive phylogenetic models relating all cells but also infer transcriptomic determinants of consequential subclonal behavior. The gene expression data, however, is limited to cells that are currently present (“leaves” of the phylogeny). As a consequence, researchers cannot form hypotheses about unobserved, or “ancestral”, states that gave rise to the observed population. To address this, we introduce TreeVAE: a probabilistic framework for estimating ancestral transcriptional states. TreeVAE uses a variational autoencoder (VAE) to model the observed transcriptomic data while accounting for the phylogenetic relationships between cells. Using simulations, we demonstrate that TreeVAE outperforms benchmarks in reconstructing ancestral states on several metrics. TreeVAE also provides a measure of uncertainty, which we demonstrate to correlate well with its prediction accuracy. This estimate therefore potentially provides a data-driven way to estimate how far back in the ancestor chain predictions could be made. Finally, using real data from lung cancer metastasis, we show that accounting for phylogenetic relationship between cells improves goodness of fit. Together, TreeVAE provides a principled framework for reconstructing unobserved cellular states from single cell lineage tracing data.


Learning from eXtreme Bandit Feedback

May 2021

·

17 Reads

·

17 Citations

Proceedings of the AAAI Conference on Artificial Intelligence

We study the problem of batch learning from bandit feedback in the setting of extremely large action spaces. Learning from extreme bandit feedback is ubiquitous in recommendation systems, in which billions of decisions are made over sets consisting of millions of choices in a single day, yielding massive observational data. In these large-scale real-world applications, supervised learning frameworks such as eXtreme Multi-label Classification (XMC) are widely used despite the fact that they incur significant biases due to the mismatch between bandit feedback and supervised labels. Such biases can be mitigated by importance sampling techniques, but these techniques suffer from impractical variance when dealing with a large number of actions. In this paper, we introduce a selective importance sampling estimator (sIS) that operates in a significantly more favorable bias-variance regime. The sIS estimator is obtained by performing importance sampling on the conditional expectation of the reward with respect to a small subset of actions for each instance (a form of Rao-Blackwellization). We employ this estimator in a novel algorithmic procedure---named Policy Optimization for eXtreme Models (POXM)---for learning from bandit feedback on XMC tasks. In POXM, the selected actions for the sIS estimator are the top-p actions of the logging policy, where p is adjusted from the data and is significantly smaller than the size of the action space. We use a supervised-to-bandit conversion on three XMC datasets to benchmark our POXM method against three competing methods: BanditNet, a previously applied partial matching pruning strategy, and a supervised learning baseline. Whereas BanditNet sometimes improves marginally over the logging policy, our experiments show that POXM systematically and significantly improves over all baselines.


Multi-resolution deconvolution of spatial transcriptomics data reveals continuous patterns of inflammation

May 2021

·

158 Reads

·

22 Citations

The function of mammalian cells is largely influenced by their tissue microenvironment. Advances in spatial transcriptomics open the way for studying these important determinants of cellular function by enabling a transcriptome-wide evaluation of gene expression in situ. A critical limitation of the current technologies, however, is that their resolution is limited to niches (spots) of sizes well beyond that of a single cell, thus providing measurements for cell aggregates which may mask critical interactions between neighboring cells of different types. While joint analysis with single-cell RNA-sequencing (scRNA-seq) can be leveraged to alleviate this problem, current analyses are limited to a discrete view of cell type proportion inside every spot. This limitation becomes critical in the common case where, even within a cell type, there is a continuum of cell states that cannot be clearly demarcated but reflects important differences in the way cells function and interact with their surroundings. To address this, we developed Deconvolution of Spatial Transcriptomics profiles using Variational Inference (DestVI), a probabilistic method for multi-resolution analysis for spatial transcriptomics that explicitly models continuous variation within cell types. Using simulations, we demonstrate that DestVI is capable of providing higher resolution compared to the existing methods and that it can estimate gene expression by every cell type inside every spot. We then introduce an automated pipeline that uses DestVI for analysis of single tissue slices and comparison between tissues. We apply this pipeline to study the immune crosstalk within lymph nodes to infection and explore the spatial organization of a mouse tumor model. In both cases, we demonstrate that DestVI can provide a high resolution and accurate spatial characterization of the cellular organization of these tissues, and that it is capable of identifying important cell-type-specific changes in gene expression - between different tissue regions or between conditions. DestVI is available as an open-source software package in the scvi-tools codebase (https://scvi-tools.org).


Figure 1: User perspective of scvi-tools. a, Overview of single-cell omics analysis pipeline with scvi-tools. Datasets may contain multiple layers of omic information, along with metadata at the cell-and feature-levels. Quality control (QC) and preprocessing are done with popular packages like Scanpy, Seurat, and Scater. Subsequently, datasets can be analyzed with scvi-tools, which contains implementations of probabilistic models that offer a range of capabilities for several omics. Finally, results are further investigated or visualized, typically through the basis of a nearest neighbors graph, and with methods like Scanpy and VISION. b, (left) The functionality of models implemented in scvi-tools covers core single-cell analysis tasks. Each model has a simple and consistent user interface. (right) A code snippet applying scVI to a dataset read from a h5ad file, and then performing dimensionality reduction and differential expression.
Figure 2: Removal of unwanted variation in the analysis of Drosophila wing disc development. a, Graphical model representation of a latent variable models in scvi-tools that conditions on nuisance covariates. b, Graphical representation of covariates injected into each layer of decoder neural networks. c, Code snippet to register AnnData and train scVI with continuous covariates. The covariates are identified with keys stored in the AnnData.obs cell-level data frame. d, UMAP [57] embedding of scVI latent space with only batch covariates (scVI) and scVI latent space with batch and continuous covariates (scVI-cc). UMAP plot is colored by batch, PCNA (cell cycle gene), IncRNA:roX1 (cell sex gene), and vg (gene marking spatial compartment within the wing disc). e, Geary's C of canonical marker genes of interest per model. f, Geary's C of the cell cycle and cell sex genes conditioned on per model. Box plots were computed on n=31 genes for (e) and n=55 genes for (f) and indicate the median (center lines), interquartile range (hinges), and whiskers at 1.5× interquartile range. Gene lists can be found in Supplementary Table 2.
Figure 3: Sequential integration of CITE-seq PBMC samples with totalVI and the scArches method. a, Code-based overview of using scArches with the implementation of totalVI in scvi-tools. scArches was implemented globally through the ArchesMixin class. First, the reference model is trained on reference data, and then the scArches architectural surgery is performed when load_query_data is called on the query data. Finally, the (now) query model is trained with the query data and downstream analysis is performed. b, c, UMAP embedding of the totalVI reference and query latent spaces colored by (b) the reference labels and predicted query labels and (c) the dataset of origin. d, Row-normalized confusion matrix of scArches predicted query labels (rows) and study-derived cell annotations (columns). e, Dotplot of log library size normalized RNA expression across cell type markers for predicted T cell subsets. f, g Frequency of (f) MAIT cells and (g) CD4 CTLs for each donor in the query dataset across healthy controls and donors with moderate and severe COVID. Horizontal line denotes median. h, Row-normalized confusion matrix of scArches predicted query labels (rows) and default totalVI predicted labels (columns).
Figure 5: Reimplementation of Stereoscope in scvi-tools. a, Overview of the Stereoscope method. Stereoscope takes as input a spatial transcriptomics dataset, as well as single-cell RNA sequencing dataset, and outputs the proportion of cell types in every spot. b, Short description of the steps required to reimplement Stereoscope into the codebase. For each of the two models of Stereoscope, we created a module class as well as a model class. c, Average cyclomatic code complexity and total number of source code lines for each of scvi-tools implementation and the original implementation. d, e, Description of implementation of the ScSignatureModule, the module class for the single-cell model of the Stereoscope method. f, Example of user code, interaction with Scanpy. g, Output example on the hippocampus spatial 10x Visium dataset.
scvi-tools: a library for deep probabilistic analysis of single-cell omics data

April 2021

·

1,598 Reads

·

36 Citations

Probabilistic models have provided the underpinnings for state-of-the-art performance in many single-cell omics data analysis tasks, including dimensionality reduction, clustering, differential expression, annotation, removal of unwanted variation, and integration across modalities. Many of the models being deployed are amenable to scalable stochastic inference techniques, and accordingly they are able to process single-cell datasets of realistic and growing sizes. However, the community-wide adoption of probabilistic approaches is hindered by a fractured software ecosystem resulting in an array of packages with distinct, and often complex interfaces. To address this issue, we developed scvi-tools (https://scvi-tools.org), a Python package that implements a variety of leading probabilistic methods. These methods, which cover many fundamental analysis tasks, are accessible through a standardized, easy-to-use interface with direct links to Scanpy, Seurat, and Bioconductor workflows. By standardizing the implementations, we were able to develop and reuse novel functionalities across different models, such as support for complex study designs through nonlinear removal of unwanted variation due to multiple covariates and reference-query integration via scArches. The extensible software building blocks that underlie scvi-tools also enable a developer environment in which new probabilistic models for single cell omics can be efficiently developed, benchmarked, and deployed. We demonstrate this through a code-efficient reimplementation of Stereoscope for deconvolution of spatial transcriptomics profiles. By catering to both the end user and developer audiences, we expect scvi-tools to become an essential software dependency and serve to formulate a community standard for probabilistic modeling of single cell omics.


Citations (30)


... {funkyheatmap} has proven its utility in benchmarking studies within single-cell omics (Li et al., 2023;Luecken et al., 2022Luecken et al., , 2024Saelens et al., 2019;Sang-Aram et al., 2024;Yan & Sun, 2023) and its applications extend to diverse fields where visualisation of mixed data types is needed. Figure 1 showcases the functionality of {funkyheatmap}, namely: ...

Reference:

funkyheatmap: Visualising data frames with mixed data types
Defining and benchmarking open problems in single-cell analysis

... To identify the gene expression programs that may drive region-specific function of endothelial cells, we leveraged the generative capacity of scVIVA for differential expression (DE) testing (endothelial cells in invasive vs. stromal regions; C1 and C2 respectively in Figure 4D-left). Our DE procedure builds on lvm-DE, a general strategy we previously developed for comparative analysis with latent variable models (Boyeau et al. (2023); see Methods). Despite the re-segmentation we find persistent issues of molecule misassignment due to remaining errors in segmentation, which may lead to spurious DE results if we analyze the data as-is. ...

An empirical Bayes method for differential expression analysis of single cells with deep generative models

Proceedings of the National Academy of Sciences

... Seurat, in its current version v5, offers integrative multimodal data analysis with provision for transcriptomics, chromatic accessibility, proteomics, etc. [68]. At the same time, with the growing adoption of single-cell genomics, Fabian Theis's research group proposed the adoption of a common standard of data structure format for genomics data, an efficient data storage format for interoperability, and a registry for packages based on a common standard through scVerse [29]. ...

The scverse project provides a computational ecosystem for single-cell omics data analysis

Nature Biotechnology

... There are extensions of the bandit setting to very large action spaces encountered in recommender systems [39], to settings in which the optimization aspect is not as crucial as the best-arm identification capabilities within a strict budget [9], and situations with budgeted interactions with a batch of concurrent Markov processes, which will be described in Section 2.3.3. ...

Learning from eXtreme Bandit Feedback
  • Citing Article
  • May 2021

Proceedings of the AAAI Conference on Artificial Intelligence

... The main factor distinguishing these methods is the assumed underlying distribution of the expression counts (the main being the Poisson and the Negative Binomial, as well as the zero-inflated versions of those). Finally, a recently proposed approach, lvm-DE, uses a Bayesian framework leveraging posterior distributions estimated from deep generative models, which is suggested to be suitable for the complex, non-linear experimental designs that are particularly relevant for performing DE analysis between groups in extensive cohort studies with complex metadata [27]. ...

An Empirical Bayes Method for Differential Expression Analysis of Single Cells with Deep Generative Models

... These models often overlook complex gene expression variation and are sensitive to incomplete or imbalanced cell type representation in the reference. Probabilistic approaches such as DestVI [18], Cell2location [19], and Redeconve [20] incorporate latent variable modeling to capture uncertainty and improve robustness, but their inference procedures can be computationally demanding and may not scale efficiently to large tissue sections. Spatially informed methods like CARD [16] and Tangram [21] introduce spatial priors or alignment strategies, yet they often depend on rigid spatial assumptions that may not generalize well across tissue types. ...

DestVI identifies continuums of cell types in spatial transcriptomics data

Nature Biotechnology

... The model was trained in a manner aligned with scvi-tools 40 , with minibatches of 256 randomly sampled cells from all batches, along with their cell type annotations. Islander was trained using cross-entropy loss with mixup 16 augmentations (default setting). ...

A Python library for probabilistic analysis of single-cell omics data
  • Citing Article
  • February 2022

Nature Biotechnology

... However, CLTs cannot be directly observed under most conditions. Instead, they are typically reconstructed from mutation data and then used in downstream analyses, including the inference of cell differentiation maps [8,22,31] and gene expression dynamics [36]. In the context of cancer, CLTs are useful for identifying subclones within a tumor [39], testing for adaptive responses of subclones to therapeutics [11], and inferring migration histories for metastatic tumors [7]. ...

Reconstructing unobserved cellular states from paired single-cell lineage tracing and transcriptomics data

... 73 In ST data analysis, the distribution of cell types in each spot can be inferred using scRNA-seq data. A variety of cell-type deconvolution methods have been designed for analyses of spatial data, such as SPOTlight, 74 spatialDWLS, 75 DSTG, 76 DestVI, 77 or STRIDE 78 and RCTD. 79 SPOTlight deconvolutes the information of spatial transcriptomics with scRNAseq data using seeded NMF regression, 74 and DSTG performs cell-type deconvolution by illustrating convolutional neural networks. ...

Multi-resolution deconvolution of spatial transcriptomics data reveals continuous patterns of inflammation
  • Citing Preprint
  • May 2021

... Second, single cell transcript count is modeled as a Gamma-Poisson distribution to account for overdispersion, similar to other probabilistic models for scRNA-seq data [44,45]. Importantly, to link the two components and quantify the influence of gene dosage on gene expression in each cluster, we define transcriptional rate as a latent parameter (c). ...

scvi-tools: a library for deep probabilistic analysis of single-cell omics data