Lior Pachter’s research while affiliated with California Institute of Technology and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (576)


Figure 1: Building a de Bruijn graph with shades. One transcript sequence (parent color) and two variant contigs (shades, shown as circles) are represented. In the end, 11 k-mers (k=3) are present, all with the same parent color, two with one shade of that color, and three with another shade of that color.
Figure 2: Partitioning a colored compacted de Bruijn graph. (A) The structure of the graph. The boxes shown are the nodes of the graph and represent colored contigs. (B) Monochromatic contigs that are extracted.
Figure 3: Relationship between different mouse strains. (A) Jaccard similarity index of k-mers between different mouse strains as determined by klue: the cardinality of the set intersection, X, represents the number of k-mers shared between two mouse strains while the cardinality of the set union, Y, represents all the number of k-mers present in either or both of two mouse strains; the Jaccard index is the ratio of X to Y. (B) The phylogenetic relationship between the 8 mouse strains.
Figure 4: klue for mouse strain demultiplexing. Overview of the workflow used for strain demultiplexing with klue and kallisto.
Figure 5: Results of klue+kallisto demultiplexing of the PWK/PhJ and A/J mouse cells. Accuracy could be determined because the PWK/PhJ cells and A/J mouse cells were placed into separate wells at the initial step of the split-pool barcoding, therefore the well barcode could serve as a ground truth label for each cell.
Pseudoassembly of k-mers
  • Preprint
  • File available

May 2025

Delaney K Sullivan

·

Mayuko Boffelli

·

Lior Pachter

We introduce a pseudoassembly approach to identifying variation in sets of genomic sequences via colored de Bruijn graphs. Our pseudoassembly method is implemented in a program called klue that assembles k-mers into sequences compatible with a variant-aware extension of pseudoalignment. We show that this approach can be used to identify cell-type specific de novo variants from single-cell RNA-seq in a mouse melanoma model.

Download

Figure 2: Results for homogeneous RNA solution. a) Schematic of the experiment and model predictions. b) Distribution of normalized covariance between gene pairs with mean expression greater than 0.1, shown separately for ERCC, mature mRNA, and nascent mRNA counts. c) Overdispersion-mean relationship for genes with mean expression greater than 0.1, for ERCC, mature mRNA, and nascent mRNA counts, respectively. d) Cumulative distribution function of c tech . Gray dots represent the empirical CDF of estimated c tech using selected Poisson mature mRNA counts. The blue line shows a Gamma distribution fitted by matching the first two moments (mean and variance), and the red line shows a Gaussian distribution with the same mean and variance. e) Cumulative distribution function of normalized covariance. Gray dots represent the empirical CDF of normalized covariance for mature mRNA counts shown in b). Using the estimated c tech values and the mean expression levels of the selected genes, 1000 bootstrap samples were generated. The purple line indicates the median empirical CDF across bootstrap replicates, and the light purple band represents the 95% confidence interval.
Figure 4: Extrinsic noise in species-mixing experiments. a) Schematic of the species-mixing experiment. b) Overdispersion-mean relationships for human and mouse genes in both human and mouse cells in the 20k Chromium X dataset. c) Biological and technical extrinsic noise in three species-mixing experiments.
Figure 5: Extrinsic noise in single cell datasets. (a) Distribution of normalized covariance within genes across different mean expression ranges for the K562 10x Flex dataset. (b) Overdispersion-mean relationship for the K562 10x Flex dataset. (c) Distribution of normalized covariance for the mESC inDrop dataset. (d) Distribution of normalized covariance for the mESC 10x 3' v3 dataset. (e) Overdispersion-mean relationship for the mESC 10x 3' v3 dataset. (f) Cell size along cell cycle. Cell sizes are estimated using Poisson genes from panel (e). Cell cycle progression is denoted by cell cycle theta, as reported by Riba et al. (2022). (g) Distribution of normalized covariance between gene pairs with mean expression greater than 0.1 for the PBMC dataset. (h) Distribution of normalized covariance between selected Poisson genes for the PBMC dataset. (i) Overdispersion-mean relationship for the PBMC dataset. (i) Sum of total counts against sum of Poisson counts, colored by cell types.
Extrinsic biological stochasticity and technical noise normalization of single-cell RNA sequencing data

May 2025

·

6 Reads

The technical noise introduced during single-cell RNA sequencing (scRNA-seq) has led to the use of size factor normalization as a first step prior to data analysis. However, this scaling approach inherently affects extrinsic (between cell) variability of gene expression, which stems from both biological and technical factors. We propose a general extrinsic noise model to provide a theoretical basis for size factor normalization, thus providing a framework for estimating both biological and technical components of extrinsic noise. We highlight the relationship between normalized gene expression covariance, extrinsic noise, and overdispersion, showing that extrinsic noise explains the baseline overdispersion commonly observed in scRNA-seq data. We validated the technical model by testing the relationship on data from RNA solutions. Interestingly, our model accurately describes mature mRNA counts but not nascent mRNA counts, suggesting the need for an alternative technical model for data derived from nascent transcripts. Using single-cell RNA-seq data, we characterize both biological and technical extrinsic noise and cell size factors estimated using Poisson-like genes. Overall, our model helps clarify common misconceptions and provides insight into the role of extrinsic noise in scRNA-seq data.


Systematic cell-type resolved transcriptomes of 8 tissues in 8 lab and wild-derived mouse strains captures global and local expression variation

April 2025

·

27 Reads

Elisabeth Rebboah

·

Ryan Weber

·

Elnaz Abdollahzadeh

·

[...]

·

Mapping the impact of genomic variation on gene expression facilitates an understanding of the molecular basis of complex phenotypic traits and disease predisposition. Mouse models provide a controlled and reproducible framework for capturing the breadth of genomic variation observed in different genotypes across a wide variety of tissues. As part of the IGVF consortium′s effort to catalog the effects of genetic variation, we uniformly characterized the transcriptomes of eight tissues from each mouse founder strain used to derive the Collaborative Cross strains, comprising five classical laboratory inbred strains and three wild-derived inbred strains. We sequenced samples from four male and four female replicates per tissue using single-nucleus RNA-seq to generate an ″8-cube″ dataset of 5.2 million nuclei across 106 cell types and cell states. As expected, the overall extent of transcriptome variation correlates positively with genetic divergence across the strains with the greatest differential between PWK/PhJ and CAST/EiJ. At the individual tissue level, heart and brain are relatively more similar across strains compared with gonads, adrenal, skeletal muscle, kidney, and liver. Further analyses revealed substantial strain variation, often concentrated in a few cell types as well as cell-state signatures that especially reflect strain-associated immune and metabolic trait differences. The founder 8-cube dataset provides rich transcriptome variation signatures to help explain strain-specific phenotypic traits and disease states, as illustrated by examples in tissue-resident immune cells, muscle degeneration, kidney sex differences, and the hypothalamic-pituitary-adrenal axis. This data further provides a systematic foundation for the analysis of these tissues in the founder strains as well as the Collaborative Cross.


Resilience of A Learned Motor Behavior After Chronic Disruption of Inhibitory Circuits

April 2025

·

10 Reads

Maintaining motor behaviors throughout life is crucial for an individual’s survival and reproductive success. The neuronal mechanisms that preserve behavior are poorly understood. To address this question, we focused on the zebra finch, a bird that produces a highly stereotypical song after learning it as a juvenile. Using cell-specific viral vectors, we chronically silenced inhibitory neurons in the pre-motor song nucleus called the high vocal center (HVC), which caused drastic song degradation. However, after producing severely degraded vocalizations for around 2 months, the song rapidly improved, and animals could sing songs that highly resembled the original. In adult birds, single-cell RNA sequencing of HVC revealed that silencing interneurons elevated markers for microglia and increased expression of the Major Histocompatibility Complex I (MHC I), mirroring changes observed in juveniles during song learning. Interestingly, adults could restore their songs despite lesioning the lateral magnocellular nucleus of the anterior neostriatum (LMAN), a brain nucleus crucial for juvenile song learning. This suggests that while molecular mechanisms may overlap, adults utilize different neuronal mechanisms for song recovery. Chronic and acute electrophysiological recordings within HVC and its downstream target, the robust nucleus of the archistriatum (RA), revealed that neuronal activity in the circuit permanently altered with higher spontaneous firing in RA and lower in HVC compared to control even after the song had fully recovered. Together, our findings show that a complex learned behavior can recover despite extended periods of perturbed behavior and permanently altered neuronal dynamics. These results show that loss of inhibitory tone can be compensated for by recovery mechanisms partly local to the perturbed nucleus and do not require circuits necessary for learning.


Resilience of A Learned Motor Behavior After Chronic Disruption of Inhibitory Circuits

April 2025

·

4 Reads

Maintaining motor behaviors throughout life is crucial for an individual’s survival and reproductive success. The neuronal mechanisms that preserve behavior are poorly understood. To address this question, we focused on the zebra finch, a bird that produces a highly stereotypical song after learning it as a juvenile. Using cell-specific viral vectors, we chronically silenced inhibitory neurons in the pre-motor song nucleus called the high vocal center (HVC), which caused drastic song degradation. However, after producing severely degraded vocalizations for around 2 months, the song rapidly improved, and animals could sing songs that highly resembled the original. In adult birds, single-cell RNA sequencing of HVC revealed that silencing interneurons elevated markers for microglia and increased expression of the Major Histocompatibility Complex I (MHC I), mirroring changes observed in juveniles during song learning. Interestingly, adults could restore their songs despite lesioning the lateral magnocellular nucleus of the anterior neostriatum (LMAN), a brain nucleus crucial for juvenile song learning. This suggests that while molecular mechanisms may overlap, adults utilize different neuronal mechanisms for song recovery. Chronic and acute electrophysiological recordings within HVC and its downstream target, the robust nucleus of the archistriatum (RA), revealed that neuronal activity in the circuit permanently altered with higher spontaneous firing in RA and lower in HVC compared to control even after the song had fully recovered. Together, our findings show that a complex learned behavior can recover despite extended periods of perturbed behavior and permanently altered neuronal dynamics. These results show that loss of inhibitory tone can be compensated for by recovery mechanisms partly local to the perturbed nucleus and do not require circuits necessary for learning.


Detection of viral sequences at single-cell resolution identifies novel viruses associated with host gene expression changes

Nature Biotechnology

The increasing use of high-throughput sequencing methods in research, agriculture and healthcare provides an opportunity for the cost-effective surveillance of viral diversity and investigation of virus–disease correlation. However, existing methods for identifying viruses in sequencing data rely on and are limited to reference genomes or cannot retain single-cell resolution through cell barcode tracking. We introduce a method that accurately and rapidly detects viral sequences in bulk and single-cell transcriptomics data based on the highly conserved RdRP protein, enabling the detection of over 100,000 RNA virus species. The analysis of viral presence and host gene expression in parallel at single-cell resolution allows for the characterization of host viromes and the identification of viral tropism and host responses. We apply our method to peripheral blood mononuclear cell data from rhesus macaques with Ebola virus disease and describe previously unknown putative viruses. Moreover, we are able to accurately predict viral presence in individual cells based on macaque gene expression.


Differential Analysis Reveals Isoform Switching Following Pneumococcal Vaccination

March 2025

·

3 Reads

Advances in RNA-sequencing (RNA-seq) technology have enabled scalable and accessible transcriptomics studies. Longitudinal RNA sequencing studies have been used to track gene expression over time, revealing biological pathways and expression patterns. Traditional approaches for such studies rely on pairwise comparisons or linear regression models, but these methods face challenges when dealing with many time points. Spline regression offers a robust alternative by efficiently capturing temporal patterns. In this study, we apply spline regression to analyze longitudinal RNA-seq data and demonstrate its advantages in isoform-level differential expression analysis. Our findings highlight the importance of isoform switching, which can be overlooked in gene-level analyses.


Geospatially informed representation of spatial genomics data with SpatialFeatureExperiment

February 2025

·

11 Reads

SpatialFeatureExperiment is a Bioconductor package that leverages the versatility of Simple Features for spatial data analysis and SpatialExperiment for single-cell -omics to provide an expansive and convenient S4 class for working with spatial -omics data. SpatialFeatureExperiment can be used to store and analyze a variety of spatial -omics data types, including data from the Visium, Xenium, MERFISH, SeqFish, and Slide-seq platforms, bringing spatial operations to the SingleCellExperiment ecosystem.


Chronocell overview
The input of Chronocell comprises three components: 1) the trajectory structure, which outlines the states each lineage traverses as paths on a directed graph. 2) The sampling assumption, which defines the prior distribution of latent variables, namely lineages and process time, with a default uniform distribution over both; and 3) the scRNA-seq data, consisting of unspliced and spliced count matrices. The Chronocell model consists of a expression model with piecewise-constant transcription rates, and a Bernoulli measurement model. Each state s is associated with a transcription rate αs for each gene, as well as an exit time τk denoting the switching time to the next state, where k is the index for the time segment. The EM algorithm is used for inference, with each iteration alternating between E-steps and M-steps. The results of Chronocell primarily include the estimated parameters and posterior distributions over latent variables for each cell.
Demonstration of inference on simulation
a The ground truth trajectory structure. Cells jump to the next state (2) from starting state (1) at τ0 = 0, and then bifurcate into two lineages with different ending states (3 and 4) at τ1. The process ends at τ2 = 1. Out of 200 total genes, 100 genes are non variable with the same distributions along time. b The falsely assumed structure that does not know the first two states are supposed to be merged into one. All genes are assumed to vary along time. c The ELBO scores of 100 random initializations (blue dots) compared to those of warm start (red line). The x axis is the Pearson’s correlation between the mean process time of each random initialization and the true time. d ELBO scores over fitting iterations of both warm start (red line) and the best random initialization (blue line), with the ELBO calculated with true parameters (gray line) as reference. e Heatmaps of inferred posterior distributions. x axis is time grids, and y axis is cells aligned by their true times with true transcription states on the left. The intensity of color indicates the weights of posterior distributions of cells on the grids. Heatmap of cells from τ0 to τ1 use a gray color palette. Heatmap of cells from τ1 to τ2 use purple color palette. Heatmap of cells from τ2 to τ3 of first lineage use blue, and those of second lineage use red. RMSE stands for root mean square errors. f The averaged posterior distributions across cells (dark blue) and true empirical distribution (gray) of process time. g Inferred parameters values compared to true values. For non variable genes, only βγ are identifiable and compared. Error is mean normalized error as described in the text, and the mean is computed across genes. h The confusion matrix for gene selection. i α values of selected genes over states. j Two models and the distribution of the chosen one by (train) ELBO, AIC, BIC and test ELBO, calculated on 20 samples each with a different set of parameter.
Inference results for Forebrain data
a Schematics of Forebrain data and PCA plot of cells colored by cell type annotations. b The AIC scores of the trajectory and cluster models. The x axis is the mean process time correlations of 100 random initializations (blue dots). The AIC scores of random initializations are compared to those of warm start (red line) as well as 3 Poisson mixture model (yellow line). AP stands for average precision. c Posterior distributions of process time of the trajectory model with warm start. Cells were ordered in y axis by their inferred mean process time and the left bar displays their cell types using the colors in a. The below histogram shows average posterior distribution averaged over cells. The entropy of the average posterior distribution was calculated using its weights on the 100 discretized time grids.
Inference results for Erythroid data
a Schematics of Erythroid data and PCA plot of cells colored by cell type annotations. b The fitted trajectory structure and inferred mean process time from random initialization indicated in blue on the same PCA plot as in a. c Posterior distributions of process time. Cells were ordered in y axis by their inferred mean process time and the left bar displays their cell types using the colors in a. The below histogram shows average posterior distribution over cells. The entropy of the average posterior distribution was calculated using its weights on the 100 discretized time grids. d Averaged posterior distribution across cells from different experimental time points. n is the number of cells. e α values of 24 selected genes over states. f Phase plots of top 5 DE genes of 24 selected genes. The x axis is the raw unspliced counts and y axis is the raw spliced counts. The blue curve is the fit mean of product Poisson distributions of unspliced and spliced counts over process time, and its darkness corresponds to the value of process time.
Inference results for cell cycle data
a Schematics of Cell cycle data and scatter plot of the Geminin-GFP and Cdt1-RFP of RPE1 cells colored by cell type annotations. b The fit trajectory structure and inferred mean process time from random initialization indicated in blue on the same scatter plot as in a. c Posterior distributions of process time. Cells were ordered in y axis by their inferred mean process time and the left bar displays their cell types using the colors in a. The below histogram shows average posterior distribution averaged over cells. The entropy of the average posterior distribution was calculated using its weights on the 100 discretized time grids. d α values of 84 selected genes over states. e Comparison of γ estimates for 84 selected genes with estimates derived from metabolic RNA labeling data. CCC stands for concordance correlation coefficient. n is the number of genes for which estimates are available in each respective paper.
Trajectory inference from single-cell genomics data with a process time model

January 2025

·

39 Reads

·

3 Citations

Single-cell transcriptomics experiments provide gene expression snapshots of heterogeneous cell populations across cell states. These snapshots have been used to infer trajectories and dynamic information even without intensive, time-series data by ordering cells according to gene expression similarity. However, while single-cell snapshots sometimes offer valuable insights into dynamic processes, current methods for ordering cells are limited by descriptive notions of “pseudotime” that lack intrinsic physical meaning. Instead of pseudotime, we propose inference of “process time” via a principled modeling approach to formulating trajectories and inferring latent variables corresponding to timing of cells subject to a biophysical process. Our implementation of this approach, called Chronocell, provides a biophysical formulation of trajectories built on cell state transitions. The Chronocell model is identifiable, making parameter inference meaningful. Furthermore, Chronocell can interpolate between trajectory inference, when cell states lie on a continuum, and clustering, when cells cluster into discrete states. By using a variety of datasets ranging from cluster-like to continuous, we show that Chronocell enables us to assess the suitability of datasets and reveals distinct cellular distributions along process time that are consistent with biological process times. We also compare our parameter estimates of degradation rates to those derived from metabolic labeling datasets, thereby showcasing the biophysical utility of Chronocell. Nevertheless, based on performance characterization on simulations, we find that process time inference can be challenging, highlighting the importance of dataset quality and careful model assessment.



Citations (63)


... To overcome this issue, one can first use methods like DeepCycle [87] or VeloCycle [88] to assign a cell age θ (which varies between 0 at the beginning of the cell-cycle and 1 at cell division) to each cell and then fit an age-dependent model of gene expression to the age-resolved scRNA-seq data [6]. Alternative ways to possibly circumvent the issues we have reported using steady-state models is to instead fit models that account for different cell states due to differentiation [89] or models that use a variety of single-cell data (transcriptomic, proteomic and epigenomic) [90]. These approaches are computationally challenging and are actively under investigation. ...

Reference:

From Noise to Models to Numbers: Evaluating Negative Binomial Models and Parameter Estimations in Single-Cell RNA-seq
Trajectory inference from single-cell genomics data with a process time model

... We expanded the RNA-seq preprocessing tool kallisto 30,31 to allow translated alignment of nucleotide sequences to an amino acid reference and validated its use in combination with the PalmDB amino acid reference for the detection of virus-like sequences in single-cell and bulk RNA-seq data. PalmDB is a database of 296,623 unique RdRP-containing amino acid sequences, representing 146,973 virus species 10 . ...

kallisto, bustools and kb-python for quantifying bulk, single-cell and single-nucleus RNA-seq
  • Citing Article
  • October 2024

Nature Protocols

... We also investigated the reliability of the burst frequency and burst size estimated from the two parameters of the best-fitting NB distribution in the parameter space region where this model is selected as the optimal one. These are commonly estimated by interpreting the NB distribution as that arising from a simpler mechanistic model than the telegraph model, namely the bursty gene expression model (with reaction scheme Eq. (15)), which was first studied in [67] and is now commonly used as the basis for more sophisticated stochastic models of gene expression [70][71][72][73][74][75] including those used to fit scRNA-seq data [6,52,[76][77][78][79]. This model can be derived from the telegraph model under the assumption that the gene inactivation rate is much larger than the gene activation rates [49], i.e. the transcriptional bursting regime. ...

Biophysically interpretable inference of cell types from multimodal sequencing data

Nature Computational Science

... In phases 1 and 2, once the sample size exceeds 1,000, we will map trans-molQTL and explore how they interact with cis-molQTL to affect molecular and complex phenotypes. In addition, we will systematically explore context-specific effects of both rare and common variants and use available cell type-specific expression and epigenetic data to resolve the cell type in which these variants function and potentially interact 44,45 . Although it is computationally intensive, we prefer linear mixed models to simple linear models for mapping molQTL, particularly trans-molQTL, in farmed animal populations with complex familial relatedness 46 . ...

Deciphering the impact of genomic variation on function

Nature

... There are also substantial routes to strengthening our mathematical formalism: we have not yet considered cell-cell interactions or cell cycle effects -of which will be the subject of future work. Other groups have also proposed mechanisms to combine biophysical modelling with deep learning frameworks (Carilli et al., 2024), suggesting we are not the only group thinking in this manner. ...

Biophysical modeling with variational autoencoders for bimodal, single-cell RNA sequencing data

Nature Methods

... For example, Oarfish uses coverage-based probabilistic modelling to adjust fragment assignment probabilities, addressing uneven coverage in long-read data sets 105 . Similarly, lr-kallisto adapts the pseudo-alignment strategy to accommodate longer reads and higher error rates, focusing on reducing the complexity of transcript compatibility graphs for speed and resource efficiency 34 . LIQA introduces a bias correction model to handle truncated reads and 3′-end biases by fitting them into a Kaplan-Meier estimator, enhancing isoform quantification accuracy 35 . ...

Long-read sequencing transcriptome quantification with lr-kallisto

... Next-generation CAR T cells designed to withstand TGF-β-rich environments or secrete immunostimulatory cytokines offer a cutting-edge solution to prostate cancer's hostile microenvironment, although robust evidence of clinical efficacy remains limited to early stage trials [121]. Similarly, CRISPR-based gene editing could enable the precise removal of inhibitory receptors or the incorporation of enhanced effector functions into T or natural killer cells. ...

PSCA-CAR T cell therapy in metastatic castration-resistant prostate cancer: a phase 1 trial

Nature Medicine

... For instance, consider a two-dimensional smFISH image discretized into N × N pixels. There is a temptation to use any of the zoo of techniques that have enjoyed success for non-spatial models, including generating functions [44], finite-state projections [16,45], or neural networks [46,47]. However, one would seemingly need to perform these sometimes already costly or complex calculations for all pixels in the image, each with a distribution for counts. ...

Spectral neural approximations for models of transcriptional dynamics
  • Citing Article
  • May 2024

Biophysical Journal

... However, running Leiden within Seurat resulted in drawbacks including higher memory usage, longer calculation time and random crashes in docker containers [13] . Scanpy resolves these issues, and unlike Seurat, Scanpy improves visualization quality by using consistent KNN and SNN graphs for both clustering and uniform manifold approximation and projection (UMAP) [13] [14] . ...

The impact of package selection and versioning on single-cell RNA-seq analysis

... p13 retrieved together with the used gene models from Ensembl 108, extended with the Rhapsody sample tags for Homo sapiens. Cell barcodes were retrieved from published data [43,44]. The targeted gene set consisted of the Becton Dickinson Rhapsody Onco-BC Targeted Panel ( h t t p s : / / s c o m i x . ...

A machine-readable specification for genomics assays
  • Citing Article
  • April 2024

Bioinformatics