ArticlePublisher preview available

scGPT: toward building a foundation model for single-cell multi-omics using generative AI

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract and Figures

Generative pretrained models have achieved remarkable success in various domains such as language and computer vision. Specifically, the combination of large-scale diverse datasets and pretrained transformers has emerged as a promising approach for developing foundation models. Drawing parallels between language and cellular biology (in which texts comprise words; similarly, cells are defined by genes), our study probes the applicability of foundation models to advance cellular biology and genetic research. Using burgeoning single-cell sequencing data, we have constructed a foundation model for single-cell biology, scGPT, based on a generative pretrained transformer across a repository of over 33 million cells. Our findings illustrate that scGPT effectively distills critical biological insights concerning genes and cells. Through further adaptation of transfer learning, scGPT can be optimized to achieve superior performance across diverse downstream applications. This includes tasks such as cell type annotation, multi-batch integration, multi-omic integration, perturbation response prediction and gene network inference.
Cell type-annotation results using scGPT a, UMAP of gene expression of cells from the human pancreas dataset, colored by the cell types annotated in the original study (left) and by the cell types predicted by the fine-tuned scGPT (right). PP, pancreatic polypeptide cell; PSC, pancreatic stellate cell. b, The confusion matrix between predicted and annotated cell types in the human pancreas dataset. c, Heatmap of 512-dimensional cell embeddings from scGPT in the human pancreas dataset. d, UMAP visualization of the myeloid dataset, colored by cancer types. scGPT was fine-tuned on the reference partition (left) and evaluated on the query partition (right). These two data partitions contain distinct cancer types. cDC2, type 2 (CD1A⁺CD172A⁺) conventional dendritic cell; ESCA, esophageal carcinoma; LYM, lymphoma; MYE, myeloma; OV-FTC, ovarian or follicular thyroid carcinoma; PAAD, pancreatic adenocarcinoma; THCA, thyroid carcinoma; UCEC, uterine corpus endometrial carcinoma. e, On the query partition, UMAP is colored by the cell types annotated in the original study (left) and by scGPT-predicted cell types. f,h, Confusion matrices between predicted cell types and actual annotations for the MS and myeloid datasets, respectively. g,i, Heatmaps showing 512-dimensional cell embeddings in scGPT for cells in the MS and myeloid datasets, respectively. j, Evaluation of the cell-annotation performance of scGPT through n = 5 random train–validation splits on the myeloid, MS and human pancreas datasets. Performance metrics from the test sets are presented as mean values ± s.e.m.
… 
Prediction results for perturbation response and reverse perturbation a, Comparison between scGPT and other perturbation prediction methods. Pearson correlation between predicted and actual gene expression changes is reported. The metric is computed for all genes and the top differentially expressed (DE) genes, respectively. b, Two example perturbations in the Adamson test dataset, distribution of predicted (pred; n = 300 cells) and actual gene expression change (n = 405 cells for the perturbation of gene KCTD16 and n = 618 cells for the perturbation of gene DAD1) of the top 20 differentially expressed genes. The box denotes the interquartile range of expression change. The median is marked by the central line within each box. Whiskers extend to 1.5 times the interquartile range. Horizontal dashed lines represent the null baseline of gene expression changes. c, Illustration diagram for predicting unseen perturbation responses using scGPT. d, UMAP of predicted gene expression profiles of the perturbation conditions. The UMAP plot is colored by Leiden clusters and labeled with the dominant gene of each cluster. e, Expression patterns of two selected perturbed genes (KLF1 and CNN1) over UMAP of perturbation conditions. f, Visualization of possible perturbation combinations over the perturbation combination space of 20 genes. The grid is colored by experiment type (train, valid, test, unseen). All predicted perturbations are highlighted by square boxes, and the actual source perturbation is marked with a cross. g, Top 1–8 accuracy by scGPT for correct and relevant predictions among the seven test cases, benchmarked against GEARS and the naive baseline of the top two differential genes. The relevant predictions (blue) indicate that at least one of the perturbed genes in the perturbation combination is found in the predictions. The hit rates of scGPT are represented by the bars, those of GEARS are shown by the lines, and differential genes (for top 1 predictions only) are represented by square markers.
… 
This content is subject to copyright. Terms and conditions apply.
Nature Methods | Voume 21 | August 2024 | 1470–1480 1470
nature methods
Article
https://doi.org/10.1038/s41592-024-02201-0
scGPT: toward building a foundation model
for single-cell multi-omics using generative AI
Haotian Cui 1,2,3,8, Chloe Wang1,2,3,8, Hassaan Maan 1,3,4, Kuan Pang 2,3,
Fengning Luo2,3, Nan Duan 5 & Bo Wang 1,2,3,4,6,7
Generative pretrained models have achieved remarkable success in
various domains such as language and computer vision. Specically, the
combination of large-scale diverse datasets and pretrained transformers
has emerged as a promising approach for developing foundation models.
Drawing parallels between language and cellular biology (in which texts
comprise words; similarly, cells are dened by genes), our study probes
the applicability of foundation models to advance cellular biology and
genetic research. Using burgeoning single-cell sequencing data, we have
constructed a foundation model for single-cell biology, scGPT, based on
a generative pretrained transformer across a repository of over 33 million
cells. Our ndings illustrate that scGPT eectively distills critical biological
insights concerning genes and cells. Through further adaptation of transfer
learning, scGPT can be optimized to achieve superior performance across
diverse downstream applications. This includes tasks such as cell type
annotation, multi-batch integration, multi-omic integration, perturbation
response prediction and gene network inference.
Single-cell RNA sequencing (scRNA-seq), by enabling intricate char-
acterization of distinct cell types and advancing our understanding
of disease pathogenesis, paves the way for cellular heterogeneity
exploration, lineage tracking, pathogenic mechanism elucidation and,
ultimately, personalized therapeutic strategies
14
. The broad-scale
application of scRNA-seq has led to comprehensive data atlases such
as the Human Cell Atlas, which now encompasses tens of millions of
cells
57
. Recent advancements in sequencing technology promote the
diversity of data modalities and extend our understanding beyond
genomics to epigenetics, transcriptomics and proteomics, thus provid-
ing multi-modal insights
8,9
. These breakthroughs have also raised new
research questions such as reference mapping, perturbation prediction
and multi-omic integration
1014
. It is critical to parallelly develop meth-
odologies capable of effectively harnessing, enhancing and adapting
to the rapid expansion of sequencing data.
One promising approach to address this challenge is the generative
pretraining of foundation models15,16. Foundation models, often built
upon the self-attention transformer architecture
17
for its effectiveness
in learning expressive data representations, are a class of deep learning
models that are pretrained on large-scale, diverse datasets and can be
readily adapted for a variety of downstream tasks. Such models have
recently achieved unprecedented success across various fields, exem-
plified by DALL-E 2 and GPT-4 in computer vision and natural language
generation (NLG)
1820
and recently Enformer
21
for biological applica-
tions. More interestingly, these generative pretrained models consist-
ently outperform task-specific models trained from scratch
22,23
. This
indicates a task-agnostic understanding of knowledge in these domains,
inspiring us to explore its adoption for single-cell omic research. How-
ever, current machine-learning-based methods in single-cell research
are rather scattered, with specific models dedicated to distinct analysis
tasks2426. As a result, the datasets used in each study are often limited
in breadth and scale7. To confront this limitation, there is a need for a
foundation model that is pretrained on large-scale data and can com-
prehend the complex interactions between genes across diverse tissues.
Received: 12 July 2023
Accepted: 30 January 2024
Published online: 26 February 2024
Check for updates
1Peter Munk Cardiac Centre, University Health Network, Toronto, Ontartio, Canada. 2Department of Computer Science, University of Toronto, Toronto,
Ontario, Canada. 3Vector Institute, Toronto, Ontario, Canada. 4Department of Medical Biophysics, University of Toronto, Toronto, Ontario, Canada.
5Microsoft Research, Redmond, WA, USA. 6Department of Laboratory Medicine and Pathobiology, University of Toronto, Toronto, Ontario, Canada. 7AI Hub,
University Health Network, Toronto, Ontario, Canada. 8These authors contributed equally: Haotian Cui, Chloe Wang. e-mail: bowang@vectorinstitute.ai
Content courtesy of Springer Nature, terms of use apply. Rights reserved
... To address these challenges, we introduce scGPT-spatial, a continual pretrained model specifically designed for the domain of spatial transcriptomics. Building on the existing scRNA-seq foundation model scGPT [13], scGPT-spatial inherits its established domain knowledge and is continually pretrained on a large-scale spatial transcriptomic corpus. For this training of scGPTspatial, we carefully curated a spatial transcriptomic dataset, SpatialHuman30M, consisting of 30 million human cells and spots from four sequencing protocols: Visium, Visium HD [14], MER-FISH, and Xenium [15]. ...
... scGPT-spatial extends the pretrained scRNA-seq foundation model, scGPT [13], to spatial omics via continual pretraining (Figure 1a). Pretraining large-scale transformers on scRNA-seq data has demonstrated their effectiveness in capturing intricate cell states and gene network dynamics through attention mechanisms [18]. ...
... scGPT-spatial was benchmarked against a zero-shot baseline, PCA, and a popular scRNA-seq integration method, Seurat v4 [19]. We report AvgBIO and AvgBATCH metrics for biological conservation and batch mixing performance respectively, in line with scGPT's integration evaluation [13] (Methods 4.5.2). In Figure 2A, scGPT-spatial readily integrates major cell types in Visium and Xenium slides from the Developing Fetal Lung dataset [20] when projecting shared genes. ...
Preprint
Full-text available
Spatial transcriptomics has emerged as a pivotal technology for profiling gene expression of cells within their spatial context. The rapid growth of publicly available spatial data presents an opportunity to further our understanding of microenvironments that drive cell fate decisions and disease progression. However, existing foundation models, largely pretrained on single-cell RNA sequencing (scRNA-seq) data, fail to resolve the spatial relationships among samples or capture the unique distributions from various sequencing protocols. We introduce scGPT-spatial, a specialized foundation model for spatial transcriptomics continually pretrained on our previously published scGPT scRNA-seq foundation model. We also curate SpatialHuman30M, a comprehensive spatial transcriptomics dataset comprising of 30 million spatial transcriptomic profiles, encompassing both imaging- and sequencing-based protocols. To facilitate integration, scGPT-spatial introduces a novel MoE (Mixture of Experts) decoder that adaptively routes samples for protocol-aware decoding of gene expression profiles. Moreover, scGPT-spatial employs a spatially-aware sampling strategy and a novel neighborhood-based training objective to better capture spatial co-localization patterns among cell states within tissue. Empirical evaluations demonstrate that scGPT-spatial robustly integrates spatial data in mulit-slide and multi-modal settings, and effectively supports cell-type deconvolution and contextualized missing gene expression imputation, outperforming many existing methods. The scGPT-spatial codebase is publicly available at https://github.com/bowang-lab/scGPT-spatial.
... Recent advances in foundation models have revolutionized single-cell analysis by leveraging large-scale pre-training on extensive datasets. Models such as Geneformer (Theodoris et al., 2023), scGPT (Cui et al., 2024a), scBERT (Yang et al., 2022), and scFoundation (Hao et al., 2024a) utilize the self-supervised learning strategy akin to Masked Language Modeling (MLM) employed in BERT (Kenton & Toutanova, 2019). In particular, these models conceptualize a single cell as "a sentence of genes", wherein certain gene expressions are randomly masked, and the model is trained to predict the masked expressions based on the expressions of the remaining genes, thereby capturing gene-to-gene correlations. ...
... Geneformer (Theodoris et al., 2023), scGPT (Cui et al., 2024a), scBERT (Yang et al., 2022), and scFoundation (Hao et al., 2024a) are foundation models pre-trained on extensive datasets comprising millions of scRNA-seq profiles. These models exhibit promising performance in a variety of tasks, including cell type annotation, batch integration, perturbation modeling, and gene network inference. ...
Preprint
Full-text available
Foundation models exhibit strong capabilities for downstream tasks by learning generalized representations through self-supervised pre-training on large datasets. While several foundation models have been developed for single-cell RNA-seq (scRNA-seq) data, there is still a lack of models specifically tailored for single-cell ATAC-seq (scATAC-seq), which measures epigenetic information in individual cells. The principal challenge in developing such a model lies in the vast number of scATAC peaks and the significant sparsity of the data, which complicates the formulation of peak-to-peak correlations. To address this challenge, we introduce EpiFoundation, a foundation model for learning cell representations from the high-dimensional and sparse space of peaks. EpiFoundation relies on an innovative cross-modality pre-training procedure with two key technical innovations. First, EpiFoundation exclusively processes the non-zero peak set, thereby enhancing the density of cell-specific information within the input data. Second, EpiFoundation utilizes dense gene expression information to supervise the pretraining process, aligning peak-to-gene correlations. EpiFoundation can handle various types of downstream tasks, including cell-type annotation, batch correction, and gene expression prediction. To train and validate EpiFoundation, we curated MiniAtlas, a dataset of 100,000+ single cells with paired scRNA-seq and scATAC-seq data, along with diverse test sets spanning various tissues and cell types for robust evaluation. EpiFoundation demonstrates state-of-the-art performance across multiple tissues and diverse downstream tasks.
... This concept has been successfully extended to single-cell genomics through single-cell foundation models (scFMs) [3] , which leverage transformer-based architectures to capture complex gene-gene relationships and cell-level variations in a self-supervised manner. Recent advances in human scFMs, such as Geneformer [4] and scGPT [5] , have demonstrated remarkable success in various tasks including cell-type annotation, batch effect correction, and perturbation response prediction. ...
... To process input data, Lemur employs a specialized preprocessing pipeline that focuses on expressed genes-those with non-zero expression values ( Figure 1B). For each gene, the expression level undergoes rank-based binning to create discrete categories, providing a robust representation that minimizes technical variations [5] . These gene-expression pairs are then independently embedded and aggregated, with a prepended <cell> token that serves to capture the cell's overall expression profile. ...
Preprint
Full-text available
Single-cell genomics has revolutionized our understanding of cellular heterogeneity, but automating its analysis remains an open challenge. Cell-type annotation represents a critical bottleneck, particularly as datasets grow in size and complexity. While foundation models have shown promise in addressing this challenge, existing approaches require extensive fine-tuning for effective cell-type annotation. Here, we present Lemur, a single-cell foundation model specifically designed for Drosophila melanogaster . Lemur achieves fine-tuning-free cell-type annotation through comprehensive pre-training on an integrated whole-organism atlas with a unified cell-type annotation schema. To leverage this unified schema, we developed a dedicated hierarchical cell-type decoder architecture. This approach enables Lemur to generate consistent cell-type predictions across multiple levels of granularity without requiring additional training on new datasets. The model demonstrates strong performance across diverse tissue types, experimental conditions, and sequencing technologies. It also achieves batch-effect correction without explicitly training for this task. This automated analysis capability positions Lemur as an effective tool for the fly research community. Beyond its immediate applications, Lemur establishes a framework for accelerating biological discovery. It enables rapid iteration between computational predictions and experimental validation in the highly controlled Drosophila system, with potential implications for translational research in human biology, particularly in aging and neurodegenerative disease studies.
... To obtain a pre-trained LLM with extensive single-cell prior knowledge, we employ the state-ofthe-art (SOTA) single-cell analysis foundation model, scGPT 18 , as the teacher model. This Transformer-based model has been extensively pre-trained on over 33 million cells, capturing representation patterns of various human cell types, including pancreatic and blood cells. ...
... In this study, we use scGPT 18 , a SOTA foundation model for single-cell analysis, as the teacher model. We aim to transfer the representation patterns learned by scGPT, which was pre-trained on 33M single-cell data, into the lightweight scKAN model. ...
Preprint
Full-text available
Single-cell analysis has revolutionized our understanding of cellular heterogeneity, yet current approaches face challenges in efficiency and interpretability. In this study, we present scKAN, a framework that leverages Kolmogorov-Arnold Networks for interpretable single-cell analysis through three key innovations: efficient knowledge transfer from large language models through a lightweight distillation strategy; systematic identification of cell-type-specific functional gene sets through KAN's learned activation curves; and precise marker gene discovery enabled by KAN's importance scores with potential for drug repurposing applications. The model achieves superior performance on cell-type annotation with a 6.63% improvement in macro F1 score compared to state-of-the-art methods. Furthermore, scKAN's learned activation curves and importance scores provide interpretable insights into cell-type-specific gene patterns, facilitating both gene set identification and marker gene discovery. We demonstrate the practical utility of scKAN through a case study on pancreatic ductal adenocarcinoma, where it successfully identified novel therapeutic targets and potential drug candidates, including Doconexent as a promising repurposing candidate. Molecular dynamics simulations further validated the stability of the predicted drug-target complexes. Our approach offers a comprehensive framework for bridging single-cell analysis with drug discovery, accelerating the translation of single-cell insights into therapeutic applications.
... The advancements of large language models (LLMs) [11], [12], [13], [14], [15], [16], [17] have transformed multiple scientific domains through their unprecedented capabilities in understanding complex patterns and relationships in text data [18], [19], [20], [21], [22], [23], [24], [25], [26], [27]. Building on their powerful language-processing capabilities, biology-specific models have been developed to tackle diverse tasks: ProtBERT [28] and ESM3 [29] focus on protein sequence analysis, while BioGPT [19] and BioMedGPT [30] have shown promise in extracting biological knowledge from scientific literature. ...
Preprint
Full-text available
Identification of protein-protein interactions (PPIs) helps derive cellular mechanistic understanding, particularly in the context of complex conditions such as neurodegenerative disorders, metabolic syndromes, and cancer. Large Language Models (LLMs) have demonstrated remarkable potential in predicting protein structures and interactions via automated mining of vast biomedical literature; yet their inherent uncertainty remains a key challenge for deriving reproducible findings, critical for biomedical applications. In this study, we present an uncertainty-aware adaptation of LLMs for PPI analysis, leveraging fine-tuned LLaMA-3 and BioMedGPT models. To enhance prediction reliability, we integrate LoRA ensembles and Bayesian LoRA models for uncertainty quantification (UQ), ensuring confidence-calibrated insights into protein behavior. Our approach achieves competitive performance in PPI identification across diverse disease contexts while addressing model uncertainty, thereby enhancing trustworthiness and reproducibility in computational biology. These findings underscore the potential of uncertainty-aware LLM adaptation for advancing precision medicine and biomedical research.
... As a result, most CpG sites remain unmeasured in typical datasets (Figure 1 (a)). In contrast, large-scale gene expression profiling has enabled foundation models like Geneformer (Theodoris et al., 2023), scGPT (Cui et al., 2024a), scBERT (Yang et al., 2022), and scFoundation (Hao et al., 2024a). Given the link between gene expression and DNAm, leveraging this relationship offers a promising solution for genome-wide DNAm prediction and analysis. ...
Preprint
DNA methylation (DNAm), an epigenetic modification, regulates gene expression, influences phenotypes, and encodes inheritable information, making it critical for disease diagnosis, treatment, and prevention. While human genome contains approximately 28 million CpG sites where DNAm can be measured, only 1% to 3% of these sites are typically available in most datasets due to complex experimental protocols and high costs, hindering insights from DNAm data. Leveraging the relationship between gene expression and DNAm offers promise for computational inference, but existing statistical, machine learning, and maskingbased generative Transformers face critical limitations: they cannot infer DNAm at unmeasured CpGs or in new samples effectively. To overcome these challenges, we introduce MethylProphet, a gene-guided, context-aware Transformer model designed for DNAm inference. MethylProphet employs a Bottleneck MLP for efficient gene profile compression and a specialized DNA sequence tokenizer, integrating global gene expression patterns with local CpG context through a Transformer encoder architecture. Trained on whole-genome bisulfite sequencing data from ENCODE (1.6B training CpG-sample pairs; 322B tokens), MethylProphet demonstrates strong performance in hold-out evaluations, effectively inferring DNAm for unmeasured CpGs and new samples. In addition, its application to 10842 pairs of gene expression and DNAm samples at TCGA chromosome 1 (450M training CpGsample pairs; 91B tokens) highlights its potential to facilitate pan-cancer DNAm landscape inference, offering a powerful tool for advancing epigenetic research and precision medicine. All codes, data, protocols, and models are publicly available via https://github.com/xk-huang/methylprophet/.
... Finally, the robustness of the proposed Reverse-Gene-Finder technology in different settings can be examined by identifying causal biomarkers across different datasets and model architectures. Future work can investigate other types of biomedical foundation models, e.g., GPT-like models (Cui et al. 2024), and data modalities, e.g., single-cell imaging data (Yang et al. 2021) and multi-omics data (Efremova and Teichmann 2020) and develop a multi-modal approach for AD biomarker identification beyond genetics. ...
Preprint
Alzheimer's Disease (AD) affects over 55 million people globally, yet the key genetic contributors remain poorly understood. Leveraging recent advancements in genomic foundation models, we present the innovative Reverse-Gene-Finder technology, a ground-breaking neuron-to-gene-token backtracking approach in a neural network architecture to elucidate the novel causal genetic biomarkers driving AD onset. Reverse-Gene-Finder comprises three key innovations. Firstly, we exploit the observation that genes with the highest probability of causing AD, defined as the most causal genes (MCGs), must have the highest probability of activating those neurons with the highest probability of causing AD, defined as the most causal neurons (MCNs). Secondly, we utilize a gene token representation at the input layer to allow each gene (known or novel to AD) to be represented as a discrete and unique entity in the input space. Lastly, in contrast to the existing neural network architectures, which track neuron activations from the input layer to the output layer in a feed-forward manner, we develop an innovative backtracking method to track backwards from the MCNs to the input layer, identifying the Most Causal Tokens (MCTs) and the corresponding MCGs. Reverse-Gene-Finder is highly interpretable, generalizable, and adaptable, providing a promising avenue for application in other disease scenarios.
Preprint
Full-text available
Genome foundation models hold transformative potential for precision medicine, drug discovery, and understanding complex biological systems. However, existing models are often inefficient, constrained by suboptimal tokenization and architectural design, and biased toward reference genomes, limiting their representation of low-abundance, uncultured microbes in the rare biosphere. To address these challenges, we developed GenomeOcean, a 4-billion-parameter generative genome foundation model trained on over 600 Gbp of high-quality contigs derived from 220 TB of metagenomic datasets collected from diverse habitats across Earth's ecosystems. A key innovation of GenomeOcean is training directly on large-scale co-assemblies of metagenomic samples, enabling enhanced representation of rare microbial species and improving generalizability beyond genome-centric approaches. We implemented a byte-pair encoding (BPE) tokenization strategy, marking its first use for genome sequence generation, alongside architectural optimizations, achieving up to 150x faster sequence generation while maintaining high biological fidelity. GenomeOcean excels in representing microbial species and generating protein-coding genes constrained by evolutionary principles. Additionally, its fine-tuned model demonstrated the ability to discover novel biosynthetic gene clusters (BGCs) in natural genomes and perform zero-shot synthesis of biochemically plausible, complete BGCs. GenomeOcean sets a new benchmark for metagenomic research, natural product discovery, and synthetic biology, offering a robust foundation for advancing these fields.
Article
Full-text available
Understanding cellular responses to genetic perturbation is central to numerous biomedical applications, from identifying genetic interactions involved in cancer to developing methods for regenerative medicine. However, the combinatorial explosion in the number of possible multigene perturbations severely limits experimental interrogation. Here, we present graph-enhanced gene activation and repression simulator (GEARS), a method that integrates deep learning with a knowledge graph of gene–gene relationships to predict transcriptional responses to both single and multigene perturbations using single-cell RNA-sequencing data from perturbational screens. GEARS is able to predict outcomes of perturbing combinations consisting of genes that were never experimentally perturbed. GEARS exhibited 40% higher precision than existing approaches in predicting four distinct genetic interaction subtypes in a combinatorial perturbation screen and identified the strongest interactions twice as well as prior approaches. Overall, GEARS can predict phenotypically distinct effects of multigene perturbations and thus guide the design of perturbational experiments.
Article
Full-text available
Deciphering individual cell phenotypes from cell-specific transcriptional processes requires high dimensional single cell RNA sequencing. However, current dimensionality reduction methods aggregate sparse gene information across cells, without directly measuring the relationships that exist between genes. By performing dimensionality reduction with respect to gene co-expression, low-dimensional features can model these gene-specific relationships and leverage shared signal to overcome sparsity. We describe GeneVector, a scalable framework for dimensionality reduction implemented as a vector space model using mutual information between gene expression. Unlike other methods, including principal component analysis and variational autoencoders, GeneVector uses latent space arithmetic in a lower dimensional gene embedding to identify transcriptional programs and classify cell types. In this work, we show in four single cell RNA-seq datasets that GeneVector was able to capture phenotype-specific pathways, perform batch effect correction, interactively annotate cell types, and identify pathway variation with treatment over time.
Article
Full-text available
Mapping gene networks requires large amounts of transcriptomic data to learn the connections between genes, which impedes discoveries in settings with limited data, including rare diseases and diseases affecting clinically inaccessible tissues. Recently, transfer learning has revolutionized fields such as natural language understanding1,2 and computer vision³ by leveraging deep learning models pretrained on large-scale general datasets that can then be fine-tuned towards a vast array of downstream tasks with limited task-specific data. Here, we developed a context-aware, attention-based deep learning model, Geneformer, pretrained on a large-scale corpus of about 30 million single-cell transcriptomes to enable context-specific predictions in settings with limited data in network biology. During pretraining, Geneformer gained a fundamental understanding of network dynamics, encoding network hierarchy in the attention weights of the model in a completely self-supervised manner. Fine-tuning towards a diverse panel of downstream tasks relevant to chromatin and network dynamics using limited task-specific data demonstrated that Geneformer consistently boosted predictive accuracy. Applied to disease modelling with limited patient data, Geneformer identified candidate therapeutic targets for cardiomyopathy. Overall, Geneformer represents a pretrained deep learning model from which fine-tuning towards a broad range of downstream applications can be pursued to accelerate discovery of key network regulators and candidate therapeutic targets.
Preprint
Full-text available
Artificial intelligence (AI) researchers have been developing and refining large language models (LLMs) that exhibit remarkable capabilities across a variety of domains and tasks, challenging our understanding of learning and cognition. The latest model developed by OpenAI, GPT-4, was trained using an unprecedented scale of compute and data. In this paper, we report on our investigation of an early version of GPT-4, when it was still in active development by OpenAI. We contend that (this early version of) GPT-4 is part of a new cohort of LLMs (along with ChatGPT and Google's PaLM for example) that exhibit more general intelligence than previous AI models. We discuss the rising capabilities and implications of these models. We demonstrate that, beyond its mastery of language, GPT-4 can solve novel and difficult tasks that span mathematics, coding, vision, medicine, law, psychology and more, without needing any special prompting. Moreover, in all of these tasks, GPT-4's performance is strikingly close to human-level performance, and often vastly surpasses prior models such as ChatGPT. Given the breadth and depth of GPT-4's capabilities, we believe that it could reasonably be viewed as an early (yet still incomplete) version of an artificial general intelligence (AGI) system. In our exploration of GPT-4, we put special emphasis on discovering its limitations, and we discuss the challenges ahead for advancing towards deeper and more comprehensive versions of AGI, including the possible need for pursuing a new paradigm that moves beyond next-word prediction. We conclude with reflections on societal influences of the recent technological leap and future research directions.
Article
Full-text available
Single cell data integration methods aim to integrate cells across data batches and modalities, and data integration tasks can be categorized into horizontal, vertical, diagonal, and mosaic integration, where mosaic integration is the most general and challenging case with few methods developed. We propose scMoMaT, a method that is able to integrate single cell multi-omics data under the mosaic integration scenario using matrix tri-factorization. During integration, scMoMaT is also able to uncover the cluster specific bio-markers across modalities. These multi-modal bio-markers are used to interpret and annotate the clusters to cell types. Moreover, scMoMaT can integrate cell batches with unequal cell type compositions. Applying scMoMaT to multiple real and simulated datasets demonstrated these features of scMoMaT and showed that scMoMaT has superior performance compared to existing methods. Specifically, we show that integrated cell embedding combined with learned bio-markers lead to cell type annotations of higher quality or resolution compared to their original annotations.
Article
Full-text available
Consistent annotation transfer from reference dataset to query dataset is fundamental to the development and reproducibility of single-cell research. Compared with traditional annotation methods, deep learning based methods are faster and more automated. A series of useful single cell analysis tools based on autoencoder architecture have been developed but these struggle to strike a balance between depth and interpretability. Here, we present TOSICA, a multi-head self-attention deep learning model based on Transformer that enables interpretable cell type annotation using biologically understandable entities, such as pathways or regulons. We show that TOSICA achieves fast and accurate one-stop annotation and batch-insensitive integration while providing biologically interpretable insights for understanding cellular behavior during development and disease progressions. We demonstrate TOSICA’s advantages by applying it to scRNA-seq data of tumor-infiltrating immune cells, and CD14+ monocytes in COVID-19 to reveal rare cell types, heterogeneity and dynamic trajectories associated with disease progression and severity. Developing computational tools for interpretable cell type annotation in scRNA-seq data remains challenging. Here the authors propose a Transformer-based model for interpretable annotation transfer using biologically understandable entities, and demonstrate its performance on large or atlas datasets.
Article
Full-text available
Motivation Gene Set enrichment analysis (GSEA) is a commonly used algorithm for characterizing gene expression changes. However, the currently available tools used to perform GSEA have a limited ability to analyze large datasets, which is particularly problematic for the analysis of single-cell data. To overcome this limitation, we developed a GSEA package in Python (GSEApy), which could efficiently analyze large single-cell datasets. Results We present a package (GSEApy) that performs GSEA in either the command line or Python environment. GSEApy uses a Rust implementation to enable it to calculate the same enrichment statistic as GSEA for a collection of pathways. The Rust implementation of GSEApy is 3-fold faster than the Numpy version of GSEApy (v0.10.8) and uses >4-fold less memory. GSEApy also provides an interface between Python and Enrichr web services, as well as for BioMart. The Enrichr API enables GSEApy to perform over-representation analysis for an input gene list. Furthermore, GSEApy consists of several tools, each designed to facilitate a particular type of enrichment analysis. Availability and implementation The new GSEApy with Rust extension is deposited in PyPI: https://pypi.org/project/gseapy/. The GSEApy source code is freely available at https://github.com/zqfang/GSEApy. Also, the documentation website is available at https://gseapy.rtfd.io/. Supplementary information is available online.
Article
The human brain directs complex behaviors, ranging from fine motor skills to abstract intelligence, but the diversity of cell types that support these skills has not been fully described. In this work, we used single-nucleus RNA sequencing to systematically survey cells across the entire adult human brain. We sampled more than three million nuclei from approximately 100 dissections across the forebrain, midbrain, and hindbrain in three postmortem donors. Our analysis identified 461 clusters and 3313 subclusters organized largely according to developmental origins and revealing high diversity in midbrain and hindbrain neurons. Astrocytes and oligodendrocyte-lineage cells also exhibited regional diversity at multiple scales. The transcriptomic census of the entire human brain presented in this work provides a resource for understanding the molecular diversity of the human brain in health and disease.
Article
Recent advances in multiplexed single-cell transcriptomics experiments facilitate the high-throughput study of drug and genetic perturbations. However, an exhaustive exploration of the combinatorial perturbation space is experimentally unfeasible. Therefore, computational methods are needed to predict, interpret, and prioritize perturbations. Here, we present the compositional perturbation autoencoder (CPA), which combines the interpretability of linear models with the flexibility of deep-learning approaches for single-cell response modeling. CPA learns to in silico predict transcriptional perturbation response at the single-cell level for unseen dosages, cell types, time points, and species. Using newly generated single-cell drug combination data, we validate that CPA can predict unseen drug combinations while outperforming baseline models. Additionally, the architecture's modularity enables incorporating the chemical representation of the drugs, allowing the prediction of cellular response to completely unseen drugs. Furthermore, CPA is also applicable to genetic combinatorial screens. We demonstrate this by imputing in silico 5,329 missing combinations (97.6% of all possibilities) in a single-cell Perturb-seq experiment with diverse genetic interactions. We envision CPA will facilitate efficient experimental design and hypothesis generation by enabling in silico response prediction at the single-cell level and thus accelerate therapeutic applications using single-cell technologies.
Article
The exceptionally rapid development of highly flexible, reusable artificial intelligence (AI) models is likely to usher in newfound capabilities in medicine. We propose a new paradigm for medical AI, which we refer to as generalist medical AI (GMAI). GMAI models will be capable of carrying out a diverse set of tasks using very little or no task-specific labelled data. Built through self-supervision on large, diverse datasets, GMAI will flexibly interpret different combinations of medical modalities, including data from imaging, electronic health records, laboratory results, genomics, graphs or medical text. Models will in turn produce expressive outputs such as free-text explanations, spoken recommendations or image annotations that demonstrate advanced medical reasoning abilities. Here we identify a set of high-impact potential applications for GMAI and lay out specific technical capabilities and training datasets necessary to enable them. We expect that GMAI-enabled applications will challenge current strategies for regulating and validating AI devices for medicine and will shift practices associated with the collection of large medical datasets.