A preview of this full-text is provided by Springer Nature.
Content available from Nature Methods
This content is subject to copyright. Terms and conditions apply.
Nature Methods | Voume 21 | August 2024 | 1470–1480 1470
nature methods
Article
https://doi.org/10.1038/s41592-024-02201-0
scGPT: toward building a foundation model
for single-cell multi-omics using generative AI
Haotian Cui 1,2,3,8, Chloe Wang1,2,3,8, Hassaan Maan 1,3,4, Kuan Pang 2,3,
Fengning Luo2,3, Nan Duan 5 & Bo Wang 1,2,3,4,6,7
Generative pretrained models have achieved remarkable success in
various domains such as language and computer vision. Specically, the
combination of large-scale diverse datasets and pretrained transformers
has emerged as a promising approach for developing foundation models.
Drawing parallels between language and cellular biology (in which texts
comprise words; similarly, cells are dened by genes), our study probes
the applicability of foundation models to advance cellular biology and
genetic research. Using burgeoning single-cell sequencing data, we have
constructed a foundation model for single-cell biology, scGPT, based on
a generative pretrained transformer across a repository of over 33 million
cells. Our ndings illustrate that scGPT eectively distills critical biological
insights concerning genes and cells. Through further adaptation of transfer
learning, scGPT can be optimized to achieve superior performance across
diverse downstream applications. This includes tasks such as cell type
annotation, multi-batch integration, multi-omic integration, perturbation
response prediction and gene network inference.
Single-cell RNA sequencing (scRNA-seq), by enabling intricate char-
acterization of distinct cell types and advancing our understanding
of disease pathogenesis, paves the way for cellular heterogeneity
exploration, lineage tracking, pathogenic mechanism elucidation and,
ultimately, personalized therapeutic strategies
1–4
. The broad-scale
application of scRNA-seq has led to comprehensive data atlases such
as the Human Cell Atlas, which now encompasses tens of millions of
cells
5–7
. Recent advancements in sequencing technology promote the
diversity of data modalities and extend our understanding beyond
genomics to epigenetics, transcriptomics and proteomics, thus provid-
ing multi-modal insights
8,9
. These breakthroughs have also raised new
research questions such as reference mapping, perturbation prediction
and multi-omic integration
10–14
. It is critical to parallelly develop meth-
odologies capable of effectively harnessing, enhancing and adapting
to the rapid expansion of sequencing data.
One promising approach to address this challenge is the generative
pretraining of foundation models15,16. Foundation models, often built
upon the self-attention transformer architecture
17
for its effectiveness
in learning expressive data representations, are a class of deep learning
models that are pretrained on large-scale, diverse datasets and can be
readily adapted for a variety of downstream tasks. Such models have
recently achieved unprecedented success across various fields, exem-
plified by DALL-E 2 and GPT-4 in computer vision and natural language
generation (NLG)
18–20
and recently Enformer
21
for biological applica-
tions. More interestingly, these generative pretrained models consist-
ently outperform task-specific models trained from scratch
22,23
. This
indicates a task-agnostic understanding of knowledge in these domains,
inspiring us to explore its adoption for single-cell omic research. How-
ever, current machine-learning-based methods in single-cell research
are rather scattered, with specific models dedicated to distinct analysis
tasks24–26. As a result, the datasets used in each study are often limited
in breadth and scale7. To confront this limitation, there is a need for a
foundation model that is pretrained on large-scale data and can com-
prehend the complex interactions between genes across diverse tissues.
Received: 12 July 2023
Accepted: 30 January 2024
Published online: 26 February 2024
Check for updates
1Peter Munk Cardiac Centre, University Health Network, Toronto, Ontartio, Canada. 2Department of Computer Science, University of Toronto, Toronto,
Ontario, Canada. 3Vector Institute, Toronto, Ontario, Canada. 4Department of Medical Biophysics, University of Toronto, Toronto, Ontario, Canada.
5Microsoft Research, Redmond, WA, USA. 6Department of Laboratory Medicine and Pathobiology, University of Toronto, Toronto, Ontario, Canada. 7AI Hub,
University Health Network, Toronto, Ontario, Canada. 8These authors contributed equally: Haotian Cui, Chloe Wang. e-mail: bowang@vectorinstitute.ai
Content courtesy of Springer Nature, terms of use apply. Rights reserved