Le Song’s research while affiliated with Mohamed bin Zayed University of Artificial Intelligence and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (216)


Figure 1: Biology is a complex multiscale network.
Figure 4: Tokenizer for complex biological data which leverages vector quantized variational autoencoder architecture. Tokenized or quantized biological data can then inter-operate better with other discrete information and operations in foundation models.
Figure 6: FM for central dogma which leverages markup information to unify DNA, RNA and protein data with different levels of completeness.
Figure 9: FM for Cell which can leverage embeddings from DNA sequence FM and Interactome FM.
Figure 12: Using foundation models for protein sequences and structures in an AIDO for generative design and engineering of molecules.

+2

Toward AI-Driven Digital Organism: Multiscale Foundation Models for Predicting, Simulating and Programming Biology at All Levels
  • Preprint
  • File available

December 2024

·

6 Reads

Le Song

·

Eran Segal

·

Eric Xing

We present an approach of using AI to model and simulate biology and life. Why is it important? Because at the core of medicine, pharmacy, public health, longevity, agriculture and food security, environmental protection, and clean energy, it is biology at work. Biology in the physical world is too complex to manipulate and always expensive and risky to tamper with. In this perspective, we layout an engineering viable approach to address this challenge by constructing an AI-Driven Digital Organism (AIDO), a system of integrated multiscale foundation models, in a modular, connectable, and holistic fashion to reflect biological scales, connectedness, and complexities. An AIDO opens up a safe, affordable and high-throughput alternative platform for predicting, simulating and programming biology at all levels from molecules to cells to individuals. We envision that an AIDO is poised to trigger a new wave of better-guided wet-lab experimentation and better-informed first-principle reasoning, which can eventually help us better decode and improve life.

Download

Balancing Locality and Reconstruction in Protein Structure Tokenizer

December 2024

·

25 Reads

Barthelemy Meynard-Piganeau

·

Jiayou Zhang

·

James Gong

·

[...]

·

Eric P. Xing

The structure of a protein is crucial to its biological function. With the expansion of available protein structures, such as those in the AlphaFold Protein Structure Database (AFDB), there is an increasing need for efficient methods to index, search, and generate these structures. Additionally, there is a growing interest in integrating structural information with models from other modalities, such as protein sequence language models. We present a novel VQ-VAE-based protein structure tokenizer, AIDO.StructureTokenizer (AIDO.St), which is a pretrained module for protein structures in an AI-driven Digital Organism. AIDO.StructureTokenizer is a 300M parameter model consisting of an equivariant encoder to discretize input structures into tokens, and an invariant decoder to reconstruct the inputs from these tokens. In addition to evaluating structure reconstruction ability, we also compared our model to Foldseek, ProToken, and ESM3 in terms of protein structure retrieval ability. Through our experiments, we discovered an intriguing trade-off between the encoder's locality and retrieval ability and the decoder's reconstruction ability. Our results also demonstrate that a better balance between retrieval and reconstruction enables a better alignment between the structure tokens and a protein sequence language model, resulting in better structure prediction accuracy. Models and code are available through ModelGenerator in https://github.com/genbio-ai/AIDO and on Hugging Face.


Figure 2: Benchmarking summary for AIDO.DNA, normalized to AIDO.DNA performance
Catalog of deep learning methods for sequence representation and prediction.
Genome Understanding Evaluation Benchmarks [7]. Metrics are Matthew's Correlation Coefficient (MCC) unless otherwise stated. DB is DNABERT. NT is Nucleotide Transformer. † Uses self pre-training before supervised finetuning [35]
Zero-shot variant effect prediction using pretrained model embeddings.
Predicting the activity of promoter sequences with decreasing training data volume, leverag- ing AIDO.DNA for transfer learning with data scarcity.
Accurate and General DNA Representations Emerge from Genome Foundation Models at Scale

December 2024

·

23 Reads

Language models applied to protein sequences have become a panacea, enabling therapeutics development, materials engineering, and core biology research. Despite the successes of protein language models, genome language models remain nascent. Recent studies suggest the bottleneck is data volume or modeling context size, since long-range interactions are widely acknowledged but sparsely annotated. However, it may be the case that even short DNA sequences are modeled poorly by existing approaches, and current models are unable to represent the wide array of functions encoded by DNA. To study this, we develop AIDO.DNA, a pretrained module for DNA representation in an AI-driven Digital Organism. AIDO.DNA is a seven billion parameter encoder-only transformer trained on 10.6 billion nucleotides from a dataset of 796 species. By scaling model size while maintaining a short context length of 4k nucleotides, AIDO.DNA shows substantial improvements across a breadth of supervised, generative, and zero-shot tasks relevant to functional genomics, synthetic biology, and drug development. Notably, AIDO.DNA outperforms prior encoder-only architectures without new data, suggesting that new scaling laws are needed to achieve compute-optimal DNA language models. Models and code are available through ModelGenerator in https://github.com/genbio-ai/AIDO and on Hugging Face.


Figure 1: Schematic diagram of MSA retriever, AIDO.RAGPLM and AIDO.RAGFold (A) The relationship between different models in this work. (B) The MSA Retriever generates MSA sequences, which serve as training data for the AIDO.RAGPLM. (C) The AIDO.RAGPLM functions as a feature extractor for end-to-end protein structure prediction.
Unsupervised contact prediction.
Retrieval Augmented Protein Language Models for Protein Structure Prediction

December 2024

·

45 Reads

The advent of advanced artificial intelligence technology has significantly accelerated progress in protein structure prediction. AlphaFold2, a pioneering method in this field, has set a new benchmark for prediction accuracy by leveraging the Evoformer module to automatically extract co-evolutionary information from multiple sequence alignments (MSA). However, the efficacy of structure prediction methods like AlphaFold2 is heavily dependent on the depth and quality of the MSA. To address this limitation, we propose a novel approach termed Protein Language Model with Retrieved AuGmented MSA (RAGPLM). This approach integrates pre-trained protein language models with retrieved MSA, allowing for the incorporation of co-evolutionary information in structure prediction while compensating for insufficient MSA information through large-scale pre-training. Our method surpasses single-sequence protein language models in perplexity, contact prediction, and fitness prediction. We utilized RAGPLM as the feature extractor for protein structure prediction, resulting in the development of RAGFold. When sufficient MSA is available, RAGFold achieves TM-scores comparable to AlphaFold2 and operates up to eight times faster. In scenarios where MSA is insufficient, our method significantly outperforms AlphaFold2 (deltaTM-score=0.379, 0.116 and 0.059 for 0, 5 and 10 MSA sequences as input). Additionally, we developed an MSA retriever for MSA searching from the UniClust30 database using hierarchical ID generation, which is 45 to 90 times faster than traditional methods, and is used to expand the MSA training set for RAGPLM by 32%. Our findings suggest that RAGPLM provides an efficient and accurate solution for protein structure prediction.


Data statistics of ProteinGym DMS benchmark.
Mixture of Experts Enable Efficient and Effective Protein Understanding and Design

December 2024

·

21 Reads

Proteins play a fundamental role in life. Understanding the language of proteins offers significant potential for gaining mechanistic insights into biological systems and introduces new avenues for treating diseases, enhancing agriculture, and safeguarding the environment. While large protein language models (PLMs) like ESM2-15B and xTrimoPGLM-100B have achieved remarkable performance in diverse protein understanding and design tasks, these models, being dense transformer models, pose challenges due to their computational inefficiency during training and deployment. In this work, we introduce AIDO.Protein, a pretrained module for protein representation in an AI-driven Digital Organism. AIDO.Protein is also the first mixture-of-experts (MoE) model in the protein domain, with model size scales to 16 billion parameters. Leveraging a sparse MoE architecture with 8 experts within each transformer block and selectively activating 2 experts for each input token, our model is significantly more efficient in training and inference. Through pre-training on 1.2 trillion amino acids collected from UniRef90 and ColabfoldDB, our model achieves state-of-the-art results across most tasks in the xTrimoPGLM benchmark. Furthermore, on over 280 ProteinGym Deep Mutational Scanning (DMS) assays, our model achieves nearly 99% of the overall performance of the best MSA-based model and significantly outperforms the previously reported state-of-the-art models that do not utilize MSA. We also adapted this model for structure-conditioned protein sequence generation tasks and achieved new SOTA in this domain. These results indicate that AIDO.Protein can serve as a strong foundation model for protein understanding and design. Models and codes are available through ModelGenerator in https://github.com/genbio-ai/AIDO and on Hugging Face.


Figure 4: UMAP visualization of gene expression foundation model embeddings, colored by cell type.
A list of a few representative single-cell foundation models and their properties
Model Architecture Parameters of AIDO.Cell Model Layers Hidden Heads Intermediate Hidden Size
Comparison of model performance on Zheng68K and Segerstolpe datasets
MSE on Differentially Expressed Genes in the Norman et al. Dataset
Scaling Dense Representations for Single Cell with Transcriptome-Scale Context

December 2024

·

25 Reads

Developing a unified model of cellular systems is a canonical challenge in biology. Recently, a wealth of public single-cell RNA sequencing data as well as rapid scaling of self-supervised learning methods have provided new avenues to address this longstanding challenge. However, rapid parameter scaling has been essential to the success of large language models in text and images, while similar scaling has not been attempted with Transformer architectures for cellular modeling. To produce accurate, transferable, and biologically meaningful representations of cellular systems, we develop AIDO.Cell, a pretrained module for representing gene expression and cellular systems in an AI-driven Digital Organism. AIDO.Cell contains a series of 3M, 10M, 100M, and 650M parameter encoder-only dense Transformer models pre-trained on 50 million human cells from diverse tissues using a read-depth-aware masked gene expression pretraining objective. Unlike previous models, AIDO.Cell is capable of handling the entire human transcriptome as input without truncation or sampling tricks, thus learning accurate and general representations of the human cell's entire transcriptional context. This pretraining with a longer context was enabled through FlashAttention-2, mixed precision, and large-scale distributed systems training. AIDO.Cell (100M) achieves state-of- the-art results in tasks such as zero-shot clustering, cell-type classification, and perturbation modeling. Our findings reveal interesting loss scaling behaviors as we increase AIDO.Cell's parameters from 3M to 650M, providing insights for future directions in single-cell modeling. Models and code are available through ModelGenerator in https://github.com/genbio-ai/AIDO and on Hugging Face.


A Large-Scale Foundation Model for RNA Function and Structure Prediction

November 2024

·

48 Reads

Originally marginalized as an intermediate in the information flow from DNA to protein, RNA has become the star of modern biology, holding the key to precision therapeutics, genetic engineering, evolutionary origins, and our understanding of fundamental cellular processes. Yet RNA is as mysterious as it is prolific, serving as an information store, a messenger, and a catalyst, spanning many undercharacterized functional and structural classes. Deciphering the language of RNA is important not only for a mechanistic understanding of its biological functions but also for accelerating drug design. Toward this goal, we introduce AIDO.RNA, a pre-trained module for RNA in an AI-driven Digital Organism. AIDO.RNA contains a scale of 1.6 billion parameters, trained on 42 million non-coding RNA (ncRNA) sequences at single-nucleotide resolution, and it achieves state-of-the-art performance on a comprehensive set of tasks, including structure prediction, genetic regulation, molecular function across species, and RNA sequence design. AIDO.RNA after domain adaptation learns to model essential parts of protein translation that protein language models, which have received widespread attention in recent years, do not. More broadly, AIDO.RNA hints at the generality of biological sequence modeling and the ability to leverage the central dogma to improve many biomolecular representations. Models and code are available through ModelGenerator in https://github.com/genbio-ai/AIDO and on Hugging Face.


Progress and opportunities of foundation models in bioinformatics

October 2024

·

18 Reads

·

6 Citations

Briefings in Bioinformatics

Bioinformatics has undergone a paradigm shift in artificial intelligence (AI), particularly through foundation models (FMs), which address longstanding challenges in bioinformatics such as limited annotated data and data noise. These AI techniques have demonstrated remarkable efficacy across various downstream validation tasks, effectively representing diverse biological entities and heralding a new era in computational biology. The primary goal of this survey is to conduct a general investigation and summary of FMs in bioinformatics, tracing their evolutionary trajectory, current research landscape, and methodological frameworks. Our primary focus is on elucidating the application of FMs to specific biological problems, offering insights to guide the research community in choosing appropriate FMs for tasks like sequence analysis, structure prediction, and function annotation. Each section delves into the intricacies of the targeted challenges, contrasting the architectures and advancements of FMs with conventional methods and showcasing their utility across different biological domains. Further, this review scrutinizes the hurdles and constraints encountered by FMs in biology, including issues of data noise, model interpretability, and potential biases. This analysis provides a theoretical groundwork for understanding the circumstances under which certain FMs may exhibit suboptimal performance. Lastly, we outline prospective pathways and methodologies for the future development of FMs in biological research, facilitating ongoing innovation in the field. This comprehensive examination not only serves as an academic reference but also as a roadmap for forthcoming explorations and applications of FMs in biology.


Multi-Modal Large Language Model Enables Protein Function Prediction

September 2024

·

32 Reads

Predicting the functions of proteins can greatly accelerate biological discovery and applications, where deep learning methods have recently shown great potential. However, these methods predominantly predict protein functions as discrete categories, which fails to capture the nuanced and complex nature of protein functions. Furthermore, existing methods require the development of separate models for each prediction task, a process that can be both resource-heavy and time-consuming. Here, we present ProteinChat, a versatile, multi-modal large language model that takes a protein's amino acid sequence as input and generates comprehensive narratives describing its function. ProteinChat is trained using over 1,500,000 (protein, prompt, answer) triplets curated from the Swiss-Prot dataset, covering diverse functions. This novel model can universally predict a wide range of protein functions, all within a single, unified framework. Furthermore, ProteinChat supports interactive dialogues with human users, allowing for iterative refinement of predictions and deeper exploration of protein functions. Our experimental results, evaluated through both human expert assessment and automated metrics, demonstrate that ProteinChat outperforms general-purpose LLMs like GPT-4, one of the flagship LLMs, by over ten-fold. In addition, ProteinChat exceeds or matches the performance of task-specific prediction models.



Citations (42)


... Unlike traditional k-mer methods or manually crafted features, these models leverage self-attention mechanisms to account for long-range relationships, enabling the extraction of nuanced patterns from genomic and RNA sequences. DNABERT, DNABERT-2, and BioBERT, as examples of such transformer architectures, have demonstrated remarkable capability in encoding sequence information into meaningful numerical representations [56][57][58][59][60][61][62]. These embeddings not only preserve essential sequence context but also enhance the downstream predictive performance of various machine learning classifiers. ...

Reference:

Advanced machine learning framework for enhancing breast cancer diagnostics through transcriptomic profiling
Progress and opportunities of foundation models in bioinformatics
  • Citing Article
  • October 2024

Briefings in Bioinformatics

... It proposes a learning objective that guides the graph encoder to utilize environment-irrelevant invariant subgraphs for stable prediction. MILI [28] adopts a dual-head graph neural network with a shared subgraph generator to identify privileged substructures, and further leverages environment and task head to assign environment labels and learn invariant molecule representations. Similarly, GIL [25] infers latent environments via clustering algorithms and devises a subgraph generator to generate invariant subgraphs. ...

Advancing Molecule Invariant Representation via Privileged Substructure Identification
  • Citing Conference Paper
  • August 2024

... Predicting perturbation responses Since experimental costs scale with the number of experimental contexts (cell lines) and perturbations, a number of works have been proposed to infer the post-intervention distribution of cells. Their goal is to generalize to unseen perturbations (Roohani et al., 2023;Bai et al., 2024;Märtens et al., 2024), or unseen contexts (Bunne et al., 2023;Lotfollahi et al., 2019). This paper focuses on the former setting, as we aim to optimize, not replace, experiments. ...

AttentionPert: accurately modeling multiplexed genetic perturbations with multi-scale effects
  • Citing Article
  • June 2024

Bioinformatics

... Scaling laws in natural language processing (NLP) suggest that larger models are more compute-efficient for modest-sized datasets (Kaplan et al., 2020). These principles also hold in biological applications (Cheng et al., 2024), with increasing ESM2 model size leading to lower language modeling loss and better performance in structure prediction (Lin et al., 2023). To investigate whether these trends extend to the downstream task of functional effect prediction, we compared the performance of different ESM2 model sizes (8M to 3B) on AAV, GB1, and GFP DMS 2020) scaling laws do not extend to downstream functional effect fine-tuning performance but are consistent with pre-training metrics (e.g., validation perplexity, CASP14 performance). ...

Training Compute-Optimal Protein Language Models

... scRNA-seq is an effective approach to explore gene transcription and expression on the cellular resolution [1], allowing for in-depth molecular studies using various tissues [2][3][4]. Accurate identification of cell types [5][6][7][8][9] is critical in scRNA-seq data analysis to uncover cellular heterogeneity within and across tissues, developmental stages, and organisms [10]. While manual annotation can be time-consuming, laborintensive, and highly dependent on marker genes [11,12], computa-tional identification of cell types can also be challenging given the high dimensionality, sparsity, and dropout events in scRNA-seq data [13]. ...

Large-scale foundation model on single-cell transcriptomics

Nature Methods

... Features are then extracted from sequence, multiple sequence alignments (MSA), and monomeric structures and fed into our in-house inter-chain residue distance predictor, DPIC to generate the inter-chain residue distance map. In parallel, the inter-chain residue distance information is extracted from the AFM prediction model and combined with the predicted inter-chain residue distance to guide the complex structure assembly through a population-based multi-objective optimization algorithm [20]. Finally, the generated models are scored and ranked through our recently developed model quality assessment method, DeepUMQA-X, to output the final complex structure. ...

Protein Multiple Conformation Prediction Using Multi-Objective Evolution Algorithm
  • Citing Article
  • January 2024

Interdisciplinary Sciences Computational Life Sciences

... Alignment is carried out to determine homologous regions between genomes that may indicate functional, structural, or evolutionary relationships [ 80 ]. Alignment tools, such as BLAST [ 81 ] for pairwise alignment and MAUVE [ 82 ] for mul- [ 84 ]. ...

A method for multiple-sequence-alignment-free protein structure prediction using a protein language model

Nature Machine Intelligence

... SciREX (Jain et al., 2020) and CitationIE (Viswanathan et al., 2021) aim to extract information from the entire paper, not only tables. Kostić et al. (2021) and Zhuang et al. (2022) perform entity extraction and relation extraction from both the text and tables. In these works, the entire document is converted into a feature. ...

ReSel: N-ary Relation Extraction from Scientific Text and Tables by Learning to Retrieve and Select
  • Citing Conference Paper
  • January 2022

... The rapid advancement of large language models (LLMs) has sparked a revolution in artificial intelligence. Recently, several large-scale protein language models (PLMs) have been introduced, including the ESM model family 22,23 , ProtTrans 24 , ProGen 25 , and xTrimoPGLM 26 . These models are trained on extensive protein data sets containing tens of millions of sequences, allowing them to capture evolutionary patterns and sequence features in vector space 24,26,27 . ...

xTrimoPGLM: Unified 100B-Scale Pre-trained Transformer for Deciphering the Language of Protein

... 13,14 Several neural network-based models, including variational autoencoderbased and (more recently) transformer-based architectures, have also been proposed. [15][16][17][18][19][20][21][22][23][24][25] Though these methods are scalable, the non-linearity of the decomposition produces a latent space that is difficult to interpret, often leading to overfitting. 26 Indeed, recent benchmarking studies found that current transformer-based foundation models for scRNA-seq underperform compared to simpler models 27,28 (consistent with further benchmarking reported herein). ...

Large Scale Foundation Model on Single-cell Transcriptomics
  • Citing Preprint
  • May 2023