ArticlePublisher preview available

Large-scale chemical language representations capture molecular structure and properties

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract and Figures

Models based on machine learning can enable accurate and fast molecular property predictions, which is of interest in drug discovery and material design. Various supervised machine learning models have demonstrated promising performance, but the vast chemical space and the limited availability of property labels make supervised learning challenging. Recently, unsupervised transformer-based language models pretrained on a large unlabelled corpus have produced state-of-the-art results in many downstream natural language processing tasks. Inspired by this development, we present molecular embeddings obtained by training an efficient transformer encoder model, MoLFormer, which uses rotary positional embeddings. This model employs a linear attention mechanism, coupled with highly distributed training, on SMILES sequences of 1.1 billion unlabelled molecules from the PubChem and ZINC datasets. We show that the learned molecular representation outperforms existing baselines, including supervised and self-supervised graph neural networks and language models, on several downstream tasks from ten benchmark datasets. They perform competitively on two others. Further analyses, specifically through the lens of attention, demonstrate that MoLFormer trained on chemical SMILES indeed learns the spatial relationships between atoms within a molecule. These results provide encouraging evidence that large-scale molecular language models can capture sufficient chemical and structural information to predict various distinct molecular properties, including quantum-chemical properties. Large language models have recently emerged with extraordinary capabilities, and these methods can be applied to model other kinds of sequence, such as string representations of molecules. Ross and colleagues have created a transformer-based model, trained on a large dataset of molecules, which provides good results on property prediction tasks.
This content is subject to copyright. Terms and conditions apply.
Nature Machine Inteigence | Voume 4 | December 2022 | 1256–1264 1256
nature machine intelligence
Article
https://doi.org/10.1038/s42256-022-00580-7
Large-scale chemical language
representations capture molecular structure
and properties
Jerret Ross , Brian Belgodere, Vijil Chenthamarakshan , Inkit Padhi,
Youssef Mroueh  & Payel Das
Models based on machine learning can enable accurate and fast molecular
property predictions, which is of interest in drug discovery and material
design. Various supervised machine learning models have demonstrated
promising performance, but the vast chemical space and the limited
availability of property labels make supervised learning challenging.
Recently, unsupervised transformer-based language models pretrained
on a large unlabelled corpus have produced state-of-the-art results in
many downstream natural language processing tasks. Inspired by this
development, we present molecular embeddings obtained by training
an ecient transformer encoder model, MLF, which uses rotary
positional embeddings. This model employs a linear attention mechanism,
coupled with highly distributed training, on SMILES sequences of 1.1
billion unlabelled molecules from the PubChem and ZINC datasets. We
show that the learned molecular representation outperforms existing
baselines, including supervised and self-supervised graph neural networks
and language models, on several downstream tasks from ten benchmark
datasets. They perform competitively on two others. Further analyses,
specically through the lens of attention, demonstrate that MLF
trained on chemical SMILES indeed learns the spatial relationships between
atoms within a molecule. These results provide encouraging evidence that
large-scale molecular language models can capture sucient chemical and
structural information to predict various distinct molecular properties,
including quantum-chemical properties.
Machine learning (ML) has emerged as an appealing, computationally
efficient approach for predicting molecular properties, with impli-
cations in drug discovery and material engineering. ML models for
molecules can be trained directly on predefined chemical descrip-
tors, such as unsupervised molecular fingerprints
1
, or hand-derived
derivatives of geometric features such as a Coulomb matrix
2
. However,
more recent ML models have focused on automatically learning the
features either from the natural graphs that encode the connectivity
information or from the line annotations of molecular structures,
such as the popular SMILES
3
(simplified molecular-input line-entry
system) representation. SMILES defines a character string representa-
tion of a molecule by performing a depth-first preorder spanning tree
traversal of the molecular graph, generating symbols for each atom,
bond, tree-traversal decision and broken cycle. Therefore, the result-
ing character string corresponds to a flattening of a spanning tree of
the molecular graph. Learning on SMILES has been widely adopted for
Received: 18 April 2022
Accepted: 3 November 2022
Published online: 21 December 2022
Check for updates
IBM Research, Yorktown Heights, New York, NY, USA. e-mail: rossja@us.ibm.com; daspa@us.ibm.com
Content courtesy of Springer Nature, terms of use apply. Rights reserved
... 3.) MoLFormer: Thirdly, we explore if using chemical foundaIon models in place of the MPNN as a property predictor leads to improved generalizaIon performance. 53 In parIcular, we benchmark the MoLFormer foundaIon model, which was pretrained on a diverse set of 1.1 billion molecules. 53 4.) MPNN3 (ac.ve learning): Finally, we compare these approaches to our acIve learning approach trained on three iteraIons of acIve learning (MPNN3), whereby we iteraIvely retrain the MPNN property predictors on the DFT-calculated properIes of the generated molecules. ...
Preprint
Full-text available
Although generative models hold promise for discovering molecules with optimized desired properties, they often fail to suggest synthesizable molecules that improve upon the known molecules seen in training. We find that a key limitation is not in the molecule generation process itself, but in the poor generalization capabilities of molecular property predictors. We tackle this challenge by creating an active-learning, closed-loop molecule generation pipeline, whereby molecular generative models are iteratively refined on feedback from quantum chemical simulations to improve generalization to new chemical space. Compared against other generative model approaches, only our active learning approach generates molecules with properties that extrapolate beyond the training data (reaching up to 0.44 standard deviations beyond the training data range) and out-of-distribution molecule classification accuracy is improved by 79%. By conditioning molecular generation on thermodynamic stability data from the active-learning loop, the proportion of stable molecules generated is 3.5x higher than the next-best model.
... Existing molecule SSL methods can be mainly classified into two categories: predictive and contrastive. Predictive learning [18][19][20][26][27][28] aims to predict structural components given contexts at different levels, which mainly focuses on intra-data relationship. These methods often follow the conventional pipeline of reconstructing the molecular information from masked inputs. ...
Article
Full-text available
Reliable molecular property prediction is essential for various scientific endeavors and industrial applications, such as drug discovery. However, the data scarcity, combined with the highly non-linear causal relationships between physicochemical and biological properties and conventional molecular featurization schemes, complicates the development of robust molecular machine learning models. Self-supervised learning (SSL) has emerged as a popular solution, utilizing large-scale, unannotated molecular data to learn a foundational representation of chemical space that might be advantageous for downstream tasks. Yet, existing molecular SSL methods largely overlook chemical knowledge, including molecular structure similarity, scaffold composition, and the context-dependent aspects of molecular properties when operating over the chemical space. They also struggle to learn the subtle variations in structure-activity relationship. This paper introduces a multi-channel pre-training framework that learns robust and generalizable chemical knowledge. It leverages the structural hierarchy within the molecule, embeds them through distinct pre-training tasks across channels, and aggregates channel information in a task-specific manner during fine-tuning. Our approach demonstrates competitive performance across various molecular property benchmarks and offers strong advantages in particularly challenging yet ubiquitous scenarios like activity cliffs.
... We employed two transformer-based language models to derive sequence feature vectors for toxin molecules and protein sequences. For the toxin molecules, we utilized ChemBERTa [60], a large chemical language model for embedding and representing molecule structural properties and relationships. ChemBERTa is trained in a self-supervised manner on SMILES strings corresponding to a large collection of chemical molecules in the public database PubChem [32]. ...
Preprint
Air pollution, particularly airborne particulate matter (PM), poses a significant threat to public health globally. It is crucial to comprehend the association between PM-associated toxic components and their cellular targets in humans to understand the mechanisms by which air pollution impacts health and to establish causal relationships between air pollution and public health consequences. Although many studies have explored the impact of PM on human health, the understanding of the association between toxins and the associated targets remain limited. Leveraging cutting-edge deep learning technologies, we developed tipFormer (toxin-protein interaction prediction based on transformer), a novel deep-learning tool for identifying toxic components capable of penetrating human cells and instigating pathogenic biological activities and signaling cascades. Experimental results show that tipFormer effectively captures interactions between proteins and toxic components. It incorporates dual pre-trained language models to encode protein sequences and chemicals. It employs a convolutional encoder to assimilate the sequential attributes of proteins and chemicals. It then introduces a learning module with a cross-attention mechanism to decode and elucidate the multifaceted interactions pivotal for the hotspots binding proteins and chemicals. Experimental results show that tipFormer effectively captures interactions between proteins and toxic components. This approach offers significant value to air quality and toxicology researchers by allowing high-throughput identification and prioritization of hazards. It supports more targeted laboratory studies and field measurements, ultimately enhancing our understanding of how air pollution impacts human health.
Article
Full-text available
Artificial intelligence (AI) has transformed infectious disease control, enhancing rapid diagnosis and antibiotic discovery. While conventional tests delay diagnosis, AI-driven methods like machine learning and deep learning assist in pathogen detection, resistance prediction, and drug discovery. These tools improve antibiotic stewardship and identify effective compounds such as antimicrobial peptides and small molecules. This review explores AI applications in diagnostics, therapy, and drug discovery, emphasizing both strengths and areas needing improvement.
Conference Paper
Full-text available
In this work, we investigate the positional encoding methods used in language pre-training (e.g., BERT) and identify several problems in the existing formulations. First, we show that in the absolute positional encoding, the addition operation applied on positional embeddings and word embeddings brings mixed correlations between the two heterogeneous information resources. It may bring unnecessary randomness in the attention and further limit the expressiveness of the model. Second , we question whether treating the position of the symbol [CLS] the same as other words is a reasonable design, considering its special role (the representation of the entire sentence) in the downstream tasks. Motivated from above analysis, we propose a new positional encoding method called Transformer with Untied Positional Encoding (TUPE). In the self-attention module, TUPE computes the word contextual correlation and positional correlation separately with different parameterizations and then adds them together. This design removes the mixed and noisy correlations over heterogeneous embeddings and offers more expres-siveness by using different projection matrices. Furthermore, TUPE unties the [CLS] symbol from other positions, making it easier to capture information from all positions. Extensive experiments and ablation studies on GLUE benchmark demonstrate the effectiveness of the proposed method. Codes and models are released at https://github.com/guolinke/TUPE.
Article
Full-text available
Molecular machine learning bears promise for efficient molecular property prediction and drug discovery. However, labelled molecule data can be expensive and time consuming to acquire. Due to the limited labelled data, it is a great challenge for supervised-learning machine learning models to generalize to the giant chemical space. Here we present MolCLR (Molecular Contrastive Learning of Representations via Graph Neural Networks), a self-supervised learning framework that leverages large unlabelled data (~10 million unique molecules). In MolCLR pre-training, we build molecule graphs and develop graph-neural-network encoders to learn differentiable representations. Three molecule graph augmentations are proposed: atom masking, bond deletion and subgraph removal. A contrastive estimator maximizes the agreement of augmentations from the same molecule while minimizing the agreement of different molecules. Experiments show that our contrastive learning framework significantly improves the performance of graph-neural-network encoders on various molecular property benchmarks including both classification and regression tasks. Benefiting from pre-training on the large unlabelled database, MolCLR even achieves state of the art on several challenging benchmarks after fine-tuning. In addition, further investigations demonstrate that MolCLR learns to embed molecules into representations that can distinguish chemically reasonable molecular similarities. Molecular representations are hard to design due to the large size of the chemical space, the amount of potentially important information in a molecular structure and the relatively low number of annotated molecules. Still, the quality of these representations is vital for computational models trying to predict molecular properties. Wang et al. present a contrastive learning approach to provide differentiable representations from unlabelled data.
Article
Full-text available
Effective molecular representation learning is of great importance to facilitate molecular property prediction. Recent advances for molecular representation learning have shown great promise in applying graph neural networks to model molecules. Moreover, a few recent studies design self-supervised learning methods for molecular representation to address insufficient labelled molecules; however, these self-supervised frameworks treat the molecules as topological graphs without fully utilizing the molecular geometry information. The molecular geometry, also known as the three-dimensional spatial structure of a molecule, is critical for determining molecular properties. To this end, we propose a novel geometry-enhanced molecular representation learning method (GEM). The proposed GEM has a specially designed geometry-based graph neural network architecture as well as several dedicated geometry-level self-supervised learning strategies to learn the molecular geometry knowledge. We compare GEM with various state-of-the-art baselines on different benchmarks and show that it can considerably outperform them all, demonstrating the superiority of the proposed method. Molecules are often represented as topological graphs while their true three-dimensional geometry contains a lot of valuable information. Xiaomin Fang and colleagues present a self-supervised molecule representation method that uses this geometric data in graph neural networks to predict a range of molecular properties.
Article
Full-text available
Significance Learning biological properties from sequence data is a logical step toward generative and predictive artificial intelligence for biology. Here, we propose scaling a deep contextual language model with unsupervised learning to sequences spanning evolutionary diversity. We find that without prior knowledge, information emerges in the learned representations on fundamental properties of proteins such as secondary structure, contacts, and biological activity. We show the learned representations are useful across benchmarks for remote homology detection, prediction of secondary structure, long-range residue–residue contacts, and mutational effect. Unsupervised representation learning enables state-of-the-art supervised prediction of mutational effect and secondary structure and improves state-of-the-art features for long-range contact prediction.
Article
Full-text available
The discovery of novel materials and functional molecules can help to solve some of society's most urgent challenges, ranging from efficient energy harvesting and storage to uncovering novel pharmaceutical drug candidates. Traditionally matter engineering -- generally denoted as inverse design -- was based massively on human intuition and high-throughput virtual screening. The last few years have seen the emergence of significant interest in computer-inspired designs based on evolutionary or deep learning methods. The major challenge here is that the standard strings molecular representation SMILES shows substantial weaknesses in that task because large fractions of strings do not correspond to valid molecules. Here, we solve this problem at a fundamental level and introduce \selfies (SELF-referencIng Embedded Strings), a string-based representation of molecules which is 100\% robust. Every \selfies string corresponds to a valid molecule, and \selfies can represent every molecule. \selfies can be directly applied in arbitrary machine learning models without the adaptation of the models; each of the generated molecule candidates is valid. In our experiments, the model's internal memory stores two orders of magnitude more diverse molecules than a similar test with SMILES. Furthermore, as all molecules are valid, it allows for explanation and interpretation of the internal working of the generative models.
Preprint
Full-text available
Transformer architectures have proven to learn useful representations for protein classification and generation tasks. However, these representations present challenges in interpretability. Through the lens of attention, we analyze the inner workings of the Transformer and explore how the model discerns structural and functional properties of proteins. We show that attention (1) captures the folding structure of proteins, connecting amino acids that are far apart in the underlying sequence, but spatially close in the three-dimensional structure, (2) targets binding sites, a key functional component of proteins, and (3) focuses on progressively more complex biophysical properties with increasing layer depth. We also present a three-dimensional visualization of the interaction between attention and protein structure. Our findings align with known biological processes and provide a tool to aid discovery in protein engineering and synthetic biology. The code for visualization and analysis is available at https://github.com/salesforce/provis.
Article
Full-text available
Drug metabolism is determined by the biochemical and physiological properties of the drug molecule. To improve the performance of a drug property prediction model, it is important to extract complex molecular dynamics from limited data. Recent machine learning or deep learning based models have employed the atom- and bond-type information, as well as the structural information to predict drug properties. However, many of these methods can be used only for the graph representations. Message passing neural networks (MPNNs) [1] is a framework used to learn both local and global features from irregularly formed data, and is invariant to permutations. This network performs an iterative message passing (MP) operation on each object and its neighbors, and obtain the final output from all messages regardless of their order. In this study, we applied the MP-based attention network [2] originally developed for text learning to perform chemical classification tasks. Before training, we tokenized the characters, and obtained embeddings of each molecular sequence. We conducted various experiments to maximize the predictivity of the model. We trained and evaluated our model using various chemical classification benchmark tasks. Our results are comparable to previous state-of-the-art and baseline models or outperform. To the best of our knowledge, this is the first attempt to learn chemical stringsusing an MP-based algorithm. We will extend our work to more complex tasks such as regression or generation tasks in the future.
Article
Full-text available
Predicting molecular properties (e.g., atomization energy) is an essential issue in quantum chemistry, which could speed up much research progress, such as drug designing and substance discovery. Traditional studies based on density functional theory (DFT) in physics are proved to be time-consuming for predicting large number of molecules. Recently, the machine learning methods, which consider much rule-based information, have also shown potentials for this issue. However, the complex inherent quantum interactions of molecules are still largely underexplored by existing solutions. In this paper, we propose a generalizable and transferable Multilevel Graph Convolutional neural Network (MGCN) for molecular property prediction. Specifically, we represent each molecule as a graph to preserve its internal structure. Moreover, the well-designed hierarchical graph neural network directly extracts features from the conformation and spatial information followed by the multilevel interactions. As a consequence, the multilevel overall representations can be utilized to make the prediction. Extensive experiments on both datasets of equilibrium and off-equilibrium molecules demonstrate the effectiveness of our model. Furthermore, the detailed results also prove that MGCN is generalizable and transferable for the prediction.
Article
An international security conference explored how artificial intelligence (AI) technologies for drug discovery could be misused for de novo design of biochemical weapons. A thought experiment evolved into a computational proof.
Chapter
In the paper, we present a ‘ Open image in new window ’+‘ Open image in new window ’ +‘ Open image in new window ’ three-stage paradigm, which is a supplementary framework for the standard ‘ Open image in new window ’+‘ Open image in new window ’ language model approach. Furthermore, based on three-stage paradigm, we present a language model named PPBERT. Compared with original BERT architecture that is based on the standard two-stage paradigm, we do not fine-tune pre-trained model directly, but rather post-train it on the domain or task related dataset first, which helps to better incorporate task-awareness knowledge and domain-awareness knowledge within pre-trained model, also from the training dataset reduce bias. Extensive experimental results indicate that proposed model improves the performance of the baselines on 24 NLP tasks, which includes eight GLUE benchmarks, eight SuperGLUE benchmarks, six extractive question answering benchmarks. More remarkably, our proposed model is a more flexible and pluggable model, where post-training approach is able to be plugged into other PLMs that are based on BERT. Extensive ablations further validate the effectiveness and its state-of-the-art (SOTA) performance. The open source code, pre-trained models and post-trained models are available publicly.