A preview of this full-text is provided by Springer Nature.
Content available from Nature Machine Intelligence
This content is subject to copyright. Terms and conditions apply.
Nature Machine Inteigence | Voume 4 | December 2022 | 1256–1264 1256
nature machine intelligence
Article
https://doi.org/10.1038/s42256-022-00580-7
Large-scale chemical language
representations capture molecular structure
and properties
Jerret Ross , Brian Belgodere, Vijil Chenthamarakshan , Inkit Padhi,
Youssef Mroueh & Payel Das
Models based on machine learning can enable accurate and fast molecular
property predictions, which is of interest in drug discovery and material
design. Various supervised machine learning models have demonstrated
promising performance, but the vast chemical space and the limited
availability of property labels make supervised learning challenging.
Recently, unsupervised transformer-based language models pretrained
on a large unlabelled corpus have produced state-of-the-art results in
many downstream natural language processing tasks. Inspired by this
development, we present molecular embeddings obtained by training
an ecient transformer encoder model, MLF, which uses rotary
positional embeddings. This model employs a linear attention mechanism,
coupled with highly distributed training, on SMILES sequences of 1.1
billion unlabelled molecules from the PubChem and ZINC datasets. We
show that the learned molecular representation outperforms existing
baselines, including supervised and self-supervised graph neural networks
and language models, on several downstream tasks from ten benchmark
datasets. They perform competitively on two others. Further analyses,
specically through the lens of attention, demonstrate that MLF
trained on chemical SMILES indeed learns the spatial relationships between
atoms within a molecule. These results provide encouraging evidence that
large-scale molecular language models can capture sucient chemical and
structural information to predict various distinct molecular properties,
including quantum-chemical properties.
Machine learning (ML) has emerged as an appealing, computationally
efficient approach for predicting molecular properties, with impli-
cations in drug discovery and material engineering. ML models for
molecules can be trained directly on predefined chemical descrip-
tors, such as unsupervised molecular fingerprints
1
, or hand-derived
derivatives of geometric features such as a Coulomb matrix
2
. However,
more recent ML models have focused on automatically learning the
features either from the natural graphs that encode the connectivity
information or from the line annotations of molecular structures,
such as the popular SMILES
3
(simplified molecular-input line-entry
system) representation. SMILES defines a character string representa-
tion of a molecule by performing a depth-first preorder spanning tree
traversal of the molecular graph, generating symbols for each atom,
bond, tree-traversal decision and broken cycle. Therefore, the result-
ing character string corresponds to a flattening of a spanning tree of
the molecular graph. Learning on SMILES has been widely adopted for
Received: 18 April 2022
Accepted: 3 November 2022
Published online: 21 December 2022
Check for updates
IBM Research, Yorktown Heights, New York, NY, USA. e-mail: rossja@us.ibm.com; daspa@us.ibm.com
Content courtesy of Springer Nature, terms of use apply. Rights reserved