ArticlePDF Available

MG-BERT: leveraging unsupervised atomic representation learning for molecular property prediction

Authors:

Abstract and Figures

Motivation: Accurate and efficient prediction of molecular properties is one of the fundamental issues in drug design and discovery pipelines. Traditional feature engineering-based approaches require extensive expertise in the feature design and selection process. With the development of artificial intelligence (AI) technologies, data-driven methods exhibit unparalleled advantages over the feature engineering-based methods in various domains. Nevertheless, when applied to molecular property prediction, AI models usually suffer from the scarcity of labeled data and show poor generalization ability. Results: In this study, we proposed molecular graph BERT (MG-BERT), which integrates the local message passing mechanism of graph neural networks (GNNs) into the powerful BERT model to facilitate learning from molecular graphs. Furthermore, an effective self-supervised learning strategy named masked atoms prediction was proposed to pretrain the MG-BERT model on a large amount of unlabeled data to mine context information in molecules. We found the MG-BERT model can generate context-sensitive atomic representations after pretraining and transfer the learned knowledge to the prediction of a variety of molecular properties. The experimental results show that the pretrained MG-BERT model with a little extra fine-tuning can consistently outperform the state-of-the-art methods on all 11 ADMET datasets. Moreover, the MG-BERT model leverages attention mechanisms to focus on atomic features essential to the target property, providing excellent interpretability for the trained model. The MG-BERT model does not require any hand-crafted feature as input and is more reliable due to its excellent interpretability, providing a novel framework to develop state-of-the-art models for a wide range of drug discovery tasks.
Content may be subject to copyright.
Xiao-Chen Zhang is currently a PhD student in State Key Laboratory of High-Performance Computing, School of Computer Science, National University of
Defense Technology, China. His researches focus on the development of cheminformatics tools.
Cheng-kun Wu is currently an associate professor in State Key Laboratory of High-Performance Computing, School of Computer Science, National
University of Defense Technology. His researches focus on Systems Biology, High-Performance Computing, Pattern Recognition, Machine Learning and
Data Mining.
Zhi-Jiang Yang was born in Hunan, China. He is currently a graduate student at Xiangya School of Pharmaceutical Sciences,Central South University. His
researches focus on leveraging artificial intelligence for drug discovery.
Zhen-Xing Wu was born in Hunan, China. He is currently a PhD student in the College of Pharmaceutical Sciences, Zhengjiang University, under the
supervision of Prof. Hou. His interests mainly lie in the area of computer-aided drug design.
Jia-Cai Yi was born in Guangdong, China. He is currently a graduate student in State Key Laboratory of High-Performance Computing, School of Computer
Science, National University of Defense Technology. His researches focus on leveraging artificial intelligence for drug discovery.
Chang-Yu Hsieh is currently a senior researcher at Tencent Quantum Laboratory since 2018. He received his PhD degree in Physics from the University of
Ottawa in 2012 and worked as a postdoctoral researcher at the University of Toronto (2012–2013) and Massachusetts Institute of Technology (2013–2016),
respectively. Before joining Tencent, he worked as a senior researcher at Singapore-MIT Alliance for Science and Technology (2017–2018). His research
interests span across quantum information science, nonequilibrium statistical physics, theoretical chemistry and machine learning for chemistry.
Ting-Jun Hou is currently a professor at the College of Pharmaceutical Sciences, Zhejiang University, China. His research interests include molecule
simulation, drug design, machine learning, chemoinformatics and bioinformatics. Further information about Ting-Jun Hou can be found at the web site
of his group: can be found at website of his group: http://cadd.zju.edu.cn.
Dong-Sheng Cao is currently a professor in the Xiangya School of Pharmaceutical Sciences, Central South University,China. His research interests include
chemoinformatics, bioinformatics, drug design, chemo- and geoinformatics, web server and database, machine learning. Further information about Dong-
Sheng Cao can be found at the website of his group: http://www.scbdd.com.
Submitted: 27 January 2021; Received (in revised form): 11 March 2021
© The Author(s) 2021. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com
1
Briefings in Bioinformatics, 00(00), 2021, 1–14
doi: 10.1093/bib/bbab152
Problem Solving Protocol
MG-BERT: leveraging unsupervised
atomic representation learning for molecular
property prediction
Xiao-Chen Zhang, Cheng-Kun Wu, Zhi-Jiang Yang, Zhen-Xing Wu,
Jia-Cai Yi, Chang-Yu Hsieh, Ting-Jun Hou and Dong-Sheng Cao
Corresponding author: Dong-Sheng Cao, Xiangya School of Pharmaceutical Sciences, Central South University, Changsha, 410003, PR China.
Tel .: +86-731-89824761; E-mail: oriental-cds@163.com; Ting-Jun Hou, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, 310058,
Zhejiang, PR China. Tel.: +86-571-88208412; E-mail: tingjunhou@zju.edu.cn
The first two authors contribute equally to this paper.
Abstract
Motivation: Accurate and efficient prediction of molecular properties is one of the fundamental issues in drug design and
discovery pipelines. Traditional feature engineering-based approaches require extensive expertise in the feature design and
selection process. With the development of artificial intelligence (AI) technologies, data-driven methods exhibit
unparalleled advantages over the feature engineering-based methods in various domains. Nevertheless, when applied to
molecular property prediction, AI models usually suffer from the scarcity of labeled data and show poor generalization
ability. Results: In this study, we proposed molecular graph BERT (MG-BERT), which integrates the local message passing
mechanism of graph neural networks (GNNs) into the powerful BERT model to facilitate learning from molecular graphs.
Furthermore, an effective self-supervised learning strategy named masked atoms prediction was proposed to pretrain the
MG-BERT model on a large amount of unlabeled data to mine context information in molecules. We found the MG-BERT
model can generate context-sensitive atomic representations after pretraining and transfer the learned knowledge to the
prediction of a variety of molecular properties. The experimental results show that the pretrained MG-BERT model with a
little extra fine-tuning can consistently outperform the state-of-the-art methods on all 11 ADMET datasets. Moreover, the
MG-BERT model leverages attention mechanisms to focus on atomic features essential to the target property, providing
Downloaded from https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbab152/6265201 by National Science & Technology Library Root Admin user on 07 May 2021
2Zhang et al.
excellent interpretability for the trained model. The MG-BERT model does not require any hand-crafted feature as input and
is more reliable due to its excellent interpretability, providing a novel framework to develop state-of-the-art models for a
wide range of drug discovery tasks.
Graphical Abstract
Key words: molecular property prediction; molecular graph BERT; atomic representation; deep learning; self-supervised
learning
Introduction
Drug discovery is a risky, lengthy and resource-intensive process
that usually takes around 10–15 years and billions of dollars [1].
To improve the efficiency of drug discovery, considerable efforts
have been put into the development of computational tools and
bioinformatics approaches [2,3]. Among these methods, compu-
tational models for accurate prediction of molecular properties
have a more significant and immediate impact on the drug
discovery process since they can alleviate the excessive depen-
dence on time-consuming and labor-intensive experiments and
substantially reduce expenditure and time costs [4]. In this con-
text, high-precision molecular property prediction models have
become indispensable tools in many stages of the drug discovery
process, covering hit identification, lead optimization, ADMET
(Absorption, Distribution, Metabolism, Excretion and Toxicity)
properties evaluation, etc. [5].
Expressive molecular representations are essential for
molecular property prediction. Traditional methods heavily rely
on feature engineering, in which experts handcraft a set of rules
to encode relevant structural information or physicochemical
properties of molecules into fixed-length vectors [6]. Molecular
fingerprints and molecular descriptors are two canonical
categories of molecular features. Molecular fingerprints focus
on recording information about molecular substructures
[7]. For instance, extended connectivity fingerprints (ECFPs)
specifies an initial feature to each nonhydrogen atom and
iteratively combines the features of neighboring atoms until
a specific diameter is reached [8,9]. Fingerprints are usu-
ally not optimized for particular prediction tasks due to
sparse encoding problems. Alternatively, descriptors consist
of a collection of physiochemical properties and structural
information selected by experts based on their professional
insight and feature engineering practice [10,11]. Molecular
descriptors can reduce irrelevant features and improve per-
formance to some extent. However, the design process is
trivial, time-consuming and error-prone. Hence, molecular
fingerprints and descriptors both suffer from low scalability and
versatility.
Downloaded from https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbab152/6265201 by National Science & Technology Library Root Admin user on 07 May 2021
Leveraging unsupervised atomic representation learning for molecular property prediction 3
Recently, deep learning (DL) methods have made significant
breakthroughs in many fields, such as computer vision [12,13],
natural language processing (NLP) [14,15], playing go game [16],
etc. The fundamental principle behind DL is: designing a suitable
deep neural network (DNN) and training it on a large amount
of raw data to learn representations automatically, rather than
relying on human-crafted features.
The successful applications of DL in various domains
inspired its application in molecular property prediction. Many
studies of molecular property prediction have tried to apply
DNNs directly to low-level molecular representations, such
as the sequential SMILES (Simplified molecular-input line-
entry specification) strings or molecular graphs [1722]. The
SMILES string describes compositions and chemical structures
of molecules by a line of ASCII strings. As a kind of text, some
suitable text processing algorithms, such as CNN, LSTM and
Transformer [9,17,23], can be directly applied to build the
prediction models. However, these algorithms need to learn to
parse out useful features of molecules from the complex syntax
of SMILES, which greatly increased the difficulty of learning and
generalizing. Notably, unsupervised methods based models like
auto-encoders have been applied to SMILES to learning useful
representation from a large amount of unlabeled data [2426].
This type of model generally comprises two neural networks: an
encoder and a decoder. The encoder network converts the input
sequence (a variable-length SMILES sequence with discrete
values) into a fixed-size continuous vector (latent represen-
tation). The decoder network takes the latent representation as
input and aims to convert it back to the input sequence. These
models can be trained to embed the discrete molecules into a
continuous vector space through training on a large amount of
unlabeled data. The latent representation can be adopted for the
downstream prediction tasks. However, the SMILES recovery-
based molecular representation may not quite be optimal for
general prediction tasks and cannot be further optimized.
Emerging GNNs can directly learn from graph data, which could
be a great advantage in molecular property prediction. Specif-
ically, GNNs first represent molecules as two-dimensional (2D)
graphs according to the connection relations [2729] or three-
dimensional (3D) graphs according to the atom distance matrix
[18,26,30]. Then atoms are embedded into vectors according to
their atomic characteristics, such as atom type, valence electron
number, bond number, etc. After that, the vector for each atom
is iteratively updated by aggregating information of surrounding
atoms. Finally, a graph-level vector is generated from vectors
of all atoms through a specific readout mechanism and sent to
the fully connected neural network for prediction. If needed, the
information from atomic bonds can also be incorporated [22,31].
However, limited by overfitting and oversmoothing problems,
current GNNs are usually too shallow (generally 2–3 layers),
which weakens their ability to extract deep-level patterns [32].
The common challenges faced by the DL models in molecular
property prediction are the scarcity of labeled data. It is well-
known that DL models usually require a large amount of labeled
data to achieve high effectiveness and generalization [33]. For
example, in image classification tasks, people usually collect
millions of images to train their DL models [34]. Unfortunately,
it is unrealistic to obtain so much molecular properties data,
especially ADMET endpoint data, which often requires a large
number of time-consuming, laborious and costly experiments
[35]. This dilemma makes DL models usually overfitting, greatly
hurting their generalization ability.
The scarcity of labeled data has motivated the development
of self-supervised or semisupervised learning methods [15,36]
in other fields. In the NLP domain, the recently proposed BERT
model can utilize a large number of unlabeled texts for pre-
training and dramatically improve the performance of vari-
ous downstream tasks. The success of the BERT model can
be attributed to the masked tokens prediction in which the
model learns to predict the masked or contaminated words
according to other visible words in the same sentence. In this
process, the model is driven to mine the context information
in this sentence. This kind of context information can bene-
fit the downstream task and greatly improve their prediction
performance. Inspired by the BERT model, the SMILES-BERT
model is proposed to directly apply the BERT model to SMILES
strings [37]. Although the SMILES-BERT model suffers from a lack
of interpretability due to the existence of auxiliary characters
in SMILES strings. Additionally, the complex syntax of SMILES
strings also increases the difficulty for model learning.
To address these issues, we proposed a novel molecular
graph BERT (MG-BERT) model by integrating the local message
passing mechanism of GNNs into the powerful BERT model.
The proposed MG-BERT model can overcome the oversmoothing
problem faced by common GNNs and provide enough capacity to
extract deep-level features for the generation of molecular rep-
resentations. We further proposed the masked atoms prediction
pretraining as an effective strategy to mine the context informa-
tion in molecules automatically. Experimental results illustrate
that MG-BERT can generate context-sensitive atomic represen-
tations after pretraining and greatly boost the performance of
molecular property prediction tasks on 11 practical tasks, in
which MG-BERT can consistently outperform previous state-of-
the-art models. Additionally, MG-BERT can learn to focus on
atoms and substructure related to the target properties by atten-
tion mechanism, which provides valuable clues to analyze and
optimize molecules.
Materials and methods
Dataset collection
The training process of the proposed MG-BERT model consists of
two stages: pretraining and fine-tuning. In the pretraining stage,
we took advantage of a large number of unlabeled molecules to
mine context information in molecules. Herein, 1.7 million com-
pounds were randomly selected from the ChEMBL [38] database
as the pretraining data. To verify the pretraining model, we
randomly keep 10% for pretraining evaluation. The number in
the training set ends up to 1.53 million. In the fine-tuning stage,
the pretrained model was further trained for specific molecu-
lar property prediction. Sixteen datasets (eight for regression
and eight for classification) covering critical ADMET endpoints
and various common molecular properties were collected from
the ADMETlab [35] and MoleculeNet [39] to train and evaluate
MG-BERT. Detailed information of these 16 datasets is listed in
Tab le 1. All molecules in these datasets are stored in SMILES
strings format. The datasets were split into the training, vali-
dation and test datasets by a ratio of 8:1:1. It is worth noting
that SMILES strings have a wide span of length ranging from
several characters to over 100 characters. Therefore, a stratified
sampling by SMILES length was used to make dataset splitting
more uniform.
Model architecture
The original BERT model consists of three components: an
embedding layer, several Transformer encoder layers [14],
Downloaded from https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbab152/6265201 by National Science & Technology Library Root Admin user on 07 May 2021
4Zhang et al.
Tab le 1. The detailed information of the 11 datasets used in this study
Type Dataset Category Number Positive Negative
Regression Caco2 Absorption 979
Regression logD Physicochemical property 10 354
Regression logS Physicochemical property 5045
Regression PPB Distribution 1480
Regression tox Toxicity 7295
Regression ESOL Physicochemical property 1128
Regression Freesolv Physicochemical property 642
Regression Lipo Physicochemical property 4200
Classification Ames Toxicity 6719 3631 3088-
Classification BBB Distribution 1855 1437 418
Classification FDAMDD Toxicity 795 437 358
Classification H_HT Toxicity 2170 1434 736
Classification Pgp_inh Absorption 2125 1240 885
Classification Pgp_sub Absorption 1210 616 594
Classification BACE Biophysics 1513 691 822
Classification BBBP Physiology 2039 1560 479
and a task-related output layer. In the embedding layer, the
input word token is embedded into continuous vector space
through an embedding matrix. As the Transformer model
cannot automatically learn positional information, a predefined
positional encoding vector needs to be added to each embedding
vector in the embedding layer. In the Transformer encoder
layer, every word token exchanges information with each other
through a global attention mechanism. The embedding and
Transformer layers are shared during the pretraining and fine-
tuning stages. The last layer is generallya fully connected neural
network, which further processes the Transformer layer’s output
and performs specific classification or regression tasks. The last
layer for the pretraining and fine-tuning stages are not shared
and are called the pretraining head and the prediction head,
respectively. We provide an introduction of the MG-BERT model
in the supplementary part. More details about the BERT model
are described in the original literature of BERT [15].
Unlike the original BERT model for unstructured NLP, MG-
BERT makes a few modifications according to the characteristics
of molecular graphs. In the embedding layer, word tokens are
replaced by atom type tokens. As atoms in a molecule are
no related sequentially, there is no need to assign positional
information. In natural language sentences, one word may be
related to any other word,so global attention is needed. However,
in a molecule, the atom is primarily associated with its neigh-
boring atoms linked by bonds. To effectively realize this kind of
inductive bias [40], we modified the global attention in BERT to
local attention based on chemical bonds, which only allow atoms
to exchange information through chemical bonds. This kind of
local message passing mechanism makes MG-BERT a new vari-
ant of GNN. Notably, MG-BERT can overcome the oversmoothing
problem due to the res-connection mechanism in BERT and has
enough capacity to extracting deep-level patterns in molecu-
lar graphs. As depicted in Figure 1, we use adjacent matrix of
molecules to control information exchange in molecules.
To obtain the graph-level representation and facilitate the
subsequent prediction tasks in the fine-tuning stage, we added
a supernode connecting to all atoms for each molecule. On
the one hand, this supernode can exchange information with
other nodes, which can well solve long-distance dependence to
some extent. On the other hand, this supernode output can be
regarded as the ultimate molecular representation and used to
solve the downstream classification or regression tasks.
Pretraining strategy
BERT leveraged two learning tasks to pretrain the model, includ-
ing the masked language model (MLM) task and the next sen-
tence prediction (NSP) task. The MLM task is a fill-in-the-blank
task, where a model uses context words surrounding a mask
token to predict what the masked word should be. The NSP task
is to determine if two sentences are consecutive. As molecules
lack ongoing relationships like sentences, we only used the
masked atom prediction task to pretrain our model.
Our proposed pretraining strategy is very similar to BERT.
Firstly, 15% of the atoms in a molecule will be randomly selected,
and at least one atom will be selected for the molecules with only
a few atoms. For each selected atom, there is an 80% probability
of being replaced with [MASK] tokens, a 10% probability of being
randomly replaced with other atoms, and a 10% probability of
keeping unchanged. The original molecule is used as the ground
truth to train the model and the loss is only calculated at the
masked atoms.
Input representations
To represent and manipulate atoms in molecular graphs, we
need to add all atom types to our dictionary. However, the
number of atom types appearing in the molecules is very limited.
After statistical analysis, 13 frequently encountered atom types
were included in the dictionary, and the other rarely encountered
atom types are uniformly denoted by [UNK]. To get the graph-
level representation, we added a supernode to each molecular
graph. This supernode is denoted by [GLOBAL]. Besides, the
[MASK] token is needed for representing the masked atoms in
the pretraining stage. Thus, our dictionary includes the following
tokens: [H], [C], [N], [O], [F], [S], [Cl], [P], [Br], [B], [I], [Si], [Se], [UNK],
[MASK], [GLOBAL].
Model training and evaluation
Pretraining stage
Each molecule was converted into a 2D undirected graph in
the pretraining stage according to the constituent atoms and
their connection relationship by RDKit [41]. Then a supernode
connecting all the nodes was added to every molecular graph.
After that, certain atoms were randomly selected for masking
Downloaded from https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbab152/6265201 by National Science & Technology Library Root Admin user on 07 May 2021
Leveraging unsupervised atomic representation learning for molecular property prediction 5
Figure 1 . The overall pretraining and fine-tuning procedures of MG-BERT. The MG-BERT model uses bond-based local attention,in which every token can only exchange
information with the tokens linked by chemical bonds. Apart from the output layers, the same architectures are used in both pretraining and fine-tuning. During the
fine-tuning stage, the supernode denoted by [GLOBAL] was used to extract the global information to perform related prediction tasks. The visible matrix is used to
modulate the attention matrix in the Transformer layer to control the information exchange.
according to the pretraining strategy. Finally, molecular graphs
were sent to the MG-BERT model to predict the types of masked
atoms. For some molecules with only a few atoms, we ensured
as least one atom will be selected for masking. The model was
trained via the standard batch gradient descent algorithm with
an Adam optimizer [42]. The learning rate was set to 1e4, and
the batch size was set to 256. The model was pretrained for 10
epochs.
To evaluate the pretraining performance, the pretraining
masking strategy was used to mask the molecules from the test
set, and then the recovery rate was calculated as the evaluation
metric.
Fine-tuning stage
After the pretraining, the pretraining head was removed. A two-
layer task-related fully connected neural network was added
to the output of the Transformer encoder layer corresponding
to the supernode. The dropout strategy [43] was adopted to
minimize overfitting. It should be noted that the dropout rate
has a great impact on the final prediction and needs to be
optimized according to specific tasks. According to our empirical
results, the recommended range for the dropout rate is [0.0,
0.5]. Adam optimizer was used as the fine-tuning optimizer and
a limited hyperparameter sweep was conducted for each task,
with the batch sizes selected from {16,32,64}and the learning
rates selected from {1e-5,5e-5,1e-4}[44].
The regression models were evaluated by the square deter-
mination coefficient (R2), and the classification models were
evaluated by the area under the receiver operating characteristic
(ROC-AUC). We used early stopping to avoid overfitting and set a
maximum epoch to 100. To reduce random errors, every dataset
was trained 10 times with random dataset splitting, and the
calculated average and the standard deviation were reported as
the final performance.
Results and discussion
Choice of MG-BERT model structure
To determine the better structure of the MG-BERT model for
molecular property prediction tasks, we designed and compared
three model structures. The specific parameters are listed in
Tab le 2. The pretraining recovery accuracy and the averaged
fine-tuning performance were used as the evaluation metrics.
As listed in Tab le 2, the small MG-BERT model is inferior to the
other two because of too few layers.Compared with the medium
MG-BERT model, the large MG-BERT model performs better on
the pretraining recovery task while performs slightly worse on
molecular property prediction tasks. This phenomenon may
be caused by the fact that the large MG-BERT model has an
overfitting risk due to too many model parameters. The structure
of the medium MG-BERT model was finally adopted since it can
achieve the best performance on molecular property prediction.
Pretraining is indeed effective
To verify the effectiveness of pretraining, we compared the
performance of the pretrained and non-pretrained MG-BERT
models on molecular property prediction under the same hyper-
parameter settings. According to the comparison results listed
Downloaded from https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbab152/6265201 by National Science & Technology Library Root Admin user on 07 May 2021
6Zhang et al.
Tab le 2. Parameters and performance between three structures of MG-BERT
Name Layers Heads Embedding size FFN size Recovery accuracy Performance
MG-BERTSMALL 3 2 128 256 0.9527 0.8082
MG-BERTMEDIUM 6 4 256 512 0.9831 0.8283
MG-BERTLARGE 12 8 576 1152 0.9835 0.8253
Tab le 3. Performances comparison (R2; ROC-AUC) of the MG-BERT models with different settings (the results are the number of percentages)
Type Dataset MG-BERT (without pretraining) MG-BERT (without hydrogens) MG-BERT
Regression Caco2 64.79 ±3.57 70.78 ±2.31 74.68 ±3.89
Regression logD 84.78 ±1.09 82.88 ±0.40 87.46 ±0.80
Regression logS 83.84 ±0.73 85.39 ±1.36 87.66 ±0.42
Regression PPB 62.07 ±4.91 59.47 ±5.12 65.94 ±2.86
Regression tox 58.33 ±1.58 60.63 ±1.50 63.68 ±1.53
Regression ESOL 79.01 ±4.21 80.16 ±3.84 84.74 ±4.09
Regression FreeSolv 76.82 ±3.58 78.19 ±4.14 84.63 ±4.68
Regression Lipo 67.57 ±3.19 70.82 ±3.07 76.50 ±2.64
Classification Ames 86.49 ±1.00 87.45 ±0.95 89.33 ±0.83
Classification BBB 91.06 ±1.07 94.76 ±1.64 95.41 ±1.13
Classification FDAMDD 80.76 ±3.73 87.01 ±1.76 88.23 ±3.43
Classification H_HT 67.54 ±2.98 69.62 ±3.54 72.87 ±3.49
Classification Pgp_inh 88.42 ±0.64 90.18 ±0.42 92.44 ±1.14
Classification Pgp_sub 83.16 ±2.31 89.34 ±0.79 91.57 ±2.26
Classification BACE 84.35 ±3.04 86.59 ±2.17 88.68 ±2.51
Classification BBBP 86.42 ±1.93 88.93 ±2.04 92.08 ±1.89
in Tab le 3, the pretrained MG-BERT model can outperform the
non-pretrained MG-BERT model by more than 2% on all datasets,
clearly demonstrating the effectiveness of the pretraining strat-
egy and the excellent generalization ability of the pretrained
model. For some small datasets such as Caco2 and FDAMDD, the
prediction performance is improved by more than 7%, suggesting
that the pretraining strategy can improve the prediction perfor-
mance more effectively for small datasets. These results indicate
that the MG-BERT model can indeed learn useful knowledge
and transfer the learned knowledge to the downstream tasks by
providing a nontrivial neural network initialization.
Influence of hydrogen atoms on pretraining accuracy
and prediction tasks
Hydrogen atoms are usually ignored in most reported molecular
property prediction models. In this study, a controlled experi-
ment was conducted to explore whether hydrogen atoms are
necessary for our MG-BERT model. The hydrogen-free model
based on the molecular graphs without all hydrogen atoms
was developed under the same hyperparameter setting for the
MG-BERT model.
As illustrated in Figure 2, the pretraining accuracy of the MG-
BERT model with hydrogens can reach 98.31%, whereas that of
the hydrogen-free model can only reach 92.25%.The fine-tuning
results listed in Tab le 3 show that the performance of the MG-
BERT model with hydrogens is much better than that of the
hydrogen-free model. Especially on some regression tasks, the
model with hydrogens can outperform the hydrogen-free model
by more than 4%.
The logic behind this is that, MG-BERT only utilizes the
composition and connection information of molecules. Under
this setting, hydrogen atoms can help determine the numbers of
the chemical bonds for atoms of other types. In the masked atom
recovery task, the numbers of bonds are critical to determining
the types of the masked atoms. Therefore, the hydrogen-free
MG-BERT model shows a significant decrease in masked atoms
recovery rate. Furthermore, the absence of hydrogen atoms will
also affect the context-information mining process in the pre-
training stage, thus weakening the generalization ability of the
pretrained model. In addition, if hydrogen atoms are removed,
some molecules can become indistinguishable. As shown in
Figure 3, benzene and cyclohexane can be converted into the
same graph if hydrogen atoms are removed. However, if hydro-
gen atoms are kept, they will be converted into two different
graphs. In this way, the absence of hydrogen atoms has a great
impact on the performance of the fine-tuned model.
Comparison with other machine learning methods
Based on different molecular representations, we selected some
state-of-the-art models as the baselines to comprehensively
evaluate our proposed MG-BERT model. The first is the XGBoost
[45] model based on ECFP4 fingerprints (ECFP4-XGBoost). This
combination is a classic paradigm for molecular property pre-
diction tasks. The 2nd and 3rd are two of the most representa-
tive and widely used GNNs: graph attention network (GAT) [27]
and graph convolutional network (GCN) [28]. The 4th is based
on the continuous-and-data-driven descriptor (CDDD) [19], and
it consists of a fixed RNN based encoder that has been pre-
trained on a large number of unlabeled SMILES strings and a
fully connected neural network. We also included the SMILES-
BERT model, which directly utilized the original BERT model for
SMILES strings.
The prediction results are shown in Tab le 4 and Figure 4.
Except for ROC-AUC and R2, we also utilized accuracy and root
mean-squared-error (RMSE) for evaluation, which are shown in
Tab le S1. The performance of the ECFP4-XGBoost model shows
Downloaded from https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbab152/6265201 by National Science & Technology Library Root Admin user on 07 May 2021
Leveraging unsupervised atomic representation learning for molecular property prediction 7
Figure 2 . The pretraining accuracy of MG-BERT models versus the training steps.
Figure 3 . The inf luence of hydrogen atoms in converting molecules to graphs. (A) If hydrogens are removed, benzene and cyclohexane will be converted into the same
graph. (B) If hydrogens are kept,benzene and cyclohexane will be converted into two different graphs.
a high variation on different datasets. This is quite possible that
ECFP4 is a fixed-length molecular representation, leading to the
information it represents may or may not work well for specified
tasks. GNN models, including GAT and GCN, perform well when
the labeled data are sufficient. However, when the labeled data
are scarce, their performance becomes much worse, even worse
Downloaded from https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbab152/6265201 by National Science & Technology Library Root Admin user on 07 May 2021
8Zhang et al.
Tab le 4. The performance comparison (R2; ROC-AUC) of the proposed model and state-of-the-art models (the results are the number of
percentages)
Type Dataset ECFP4-XGBoost GAT GCN CDDD SMILES-BERT MG-BERT
Regression Caco2 61.41 ±4.61 69.16 ±4.40 67.15 ±3.67 73.42 ±1.89 72.39 ±2.85 74.68 ±3.89
Regression logD 70.84 ±0.82 84.62 ±2.29 86.22 ±0.60 85.85 ±2.03 86.31 ±1.70 87.46 ±0.80
Regression logS 73.73 ±2.32 84.06 ±0.73 83.47 ±1.18 84.01 ±2.11 85.20 ±1.31 87.66 ±0.42
Regression PPB 55.11 ±4.35 59.96 ±4.52 57.34 ±4.74 54.12 ±4.48 62.37 ±3.89 65.94 ±2.86
Regression tox 59.02 ±1.42 59.79 ±1.79 56.71 ±1.99 57.91 ±4.19 60.49 ±2.77 63.68 ±1.53
Regression ESOL 72.80 ±2.93 76.63 ±1.77 79.18 ±2.88 80.21 ±1.72 82.41 ±3.74 84.74 ±4.09
Regression FreeSolv 73.70 ±5.03 73.17 ±4.86 77.81 ±4.92 79.15 ±3.12 81.47 ±4.73 84.63 ±4.68
Regression Lipo 54.81 ±1.64 67.56 ±3.67 70.29 ±3.16 69.24 ±2.72 72.44 ±2.57 76.50 ±2.64
Classification Ames 87.21 ±0.91 86.38 ±1.46 87.04 ±1.55 86.82 ±0.64 87.69 ±0.98 89.33 ±0.83
Classification BBB 94.62 ±0.82 93.03 ±3.04 92.67 ±1.44 94.44 ±1.67 94.02 ±1.79 95.41 ±1.13
Classification FDAMDD 88.14 ±1.83 85.27 ±1.39 85.12 ±3.89 86.55 ±2.54 87.94 ±2.77 88.23 ±3.43
Classification H_HT 71.32 ±2.41 67.48 ±2.69 69.09 ±2.54 70.81 ±3.32 71.07 ±3.81 72.87 ±3.49
Classification Pgp_inh 91.53 ±1.52 90.88 ±2.77 90.25 ±1.99 91.54 ±1.16 91.24 ±1.42 92.44 ±1.14
Classification Pgp_sub 88.30 ±2.60 90.73 ±2.25 90.42 ±2.79 89.88 ±2.76 91.32 ±2.35 91.57 ±2.26
Classification BACE 87.14 ±2.14 85.58 ±3.38 86.98 ±1.39 86.58 ±3.84 87.64 ±2.94 88.68 ±2.51
Classification BBBP 89.16 ±1.17 90.33 ±3.02 90.74 ±1.05 91.12 ±1.52 91.32 ±1.83 92.08 ±1.89
Figure 4 . The performance comparison of MG-BERT and state-of-the-art models for (A) the regression tasks and (B) the classification tasks.
Downloaded from https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbab152/6265201 by National Science & Technology Library Root Admin user on 07 May 2021
Leveraging unsupervised atomic representation learning for molecular property prediction 9
Figure 5 . The t-SNE plot of the atomic embedding vectors colored by atomic types. The hydrogen atoms are ignored before t-SNE for better visualization.
Tab le 5. Symbols and defined categories of atoms
Symbols Categories
C–N Carbon atoms with four single bonds, in which one bond is connected to a nitrogen atom.
C–O Carbon atoms with four single bonds, in which no bond is connected to the nitrogen atom and one bond is
connected to the oxygen atom.
C–C Carbon atoms with four single bonds, in which no bond is connected to nitrogen atom or oxygen atom.
C=C Carbon atoms with a double bond that is connected to another carbon atom.
C=O Carbon atoms with a double bond that connected to an oxygen atom.
Aromatic C Carbon atoms in a benzene ring.
O= Oxygen atoms with a double bond.
–OH. Oxygen atoms with a single bond connected to hydrogen.
–O– Oxygen atoms with two single bonds, in which no bond is connected to hydrogen atoms.
N= Nitrogen atoms with a double bond.
–NH2Nitrogen atoms have three single bonds, in which two bonds are connected to hydrogen atoms.
–NH. Nitrogen atoms have three single bonds, in which one bond is connected to hydrogen atoms.
N Nitrogen atoms have three single bonds, in which no bond is connected to hydrogen atoms.
Others Atoms not def ined above
than the model based on the molecular fingerprints. The CDDD
model shows certain competitiveness. However, the molecular
representations of the CDDD model were obtained through the
SMILES encoding and decoding tasks, which cannot be further
optimized for specif ic tasks. In contrast, The SMILES-BERT model
and our MG-BERT model can also learn rich context-sensitive
information in the pretraining stage and can be further opti-
mized for specific tasks. SMILES-BERT models are slightly behind
our MG-BERT model. This may be caused by the fact that SMILES
strings are much more complex to learn from than molecular
graphs, which means the SMILES-BERT models have to parse
out the molecular information hidden in the complex syntax of
SMILES strings. Although the MG-BERT model can directly learn
from the molecular graphs, which are naturally representation of
molecules. The proposed MG-BERT model can consistently out-
perform the other methods. The overall improvement is 28.1%
Downloaded from https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbab152/6265201 by National Science & Technology Library Root Admin user on 07 May 2021
10 Zhang et al.
Figure 6 . The t-SNE plot of the atomic embedding vectors colored by atom categories. In the main graph,the positions of the atoms belonging to the displayed molecule
are marked. In the enlarged graph, the atoms and their corresponding positions are marked by the same color.In the enlargement, all colored atoms are in a benzene
ring and linked to a carbonyl group.
(7.02% on classification tasks and 21.28% on regression tasks).
Notably, for the PPB dataset, the improvement of the MG-BERT
model is more than 6%. The improvements of our model relative
to baselines are statistically significant (95% confidence interval,
CI) according to the paired t-test (P0.001). These results con-
vincingly highlight MG-BERT’s potential to become a good choice
for molecular property prediction tasks in drug design.
Analysis of the atomic representations
from the pretrained MG-BERT model by t-SNE
To analyze what the MG-BERT model learned in the pretraining
stage, we visualized the atomic representations generated by the
pretrained model and tried to find some interesting patterns.
Specifically, 1000 molecules (including approximately 22 000
atoms) were randomly selected from the fine-tuning dataset and
fed into the pretrained model without masking, and the output
of the Transformer encoder layer was gathered. In this way, a
256-dimensional vector was obtained for each atom, and about
22 000 vectors were obtained in total. The classical dimensional-
ity reduction method t-SNE [46,47] was used to visualize these
high-dimensional vectors. As shown in Figure 5, atoms of dif-
ferent types can be easily distinguished, demonstrating that the
generated representations contain information of atomic types.
Further observation shows that the atoms of the same
type can be divided into several different groups. It seems
Downloaded from https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbab152/6265201 by National Science & Technology Library Root Admin user on 07 May 2021
Leveraging unsupervised atomic representation learning for molecular property prediction 11
Figure 7 . The attention weight visualization of the supernodes for some molecules in the (A) logD prediction task and (B) Ames prediction task. A darker the color
denotes a bigger attention weight. The attention weights for hydrogens are transferred to their neighbors for convenience.
that the generated atomic representations contain richer
information than atomic types. This finding inspires us to
define some atomic categories according to atoms’ surrounding
environments and visualize the atoms by their categories. The
categories were mainly defined by the possible varieties of
the 1st-order neighborhood (Tab le 5). As shown in Figure 6,
atoms are clustered together according to their categories.
These results indicate that the generated atomic representation
for each atom contains information on its 1st-order neigh-
borhood. Notably, the carbon atoms in the benzene ring are
distinguishable from those in other environments, indicating
that the pretrained model can capture high-order neighborhood
information. To make further analysis, we randomly selected
acomplexmoleculeandmarkedthepositionsofitsatomsin
Figure 6. It can be found that atoms in the benzene ring tend to be
close to each other if their branching environments are similar.
We further made a local enlargement to figure out whether
the environments of atoms in a small region are consistent.
In the enlarged graph, we randomly marked some atoms and
displayed their corresponding molecules. It can be found that
all these atoms are in a benzene ring and linked to a carbonyl
group. These results can prove that our pretrained model can
surely capture high-order neighborhood information to some
extent.
The above results fully prove that the generated atomic rep-
resentations can successfully capture the 1st-order or even high-
order neighborhood information. In this way, the learned atomic
representations can be regarded as the representations of molec-
ular substructures, which can be very beneficial for downstream
tasks.
Analysis of MG-BERT’s attention
Understanding the relationship between molecular structures
and molecular properties is quite beneficial for analyzing and
optimizing molecules, and MG-BERT provides a natural way to
reveal these relationships. MG-BERT leverages attention mech-
anisms to aggregate information from all atomic representa-
tions to form molecular representations. In this way, the atten-
tion weights represent the contribution of each atomic rep-
resentation in the final molecular representation and can be
regarded as a relevant measurement to the target property. To
note, each atomic representation also aggregates information
from its neighbors, so the attention weights are not only just for
individual atoms but also shared by its neighbors to some extent.
To find out whether our model can reasonably allocate the
attention weights, we randomly selected some molecules and
visualized their attention weights according to specific tasks.
Downloaded from https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbab152/6265201 by National Science & Technology Library Root Admin user on 07 May 2021
12 Zhang et al.
The results for several molecules in the logD and Ames predic-
tion tasks are shown in Figure 7. The property of logD is related
to molecular lipophilicity. According to Figure 7A,wecancon-
clude that more attention is distributed to polar groups, which
play an important role in determining molecular lipophilicity.
The Ames task is to determine whether a molecule belongs
to mutagens or not. Figure 7B shows that the attention pre-
dominately is distributed to the acylchloride, nitrosamide and
azide groups, which have been demonstrated to be mutagenic
structural alerts [48]. These results demonstrate that MG-BERT
can reasonably allocate attention weight according to specific
tasks, which is of great significance for medicinal chemists to
explore the relationship between substructures and molecular
properties.
Conclusion
In this study, we presented a novel semisupervised learning
approach called MG-BERT to alleviate the data scarcity problem
in molecular property prediction. The proposed MG-BERT model
modifies the original BERT model according to the characteristic
of molecular graphs. MG-BERT takes advantage of large amounts
of unlabeled molecular data through the masked atom recovery
task to mine the context information in molecular graphs for
effective atomic and molecular representation learning. After
the pretraining, MG-BERT could easily be fine-tuned on small
labeled datasets and achieved very competitive prediction per-
formance. In experiments,the MG-BERT model consistently out-
performed the-state-of-art models on 11 representative ADMET
tasks, fully demonstrating the effectiveness of our proposed
method. Furthermore, we visualized the atom representations
from the pretrained model and found that the generated atomic
representations can fully capture the 1st-order neighborhood
information and even high-order neighborhood information
to some extent. Through this, we effectively explained why
pretraining is beneficial for downstream tasks. Although DL
models have strong learning and predictive capabilities, their
interpretability is generally so poor that they are called black-
box models. MG-BERT provides a natural way to measure the
relevance between atoms or substructures with the target
property by attention mechanisms. These features have estab-
lished MG-BERT as an effective and interpretable computational
tool in solving challenges of molecular property prediction and
molecular optimization.
Abbreviation
BERT, bidirectional encoder representations from Trans-
formers; AI, artificial intelligence; ASCII, American Standard
Code for Information Interchange; GNN, graph neural
network; ADMET, absorption, distribution, metabolism,
excretion and toxicity; ECFP, extended connectivity fin-
gerprints; DL, deep learning; DNN, deep neural network;
SMILES, simplified molecular input line entry specification;
CNN, convolutional neural network; LSTM, long short-term
memory.
Key Points
MG-BERT integrates the local message passing mecha-
nism of GNNs into the powerful BERT. As a new variant
of GNNs, MG-BERT can overcome the oversmoothing
problem and has enough capacity to extracting deep-
level patterns in molecular graphs.
MG-BERT can take advantage of a large number of
unlabeled molecules through the masked atom recov-
ery task to mine the context information in molecular
graphs and transfer the learned knowledge to benefit
molecular property prediction.
MG-BERT can outperform the state-of-the-art models
on molecular property prediction without any hand-
crafted features and provides interpretability by rea-
sonably allocating attention weights to atoms or sub-
structures according to the relevance with the target
property.
Availability
All datasets and codes used in this study are available at
GitHub: https://github.com/zhang-xuan1314/Molecular-gra
ph-BERT.
Supplementary data
Supplementary data are available online at Briefings in Bioin-
formatics.
Authors’ contributions
XCZ and DSC developed the algorithms; XCZ wrote the codes
and drafted the manuscript; ZJY and JCY helped prepared
the datasets and figures. DSC, CKW, CYH, TJH and ZXW
helped check and improve the manuscript. All authors read
and approved the final manuscript.
Funding
Changsha Municipal Natural Science Foundation [kq2014144];
Changsha Science and Technology Bureau project [kq2001034];
National Key Research & Development project by the Min-
istry of Science and Technology of China (2018YFB1003203);
State Key Laboratory of High-Performance Computing
(No. 201901-11); National Science Foundation of China
(U1811462).
Conflict of interest
The authors declare that they have no conflict of interest.
References
1. Zhou S-F, Zhong W-Z. Drug design and discovery: principles
and applications. Molecules 2017;22(2):279.
2. Marshall GRJ. Computer-aided drug design. Annu Rev Phar-
macol 1987;27:193–213.
3. Veselovsky A, Ivanov A. Strategy of computer-aided drug
design. Current Drug Targets-Infectious Disorders 2003;3:33–40.
4. Song CM, Lim SJ, Tong JC. Recent advances in computer-
aided drug design. Brief Bioinform 2009;10:579–91.
5. Inza I, Calvo B, Armañanzas R, et al. Machine learning:
an indispensable tool in bioinformatics. Methods Mol Biol
2010;593:25–48.
Downloaded from https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbab152/6265201 by National Science & Technology Library Root Admin user on 07 May 2021
Leveraging unsupervised atomic representation learning for molecular property prediction 13
6. Phillips J, Gibson W, Yam J, et al. Survey of the QSAR
and in vitro approaches for developing non-animal meth-
ods to supersede the in vivo LD50 test. Food Chem Toxicol
1990;28:375–94.
7. Livingstone DJ. The characterization of chemical structures
using molecular properties. A Survey, J Chem Inf Comput Sci
2000;40(2):195–209.
8. Rogers D, Hahn M. Extended-connectivity fingerprints. J
Chem Inf Model 2010;50:742–54.
9. Chen J-H, Tseng YJ. Different molecular enumeration inf lu-
ences in deep learning: an example using aqueous solubility.
Brief Bioinform 2020:bbaa092.
10. Consonni V, Todeschini R. Molecular descriptors. In: T.
Puzyn, J. Leszczynski, and M.T.D. Cronin (Eds.), Recent
advances in QSAR studies, Methods and applications.NewYork:
Springer, 2010, 20–102.
11. Todeschini R, Consonni V. Handbook of Molecular Descriptors.
Weinheim: Wiley-VCH, 2002.
12. Szegedy C, Vanhoucke V, Ioffe S, et al. Rethinking the incep-
tion architecture for computer vision. In: Proceedings of the
IEEE conference on computer vision and pattern recognition.Pis-
cataway, NJ: IEEE, 2016, 2818–26.
13. He K, Zhang X, Ren S, et al. Identity mappings in deep
residual networks. In: European Conference on Computer Vision.
Cham, Switzerland: Springer, 2016, 630–45.
14. Vaswani A, Shazeer N, Parmar N, et al. Attention is all you
need. arXiv preprint, arXiv:1706.03762, 2017.
15. Devlin J, Chang M-W, Lee K, et al. Bert: pre-training of
deep bidirectional transformers for language understand-
ing. arXiv preprint, arXiv:1810.04805, 2018.
16. Silver D, Schrittwieser J, Simonyan K, et al. Master-
ing the game of go without human knowledge. Nature
2017;550:354–9.
17. Bjerrum EJ. SMILES enumeration as data augmentation
for neural network modeling of molecules. arXiv preprint,
arXiv:1703.07076, 2017.
18. Gilmer J, Schoenholz SS, Riley PF, et al. Neural message pass-
ing for quantum chemistry. In: Proceedings of the 34th Inter-
national Conference on Machine Learning, Sydney, NSW, Aus-
tralia, 2017. p. 1263–1272. Proceedings of Machine Learning
Research, Cambridge, MA , USA.
19. Winter R, Montanari F, Noé F, et al. Learning continuous and
data-driven molecular descriptors by translating equivalent
chemical representations. Chem Sci 2019;10:1692–701.
20. Feinberg EN, Sur D, Wu Z, et al. Potential net for molecular
property prediction. ACS Central Science 2018;4:1520–30.
21. Gomes J, Ramsundar B, Feinberg EN, et al. Atomic con-
volutional networks for predicting protein-ligand binding
affinity. arXiv preprint, arXiv:1703.10603, 2017.
22. Kearnes S, McCloskey K, Berndl M, et al. Molecular graph
convolutions: moving beyond fingerprints. JComputAided
Mol Des 2016;30:595–608.
23. Karpov P, Godin G, Tetko IV. Transformer-CNN: Swiss
knife for QSAR modeling and interpretation. J Cheminform
2020;12(1):17.
24. Xu Z, Wang S, Zhu F, et al. Seq2seq fingerprint: An unsu-
pervised deep molecular embedding for drug discovery. In:
Proceedings of the 8th ACM International Conference on Bioinfor-
matics, Computational Biology, and Health Informatics.NewYork,
USA: Association for Computing Machinery, 2017, pp.285–94.
25. Kadurin A, Nikolenko S, Khrabrov K, et al. druGAN: an
advanced generative adversarial autoencoder model for de
novo generation of new molecules with desired molecular
properties in silico. Mol Pharm 2017;14:3098–104.
26. Feinberg EN, Joshi E, Pande VS, et al. Improvement in ADMET
prediction with multitask deep featurization. J Med Chem
2020;63:8835–48.
27. Veliˇ
ckovi´
c P, Cucurull G, Casanova A, et al. Graph attention
networks. International Conference on Learning Represen-
tations, Vancouver, BC, Canada, 2018. OpenReview.net.Inter-
national Conference on Representation Learning, La Jolla,
CA, USA.
28. Kipf TN, Welling M. Semi-supervised classification with
graph convolutional networks. In: International Conference
on Learning Representations, Toulon, France, 2017. Open-
Review.net. International Conference on Representation
Learning, La Jolla, CA, USA.
29. Xiong Z, Wang D, Liu X, et al. Pushing the bound-
aries of molecular representation for drug discovery with
the graph attention mechanism. J Med Chem 2019;63(16):
8749–60.
30. Gao P, Zhang J, Sun Y, et al. Accurate predictions of aqueous
solubility of drug molecules via the multilevel graph convo-
lutional network (MGCN) and SchNet architectures. Journal
of Machine Learning Research 2020;22:23766–72.
31. Shang C, Liu Q, Chen K-S, et al. Edge attention-based multi-
relational graph convolutional networks. arXiv e-prints,
arXiv:1802.04944, 2018.
32. Li G, Müller M, Qian G, et al. Deepgcns: making gcns go as
deep as cnns. arXiv preprint, arXiv:1910.06849, 2019.
33. Zhang Q, Yang LT, Chen Z, et al. A survey on deep learning
for big data. Inform Fusion 2018;42:146–57.
34. Krizhevsky A, Sutskever I, Hinton GE. Imagenet classifica-
tion with deep convolutional neural networks.Commun ACM
2017;60:84–90.
35. Dong J, Wang N-N, Yao Z-J, et al. ADMETlab: a platform for
systematic ADMET evaluation based on a comprehensively
collected ADMET database. JChem2018;10:29.
36. Chen T, Kornblith S, Norouzi M, et al. A simple framework
for contrastive learning of visual representations. In: Proceed-
ings of the 37th International Conference on Machine Learning,
Virtual Ev en t. Cambridge, MA , USA: Proceedings of Machine
Learning Research, 2020, p. 1597-1607.
37. Wang S, Guo Y, Wang Y, et al. SMILES-BERT: large scale
unsupervised pre-training for molecular property predic-
tion. In: Proceedings of the 10th ACM International Conference
on Bioinformatics, Computational Biology and Health Informatics.
New York, USA: Association for Computing Machinery,2019,
p. 429–36.
38. Gaulton A, Bellis LJ, Bento AP, et al. ChEMBL: a large-scale
bioactivity database for drug discovery. Nucleic Acids Res
2012;40:D1100–7.
39. Wu Z, Ramsundar B, Feinberg EN, et al. MoleculeNet:
a benchmark for molecular machine learning. Chem Sci
2018;9:513–30.
40. Battaglia PW, Hamrick JB, Bapst V, et al. Relational inductive
biases, deep learning, and graph networks. arXiv preprint,
arXiv:1806.01261, 2018.
41. Landrum G. RDKit: Open-Source Cheminformatics Software.
http://www.rdkit.org (accessed Aug 20, 2020).
42. Kingma DP, Ba J. Adam: a method for stochastic opti-
mization. In: International Conference on Learning Represen-
tations, San Diego, CA, USA, 2015. OpenReview.net. Inter-
national Conference on Representation Learning, La Jolla,
CA, USA.
43. Srivastava N, Hinton G, Krizhevsky A, et al. Dropout: a simple
way to prevent neural networks from overfitting. The Journal
of Machine Learning Research 2014;15:1929–58.
Downloaded from https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbab152/6265201 by National Science & Technology Library Root Admin user on 07 May 2021
14 Zhang et al.
44. Liu Y, Ott M, Goyal N, et al. Roberta: A robustly optimized bert
pretraining approach. In: International Conference on Learning
Representations, Virtual Conference, 2020. OpenReview.net.
International Conference on Representation Learning, La
Jolla, CA, USA.
45. Chen T, Guestrin C. Xgboost: A scalable tree boosting system.
In: Proceedings of the 22nd ACM SIGKDD International Conference
on Knowledge Discovery and Data Mining.NewYork,USA:
Association for Computing Machinery, 2016, p. 785–94.
46. Wattenberg M, Viégas F, Johnson I. How to use t-SNE effec-
tively. Distill 2016;1(10):e2.
47. Van der Maaten L, Hinton G. Visualizing data using t-SNE.
Journal of Machine Learning Research 2008;9(11):2579–605.
48. Ploˇ
snik A, Vraˇ
cko M, Sollner Dolenc M. Mutagenic and car-
cinogenic structural alerts and their mechanisms of action.
Arh Hig Rada Toksikol 2016;67:169–82.
Downloaded from https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbab152/6265201 by National Science & Technology Library Root Admin user on 07 May 2021
... Self-supervised learning first run pretraining process on largescale unlabeled dataset to derive latent representations, and then applied to downstream tasks through transfer learning [9,10,21] to obtain better performance and robustness [22,23]. For molecular property prediction task, a few self-supervised methods have been proposed for molecular representation learning [24,25,26,27,28,29]. These methods fall roughly into two categories: generation-based methods and contrastive learning-based methods. ...
... The generative methods learned molecular features by establishing specific pretext tasks that encourage the encoder to extract high-order structural information. For example, MG-BERT learned to predict the masked atomic [24], by integrating the local message passing mechanism of graph neural networks (GNNs) into the powerful BERT model [25] to enhance representation learning from molecular graphs. MolGPT [26] trained a transformerdecoder model for the next token prediction task using masked self-attention to generate novel molecules. ...
Preprint
Full-text available
Motivation Accurate and efficient prediction of the molecular property is one of the fundamental problems in drug research and development. Recent advancements in representation learning have been shown to greatly improve the performance of molecular property prediction. However, due to limited labeled data, supervised learning-based molecular representation algorithms can only search limited chemical space and suffer from poor generalizability. Results In this work, we proposed a self-supervised learning method, ATMOL, for molecular representation learning and properties prediction. We developed a novel molecular graph augmentation strategy, referred to as attention-wise graph masking, to generate challenging positive samples for contrastive learning. We adopted the graph attention network (GAT) as the molecular graph encoder, and leveraged the learned attention weights as masking guidance to generate molecular augmentation graphs. By minimization of the contrastive loss between original graph and augmented graph, our model can capture important molecular structure and higher-order semantic information. Extensive experiments showed that our attention-wise graph mask contrastive learning exhibited state-of-the-art performance in a couple of downstream molecular property prediction tasks. We also verified that our model pretrained on larger scale of unlabeled data improved the generalization of learned molecular representation. Moreover, visualization of the attention heatmaps showed meaningful patterns indicative of atoms and atomic groups important to specific molecular property.
... More concretely, models could learn general-domain chemical knowledge (e.g. the understanding of the correct valency of chemical bonds etc.) in the pre-training process. Then this knowledge is transferred in the fine-tuning to overcome data scarcity for specific tasks, such as quantitative structure-activity relationship (QSAR) [16][17][18][19][20], virtual screening (VS) [21], reaction prediction [13], de novo design [15], molecular optimization [22] and drug-drug interaction predictions [23]. ...
Preprint
Full-text available
Transformer models have been developed in molecular science with excellent performance in applications including quantitative structure-activity relationship (QSAR) and virtual screening (VS). Compared with other types of models, however, they are large, which results in a high hardware requirement to abridge time for both training and inference processes. In this work, cross-layer parameter sharing (CLPS), and knowledge distillation (KD) are used to reduce the sizes of transformers in molecular science. Both methods not only have competitive QSAR predictive performance as compared to the original BERT model, but also are more parameter efficient. Furthermore, by integrating CLPS and KD into a two-state chemical network, we introduce a new deep lite chemical transformer model, DeLiCaTe. DeLiCaTe captures general-domains as well as task-specific knowledge, which lead to a 4x faster rate of both training and inference due to a 10- and 3-times reduction of the number of parameters and layers, respectively. Meanwhile, it achieves comparable performance in QSAR and VS modeling. Moreover, we anticipate that the model compression strategy provides a pathway to the creation of effective generative transformer models for organic drug and material design.
... Indeed, recently, pretrained self-supervised learning models such as BERT and GPT-3 have been proven to be effective at learning language grammars [11] for text sentence generation [13,14] and have achieved superior performance for many downstream tasks such as reading comprehension and question-answering. These language models been further transferred to the domain of proteins [15,16,17] and organic molecules [18,19,20,21,22]. In 2019, Alley et. ...
Preprint
Full-text available
Self-supervised neural language models have recently achieved unprecedented success, from natural language processing to learning the languages of biological sequences and organic molecules. These models have demonstrated superior performance in the generation, structure classification, and functional predictions for proteins and molecules with learned representations. However, most of the masking-based pre-trained language models are not designed for generative design, and their black-box nature makes it difficult to interpret their design logic. Here we propose BLMM Crystal Transformer, a neural network based probabilistic generative model for generative and tinkering design of inorganic materials. Our model is built on the blank filling language model for text generation and has demonstrated unique advantages in learning the "materials grammars" together with high-quality generation, interpretability, and data efficiency. It can generate chemically valid materials compositions with as high as 89.7\% charge neutrality and 84.8\% balanced electronegativity, which are more than 4 and 8 times higher compared to a pseudo random sampling baseline. The probabilistic generation process of BLMM allows it to recommend tinkering operations based on learned materials chemistry and makes it useful for materials doping. Combined with the TCSP crysal structure prediction algorithm, We have applied our model to discover a set of new materials as validated using DFT calculations. Our work thus brings the unsupervised transformer language models based generative artificial intelligence to inorganic materials. A user-friendly web app has been developed for computational materials doping and can be accessed freely at \url{www.materialsatlas.org/blmtinker}.
... Computational approaches can be divided into different groups, that is drug-centric, target-based, knowledgebased, signature-based and pathway/network-based methods [14]. Machine learning-based prediction models and frameworks were established in drug discovery using structures and molecular properties of compounds [15][16][17]. Network analysis-based workf lows were also developed to predict potential drug-target interactions, which provided clues for drug repositioning [18][19][20]. Successful examples include the indications of Gleevec (Abl tyrosine kinase inhibitor) for poxvirus infection [21], U0126 (MEK kinase inhibitor) for inf luenza virus infection [22], FTI-277 (farnesyltransferase inhibitor) for hepatitis delta virus (HDV) infection [23] and combination of decitabine and gemcitabine (2 antimetabolites) for human immunodeficiency virus (HIV) infection [24]. ...
Article
Full-text available
Inhibition of host protein functions using established drugs produces a promising antiviral effect with excellent safety profiles, decreased incidence of resistant variants and favorable balance of costs and risks. Genomic methods have produced a large number of robust host factors, providing candidates for identification of antiviral drug targets. However, there is a lack of global perspectives and systematic prioritization of known virus-targeted host proteins (VTHPs) and drug targets. There is also a need for host-directed repositioned antivirals. Here, we integrated 6140 VTHPs and grouped viral infection modes from a new perspective of enriched pathways of VTHPs. Clarifying the superiority of nonessential membrane and hub VTHPs as potential ideal targets for repositioned antivirals, we proposed 543 candidate VTHPs. We then presented a large-scale drug–virus network (DVN) based on matching these VTHPs and drug targets. We predicted possible indications for 703 approved drugs against 35 viruses and explored their potential as broad-spectrum antivirals. In vitro and in vivo tests validated the efficacy of bosutinib, maraviroc and dextromethorphan against human herpesvirus 1 (HHV-1), hepatitis B virus (HBV) and influenza A virus (IAV). Their drug synergy with clinically used antivirals was evaluated and confirmed. The results proved that low-dose dextromethorphan is better than high-dose in both single and combined treatments. This study provides a comprehensive landscape and optimization strategy for druggable VTHPs, constructing an innovative and potent pipeline to discover novel antiviral host proteins and repositioned drugs, which may facilitate their delivery to clinical application in translational medicine to combat fatal and spreading viral infections.
... The local head, on the other hand, considers the topological structure of the molecule. The receptive field of the individual token is restricted to its one-hop neighborhood, which is similar to [36]. In addition, we perform element-wise multiplication between the key vector and the edge feature to incorporate the bond information into the calculation. ...
Preprint
Full-text available
Retrosynthesis prediction is one of the fundamental challenges in organic synthesis. The task is to predict the reactants given a core product. With the advancement of machine learning, computer-aided synthesis planning has gained increasing interest. Numerous methods were proposed to solve this problem with different levels of dependency on additional chemical knowledge. In this paper, we propose Retroformer, a novel Transformer-based architecture for retrosynthesis prediction without relying on any cheminformatics tools for molecule editing. Via the proposed local attention head, the model can jointly encode the molecular sequence and graph, and efficiently exchange information between the local reactive region and the global reaction context. Retroformer reaches the new state-of-the-art accuracy for the end-to-end template-free retrosynthesis, and improves over many strong baselines on better molecule and reaction validity. In addition, its generative procedure is highly interpretable and controllable. Overall, Retroformer pushes the limits of the reaction reasoning ability of deep generative models.
Article
Motivation Accurate and efficient prediction of the molecular property is one of the fundamental problems in drug research and development. Recent advancements in representation learning have been shown to greatly improve the performance of molecular property prediction. However, due to limited labeled data, supervised learning-based molecular representation algorithms can only search limited chemical space and suffer from poor generalizability. Results In this work, we proposed a self-supervised learning method, ATMOL, for molecular representation learning and properties prediction. We developed a novel molecular graph augmentation strategy, referred to as attention-wise graph masking, to generate challenging positive samples for contrastive learning. We adopted the graph attention network as the molecular graph encoder, and leveraged the learned attention weights as masking guidance to generate molecular augmentation graphs. By minimization of the contrastive loss between original graph and augmented graph, our model can capture important molecular structure and higher order semantic information. Extensive experiments showed that our attention-wise graph mask contrastive learning exhibited state-of-the-art performance in a couple of downstream molecular property prediction tasks. We also verified that our model pretrained on larger scale of unlabeled data improved the generalization of learned molecular representation. Moreover, visualization of the attention heatmaps showed meaningful patterns indicative of atoms and atomic groups important to specific molecular property.
Article
To easier manipulate Knowledge Graphs (KGs), knowledge graph embedding (KGE) is proposed and wildly used. However, the relations between entities are usually incomplete due to the performance problems of knowledge extraction methods, which also leads to the sparsity of KGs and make it difficult for KGE methods to obtain reliable representations. Related research has not paid much attention to this challenge in the biomedicine field and has not sufficiently integrated the domain knowledge into KGE methods. To alleviate this problem, we try to incorporate the molecular structure information of the entity into KGE. Specifically, we adopt two strategies to obtain the vector representations of the entities: text-structure-based and graph-structure-based. Then, we spliced the two together as the input of the KGE models. To validate our model, we construct a KCCR knowledge graph and validate the model’s superiority in entity prediction, relation prediction, and drug-drug interaction prediction tasks. To the best of our knowledge, this is the first time that molecular structure information has been integrated into KGE methods. It is worth noting that researchers can try to improve the work based on KGE by fusing other feature annotations such as Gene Ontology and protein structure.
Article
Full-text available
Accurate estimation of the synthetic accessibility of small molecules is needed in many phases of drug discovery. Several expert-crafted scoring methods and descriptor-based quantitative structure-activity relationship (QSAR) models have been developed for synthetic accessibility assessment, but their practical applications in drug discovery are still quite limited because of relatively low prediction accuracy and poor model interpretability. In this study, we proposed a data-driven interpretable prediction framework called GASA (Graph Attention-based assessment of Synthetic Accessibility) to evaluate the synthetic accessibility of small molecules by distinguishing compounds to be easy- (ES) or hard-to-synthesize (HS). GASA is a graph neural network (GNN) architecture that makes self-feature deduction by applying an attention mechanism to automatically capture the most important structural features related to synthetic accessibility. The sampling around the hypothetical classification boundary was used to improve the ability of GASA to distinguish structurally similar molecules. GASA was extensively evaluated and compared with two descriptor-based machine learning methods (random forest, RF; eXtreme gradient boosting, XGBoost) and four existing scores (SYBA: SYnthetic Bayesian Accessibility; SCScore: Synthetic Complexity score; RAscore: Retrosynthetic Accessibility score; SAscore: Synthetic Accessibility score). Our analysis demonstrates that GASA achieved remarkable performance in distinguishing similar molecules compared with other methods and had a broader applicability domain. In addition, we show how GASA learns the important features that affect molecular synthetic accessibility by assigning attention weights to different atoms. An online prediction service for GASA was offered at http://cadd.zju.edu.cn/gasa/.
Article
Prediction of the interactions between small molecules and their targets play important roles in various applications of drug development, such as lead discovery, drug repurposing and elucidation of potential drug side effects. Therefore, a variety of machine learning-based models have been developed to predict these interactions. In this study, a model called auxiliary multi-task graph isomorphism network with uncertainty weighting (AMGU) was developed to predict the inhibitory activities of small molecules against 204 different kinases based on the multi-task Graph Isomorphism Network (MT-GIN) with the auxiliary learning and uncertainty weighting strategy. The calculation results illustrate that the AMGU model outperformed the descriptor-based models and state-of-the-art graph neural networks (GNN) models on the internal test set. Furthermore, it also exhibited much better performance on two external test sets, suggesting that the AMGU model has enhanced generalizability due to its great transfer learning capacity. Then, a naïve model-agnostic interpretable method for GNN called edges masking was devised to explain the underlying predictive mechanisms, and the consistency of the interpretability results for 5 typical epidermal growth factor receptor (EGFR) inhibitors with their structure‒activity relationships could be observed. Finally, a free online web server called KIP was developed to predict the kinome-wide polypharmacology effects of small molecules (http://cadd.zju.edu.cn/kip).
Article
Full-text available
We present SMILES-embeddings derived from the internal encoder state of a Transformer [1] model trained to canonize SMILES as a Seq2Seq problem. Using a CharNN [2] architecture upon the embeddings results in higher quality interpretable QSAR/QSPR models on diverse benchmark datasets including regression and classification tasks. The proposed Transformer-CNN method uses SMILES augmentation for training and inference, and thus the prognosis is based on an internal consensus. That both the augmentation and transfer learning are based on embeddings allows the method to provide good results for small datasets. We discuss the reasons for such effectiveness and draft future directions for the development of the method. The source code and the embeddings needed to train a QSAR model are available on https://github.com/bigchem/transformer-cnn. The repository also has a standalone program for QSAR prognosis which calculates individual atoms contributions, thus interpreting the model’s result. OCHEM [3] environment (https://ochem.eu) hosts the on-line implementation of the method proposed.
Article
Full-text available
Hunting for chemicals with favourable pharmacological, toxicological and pharmacokinetic properties remains a formidable challenge for drug discovery. Deep learning provides us with powerful tools to build predictive models that are appropriate for the rising amounts of data, but the gap between what these neural networks learn and what human beings can comprehend is growing. Moreover, this gap may induce distrust and restrict deep learning applications in practice. Here, we introduce a new graph neural network architecture called Attentive FP for molecular representation that uses a graph attention mechanism to learn from relevant drug discovery datasets. We demonstrate that Attentive FP achieves state-of-the-art predictive performances on a variety of datasets and that what it learns is interpretable. The feature visualization for Attentive FP suggests that it automatically learns non-local intramolecular interactions from specified tasks, which can help us gain chemical insights directly from data beyond human perception.
Article
Full-text available
There has been a recent surge of interest in using machine learning across chemical space in order to predict properties of molecules or design molecules and materials with desired properties. Most of this work relies on defining clever feature representations, in which the chemical graph structure is encoded in a uniform way such that predictions across chemical space can be made. In this work, we propose to exploit the powerful ability of deep neural networks to learn a feature representation from low-level encodings of a huge corpus of chemical structures. Our model borrows ideas from neural machine translation: it translates between two semantically equivalent but syntactically different representations of molecular structures, compressing the meaningful information both representations have in common in a low-dimensional representation vector. Once the model is trained, this representation can be extracted for any new molecule and utilized as descriptor. In fair benchmarks with respect to various human-engineered molecular fingerprints and graph-convolution models, our method shows competitive performance in modelling quantitative structure-activity relationships in all analysed datasets. Additionally, we show that our descriptor significantly outperforms all baseline molecular fingerprints in two ligand-based virtual screening tasks. Overall, our descriptors show the most consistent performances in all experiments. The continuity of the descriptor space and the existence of the decoder that permits to deduce a chemical structure from an embedding vector allows for exploration of the space and opens up new opportunities for compound optimization and idea generation.
Article
Full-text available
The arc of drug discovery entails a multiparameter optimization problem spanning vast length scales. The key parameters range from solubility (angstroms) to protein-ligand binding (nanometers) to in vivo toxicity (meters). Through feature learning - instead of feature engineering - deep neural networks promise to outperform both traditional physics-based and knowledge-based machine learning models for predicting molecular properties pertinent to drug discovery. To this end, we present the PotentialNet family of graph convolutions. These models are specifically designed for and achieve state-of-the-art performance for protein-ligand binding affinity. We further validate these deep neural networks by setting new standards of performance in several ligand-based tasks. In parallel, we introduce a new metric, the Regression Enrichment Factor EFχ(R), to measure the early enrichment of computational models for chemical data. Finally, we introduce a cross-validation strategy based on structural homology clustering that can more accurately measure model generalizability, which crucially distinguishes the aims of machine learning for drug discovery from standard machine learning tasks.
Article
Convolutional Neural Networks have been very successful at solving a variety of computer vision tasks such as object classification and detection, semantic segmentation, activity understanding, to name just a few. One key enabling factor for their great performance has been the ability to train very deep networks. Despite their huge success in many tasks, CNNs do not work well with non-Euclidean data, which is prevalent in many real-world applications. Graph Convolutional Networks offer an alternative that allows for non-Eucledian data input to a neural network. While GCNs already achieve encouraging results, they are currently limited to architectures with a relatively small number of layers, primarily due to vanishing gradients during training. This work transfers concepts such as residual/dense connections and dilated convolutions from CNNs to GCNs in order to successfully train very deep GCNs. We show the benefit of using deep GCNs experimentally across various datasets and tasks. Specifically, we achieve promising performance in part segmentation and semantic segmentation on point clouds and in node classification of protein functions across biological protein-protein-interaction graphs. We believe that the insights in this work will open avenues for future research on GCNs and their application to further tasks not explored in this paper.
Article
Deep learning based methods have been widely applied to predict various kinds of molecular properties in the pharmaceutical industry with increasingly more success. In this study, we propose two novel models for aqueous solubility predictions, based on the Multilevel Graph Convolutional Network (MGCN) and SchNet architectures, respectively. The advantage of the MGCN lies in the fact that it could extract the graph features of the target molecules directly from the (3D) structural information; therefore, it doesn't need to rely on a lot of intra-molecular descriptors to learn the features, which are of significance for accurate predictions of the molecular properties. The SchNet performs well in modelling the interatomic interactions inside a molecule, and such a deep learning architecture is also capable of extracting structural information and further predicting the related properties. The actual accuracy of these two novel approaches was systematically benchmarked with four different independent datasets. We found that both the MGCN and SchNet models performed well for aqueous solubility predictions. In the future, we believe such promising predictive models will be applicable to enhancing the efficiency of the screening, crystallization and delivery of drug molecules, essentially as a useful tool to promote the development of molecular pharmaceutics.
Article
Aqueous solubility is the key property driving many chemical and biological phenomena and impacts experimental and computational attempts to assess those phenomena. Accurate prediction of solubility is essential and challenging, even with modern computational algorithms. Fingerprint-based, feature-based and molecular graph-based representations have all been used with different deep learning methods for aqueous solubility prediction. It has been clearly demonstrated that different molecular representations impact the model prediction and explainability. In this work, we reviewed different representations and also focused on using graph and line notations for modeling. In general, one canonical chemical structure is used to represent one molecule when computing its properties. We carefully examined the commonly used simplified molecular-input line-entry specification (SMILES) notation representing a single molecule and proposed to use the full enumerations in SMILES to achieve better accuracy. A convolutional neural network (CNN) was used. The full enumeration of SMILES can improve the presentation of a molecule and describe the molecule with all possible angles. This CNN model can be very robust when dealing with large datasets since no additional explicit chemistry knowledge is necessary to predict the solubility. Also, traditionally it is hard to use a neural network to explain the contribution of chemical substructures to a single property. We demonstrated the use of attention in the decoding network to detect the part of a molecule that is relevant to solubility, which can be used to explain the contribution from the CNN.
Article
The absorption, distribution, metabolism, elimination, and toxicity (ADMET) properties of drug candidates are important for their efficacy and safety as therapeutics. Predicting ADMET properties has therefore been of great interest to the computational chemistry and medicinal chemistry communities in recent decades. Traditional cheminformatics approaches, using learners such as random forests and deep neural networks, leverage fingerprint feature representations of molecules. Here, we learn the features most relevant to each chemical task at hand by representing each molecule explicitly as a graph. By applying graph convolutions to this explicit molecular representation, we achieve, to our knowledge, unprecedented accuracy in prediction of ADMET properties. By challenging our methodology with rigorous cross-validation procedures and prospective analyses, we show that deep featurization better enables molecular predictors to not only interpolate but can also extrapolate to new regions of chemical space.
Conference Paper
With the rapid progress of AI in both academia and industry, Deep Learning has been widely introduced into various areas in drug discovery to accelerate its pace and cut R&D costs. Among all the problems in drug discovery, molecular property prediction has been one of the most important problems. Unlike general Deep Learning applications, the scale of labeled data is limited in molecular property prediction. To better solve this problem, Deep Learning methods have started focusing on how to utilize tremendous unlabeled data to improve the prediction performance on small-scale labeled data. In this paper, we propose a semi-supervised model named SMILES-BERT, which consists of attention mechanism based Transformer Layer. A large-scale unlabeled data has been used to pre-train the model through a Masked SMILES Recovery task. Then the pre-trained model could easily be generalized into different molecular property prediction tasks via fine-tuning. In the experiments, the proposed SMILES-BERT outperforms the state-of-the-art methods on all three datasets, showing the effectiveness of our unsupervised pre-training and great generalization capability of the pre-trained model.
Article
Deep learning, as one of the most currently remarkable machine learning techniques, has achieved great success in many applications such as image analysis, speech recognition and text understanding. It uses supervised and unsupervised strategies to learn multi-level representations and features in hierarchical architectures for the tasks of classification and pattern recognition. Recent development in sensor networks and communication technologies has enabled the collection of big data. Although big data provides great opportunities for a broad of areas including e-commerce, industrial control and smart medical, it poses many challenging issues on data mining and information processing due to its characteristics of large volume, large variety, large velocity and large veracity. In the past few years, deep learning has played an important role in big data analytic solutions. In this paper, we review the emerging researches of deep learning models for big data feature learning. Furthermore, we point out the remaining challenges of big data deep learning and discuss the future topics.