PreprintPDF Available

DrugGen: Advancing Drug Discovery with Large Language Models and Reinforcement Learning Feedback

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

Traditional drug design faces significant challenges due to inherent chemical and biological complexities, often resulting in high failure rates in clinical trials. Deep learning advancements, particularly generative models, offer potential solutions to these challenges. One promising algorithm is DrugGPT, a transformer-based model, that generates small molecules for input protein sequences. Although promising, it generates both chemically valid and invalid structures and does not incorporate the features of approved drugs, resulting in time-consuming and inefficient drug discovery. To address these issues, we introduce DrugGen, an enhanced model based on the DrugGPT structure. DrugGen is fine-tuned on approved drug-target interactions and optimized with proximal policy optimization. By giving reward feedback from protein-ligand binding affinity prediction using pre-trained transformers (PLAPT) and a customized invalid structure assessor, DrugGen significantly improves performance. Evaluation across multiple targets demonstrated that DrugGen achieves 100% valid structure generation compared to 95.5% with DrugGPT and produced molecules with higher predicted binding affinities (7.22 [6.30-8.07]) compared to DrugGPT (5.81 [4.97-6.63]) while maintaining diversity and novelty. Docking simulations further validate its ability to generate molecules targeting binding sites effectively. For example, in the case of fatty acid-binding protein 5 (FABP5), DrugGen generated molecules with superior docking scores (FABP5/11, -9.537 and FABP5/5, -8.399) compared to the reference molecule (Palmitic acid, -6.177). Beyond lead compound generation, DrugGen also shows potential for drug repositioning and creating novel pharmacophores for existing targets. By producing high-quality small molecules, DrugGen provides a high-performance medium for advancing pharmaceutical research and drug discovery.
Content may be subject to copyright.
arXiv:2411.14157v1 [q-bio.QM] 20 Nov 2024
DrugGen: Advancing Drug Discovery with
Large Language Models and Reinforcement
Learning Feedback
Mahsa Sheikholeslami1,2, Navid Mazrouei1, Yousof Gheisari1,3, Afshin
Fasihi2, Matin Irajpour1,4*, and Ali Motahharynia1,5*
1Regenerative Medicine Research Center, Isfahan University of Medical Sciences, Isfahan, Iran
2Department of Medicinal Chemistry, School of Pharmacy, Isfahan University of Medical
Sciences, Isfahan, Iran
3Department of Genetics and Molecular Biology, Isfahan University of Medical Sciences, Isfahan,
Iran
4Isfahan Cardiovascular Research Center, Cardiovascular Research Institute, Isfahan University of
Medical Sciences, Isfahan, Iran
5Isfahan Neuroscience Research Center, Isfahan University of Medical Sciences, Isfahan, Iran
*Corresponding authors:
Matin Irajpour (2012irajpour@gmail.com; ORCID: 0000-0001-8504-9652)
Ali Motahharynia (alimotahharynia@gmail.com; ORCID: 0000-0002-1140-3257; Tel: +98 313 668
7087)
November 22, 2024
Abstract
Traditional drug design faces significant challenges due to inherent chemical and bio-
logical complexities, often resulting in high failure rates in clinical trials. Deep learning
advancements, particularly generative models, offer potential solutions to these chal-
lenges. One promising algorithm is DrugGPT, a transformer-based model, that gen-
erates small molecules for input protein sequences. Although promising, it generates
both chemically valid and invalid structures and does not incorporate the features of
approved drugs, resulting in time-consuming and inefficient drug discovery. To address
these issues, we introduce DrugGen, an enhanced model based on the DrugGPT struc-
ture. DrugGen is fine-tuned on approved drug-target interactions and optimized with
proximal policy optimization. By giving reward feedback from protein-ligand binding
affinity prediction using pre-trained transformers (PLAPT) and a customized invalid
structure assessor, DrugGen significantly improves performance. Evaluation across
multiple targets demonstrated that DrugGen achieves 100% valid structure genera-
tion compared to 95.5% with DrugGPT and produced molecules with higher predicted
binding affinities (7.22 [6.30-8.07]) compared to DrugGPT (5.81 [4.97-6.63]) while main-
1
taining diversity and novelty. Docking simulations further validate its ability to gen-
erate molecules targeting binding sites effectively. For example, in the case of fatty
acid-binding protein 5 (FABP5), DrugGen generated molecules with superior docking
scores (FABP5/11, -9.537 and FABP5/5, -8.399) compared to the reference molecule
(Palmitic acid, -6.177). Beyond lead compound generation, DrugGen also shows poten-
tial for drug repositioning and creating novel pharmacophores for existing targets. By
producing high-quality small molecules, DrugGen provides a high-performance medium
for advancing pharmaceutical research and drug discovery.
Keywords: Drug design; Drug repurposing; Large language model; Reinforcement learning;
Molecular docking
1. Introduction
Traditional drug design often falls short in handling the vast chemical and biological space
features involved in ligand-receptor interactions [1, 2]. Usually, a major proportion of sug-
gested drug candidates fail in clinical trials [3], making drug discovery a time-consuming and
costly process. Recent advances in deep learning (DL), particularly in generative models, of-
fer promising solutions for these obstacles [4,5]. Deep learning models have been extensively
used in molecular design [6,7], pharmacokinetics [8–11], pharmacodynamics predictions [12],
and toxicity assessments [10]. These models improve the efficiency and accuracy of various
tasks in drug development, contributing to different stages of drug discovery and optimiza-
tion projects [13, 14]. However, due to the insufficiency of available datasets, complexity
of drug-target interactions, and complication of manipulating complex chemical structures,
generative DL models also seem to be insufficient in proposing optimal answers to drug de-
sign problems [15]. Nevertheless, with the advancement of transformer-based architecture
in large language models (LLMs), new horizons have opened up in various biological con-
texts. ProGen, a model developed to design new proteins with desired functionality and
protein-ligand binding affinity prediction using pre-trained transformers (PLAPT), a model
for protein-ligand binding affinity prediction, are successful examples of the application of
LLMs in bioinformatics [16, 17]. DrugGPT, an LLM based on the generative pre-trained
transformer (GPT) architecture [18] is another example that has shown potential in gener-
ating novel drug-like molecules having interactions with biological targets [19].
DrugGPT leverages the transformer architecture to comprehend structural properties and
structure-activity relationships. Receiving the amino acid sequence of a given target pro-
tein, this model generates simplified molecular input line entry system (SMILES) [20] strings
of interacting small molecules. By learning from large datasets of known drugs and their
targets, DrugGPT can propose new compounds with desired properties by employing au-
toregressive algorithms for a stable and effective training process [21], thus accelerating the
lead discovery phase in drug development. However, the effectiveness of generative models
in drug discovery relies heavily on the quality and relevance of the training data [5]. Models
trained on comprehensive and accurately curated datasets are more likely to produce viable
drug candidates [22]. Additionally, fine-tuning these models can enhance their performance
for predictive applications [23].
2
In this study, we developed “DrugGen”, an LLM based on the DrugGPT architecture, fine-
tuned using a curated dataset of approved drug-target pairs; which is further enhanced
using a policy optimization method. By utilizing this approach, DrugGen is optimized to
generate drug candidates with optimized properties. Furthermore, we evaluated the model’s
performance using custom metrics—validity, diversity, and novelty—to comprehensively as-
sess the quality and properties of the generated compounds. Our results indicated that
DrugGen generates chemically sound and valid molecules in comparison with DrugGPT
while maintaining diversity and validity of generated structures. Notably, DrugGen excels
in generating molecules with higher predicted binding affinities, increasing the likelihood
of strong interactions with biological targets. Docking simulations further demonstrated
the model’s capability to accurately target binding sites and suggest new pharmacophores.
These findings highlight DrugGen’s promising potential to advance pharmaceutical research.
Moreover, we proposed evaluation metrics that can serve as objective and practical measures
for comparing future models.
2. Results
In order to develop an algorithm to generate drug-like structures, we gathered a curated
dataset of approved drug-target pairs. We began by selecting a pre-trained model and then
enhanced its performance through a two-step process. First, we employed supervised fine-
tuning (SFT) on a dataset of approved sequence-SMILES pairs to fine-tune the model. Next,
we utilized a reinforcement learning algorithm—proximal policy optimization (PPO)—along
with a customized reward system to further optimize its performance. The final model was
named DrugGen. The schematic design of the study is illustrated in Fig. 1.
2.1. DrugGen is effectively fine-tuned on a dataset of approved
drug-target
Supervised fine-tuning using the SFT trainer exhibited a steady decrease in training and val-
idation loss over the epochs, indicating effective learning (Fig. 2A and Supplementary file 1).
After three epochs of training, the loss of both the training and validation datasets reached
a plateau. Therefore, checkpoint number three was selected for the second phase. In the
second phase, the model was further optimized using PPO based on the customized reward
system. Over 20 epochs of optimization, the model generated 30 unique small molecules for
each target in each epoch, ultimately reaching a plateau in the reward diagram (Fig. 2B and
Supplementary file 2).
2.2. DrugGen generates valid, diverse, and novel small molecules
Eight proteins were selected for models assessments which include two targets with a high
probability of association with diabetic kidney disease (DKD) from the DisGeNet database,
angiotensin-converting enzyme (ACE) and peroxisome proliferator-activated receptor gamma
(PPARG) and six proteins without known approved drugs, i.e., galactose mutarotase (GALM),
putative fatty acid-binding protein 5-like protein 3 (FB5L3), short-wave-sensitive opsin
3
DrugGen development
DrugGen assessment
SMILES
Sequence
DrugGPT
Supervised
fine-tuning
Proximal policy
optimization
DrugGen
Feedback
DrugGen
Sequence SMILES
Custom dataset
Data gathering
Data curation Base model
Docking simulation
Hugging face
DrugGPT
Validity
Diversity
Novelty
Binding
affinity
Dataset
Binding affinity
(PLAPT)
Validity checker
Repetition penalty
Fig. 1. Schematic representation of model development and evaluation. The top section
illustrates the dataset creation and the training of DrugGen through supervised fine-tuning
(SFT) and proximal policy optimization (PPO) using a customized reward function. The
bottom section outlines the assessment process, based on validity, diversity, novelty, and bind-
ing affinity for both DrugGen and DrugGPT, along with docking simulations for DrugGen.
1 (OPSB), nicotinamide phosphoribosyltransferase (NAMPT), phosphoglycerate kinase 2
(PGK2), and fatty acid-binding protein 5 (FABP5), that identified as having a high prob-
ability of being targeted by approved small molecules through our newly developed drug-
gability scoring algorithm, DrugTar [24]. For each target, 500 molecules were generated.
The validity of generated molecules was 95.45% and 99.90% for DrugGPT and DrugGen, re-
spectively (Chi-Squared, P< 10-38, Supplementary file 3). These molecules had an average
diversity of 84.54% [74.24-90.48] for DrugGPT and 60.32% [38.89-92.80] for DrugGen (U=
358245213849, P= 0, Fig. 3A and Supplementary file 4), indicating the generation of more
similar molecules in DrugGen. These results suggest that DrugGen still generates a wide
range of structurally diverse drug candidates rather than producing similar or redundant
molecules.
To assess the novelty of generated molecules, 100 unique small molecules were generated for
each target. The validity scores for DrugGPT and DrugGen were in agreement with previous
results (95.5% and 100%, respectively, Chi-Squared, P< 10-8, Supplementary file 5). After
removing invalid structures, the novelty scores for DrugGPT and DrugGen were 66.84%
[55.28-73.57] and 41.88% [24-59.66], respectively ([Mann–Whitney, U= 475980, P< 10-80],
Fig. 3B and Supplementary file 5), indicating that fewer novel molecules were generated in
DrugGen. These values indicate a good balance between diversity and novelty for DrugGen.
4
B
Loss per epoch
Epoch
Training loss
Evaluation loss
Loss
012 3 4 5
0.5
1.0
2.0
1.5
A
Reward per epoch
Reward (Binding affinity)
Epoch
6
7
8
3
4
5
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Fig. 2. Training process of DrugGen. (A) Learning curve of the model during supervised
fine-tuning (SFT) and (B) reward trend during proximal policy optimization (PPO).
2.3. DrugGen generates small molecules with high affinity for their
targets
We used two different measures to assess the binding affinity of the generated molecules
to their respective targets: PLAPT, an LLM for predicting binding affinity, and molecular
docking.
PLAPT: The same set of small molecules generated in novelty assessment (100 unique small
molecules for each target) were used for assessing the quality of generated structures. Except
for FABP5, DrugGen consistently produced small molecules with significantly higher binding
affinities compared to DrugGPT ([7.22 [6.30-8.07] vs. 5.81 [4.97-6.63], U= 137934, P<
10-85], Fig. 3C, Table 1, and Supplementary file 5). This finding underscores DrugGen's
superior capability to generate high-quality structures.
Table 1: Statistical analysis of binding affinities of DrugGPT vs. DrugGen.
Targets DrugGPT DrugGen Ustatistics P
ACE 5.71 [5.10-6.71] 8.43 [6.65-9.06] 1475 <1016*
PPARG 6.32 [5.75-6.74] 7.39 [6.61-7.95] 2208 <1010*
GALM 6.12 [4.96-6.73] 6.92 [6.04-7.73] 2767 <106*
FB5L3 5.35 [4.73-6.38] 6.94 [6.36-7.48] 1723 <1014*
OPSB 5.43 [4.26-6.14] 7.62 [6.84-8.07] 842 <1022*
NAMPT 5.84 [5.16-6.75] 7.00 [6.19-7.70] 2616 <107*
PGK2 5.23 [4.62-6.22] 7.34 [6.36-8.52] 1212 <1018*
FABP5 6.30 [5.35-7.18] 6.69 [5.80-7.60] 4382 1
All data are presented as median [Q1-Q3]. Data are compared using Mann–Whitney U test with
corrections for multiple comparisons using the Bonferroni method. * P < 0.05
5
A
1 - Tanimoto similarity (percent)
Frequency
BC
ACE ACE
FABP5 FABP5
FB5L3 FB5L3
GALM GALM
NAMPT NAMPT
OPSB OPSB
PGK2 PGK2
PPARG PPARG
DrugGPT
DrugGen
D
Reward (Affinity score)
ACE
PPARG
GALM
FB5L3
OPSB
NAMPT
PGK2
FABP5
-2
0
2
4
6
8
10
12
0
5000
10000
15000
20000
25000
0
2500
5000
7500
10000
12500
15000
17500
0
5000
10000
15000
20000
25000
30000
35000
40000
0
10000
20000
30000
40000
50000
0
5000
10000
15000
20000
25000
30000
20 40 60 80 100 20 40 60 80 20 40 60 80 10020 40 60 80 1000
0
5000
10000
15000
20000
25000
20 40 60 80 100
0
5000
10000
15000
20000
25000
20 40 60 80 100
0
0
5000
10000
15000
20000
25000
30000
35000
20 40 60 80 1000
0
2500
5000
7500
10000
12500
20 40 60 80 100
0
20 40 60 80 1000 20 40 60 80 100 20 40 60 80 100
0
0
10000
20000
30000
40000
0
5000
10000
15000
20000
30000
35000
40000
0
10000
20000
30000
40000
50000
20 40 60 80 100
0
10000
20000
30000
40000
20 40 60 80 1000
0
15000
25000
35000
10000
20000
30000
40000
5000
20 40 60 80 100
0
10000
20000
30000
40000
20 40 60 80
0
5000
10000
15000
20000
25000
30000
35000
20 40 60 80 100
Molecule index Molecule index
0
20
40
60
80
100
0
20
40
60
80
100
0 100 200 300 400 500 600 700 800 0100 200 300 400 500 600 700 800
1 - Tanimoto similarity (percent)
1 - Tanimoto similarity (percent)
0
0
0
25000
0
0
15000
17500
20000
0
40000
0
0 0
Fig. 3. Comparison of molecular diversity, novelty, and binding affinity across different
targets. (A) Comparison of molecular diversity distribution is shown as the frequency of the
“1 Tanimoto similarity” in percent for each target. (B) Scatter plots comparing novelty,
with the “1 Tanimoto similarity” in percent plotted against the molecular index. (C) Violin
plots depicting the distribution of binding affinity for each target.
6
Molecular docking: Docking simulations were performed on the targets that had reliable
protein data bank files and could be successfully re-docked, i.e., FABP5, NAMPT, and
ACE. GALM protein was included to emphasize the model’s capability to create molecules
for unexplored targets with no reference molecules. The results showed that the generated
molecules included agents with high binding affinities for the binding site of their respective
targets (Table 2 and Supplementary file 6). Except for ACE which has multiple proven
binding sites with docked molecules binding to different locations than the reference molecule,
all other docked molecules were positioned in the same binding site as the reference in their
best-docked poses (Fig. 4).
Table 2: Extra precision (XP) docking scores of generated ligands.
Small molecules XP GScore
NAMPT40 -8.381
Daporinad -8.300
NAMPT23 -8.187
Lisinopril -19.489
ACE17 (Enalapril) -15.538
ACE14 (Captopril) -9.677
ACE28 -8.964
ACE29 -6.405
GALM13 -8.905
GALM2 -7.061
GALM7 -6.913
FABP5/11 -9.537
FABP5/5 -8.399
Palmitic acid -6.177
Furthermore, the model has generated molecules with better docking scores than the refer-
ence for FABP5 (-9.537 and -8.399 vs. -6.177) and NAMPT (-8.381 vs. -8.300). Notably,
for NAMPT, the model suggested a novel pharmacophore that occupies the same active site
as the reference molecule (Fig. 5). ID cards of generated small molecules with their related
SMILES are presented in Supplementary file 7.
3. Discussion
In this study, we developed DrugGen, a large language model designed to generate small
molecules based on the desired targets as input. DrugGen is based on a previously developed
model known as DrugGPT, achieving improvements by supervised fine-tuning on approved
drugs and reinforcement learning. These improvements aim to facilitate the generation of
novel small molecules with stronger binding affinities and a higher probability of approval in
future clinical trials. The results indicate that DrugGen can produce high-affinity molecules
with robust docking scores, highlighting its potential to accelerate the drug discovery process.
DrugGen is primarily based on the DrugGPT, which utilizes a GPT-2 architecture trained
7
BA
C D
Fig. 4. Visualization of ligand binding in active sites across selected targets. (A) FABP5/11
(dark pink) and Palmitic acid (light pink, reference) in the FABP5 active site. (B) GALM13
(light green) and GALM2 (lime) associated with GALM. (C) NAMPT40 (teal) and Dapori-
nad (blue, reference) in the NAMPT active site. (D) ACE29 (red), ACE28 (yellow), ACE17
(peach), ACE14 (orange), and Lisinopril (copper, reference) associated with ACE.
on datasets comprising SMILES and SMILES-protein sequence pairs for generation of small
molecules. Although DrugGPT shows promise, it became evident that the creation of high-
quality small molecules requires more than merely ensuring ligand-target interactions. These
molecules may also exhibit essential properties, including favorable chemical characteristics
(such as stability and the absence of cytotoxic substructures), pharmacokinetic profiles (ac-
ceptable ADME properties—absorption, distribution, metabolism, and excretion), pharma-
codynamic attributes (efficacy and potency) [25–27]. Hence, due to the hypothesis that
approved drugs have intrinsic properties that make them become approved [28], DrugGen
was fine-tuned on approved sets of small molecules. This fine-tuning was enhanced through
binding affinity feedback from another LLM, PLAPT, resulting in improved quality of gen-
erated molecules. Our findings demonstrate that DrugGen produces small molecules with
significantly better chemical validity and binding affinity compared to DrugGPT while main-
taining chemical diversity.
To assess the capability of DrugGen in generating high-quality molecules, we selected eight
targets. The inclusion of six targets without known approved small molecules demonstrates
DrugGen's potential to introduce novel candidates for previously untargeted or unexplored
8
BA
Fig. 5. Comparison of pharmacophores of NAMPT inhibitors in the same active site. Da-
porinad (left) and NAMPT40 (right) with fundamentally different pharmacophores, both
placed in the active site of NAMPT.
therapeutic areas. Among the assessed targets, generated molecules showed enhanced valid-
ity and stronger binding affinities compared to those produced by DrugGPT. This consis-
tency suggests that DrugGen's reinforcement learning process effectively enhances its ability
to generate potent drug candidates. Moreover, docking simulations further confirmed the
quality of DrugGen in generating high-quality small molecules. The comparison of docking
scores between generated and reference molecules, NAMPT40 vs. Daporinad and FABP5/11
and FABP5/5 vs. Palmitic acid, shows that DrugGen can design molecules with predicted
interactions stronger than the known drugs. This observation highlights DrugGen’s capabil-
ity to innovate beyond the existing drug design approaches. Furthermore, the diversity of
generated molecules, reflected in the wide range of docking scores, emphasizes the model’s
flexibility in producing varied chemical structures. Additionally, in the case of NAMPT,
the model generated one structure with a strong docking score possessing a pharmacophore
very different from that of the reference molecules, meaning that core drug structure was
dissimilar to the reference molecule. This structure occupied the same binding site as the
reference molecule, which is a potentially new pharmacophore for this target. In addition to
these improvements, in the process of reinforcement learning, penalties were applied for gen-
erating repetitive structures, resulting in a diverse and valid set of molecules whilst retaining
the possibility of regenerating approved drugs in the case of drug repurposing [29]. Thus,
DrugGen demonstrates applicability in both de novo drug design and repurposing efforts.
Despite these achievements, our study has some limitations that should be considered in
future research. Variability in binding affinity results across assessed targets was observed.
For instance, FABP5’s performance improvement was less pronounced compared with others.
This might suggest that with certain target classes or protein sequences, unique challenges
9
emerge for our model, requiring additional fine-tuning or alternative strategies for further
optimization. In addition, DrugGen cannot target a specific binding site, as can be seen in
the case of ACE, which has multiple binding sites [30]. Ligand prediction using the DrugGen
model led to molecules with fairly strong ligand binding to different binding sites; however,
this may not be desirable in some cases. The existing reward function relies on an affinity-
predictor deep learning model that has inherent accuracy and specificity limitations due to
the limitations of the databases and input representation, which could be addressed in future
works. Our model is primarily focused on predicting novel cores and structures for targets
with limited bioactive molecules, thus it does not generate fully optimized structures. These
predicted structures should undergo structural manipulation for structural optimization to
better fit the active site of the target. Future improvements will involve incorporating active
site interactions into the reward system to enhance structural accuracy. Finally, the reliance
on in silico validation, while useful, needs to be complemented with experimental validation
to confirm the practical efficacy and safety of the generated molecules.
In conclusion, DrugGen represents a powerful tool for early-stage drug discovery, with the
potential to significantly accelerate the process of identifying novel lead compounds. With
further refinement and integration with experimental validation, DrugGen could become an
integral part of future drug discovery pipelines, contributing to the development of new
therapeutics across a wide range of diseases.
4. Materials and Methods
4.1. Dataset Preparation
A dataset of small molecules, each approved by at least one regulatory body, was collected to
enhance the safety and relevance of the generated molecules. First, 1710 small molecules from
the DrugBank database (version: 2023-01-04) were retrieved [31], 117 of which were labeled
as withdrawn. After initial assessments of withdrawn drugs by a physician (Ali Motahhary-
nia) and a pharmacist (Mahsa Sheikholeslami), consensus was reached to omit 50 entries
due to safety concerns. Consequently, 1660 approved small molecules and their respective
targets were selected. From the total of 2116 approved drug targets, retrieved from Drug-
Bank database, 27 were not present in the UniProt database [32]. After further assessment,
these 27 proteins were replaced manually with equivalently reviewed UniProt IDs, identical
protein names, or by basic local alignment search tools (BLAST) [33]. The protein with
UniProt ID “Q5JXX5” was deleted from the UniProt database and therefore, omitted from
the collected dataset as well. Finally, 1660 small molecules and 2093 related protein targets
were selected. Available SMILEs (1634) were retrieved from DrugBank, ChEMBL [34], and
ZINC20 databases [35]. Protein sequences were retrieved from the UniProt database.
4.2. Data Preprocessing
Similar to the structure used by DrugGPT, the small molecules and target sequences were
merged into the pair of a string consisting of protein sequence and SMILES in the following
format: “<|startoftext|> + <P> + target protein sequence + <L> + SMILES + <|end-
10
oftext|>." To ensure the compatibility of this input format with the original model, the
resulting strings were tokenized using the trained DrugGPT’s byte-pair encoding (BPE) to-
kenizer (53083 tokens). The strings were padded to the maximum length of 768, and longer
strings were truncated. The “<|startoftext|>”, “<|endoftext|>”, and “<PAD>” were de-
fined as special tokens.
4.3. DrugGen Development Overview
Using the supervised fine-tuning (SFT) trainer module from the transformer reinforcement
learning (TRL) library (version: 0.9.4) [36], the original DrugGPT model was finetuned on
our dataset. Afterward, reinforcement learning was applied to further improve the model.
For this purpose, a Tesla V100 GPU with 32 GB of VRAM, 64 GB of RAM, and a 4-core
CPU were utilized for both phases, i.e., SFT and reinforcement learning using a PPO trainer.
4.3.1. Supervised Fine-tuning
The training dataset consisted of 9398 strings. The base model was trained using the SFT
trainer class for five epochs with the following configuration: Learning rate: 5e-4, batch
size: 8, warmup steps (linear warmup strategy): 100, and eval steps: 50. AdamW optimizer
with a learning rate of 5e-4 and epsilon value of 1e-8 was used for optimizing the model
parameters. The model's performance on the training and validation sets (ratio of 8:2) was
evaluated using the cross-entropy loss function during the training phase.
4.3.2. Proximal Policy Optimization
Hugging Face’s PPO Trainer, which is based on OpenAI’s original method for “Summarize
from Feedback” [37] was used in this study. PPO is a reinforcement learning algorithm that
improves the policy by taking small steps during optimization, avoiding overly large updates
that could lead to instability. The key formula used in PPO is:
LCLI P (θ) = Et[min(rt(θ)At,clip(rt(θ),1ǫ, 1 + ǫ)At)] (1)
In this equation, LCLI P (θ) represents the clipped objective function that PPO aims to opti-
mize during training. The expectation Etdenotes the average over time steps t, capturing the
overall performance of the policy. The term rt(θ) is the probability ratio of taking action at
under the new policy compared to the old policy, defined as rt(θ) = πθ(at|st)
πθold (at|st).The advantage
estimate Atquantifies the relative value of the action taken in relation to the expected value
of the policy. The clipping function, clip(rt(θ),1ǫ, 1 + ǫ), restricts the ratio to a defined
range, preventing large updates to the policy that could destabilize training. This formula-
tion allows PPO to balance exploration and stability, enabling effective policy updates while
minimizing the risk of performance degradation. There are three main phases in training a
model with PPO. First, the language model generates a response based on an input query
in a phase called the rollout phase. In our study, the queries were protein sequences, and
the generated responses were SMILES strings. Then in the evaluation phase, the generated
11
molecules were assessed with a custom model that predicts binding affinity. Finally, the
log probabilities of the tokens in the generated SMILES sequences were calculated based on
the query/response pairs. This step is also known as the optimization phase. Additionally,
to maintain the generated responses within a reasonable range from the reference language
model, a reward signal was introduced in the form of the Kullback-Leibler (KL) divergence
between the two outputs. This additional signal ensures that the new responses do not de-
viate too far from the original model's outputs. Thus, PPO was applied to train the active
language model.
In our study, the rollout section had the following generation parameters: “do_sample”:
True, “top_k”: 9, “top_p”: 0.9, “max_length”: 1024, and “num_return_sequences”: 10.
In each epoch, generation was continued until 30 unique small molecules were generated for
each target. Keeping initial model’s structure in mind, the dataset was filtered based on
the length of each protein sequence. After creating the prompts according to the specified
format, i.e., “<|startoftext|> + <P> + target protein sequence + <L>”, prompts with a
tensor size greater than 768 were omitted, resulting in 2053 proteins (98.09% of the initial
dataset).
The PPO trainer configuration included: “mini_batch_size”: 8, “batch_size”: 240, and
“learning_rate”: 1.41e-5. Score scaling and normalization were handled with the PPO
trainer’s built-in functions.
4.3.3. Reward Function
PLAPT: PLAPT, a cutting-edge model designed to predict binding affinities with remark-
able accuracy was used as a reward function. PLAPT leverages transfer learning from
pre-trained transformers, ProtBERT and ChemBERTa, to process one-dimensional protein
and ligand sequences, utilizing a branching neural network architecture for the integration of
features and estimation of binding affinities. The superior performance of PLAPT has been
validated across multiple datasets, where it achieved state-of-the-art results [16]. The affini-
ties of the generated structures with their respective targets were evaluated using PLAPT’s
neg_log10_affinity_M output.
Customized invalid structure assessor: We developed a customized algorithm using RDKit
library (version: 2023.9.5) [38] to assess invalid structure, where specific checks were per-
formed to identify potential issues such as atom count, valence errors, and parsing errors.
Invalid structures, including those with fewer than two atoms, incorrect valence states, or
parsing failures were flagged and penalized accordingly. To promote the generation of valid
molecules, a reward value of 0 was assigned to any invalid SMILES structures. These reward
systems provide a rigorous scoring system for model development.
To further shift the model toward generating novel molecules, a multiplicative penalty was
applied to the reward score when a generated SMILES string matched a molecule already
present in the approved SMILES dataset. Specifically, the reward was multiplied by 0.7 for
such occurrences, to retain a balance between generating new structures as well as repur-
posing approved drugs.
12
4.3.4. DrugGen Assessment
To evaluate the performance of DrugGen, several metrics were employed to measure its
efficacy in generating viable and high-affinity drug candidates. For this purpose, eight targets
consisting of two DKD targets with the highest score in DisGeNet database (version 3.12.1)
[39], i.e., “ACE” and “PPARG” and six targets without any known approved small molecules
for them were selected. The selection of these six targets was according to our recent study
“DrugTar Improves Druggability Prediction by Integrating Large Language Models and Gene
Ontologies” [24]. According to this study, 6 out of the 10 most probable proteins for future
targets were selected. The selected targets are “GALM”, “FB5L3”, “OPSB”, “NAMPT”,
“PGK2”, and “FABP5”. The generative quality of DrugGPT and DrugGen in terms of
validity, diversity, novelty, and binding affinity was assessed. Additionally, we performed in
silico validation of the molecules generated by DrugGen using a rigorous docking method.
Validity Assessment The validity of the generated molecules was evaluated using the
previously mentioned customized invalid structure assessor. The percentage of valid to total
generation was reported as models’ capability to construct valid structures.
Diversity Assessment To assess the diversity of the generated molecules, 500 ligands
were generated for each target by DrugGPT and DrugGen. The diversity of the generated
molecules was quantitatively assessed using the Tanimoto similarity index [40]. The diversity
evaluation process involved the following steps: First, each generated molecule was converted
to its corresponding molecular fingerprint using Morgan fingerprints (size = 2048 bits, radius
= 2) [41]. For each molecule, pairwise Tanimoto similarities were calculated between all
possible pairs of fingerprints, and the average value was calculated. Thus, the diversity
of the generated set was determined as the “1 - average of Tanimoto similarity” within a
generated batch. The distribution of diversity for each target was plotted. The invalid
structures were not involved in diversity assessments. Statistical analyses were performed
using Mann–Whitney Utest.
Novelty Assessment For each target, a set of 100 unique molecules was generated by
DrugGPT and DrugGen. The novelty of the generated molecules was evaluated by com-
paring them to a dataset of approved drugs. After converting the molecules into Morgan
fingerprints, the similarity of each generated molecule to the approved drugs was calculated
using Tanimoto similarity index, retaining only the maximum similarity value. The novelty
was reported as the “1 - max_Tanimoto similarity”. The invalid structures were not included
in the novelty assessments. Statistical analyses were performed using Mann–Whitney Utest.
PLAPT Binding Affinity Assessment The same set of molecules generated during the
novelty assessment was used to evaluate the binding affinities of the compounds produced by
DrugGPT and DrugGen. The invalid structures were involved in the binding affinity assess-
ments. Statistical analysis was conducted using the Mann–Whitney Utest, and corrections
for multiple comparisons were applied using the Bonferroni method.
13
Molecular Docking Molecular docking was conducted for selected targets with available
protein data bank (PDB) structures, specifically ACE, NAMPT, GALM, and FABP5. A set
of 100 newly generated molecules, following duplicate removal, were docked into the crystal
structures of ACE (PDB ID: 1o86), NAMPT (PDB ID: 2gvj), GALM (PDB ID: 1snz), and
FABP5 (PDB ID: 1b56). Overall, blind docking [42] was employed for all 122 generated
molecules and their references to thoroughly search the entire protein surface for the most
favorable active site (Supplementary file 6 and Supplementary file 7). The reference ligands
used were Lisinopril for ACE and Palmitic acid for FABP5, both of which were bound in
the active site. For NAMPT, Daporinad, a molecule currently in phase 2 clinical trials,
served as the highest available reference. In the case of GALM, no reference ligand was
found. The retrieved PDB files were prepared using the protein preparation wizard [43]
available in the Schrödinger suite, ensuring the addition of missing hydrogens, assignment of
appropriate charge states at physiological pH, and reconstruction of incomplete side chains
and rings. LigPrep [44] with the OPLS4 force field [45] was employed to generate all possible
stereoisomers and ionization states at pH 7.4±0.5. The prepared structures were used for
docking.
Docking simulations were performed using the GLIDE program citeFriesner2004. Ligands
were docked using the extra precision (XP) protocol. Ligands were allowed full flexibility
during the docking process, while the protein was held rigid. The information of the grid
boxes is summarized in Table 3.
Table 3: Gridbox generation properties for performing blind docking.
Target ligxrange ligyrange ligzrange xcent xrange ycent yrange zcent zrange
2GVJ - NAMPT 40 40 40 14.616 76 -7.569 76 14.046 76
1O86 - ACE 40 40 40 40.657 76 37.169 76 43.527 76
1SNZ - GALM 40 40 40 -10.433 58 5.656 58 50.197 58
1B56 - FABP5 30 30 30 49.969 52 22.227 52 32.492 52
The GLIDE XP scoring function was used to evaluate docking poses. Negative values of
the GLIDE score (XP GScore) were reported for readability. The robustness of the docking
procedures was validated by redocking the reference ligands into their respective binding
sites. The computed root-mean-squared deviation (RMSD) values were 0.7233Å, 0.2961Å,
and 2.0119Å for ACE, NAMPT, and FABP5, respectively, confirming the reliability of the
docking protocol.
14
Data availability
All data generated or analyzed during this study are included in the manuscript and sup-
porting files. The sequence-SMILES dataset of approved drug-target pairs used in this
study is publicly available at “alimotahharynia/approved_drug_target” from Hugging Face
(https://huggingface.co/datasets/alimotahharynia/approved_drug_target).
Code availability
The checkpoints, code for generating small molecules, and customized validity small as-
sessor are publicly available at https://huggingface.co/alimotahharynia/DrugGen and
https://github.com/mahsasheikh/DrugGen.
Acknowledgment
We sincerely thank Dr. Mehdi Rahmani for his invaluable assistance with technical and
software issues related to training our model on the cluster servers.
Funding
No funding was received for this study or its publication.
Competing interest
The authors declare no competing interests.
Authors contribution
Conceptualization: M.S, Y.G, M.I, A.M. Dataset preparation: M.S, A.M. Model develop-
ment: M.S, N.M, M.I, A.M. Statistical analysis: M.S, N.M, A.M. In silico validation: M.S,
A.F. Data interpretation: All authors. Drafting original manuscript: M.S, N.M. Revising the
manuscript: Y.G, A.F, M.I, A.M. All the authors have read and approved the final version
for publication and agreed to be responsible for the integrity of the study.
15
References
[1] Bai, Long and Wu, Yan and Li, Guangfeng and Zhang, Wencai and Zhang, Hao and
Su, Jiacan, "AI-enabled organoids: construction, analysis, and application," Bioactive
Materials, vol. 31, pp. 525–548, 2024.
[2] Coley, Connor W., "Defining and Exploring Chemical Spaces," Trends in Chemistry,
vol. 3, no. 2, pp. 133–145, 2021. DOI: 10.1016/j.trechm.2020.11.004.
[3] Sun, Duxin and Gao, Wei and Hu, Hongxiang and Zhou, Simon, "Why 90% of clinical
drug development fails and how to improve it?," Acta Pharmaceutica Sinica B, vol. 12,
no. 7, pp. 3049–3062, 2022. DOI: 10.1016/j.apsb.2022.02.002.
[4] Tong, Xiaochu and Liu, Xiaohong and Tan, Xiaoqin and Li, Xutong and Jiang, Ji-
axin and Xiong, Zhaoping and Xu, Tingyang and Jiang, Hualiang and Qiao, Nan and
Zheng, Mingyue, "Generative Models for de Novo Drug Design," Journal of Medicinal
Chemistry, vol. 64, no. 19, pp. 14011–14027, 2021. DOI: 10.1021/acs.jmedchem.1c00927.
[5] Zeng, Xiangxiang and Wang, Fei and Luo, Yuan and gu Kang, Seung and Tang, Jian
and Lightstone, Felice C. and Fang, Evandro F. and Cornell, Wendy and Nussinov, Ruth
and Cheng, Feixiong, "Deep generative molecular design reshapes drug discovery," Cell
Reports Medicine, vol. 3, no. 12, 2022. DOI: 10.1016/j.xcrm.2022.100794.
[6] Meyers, Joshua and Fabian, Benedek and Brown, Nathan, "De novo molecular design
and generative models," Drug Discovery Today, vol. 26, no. 11, pp. 2707–2715, 2021.
DOI: https://doi.org/10.1016/j.drudis.2021.05.019.
[7] Méndez-Lucio, Oscar and Baillif, Benoit and Clevert, Djork Arné and Rouquié, David
and Wichard, Joerg, "De novo generation of hit-like molecules from gene expression
signatures using artificial intelligence," Nature Communications, vol. 11, no. 1, pp. 10,
2020. DOI: 10.1038/s41467-019-13807-w.
[8] Janssen, Alexander and Smalbil, Louk and Bennis, Frank C. and Cnossen, Marjon H.
and Mathôt, Ron A.A., "A Generative and Causal Pharmacokinetic Model for Factor
VIII in Hemophilia A: A Machine Learning Framework for Continuous Model Refine-
ment," Clinical Pharmacology and Therapeutics, vol. 115, no. 4, pp. 881–889, 2024. DOI:
10.1002/cpt.3203.
[9] Ota, Ryosaku and Yamashita, Fumiyoshi, "Application of machine learning techniques
to the analysis and prediction of drug pharmacokinetics," Journal of Controlled Release,
vol. 352, pp. 961–969, 2022. DOI: 10.1016/j.jconrel.2022.11.014.
[10] Horne, Robert I. and Wilson-Godber, Jared and González Díaz, Alicia and Brotza-
kis, Z. Faidon and Seal, Srijit and Gregory, Rebecca C. and Possenti, Andrea and
Chia, Sean and Vendruscolo, Michele, "Using Generative Modeling to Endow with Po-
tency Initially Inert Compounds with Good Bioavailability and Low Toxicity," Jour-
nal of Chemical Information and Modeling, vol. 64, no. 3, pp. 590–596, 2024. DOI:
10.1021/acs.jcim.3c01777.
16
[11] Ghayoor, Ali and Kohan, Hamed Gilzad, "Revolutionizing pharmacokinetics: the dawn
of AI-powered analysis," Journal of Pharmacy & Pharmaceutical Sciences, vol. 27, pp.
12671, 2024.
[12] Menke, Janosch and Koch, Oliver, "Using Domain-Specific Fingerprints Generated
through Neural Networks to Enhance Ligand-Based Virtual Screening," Journal
of Chemical Information and Modeling, vol. 61, no. 2, pp. 664–675, 2021. DOI:
10.1021/acs.jcim.0c01208.
[13] Qureshi, Rizwan and Irfan, Muhammad and Gondal, Taimoor Muzaffar and Khan, She-
heryar and Wu, Jia and Hadi, Muhammad Usman and Heymach, John and Le, Xiuning
and Yan, Hong and Alam, Tanvir, "AI in drug discovery and its clinical relevance,"
Heliyon, vol. 9, no. 7, 2023. DOI: 10.1016/j.heliyon.2023.e17575.
[14] Zhuang, Dylan and Ibrahim, Ali K., "Deep learning for drug discovery: A study of identi-
fying high efficacy drug compounds using a cascade transfer learning approach," Applied
Sciences (Switzerland), vol. 11, no. 17, pp. 7772, 2021. DOI: 10.3390/app11177772.
[15] Gangwal, Amit and Lavecchia, Antonio, "Unleashing the power of generative AI
in drug discovery," Drug Discovery Today, vol. 29, no. 6, pp. 103992, 2024. DOI:
10.1016/j.drudis.2024.103992.
[16] Rose, Tyler and Monti, Nicolò and Anand, Navvye and Shen, Tianyu, "PLAPT:
Protein-Ligand Binding Affinity Prediction Using Pretrained Transformers," bioRxiv,
pp. 2024.02.08.575577, 2024.
[17] Madani, Ali and McCann, Bryan and Naik, Nikhil and Keskar, Nitish Shirish and
Anand, Namrata and Eguchi, Raphael R. and Huang, Po-Ssu and Socher, Richard, "Pro-
Gen: Language Modeling for Protein Generation," arXiv preprint arXiv:2004.03497,
2020.
[18] Brown, Tom B. and Mann, Benjamin and Ryder, Nick and Subbiah, Melanie and Ka-
plan, Jared and Dhariwal, Prafulla and Neelakantan, Arvind and Shyam, Pranav and
Sastry, Girish and Askell, Amanda and Agarwal, Sandhini and Herbert-Voss, Ariel and
Krueger, Gretchen and Henighan, Tom and Child, Rewon and Ramesh, Aditya and
Ziegler, Daniel M. and Wu, Jeffrey and Winter, Clemens and Hesse, Christopher and
Chen, Mark and Sigler, Eric and Litwin, Mateusz and Gray, Scott and Chess, Benjamin
and Clark, Jack and Berner, Christopher and McCandlish, Sam and Radford, Alec and
Sutskever, Ilya and Amodei, Dario, "Language models are few-shot learners," Advances
in Neural Information Processing Systems, vol. 2020-December, pp. 1877–1901, 2020.
[19] Li, Yuesen and Gao, Chengyi and Song, Xin and Wang, Xiangyu and Xu, Yun-
gang and Han, Suxia, "DrugGPT: A GPT-based Strategy for Designing Potential
Ligands Targeting Specific Proteins," bioRxiv, pp. 2023.06.29.543848, 2023. DOI:
10.1101/2023.06.29.543848.
[20] Weininger, David, "SMILES, a Chemical Language and Information System: 1: Intro-
duction to Methodology and Encoding Rules," Journal of Chemical Information and
Computer Sciences, vol. 28, no. 1, pp. 31–36, 1988. DOI: 10.1021/ci00057a005.
17
[21] Michailidis, George and D’Alché-Buc, Florence, "Autoregressive models for gene reg-
ulatory network inference: Sparsity, stability and causality issues," Mathematical Bio-
sciences, vol. 246, no. 2, pp. 326–334, 2013. DOI: 10.1016/j.mbs.2013.10.003.
[22] Kim, Tae Kyung and Yi, Paul H. and Hager, Gregory D. and Lin, Cheng Ting,
"Refining dataset curation methods for deep learning-based automated tuberculosis
screening," Journal of Thoracic Disease, vol. 12, no. 9, pp. 5078–5085, 2020. DOI:
10.21037/jtd.2019.08.34.
[23] Stokes, Jonathan M. and Yang, Kevin and Swanson, Kyle and Jin, Wengong and
Cubillos-Ruiz, Andres and Donghia, Nina M. and MacNair, Craig R. and French,
Shawn and Carfrae, Lindsey A. and Bloom-Ackerman, Zohar and Tran, Victoria M.
and Chiappino-Pepe, Anush and Badran, Ahmed H. and Andrews, Ian W. and Chory,
Emma J. and Church, George M. and Brown, Eric D. and Jaakkola, Tommi S. and
Barzilay, Regina and Collins, James J., "A Deep Learning Approach to Antibiotic Dis-
covery," Cell, vol. 180, no. 4, pp. 688–702.e13, 2020. DOI: 10.1016/j.cell.2020.01.021.
[24] Borhani, Niloofar and Izadi, Iman and Motahharynia, Ali and Sheikholeslami, Mahsa
and Gheisari, Yousof, "DrugTar Improves Druggability Prediction by Integrating Large
Language Models and Gene Ontologies," bioRxiv, pp. 2024.09.21.614218, 2024. DOI:
10.1101/2024.09.21.614218.
[25] Roskoski, Robert, "Properties of FDA-approved small molecule protein kinase in-
hibitors: A 2024 update," Pharmacological Research, vol. 200, pp. 107059, 2024. DOI:
10.1016/j.phrs.2024.107059.
[26] Loftsson, Thorsteinn, "Physicochemical Properties and Pharmacokinetics," pp. 85–104,
2015. DOI: 10.1016/b978-0-12-801411-0.00003-2.
[27] Di, Li and Kerns, Edward H., "Chapter 1 - Introduction," in Drug-Like Properties (Sec-
ond Edition), 2nd ed., L. Di and E. H. Kerns, Eds. Boston: Academic Press, 2016, pp.
1–3. DOI: https://doi.org/10.1016/B978-0-12-801076-1.00001-0. Available at:
https://www.sciencedirect.com/science/article/pii/B9780128010761000010.
[28] Li, Bowen and Wang, Zhen and Liu, Ziqi and Tao, Yanxin and Sha, Chulin and He, Min
and Li, Xiaolin, "DrugMetric: quantitative drug-likeness scoring based on chemical space
distance," Briefings in Bioinformatics, vol. 25, no. 4, 2024. DOI: 10.1093/bib/bbae321.
[29] Kulkarni, V. S. and Alagarsamy, V. and Solomon, V. R. and Jose, P. A. and Mu-
rugesan, S., "Drug Repurposing: An Effective Tool in Modern Drug Discovery,"
Russian Journal of Bioorganic Chemistry, vol. 49, no. 2, pp. 157–166, 2023. DOI:
10.1134/S1068162023020139.
[30] Cozier, Gyles E. and Lubbe, Lizelle and Sturrock, Edward D. and Acharya, K.
Ravi, "Angiotensin-converting enzyme open for business: structural insights into the
subdomain dynamics," FEBS Journal, vol. 288, no. 7, pp. 2238–2256, 2021. DOI:
10.1111/febs.15601.
[31] Wishart, David S. and Knox, Craig and Guo, An Chi and Shrivastava, Savita and
Hassanali, Murtaza and Stothard, Paul and Chang, Zhan and Woolsey, Jennifer,
18
"DrugBank: a comprehensive resource for in silico drug discovery and exploration.,"
Nucleic acids research, vol. 34, no. Database issue, pp. D668–D672, 2006. DOI:
10.1093/nar/gkj067.
[32] The UniProt Consortium, "UniProt: the Universal Protein Knowledgebase in 2023.
Nucleic Acids Res. 51:D523–D531 (2023)," Nucleic acids research, vol. 51, no. November
2022, pp. 523–531, 2023.
[33] Altschul, Stephen F. and Gish, Warren and Miller, Webb and Myers, Eugene W. and
Lipman, David J., "Basic local alignment search tool," Journal of Molecular Biology,
vol. 215, no. 3, pp. 403–410, 1990. DOI: 10.1016/S0022-2836(05)80360-2.
[34] Zdrazil, Barbara and Felix, Eloy and Hunter, Fiona and Manners, Emma J and Black-
shaw, James and Corbett, Sybilla and de Veij, Marleen and Ioannidis, Harris and
Mendez Lopez, David and Mosquera, Juan F and Others, "The ChEMBL Database
in 2023: A drug discovery platform spanning multiple bioactivity data types and time
periods," Nucleic Acids Research, vol. 52, no. D1, pp. gkad1004, 2023.
[35] Irwin, John J. and Tang, Khanh G. and Young, Jennifer and Dandarchuluun, Chinzorig
and Wong, Benjamin R. and Khurelbaatar, Munkhzul and Moroz, Yurii S. and Mayfield,
John and Sayle, Roger A., "ZINC20 - A Free Ultralarge-Scale Chemical Database for
Ligand Discovery," Journal of Chemical Information and Modeling, vol. 60, no. 12, pp.
6065–6073, 2020. DOI: 10.1021/acs.jcim.0c00675.
[36] Transformer Reinforcement Learning, [cited 2024]. Available at:
https://huggingface.co/docs/trl/en/index.
[37] Summarize from Feedback. Available at: https://github.com/openai/summarize-from-feedback.
[38] RDKit, [cited 2024]. Available at: https://www.rdkit.org/.
[39] Piñero, Janet and Bravo, Álex and Queralt-Rosinach, Núria and Gutiérrez-Sacristán,
Alba and Deu-Pons, Jordi and Centeno, Emilio and García-García, Javier and Sanz,
Ferran and Furlong, Laura I., "DisGeNET: A comprehensive platform integrating infor-
mation on human disease-associated genes and variants," Nucleic Acids Research, vol.
45, no. D1, pp. D833–D839, 2017. DOI: 10.1093/nar/gkw943.
[40] Bajusz, Dávid and Rácz, Anita and Héberger, Károly, "Why is Tanimoto index an
appropriate choice for fingerprint-based similarity calculations?," Journal of Chemin-
formatics, vol. 7, no. 1, pp. 1–13, 2015. DOI: 10.1186/s13321-015-0069-3.
[41] Morgan, H. L., "The Generation of a Unique Machine Description for Chemical Struc-
tures—A Technique Developed at Chemical Abstracts Service," Journal of Chemical
Documentation, vol. 5, no. 2, pp. 107–113, 1965. DOI: 10.1021/c160017a018.
[42] Hassan, Nafisa M. and Alhossary, Amr A. and Mu, Yuguang and Kwoh, Chee Keong,
"Protein-Ligand Blind Docking Using QuickVina-W with Inter-Process Spatio-Temporal
Integration," Scientific Reports, vol. 7, no. 1, pp. 15451, 2017. DOI: 10.1038/s41598-017-
15571-7.
19
[43] Madhavi Sastry, G. and Adzhigirey, Matvey and Day, Tyler and Annabhimoju, Ramakr-
ishna and Sherman, Woody, "Protein and ligand preparation: Parameters, protocols,
and influence on virtual screening enrichments," Journal of Computer-Aided Molecular
Design, vol. 27, no. 3, pp. 221–234, 2013. DOI: 10.1007/s10822-013-9644-8.
[44] Schrödinger Release 2024–2: LigPrep, 2024, Schrödinger, LLC: New York, NY.
[45] Lu, Chao and Wu, Chuanjie and Ghoreishi, Delaram and Chen, Wei and Wang, Lingle
and Damm, Wolfgang and Ross, Gregory A. and Dahlgren, Markus K. and Russell,
Ellery and Von Bargen, Christopher D. and Abel, Robert and Friesner, Richard A. and
Harder, Edward D., "OPLS4: Improving force field accuracy on challenging regimes
of chemical space," Journal of Chemical Theory and Computation, vol. 17, no. 7, pp.
4291–4300, 2021. DOI: 10.1021/acs.jctc.1c00302.
[46] Friesner, Richard A. and Banks, Jay L. and Murphy, Robert B. and Halgren, Thomas
A. and Klicic, Jasna J. and Mainz, Daniel T. and Repasky, Matthew P. and Knoll, Eric
H. and Shelley, Mee and Perry, Jason K. and Shaw, David E. and Francis, Perry and
Shenkin, Peter S., "Glide: A New Approach for Rapid, Accurate Docking and Scoring.
1. Method and Assessment of Docking Accuracy," Journal of Medicinal Chemistry, vol.
47, no. 7, pp. 1739–1749, 2004. DOI: 10.1021/jm0306430.
20
ResearchGate has not been able to resolve any citations for this publication.
Preprint
Full-text available
Target discovery is crucial in drug development, especially for complex chronic diseases. Recent advances in high-throughput technologies and the explosion of biomedical data have highlighted the potential of computational druggability prediction methods. However, most current methods rely on sequence-based features with machine learning, which often face challenges related to hand-crafted features, reproducibility, and accessibility. Moreover, the potential of raw sequence and protein structure has not been fully investigated. Here, we leveraged both protein sequence and structure using deep learning techniques, revealing that protein sequence, especially pre-trained embeddings, is more informative than protein structure. Next, we developed DrugTar, a high‑performance deep learning algorithm integrating sequence embeddings from the ESM-2 pre-trained protein language model with protein ontologies to predict druggability. DrugTar achieved areas under the curve and precision-recall curve values above 0.90, outperforming state-of-the-art methods. In conclusion, DrugTar streamlines target discovery as a bottleneck in developing novel therapeutics.
Article
Full-text available
The process of drug discovery is widely known to be lengthy and resource-intensive. Artificial Intelligence approaches bring hope for accelerating the identification of molecules with the necessary properties for drug development. Drug-likeness assessment is crucial for the virtual screening of candidate drugs. However, traditional methods like Quantitative Estimation of Drug-likeness (QED) struggle to distinguish between drug and non-drug molecules accurately. Additionally, some deep learning-based binary classification models heavily rely on selecting training negative sets. To address these challenges, we introduce a novel unsupervised learning framework called DrugMetric, an innovative framework for quantitatively assessing drug-likeness based on the chemical space distance. DrugMetric blends the powerful learning ability of variational autoencoders with the discriminative ability of the Gaussian Mixture Model. This synergy enables DrugMetric to identify significant differences in drug-likeness across different datasets effectively. Moreover, DrugMetric incorporates principles of ensemble learning to enhance its predictive capabilities. Upon testing over a variety of tasks and datasets, DrugMetric consistently showcases superior scoring and classification performance. It excels in quantifying drug-likeness and accurately distinguishing candidate drugs from non-drugs, surpassing traditional methods including QED. This work highlights DrugMetric as a practical tool for drug-likeness scoring, facilitating the acceleration of virtual drug screening, and has potential applications in other biochemical fields.
Article
Full-text available
This editorial explores how artificial intelligence (AI) is revolutionizing the science of pharmacokinetics (PK). It discusses the challenges of conventional PK analysis and how AI has transformed this area. It highlights the promise of artificial intelligence (AI) in predicting pharmacokinetic profiles from chemical structures and its application in several aspects of pharmacology, including dosage customization and drug interactions. Additionally, it emphasizes how important ethical issues and openness are to AI applications, especially when it comes to pharmacokinetic prediction and dataset adaptation. Future directions for AI in PK are discussed, with the creation of all-inclusive AI pharmacokinetics/pharmacometrics software being envisioned. Drug discovery and patient care could be transformed toward more individualized and effective healthcare solutions with the help of this software, which could handle tasks such as data cleaning, model selection, and regulatory report preparation. The editorial highlights the importance of AI in improving pharmaceutical sciences while urging caution and teamwork in navigating its possible uses in pharmacokinetics.
Article
Full-text available
ChEMBL (https://www.ebi.ac.uk/chembl/) is a manually curated, high-quality, large-scale, open, FAIR and Global Core Biodata Resource of bioactive molecules with drug-like properties, previously described in the 2012, 2014, 2017 and 2019 Nucleic Acids Research Database Issues. Since its introduction in 2009, ChEMBL’s content has changed dramatically in size and diversity of data types. Through incorporation of multiple new datasets from depositors since the 2019 update, ChEMBL now contains slightly more bioactivity data from deposited data vs data extracted from literature. In collaboration with the EUbOPEN consortium, chemical probe data is now regularly deposited into ChEMBL. Release 27 made curated data available for compounds screened for potential anti-SARS-CoV-2 activity from several large-scale drug repurposing screens. In addition, new patent bioactivity data have been added to the latest ChEMBL releases, and various new features have been incorporated, including a Natural Product likeness score, updated flags for Natural Products, a new flag for Chemical Probes, and the initial annotation of the action type for ∼270 000 bioactivity measurements.
Article
Full-text available
Organoids, miniature and simplified in vitro model systems that mimic the structure and function of organs, have attracted considerable interest due to their promising applications in disease modeling, drug screening, personalized medicine, and tissue engineering. Despite the substantial success in cultivating physiologically relevant organoids, challenges remain concerning the complexities of their assembly and the difficulties associated with data analysis. The advent of AI-Enabled Organoids, which interfaces with artificial intelligence (AI), holds the potential to revolutionize the field by offering novel insights and methodologies that can expedite the development and clinical application of organoids. This review succinctly delineates the fundamental concepts and mechanisms underlying AI-Enabled Organoids, summarizing the prospective applications on rapid screening of construction strategies, cost-effective extraction of multiscale image features, streamlined analysis of multi-omics data, and precise preclinical evaluation and application. We also explore the challenges and limitations of interfacing organoids with AI, and discuss the future direction of the field. Taken together, the AI-Enabled Organoids hold significant promise for advancing our understanding of organ development and disease progression, ultimately laying the groundwork for clinical application.
Preprint
Full-text available
DrugGPT presents a ligand design strategy based on the autoregressive model, GPT, focusing on chemical space exploration and the discovery of ligands for specific proteins. Deep learning language models have shown significant potential in various domains including protein design and biomedical text analysis, providing strong support for the proposition of DrugGPT. In this study, we employ the DrugGPT model to learn a substantial amount of protein-ligand binding data, aiming to discover novel molecules that can bind with specific proteins. This strategy not only significantly improves the efficiency of ligand design but also offers a swift and effective avenue for the drug development process, bringing new possibilities to the pharmaceutical domain. In our research, we particularly optimized and trained the GPT-2 model to better adapt to the requirements of drug design. Given the characteristics of proteins and ligands, we redesigned the tokenizer using the BPE algorithm, abandoned the original tokenizer, and trained the GPT-2 model from scratch. This improvement enables DrugGPT to more accurately capture and understand the structural information and chemical rules of drug molecules. It also enhances its comprehension of binding information between proteins and ligands, thereby generating potentially active drug candidate molecules. Theoretically, DrugGPT has significant advantages. During the model training process, DrugGPT aims to maximize the conditional probability and employs the back-propagation algorithm for training, making the training process more stable and avoiding the Mode Collapse problem that may occur in Generative Adversarial Networks in drug design. Furthermore, the design philosophy of DrugGPT endows it with strong generalization capabilities, giving it the potential to adapt to different tasks. In conclusion, DrugGPT provides a forward-thinking and practical new approach to ligand design. By optimizing the tokenizer and retraining the GPT-2 model, the ligand design process becomes more direct and efficient. This not only reflects the theoretical advantages of DrugGPT but also reveals its potential applications in the drug development process, thereby opening new perspectives and possibilities in the pharmaceutical field.
Article
In rare diseases, such as hemophilia A, the development of accurate population pharmacokinetic (PK) models is often hindered by the limited availability of data. Most PK models are specific to a single recombinant factor VIII (rFVIII) concentrate or measurement assay, and are generally unsuited for answering counterfactual (“what‐if”) queries. Ideally, data from multiple hemophilia treatment centers are combined but this is generally difficult as patient data are kept private. In this work, we utilize causal inference techniques to produce a hybrid machine learning (ML) PK model that corrects for differences between rFVIII concentrates and measurement assays. Next, we augment this model with a generative model that can simulate realistic virtual patients as well as impute missing data. This model can be shared instead of actual patient data, resolving privacy issues. The hybrid ML‐PK model was trained on chromogenic assay data of lonoctocog alfa and predictive performance was then evaluated on an external data set of patients who received octocog alfa with FVIII levels measured using the one‐stage assay. The model presented higher accuracy compared with three previous PK models developed on data similar to the external data set (root mean squared error = 14.6 IU/dL vs. mean of 17.7 IU/dL). Finally, we show that the generative model can be used to accurately impute missing data (< 18% error). In conclusion, the proposed approach introduces interesting new possibilities for model development. In the context of rare disease, the introduction of generative models facilitates sharing of synthetic data, enabling the iterative improvement of population PK models.
Article
In the early stages of drug development, large chemical libraries are typically screened to identify compounds of promising potency against the chosen targets. Often, however, the resulting hit compounds tend to have poor drug metabolism and pharmacokinetics (DMPK), with negative developability features that may be difficult to eliminate. Therefore, starting the drug discovery process with a “null library”, compounds that have highly desirable DMPK properties but no potency against the chosen targets, could be advantageous. Here, we explore the opportunities offered by machine learning to realize this strategy in the case of the inhibition of α-synuclein aggregation, a process associated with Parkinson’s disease. We apply MolDQN, a generative machine learning method, to build an inhibitory activity against α-synuclein aggregation into an initial inactive compound with good DMPK properties. Our results illustrate how generative modeling can be used to endow initially inert compounds with desirable developability properties.