ArticlePDF Available

Boosting the predictive performance with aqueous solubility dataset curation

Authors:

Abstract and Figures

Intrinsic solubility is a critical property in pharmaceutical industry that impacts in-vivo bioavailability of small molecule drugs. However, solubility prediction with Artificial Intelligence(AI) are facing insufficient data, poor data quality, and no unified measurements for AI and physics-based approaches. We collect 7 aqueous solubility datasets, and present a dataset curation workflow. Evaluating the curated data with two expanded deep learning methods, improved RMSE scores on all curated thermodynamic datasets are observed. We also compare expanded Chemprop enhanced with curated data and state-of-art physics-based approach using pearson and spearman correlation coefficients. A similar performance on pearson with 0.930 and spearman with 0.947 from expanded Chemprop is achieved. A steadily improved pearson and spearman values with increasing data points are also illustrated. Besides that, the computation advantage of AI models enables quick evaluation of a large set of molecules during the hit identification or lead optimization stages, which helps further decision making within the time cycle at drug discovery stage.
This content is subject to copyright. Terms and conditions apply.
1
SCIENTIFIC DATA | (2022) 9:71 | https://doi.org/10.1038/s41597-022-01154-3
www.nature.com/scientificdata
Boosting the predictive
performance with aqueous
solubility dataset curation
Jintao Meng
Chen
 ✉ WahibYangZheng
Wei ✉ Feng ✉ & Wei iu
Intrinsic solubility is a critical property in pharmaceutical industry that impacts in-vivo bioavailability




thermodynamic datasets are observed. We also compare expanded Chemprop enhanced with curated

Chemprop
is achieved. A steadily improved pearson and spearman values with increasing data points are also



Introduction
Aqueous solubility is one of the critical factors dening the bio-availability of orally administrated drugs.
Reportedly, over 75% of oral drug development candidates have a low solubility based on the Bio-pharmaceutics
Classication System (BCS)01,2. To tackle this challenge, researchers are focusing on drug solubility improve-
ments with both physics-based Quantum Mechanics-Quantitative Structure Property Relationships
(QM-QSPR) approaches36 and data-driven articial intelligence (AI) methods711.
e development of QM-QSPR approaches provides a large number of computational methods for aque-
ous solubility prediction starting from a molecular structure36. e majority of these methods try to explore
fundamental physics-based rules with a sublimation thermodynamic cycle solubility approach2,12 on crystalline
drug-like molecules. is approach is an interplay between crystal packing and molecular hydration free energy
contributions1215. With this approach, a crystal packing contribution to the drug solubility typically requires a
sublimation energy estimation from crystal lattice calculations1214, molecular dynamics simulations16, or QSPR
statistical models15,17. e free energy of solvation may be estimated by a variety of approaches, including QSPR
models, monte carlo simulations, and QM-based methods18. Recently, a study of guiding lead optimization2 was
proposed. It explicitly describes the solid-state contribution, and the superior performance of the QM-based
thermodynamic cycle approach is demonstrated in the optimization of two pharmaceutical series. e main
limitations of the physics-based QM-QSPR approaches are the large compute requirements and long run time.
For example, guiding lead optimization2 relies on crystal structure prediction calculations19, which may require
several days on a powerful cloud infrastructure consisting of millions of CPU cores.
Early AI-based approaches for solubility prediction involve the application of logistic regression7, random
forests8 and convolutional neural networks9 to expert-engineered descriptors10,11 or molecular ngerprints
such as the Dragon descriptors or Morgan (ECFP) ngerprints2022. eir predictive accuracy or equivalent
root mean square errors (RMSE) is limited to 0.7–1.0 log. Recent research eorts are focused on graph learn-
ing2326 of the underlying topology of molecule structure using SMILES strings27. Such models extract their
1Shenzhen Institutes of Advanced Technology, CAS, Shenzhen, 518000, China. 2National Institute of Advanced
Industrial Science and Technology, Tokyo, Japan. 3RIKEN Center for Computational Science, Hyogo, Japan. 4XtalPi.
Inc, Shenzhen, 518000, China. 5National Supercomputer Center in Shenzhen, Shenzhen, 518000, China. 6Tencent AI
Lab, Shenzhen, 518000, China. e-mail: chin.hou@aist.go.jp; yj.wei@siat.ac.cn; fengsz@nsccsz.cn


Content courtesy of Springer Nature, terms of use apply. Rights reserved
2
SCIENTIFIC DATA | (2022) 9:71 | https://doi.org/10.1038/s41597-022-01154-3
www.nature.com/scientificdata
www.nature.com/scientificdata/
own expert features directly from atoms and edges, and embed them with graph convolutional networks. An
experiment in MoleculeNet23 on solubility prediction of ESOL dataset ranks Message Passing Neural Network
(MPNN) as the best graph learning model with a predictive accuracy of 0.58 among other graph models, such
as WEAVE28 and GraphConv26. Chemprop24 which embeds molecule-level features and extends the Message
Passing Neural Network (MPNN) with Directed MPNN, further improves the predictive accuracy on ESOL to
0.56. AttentiveFP29 is the rst work applying attention mechanism with a graph neural network and reports the
lowest accuracy of 0.503 on ESOL dataset. ese deep learning based approaches are trying to model complex
physicochemical properties with a QSPR statistical approach, however their exibility and capacity of capturing
those complex relationships are still bounded by the availability of high quality data3032.
e measurement and dataset diversity gap between the AI-based and QM-QSPR approaches are two crit-
ical issues hindering the research on combining these two approaches. For AI-based approaches in particular,
dierent papers evaluate their work on dierent datasets, using dierent workows, or even with dierent meas-
urements. In most cases, this becomes the rst obstacle preventing readers from objectively distinguishing the
viability of the proposed AI-approaches. More importantly, to the authors knowledge, no previous work con-
ducted any comparison to evaluate both AI-based and QM-QSPR approaches under the some measurements
with an open available dataset. is situation also inhibits any quantitative analysis from exploring the advan-
tages and disadvantages of these two approaches, and the possibilities of combining them to achieve additional
progress.
In term of data curation methodologies, previously Eriksson’s work published in 200333, takes prepossess-
ing techniques (scaling and centering), data correction, and transformations to improve the regression model’s
performance on Quantitative Structure-Activity Relationship (QSAR). ere are three dierent points between
our work and Erikssons work. Firstly, Our work is focused specially on solubility instead of QSAR. e data
correction using signal correction actually cannot work on our dataset, as there are no relationship between
the solubility value and undesired variation arising from light-scattering eects, baseline dri, nonlinearities,
and so forth. Secondly, our work is exploring data curation methodology for nonlinear deep learning model
using graph neural networks, whereas Erikssons work33 is targeting on linear regression model. last, our work
on is focused on data curation methodology itself. Eriksson’s work needs the prepossessing techniques (scaling
and centering) and transformations steps to avoid large inuence on the model and dominating over the other
measurements from unbalanced data composition. However, these problems has been resolved in our work by
using scaold data partition. Our work is the only work focused on inter-dataset redundancy and intra-dataset
redundancy, it is a novel technique not yet presented by any previous work.
To conclude the above discussion, solubility prediction with AI-based methods still face the following three
challenges:
1. e volume of training data in previous works, such as the ESOL dataset, is limited. Training and evalua-
tion on these small datasets do not necessarily oer good performance for our problems. ese datasets are
also insucient for sophisticated models attempting to learn massive physical-chemical rules and converge
to a stable state.
2. Data curation methods or tools for low-quality aqueous solubility data are still lacking. Directly training on
data with poor quality may aect the predictive accuracy.
3. None of the previous studies pose a comparison of the predictive accuracy between leading deep learning
and state-of-the-art QM-QSPR approaches. Analyzing and determining the advantages and disadvantages
of deep learning methods in comparison with the QM-QSPR approaches is also critical but dicult to
achieve.
To resolve the above issues and rene the research problem of solubility prediction for AI, our contributions
are threefold:
1. e rst large-scale dataset for AI research on aqueous solubility is collected. is dataset contains seven
aqueous solubility datasets including both thermodynamic and kinetic data. e number of records in
these datasets ranges between a few thousand to several hundreds of thousands.
2. is work is the rst to improve the aqueous solubility predictive accuracy with a data curation method.
We present a data curation workow of ltering, evaluating and clustering. is workow adds solubility
quality to each record and curates records sharing similar solubility among dierent datasets. We also
expand two leading deep learning methods, i.e., Chemprop24 and AttentiveFP29, to support data quality
during the training and evaluation process. Using these expansions of the Chemprop and AttentiveFP deep
learning methods, improved predictive accuracy is observed on all thermodynamic datasets.
3. is work is also the rst to compare deep learning and QM-QSPR approaches using the pearson and
spearman’s rank-order correlation coecients by predicting four pharmaceutical series of 48 molecules.
Abramov’s guiding lead optimization and weighted Chemprop are selected as the representatives for both.
By predicting the rst two pharmaceutical series of 31 molecules, Abramov’s approach demonstrates a
pearson correlation coecient r2 of 0.905 and spearman’s rank-order correlation coecient Rs of 0.967.
Weighted Chemprop (expanded to support the high data quality) is trained on the curated dataset yielding
improvement in its r2 and Rs values. It increases steadily with the increase in training data volume and fur-
ther achieves comparable performance on r2 with 0.930 and Rs with 0.947. In comparison with Abramov’s
approach, which requires a large compute resources, predicting the thousands of target compounds with
deep learning approach takes only seconds on a common desktop computer.
Content courtesy of Springer Nature, terms of use apply. Rights reserved
3
SCIENTIFIC DATA | (2022) 9:71 | https://doi.org/10.1038/s41597-022-01154-3
www.nature.com/scientificdata
www.nature.com/scientificdata/
e rest of this paper is organized as follows. e collection and description of seven datasets, together with
our data curation workow, are illustrated in Section methods. Section results compares the deep learning and
QM-QSPR approaches and then discusses the benets of data curation. Section discussion explains the innova-
tions and contributions this work make towards molecule property prediction.
Methods
Datasets. We collected molecules labeled with aqueous solubility from publicly available databases or data-
sets provided by previous papers, resulting in the 7 datasets shown in Table1. Among them, the rst three data-
sets were evaluated by previous papers11,23,24,34,35 but are limited in number of samples or records, while the last
four datasets have larger number of samples with poor data quality. We also include both thermodynamic and
kinetic datasets; the rst six are the thermodynamic datasets, while the last is the kinetic set36.
Table1 demonstrates the statistical information of each dataset. Every dataset is processed separately to have
the same standardized form. e data extraction process and standardization methods applied for each dataset
are described below.
• AQUA. is dataset was taken from the work of Huuskonen34 and Tetko11, with 1311 records on 1307 mol-
ecules downloaded from the ALOGPS homepage at http://146.107.217.178/lab/alogps/logs.txt. e exper-
imental aqueous solubility value is measured between 20–25 °C and obtained partly from the AQUASOL
database of the University of Arizona and SCR’s PHYSPROP database.
• PHYS. is dataset is a curated PHYSPROP database consisting of a collection of datasets in SDF format. An
automated KNIME workow37 is used to curate and correct errors in the structure and identity of chemicals
using the publicly available PHYSPROP datasets. Here, we extract 2024 molecules with a water solubility
(WS) endpoint. e quality of each record is measured with stars from 1 to 5; thus, the data quality property
of “STAR_FLAG” is reserved, and nally 2010 records is reserved.
• ESOL. e original ESOL dataset, containing 1144 records, was rst used by35, and then its veried ver-
sion was evaluated in23,24. We downloaded its veried version with 1128 records from Chemprop’s repository
https://github.com/Chemprop/Chemprop as our ESOL dataset to keep it consistent with previous works23,24.
• OCHEM. is dataset is taken from the OCHEM database of WS at https://ochem.eu/. We reserve 6525
rows from 36,450 records by selecting molecules with the dataset type “Training” to reserve molecules with
experimental solubility values.
• AQSOL. is dataset38 combines 9 datasets, including the AQUA and ESOL datasets. A preprocessing step
is used to lter this dataset by merging repetitive molecules, with 9982 records remaining. According to the
number of occurrences in the 9 original datasets, a new property called “group” is added to this dataset by
using a classication strategy that can group this dataset into 5 groups. We keep “group” in this dataset for
further assignment of the weights to identify the data quality of each record.
• CHEMBL. is dataset is extracted from CHEMBLs activity database, which includes 15,996,368 records at
https://www.ebi.ac.uk/chembl. We lter this dataset with the assay type “physicochemical” and then select
40,520 records with the standard type “Solubility” or “solubility” as our dataset. Several dierent units are
used for the aqueous solubility measurement, such as nM, ug/mL, and ug.mL-1. All the units are converted
to standard “LogS” units. We nd 4,543 records are kinetic solubility data and 17 records are using oil as the
solvent. us we further removed all these records to clean our CHEMBL dataset from kinetic solubility data.
Finally 30,099 valid records are reserved. In addition, the column “Comment” describing the temperature and
pH of the experiment is kept for later weight assignment on the data quality of each record.
• KINECT. is dataset is taken from the OCHEM database of WS based on the Kinect technology at https://
ochem.eu/. 164,273 records described in SDF format are extracted and collected into this dataset. In addi-
tion, the columns of properties “SMILES”, “LogS value”, “pH value” and “Temperature” are also extracted and
reserved for quality weight assignment.
Dataset
No. of Records in
Weights Additional Columns
of Org DatasetOrg Cln Cure
AQUA 1311 1311 1354 1.0
PHYS 2010 2001 2001 1.0 star_ag
ESOL 1128 1116 1157 1.0
OCHEM 6525 4218 3766 0.85
AQSOL 9982 8701 9061 0.4 group
CHEMBL 30099 30099 28675 0.8 comment
KINECT 164273 82057 81935 — temperature, pH value
Tab le 1. Statistical information of the number of records in the 7 collected datasets. “Org” is the original
dataset, “Cln” denotes the dataset aer Data Filtering, “Cure” is the dataset aer Data Curation using the
clustering algorithm across multiple datasets, “Weights” denotes the assigned weights for each dataset to
identify the dataset quality, and “Additional Columns of Org Dataset” includes special properties reserved by
some of the datasets.
Content courtesy of Springer Nature, terms of use apply. Rights reserved
4
SCIENTIFIC DATA | (2022) 9:71 | https://doi.org/10.1038/s41597-022-01154-3
www.nature.com/scientificdata
www.nature.com/scientificdata/
Data curation. Due to the various experiment environments, workows and non-unique identications, the
records in the aqueous solubility datasets are repetitive, erroneous or even contradictory to each other37,38. Note
that molecules with the same SMILES may be dierent tautomers39, and thus have dierent solubility value. As
SMILES can not distinguish tautomers thus we just keep them into dierent records but with dierent solubility
value. We merge the two records only when the dierence of these two value is less than 0.5.
e development of reliable data-driven deep learning models, however, may be hindered by uncertainties
and disagreements in these repetitive records, which are obtained from many disparate data sources. Training
data with systematic errors from dierent experimental methodologies potentially limit the predictive accuracy
of deep learning models. To improve the predictive accuracy of deep learning methods and achieve a better
generalization ability from low-quality and confusing data, a curation method delivering high-quality data,
balanced on substructure classes and sucient in terms of the data volume, is vitally important.
We present a data curation workow of ltering, evaluating and clustering for the above 7 datasets as illus-
trated in Fig.1. e workow tries to improve the dataset quality by data ltering, a quality evaluation and then
cross-dataset correction among dierent datasets with a clustering algorithm. Finally, an evaluation with two
leading deep learning methods, i.e., Chemprop and AttentiveFP, demonstrates the benets of this workow in
predictive accuracy improvement based on the RMSE over all thermodynamic datasets.
Data filtering. To resolve the standardization of the molecule expressions, uncertainties from various
experiment environments, and weight bias from repetitive data, the data ltering strategy is proposed with
the following three steps: SMILES standardization, experiment environmental control, and repetitive record
normalization.
• SMILES standardization First, each molecule has only one unique SMILES expression in dierent databases.
MolVS (described at https://molvs.readthedocs.io/en/latest/) is used to standardize all chemical structures
and maintain one unique standard SMILES for each molecule. Any molecule that fails to pass our standardi-
zation procedure is removed from the dataset.
• Experiment environmental control Second, we target the aqueous solubility prediction of small molecules
in drug design. us, the experiment environment of molecules with temperatures of 25 ± 5 °C and pH val-
ues of 7 ± 1 are highly valued; any records beyond our scope are ranked low or even removed. Any molecule
used for drug design should be poison-free. For this reason, molecules with heavy metals such as “U, Ge, Pr,
La, Dy, Ti, Zr, Rh, Lu, Mo, Sm, Sb, Nd, Gd, Cd, Ce, In, Pt, Sb, As, Ir, Ba, B, Hg, Se, Sn, Ti, Fe, Si, Al, Bi, Pb, Pd,
Ag, Au, Cu, Pt, Co, Ni, Ru, Mg, Zn, Mn, Cr, Ca, K, Li” are ltered from all datasets. “SF5,SF6” are also cleaned,
as they are rarely used in drug design.
• Repetitive record normalization ird, some datasets contain repetitive molecules with equal or dierent
solubility values. According to the frequency of occurrence, repetitive record normalization is carried out to
assign weights to each molecule, with a total weighted value of 1.0, to prevent those molecules with repetitive
values from gaining larger parameter update weights during the model training process.
e number of data records before and aer our data ltering is presented in Table1. For each cleaned data-
set, the available information in terms of the name, description, and column type are presented in Table2. In
the end, 1311, 2001, 1116, 4218, 8701, 30,099, and 82,057 records are in cleaned AQUA, PHYS, ESOL, OCHEM,
AQSOL, CHEMBL, and KINECT datasets, respectively.
Quality evaluation. Quality evaluation is performed to analyze, evaluate and assign each dataset with an appro-
priate weight to identify its quality. We rst analyze the molecule redundancy among dierent datasets with
identical or dierent solubility values. en, we expand Chemprop and AttentiveFP to support the data quality
weights and refer to them as weighted Chemprop and weighted AttentiveFP. weighted Chemprop is used to
Fig. 1 e data curation workow of ltering, evaluating, and clustering on the 7 collected datasets.
Content courtesy of Springer Nature, terms of use apply. Rights reserved
5
SCIENTIFIC DATA | (2022) 9:71 | https://doi.org/10.1038/s41597-022-01154-3
www.nature.com/scientificdata
www.nature.com/scientificdata/
evaluate each dataset’s predictive accuracy (measured in RMSE) to identify the dataset quality. Finally, each
dataset is assigned a weight indicating its data quality.
e existence of data redundancy in repetitive records generate bias in the model training process and evalu-
ation metric. Several data redundancies can be found both within and among the datasets. ese data redundan-
cies can be classied into two classes: those in which a given molecule was found in two records with identical
solubility values and those in which a given molecule was found in two records with dierent solubility values.
Here, we dene solubility values with a 0.01 LogS unit dierence between two records as identical. Notably, these
redundancies can be found in two records from a single dataset or from two dierent datasets. e former case
is normalized rst by repetitive record normalization, as discussed in the previous subsection; thus, there is no
molecules sharing the same value occur twice in a single dataset.
With the above denitions, two redundancy matrices are collected, as presented in Fig.2, where the percent-
ages of repetitive molecules with the same and dierent solubility values are presented in the upper and lower
tables, respectively. e rows or columns of these two tables represent the corresponding datasets. e percent-
age of repetitive molecules with the same solubility value between two datasets i and j is represented as Aij, and
that with dierent solubility values is represented as Bij. For example, AESOL,PHYS = 43.01 indicates that 43.01%
of the records (one molecule can have multiple records) in the ESOL dataset can be found in the PHYS dataset
with the same solubility value. As another example, BCEHMBL,CHEMBL = 25.13 reveals that 25.13% of the records
in the CHEMBL dataset can be found sharing the same molecule but with dierent solubility values in the same
dataset. Note that the two redundancy matrices in Fig.2 are not symmetric for dierent dataset sizes. e sum
of Aij and Bij for corresponding datasets i and j can be beyond 100%, as given a record from dataset i, a molecule
in dataset j can have multiple records and thus can share both the same and dierent solubility values with the
same molecule in other datasets.
A preliminary analysis of the records in and between datasets reveals the potential value of data curation.
In the upper table of Fig.2, approximately half of the records in AQUA, PHYS, and ESOL share the same sol-
ubility values. Approximately 77–98% and 66% of the records in AQUA, PHYS and ESOL are contained in
OCHEM and AQSOL, respectively. In the lower table, 9–30% of the records share dierent solubility values
among AQUA, PHYS and ESOL. More than 24% and 30% of the records in these three datasets share dierent
solubility values. CHEMBL has its own speciality. Both tables conrm that CHEMBL contains few records from
other datasets, and the lower table conrms that one-quarter of the records in CHEMBL have diverse solubility
values. Our intuition on data curation is to make use of the above record redundancies. In practice, a record for
a given molecule with the same solubility value in more than one dataset can help us to improve the condence
regarding its data quality. Likewise, a record with dierent values among datasets can decrease the condence
with regard to its data quality. is is the fundamental dierence between our work and a previous work38 as
a result of selecting those records with multiple occurrences. us, the percentages of both inter-dataset and
intra-dataset record redundancies will determine the eectiveness of our data curation method.
To analyze the quality of each dataset, one of the leading graph learning methods named as Chemprop is
selected to evaluate all 7 datasets, with the predictive accuracy used as a reference. Both random and scaold
splitting are used in this evaluation. Here random splitting randomly splits samples into training, validation, and
test three subsets. Scaold40 splitting splits the samples based on their two-dimensional structural frameworks
Column Name Description Type
Smiles SMILES representation of compound String
LogS Experimental aqueous solubility value
(LogS) String
Weight weighted quality score in [0, 1] Float
Tab le 2. List of information for all cleaned and curated datasets in terms of the name, description, and type of
each column.
Fig. 2 Redundancy matrices showing the percentage of repetitive molecules between two datasets. e upper
table Aij summarizes the percentages of molecules with the same solubility values, and the lower table Bij
describes the percentages of molecules with dierent solubility values.
Content courtesy of Springer Nature, terms of use apply. Rights reserved
6
SCIENTIFIC DATA | (2022) 9:71 | https://doi.org/10.1038/s41597-022-01154-3
www.nature.com/scientificdata
www.nature.com/scientificdata/
as implemented in RDKit. Scaold splitting is a useful way of organizing the structural data to group the atoms
of each drug molecule into ring, linker, framework, and side chain atoms. Considering that random splitting of
molecular data isn’t always best for evaluating machine learning methods. Scaold splitting is also applied in
our evaluation. For the original datasets, we train each dataset with Chemprop using both random and scaold
data partition ratios of [0.8, 0.1, 0.1] for training, testing, and evaluation. Moreover, we ensemble 5 models to
improve the model accuracy and record the average RMSE value and its condence intervals by running each
ensembled model 8 times. e RMSE value of the original dataset is recorded and collected in the third column
of Table3. Multiple dierent solubility values for a given molecule among the datasets are normalized on weights
according to the statistical distribution of the molecule determined by the previously discussed data ltering
process. However, Chemprop does not support weighted quality scores for records in a cleaned dataset. us, we
expand the training and evaluation codes of Chemprop to support training over weighted records and rename
it weighted Chemprop. As a result, a record with higher weighted quality has a contributes to a larger extent in
the parameter update, whereas records with lower weights have a smaller eect. Note that when dataset contains
no weights, weighted Chemprop treats each record equally and acts the same as Chemprop. Trained on these 7
cleaned datasets, the corresponding prediction accuracy measured with the RMSE is collected in the fourth
column of Table3.
Root Mean Square Error (RMSE) is a standard way to measure the error of a model in predicting quantitative
data, detailed denition of RMSE is presented at https://en.wikipedia.org/wiki/Root-mean-square_deviation.
Assume that there are n records in the test subsets, formally the RMSE of this test subsets is dened as follows:
=
=
RMSE yy
n
()
i
nii
0
12
Here,
yyy,, ,n01 1
are predicted values. y0, y1, , yn1 are observed values. n is the number of records in the
test subsets. As all the clean and cure dataset contains quality weights, we must update RMSE for both evaluation
and test subset to use the weighted records during our training process. us we updated the evaluation metric
with weighted records, the denition of our weighted RMSE is described as below. Assume that there are n
records in the test subsets, formally the weighted RMSE of this test subsets is dened as follows:
RMSE wyy
n
()
i
niii
0
12
=∗−
=
Here,
yyy,, ,n01 1
are predicted values. y0, y1, , yn1 are observed values. w0, w1, , wn1 are quality
weights of each records. n is the number of records in the test subsets. In the original dataset, there are no quality
weights, we treat each record with unit weights by default to calculate its weighted RMSE. As you can see, the
weighted RMSE will be same as RMSE when using unit weights for original dataset. us the weighted RMSE is
a comparable metric across the original, clean and cure datasets, and in this paper we use RMSE for simple to
denote “weighted RMSE” for curated datasets. e original Chemprop is used on “Org” dataset, and the weighted
Chemprop is applied on both “Cln” and “Cure” datasets.
According to Table3, the six thermodynamic datasets can be split into two groups. e rst group includes
AQUA, PHYS, ESOL and OCHEM, and the second group includes AQSOL and CHEMBL. e datasets in the
Split Type Dataset
RMSE & Condence Intervals
Org Cln Cure
Random
AQUA 0.573 ± 0.037 0.583 ± 0.057 0.536 ± 0.042
PHYSP 0.550 ± 0.026 0.600 ± 0.032 0.515 ± 0.018
ESOL 0.596 ± 0.075 0.619 ± 0.044 0.512 ± 0.047
OCHEM 0.548 ± 0.024 0.639 ± 0.044 0.522 ± 0.017
AQSOL 1.023 ± 0.035 0.820 ± 0.036 0.518 ± 0.022
CHEMBL 0.917 ± 0.017 0.811 ± 0.016 0.499 ± 0.011
KINECT 0.401 ± 0.003 0.431 ± 0.003 0.432 ± 0.003
Scaold
AQUA 0.850 ± 0.086 0.849 ± 0.075 0.697 ± 0.043
PHYS 0.833 ± 0.058 0.813 ± 0.115 0.691 ± 0.092
ESOL 0.854 ± 0.097 0.808 ± 0.090 0.711 ± 0.073
OCHEM 0.847 ± 0.067 0.808 ± 0.075 0.695 ± 0.061
AQSOL 1.073 ± 0.062 0.968 ± 0.045 0.596 ± 0.033
CHEMBL 1.040 ± 0.038 0.900 ± 0.049 0.555 ± 0.031
KINECT 0.433 ± 0.015 0.461 ± 0.008 0.460 ± 0.008
Tab le 3. e collected RMSE and condence intervals of Chemprop or weighted Chemprop trained on the 7
datasets. e data partition strategies include both random and scaold strategies. Five models are ensembled
to improve the model accuracy. We average the RMSE by running each model 8 times and then calculate
the corresponding condence interval. e original Chemprop is used on “Org” dataset, and the weighted
Chemprop is applied on both “Cln” and “Cure” datasets.
Content courtesy of Springer Nature, terms of use apply. Rights reserved
7
SCIENTIFIC DATA | (2022) 9:71 | https://doi.org/10.1038/s41597-022-01154-3
www.nature.com/scientificdata
www.nature.com/scientificdata/
rst group have smaller populations and relatively lower RMSE values; we denote the datasets in this group as
high-quality datasets. e second group has massive records and higher RMSE values in both the original and
cleaned datasets; thus, the two datasets are regarded as low-quality datasets. Due to the change in the evaluation
metric with the weighted records in the high-quality datasets and KINECT dataset, the predictive accuracy of
each clean dataset has a 10% increase in the RMSE using a random partition compared with the original data-
set. At the same time, as we take “group” and “comment” as references to carry out weight assignment for each
record in the low-quality datasets, weighted Chemprop learns over the quality weights aer repetitive record
normalization and then benets from a slightly decrease in predictive accuracy (lower is better).
With the above analysis, we can initialize and assign a quality weight for each dataset. e assigned quality
weight is used for data curation in the following section. e assigned weights are distributed in [0, 1], with a
value close to 1 indicating high data quality. e assigned weights for these six thermodynamic datasets are
listed in the h column of Table1. e KINECT dataset is the only kinetic-based dataset; thus, no weighted
quality is set. e weights in Table1 are presented as an example to show a relative ranking in terms of the data
quality among the dierent datasets, and the specic weight for each dataset can still be adjusted. Searching for
and evaluating a better weight assignment require extremely large compute power, e.g., one round of evaluation
generating all the data in Table3 costs approximately two weeks using 1200 compute nodes (38,200 cores and
4800 GPU accelerators) in the National Supercomputer Center in Shenzhen. erefore, we estimate the weights
in Table1 from our rst intuition and then calculate the corresponding predictive accuracy results in Table3.
Data clustering. is work is the rst to curate data using inter-dataset redundancy and intra-dataset redun-
dancy. ree curation guidelines are followed to take advantage of these datasets with potential redundancy:
A dataset with a higher quality weight can be used to curate a dataset with a lower weight. e nal quality
weight of a record from a dataset can be calculated by multiplying the weight of the record itself by the assigned
weight of the dataset. Records with similar solubility values for a given molecule can be merged by averaging
their solubility values over their weights.
First, a curation schedule following guideline is designed, as demonstrated in Fig.3. Previously, we
divided the six thermodynamic datasets into two groups: a high- and low-quality group. As illustrated in Fig.3,
one can curate a dataset with other datasets in the same group with higher or equal weights, which is denoted
as inter-group curation. A dataset in the high-quality group can be used to curate a dataset in the low-quality
group, which we refer to as intra-group curation. No other operations are allowed.
Second, a record clustering and curation workow is adapted to implement guideline . Given a set of n
cleaned datasets D[i], each records is initialized with our workow aims to curate D[n 1] with datasets D[0],
, D[n 2]. Our curation workow contains three steps: (1) We merge all input datasets D[i] and reserve all
the records with the same compound contained by dataset D[n 1] as a new dataset T. (2) For each molecule
with multiple solubility values, a partial clustering algorithm, illustrated in Algorithm 1, is adopted to merge
these records. en, we update the solubility values and weights with the equation listed by line 5 and line 6 in
algorithm 1 for each molecule in T. (3) We accumulate the total weights for each molecule and truncate the max-
imum total weights with a given threshold. en, the weights for each record are normalized in T. By adjusting
threshold, those molecules occurring in multiple datasets and thus accumulating high total weights larger than
threshold become highly valued, and those molecules with total weights less than threshold become devalued.
ird, the partial clustering algorithm mentioned above is designed, as presented in Algorithm 1, to cure
the records following guideline . In each while loop, the two closest solubility values for a given molecule are
selected and merged if their dierence is less than a given parameter d. e two records are merged by averaging
their solubility values over their weights, and their two weights are summed as the new quality weight. If the dif-
ference between the two closest values is larger than d, the while loop ends and the merged records are updated
as the new record. For the parameter d, we recommend using 0.5 as suggested in11.
Fig. 3 Data curation schedule for the 6 thermodynamic datasets. e datasets are divided into 2 groups: high
quality and low quality groups. Two curation operations, i.e., inter-group curation and intra-group curation, are
illustrated. e feasible curation operations for each dataset are denoted by the lines. For example, AQUA can be
curated with the AQUA, PHYS, and ESOL datasets, and AQSOL can be curated with all dataset in high quality
group, and CHEMBL.
Content courtesy of Springer Nature, terms of use apply. Rights reserved
8
SCIENTIFIC DATA | (2022) 9:71 | https://doi.org/10.1038/s41597-022-01154-3
www.nature.com/scientificdata
www.nature.com/scientificdata/
Algorithm 1 Partial Clustering Algorithm.
e above workow is developed and open-sourced in our repository
https://github.com/Mengjintao/Chemprop. e seven curated datasets are collected by applying this work-
ow, and then weighted Chemprop is trained on these datasets. For the best ensembled models, the solubility
prediction accuracy values measured in terms of the RMSE are summarized in Table3. Here the lowest RMSE
value is recorded for KINECT, being as low as 0.432 (with a condence interval of 0.003). ESOL is a widely used
benchmark in previous research, and its RMSE score decreases from 0.596 (0.56 reported by Chemprop with
Bayesian optimization24) to 0.512, i.e., a 0.084 LogS unit decline aer data curation. On other datasets with a
random data partition, the RMSE values of weighted Chemprop benet from a dramatic decline of 0.037, 0.035,
0.026, 0.505, and 0.418, respectively, on the curated AQUA, PHYS, OCHEM, AQSOL, and CHEMBL datasets.
With scaold data partition, the RMSE values decreasing by 0.153, 0.142, 0.152, 0.477, and 0.485, respectively.
e model trained on the curated KINECT dataset, however, records an increase in the RMSE value under both
random and scaold data partition, as the KINECT dataset is the only set of Kinect solubility data; hence, no
other dataset can be used to curate this dataset. Moreover, the limited inter-dataset redundancy demonstrated
in Fig.2 on the KINECT dataset also restricts our curation benets. Even with the above limitations, KINECT
dataset still contributes the lowest RMSE score among all datasets with both Random and scaold data partition.
In addition to Chemprop, we include another recently developed deep learning method named AttentiveFP
in our evaluation. AttentiveFP follows a traditional graph learning mechanism and allows non-local eects at
the intra-molecular level by applying a graph attention mechanism with multiple GRU layers. We also expand
the code of AttentiveFP to support data quality weights during training and evaluation. e GitHub repository
of weighted AttentiveFP is https://github.com/Mengjintao/AttentiveFP. An evaluation workow similar to that
of Chemprop is used, ensembling multiple AttentiveFP models in several folds. e RMSE values and condence
intervals of AttentiveFP on all 7 datasets are collected in Table4. A similar trend with a decreasing RMSE value
is illustrated in Table4. For example, AttentiveFP trained on the curated AQUA, PHYS, ESOL, OCHEM, and
AQSOL datasets achieves 0.067, 0.095, 0.03, 0.043, and 0.242 unit log decreases in the RMSE compared with the
original dataset using a scaold data partition.
All the evaluations demonstrated above in Tables3, 4 employ hyperparameter optimization with a grid
search approach. e grid search approach randomly selects 108 parameter combinations on ve key parame-
ters, and the lowest RMSE value is recorded. A larger search space may decrease the RMSE value further but will
not change the trend demonstrated in Tables3, 4; thus, we keep the same number of parameter combinations
in our search space during the entire work and do not enlarge the search space to reduce the training time and
computing resources.

e disparate statistical measurement and high quality datasets are the main obstacles to making an objective
comparison between deep learning and QM-QSPR approaches, in terms of solubility prediction. To conduct
a comparison, a dataset of 48 molecules is gathered from several previous works2,41,42. is dataset includes
four pharmaceutical series of 48 molecules, and none are contained in the 7 collected datasets. pearson and
spearman’s rank-order correlation coecients are used to evaluate the performances of the deep learning and
QM-QSPR approaches.
e correlation coecients of the predicted and observed values are the main concern for lead optimiza-
tion in compound design. e thermodynamic cycle solubility approach is a fundamental theory used in the
QM-QSPR approaches. In this approach, the log scale of the aqueous solubility value is linearly related to the
sublimation and hydration free energies. QM-QSPR approaches mainly focus on searching for extremely accu-
rate methods to calculate the sublimation and hydration free energies using a physics-based simulation at the
cost of enormous supercomputing power or quantum computation. us, instead of predicting the absolute
solubility values, the main goal of the QM-QSPR approaches is to evaluate the correlation coecient of the
solubility value with its two energy factors and then apply it in lead optimization. Two measurements are recom-
mended by one state-of-the-art work2 to evaluate the correlation coecient: the square of the pearson correla-
tion coecient r2 and spearman’s rank-order correlation coecient RS. e equation for the pearson correlation
coecient r is
Content courtesy of Springer Nature, terms of use apply. Rights reserved
9
SCIENTIFIC DATA | (2022) 9:71 | https://doi.org/10.1038/s41597-022-01154-3
www.nature.com/scientificdata
www.nature.com/scientificdata/
=−−
.
=
=
=
r
xxyy
xx yy
()()
() ()
(1)
i
n
ii
i
nii
ni
0
1
0
120
12
Here, x is the vector of the predicted value, y is the vector of the true value, and
x
and
y
are the average values
of x and y, respectively. When r2 equal to 1, this indicates a perfect linear correlation between the observed and
predicted solubility values. spearmans rank-order correlation coecient RS can be calculated as
=−
=
Rdnn16()/( 1),
(2)
S
i
n
i
0
1
22
Where di is the dierence between the ranks of the measured and predicted solubilities of molecule i. Here, RS
equal to 1 indicates a perfect ranking of the predicted solubility values.
We compare the deep learning and QM-QSPR approaches on r2 and RS with the evaluation dataset of 48 mol-
ecules. e ensembled models resulting in the best RMSE value in Tables3, 4 are used to predict the evaluation
dataset with weighted Chemprop (Chemprop expanded to support data quality). In this evaluation dataset, 12
molecules of Benzoylphenylurea (BPU) derivatives and 19 molecules of Benzodiazepin (BDZ) derivatives com-
prise the rst 31 molecules2. Seven molecules with selective Cyclin-Dependent Kinase 12 (CDK) inhibitors42 and
10 molecules of Pyrazole and Cyanopyrrole Analogs (PCAs) comprise the last 17 molecules41. We collect the sta-
tistical results of r2 and RS on these 48 molecules for weighted Chemprop and plot them in Figs.4, 5, respectively.
Note that the r2 values of the QM-QSPR approach proposed by2 are 0.79, 0.83, and 0.905 on the BPU, BDZ, and
BPU&BDZ datasets, respectively. ey also report the Rs score on the BPU, BDZ, and BPU&BDZ datasets to be
0.87, 0.90 and 0.967 respectively. Currently, no statistical results have been given on PCAs and CDK inhibitors
by any of the QM-QSPR approaches.
In Fig.4, the r2 curves of weighted Chemprop on BPU, BDZ, and BPU&BDZ increase steadily to 0.90, 0.62,
and 0.93, respectively. e curve of CDK for weighted Chemprop increases to 0.48 on the CHEMBL dataset.
For PCAs, the curves for both Chemprop and weighted Chemprop show no clear correlations due to small data
size. e jitter curves of Chemprop in most cases with lower r2 values reveal that low-quality data in the training
Split Type Dataset
RMSE & Condence Intervals
Org Cln Cure
Random
AQUA 0.616 ± 0.027 0.639 ± 0.014 0.579 ± 0.020
PHYS 0.649 ± 0.019 0.643 ± 0.013 0.551 ± 0.024
ESOL 0.642 ± 0.017 0.641 ± 0.025 0.594 ± 0.022
OCHEM 0.6018 ± 0.012 0.651 ± 0.020 0.6016 ± 0.010
AQSOL 0.826 ± 0.027 0.760 ± 0.012 0.593 ± 0.004
Scaold
AQUA 0.743 ± 0.038 0.747 ± 0.031 0.676 ± 0.038
PHYS 0.782 ± 0.037 0.789 ± 0.037 0.687 ± 0.038
ESOL 0.761 ± 0.048 0.801 ± 0.043 0.731 ± 0.073
OCHEM 0.746 ± 0.011 0.779 ± 0.019 0.703 ± 0.016
AQSOL 0.872 ± 0.017 0.842 ± 0.019 0.630 ± 0.008
Tab le 4. e collected RMSE and condence intervals of AttentiveFPwhen trained on the 7 datasets29. e
data partition strategies include both random and scaold partitioning, and the partition ratio is [0.8, 0.1, 0.1]
for training, testing, and evaluation. In this experiment, 5 models are ensembled 8 times to average the RMSE
values and calculate the corresponding condence interval. Because AttentiveFPis time consuming on a very
large dataset, the CHEMBL and KINECT datasets are not recorded, as their training times are longer than 150
hours. e original AttentiveFP is used on “Org” dataset, and e weighted AttentiveFP is applied on both “Cln”
and “Cure” datasets.
Fig. 4 Comparison of r2 values for ensembled models with the best RMSE scores in Table3 for Chemprop (le
gure) or weighted Chemprop (right gure) when predicting 48 molecules.
Content courtesy of Springer Nature, terms of use apply. Rights reserved
10
SCIENTIFIC DATA | (2022) 9:71 | https://doi.org/10.1038/s41597-022-01154-3
www.nature.com/scientificdata
www.nature.com/scientificdata/
datasets aect the model performance. Specically, data curation poses a negative eect on r2 for some special
datasets, for example, the CHEMBL dataset. is outcome may indicate that the actual data quality of this data-
set should be higher than the value we set, and thus, the data may be polluted by other datasets, resulting in poor
performance. When comparing Chemprop with original dataset and weighted Chemprop with curated dataset,
the results in the le side of Fig.4 shows no clear trends or gradation on both increasing training dataset size in
x axis or increasing prediction dataset size on BPZ, BDZ, and BPZ&BDZ dataset in y axis. In the right side of
Fig.4 however we can conrm two trends from both x and y axis in our evaluation. Firstly on the x direction, the
r2 value increases steadily when the data size of the training dataset increasing from AQUA with one thousand
compounds to KINECT of hundred of thousands. Secondly on the y direction, the r2 value of BPZ&BDZ dataset
with 31 compounds is larger than BPZ and BDZ in most cases on 7 datasets. What’s more, the r2 value of BPZ
with 19 compounds is larger than that of BDZ with 12 compounds. us there is a clear gradation on increasing
prediction dataset size on our curated dataset.
In Fig.5, the Rs curves of BPU, BDZ, and BPU&BDZ converge to 0.59, 0.89, and 0.947, respectively, with
increasing data size when using weighted Chemprop. e Rs value of PCAs and CDK increase to 0.58 and 0.63
on the CHEMBL dataset and decrease to 0.4 and 0.18 on the Kinect dataset, respectively. One can see that the
Kinect dataset yields a negative performance on Rs when predicting the PCA and CDK values for both weighted
Chemprop and Chemprop. e unstable r2 and Rs values around 0 for CDK conrm that the graph learning model
of Chemprop fails to track the physicochemical features of PCAs in terms of solubility. From both Figs.4, 5,
weighted Chemprop demonstrates a clear prediction performance gradation on the BPU, BDZ, CDK and PCA
molecules, whereas Chemprop with the original dataset does not.
e above comparison conrms that r2 and Rs values for CDK and PCA are noisy, these two datasets with 7
and 10 elements respectively, are too small to deliver a good comparison. However when given enough number
of compounds, both the r2 and Rs value of BPU & BDZ datasets are high and above 0.9. As both r2 and Rs are used
to evaluate the correlation coecients of the predicted and observed values, it is not the absolute value of solubil-
ity value. We guess that intrinsic solubility and kinetic solubility can have dierent absolute solubility value but
can still share the same tread in its correlation coecients. us we didn’t distinguish between thermodynamic
solubility and kinetic solubility in our training datasets (AQUA, PHYS, ESOL, OCHEM, AQSOL, CHEMBL,
KINECT) and the test dataset (BPU, BDZ, CDK, and PCA). Note that, it is still recommended to avoid mixing
kinetic and thermodynamic in one training dataset or test dataset. Larger dataset will be better for us to do this
evaluation, but currently no other open data is available.
In terms of running time, predicting these 48 molecules, for example, with weighted Chemprop requires
approximately 1.34 seconds in total or 0.028 seconds for each molecules on average with a single desktop com-
puter as listed in Table5. For QM-QSPR approaches such as the QM-based methods2, the calculation relies on
a cloud infrastructure of millions of CPU cores; however, no running time can be recorded as their method is
commercial and not publicly available. us, the availability of open-source methods and dramatically lower
usage of computing resources are additional advantages of applying deep learning models.
Fig. 5 Comparison of Rs values on ensembled models with the best RMSE scores in Table3 for Chemprop (le
gure) or weighted Chemprop (right gure) when predicting 48 molecules.
Desktop Time Usage (in seconds)
CPU GPU Evaluation (48) ESOL (1128) AQSOL (9982)
E3-1225 v6 1.28 8.11 86.56
E3-1225 v6 Quadro P400 1.34 8.49 86.07
Platinum 8180 0.70 9.98 107.93
Platinum 8180 GTX 1050Ti 0.61 8.27 86.28
Platinum 8180 Tesla T4 0.62 8.42 91.20
Tab le 5. Statistical time-usage (averaged over 100 rounds) of predicting compounds in evaluation, ESOL, and
AQSOL datasets with weighted Chemprop on three computers. e number of molecules containing in these
datasets are 48, 1311, 9982 respectively. e time-usage is measured in seconds. e eciency of the prediction
workload is about 4% on Tesla T4, 6% on GTX 1050Ti, and 9% on Quadro P400, thus the running time has
limited relation with GPU cards for unsaturated workload.
Content courtesy of Springer Nature, terms of use apply. Rights reserved
11
SCIENTIFIC DATA | (2022) 9:71 | https://doi.org/10.1038/s41597-022-01154-3
www.nature.com/scientificdata
www.nature.com/scientificdata/
To conclude, with seven collected large-scale aqueous solubility dataset and the proposed data curation
methodology, seven high quality curated datasets with quality weights are generated. Deep learning methods
including both Chemprop and AttentiveFP shares a dramatically increase on predictive accuracy measured in
RMSE, which has been demonstrated in method section in details. More importantly, using these ensembled
models with best RMSE, deep learning methods benet from curated datasets, with a steady improvement in r2
and Rs when increasing training data volume. Deep learning methods also demonstrate a superior performance
on r2 and comparable performance on Rs when predicting BPU and BDZ derivatives for leading compound
optimization compared with the QM-QSPR approaches, such as2. A clear prediction performance rank demon-
strating the capacity of deep learning methods on four series compounds is also illustrated by curated datasets.
For example, deep learning methods do not function well on PCA and CDK derivatives, while the QM-QSPR
approaches have not demonstrated their capacity. A clear advantage of deep learning approach is its running
time, when predicting thousands of target compounds it takes only seconds on a common desktop computer
whereas physics-based approach requires a large compute resources and takes a longrunning time.
Discussion
Previously, both AI and drug design experts are focused on molecular property prediction. However they are
interested in totally dierent issues, as we illustrated in Table6. Enormous high quality data and high predictive
accuracy on their own measurement standard are the main concern for the AI experts. Drug design experts
are more interested in real world eects of the method itself. For example, how is the correlation coecients
in compound lead optimization, what’s the generalization ability on dierent series of in-house compounds,
what’s the required computing resource and its running time on making prediction, and nally is it available or
open-sourced for free application. is work is trying to bridge such gap with one of its sub-problem, aqueous
solubility prediction.
Currently, the QM-QSPR approaches are the dominant techniques for aqueous solubility prediction in drug
design. Several research works have demonstrated their improvement with AI techniques. However, with these
continuous improvements in predictive accuracy achieved with AI, conservative drug design experts remain
concerned about the real ability of deep learning in comparison with that of QM-QSPR approaches on their
in-house datasets. is work contributes to resolve the concerned issues from both deep learning and drug
design side. From the deep learning side, we increased the data volume of aqueous solubility datasets from
thousand to hundreds of thousands of molecules, rened the data quality of the datasets with a data curation
method, and nally improved the solubility predictive accuracy dramatically under the traditional measurement
of RMSE. In terms of drug design side, this work is a milestone bridge that constructs a mechanism to com-
pare QM-QSPR and deep learning approaches with state-of-the-art solubility evaluation datasets on correlation
coecients. Fortunately, the graph learning method of expanded Chemprop trained on a curated dataset has
demonstrated a steady performance on correlation coecients of r2 and Rs comparable to that of the QM-QSPR
approaches, while using orders of magnitudes less compute resources and being available for public evaluation.
e comparison also conrms that the generalization ability of deep learning approach is good on BPU andBDZ
derivatives but still limited on PCA and CDK derivatives which demands further research eort on both sides.
This work also reveals a turning point in molecular property prediction where the deep learning and
QM-QSPR approaches should be jointly co-developed. For example, topology-based graph learning and crystal-
3D-structure-based deep learning may integrate both topology and crystal 3D features in solubility prediction
with a promising accuracy improvement. One can also expand this work to other molecular properties to better
understand natural phenomena with the help of both QM-QSPR and deep learning methods.
Usage Notes
Reproducibility of the curation algorithm, training workow and performance evaluation can be verifed by execut-
ing the scripts described in the README of our project SolCuration at https://github.com/Mengjintao/SolCuration.
e code has been developed and tested using Python 3.7 on Linux operating system and is available under the BSD
3-Clause License. All the datasets are also provided in this repository for further research eort on this problem.
Data availability
e original, clean and curated dataset for the 7 selected data sources presented in this paper are publicly available
on GitHub at https://github.com/Mengjintao/SolCuration and can be cited by43.
Code availability
Python and C++ codes used to perform data curation, training workow, and performance evaluation shown in
this manuscript are publicly available on GitHub at https://github.com/Mengjintao/SolCuration or one can cite
our code by43.
AI experts Drug design experts
Data volume Correlation coecients in compound lead optimization
Data quality Generalization ability on dierent series of compounds
Measurement standard Computing resource and its running time
Predictive accuracy Open source availability
Tab le 6. Dierence of the issues concerned by AI experts and Drug design experts.
Content courtesy of Springer Nature, terms of use apply. Rights reserved
12
SCIENTIFIC DATA | (2022) 9:71 | https://doi.org/10.1038/s41597-022-01154-3
www.nature.com/scientificdata
www.nature.com/scientificdata/
Received: 5 October 2020; Accepted: 25 January 2022;
Published: xx xx xxxx

1. Lipp, . e innovator pipeline: bioavailability challenges and advanced oral drug delivery opportunities. Am Pharm ev 16, 10–12
(2013).
2. Abramov, Y. A., Sun, G., Zeng, Q., Zeng, Q. & Yang, M. Guiding lead optimization for solubility improvement with physics-based
modeling. Molecular Pharmaceutics (2020).
3. Wang, J. & Hou, T. ecent advances on aqueous solubility prediction. Combinatorial chemistry & high throughput screening 14,
328–338 (2011).
4. Salahinejad, M., Le, T. C. & Winler, D. A. Aqueous solubility prediction: do crystal lattice interactions help? Molecular
pharmaceutics 10, 2757–2766 (2013).
5. Jorgensen, W. L. & Duy, E. M. Prediction of drug solubility from structure. Advanced drug delivery reviews 54, 355–366 (2002).
6. Hossain, S., abedev, A., Parrow, A., Bergström, C. & Larsson, P. Molecular simulation as a computational pharmaceutics tool to
predict drug solubility, solubilization processes and partitioning. European Journal of Pharmaceutics and Biopharmaceutics (2019).
7. Teto, I. V., Villa, A. E. & Livingstone, D. J. Neural networ studies. 2. variable selection. Journal of chemical information and
computer sciences 36, 794–803 (1996).
8. Palmer, D. S., O’Boyle, N. M., Glen, . C. & Mitchell, J. B. andom forest models to predict aqueous solubility. Journal of chemical
information and modeling 47, 150–158 (2007).
9. Duvenaud, D. . et al. Convolutional networs on graphs for learning molecular ngerprints. In Advances in neural information
processing systems, 2224–2232 (2015).
10. ier, L. B., et al. Molecular connectivity in structure-activity analysis (esearch Studies, 1986).
11. Teto, I. V., Tanchu, V. Y., asheva, T. N. & Villa, A. E. Estimation of aqueous solubility of chemical compounds using e-state
indices. Journal of chemical information and computer sciences 41, 1488–1493 (2001).
12. Palmer, D. S. et al. Predicting intrinsic aqueous solubility by a thermodynamic cycle. Molecular Pharmaceutics 5, 266–279 (2008).
13. Palmer, D. S., McDonagh, J. L., Mitchell, J. B., van Mouri, T. & Fedorov, M. V. First-principles calculation of the intrinsic aqueous
solubility of crystalline druglie molecules. Journal of chemical theory and computation 8, 3322–3337 (2012).
14. Buchholz, H. . et al. ermochemistry of racemic and enantiopure organic crystals for predicting enantiomer separation. Crystal
Growth & Design 17, 4676–4686 (2017).
15. Docherty, ., Pencheva, . & Abramov, Y. A. Low solubility in drug development: de-convoluting the relative importance of
solvation and crystal pacing. Journal of Pharmacy and Pharmacology 67, 847–856 (2015).
16. Par, J. et al. Absolute organic crystal thermodynamics: growth of the asymmetric unit into a crystal via alchemy. Journal of chemical
theory and computation 10, 2781–2791 (2014).
17. Perlovich, G. L. & aevsy, O. A. Sublimation of molecular crystals: prediction of sublimation functions on the basis of hybot
physicochemical descriptors and structural clusterization. Crystal growth & design 10, 2707–2712 (2010).
18. Syner, ., McDonagh, J., Groom, C., Van Mouri, T. & Mitchell, J. A review of methods for the calculation of solution free energies
and the modelling of systems in solution. Physical Chemistry Chemical Physics 17, 6174–6191 (2015).
19. Zhang, P. et al. Harnessing cloud architecture for crystal structure prediction calculations. Crystal Growth & Design 18, 6891–6900
(2018).
20. Morgan, H. L. e generation of a unique machine description for chemical structures-a technique developed at chemical abstracts
service. Journal of Chemical Documentation 5, 107–113 (1965).
21. ogers, D. & Hahn, M. Extended-connectivity ngerprints. Journal of chemical information and modeling 50, 742–754 (2010).
22. Glen, . C. et al. Circular ngerprints: exible molecular descriptors with applications from physical chemistry to adme. IDrugs 9,
199 (2006).
23. Wu, Z. et al. Moleculenet: a benchmar for molecular machine learning. Chemical science 9, 513–530 (2018).
24. Yang, . et al. Analyzing learned molecular representations for property prediction. Journal of chemical information and modeling
59, 3370–3388 (2019).
25. Feinberg, E. N. et al. Potentialnet for molecular property prediction. ACS central science 4, 1520–1530 (2018).
26. earnes, S., McClosey, ., Berndl, M., Pande, V. & iley, P. Molecular graph convolutions: moving beyond ngerprints. Journal of
computer-aided molecular design 30, 595–608 (2016).
27. Weininger, D. Smiles, a chemical language and information system. 1. introduction to methodology and encoding rules. Journal of
chemical information and computer sciences 28, 31–36 (1988).
28. Gilmer, J., Schoenholz, S. S., iley, P. F., Vinyals, O. & Dahl, G. E. Neural message passing for quantum chemistry. arXiv preprint
arXiv:1704.01212 (2017).
29. Xiong, Z. et al. Pushing the boundaries of molecular representation for drug discovery with the graph attention mechanism. Journal
of medicinal chemistry (2019).
3 0. Avdeef, A. Suggested improvements for measurement of equilibrium solubility-ph of ionizable drugs. ADMET and DMP 3, 84–109
(2015).
31. Bergström, C. A. & Larsson, P. Computational prediction of drug solubility in water-based systems: qualitative and quantitative
approaches used in the current drug discovery and development setting. International journal of pharmaceutics 540, 185–193 (2018).
32. Wenloc, M. C., Austin, . P., Potter, T. & Barton, P. A highly automated assay for determining the aqueous equilibrium solubility of
drug discovery compounds. JALA: Journal of the Association for Laboratory Automation 16, 276–284 (2011).
33. Eri sson, L. et al. Methods for reliability and uncertainty assessment and for applicability evaluations of classification-and
regression-based qsars. Environmental health perspectives 111, 1361–1375 (2003).
34. Huusonen, J. Estimation of aqueous solubility for a diverse set of organic compounds based on molecular topology. Journal of
Chemical Information and Computer Sciences 40, 773–777 (2000).
35. Delaney, J. S. Esol: estimating aqueous solubility directly from molecular structure. Journal of chemical information and computer
sciences 44, 1000–1005 (2004).
36. Saal, C. & Petereit, A. C. Optimizing solubility: inetic versus thermodynamic solubility temptations and riss. European journal of
pharmaceutical sciences 47, 589–595 (2012).
37. Mansouri, ., Grule, C., ichard, A., Judson, . & Williams, A. An automated curation procedure for addressing chemical errors
and inconsistencies in public datasets used in qsar modelling. SA and QSA in Environmental esearch 27, 911–937 (2016).
38. Sorun, M. C., hetan, A. & Er, S. Aqsoldb, a curated reference set of aqueous solubility and 2d descriptors for a diverse set of
compounds. Scientic data 6, 1–8 (2019).
39. Zalesa, B. et al. Synthesis of zwitterionic compounds: Fully saturated pyrimidinylium and 1, 3-diazepinylium derivatives via the
novel rearrangement of 3-oxobutanoic acid thioanilide derivatives. e Journal of organic chemistry 67, 4526–4529 (2002).
40. Bemis, G. W. & Murco, M. A. The properties of nown drugs. 1. molecular framewors. Journal of medicinal chemistry 39,
2887–2893 (1996).
41. awahata, W. et al. Design and synthesis of novel amino-triazine analogues as selective bruton’s tyrosine inase inhibitors for
treatment of rheumatoid arthritis. Journal of medicinal chemistry 61, 8917–8933 (2018).
Content courtesy of Springer Nature, terms of use apply. Rights reserved
13
SCIENTIFIC DATA | (2022) 9:71 | https://doi.org/10.1038/s41597-022-01154-3
www.nature.com/scientificdata
www.nature.com/scientificdata/
42. Ito, M. et al. Discovery of 3-benzyl-1-(trans-4-((5-cyanopyridin-2-yl) amino) cyclohexyl)-1-arylurea derivatives as novel and
selective cyclin-dependent inase 12 (cd12) inhibitors. Journal of medicinal chemistry 61, 7710–7728 (2018).
43. Meng, J. Solcuration. gshare https://doi.org/10.6084/m9.gshare.14766909 (2021).

is work was partly supported by the National Key Research and Development Program of China under Grant
No. 2018YFB0204403, Strategic Priority CAS Project XDB38050100, National Science Foundation of China
under grant No. U1813203, the Shenzhen Basic Research Fund under grant No. RCYX2020071411473419,
KQTD20200820113106007 and JSGG20190220164202211, CAS Key Lab under grant No. 2011DP173015.
is work was partly supported by JST, PRESTO under grant No. JPMJPR20MA, JSPS KAKENHI under grant
No. JP21K17750, and AIST Emerging Research under grant No. AAZ2029701B, Japan. We would like to thank
Dr. Kamel Mansouri from Integrated Laboratory Systems, Inc for providing curated PHYSPROP datasets. We
also want to thank the editors and reviewers for their professional comments which have greatly improved this
manuscript.
Competing interests
e authors declare no competing interests.
Additional information
Correspondence and requests for materials should be addressed to P.C., Y.W. or S.F.
Reprints and permissions information is available at www.nature.com/reprints.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and
institutional aliations.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International
License, which permits use, sharing, adaptation, distribution and reproduction in any medium or
format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Cre-
ative Commons license, and indicate if changes were made. e images or other third party material in this
article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the
material. If material is not included in the article’s Creative Commons license and your intended use is not per-
mitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the
copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
© e Author(s) 2022
Content courtesy of Springer Nature, terms of use apply. Rights reserved
1.
2.
3.
4.
5.
6.
Terms and Conditions
Springer Nature journal content, brought to you courtesy of Springer Nature Customer Service Center GmbH (“Springer Nature”).
Springer Nature supports a reasonable amount of sharing of research papers by authors, subscribers and authorised users (“Users”), for small-
scale personal, non-commercial use provided that all copyright, trade and service marks and other proprietary notices are maintained. By
accessing, sharing, receiving or otherwise using the Springer Nature journal content you agree to these terms of use (“Terms”). For these
purposes, Springer Nature considers academic use (by researchers and students) to be non-commercial.
These Terms are supplementary and will apply in addition to any applicable website terms and conditions, a relevant site licence or a personal
subscription. These Terms will prevail over any conflict or ambiguity with regards to the relevant terms, a site licence or a personal subscription
(to the extent of the conflict or ambiguity only). For Creative Commons-licensed articles, the terms of the Creative Commons license used will
apply.
We collect and use personal data to provide access to the Springer Nature journal content. We may also use these personal data internally within
ResearchGate and Springer Nature and as agreed share it, in an anonymised way, for purposes of tracking, analysis and reporting. We will not
otherwise disclose your personal data outside the ResearchGate or the Springer Nature group of companies unless we have your permission as
detailed in the Privacy Policy.
While Users may use the Springer Nature journal content for small scale, personal non-commercial use, it is important to note that Users may
not:
use such content for the purpose of providing other users with access on a regular or large scale basis or as a means to circumvent access
control;
use such content where to do so would be considered a criminal or statutory offence in any jurisdiction, or gives rise to civil liability, or is
otherwise unlawful;
falsely or misleadingly imply or suggest endorsement, approval , sponsorship, or association unless explicitly agreed to by Springer Nature in
writing;
use bots or other automated methods to access the content or redirect messages
override any security feature or exclusionary protocol; or
share the content in order to create substitute for Springer Nature products or services or a systematic database of Springer Nature journal
content.
In line with the restriction against commercial use, Springer Nature does not permit the creation of a product or service that creates revenue,
royalties, rent or income from our content or its inclusion as part of a paid for service or for other commercial gain. Springer Nature journal
content cannot be used for inter-library loans and librarians may not upload Springer Nature journal content on a large scale into their, or any
other, institutional repository.
These terms of use are reviewed regularly and may be amended at any time. Springer Nature is not obligated to publish any information or
content on this website and may remove it or features or functionality at our sole discretion, at any time with or without notice. Springer Nature
may revoke this licence to you at any time and remove access to any copies of the Springer Nature journal content which have been saved.
To the fullest extent permitted by law, Springer Nature makes no warranties, representations or guarantees to Users, either express or implied
with respect to the Springer nature journal content and all parties disclaim and waive any implied warranties or warranties imposed by law,
including merchantability or fitness for any particular purpose.
Please note that these rights do not automatically extend to content, data or other material published by Springer Nature that may be licensed
from third parties.
If you would like to use or distribute our Springer Nature journal content to a wider audience or on a regular basis or in any other manner not
expressly permitted by these Terms, please contact Springer Nature at
onlineservice@springernature.com
... Since the applicability domain of models critically depends on data [8], the studies reporting new large datasets with solubility data, such as AqSolDB [9], are of considerable interest to the research community. That is why the recent article by Meng et al. [10] which reported on the collection of large sets of solubility values, while also mentioning a significant drop in the RMSE due to the reported data curation, attracted our attention. One of the important methodological approaches reported in the article was the use of a hyperparameter optimization procedure which required a lot of computational power. ...
... The "original" sets collected data retrieved by the authors from the respective data sources, as reported in Meng et al. [10]. Some of the original sets contained duplicates. ...
... , which could not be processed by graph-based neural networks due to the fact that there are no bonds between heavy atoms set for those objects to apply graph convolution or/and these atom/molecule/compound types are not supported. The authors removed duplicates and metals Table 1 Analysis and cleaning of clean data ("Cln") reported in the study by Meng et al. [10] a 192 metals and non-organics were removed from the "Org" AQSOL set during the cleaning and standardization procedure to create "clean" and "curated" sets as described below, and the 192 metal-containing compounds that remained in the "Org" set were excluded as reported in Table 1. ...
Article
Full-text available
Hyperparameter optimization is very frequently employed in machine learning. However, an optimization of a large space of parameters could result in overfitting of models. In recent studies on solubility prediction the authors collected seven thermodynamic and kinetic solubility datasets from different data sources. They used state-of-the-art graph-based methods and compared models developed for each dataset using different data cleaning protocols and hyperparameter optimization. In our study we showed that hyperparameter optimization did not always result in better models, possibly due to overfitting when using the same statistical measures. Similar results could be calculated using pre-set hyperparameters, reducing the computational effort by around 10,000 times. We also extended the previous analysis by adding a representation learning method based on Natural Language Processing of smiles called Transformer CNN. We show that across all analyzed sets using exactly the same protocol, Transformer CNN provided better results than graph-based methods for 26 out of 28 pairwise comparisons by using only a tiny fraction of time as compared to other methods. Last but not least we stressed the importance of comparing calculation results using exactly the same statistical measures. Scientific Contribution We showed that models with pre-optimized hyperparameters can suffer from overfitting and that using pre-set hyperparameters yields similar performances but four orders faster. Transformer CNN provided significantly higher accuracy compared to other investigated methods.
... 7 Since the applicability domain of models critically depends on data, 8 the studies reporting new large datasets with solubility data, such as AqSolDB, 9 are of considerable interest to the research community. That is why the recent article by Meng et al., 10 which reported on the collection of large sets of solubility values, while also mentioning a significant drop in the RMSE due to the reported data curation, attracted our attention. One of the important methodological approaches reported in the article was the use of a hyperparameter optimization procedure which required a lot of computational power. ...
... The "original" sets collected data retrieved by the authors from the respective data sources, as reported in Meng et al. 10 Some of the original sets contained duplicates. For example, On-line CHEmical database and Modelling environment (OCHEM) 11 has a policy of collecting data as it is published in the original articles. ...
... The cleaning procedure (described in Meng et al. 10 ) included SMILES standardization using MolVS followed by the removal of duplicates (only when the difference between values for the same molecules in two records was less than 0.01 log unit), removing records that followed non-standard experimental protocols (temperature 25 ± 5°C, pH 7 ± 1) as well as the removal of compounds containing metals. ...
Preprint
Full-text available
Hyperparameter optimization is very frequently employed in machine learning. However, an optimization of a large space of parameters could result in overfitting of models. In recent studies on solubility prediction the authors collected seven thermodynamic and kinetic solubility datasets from different data sources. They used state-of-the-art graph-based methods and compared models developed for each dataset using different data cleaning protocols and hyperparameter optimization. In our study we showed that hyperparameter optimization did not always result in better models, possibly due to overfitting when using the same statistical measures. Similar results could be calculated using pre-set hyperparameters, reducing the computational effort by around 10,000 times. We also extended the previous analysis by adding a representation learning method based on Natural Language Processing of smiles called Transformer CNN. We show that across all analyzed sets using exactly the same protocol, Transformer CNN provided better results than graph-based methods for 26 out of 28 pairwise comparisons by using only a tiny fraction of time as compared to other methods. Last but not least we stressed the importance of comparing calculation results using exactly the same statistical measures.
... 39 (ii) AqSol data set 40 which combines nine data sets from various source. (iii) AQUA data set obtained from Meng et al., 41 containing research data by Huuskonen 42 and Tetko et al., 43 with experimental aqueous solubility values measured between 20 to 25°C, sourced partly from the AQUASOL database of the University of Arizona and SCR's PHYSPROP database. (iv) PHYS data set 41 obtained from Meng et al., 41 containing molecules with water solubility end points extracted from the PHYSPROP database. ...
... (iii) AQUA data set obtained from Meng et al., 41 containing research data by Huuskonen 42 and Tetko et al., 43 with experimental aqueous solubility values measured between 20 to 25°C, sourced partly from the AQUASOL database of the University of Arizona and SCR's PHYSPROP database. (iv) PHYS data set 41 obtained from Meng et al., 41 containing molecules with water solubility end points extracted from the PHYSPROP database. (v) Some other curated solubility data. ...
Article
The solubility of chemical substances in water is a critical parameter in pharmaceutical development, environmental chemistry, agrochemistry, and other fields; however, accurately predicting it remains a challenge. This study aims to evaluate and compare the effectiveness of some of the most popular machine learning modeling methods and molecular featurization techniques in predicting aqueous solubility. Although these methods were not implemented in a competitive environment, some of their performance surpassed previous benchmarks, offering gradual but significant improvements. Our results show that methods based on graph convolution and graph attention mechanisms demonstrated exceptional predictive abilities with high-quality data sets, albeit with a sensitivity to data noise and errors. In contrast, models leveraging molecular descriptors not only provided better interpretability but also showed more resilience when dealing with inherent noise and errors in data. Our analysis of over 4000 molecular descriptors used in various models identified that approximately 800 of these descriptors make a significant contribution to solubility prediction. These insights offer guidance and direction for future developments in solubility prediction.
... Graph Convolutional Networks (GCNs) have gained popularity recently for processing graph-structured data and demonstrated successes in various applications such as social network analysis [18], biology [26,36,55], and chemistry [39,43]. Large-scale graphs, which contain more than hundreds of million nodes and edges, exist widely in realworld applications [40]. ...
Preprint
Full-text available
Graph Convolutional Networks (GCNs) are widely used in various domains. However, training distributed full-batch GCNs on large-scale graphs poses challenges due to inefficient memory access patterns and high communication overhead. This paper presents general and efficient aggregation operators designed for irregular memory access patterns. Additionally, we propose a pre-post-aggregation approach and a quantization with label propagation method to reduce communication costs. Combining these techniques, we develop an efficient and scalable distributed GCN training framework, \emph{SuperGCN}, for CPU-powered supercomputers. Experimental results on multiple large graph datasets show that our method achieves a speedup of up to 6×\times compared with the SoTA implementations, and scales to 1000s of HPC-grade CPUs, without sacrificing model convergence and accuracy. Our framework achieves performance on CPU-powered supercomputers comparable to that of GPU-powered supercomputers, with a fraction of the cost and power budget.
... These environmental considerations span from exposure to sunlight irradiation, the presence of natural organic matter, to the presence of mineral particles in the vicinity. Collectively, these diverse sources and multifaceted factors contribute to a comprehensive understanding of nanotoxicity, underpinning the ongoing efforts to decipher and mitigate the potential hazards posed by nanomaterials [41,42,[44][45][46][47][48]. ...
... present B3DB, which includes 1,058 compounds containing log BB values and 7,807 compounds with classification labels for the blood-brain barrier as one of the distribution properties. Meng et al. 12 collected seven aqueous solubility datasets and presented a dataset curation workflow to establish solubility datasets as one of the physicochemical properties. ...
Article
Full-text available
Accurately predicting ADMET (Absorption, Distribution, Metabolism, Excretion, and Toxicity) properties early in drug development is essential for selecting compounds with optimal pharmacokinetics and minimal toxicity. Existing ADMET-related benchmark sets are limited in utility due to their small dataset sizes and the lack of representation of compounds used in drug discovery projects. These shortcomings hinder their application in model building for drug discovery. To address this issue, we propose a multi-agent data mining system based on Large Language Models that effectively identifies experimental conditions within 14,401 bioassays. This approach facilitates merging entries from different sources, culminating in the creation of PharmaBench. Additionally, we have developed a data processing workflow to integrate data from various sources, resulting in 156,618 raw entries. Through this workflow, we constructed PharmaBench, a comprehensive benchmark set for ADMET properties, which comprises eleven ADMET datasets and 52,482 entries. This benchmark set is designed to serve as an open-source dataset for the development of AI models relevant to drug discovery projects.
... In a recent study, Meng et al. 44 introduced a curation work-ow to rene seven well-established aqueous solubility datasets. This process focused on removing redundant and conicting records, particularly those exhibiting variations in solubility across datasets. ...
Article
Full-text available
The effectiveness of drug treatments depends significantly on the water solubility of compounds, influencing bioavailability and therapeutic outcomes. A reliable predictive solubility tool enables drug developers to swiftly identify drugs with low solubility and implement proactive solubility enhancement techniques. The current research proposes three predictive models based on four solubility datasets (ESOL, AQUA, PHYS, OCHEM), encompassing 3942 unique molecules. Three different molecular representations were obtained, including electrostatic potential (ESP) maps, molecular graph, and tabular features (extracted from ESP maps and tabular Mordred descriptors). We conducted 3942 DFT calculations to acquire ESP maps and extract features from them. Subsequently, we applied two deep learning models, EdgeConv and Graph Convolutional Network (GCN), to the point cloud (ESP) and graph modalities of molecules. In addition, we utilized a random forest-based feature selection on tabular features, followed by mapping with XGBoost. A t-SNE analysis visualized chemical space across datasets and unique molecules, providing valuable insights for model evaluation. The proposed machine learning (ML)-based models, trained on 80% of each dataset and evaluated on the remaining 20%, showcased superior performance, particularly with XGBoost utilizing the extracted and selected tabular features. This yielded average test data Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), and R-squared (R²) values of 0.458, 0.613, and 0.918, respectively. Furthermore, an ensemble of the three models showed improvement in error metrics across all datasets, consistently outperforming each individual model. This Ensemble model was also tested on the Solubility Challenge 2019, achieving an RMSE of 0.865 and outperforming 37 models with an average RMSE of 1.62. Transferability analysis of our work further indicated robust performance across different datasets. Additionally, SHAP explainability for the feature-based XGBoost model provided transparency in solubility predictions, enhancing the interpretability of the results.
... In contract to the existing approaches that are tested on a very limited number of datasets, we collected 29 diverse datasets to demonstrate the usefulness of our approach. [23], Wassvik, Duffy, Dearden, Huuskonen [9], D5, Jain, Goodman, Wang [8], Boobier, Aqsol, ESOL [14], Bergstrom [24], Grigorev [4], Lovric [6], David [3], Daniel [25], Tang [12], Phys, Ochem, Aqua [26], Training set [15], Cui [13], Charles N. Lowe [5], and Ademola [27]. ...
Preprint
Full-text available
Aqueous solubility (AS) is a key physiochemical property that plays a crucial role in drug discovery and material design. We report a novel unified approach to predict and infer chemical compounds with the desired AS based on simple deterministic graph-theoretic descriptors, multiple linear regression (MLR) and mixed integer linear programming (MILP). Selected descriptors based on a forward stepwise procedure enabled the simplest regression model, MLR, to achieve significantly good prediction accuracy compared to the existing approaches, achieving the accuracy in the range [0.7191, 0.9377] for 29 diverse datasets. By simulating these descriptors and learning models as MILPs, we inferred mathematically exact and optimal compounds with the desired AS, prescribed structures, and up to 50 non-hydrogen atoms in a reasonable time range [6, 1204] seconds. These findings indicate a strong correlation between the simple graph-theoretic descriptors and the AS of compounds, potentially leading to a deeper understanding of their AS without relying on widely used complicated chemical descriptors and complex machine learning models that are computationally expensive, and therefore difficult to use for inference. An implementation of the proposed approach is available at https://github.com/ku-dml/mol-infer/tree/master/AqSol.
... The integration of Artificial Intelligence (AI) in organic chemistry is gaining significant prominence, primarily due to its potential to improve the pace and efficiency of drug discovery, materials design, and chemical processes [1,2]. AI techniques can be harnessed to generate novel molecules with desired properties [3][4][5], such as improved solubility [6,7] or toxicity [8], leveraging existing molecular data. This capability has the profound potential to expedite materials design and discovery [9]. ...
Article
Full-text available
Inferring missing molecules in chemical equations is an important task in chemistry and drug discovery. In fact, the completion of chemical equations with necessary reagents is important for improving existing datasets by detecting missing compounds, making them compatible with deep learning models that require complete information about reactants, products, and reagents in a chemical equation for increased performance. Here, we present a deep learning model to predict missing molecules using a multi-task approach, which can ultimately be viewed as a generalization of the forward reaction prediction and retrosynthesis models, since both can be expressed in terms of incomplete chemical equations. We illustrate that a single trained model, based on the transformer architecture and acting on reaction SMILES strings, can address the prediction of products (forward), precursors (retro) or any other molecule in arbitrary positions such as solvents, catalysts or reagents (completion). Our aim is to assess whether a unified model trained simultaneously on different tasks can effectively leverage diverse knowledge from various prediction tasks within the chemical domain, compared to models trained individually on each application. The multi-task models demonstrate top-1 performance of 72.4%, 16.1%, and 30.5% for the forward, retro, and completion tasks, respectively. For the same model we computed round-trip accuracy of 83.4%. The completion task exhibiting improvements due to the multi-task approach.
... The details of these datasets, including the list of chemical activities and the Pearson correlation between activities, can be found in Appendix D. We will benchmark QComp on the largest ADMET-750k dataset, which is accumulated from consistent industrial drug discovery practices. Similar benchmarking procedure is performed for the small public dataset, which is compiled from various public sources [13,[31][32][33][34][35][36][37][38][39][40], for reproducibility of the QComp approach (see Appendix B.2). ...
Preprint
Full-text available
In drug discovery, in vitro and in vivo experiments reveal biochemical activities related to the efficacy and toxicity of compounds. The experimental data accumulate into massive, ever-evolving, and sparse datasets. Quantitative Structure-Activity Relationship (QSAR) models, which predict biochemical activities using only the structural information of compounds, face challenges in integrating the evolving experimental data as studies progress. We develop QSAR-Complete (QComp), a data completion framework to address this issue. Based on pre-existing QSAR models, QComp utilizes the correlation inherent in experimental data to enhance prediction accuracy across various tasks. Moreover, QComp emerges as a promising tool for guiding the optimal sequence of experiments by quantifying the reduction in statistical uncertainty for specific endpoints, thereby aiding in rational decision-making throughout the drug discovery process.
Article
Full-text available
Water is a ubiquitous solvent in chemistry and life. It is therefore no surprise that the aqueous solubility of compounds has a key role in various domains, including but not limited to drug discovery, paint, coating, and battery materials design. Measurement and prediction of aqueous solubility is a complex and prevailing challenge in chemistry. For the latter, different data-driven prediction models have recently been developed to augment the physics-based modeling approaches. To construct accurate data-driven estimation models, it is essential that the underlying experimental calibration data used by these models is of high fidelity and quality. Existing solubility datasets show variance in the chemical space of compounds covered, measurement methods, experimental conditions, but also in the non-standard representations, size, and accessibility of data. To address this problem, we generated a new database of compounds, AqSolDB, by merging a total of nine different aqueous solubility datasets, curating the merged data, standardizing and validating the compound representation formats, marking with reliability labels, and providing 2D descriptors of compounds as a Supplementary Resource. Machine-accessible metadata file describing the reported data (ISA-Tab format)
Article
Full-text available
Advancements in neural machinery have led to a wide range of algorithmic solutions for molecular property prediction. Two classes of models in particular have yielded promising results: neural networks applied to computed molecular fingerprints or expert-crafted descriptors, and graph convolutional neural networks that construct a learned molecular representation by operating on the graph structure of the molecule. However, recent literature has yet to clearly determine which of these two methods is superior when generalizing to new chemical space. Furthermore, prior research has rarely examined these new models in industry research settings in comparison to existing employed models. In this paper, we benchmark models extensively on 19 public and 16 proprietary industrial datasets spanning a wide variety of chemical endpoints. In addition, we introduce a graph convolutional model that consistently matches or outperforms models using fixed molecular descriptors as well as previous graph neural architectures on both public and proprietary datasets. Our empirical findings indicate that while approaches based on these representations have yet to reach the level of experimental reproducibility, our proposed model nevertheless offers significant improvements over models currently used in industrial workflows.
Article
Full-text available
The arc of drug discovery entails a multiparameter optimization problem spanning vast length scales. The key parameters range from solubility (angstroms) to protein-ligand binding (nanometers) to in vivo toxicity (meters). Through feature learning - instead of feature engineering - deep neural networks promise to outperform both traditional physics-based and knowledge-based machine learning models for predicting molecular properties pertinent to drug discovery. To this end, we present the PotentialNet family of graph convolutions. These models are specifically designed for and achieve state-of-the-art performance for protein-ligand binding affinity. We further validate these deep neural networks by setting new standards of performance in several ligand-based tasks. In parallel, we introduce a new metric, the Regression Enrichment Factor EFχ(R), to measure the early enrichment of computational models for chemical data. Finally, we introduce a cross-validation strategy based on structural homology clustering that can more accurately measure model generalizability, which crucially distinguishes the aims of machine learning for drug discovery from standard machine learning tasks.
Article
Full-text available
In this review we will discuss recent advances in computational prediction of solubility in water-based solvents. Our focus is set on recent advances in predictions of biorelevant solubility in media mimicking the human intestinal fluids and on new methods to predict the thermodynamic cycle rather than prediction of solubility in pure water through quantitative structure property relationships (QSPR). While the literature is rich in QSPR models for both solubility and melting point, a physicochemical property strongly linked to the solubility, recent advances in the modelling of these properties make use of theory and computational simulations to better predict these properties or processes involved therein (e.g. solid state crystal lattice packing, dissociation of molecules from the lattice and solvation). This review serves to provide an update on these new approaches and how they can be used to more accurately predict solubility, and also importantly, inform us on molecular interactions and processes occurring during drug dissolution and solubilisation.
Article
Although there is a number of computational approaches available for the aqueous solubility prediction, majority of those models rely on the existence of a training set of thermodynamic solubility measurements, or/and fail to accurately account for the lattice packing contribution to the solubility. The main focus of this study is the validation of the application of a physics-based aqueous solubility approach, which does not rely on any prior knowledge and explicitly describes the solid-state contribution, in order to guide the improvement of poor solubility during the lead optimization. A superior performance of a quantum mechanical (QM)-based thermodynamic cycle approach relative to a molecular mechanical (MM)-based one in application to the optimization of two pharmaceutical series was demonstrated. The QM-based model also provided insights into the source of poor solubility of the lead compounds, allowing the selection of the optimal strategies for chemical modification and formulation. It is concluded that the application of that approach to guide solubility improvement at the late discovery and/or early development stages of the drug design proves to be highly attractive.
Article
Hunting for chemicals with favourable pharmacological, toxicological and pharmacokinetic properties remains a formidable challenge for drug discovery. Deep learning provides us with powerful tools to build predictive models that are appropriate for the rising amounts of data, but the gap between what these neural networks learn and what human beings can comprehend is growing. Moreover, this gap may induce distrust and restrict deep learning applications in practice. Here, we introduce a new graph neural network architecture called Attentive FP for molecular representation that uses a graph attention mechanism to learn from relevant drug discovery datasets. We demonstrate that Attentive FP achieves state-of-the-art predictive performances on a variety of datasets and that what it learns is interpretable. The feature visualization for Attentive FP suggests that it automatically learns non-local intramolecular interactions from specified tasks, which can help us gain chemical insights directly from data beyond human perception.
Article
In this review we will discuss how computational methods, and in particular classical molecular dynamics simulations, can be used to calculate solubility of pharmaceutically relevant molecules and systems. To the extent possible, we focus on the non-technical details of these calculations, and try to show also the added value of a more thorough and detailed understanding of the solubilization process obtained by using computational simulations. Although the main focus is on classical molecular dynamics simulations, we also provide the reader with some insights into other computational techniques, such as the COSMO-method, and also discuss Flory-Huggins theory and solubility parameters. We hope that this review will serve as a valuable starting point for any pharmaceutical researcher, who has not yet fully explored the possibilities offered by computational approaches to solubility calculations.
Article
Accurate and rapid crystal structure predictions have the potential to transform the development of new materials, particularly in fields with highly complex molecular structures (such as in drug development). In this work we present a novel cloud-computing CSP platform with the capability of scheduling hundreds of thousands CPU cores and integrating cutting-edge computational chemistry algorithms. This new cloud-computing based CSP platform has been applied to three crystalline drug substances of increasing complexity. The lattice energies of the experimental crystal structures are all within 3.0 kJ/mol of the lowest energy predicted structures. Based on the results of this work, the algorithm improvement and the mass computational power of cloud computing can reduce the whole CSP process to just 1 - 3 weeks for Z' = 1 systems and less than 5 weeks for significantly more complex systems. Furthermore, it is possible to simultaneously perform calculations for multiple molecules if desired. As a result of these improvements, CSP calculations can potentially be applied in conjunction with state-of-the-art experimental screening experiments to reduce the risk of finding new solid forms after product launch provided that a sufficient number of stoichiometries and space-groups are explored.
Article
Bruton's tyrosine kinase (BTK) is a promising drug target for the treatment of multiple diseases such as B cell malignances, asthma, and rheumatoid arthritis. A series of novel aminotriazines were identified as highly selective inhibitors of BTK by a scaffold hopping approach. Subsequent SAR studies of this series using two conformationally different BTK proteins, an activated form of BTK and an unactivated form of BTK, led to the discovery of a highly selective BTK inhibitor 4b. With significant efficacies in vivo models and a good ADME and safety profile, 4b was advanced into preclinical studies.
Article
Cyclin-dependent kinase 12 (CDK12) plays a key role in the coordination of transcription with elongation and mRNA processing. CDK12 mutations found in tumors and CDK12 inhibition sensitize cancer cells to DNA-damaging reagents and DNA-repair inhibitors. This suggests that CDK12 inhibitors are potential therapeutics for cancer that may cause synthetic lethality. Here, we report the discovery of 3-benzyl-1-(trans-4-((5-cyanopyridin-2-yl)amino)cyclohexyl)-1-arylurea derivatives as novel and selective CDK12 inhibitors. Structure-activity relationship studies of a HTS hit, structure-based drug design, and conformation-oriented design using the Cambridge Structural Database afforded the optimized compound 2, which exhibited not only potent CDK12 (and CDK13) inhibitory activity and excellent selectivity, but also good physicochemical properties. Furthermore, 2 inhibited the phosphorylation of Ser2 in the C-terminal domain of RNA polymerase II, and induced growth inhibition in SK-BR-3 cells. Therefore, 2 represents an excellent chemical probe for functional studies of CDK12 and could be a promising lead compound for drug discovery.