Access to this full-text is provided by Springer Nature.
Content available from Scientific Data
This content is subject to copyright. Terms and conditions apply.
1
SCIENTIFIC DATA | (2022) 9:71 | https://doi.org/10.1038/s41597-022-01154-3
www.nature.com/scientificdata
Boosting the predictive
performance with aqueous
solubility dataset curation
Jintao Meng
Chen
✉ WahibYangZheng
Wei ✉ Feng ✉ & Wei iu
Intrinsic solubility is a critical property in pharmaceutical industry that impacts in-vivo bioavailability
thermodynamic datasets are observed. We also compare expanded Chemprop enhanced with curated
Chemprop
is achieved. A steadily improved pearson and spearman values with increasing data points are also
Introduction
Aqueous solubility is one of the critical factors dening the bio-availability of orally administrated drugs.
Reportedly, over 75% of oral drug development candidates have a low solubility based on the Bio-pharmaceutics
Classication System (BCS)01,2. To tackle this challenge, researchers are focusing on drug solubility improve-
ments with both physics-based Quantum Mechanics-Quantitative Structure Property Relationships
(QM-QSPR) approaches3–6 and data-driven articial intelligence (AI) methods7–11.
e development of QM-QSPR approaches provides a large number of computational methods for aque-
ous solubility prediction starting from a molecular structure3–6. e majority of these methods try to explore
fundamental physics-based rules with a sublimation thermodynamic cycle solubility approach2,12 on crystalline
drug-like molecules. is approach is an interplay between crystal packing and molecular hydration free energy
contributions12–15. With this approach, a crystal packing contribution to the drug solubility typically requires a
sublimation energy estimation from crystal lattice calculations12–14, molecular dynamics simulations16, or QSPR
statistical models15,17. e free energy of solvation may be estimated by a variety of approaches, including QSPR
models, monte carlo simulations, and QM-based methods18. Recently, a study of guiding lead optimization2 was
proposed. It explicitly describes the solid-state contribution, and the superior performance of the QM-based
thermodynamic cycle approach is demonstrated in the optimization of two pharmaceutical series. e main
limitations of the physics-based QM-QSPR approaches are the large compute requirements and long run time.
For example, guiding lead optimization2 relies on crystal structure prediction calculations19, which may require
several days on a powerful cloud infrastructure consisting of millions of CPU cores.
Early AI-based approaches for solubility prediction involve the application of logistic regression7, random
forests8 and convolutional neural networks9 to expert-engineered descriptors10,11 or molecular ngerprints
such as the Dragon descriptors or Morgan (ECFP) ngerprints20–22. eir predictive accuracy or equivalent
root mean square errors (RMSE) is limited to 0.7–1.0 log. Recent research eorts are focused on graph learn-
ing23–26 of the underlying topology of molecule structure using SMILES strings27. Such models extract their
1Shenzhen Institutes of Advanced Technology, CAS, Shenzhen, 518000, China. 2National Institute of Advanced
Industrial Science and Technology, Tokyo, Japan. 3RIKEN Center for Computational Science, Hyogo, Japan. 4XtalPi.
Inc, Shenzhen, 518000, China. 5National Supercomputer Center in Shenzhen, Shenzhen, 518000, China. 6Tencent AI
Lab, Shenzhen, 518000, China. ✉e-mail: chin.hou@aist.go.jp; yj.wei@siat.ac.cn; fengsz@nsccsz.cn
Content courtesy of Springer Nature, terms of use apply. Rights reserved
2
SCIENTIFIC DATA | (2022) 9:71 | https://doi.org/10.1038/s41597-022-01154-3
www.nature.com/scientificdata
www.nature.com/scientificdata/
own expert features directly from atoms and edges, and embed them with graph convolutional networks. An
experiment in MoleculeNet23 on solubility prediction of ESOL dataset ranks Message Passing Neural Network
(MPNN) as the best graph learning model with a predictive accuracy of 0.58 among other graph models, such
as WEAVE28 and GraphConv26. Chemprop24 which embeds molecule-level features and extends the Message
Passing Neural Network (MPNN) with Directed MPNN, further improves the predictive accuracy on ESOL to
0.56. AttentiveFP29 is the rst work applying attention mechanism with a graph neural network and reports the
lowest accuracy of 0.503 on ESOL dataset. ese deep learning based approaches are trying to model complex
physicochemical properties with a QSPR statistical approach, however their exibility and capacity of capturing
those complex relationships are still bounded by the availability of high quality data30–32.
e measurement and dataset diversity gap between the AI-based and QM-QSPR approaches are two crit-
ical issues hindering the research on combining these two approaches. For AI-based approaches in particular,
dierent papers evaluate their work on dierent datasets, using dierent workows, or even with dierent meas-
urements. In most cases, this becomes the rst obstacle preventing readers from objectively distinguishing the
viability of the proposed AI-approaches. More importantly, to the authors knowledge, no previous work con-
ducted any comparison to evaluate both AI-based and QM-QSPR approaches under the some measurements
with an open available dataset. is situation also inhibits any quantitative analysis from exploring the advan-
tages and disadvantages of these two approaches, and the possibilities of combining them to achieve additional
progress.
In term of data curation methodologies, previously Eriksson’s work published in 200333, takes prepossess-
ing techniques (scaling and centering), data correction, and transformations to improve the regression model’s
performance on Quantitative Structure-Activity Relationship (QSAR). ere are three dierent points between
our work and Eriksson’s work. Firstly, Our work is focused specially on solubility instead of QSAR. e data
correction using signal correction actually cannot work on our dataset, as there are no relationship between
the solubility value and undesired variation arising from light-scattering eects, baseline dri, nonlinearities,
and so forth. Secondly, our work is exploring data curation methodology for nonlinear deep learning model
using graph neural networks, whereas Eriksson’s work33 is targeting on linear regression model. last, our work
on is focused on data curation methodology itself. Eriksson’s work needs the prepossessing techniques (scaling
and centering) and transformations steps to avoid large inuence on the model and dominating over the other
measurements from unbalanced data composition. However, these problems has been resolved in our work by
using scaold data partition. Our work is the only work focused on inter-dataset redundancy and intra-dataset
redundancy, it is a novel technique not yet presented by any previous work.
To conclude the above discussion, solubility prediction with AI-based methods still face the following three
challenges:
1. e volume of training data in previous works, such as the ESOL dataset, is limited. Training and evalua-
tion on these small datasets do not necessarily oer good performance for our problems. ese datasets are
also insucient for sophisticated models attempting to learn massive physical-chemical rules and converge
to a stable state.
2. Data curation methods or tools for low-quality aqueous solubility data are still lacking. Directly training on
data with poor quality may aect the predictive accuracy.
3. None of the previous studies pose a comparison of the predictive accuracy between leading deep learning
and state-of-the-art QM-QSPR approaches. Analyzing and determining the advantages and disadvantages
of deep learning methods in comparison with the QM-QSPR approaches is also critical but dicult to
achieve.
To resolve the above issues and rene the research problem of solubility prediction for AI, our contributions
are threefold:
1. e rst large-scale dataset for AI research on aqueous solubility is collected. is dataset contains seven
aqueous solubility datasets including both thermodynamic and kinetic data. e number of records in
these datasets ranges between a few thousand to several hundreds of thousands.
2. is work is the rst to improve the aqueous solubility predictive accuracy with a data curation method.
We present a data curation workow of ltering, evaluating and clustering. is workow adds solubility
quality to each record and curates records sharing similar solubility among dierent datasets. We also
expand two leading deep learning methods, i.e., Chemprop24 and AttentiveFP29, to support data quality
during the training and evaluation process. Using these expansions of the Chemprop and AttentiveFP deep
learning methods, improved predictive accuracy is observed on all thermodynamic datasets.
3. is work is also the rst to compare deep learning and QM-QSPR approaches using the pearson and
spearman’s rank-order correlation coecients by predicting four pharmaceutical series of 48 molecules.
Abramov’s guiding lead optimization and weighted Chemprop are selected as the representatives for both.
By predicting the rst two pharmaceutical series of 31 molecules, Abramov’s approach demonstrates a
pearson correlation coecient r2 of 0.905 and spearman’s rank-order correlation coecient Rs of 0.967.
Weighted Chemprop (expanded to support the high data quality) is trained on the curated dataset yielding
improvement in its r2 and Rs values. It increases steadily with the increase in training data volume and fur-
ther achieves comparable performance on r2 with 0.930 and Rs with 0.947. In comparison with Abramov’s
approach, which requires a large compute resources, predicting the thousands of target compounds with
deep learning approach takes only seconds on a common desktop computer.
Content courtesy of Springer Nature, terms of use apply. Rights reserved
3
SCIENTIFIC DATA | (2022) 9:71 | https://doi.org/10.1038/s41597-022-01154-3
www.nature.com/scientificdata
www.nature.com/scientificdata/
e rest of this paper is organized as follows. e collection and description of seven datasets, together with
our data curation workow, are illustrated in Section methods. Section results compares the deep learning and
QM-QSPR approaches and then discusses the benets of data curation. Section discussion explains the innova-
tions and contributions this work make towards molecule property prediction.
Methods
Datasets. We collected molecules labeled with aqueous solubility from publicly available databases or data-
sets provided by previous papers, resulting in the 7 datasets shown in Table1. Among them, the rst three data-
sets were evaluated by previous papers11,23,24,34,35 but are limited in number of samples or records, while the last
four datasets have larger number of samples with poor data quality. We also include both thermodynamic and
kinetic datasets; the rst six are the thermodynamic datasets, while the last is the kinetic set36.
Table1 demonstrates the statistical information of each dataset. Every dataset is processed separately to have
the same standardized form. e data extraction process and standardization methods applied for each dataset
are described below.
• AQUA. is dataset was taken from the work of Huuskonen34 and Tetko11, with 1311 records on 1307 mol-
ecules downloaded from the ALOGPS homepage at http://146.107.217.178/lab/alogps/logs.txt. e exper-
imental aqueous solubility value is measured between 20–25 °C and obtained partly from the AQUASOL
database of the University of Arizona and SCR’s PHYSPROP database.
• PHYS. is dataset is a curated PHYSPROP database consisting of a collection of datasets in SDF format. An
automated KNIME workow37 is used to curate and correct errors in the structure and identity of chemicals
using the publicly available PHYSPROP datasets. Here, we extract 2024 molecules with a water solubility
(WS) endpoint. e quality of each record is measured with stars from 1 to 5; thus, the data quality property
of “STAR_FLAG” is reserved, and nally 2010 records is reserved.
• ESOL. e original ESOL dataset, containing 1144 records, was rst used by35, and then its veried ver-
sion was evaluated in23,24. We downloaded its veried version with 1128 records from Chemprop’s repository
https://github.com/Chemprop/Chemprop as our ESOL dataset to keep it consistent with previous works23,24.
• OCHEM. is dataset is taken from the OCHEM database of WS at https://ochem.eu/. We reserve 6525
rows from 36,450 records by selecting molecules with the dataset type “Training” to reserve molecules with
experimental solubility values.
• AQSOL. is dataset38 combines 9 datasets, including the AQUA and ESOL datasets. A preprocessing step
is used to lter this dataset by merging repetitive molecules, with 9982 records remaining. According to the
number of occurrences in the 9 original datasets, a new property called “group” is added to this dataset by
using a classication strategy that can group this dataset into 5 groups. We keep “group” in this dataset for
further assignment of the weights to identify the data quality of each record.
• CHEMBL. is dataset is extracted from CHEMBL’s activity database, which includes 15,996,368 records at
https://www.ebi.ac.uk/chembl. We lter this dataset with the assay type “physicochemical” and then select
40,520 records with the standard type “Solubility” or “solubility” as our dataset. Several dierent units are
used for the aqueous solubility measurement, such as nM, ug/mL, and ug.mL-1. All the units are converted
to standard “LogS” units. We nd 4,543 records are kinetic solubility data and 17 records are using oil as the
solvent. us we further removed all these records to clean our CHEMBL dataset from kinetic solubility data.
Finally 30,099 valid records are reserved. In addition, the column “Comment” describing the temperature and
pH of the experiment is kept for later weight assignment on the data quality of each record.
• KINECT. is dataset is taken from the OCHEM database of WS based on the Kinect technology at https://
ochem.eu/. 164,273 records described in SDF format are extracted and collected into this dataset. In addi-
tion, the columns of properties “SMILES”, “LogS value”, “pH value” and “Temperature” are also extracted and
reserved for quality weight assignment.
Dataset
No. of Records in
Weights Additional Columns
of Org DatasetOrg Cln Cure
AQUA 1311 1311 1354 1.0
PHYS 2010 2001 2001 1.0 star_ag
ESOL 1128 1116 1157 1.0
OCHEM 6525 4218 3766 0.85
AQSOL 9982 8701 9061 0.4 group
CHEMBL 30099 30099 28675 0.8 comment
KINECT 164273 82057 81935 — temperature, pH value
Tab le 1. Statistical information of the number of records in the 7 collected datasets. “Org” is the original
dataset, “Cln” denotes the dataset aer Data Filtering, “Cure” is the dataset aer Data Curation using the
clustering algorithm across multiple datasets, “Weights” denotes the assigned weights for each dataset to
identify the dataset quality, and “Additional Columns of Org Dataset” includes special properties reserved by
some of the datasets.
Content courtesy of Springer Nature, terms of use apply. Rights reserved
4
SCIENTIFIC DATA | (2022) 9:71 | https://doi.org/10.1038/s41597-022-01154-3
www.nature.com/scientificdata
www.nature.com/scientificdata/
Data curation. Due to the various experiment environments, workows and non-unique identications, the
records in the aqueous solubility datasets are repetitive, erroneous or even contradictory to each other37,38. Note
that molecules with the same SMILES may be dierent tautomers39, and thus have dierent solubility value. As
SMILES can not distinguish tautomers thus we just keep them into dierent records but with dierent solubility
value. We merge the two records only when the dierence of these two value is less than 0.5.
e development of reliable data-driven deep learning models, however, may be hindered by uncertainties
and disagreements in these repetitive records, which are obtained from many disparate data sources. Training
data with systematic errors from dierent experimental methodologies potentially limit the predictive accuracy
of deep learning models. To improve the predictive accuracy of deep learning methods and achieve a better
generalization ability from low-quality and confusing data, a curation method delivering high-quality data,
balanced on substructure classes and sucient in terms of the data volume, is vitally important.
We present a data curation workow of ltering, evaluating and clustering for the above 7 datasets as illus-
trated in Fig.1. e workow tries to improve the dataset quality by data ltering, a quality evaluation and then
cross-dataset correction among dierent datasets with a clustering algorithm. Finally, an evaluation with two
leading deep learning methods, i.e., Chemprop and AttentiveFP, demonstrates the benets of this workow in
predictive accuracy improvement based on the RMSE over all thermodynamic datasets.
Data filtering. To resolve the standardization of the molecule expressions, uncertainties from various
experiment environments, and weight bias from repetitive data, the data ltering strategy is proposed with
the following three steps: SMILES standardization, experiment environmental control, and repetitive record
normalization.
• SMILES standardization First, each molecule has only one unique SMILES expression in dierent databases.
MolVS (described at https://molvs.readthedocs.io/en/latest/) is used to standardize all chemical structures
and maintain one unique standard SMILES for each molecule. Any molecule that fails to pass our standardi-
zation procedure is removed from the dataset.
• Experiment environmental control Second, we target the aqueous solubility prediction of small molecules
in drug design. us, the experiment environment of molecules with temperatures of 25 ± 5 °C and pH val-
ues of 7 ± 1 are highly valued; any records beyond our scope are ranked low or even removed. Any molecule
used for drug design should be poison-free. For this reason, molecules with heavy metals such as “U, Ge, Pr,
La, Dy, Ti, Zr, Rh, Lu, Mo, Sm, Sb, Nd, Gd, Cd, Ce, In, Pt, Sb, As, Ir, Ba, B, Hg, Se, Sn, Ti, Fe, Si, Al, Bi, Pb, Pd,
Ag, Au, Cu, Pt, Co, Ni, Ru, Mg, Zn, Mn, Cr, Ca, K, Li” are ltered from all datasets. “SF5,SF6” are also cleaned,
as they are rarely used in drug design.
• Repetitive record normalization ird, some datasets contain repetitive molecules with equal or dierent
solubility values. According to the frequency of occurrence, repetitive record normalization is carried out to
assign weights to each molecule, with a total weighted value of 1.0, to prevent those molecules with repetitive
values from gaining larger parameter update weights during the model training process.
e number of data records before and aer our data ltering is presented in Table1. For each cleaned data-
set, the available information in terms of the name, description, and column type are presented in Table2. In
the end, 1311, 2001, 1116, 4218, 8701, 30,099, and 82,057 records are in cleaned AQUA, PHYS, ESOL, OCHEM,
AQSOL, CHEMBL, and KINECT datasets, respectively.
Quality evaluation. Quality evaluation is performed to analyze, evaluate and assign each dataset with an appro-
priate weight to identify its quality. We rst analyze the molecule redundancy among dierent datasets with
identical or dierent solubility values. en, we expand Chemprop and AttentiveFP to support the data quality
weights and refer to them as weighted Chemprop and weighted AttentiveFP. weighted Chemprop is used to
Fig. 1 e data curation workow of ltering, evaluating, and clustering on the 7 collected datasets.
Content courtesy of Springer Nature, terms of use apply. Rights reserved
5
SCIENTIFIC DATA | (2022) 9:71 | https://doi.org/10.1038/s41597-022-01154-3
www.nature.com/scientificdata
www.nature.com/scientificdata/
evaluate each dataset’s predictive accuracy (measured in RMSE) to identify the dataset quality. Finally, each
dataset is assigned a weight indicating its data quality.
e existence of data redundancy in repetitive records generate bias in the model training process and evalu-
ation metric. Several data redundancies can be found both within and among the datasets. ese data redundan-
cies can be classied into two classes: those in which a given molecule was found in two records with identical
solubility values and those in which a given molecule was found in two records with dierent solubility values.
Here, we dene solubility values with a 0.01 LogS unit dierence between two records as identical. Notably, these
redundancies can be found in two records from a single dataset or from two dierent datasets. e former case
is normalized rst by repetitive record normalization, as discussed in the previous subsection; thus, there is no
molecules sharing the same value occur twice in a single dataset.
With the above denitions, two redundancy matrices are collected, as presented in Fig.2, where the percent-
ages of repetitive molecules with the same and dierent solubility values are presented in the upper and lower
tables, respectively. e rows or columns of these two tables represent the corresponding datasets. e percent-
age of repetitive molecules with the same solubility value between two datasets i and j is represented as Aij, and
that with dierent solubility values is represented as Bij. For example, AESOL,PHYS = 43.01 indicates that 43.01%
of the records (one molecule can have multiple records) in the ESOL dataset can be found in the PHYS dataset
with the same solubility value. As another example, BCEHMBL,CHEMBL = 25.13 reveals that 25.13% of the records
in the CHEMBL dataset can be found sharing the same molecule but with dierent solubility values in the same
dataset. Note that the two redundancy matrices in Fig.2 are not symmetric for dierent dataset sizes. e sum
of Aij and Bij for corresponding datasets i and j can be beyond 100%, as given a record from dataset i, a molecule
in dataset j can have multiple records and thus can share both the same and dierent solubility values with the
same molecule in other datasets.
A preliminary analysis of the records in and between datasets reveals the potential value of data curation.
In the upper table of Fig.2, approximately half of the records in AQUA, PHYS, and ESOL share the same sol-
ubility values. Approximately 77–98% and 66% of the records in AQUA, PHYS and ESOL are contained in
OCHEM and AQSOL, respectively. In the lower table, 9–30% of the records share dierent solubility values
among AQUA, PHYS and ESOL. More than 24% and 30% of the records in these three datasets share dierent
solubility values. CHEMBL has its own speciality. Both tables conrm that CHEMBL contains few records from
other datasets, and the lower table conrms that one-quarter of the records in CHEMBL have diverse solubility
values. Our intuition on data curation is to make use of the above record redundancies. In practice, a record for
a given molecule with the same solubility value in more than one dataset can help us to improve the condence
regarding its data quality. Likewise, a record with dierent values among datasets can decrease the condence
with regard to its data quality. is is the fundamental dierence between our work and a previous work38 as
a result of selecting those records with multiple occurrences. us, the percentages of both inter-dataset and
intra-dataset record redundancies will determine the eectiveness of our data curation method.
To analyze the quality of each dataset, one of the leading graph learning methods named as Chemprop is
selected to evaluate all 7 datasets, with the predictive accuracy used as a reference. Both random and scaold
splitting are used in this evaluation. Here random splitting randomly splits samples into training, validation, and
test three subsets. Scaold40 splitting splits the samples based on their two-dimensional structural frameworks
Column Name Description Type
Smiles SMILES representation of compound String
LogS Experimental aqueous solubility value
(LogS) String
Weight weighted quality score in [0, 1] Float
Tab le 2. List of information for all cleaned and curated datasets in terms of the name, description, and type of
each column.
Fig. 2 Redundancy matrices showing the percentage of repetitive molecules between two datasets. e upper
table Aij summarizes the percentages of molecules with the same solubility values, and the lower table Bij
describes the percentages of molecules with dierent solubility values.
Content courtesy of Springer Nature, terms of use apply. Rights reserved
6
SCIENTIFIC DATA | (2022) 9:71 | https://doi.org/10.1038/s41597-022-01154-3
www.nature.com/scientificdata
www.nature.com/scientificdata/
as implemented in RDKit. Scaold splitting is a useful way of organizing the structural data to group the atoms
of each drug molecule into ring, linker, framework, and side chain atoms. Considering that random splitting of
molecular data isn’t always best for evaluating machine learning methods. Scaold splitting is also applied in
our evaluation. For the original datasets, we train each dataset with Chemprop using both random and scaold
data partition ratios of [0.8, 0.1, 0.1] for training, testing, and evaluation. Moreover, we ensemble 5 models to
improve the model accuracy and record the average RMSE value and its condence intervals by running each
ensembled model 8 times. e RMSE value of the original dataset is recorded and collected in the third column
of Table3. Multiple dierent solubility values for a given molecule among the datasets are normalized on weights
according to the statistical distribution of the molecule determined by the previously discussed data ltering
process. However, Chemprop does not support weighted quality scores for records in a cleaned dataset. us, we
expand the training and evaluation codes of Chemprop to support training over weighted records and rename
it weighted Chemprop. As a result, a record with higher weighted quality has a contributes to a larger extent in
the parameter update, whereas records with lower weights have a smaller eect. Note that when dataset contains
no weights, weighted Chemprop treats each record equally and acts the same as Chemprop. Trained on these 7
cleaned datasets, the corresponding prediction accuracy measured with the RMSE is collected in the fourth
column of Table3.
Root Mean Square Error (RMSE) is a standard way to measure the error of a model in predicting quantitative
data, detailed denition of RMSE is presented at https://en.wikipedia.org/wiki/Root-mean-square_deviation.
Assume that there are n records in the test subsets, formally the RMSE of this test subsets is dened as follows:
∑
=−
=
−
RMSE yy
n
()
i
nii
0
12
Here,
…−
yyy,, ,n01 1
are predicted values. y0, y1, …, yn−1 are observed values. n is the number of records in the
test subsets. As all the clean and cure dataset contains quality weights, we must update RMSE for both evaluation
and test subset to use the weighted records during our training process. us we updated the evaluation metric
with weighted records, the denition of our weighted RMSE is described as below. Assume that there are n
records in the test subsets, formally the weighted RMSE of this test subsets is dened as follows:
RMSE wyy
n
()
i
niii
0
12
∑
=∗−
=
−
Here,
…−
yyy,, ,n01 1
are predicted values. y0, y1, …, yn−1 are observed values. w0, w1, …, wn−1 are quality
weights of each records. n is the number of records in the test subsets. In the original dataset, there are no quality
weights, we treat each record with unit weights by default to calculate its weighted RMSE. As you can see, the
weighted RMSE will be same as RMSE when using unit weights for original dataset. us the weighted RMSE is
a comparable metric across the original, clean and cure datasets, and in this paper we use RMSE for simple to
denote “weighted RMSE” for curated datasets. e original Chemprop is used on “Org” dataset, and the weighted
Chemprop is applied on both “Cln” and “Cure” datasets.
According to Table3, the six thermodynamic datasets can be split into two groups. e rst group includes
AQUA, PHYS, ESOL and OCHEM, and the second group includes AQSOL and CHEMBL. e datasets in the
Split Type Dataset
RMSE & Condence Intervals
Org Cln Cure
Random
AQUA 0.573 ± 0.037 0.583 ± 0.057 0.536 ± 0.042
PHYSP 0.550 ± 0.026 0.600 ± 0.032 0.515 ± 0.018
ESOL 0.596 ± 0.075 0.619 ± 0.044 0.512 ± 0.047
OCHEM 0.548 ± 0.024 0.639 ± 0.044 0.522 ± 0.017
AQSOL 1.023 ± 0.035 0.820 ± 0.036 0.518 ± 0.022
CHEMBL 0.917 ± 0.017 0.811 ± 0.016 0.499 ± 0.011
KINECT 0.401 ± 0.003 0.431 ± 0.003 0.432 ± 0.003
Scaold
AQUA 0.850 ± 0.086 0.849 ± 0.075 0.697 ± 0.043
PHYS 0.833 ± 0.058 0.813 ± 0.115 0.691 ± 0.092
ESOL 0.854 ± 0.097 0.808 ± 0.090 0.711 ± 0.073
OCHEM 0.847 ± 0.067 0.808 ± 0.075 0.695 ± 0.061
AQSOL 1.073 ± 0.062 0.968 ± 0.045 0.596 ± 0.033
CHEMBL 1.040 ± 0.038 0.900 ± 0.049 0.555 ± 0.031
KINECT 0.433 ± 0.015 0.461 ± 0.008 0.460 ± 0.008
Tab le 3. e collected RMSE and condence intervals of Chemprop or weighted Chemprop trained on the 7
datasets. e data partition strategies include both random and scaold strategies. Five models are ensembled
to improve the model accuracy. We average the RMSE by running each model 8 times and then calculate
the corresponding condence interval. e original Chemprop is used on “Org” dataset, and the weighted
Chemprop is applied on both “Cln” and “Cure” datasets.
Content courtesy of Springer Nature, terms of use apply. Rights reserved
7
SCIENTIFIC DATA | (2022) 9:71 | https://doi.org/10.1038/s41597-022-01154-3
www.nature.com/scientificdata
www.nature.com/scientificdata/
rst group have smaller populations and relatively lower RMSE values; we denote the datasets in this group as
high-quality datasets. e second group has massive records and higher RMSE values in both the original and
cleaned datasets; thus, the two datasets are regarded as low-quality datasets. Due to the change in the evaluation
metric with the weighted records in the high-quality datasets and KINECT dataset, the predictive accuracy of
each clean dataset has a 10% increase in the RMSE using a random partition compared with the original data-
set. At the same time, as we take “group” and “comment” as references to carry out weight assignment for each
record in the low-quality datasets, weighted Chemprop learns over the quality weights aer repetitive record
normalization and then benets from a slightly decrease in predictive accuracy (lower is better).
With the above analysis, we can initialize and assign a quality weight for each dataset. e assigned quality
weight is used for data curation in the following section. e assigned weights are distributed in [0, 1], with a
value close to 1 indicating high data quality. e assigned weights for these six thermodynamic datasets are
listed in the h column of Table1. e KINECT dataset is the only kinetic-based dataset; thus, no weighted
quality is set. e weights in Table1 are presented as an example to show a relative ranking in terms of the data
quality among the dierent datasets, and the specic weight for each dataset can still be adjusted. Searching for
and evaluating a better weight assignment require extremely large compute power, e.g., one round of evaluation
generating all the data in Table3 costs approximately two weeks using 1200 compute nodes (38,200 cores and
4800 GPU accelerators) in the National Supercomputer Center in Shenzhen. erefore, we estimate the weights
in Table1 from our rst intuition and then calculate the corresponding predictive accuracy results in Table3.
Data clustering. is work is the rst to curate data using inter-dataset redundancy and intra-dataset redun-
dancy. ree curation guidelines are followed to take advantage of these datasets with potential redundancy: ❶
A dataset with a higher quality weight can be used to curate a dataset with a lower weight. ❷ e nal quality
weight of a record from a dataset can be calculated by multiplying the weight of the record itself by the assigned
weight of the dataset. ❸ Records with similar solubility values for a given molecule can be merged by averaging
their solubility values over their weights.
First, a curation schedule following guideline ❶ is designed, as demonstrated in Fig.3. Previously, we
divided the six thermodynamic datasets into two groups: a high- and low-quality group. As illustrated in Fig.3,
one can curate a dataset with other datasets in the same group with higher or equal weights, which is denoted
as inter-group curation. A dataset in the high-quality group can be used to curate a dataset in the low-quality
group, which we refer to as intra-group curation. No other operations are allowed.
Second, a record clustering and curation workow is adapted to implement guideline ❷. Given a set of n
cleaned datasets D[i], each records is initialized with our workow aims to curate D[n − 1] with datasets D[0],
…, D[n − 2]. Our curation workow contains three steps: (1) We merge all input datasets D[i] and reserve all
the records with the same compound contained by dataset D[n − 1] as a new dataset T. (2) For each molecule
with multiple solubility values, a partial clustering algorithm, illustrated in Algorithm 1, is adopted to merge
these records. en, we update the solubility values and weights with the equation listed by line 5 and line 6 in
algorithm 1 for each molecule in T. (3) We accumulate the total weights for each molecule and truncate the max-
imum total weights with a given threshold. en, the weights for each record are normalized in T. By adjusting
threshold, those molecules occurring in multiple datasets and thus accumulating high total weights larger than
threshold become highly valued, and those molecules with total weights less than threshold become devalued.
ird, the partial clustering algorithm mentioned above is designed, as presented in Algorithm 1, to cure
the records following guideline ❸. In each while loop, the two closest solubility values for a given molecule are
selected and merged if their dierence is less than a given parameter d. e two records are merged by averaging
their solubility values over their weights, and their two weights are summed as the new quality weight. If the dif-
ference between the two closest values is larger than d, the while loop ends and the merged records are updated
as the new record. For the parameter d, we recommend using 0.5 as suggested in11.
Fig. 3 Data curation schedule for the 6 thermodynamic datasets. e datasets are divided into 2 groups: high
quality and low quality groups. Two curation operations, i.e., inter-group curation and intra-group curation, are
illustrated. e feasible curation operations for each dataset are denoted by the lines. For example, AQUA can be
curated with the AQUA, PHYS, and ESOL datasets, and AQSOL can be curated with all dataset in high quality
group, and CHEMBL.
Content courtesy of Springer Nature, terms of use apply. Rights reserved
8
SCIENTIFIC DATA | (2022) 9:71 | https://doi.org/10.1038/s41597-022-01154-3
www.nature.com/scientificdata
www.nature.com/scientificdata/
Algorithm 1 Partial Clustering Algorithm.
e above workow is developed and open-sourced in our repository
https://github.com/Mengjintao/Chemprop. e seven curated datasets are collected by applying this work-
ow, and then weighted Chemprop is trained on these datasets. For the best ensembled models, the solubility
prediction accuracy values measured in terms of the RMSE are summarized in Table3. Here the lowest RMSE
value is recorded for KINECT, being as low as 0.432 (with a condence interval of 0.003). ESOL is a widely used
benchmark in previous research, and its RMSE score decreases from 0.596 (0.56 reported by Chemprop with
Bayesian optimization24) to 0.512, i.e., a 0.084 LogS unit decline aer data curation. On other datasets with a
random data partition, the RMSE values of weighted Chemprop benet from a dramatic decline of 0.037, 0.035,
0.026, 0.505, and 0.418, respectively, on the curated AQUA, PHYS, OCHEM, AQSOL, and CHEMBL datasets.
With scaold data partition, the RMSE values decreasing by 0.153, 0.142, 0.152, 0.477, and 0.485, respectively.
e model trained on the curated KINECT dataset, however, records an increase in the RMSE value under both
random and scaold data partition, as the KINECT dataset is the only set of Kinect solubility data; hence, no
other dataset can be used to curate this dataset. Moreover, the limited inter-dataset redundancy demonstrated
in Fig.2 on the KINECT dataset also restricts our curation benets. Even with the above limitations, KINECT
dataset still contributes the lowest RMSE score among all datasets with both Random and scaold data partition.
In addition to Chemprop, we include another recently developed deep learning method named AttentiveFP
in our evaluation. AttentiveFP follows a traditional graph learning mechanism and allows non-local eects at
the intra-molecular level by applying a graph attention mechanism with multiple GRU layers. We also expand
the code of AttentiveFP to support data quality weights during training and evaluation. e GitHub repository
of weighted AttentiveFP is https://github.com/Mengjintao/AttentiveFP. An evaluation workow similar to that
of Chemprop is used, ensembling multiple AttentiveFP models in several folds. e RMSE values and condence
intervals of AttentiveFP on all 7 datasets are collected in Table4. A similar trend with a decreasing RMSE value
is illustrated in Table4. For example, AttentiveFP trained on the curated AQUA, PHYS, ESOL, OCHEM, and
AQSOL datasets achieves 0.067, 0.095, 0.03, 0.043, and 0.242 unit log decreases in the RMSE compared with the
original dataset using a scaold data partition.
All the evaluations demonstrated above in Tables3, 4 employ hyperparameter optimization with a grid
search approach. e grid search approach randomly selects 108 parameter combinations on ve key parame-
ters, and the lowest RMSE value is recorded. A larger search space may decrease the RMSE value further but will
not change the trend demonstrated in Tables3, 4; thus, we keep the same number of parameter combinations
in our search space during the entire work and do not enlarge the search space to reduce the training time and
computing resources.
e disparate statistical measurement and high quality datasets are the main obstacles to making an objective
comparison between deep learning and QM-QSPR approaches, in terms of solubility prediction. To conduct
a comparison, a dataset of 48 molecules is gathered from several previous works2,41,42. is dataset includes
four pharmaceutical series of 48 molecules, and none are contained in the 7 collected datasets. pearson and
spearman’s rank-order correlation coecients are used to evaluate the performances of the deep learning and
QM-QSPR approaches.
e correlation coecients of the predicted and observed values are the main concern for lead optimiza-
tion in compound design. e thermodynamic cycle solubility approach is a fundamental theory used in the
QM-QSPR approaches. In this approach, the log scale of the aqueous solubility value is linearly related to the
sublimation and hydration free energies. QM-QSPR approaches mainly focus on searching for extremely accu-
rate methods to calculate the sublimation and hydration free energies using a physics-based simulation at the
cost of enormous supercomputing power or quantum computation. us, instead of predicting the absolute
solubility values, the main goal of the QM-QSPR approaches is to evaluate the correlation coecient of the
solubility value with its two energy factors and then apply it in lead optimization. Two measurements are recom-
mended by one state-of-the-art work2 to evaluate the correlation coecient: the square of the pearson correla-
tion coecient r2 and spearman’s rank-order correlation coecient RS. e equation for the pearson correlation
coecient r is
Content courtesy of Springer Nature, terms of use apply. Rights reserved
9
SCIENTIFIC DATA | (2022) 9:71 | https://doi.org/10.1038/s41597-022-01154-3
www.nature.com/scientificdata
www.nature.com/scientificdata/
=∑−−
∑−∑−.
=
−
=
−=
−
r
xxyy
xx yy
()()
() ()
(1)
i
n
ii
i
nii
ni
0
1
0
120
12
Here, x is the vector of the predicted value, y is the vector of the true value, and
x
and
y
are the average values
of x and y, respectively. When r2 equal to 1, this indicates a perfect linear correlation between the observed and
predicted solubility values. spearman’s rank-order correlation coecient RS can be calculated as
∑
=− −
=
−
Rdnn16()/( 1),
(2)
S
i
n
i
0
1
22
Where di is the dierence between the ranks of the measured and predicted solubilities of molecule i. Here, RS
equal to 1 indicates a perfect ranking of the predicted solubility values.
We compare the deep learning and QM-QSPR approaches on r2 and RS with the evaluation dataset of 48 mol-
ecules. e ensembled models resulting in the best RMSE value in Tables3, 4 are used to predict the evaluation
dataset with weighted Chemprop (Chemprop expanded to support data quality). In this evaluation dataset, 12
molecules of Benzoylphenylurea (BPU) derivatives and 19 molecules of Benzodiazepin (BDZ) derivatives com-
prise the rst 31 molecules2. Seven molecules with selective Cyclin-Dependent Kinase 12 (CDK) inhibitors42 and
10 molecules of Pyrazole and Cyanopyrrole Analogs (PCAs) comprise the last 17 molecules41. We collect the sta-
tistical results of r2 and RS on these 48 molecules for weighted Chemprop and plot them in Figs.4, 5, respectively.
Note that the r2 values of the QM-QSPR approach proposed by2 are 0.79, 0.83, and 0.905 on the BPU, BDZ, and
BPU&BDZ datasets, respectively. ey also report the Rs score on the BPU, BDZ, and BPU&BDZ datasets to be
0.87, 0.90 and 0.967 respectively. Currently, no statistical results have been given on PCAs and CDK inhibitors
by any of the QM-QSPR approaches.
In Fig.4, the r2 curves of weighted Chemprop on BPU, BDZ, and BPU&BDZ increase steadily to 0.90, 0.62,
and 0.93, respectively. e curve of CDK for weighted Chemprop increases to 0.48 on the CHEMBL dataset.
For PCAs, the curves for both Chemprop and weighted Chemprop show no clear correlations due to small data
size. e jitter curves of Chemprop in most cases with lower r2 values reveal that low-quality data in the training
Split Type Dataset
RMSE & Condence Intervals
Org Cln Cure
Random
AQUA 0.616 ± 0.027 0.639 ± 0.014 0.579 ± 0.020
PHYS 0.649 ± 0.019 0.643 ± 0.013 0.551 ± 0.024
ESOL 0.642 ± 0.017 0.641 ± 0.025 0.594 ± 0.022
OCHEM 0.6018 ± 0.012 0.651 ± 0.020 0.6016 ± 0.010
AQSOL 0.826 ± 0.027 0.760 ± 0.012 0.593 ± 0.004
Scaold
AQUA 0.743 ± 0.038 0.747 ± 0.031 0.676 ± 0.038
PHYS 0.782 ± 0.037 0.789 ± 0.037 0.687 ± 0.038
ESOL 0.761 ± 0.048 0.801 ± 0.043 0.731 ± 0.073
OCHEM 0.746 ± 0.011 0.779 ± 0.019 0.703 ± 0.016
AQSOL 0.872 ± 0.017 0.842 ± 0.019 0.630 ± 0.008
Tab le 4. e collected RMSE and condence intervals of AttentiveFPwhen trained on the 7 datasets29. e
data partition strategies include both random and scaold partitioning, and the partition ratio is [0.8, 0.1, 0.1]
for training, testing, and evaluation. In this experiment, 5 models are ensembled 8 times to average the RMSE
values and calculate the corresponding condence interval. Because AttentiveFPis time consuming on a very
large dataset, the CHEMBL and KINECT datasets are not recorded, as their training times are longer than 150
hours. e original AttentiveFP is used on “Org” dataset, and e weighted AttentiveFP is applied on both “Cln”
and “Cure” datasets.
Fig. 4 Comparison of r2 values for ensembled models with the best RMSE scores in Table3 for Chemprop (le
gure) or weighted Chemprop (right gure) when predicting 48 molecules.
Content courtesy of Springer Nature, terms of use apply. Rights reserved
10
SCIENTIFIC DATA | (2022) 9:71 | https://doi.org/10.1038/s41597-022-01154-3
www.nature.com/scientificdata
www.nature.com/scientificdata/
datasets aect the model performance. Specically, data curation poses a negative eect on r2 for some special
datasets, for example, the CHEMBL dataset. is outcome may indicate that the actual data quality of this data-
set should be higher than the value we set, and thus, the data may be polluted by other datasets, resulting in poor
performance. When comparing Chemprop with original dataset and weighted Chemprop with curated dataset,
the results in the le side of Fig.4 shows no clear trends or gradation on both increasing training dataset size in
x axis or increasing prediction dataset size on BPZ, BDZ, and BPZ&BDZ dataset in y axis. In the right side of
Fig.4 however we can conrm two trends from both x and y axis in our evaluation. Firstly on the x direction, the
r2 value increases steadily when the data size of the training dataset increasing from AQUA with one thousand
compounds to KINECT of hundred of thousands. Secondly on the y direction, the r2 value of BPZ&BDZ dataset
with 31 compounds is larger than BPZ and BDZ in most cases on 7 datasets. What’s more, the r2 value of BPZ
with 19 compounds is larger than that of BDZ with 12 compounds. us there is a clear gradation on increasing
prediction dataset size on our curated dataset.
In Fig.5, the Rs curves of BPU, BDZ, and BPU&BDZ converge to 0.59, 0.89, and 0.947, respectively, with
increasing data size when using weighted Chemprop. e Rs value of PCAs and CDK increase to 0.58 and 0.63
on the CHEMBL dataset and decrease to 0.4 and −0.18 on the Kinect dataset, respectively. One can see that the
Kinect dataset yields a negative performance on Rs when predicting the PCA and CDK values for both weighted
Chemprop and Chemprop. e unstable r2 and Rs values around 0 for CDK conrm that the graph learning model
of Chemprop fails to track the physicochemical features of PCAs in terms of solubility. From both Figs.4, 5,
weighted Chemprop demonstrates a clear prediction performance gradation on the BPU, BDZ, CDK and PCA
molecules, whereas Chemprop with the original dataset does not.
e above comparison conrms that r2 and Rs values for CDK and PCA are noisy, these two datasets with 7
and 10 elements respectively, are too small to deliver a good comparison. However when given enough number
of compounds, both the r2 and Rs value of BPU & BDZ datasets are high and above 0.9. As both r2 and Rs are used
to evaluate the correlation coecients of the predicted and observed values, it is not the absolute value of solubil-
ity value. We guess that intrinsic solubility and kinetic solubility can have dierent absolute solubility value but
can still share the same tread in its correlation coecients. us we didn’t distinguish between thermodynamic
solubility and kinetic solubility in our training datasets (AQUA, PHYS, ESOL, OCHEM, AQSOL, CHEMBL,
KINECT) and the test dataset (BPU, BDZ, CDK, and PCA). Note that, it is still recommended to avoid mixing
kinetic and thermodynamic in one training dataset or test dataset. Larger dataset will be better for us to do this
evaluation, but currently no other open data is available.
In terms of running time, predicting these 48 molecules, for example, with weighted Chemprop requires
approximately 1.34 seconds in total or 0.028 seconds for each molecules on average with a single desktop com-
puter as listed in Table5. For QM-QSPR approaches such as the QM-based methods2, the calculation relies on
a cloud infrastructure of millions of CPU cores; however, no running time can be recorded as their method is
commercial and not publicly available. us, the availability of open-source methods and dramatically lower
usage of computing resources are additional advantages of applying deep learning models.
Fig. 5 Comparison of Rs values on ensembled models with the best RMSE scores in Table3 for Chemprop (le
gure) or weighted Chemprop (right gure) when predicting 48 molecules.
Desktop Time Usage (in seconds)
CPU GPU Evaluation (48) ESOL (1128) AQSOL (9982)
E3-1225 v6 — 1.28 8.11 86.56
E3-1225 v6 Quadro P400 1.34 8.49 86.07
Platinum 8180 — 0.70 9.98 107.93
Platinum 8180 GTX 1050Ti 0.61 8.27 86.28
Platinum 8180 Tesla T4 0.62 8.42 91.20
Tab le 5. Statistical time-usage (averaged over 100 rounds) of predicting compounds in evaluation, ESOL, and
AQSOL datasets with weighted Chemprop on three computers. e number of molecules containing in these
datasets are 48, 1311, 9982 respectively. e time-usage is measured in seconds. e eciency of the prediction
workload is about 4% on Tesla T4, 6% on GTX 1050Ti, and 9% on Quadro P400, thus the running time has
limited relation with GPU cards for unsaturated workload.
Content courtesy of Springer Nature, terms of use apply. Rights reserved
11
SCIENTIFIC DATA | (2022) 9:71 | https://doi.org/10.1038/s41597-022-01154-3
www.nature.com/scientificdata
www.nature.com/scientificdata/
To conclude, with seven collected large-scale aqueous solubility dataset and the proposed data curation
methodology, seven high quality curated datasets with quality weights are generated. Deep learning methods
including both Chemprop and AttentiveFP shares a dramatically increase on predictive accuracy measured in
RMSE, which has been demonstrated in method section in details. More importantly, using these ensembled
models with best RMSE, deep learning methods benet from curated datasets, with a steady improvement in r2
and Rs when increasing training data volume. Deep learning methods also demonstrate a superior performance
on r2 and comparable performance on Rs when predicting BPU and BDZ derivatives for leading compound
optimization compared with the QM-QSPR approaches, such as2. A clear prediction performance rank demon-
strating the capacity of deep learning methods on four series compounds is also illustrated by curated datasets.
For example, deep learning methods do not function well on PCA and CDK derivatives, while the QM-QSPR
approaches have not demonstrated their capacity. A clear advantage of deep learning approach is its running
time, when predicting thousands of target compounds it takes only seconds on a common desktop computer
whereas physics-based approach requires a large compute resources and takes a longrunning time.
Discussion
Previously, both AI and drug design experts are focused on molecular property prediction. However they are
interested in totally dierent issues, as we illustrated in Table6. Enormous high quality data and high predictive
accuracy on their own measurement standard are the main concern for the AI experts. Drug design experts
are more interested in real world eects of the method itself. For example, how is the correlation coecients
in compound lead optimization, what’s the generalization ability on dierent series of in-house compounds,
what’s the required computing resource and its running time on making prediction, and nally is it available or
open-sourced for free application. is work is trying to bridge such gap with one of its sub-problem, aqueous
solubility prediction.
Currently, the QM-QSPR approaches are the dominant techniques for aqueous solubility prediction in drug
design. Several research works have demonstrated their improvement with AI techniques. However, with these
continuous improvements in predictive accuracy achieved with AI, conservative drug design experts remain
concerned about the real ability of deep learning in comparison with that of QM-QSPR approaches on their
in-house datasets. is work contributes to resolve the concerned issues from both deep learning and drug
design side. From the deep learning side, we increased the data volume of aqueous solubility datasets from
thousand to hundreds of thousands of molecules, rened the data quality of the datasets with a data curation
method, and nally improved the solubility predictive accuracy dramatically under the traditional measurement
of RMSE. In terms of drug design side, this work is a milestone bridge that constructs a mechanism to com-
pare QM-QSPR and deep learning approaches with state-of-the-art solubility evaluation datasets on correlation
coecients. Fortunately, the graph learning method of expanded Chemprop trained on a curated dataset has
demonstrated a steady performance on correlation coecients of r2 and Rs comparable to that of the QM-QSPR
approaches, while using orders of magnitudes less compute resources and being available for public evaluation.
e comparison also conrms that the generalization ability of deep learning approach is good on BPU andBDZ
derivatives but still limited on PCA and CDK derivatives which demands further research eort on both sides.
This work also reveals a turning point in molecular property prediction where the deep learning and
QM-QSPR approaches should be jointly co-developed. For example, topology-based graph learning and crystal-
3D-structure-based deep learning may integrate both topology and crystal 3D features in solubility prediction
with a promising accuracy improvement. One can also expand this work to other molecular properties to better
understand natural phenomena with the help of both QM-QSPR and deep learning methods.
Usage Notes
Reproducibility of the curation algorithm, training workow and performance evaluation can be verifed by execut-
ing the scripts described in the README of our project SolCuration at https://github.com/Mengjintao/SolCuration.
e code has been developed and tested using Python 3.7 on Linux operating system and is available under the BSD
3-Clause License. All the datasets are also provided in this repository for further research eort on this problem.
Data availability
e original, clean and curated dataset for the 7 selected data sources presented in this paper are publicly available
on GitHub at https://github.com/Mengjintao/SolCuration and can be cited by43.
Code availability
Python and C++ codes used to perform data curation, training workow, and performance evaluation shown in
this manuscript are publicly available on GitHub at https://github.com/Mengjintao/SolCuration or one can cite
our code by43.
AI experts Drug design experts
Data volume Correlation coecients in compound lead optimization
Data quality Generalization ability on dierent series of compounds
Measurement standard Computing resource and its running time
Predictive accuracy Open source availability
Tab le 6. Dierence of the issues concerned by AI experts and Drug design experts.
Content courtesy of Springer Nature, terms of use apply. Rights reserved
12
SCIENTIFIC DATA | (2022) 9:71 | https://doi.org/10.1038/s41597-022-01154-3
www.nature.com/scientificdata
www.nature.com/scientificdata/
Received: 5 October 2020; Accepted: 25 January 2022;
Published: xx xx xxxx
1. Lipp, . e innovator pipeline: bioavailability challenges and advanced oral drug delivery opportunities. Am Pharm ev 16, 10–12
(2013).
2. Abramov, Y. A., Sun, G., Zeng, Q., Zeng, Q. & Yang, M. Guiding lead optimization for solubility improvement with physics-based
modeling. Molecular Pharmaceutics (2020).
3. Wang, J. & Hou, T. ecent advances on aqueous solubility prediction. Combinatorial chemistry & high throughput screening 14,
328–338 (2011).
4. Salahinejad, M., Le, T. C. & Winler, D. A. Aqueous solubility prediction: do crystal lattice interactions help? Molecular
pharmaceutics 10, 2757–2766 (2013).
5. Jorgensen, W. L. & Duy, E. M. Prediction of drug solubility from structure. Advanced drug delivery reviews 54, 355–366 (2002).
6. Hossain, S., abedev, A., Parrow, A., Bergström, C. & Larsson, P. Molecular simulation as a computational pharmaceutics tool to
predict drug solubility, solubilization processes and partitioning. European Journal of Pharmaceutics and Biopharmaceutics (2019).
7. Teto, I. V., Villa, A. E. & Livingstone, D. J. Neural networ studies. 2. variable selection. Journal of chemical information and
computer sciences 36, 794–803 (1996).
8. Palmer, D. S., O’Boyle, N. M., Glen, . C. & Mitchell, J. B. andom forest models to predict aqueous solubility. Journal of chemical
information and modeling 47, 150–158 (2007).
9. Duvenaud, D. . et al. Convolutional networs on graphs for learning molecular ngerprints. In Advances in neural information
processing systems, 2224–2232 (2015).
10. ier, L. B., et al. Molecular connectivity in structure-activity analysis (esearch Studies, 1986).
11. Teto, I. V., Tanchu, V. Y., asheva, T. N. & Villa, A. E. Estimation of aqueous solubility of chemical compounds using e-state
indices. Journal of chemical information and computer sciences 41, 1488–1493 (2001).
12. Palmer, D. S. et al. Predicting intrinsic aqueous solubility by a thermodynamic cycle. Molecular Pharmaceutics 5, 266–279 (2008).
13. Palmer, D. S., McDonagh, J. L., Mitchell, J. B., van Mouri, T. & Fedorov, M. V. First-principles calculation of the intrinsic aqueous
solubility of crystalline druglie molecules. Journal of chemical theory and computation 8, 3322–3337 (2012).
14. Buchholz, H. . et al. ermochemistry of racemic and enantiopure organic crystals for predicting enantiomer separation. Crystal
Growth & Design 17, 4676–4686 (2017).
15. Docherty, ., Pencheva, . & Abramov, Y. A. Low solubility in drug development: de-convoluting the relative importance of
solvation and crystal pacing. Journal of Pharmacy and Pharmacology 67, 847–856 (2015).
16. Par, J. et al. Absolute organic crystal thermodynamics: growth of the asymmetric unit into a crystal via alchemy. Journal of chemical
theory and computation 10, 2781–2791 (2014).
17. Perlovich, G. L. & aevsy, O. A. Sublimation of molecular crystals: prediction of sublimation functions on the basis of hybot
physicochemical descriptors and structural clusterization. Crystal growth & design 10, 2707–2712 (2010).
18. Syner, ., McDonagh, J., Groom, C., Van Mouri, T. & Mitchell, J. A review of methods for the calculation of solution free energies
and the modelling of systems in solution. Physical Chemistry Chemical Physics 17, 6174–6191 (2015).
19. Zhang, P. et al. Harnessing cloud architecture for crystal structure prediction calculations. Crystal Growth & Design 18, 6891–6900
(2018).
20. Morgan, H. L. e generation of a unique machine description for chemical structures-a technique developed at chemical abstracts
service. Journal of Chemical Documentation 5, 107–113 (1965).
21. ogers, D. & Hahn, M. Extended-connectivity ngerprints. Journal of chemical information and modeling 50, 742–754 (2010).
22. Glen, . C. et al. Circular ngerprints: exible molecular descriptors with applications from physical chemistry to adme. IDrugs 9,
199 (2006).
23. Wu, Z. et al. Moleculenet: a benchmar for molecular machine learning. Chemical science 9, 513–530 (2018).
24. Yang, . et al. Analyzing learned molecular representations for property prediction. Journal of chemical information and modeling
59, 3370–3388 (2019).
25. Feinberg, E. N. et al. Potentialnet for molecular property prediction. ACS central science 4, 1520–1530 (2018).
26. earnes, S., McClosey, ., Berndl, M., Pande, V. & iley, P. Molecular graph convolutions: moving beyond ngerprints. Journal of
computer-aided molecular design 30, 595–608 (2016).
27. Weininger, D. Smiles, a chemical language and information system. 1. introduction to methodology and encoding rules. Journal of
chemical information and computer sciences 28, 31–36 (1988).
28. Gilmer, J., Schoenholz, S. S., iley, P. F., Vinyals, O. & Dahl, G. E. Neural message passing for quantum chemistry. arXiv preprint
arXiv:1704.01212 (2017).
29. Xiong, Z. et al. Pushing the boundaries of molecular representation for drug discovery with the graph attention mechanism. Journal
of medicinal chemistry (2019).
3 0. Avdeef, A. Suggested improvements for measurement of equilibrium solubility-ph of ionizable drugs. ADMET and DMP 3, 84–109
(2015).
31. Bergström, C. A. & Larsson, P. Computational prediction of drug solubility in water-based systems: qualitative and quantitative
approaches used in the current drug discovery and development setting. International journal of pharmaceutics 540, 185–193 (2018).
32. Wenloc, M. C., Austin, . P., Potter, T. & Barton, P. A highly automated assay for determining the aqueous equilibrium solubility of
drug discovery compounds. JALA: Journal of the Association for Laboratory Automation 16, 276–284 (2011).
33. Eri sson, L. et al. Methods for reliability and uncertainty assessment and for applicability evaluations of classification-and
regression-based qsars. Environmental health perspectives 111, 1361–1375 (2003).
34. Huusonen, J. Estimation of aqueous solubility for a diverse set of organic compounds based on molecular topology. Journal of
Chemical Information and Computer Sciences 40, 773–777 (2000).
35. Delaney, J. S. Esol: estimating aqueous solubility directly from molecular structure. Journal of chemical information and computer
sciences 44, 1000–1005 (2004).
36. Saal, C. & Petereit, A. C. Optimizing solubility: inetic versus thermodynamic solubility temptations and riss. European journal of
pharmaceutical sciences 47, 589–595 (2012).
37. Mansouri, ., Grule, C., ichard, A., Judson, . & Williams, A. An automated curation procedure for addressing chemical errors
and inconsistencies in public datasets used in qsar modelling. SA and QSA in Environmental esearch 27, 911–937 (2016).
38. Sorun, M. C., hetan, A. & Er, S. Aqsoldb, a curated reference set of aqueous solubility and 2d descriptors for a diverse set of
compounds. Scientic data 6, 1–8 (2019).
39. Zalesa, B. et al. Synthesis of zwitterionic compounds: Fully saturated pyrimidinylium and 1, 3-diazepinylium derivatives via the
novel rearrangement of 3-oxobutanoic acid thioanilide derivatives. e Journal of organic chemistry 67, 4526–4529 (2002).
40. Bemis, G. W. & Murco, M. A. The properties of nown drugs. 1. molecular framewors. Journal of medicinal chemistry 39,
2887–2893 (1996).
41. awahata, W. et al. Design and synthesis of novel amino-triazine analogues as selective bruton’s tyrosine inase inhibitors for
treatment of rheumatoid arthritis. Journal of medicinal chemistry 61, 8917–8933 (2018).
Content courtesy of Springer Nature, terms of use apply. Rights reserved
13
SCIENTIFIC DATA | (2022) 9:71 | https://doi.org/10.1038/s41597-022-01154-3
www.nature.com/scientificdata
www.nature.com/scientificdata/
42. Ito, M. et al. Discovery of 3-benzyl-1-(trans-4-((5-cyanopyridin-2-yl) amino) cyclohexyl)-1-arylurea derivatives as novel and
selective cyclin-dependent inase 12 (cd12) inhibitors. Journal of medicinal chemistry 61, 7710–7728 (2018).
43. Meng, J. Solcuration. gshare https://doi.org/10.6084/m9.gshare.14766909 (2021).
is work was partly supported by the National Key Research and Development Program of China under Grant
No. 2018YFB0204403, Strategic Priority CAS Project XDB38050100, National Science Foundation of China
under grant No. U1813203, the Shenzhen Basic Research Fund under grant No. RCYX2020071411473419,
KQTD20200820113106007 and JSGG20190220164202211, CAS Key Lab under grant No. 2011DP173015.
is work was partly supported by JST, PRESTO under grant No. JPMJPR20MA, JSPS KAKENHI under grant
No. JP21K17750, and AIST Emerging Research under grant No. AAZ2029701B, Japan. We would like to thank
Dr. Kamel Mansouri from Integrated Laboratory Systems, Inc for providing curated PHYSPROP datasets. We
also want to thank the editors and reviewers for their professional comments which have greatly improved this
manuscript.
Competing interests
e authors declare no competing interests.
Additional information
Correspondence and requests for materials should be addressed to P.C., Y.W. or S.F.
Reprints and permissions information is available at www.nature.com/reprints.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and
institutional aliations.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International
License, which permits use, sharing, adaptation, distribution and reproduction in any medium or
format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Cre-
ative Commons license, and indicate if changes were made. e images or other third party material in this
article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the
material. If material is not included in the article’s Creative Commons license and your intended use is not per-
mitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the
copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
© e Author(s) 2022
Content courtesy of Springer Nature, terms of use apply. Rights reserved
1.
2.
3.
4.
5.
6.
Terms and Conditions
Springer Nature journal content, brought to you courtesy of Springer Nature Customer Service Center GmbH (“Springer Nature”).
Springer Nature supports a reasonable amount of sharing of research papers by authors, subscribers and authorised users (“Users”), for small-
scale personal, non-commercial use provided that all copyright, trade and service marks and other proprietary notices are maintained. By
accessing, sharing, receiving or otherwise using the Springer Nature journal content you agree to these terms of use (“Terms”). For these
purposes, Springer Nature considers academic use (by researchers and students) to be non-commercial.
These Terms are supplementary and will apply in addition to any applicable website terms and conditions, a relevant site licence or a personal
subscription. These Terms will prevail over any conflict or ambiguity with regards to the relevant terms, a site licence or a personal subscription
(to the extent of the conflict or ambiguity only). For Creative Commons-licensed articles, the terms of the Creative Commons license used will
apply.
We collect and use personal data to provide access to the Springer Nature journal content. We may also use these personal data internally within
ResearchGate and Springer Nature and as agreed share it, in an anonymised way, for purposes of tracking, analysis and reporting. We will not
otherwise disclose your personal data outside the ResearchGate or the Springer Nature group of companies unless we have your permission as
detailed in the Privacy Policy.
While Users may use the Springer Nature journal content for small scale, personal non-commercial use, it is important to note that Users may
not:
use such content for the purpose of providing other users with access on a regular or large scale basis or as a means to circumvent access
control;
use such content where to do so would be considered a criminal or statutory offence in any jurisdiction, or gives rise to civil liability, or is
otherwise unlawful;
falsely or misleadingly imply or suggest endorsement, approval , sponsorship, or association unless explicitly agreed to by Springer Nature in
writing;
use bots or other automated methods to access the content or redirect messages
override any security feature or exclusionary protocol; or
share the content in order to create substitute for Springer Nature products or services or a systematic database of Springer Nature journal
content.
In line with the restriction against commercial use, Springer Nature does not permit the creation of a product or service that creates revenue,
royalties, rent or income from our content or its inclusion as part of a paid for service or for other commercial gain. Springer Nature journal
content cannot be used for inter-library loans and librarians may not upload Springer Nature journal content on a large scale into their, or any
other, institutional repository.
These terms of use are reviewed regularly and may be amended at any time. Springer Nature is not obligated to publish any information or
content on this website and may remove it or features or functionality at our sole discretion, at any time with or without notice. Springer Nature
may revoke this licence to you at any time and remove access to any copies of the Springer Nature journal content which have been saved.
To the fullest extent permitted by law, Springer Nature makes no warranties, representations or guarantees to Users, either express or implied
with respect to the Springer nature journal content and all parties disclaim and waive any implied warranties or warranties imposed by law,
including merchantability or fitness for any particular purpose.
Please note that these rights do not automatically extend to content, data or other material published by Springer Nature that may be licensed
from third parties.
If you would like to use or distribute our Springer Nature journal content to a wider audience or on a regular basis or in any other manner not
expressly permitted by these Terms, please contact Springer Nature at
onlineservice@springernature.com
Available via license: CC BY 4.0
Content may be subject to copyright.