ThesisPDF Available

Statistical modelling of the biodegradability of organic compounds

Authors:

Abstract

Modelling of biodegradability is crucial for designing safe chemicals and directly addresses 8 of the 17 sustainable development goals (SDG). As for today, the modelling is mostly done via closed and proprietary software with lacking transparency development practices. The quality of the underlying data is questionable, as there is no central or open scientific organ that regulates and checks the data regularly. The lack of such an organ thus limits the ability of scientists to derive new models. Especially for biodegradability, the underlying mechanisms are not yet fully understood. This work explores how methods of CADD could help in understanding why a compound is RB or why a compound is non-readily biodegrable (NRB). The applied methods include statistical modelling with state-of-the art machine learning algorithms, CADD descriptors, and finally the development and application of open-source and free software chembee and acheeve. The software development was solely done with open and free software as well. The models derived with chembee are compliant to EU regulations and explainable. Data impurities were identified using the workflows implemented in chembee and verified against the scientific literature and the European Chemicals Agency (ECHA) databas
Masther Thesis
Statistical modelling of the
biodegradability of organic compounds
Leuphana Professional School
Julian M. Kleber, Matr.-Nr.: 3041059
Sustainable Chemistry
Stettiner Str. 57, 13357 Berlin
julian.m.kleber@gmail.com
Supervisors:
Prof. Dr. Marco Reich
Ann-Kathrin Amsel
August 7, 2022
Acknowledgement
I want to say thank you to everyone supporting me and my work up to now. Thank
you, Gerhard, for introducing me to your group and providing the computational resources
needed to complete this work. Thank you, Lisa and Myriam, for having my back during
the work intense phases of my education. Thank you, Klaus, for your valuable feedback
and giving me room to grow.
I thank my family for supporting me. I thank my parents, Jutta, and Arno, as well as my
grandmother Margarete for their everlasting support. Thank you, Joshua, for trying out
even the craziest ideas from lean drug development with me. Thank you, Nina, for always
holding to me. May you never be in danger of a disease again. Thank you Anke, thank
you, Ulrich, for all the family gatherings. They were always wonderful and strengthening.
I want to express my sincerest gratitude to my supporters who were providing valuable
sparring, coaching, time, emotions, and resources.
Contents
List of Figures i
Acronyms iv
Abstract viii
1 Introduction 1
1.1 Biodegradation and sustainability ....................... 1
1.2 Regulation .................................... 2
1.3 Water scarcity .................................. 3
1.4 Closed bottle Test ................................ 4
1.5 Machine learning and artificial intelligence .................. 5
1.6 Computer-aided drug design .......................... 7
1.6.1 Virtual screening ............................ 7
1.6.2 Molecular dynamics ........................... 8
1.6.3 Docking ................................. 10
1.6.4 Quantitative structure activity relationship . . . . . . . . . . . . . . 10
1.6.5 Representation Learning . . . . . . . . . . . . . . . . . . . . . . . . 12
1.7 Biodegradability and computer aided drug design (CADD) . . . . . . . . . 13
1.8 Modelling of biodegradability . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.9 Aimsandscope................................. 16
2 Methods 18
2.1 SoftwareEngineering.............................. 18
2.2 Datapreparation ................................ 19
2.3 Machine Learning Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.4 LogisticRegression ............................... 21
2.4.1 K-meansalgorithm........................... 22
2.4.2 K-nearest neighbor clustering . . . . . . . . . . . . . . . . . . . . . 23
1
Contents
2.4.3 Support Vector machine . . . . . . . . . . . . . . . . . . . . . . . . 23
2.4.4 Spectral clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.4.5 Random forest algorithm . . . . . . . . . . . . . . . . . . . . . . . . 25
2.4.6 Naive Bayes algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.4.7 Summary of machine learning algorithms . . . . . . . . . . . . . . . 29
3 Results 30
3.1 Established molecular descriptors . . . . . . . . . . . . . . . . . . . . . . . 30
3.2 Morganngerprints............................... 35
3.3 Screening of descriptors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.4 Comparison of the applicability domain . . . . . . . . . . . . . . . . . . . . 40
3.5 Comparison of the descriptiveness . . . . . . . . . . . . . . . . . . . . . . . 40
3.6 Explainability.................................. 44
4 Discussion 48
4.1 Featurespaces.................................. 48
4.2 Dataimpurities................................. 50
4.2.1 Explainability.............................. 52
5 Conclusion 54
5.1 Summary .................................... 54
5.2 Project...................................... 55
5.3 Compliance and applicability . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.4 Outlook ..................................... 56
6 Declaration of independent writing 65
2
List of Figures
1.3.1 Water scarcity map for the whole column for Germany based on modelling by
Marx et al.[18]obtained from their website[19]. Lighter colors indicate water
abundance, whereas darker colors indicate water scarcity. .......... 4
1.5.1 Typical pipeline of a data science project. ................... 6
1.5.2 The typical lean business cycle[23]........................ 6
1.6.1 General workflow of the molecular dynmaics (MD) algorithm. ........ 9
1.6.2 General outline of the docking procedure. . . . . . . . . . . . . . . . . . . . 11
2.1.1 SOLID software design pattern for modelling readily biodegrable (RB). . . 19
2.2.1 Polar plots visualizing the enhanced Lipinski’s rule of five[53]for the dataset
of biodegradability from Lunghini et al.[62]. The plots demonstrate influence
for the CADD descriptors on RB. The red dashed line indicates the dataset
average and the two black hexagons indicate the borders of the sweet spot
accordingtoLipinski. .............................. 20
2.4.1 Logistic regression with dierent solvers on the IRIS dataset[76]........ 21
2.4.2 Dierent methods of the k-means algorithm on the IRIS dataset[76]. . . . . . 22
2.4.3 Dierent methods of the k-nearest neighbors algorithm (KNN) algorithm on
theIRISdataset.................................. 23
2.4.4 Dierent kernels for an support vector machine (SVM) on the IRIS dataset[81]24
2.4.5 The plot shows dierent SVM models with dierent polynomial kernels to
compare non-linear properties of these kernels. . . . . . . . . . . . . . . . . 25
2.4.6 Spectral clustering on the IRIS dataset with three dierent classifiers. . . . 26
2.4.7 Schematic representation of a decision tree[85].................. 27
2.4.8 Schematic representation of the random forest algorithm[85].......... 27
2.4.9 Classification performance of the random forest classifier (RFC) with dierent
lossfunctions. .................................. 28
2.4.10 Comparison of dierent naive Bayes (NB) algorithms on the IRIS dataset[76].28
i
List of Figures
2.4.11 reciever operator characteristic (ROC) and area under the curve (AUC)
curves for three dierent non-linear classifiers with standard parameters on
the IRIS dataset[76]. ............................... 29
3.1.1 Extraction of the feature importance applying the molecular descriptor used
by Ruiz Moreno et al.[53]using the random forest classifier implemented in
chembee. ..................................... 31
3.1.2 Decision boundary of the SVM fitting the biodegradability dataset with the
features logP and the inertial shape factor using chembee. .......... 31
3.1.3 Decision boundary of the RFC fitting the biodegradability dataset with the
features logP and the inertial shape factor using chembee. .......... 32
3.1.4 Decision boundary of the KNN fitting the biodegradability dataset with the
features logP and the inertial shape factor using chembee. .......... 32
3.1.5 Decision boundary of the multilayer perceptron (MLP) fitting the biodegrad-
ability dataset with the features logP and the inertial shape factor using
chembee. ..................................... 33
3.1.6 Comparison of RFC, KNN, and SVM on the metrics accuracy, precision,
recall, specificity, and f-score on the CADD descriptors[62]........... 34
3.1.7 ROC-AUC curves for dierent classifiers trained over five new splits of the
dataset using the hyperparameters obtained from the cross-validated grid
search. The gray area indicates one standard deviation. . . . . . . . . . . . 34
3.2.1 Comparison of RFC, KNN, and SVM on the metrics accuracy, precision,
recall, specificity, and f-score using Morgan fingerprints as the feature space. 35
3.2.2 ROC-AUC curves for dierent classifiers trained over five new splits of the
dataset using the hyperparameters obtained from the cross-validated grid
search. The gray area indicates one standard deviation. . . . . . . . . . . . 36
3.3.1 Extraction of the feature importance, applying the molecular descriptors of
mordred on RDKit mols used by using the random forest classifier imple-
mented in chembee. ............................... 37
3.3.2 Comparison of RFC, KNN, and SVM on the metrics accuracy, precision,
recall, specificity, and f-score using filtered mordred descriptors. . . . . . . . 37
3.3.3 ROC curves for dierent classifiers trained over five new splits of the dataset
using only the filtered features and the hyperparameters obtained from the
cross-validated grid search. The gray area indicates one standard deviation. 38
ii
List of Figures
3.3.4 Decision boundary of the classifiers RFC, KNN, and SVM tuned on filtered
mordred descriptors fitted on logP and inertial shape factor. . . . . . . . . . 39
3.3.5 Screen for data skewness using a stratified 120-fold cross validation on the
optimized RFC. ................................. 40
3.4.1 Comparison of the applicability domain of each feature set introduced by
dierent workflows introducing dierent features to the biodegradation data
set by Lunghini et al.[62]For each confidence interval, the cumulative counts
areshown. .................................... 41
3.5.1 Comparison of each feature space (extended Lipinski descriptors[53]). Mor-
gan fingerprints, and screened descriptors) for the complete biodegradation
dataset of Lunghini et al.[62]. For all cases, a RFC was fitted with the cross-
validated hyperparameters for each distinct feature set. The top row shows
all false predictions for six (left) and 100 iterations (right). The lower row
shoes the unique compounds that are wrongly predicted. . . . . . . . . . . . 42
3.5.2 False predictions by compound for the feature space consisting of the en-
hanced Lipinski fingerprints by Ruiz-Moreno et al.[53]for 100 iterations on
thewholedataset................................. 42
3.5.3 False predictions by compound for the feature space consisting of Morgan
Fingerprints for 100 iterations on the whole dataset. . . . . . . . . . . . . . 43
3.5.4 False predictions by compound for the feature space consisting of the screened
descriptors for 100 iterations on the whole dataset. . . . . . . . . . . . . . . 43
3.5.5 Compounds that show wrong predictions in the established CADD[53]de-
scriptors and the screened mordred descriptors. . . . . . . . . . . . . . . . . 43
3.5.6 False predictions for each new chembee either missing compound 2578 (OPERA)
or compound 2579 ( Ministry of International Trade and Industry (MITI)/National
Institute of Technology and Evaluation (NITE)). ............... 44
3.5.7 False predictions by compound for the feature space consisting of the screened
descriptors for 100 iterations on the whole dataset missing entry 2578 (OPERA). 44
3.5.8 False predictions by compound for the feature space consisting of the screened
descriptors for 100 iterations on the whole dataset missing entry 2579 (MITI/-
NITE). ...................................... 45
3.5.9 Example decision tree for one estimator of the trained RFC ensemble, under-
lining the complexity of the modelling task concerned with RB classification. 45
iii
List of Figures
3.5.10 Possible impurities in the dataset of Lunghini et al.[62]by comparing the
false predictions of the dataset without compound 2578 or 2579 respectively
multiple time. However, the extraction of likely structures was done manually
and could not be done automatically in a reliable way. Each compound is
labeled with its index after the removal of 2578 or 2579, its label regarding
RB, CAS number when present, and source. . . . . . . . . . . . . . . . . . 46
3.5.11 Part two of the data impurities discovered using chembee and acheeve. The
first part is shown in Figure 3.5.10 . . . . . . . . . . . . . . . . . . . . . . . 47
5.1.1 Rapid Prototyping pipeline derived from the study of ready biodegradability
using the hand curated dataset of Lunghini et al.[62]............. 55
iv
Acronyms
Mathematical Symbols
VPotential energy
Lennard-Jones parameter decribing the depth of the potential (dispersion energy)
Lennard-Jones parameter describing the particle size
rDistance
qElectric charge
JLoss function
µCentroid position vector
xiData point position vector
v
List of Figures
Abbreviations
GI Gastrointestianl
UN United Nations
SDG Sustainable development goals
POP Persistent organic pollutants
RB Readily biodegrable
NRB Non-readily biodegrable
DOC Dissolved organic carbon
CBT Closed bottle test
QSAR Quantitative structure activity relationship
SAR Structure activity relationship
SVM Support vector machine
PLSDA Partial least squares discriminant analysis
KNN K-nearest neighbour
LR Linear regression
MR Multiple linear regression
NB Naive Bayes
RF Random forest
MLP Multilayer perceptron
ML Machine learning
AI artificial intelligence
MVP Minimum viable product
CADD Computer aided drug design
LBDD Ligand based drug design
HCV Hepatitis virus
vi
List of Figures
SBDD Structure based drug design
MD Molecular dynmaics
ADME Absorption, distribution, metabolism, and excretion
ROCS Rapid overlay of compound structures
OECD Organisation for Economic Co-operation and Developmen
ISO International Organization for Standardization
MITI Ministry of International Trade and Industry
NITE National Institute of Technology and Evaluation
EU European Union
US-EPA United States Environmental Protection Agency
REACH Registration, Evaluation, Authorisation and Restriction of Chemicals
ECHA European Chemicals Agency
SIEF Substance Information Exchange Forum
KNN k-nearest neighbors algorithm
IBL Instance based learning
RBF Radial distribution function
RFC Random forest classifier
t-SNE t-distributed stochastic neighbor embedding
TPSA Topological polar surface area
NBA Naive Bayes algorithm
ROC Reciever operator characteristic
AUC Area under the curve
FAME Fatty acid methyl ester
vii
Abstract
Modelling of biodegradability is crucial for designing safe chemicals and directly addresses
8 of the 17 sustainable development goals (SDG)[1]. As for today, the modelling is mostly
done via closed and proprietary software with lacking transparency development practices.
The quality of the underlying data is questionable, as there is no central or open scientific
organ that regulates and checks the data regularly. The lack of such an organ thus limits the
ability of scientists to derive new models. Especially for biodegradability, the underlying
mechanisms are not yet fully understood.
This work explores how methods of CADD could help in understanding why a compound
is RB or why a compound is non-readily biodegrable (NRB). The applied methods include
statistical modelling with state-of-the art machine learning algorithms, CADD descriptors,
and finally the development and application of open-source and free software chembee[2] and
acheeve[3]. The software development was solely done with open and free software as well.
The models derived with chembee are compliant to EU regulations and explainable. Data
impurities were identified using the workflows implemented in chembee and verified against
the scientific literature and the European Chemicals Agency (ECHA) database.
viii
1 Introduction
Throughout the history, humanity used available substances and modified them to achieve
specific functions useful to humanity. Chemistry is indeed still today a basic natural science
and the blood of any modern economy. Applications of chemistry include pharmaceuticals,
agriculture, the protection against severe environments (e.g., through textiles), or protec-
tion against black-swan events (energy storage such as batteries).
However, chemistry has also very destructive potentials, ranging from warfare to pollution.
Thus, with rising confidence about the consequences of chemicals to the environment, the
pressure to invent more benign chemicals is rising significantly.
1.1 Biodegradation and sustainability
The problem of biodegradation manifests itself already in the human body and is subject
to the disciplines of pharmacology and toxicology. Thus, from the abstract perspective of
biological interaction, CADD and biodegradability are related.
Mostly, specificity that only relevant biological targets are hit, and other targets are spared
out seem to be connecting concepts between CADD and biodegradability. The need for
specificity in chemotherapy was first formulated by Paul Ehrlich[4] and is known to medic-
inal chemists as the pharmacophore.
According to IUPAC, “a pharmacophore is the ensemble of steric and electronic features
that is necessary to ensure the optimal supramolecular interactions with a specific biological
target structure and to trigger (or to block) its biological response”[5].
According to IUPAC, biodegradation is defined “[d]egradation caused by enzymatic process
resulting from the action of cells”[6]. Thus, both definitions for the action of a chemical
in the human body and biodegradation are very similar. On the contrary, accumulation
of non-degradable chemicals introduce several threats to ecosystems (as they would in
the human body). In the model of planetary boundaries[7,8] non-biodegradable pollutants
manifest dierent problems.
1
1 Introduction
The planetary boundaries measure dierent factors important to the global ecosystem and
try to quantify where possible their resilience and out-of-safe zone limits. Speaking of
chemicals, it may be stated persistent organic pollutants (POP) appeal to all planetary
boundaries.
Therefore, it is significantly important for humanity to be able to control the impact of
chemicals on the planetary boundaries, as according to Persson et al.[9] recently another
planetary boundary left its safe-operating zone. The recently overstretched boundary is
called Novel Entities[9], and is crucially tied to POP and other pollutants introduced by
the (chemical) industry.
The United Nations (UN) published the Agenda 2030 with the formulation of 17 SDG to
reach for a sustainable world. A sustainable world goes beyond good practices in industry
and environmental protection. It also includes social factors such as justice. Biodegradabil-
ity is critical to at least eight of the 17 SDGs. The eight SDGs that are directly impacted
by POPsare
Good health and well-being (Goal 3)
Clean water and sanitation (Goal 6)
Industry, innovation and infrastructure (Goal 9)
Sustainable cities and communities (Goal 11)
Responsible production and consumption (Goal 12)
Climate action (Goal 13)
Life below water (Goal 14)
Life on land (Goal 15)
Understanding RB through good modelling of is thus crucial to achieve the Agenda 2030[1].
1.2 Regulation
Biodegradability is part of the greater risk assessment frameworks of dierent authorities
and government bodies. Authorities and government bodies concerned with risk assessment
are Organisation for Economic Co-operation and Developmen (OECD), International Or-
ganization for Standardization (ISO), MITI, NITE, the European Union (EU), and the
2
1.3 Water scarcity
United States Environmental Protection Agency (US-EPA).
The most restrictive legal framework to date is called Registration, Evaluation, Authori-
sation and Restriction of Chemicals (REACH) and was developed by the EU.REACH is
thus the gold standard for compliance for industrial chemicals. Other frameworks relevant
for modelling of biodegradability are the Persistent Organic Pollutants Regulation and the
Biocidal Products Regulation.
REACH came into force on 1st June 2007. Since then, all substances that are manufactured
or imported into the EU in quantities above 1 tonne per year have to register under REACH.
The REACH regulation is not valid for polymers and non-isolated intermediates. During
the registration, the applicant must collect, collate, and submit data on the hazardous
properties to the European Chemicals Agency (ECHA). In the database for REACH, there
are 26,493 registered substances that filed a registration in June 2022[10].
To make the registration process more feasible, applicants are enforced by the EU to form
groups. These groups are called Substance Information Exchange Forum (SIEF) and divide
the burden of assessing a chemical. If an applicant submits a registration, that applicant
automatically joins a SIEF.
Moreover, in 2011, the EU took six substances that are highly dangerous in terms of
CMR toxicity from the market[11]. Most of these substances were found to be NRB, too.
Moreover, the EU filed a new consortium to investigate forever chemicals (NRB chemicals)
circulating substances[12].
1.3 Water scarcity
According to Johan Rockström et al., fresh water availability is directly linked to the re-
silience of an ecosystem[7], but is a local rather than a global problem[7,13]. Measuring water
scarcity can be done with various methods of dierent sensitivity[14].
Yet, estimates say that depending on the method, up to 4 billion people could be af-
fected by water scarcity[15]. A large share of the aected people would live in India and
China[15].
In Germany, fresh water scarcity is a problem, too. In recent years from 2018 on, Germany
experienced extreme droughts. Research about water scarcity has developed the mHM
model[16]. The drought monitor, gives out a daily drought map for Germany based on 2500
3
1 Introduction
weather stations using a GIS system based the mHM model[16,17].
Figure 1.3.1: Water scarcity map for the whole column for Germany based on modelling
by Marx et al.[18]obtained from their website[19]. Lighter colors indicate water abundance,
whereas darker colors indicate water scarcity.
Overall, it is no secret that water is a finite resource and that especially developing countries
pose a risk towards sustainable global water management. According to A. Boretti and L.
Rosa, water scarcity problems should be addressed mostly political, but technologies for
water purification would help by decreasing water stress[20].
1.4 Closed bottle Test
The closed-bottle test, the most recent guideline from the OECD to test for the aerobic
RB of a substance. The identifier for the test is 301. In the OECD guideline, there are six
dierent methods described[21]. The methods include
301A: DOC Die-Away
301B: CO2 Evolution (Modified Sturm Test)
301C: MITI (I)
301D: Closed Bottle
301E: Modified OECD Screening
4
1.5 Machine learning and artificial intelligence
301F: Manometric Respirometry
The testing procedure follows to measure the degradation via parameters such as dissolved
organic carbon (DOC), CO2production, and the oxygen uptake. Measurements should
be taken at sucient intervals or be continuous. Chemical analysis of the degradation
process and the degradation intermediates is optional, but mandatory in the MITI method
(method 301C).
Normally, the tests’ duration is 28 days. However, if the test is shows a plateau, the
duration of the closed bottle test (CBT) might be shortened. If no plateau was reached,
the substance is considered as NRB.
1.5 Machine learning and artificial intelligence
This section aims to give more insight into artificial intelligence and machine learning.
Moreover, algorithms applied in this study are explained further in 2.3. The process of
extracting data and deriving data is essentially the business of data science.
Data science is interdisciplinary, and mainly concerned with deriving insights from unstruc-
tured data using statistics, computer science, as well as problem formulation, reframing and
reduction skills[22]. Data science is essential for any lean business in the 21st century. A
lean business follows the typical lean business cycle[23], shown in Figure 1.5.2.
The cycle manifests itself in the standard procedure of a data science project (compare
Figure 1.5.1). In a lean business, a minimum viable product (MVP) is built on an idea and
the business metrics on that MVP are measured.
Based on the generated data, insights are learned and new ideas generated. The cycle
starts again. The lean cycle shown in Figure 1.5.2 is itself just an improved manifestation
of the Plan-Do-Check-Act cycle advocated by the Toyota company[24].
A data science project is thus never done and continuously improves dierent parts of the
data science workflow shown inFigure 1.5.1 for the desired prediction based on new data
and technologies. Still, in contrast to classical science, for data science the application or
business context provides learnings, too.
Therefore, real data science only happens within the process of applying the problem to
solve real-world problems (preferably in a live) setting. Continuous improvement and ques-
tioning of the derived models is essential to the lean methodology[23]in business and makes
5
1 Introduction
Figure 1.5.1: Typic al pip el ine of a d ata sc ienc e pr o ject .
Figure 1.5.2: The typical lean business cycle[23].
6
1.6 Computer-aided drug design
data science therefore a lean method.
Mostly since 2020 it was shown that in an industrial and combined academic context of drug
design, the lean philosophy and methodology produces new ways and theories not existent
before. It is therefore natural to link data science and drug design through a lean mindset.
the lean methodology thus links drug designs and modelling of biodegradability[25]. As will
be explained in the subsequent section, the relationship of biodegradability and CADD
might present itself via technology, too.
1.6 Computer-aided drug design
CADD aims to decrease the costs and time needed to get a drug to the market. Moreover,
also the quality of a developed drug is likely to be increased. With rising computational
power, novel algorithms, and processors, the field of CADD is significantly accelerated by
the technological advancements of the 21st century.
There are two branches of CADD for small molecules. Firstly, ligand based drug design
(LBDD) focuses on methods and knowledge derived from known active compounds against
a given target. For example, knowledge could be derived from structures of known antivirals
acting on hepatitis virus (HCV)-C virus protease NS3/4A. From a set of known active
ligands to a given target, structure activity relationship (SAR) or quantitative structure
activity relationship (QSAR) models can be developed. This search is often done via
fingerprints that focus on the molecular structure.
On the contrary, structure based drug design (SBDD) focuses on the underlying bio-
macromolecular target and does not require knowledge of active ligands. Common tech-
niques in CADD are, virtual screening,pharmacophore mapping,docking,MD, and absorp-
tion, distribution, metabolism, and excretion (ADME)predictions.
1.6.1 Virtual screening
The aim of virtual screening is to identify small molecules that will bind to a given target
(enzyme, receptor, other biomolecule) with an activity and thus high probability[26]. Vir-
tual screening can either be ligand- or structure-based. Moreover, there are even hybrid
methods. Ligand-based approaches can include pharmacophore-based, shape-based, and
field-based methods.
7
1 Introduction
The pharmacophore concept is especially important to virtual screening methods. Virtual
screening on a pharmacophores introduces variety and diversity into the scaolds of the
filtered compounds. The eect was first described by Schneider et al. and is known as
“scaold hopping”[27].
Scaold hopping demonstrates that most biological active compounds manifest their activ-
ity mostly through attached functional groups rather than their core scaold. Doing scaold
“hoping, leaping, or crawling” computationally, may lead to entirely new structures and is
a powerful tool in CADD[28]. On the contrary, shape-based methods are applied primar-
ily with the rapid overlay of compound structures (ROCS) algorithm in virtual screening
campaigns[2931]but oer some problems in optimization procedures[32].
Field-based methods oer an improved way for doing shape-based methods by introducing
physical descriptors to the method. Field-based methods improve the molecular weight
ratio according to the Lipinski’s rule of five in the filtered compounds[33]. Moreover, field-
based methods improve the structural diversity of the filtered compounds[33].
1.6.2 Molecular dynamics
Molecular Dynamics is a simulation technique using Hamiltonian dynamics and numerical
integration to simulate the time evolution of a system with several approximated force
fields. It aims to be an approximation to quantum mechanical methods. The classical MD
simulation relies on a force field. The force field is generally expressed as a potential Vtotal
(however, in the algorithm the force has to be calculated)
Vtotal =Vtot,covalent +Vtot,noncovalent (1.6.1)
The covalent interaction is defined on a basic level with the bonded potential, the angle
potential, and the torsion potential
Vbonded =ÿ
bonds
kr(rreq)+ (1.6.2)
ÿ
angles
k(eq)2+
ÿ
dihedrals
k(1 + cos[n])+
ÿ
improper
kÊ(ÊÊeq)2
8
1.6 Computer-aided drug design
with each dierent kis representing the respective force constant, ris representing the
distance, and ,,,Êdenote angles. Whereas the non-covalent interactions are defined
by the Lennard-Jones potential. The Lennard-Jones potential for a pair of atoms i, j
is
VLJ =4S
UA
rij B12
A
rij B6T
V(1.6.3)
And the Coulomb potential
VCoulomb =1
4fi‘0
q1q2
rij
(1.6.4)
Thus, the overall non-covalent interaction for a system of Nparticles is then given by
Vtot,noncovalent =
N
ÿ
j
N
ÿ
i
VLJ +
N
ÿ
j
N
ÿ
i
VCoulomb (1.6.5)
Vtot,noncovalent =4
N
ÿ
j
N
ÿ
iS
UA
rij B12
A
rij B6T
V+1
4fi‘0
N
ÿ
j
N
ÿ
i
q1q2
rij
(1.6.6)
Figure 1.6.1: General workflow of the MD algorithm.
The MD algorithm is on a high-level a simple procedure and outlined in Figure 1.6.1. By
employing potentials to describe interactions directly rather than indirectly, in QSAR (com-
9
1 Introduction
pare below), MD abstracts the biological activity away from Lewis structures to physical
interactions.
On a given trajectory, a dynamical pharmacophore may be generated[34]. Dynamical phar-
macophores oer novel insights into protein–ligand interactions. With deep-learning meth-
ods, the eigenstates[35]of the system can be extracted. The extracted eigenstates can then
clearly dierentiate dierent binding-modes of dierent ligands. Also, Dynophores can be
used in virtual screening campaigns[36].
1.6.3 Docking
There are two dierent methods to docking. Firstly, there is a simulation approach, that
simulates separated host-guest molecules, and then there is docking via shape complemen-
tarity. Host and guests molecules are abstracted into dierent features that can make up a
shape complementarity. Shape complementarity methods can either use a rigid or a flexible
picture.
Shape complementarity methods are most suitable for pharmacophores, as both use ge-
ometric descriptors[37].Ecient algorithms, e.g., genetic algorithms[38], make the shape
complementarity approach very ecient in screening large virtual libraries. Methods in-
corporating electrostatic interactions between host and guest might use precomputed force
fields for the host molecule[39].
The simulated approach is using methods applying pairwise interaction calculations. Mostly,
the Poisson formulation of MD is used, such that Fourier methods are a prominent technique
to calculate simulated docking[40]. The simulation approach incorporates ligand flexibility
easily. By linking the molecular shape rather than the molecular structure to the biological
activity, docking is a closer-to-reality-method than just using the 2-D structure such as
the smiles string, that only represents the Lewis structure.The general mechanics of the
docking procedure are shown in Figure 1.6.2.
1.6.4 Quantitative structure activity relationship
QSAR uses chemical and physical information to explain a chemical or biological property
of a molecule with a mathematical (statistical) model[41]. QSAR aims to solve several tasks.
The key tasks of QSAR modelling include
1. Activity prediction for arbitrary endpoints
10
1.6 Computer-aided drug design
Figure 1.6.2: General outline of the docking procedure.
2. Reduction of experimental workload
3. Virtual screening
4. Mechanism elucidation
5. Classification of data
6. Optimization of leads
7. Refinement of synthetic targets[41]
The QSAR study is thus essentially similar to the data science workflow (Figure 1.5.1).
Often, quantitative modelling is referred to as QSAR and qualitative modelling to SAR.
In this work, both are used interchangeably, as the underlying problem is a classifica-
tion task. Moreover, QSAR is engraved in the legal framework around the ECHA and
OECD. The ECHA states the QSAR as a vitally important method in the risk assess-
ment of chemicals[42]. For the ECHA, QSAR is vitally important for the implementation
of REACH.
According to several domain experts, QSAR models used for risk assessment of chemicals
should satisfy the following criteria
1. A defined endpoint
2. An unambiguous algorithm
11
1 Introduction
3. A defined domain of applicability
4. Appropriate measures of goodness-of-fit
5. Robustness and predictivity
6. A mechanistic interpretation, if possible[43]
The two most prominent domains for QSAR models in the legal market include health
endpoints and ecological endpoints. Ecological endpoints may include
1. Acute aquatic toxicity
2. Chronic aquatic toxicity
3. Estrogen receptor binding
4. Biodegradation
5. Sediment sorption
6. Bioaccumulation[44]
1.6.5 Representation Learning
There are numerous eorts in learning proper representations of molecules. Classical repre-
sentations include the SMILES, InChI (improved SMILES by IUPAC), Morgan fingerprints
(algorithmic encoding), 3-D coordinates such as .mol, and the classical IUPAC name. How-
ever, all of these representations usually have to be encoded (except Morgan fingerprints)
or enriched with other data to be useful in QSAR or SAR modelling.
There have been numerous attempts in generating high-performant fingerprints, with one of
the most promising ones is combining the concept of SMILES with Morgan fingerprints[45].
The resulting fingerprint carries more information while claiming to be suitable for both,
small and large molecules.
Moreover, the development of self attention mechanism in bidirectional auto encoders[46]
showed that neural networks can learn semantics of arbitrary string input[47]. Mostly, it was
demonstrated that the BERT algorithm can structure information.[tenney2018what]
The famous SMILES representation is a string encoded information and would give a
link between chemistry and highly successful large-scale natural language processing algo-
rithms. It was demonstrated that molecular descriptors can be learned with data science
12
1.7 Biodegradability and CADD
approaches[48]. Thus, the self-attention mechanism oers a way to do QSAR with deep
learning models from the natural language processing realm[49].
Another attempt in finding the hidden representation of a molecule is to learn a repre-
sentation using machine learning (ML) or artificial intelligence (AI) either from a classical
representation or from an enriched representation.
When using graph neural networks, usually the enriched 3-D coordinates of the molecules
are encoded into a hidden representation. The authors Stojanovic et al. used the hidden
representations of molecules encoded by graph neural networks with attention mechanisms
for doing similarity search[50].
The approach of using a hidden representation for similarity search is essentially similar
to the well-known Tanimoto index (also called Jaccard index[51]). Similarity search is an
important method for, CADD especially in virtual screening campaigns[52].
Mostly, measuring and utilizing chemical similarity is also a key task for (Q)SAR mod-
elling. Another famous approach to chemical similarity is to encode fingerprints with
the t-distributed stochastic neighbor embedding (t-SNE). The t-SNE is mostly used for
visualization, but as the name suggests, could also be used as an embedding (hidden
representation)[53].
1.7 Biodegradability and CADD
The elimination of environmental wastes and that of biomedical residues left after the
application of healing the human body from a given disease are comparable[6]. Even more,
apharmacophore is conceptually similar to a toxicophore[4].
According to Ehrlich, for a substance to induce a biological response, it has to manifest
atropy. For example, a neurotropic substance must act on the brain[4]. Just so, the
manifestation of a biological response always happens through the interaction with or
between biomolecules.
Thus, it is hypothesized that methods of CADD can significantly improve the modelling of
biodegradability. The hypothesis is supported by the fact that chemists have already been
able to derive corresponding rules of thumb. For example, long alkyl chains are bad for
biodegradation.
Furthermore, the hypothesis is already supported by docking experiments[54]. As elaborated
13
1 Introduction
before, it can be shown by via the docking scoring and evaluation that there is a relationship
between ligand structure and biodegradability based on biomolecule interaction. Studies
on the halogenation patterns of polychlorinated compounds indicate a structural influence
of biomolecules[55].
The toxicity studies by Kümmerer et al. suggest interaction of antibiotics with biomolecules
of corresponding microorganisms involved in biodegradation[56]. Systematic studies on
hydratases[57]also show the significance of biomolecule host interaction with the guest
molecule of interest.
Therefore, the ligand-based approach can be pursued to obtain a more accurate description
of biodegradation in models. If the hypothesis turns out to be false, no better model can
be obtained. In modeling via SAR and artificial intelligence, this approach is also evident,
as large influences of structural properties of molecules become apparent[5863].
Moreover, the medicinal chemists developed several tools to guide the drug development
process. Yet, up to now, these methods were rarely applied in a similar eort comparable
to a drug-design campaign to the problem of designing for good biodegradation. However,
the need to design safe chemicals is required by the 4th principle of green chemistry (“de-
signing safer chemicals”)[64]and thus from all perspectives of green sustainable practices
and philosophies related to chemistry.
1.8 Modelling of biodegradability
This section gives an overview about the literature concerned with modelling the biodegrad-
ability of chemical compounds.
Previously, machine learning approaches, using Bayesian techniques, filtered out important
structural fragments for designing more RB molecules. On the contrary, several structural
fragments of NRB molecules were identified using the same Bayesian techniques[65].
The structural fragments of RB molecules are shown in Scheme 1.2 and the most important
structural fragments of NRB molecules are shown in Scheme 1.1[65]
Another work, modelling RB classification, focused more on the selection of molecular
descriptors, but not exploring the fitting capabilities of all investigated descriptors because
of the fear of overfitting[61].
The authors wanted to achieve higher accuracy by introducing a consensus or ensemble
14
1.8 Modelling of biodegradability
R
R
O R
R
R
R
O
R
Scheme 1.1: Important structural fragments of RB molecules[65]
R
R
Cl R
iPr
R
Scheme 1.2: Important structural fragments of NRB molecules[65]
method. The authors claim that fundamental principle of QSAR demand the reduction of
the problem with respect to input features (compare section QSAR).
Moreover, the authors state that the methods SVM, partial least squares discriminant
analysis (PLSDA), and KNN perform better than the random forest (RF) classifier.
Descriptors that showed high performance in the trained models were chemical descriptors
related to single and double bonds, cycles, halogens, and nitrogen related descriptors.
Furthermore, the authors included graph-based descriptions derived from adjacency matri-
ces. Various graph-based metrics were calculated and then later related to the branching of
the molecular structure[61]. Still, the molecular descriptors are very structure related and
do not focus so much on the physical properties of a molecule.
In a study by D. Ballabio et al., a dataset of 416 chemical compounds was used to assess
the performance of RB predictions of eight QSAR models. The study applied ensemble
methods with Bayesian-consensus models according to Dempster-Shafer theory. The study
found that majority voting improved the QSAR predictions of the model[66].
Moreover, it has been shown that on the UCL biodegradation dataset, balanced decision
trees show decent performance[60]. The trees show decent ROC and AUC curves and the
overall classification is based on ensembles[60].
The study of Cheng et al. was performed on the MITI dataset. In summary, the model
15
1 Introduction
evaluated descriptor based against fingerprint-based models. According to the authors, the
fingerprint-based models outperform the descriptor-based models. The applicability do-
main was determined on MACCS keys with length of 166 and the euclidean distance[67].
In another study by Lombardo et al.,aSAR model based on the Python software SARpy[68]
was developed. The authors approached the modelling via molecular fragments in a dual-
istic manner, combining expert knowledge and statistical modelling to develop a machine
learning model[69].
The approach chosen by Lombardo et al. already gives hints that the description of the
molecules only by its structure and expert knowledge may not be sucient to model the
RB with ML and AI. In addition, it was shown that regression modelling of the toxicity
of ionic liquids is possible with classical ML approaches[70].
1.9 Aims and scope
Firstly, in the above section, it was shown that modelling of biodegradation and other
similar endpoints is essential for a sustainable chemical industry and achieving the Agenda
2030. The impact of biodegradability modelling manifests itself by providing safer chemicals
and water purification technologies.
It was shown that water scarcity is a local rather than a global problem[7]. Open-source
platform models to roll out derived technology from biodegradability modelling are thus
the best option to address water scarcity.
Next, emphasis was put on QSAR modelling, and it was demonstrated that QSAR is related
to the lean business methodology via data science and thus related to overall modern
CADD. By linking both, CADD and biodegradation modelling to each other via the
business perspective, chemical and physical similarities might be discoverable.
The potential of a good QSAR model then may replace the need for more expensive sim-
ulations such as docking or MD. Moreover, a good classifier might give rise to a fast and
cheap virtual screening methods combining field-based, shape-based and pharmacophore-
based technologies into one algorithm.
Therefore, good QSAR models would make software building upon these models more
feasible. It is expected, that a hidden representation is learned. The features of the hidden
representation contribute to the explainability of the algorithm and make the technology
16
1.9 Aims and scope
developed in this work more compliant.
It was discussed that the current academic literature uses predominantly proprietary soft-
ware to derive their models[62], which is at least bad scientific practice. Moreover, some au-
thors claimed that the descriptiveness of the datasets are not enough for classical models[Lombardo2013]
and might get worse if the data set size is increased[62]. Thus, this work aims to improve
the overall reliability of scientific QSAR models for chemical use-cases.
However, from a data-science perspective, the dataset would in such a case have bad fea-
tures. Thus, the current state of the and modelling in sustainability can be improved by
following the lean cycle. The hypothesis of bad features is supported by the recent ad-
vance of hidden representation learning to numerous chemical classification tasks (compare
section 1.6.5).
Therefore, methods of CADD may oer new ways to achieve better understanding of
biodegradability. The hypothesis that the modelling of biodegradability can be improved
with CADD methodology shall be evaluated with the following steps:
1. Calculate a wide array of dierent descriptors
2. Filter and evaluate the descriptors
3. Evaluation against fingerprint and validated CADD descriptors
4. Evaluate false positives and false negatives
5. Evaluation of the explainability of the classification results
17
2 Methods
2.1 Software Engineering
To comply with the requirements for QSAR models, it was attempted to design a SOLID
design. The SOLID create principles are derived by practitioners since the 1980s and
follow five principles to design good code. Mostly, solid should help to avoid STUPID
(singleton, tight coupling, untestability, premature optimization, in descriptive naming,
and duplication) code[71]. The five main SOLID principles are:
Single Responsibility Principle
Open/Closed Principle
Liskov Substitution Principle
Interface Segregation Principle
Dependency Inversion Principle
However, the SOLID design is not always the best design pathway, and sometimes is un-
der heavy criticism by practitioners. Especially, SOLID does not suit the microservice
world[72].
Still, to be able to apply the method easily to new endpoints without having to do ev-
erything from scratch, the following pattern was developed during the early prototyping
phases:
The pattern shown in Figure 2.1.1 places the dataset in the center of the workflow. The
decision came essentially from applying domain-driven design[73,74]to the business problem
of generating value from a dataset via a QSAR problem.
The dataset in the shown software pattern is essentially an entity in the world of domain-
driven design. Abstracting the dataset as an entity is counterintuitive and unusual, but
could thus also lead to more value.
18
2.2 Data preparation
Figure 2.1.1: SOLID software design pattern for modelling RB.
Through the design of putting the dataset into a container, chembee and the implemented
methods align with the pillars of object-oriented programming. The pillars of object-
oriented programming are encapsulation,abstraction,inheritance, and polymorphism. All
the pillars of object-oriented programming are addressed within one chembee object.
The domain-driven design manifests itself through the object interaction tree and makes
heavy use of the Dependency Inversion Principle. However, the shown pattern only shows
parts of a potential domain model.
2.2 Data preparation
Collected data from the recent publication of F. Lunghini et al.[62]was processed further
using the software RDKit[75]. The authors provided the data for their molecules totaling
3192 molecules, labelled as either RB or NRB.
For the modelling part, several features were needed to be able to train machine learning
models on the dataset. Firstly, the CADD descriptors described by Ruiz-Moreno et al.[53]
were calculated. Secondly, all available molecular descriptors from the mordred software
package were calculated. The resulting 3192 molecules were then saved for further use in
19
2 Methods
(a) Whole dataset (b) Biodegradable (c) Non-biodegradable
Figure 2.2.1: Polar plots visualizing the enhanced Lipinski’s rule of five[53]for the dataset
of biodegradability from Lunghini et al.[62]. The plots demonstrate influence for the CADD
descriptors on RB. The red dashed line indicates the dataset average and the two black
hexagons indicate the borders of the sweet spot according to Lipinski.
the .sdf file format.
The distribution of molecules by class is distorted towards Lipinski’s rule of five (Fig-
ure 2.4.11). It can be seen that a large fraction of the dataset is within the sweet spot of
Lipinski’s rules. Moreover, it can be seen that the biodegradability is influenced by those
descriptors.
For the whole dataset and biodegradable compounds alike, the topological polar surface
area (TPSA) seems less relevant than in CADD and is slightly below the sweet spot.
Furthermore, hydrogen bond donors, are seemingly less relevant than in CADD. For non-
biodegradable compounds, the dataset is clearly distorted for LogP, but still on average
compliant with the rule.
As the next step, the features are filtered using the best class RFC and selected based on
the standard deviation of their importance. Selecting the standard deviation of importance
should make sure the features are relevant for certain molecules but might be irrelevant to
other molecules. In this way, a better fit could be achieved.
20
2.3 Machine Learning Algorithms
2.3 Machine Learning Algorithms
The following sections shall give an overview about the applied modelling techniques and
compare their baseline performance on the common IRIS dataset[76].
2.4 Logistic Regression
Linear and logistic regression are very popular methods in data science. Open-source tools
in Python and R, as well as proprietary tools such as Excel, are making the use of logistic
and linear regression frictionless. Both methods are fast and verified by the community
many times.
Logistic and linear regression dier fundamentally in the mathematics of their statistical
model. Linear regression models a relation between continuous variables. On the contrary,
logistic regression models the relation between a continuous variable and a categorical
variable.
The categorical variable in the logistic model is typically a boolean variable. The boolean
variable can have dierent manifestations (0 or 1, true or false, etc.). The logistic model is
from the statistical background a natural choice for modelling RB of organic molecules.
Figure 2.4.1: Logistic regression with dierent solvers on the IRIS dataset[76].
21
2 Methods
2.4.1 K-means algorithm
The kmeans algorithm is a kernel method. The k-means algorithm is famous for clus-
tering data because the algorithm is fast in finding the centroids of clusters. However,
the algorithm prefers clusters of equal size. Still, the k-means algorithm is similar to
the expectation-maximazation algorithm from hidden Markov modelling.The k-means al-
gorithm is the most basic instance based learning (IBL) method[77]. However, due the
mathematical constraint that optimizing the cost function
J=
k
ÿ
j
N
ÿ
xjœSi
|xjµi|(2.4.1)
The cost function measures the Euclidean distance between the jth data point xiin the ith
of the k clusters Sifrom the centroid µiof the cluster Si. Therefore, the problem is always
constraint to linear decision boundaries. A non-linear decision boundary is not achievable.
Figure 2.4.2: Dierent methods of the k-means algorithm on the IRIS dataset[76].
22
2.4 Logistic Regression
2.4.2 K-nearest neighbor clustering
The KNN was invented in 1969 by Cover and Hart[78]. The KNN algorithm is a IBL method.
Essentially, instance-based learners do not learn parameters of a given model, but on the
contrary, learn probabilities by comparing the new example to memorized training examples
with a similarity function[77]. Mostly, the memory storage has polynomial complexity,
but the calculation is linear in complexity[77]. The benchmarking with dierent search
algorithms for the cluster calculation are shown in Figure 2.4.3
Figure 2.4.3: Dierent methods of the KNN algorithm on the IRIS dataset.
The KNN algorithm proved to be capable of highly non-linear fits by catching some edge
cases. However, the fit seems to be less accurate compared to the Random forest algorithm
(Figure 2.4.9)
2.4.3 Support Vector machine
The SVM is a binary non-probabilistic classification algorithm based on geometric argu-
ments. The invention of the algorithm started in the early 1960s and was developed to
production use in the 1990s[79]. The idea to classify data goes back to the late 1930s[80].
The SVM algorithm can be used for classification, and regression tasks.
The algorithm does the classification based on a hyperplane that separates the problem
23
2 Methods
with by minimizing the distance of the clusters to the hyperplane. Because the SVM
is using the hyperplane, it is without modification only applicable to linearly separable
problems. In reality however, the minority of problems are linearly separable. To overcome
the requirement of linearly separable training data, the SVM can be modified to use the
kernel trick.
The kernel trick finds a mapping that projects the data into a higher dimensional space
and finds a hyperplane in that higher dimensional space. The trick in the kernel method
lies in computing an implicit feature space. By computing an implicit feature space, via
the kernel functions, the computational cost stays low.
Methods based on kernel functions are lazy methods because they use implicit learning.
Therefore, their complexity of prediction is O(n). However, the problem is then memory
bound. Most of the kernel-based methods therefore use only a subfraction of the training
data.
The most common kernels for SVM s are the linear kernel, the polynomial kernel, and
the kernel with radial basis functions. Dierent kernels and SVM values are compared in
Figure 2.4.4. For the subsequent example, a polynomial of 3rd order was chosen.
Figure 2.4.4: Dierent kernels for an SVM on the IRIS dataset[81]
The radial distribution function (RBF) and the polynomial kernel suggest non-linear fits.
To see if the non-linear properties of the SVM can be extended polynomial kernels with
degree 1, 3, and 6 were fitted to the IRIS dataset[76](Figure 2.4.5)
24
2.4 Logistic Regression
Figure 2.4.5: The plot shows dierent SVM models with dierent polynomial kernels to
compare non-linear properties of these kernels.
The comparison in Figure 2.4.5 shows that kernels using polynomial functions of higher
order are better capable of non-linear fitting than kernels using polynomials of lower order.
However, the benefits of using higher order polynomials in the kernel did not outweigh the
computational cost for the methods implemented in scikit-learn.[76]
2.4.4 Spectral clustering
Spectral clustering uses of the eigenvalues (spectrum) of the similarity matrix of the input
for the clustering. The spectral clustering algorithm performs dimensionality reduction
before finally clustering the data points. Spectral clustering needs the similarity matrix,
and is thus a kernel method.
In the field of image analysis, spectral clustering is sometimes called object categorization.
Spectral clustering is an unsupervised technique that needs some method of clustering
against the similarity functions. The performance of dierent spectral clustering algorithms
in shown in Figure 2.4.6
2.4.5 Random forest algorithm
The random forest algorithm is as a decision tree algorithm (Figure 2.4.7), that allows for
regression and classification tasks. Mostly, the random forest algorithm is an ensemble
25
2 Methods
Figure 2.4.6: Spectral clustering on the IRIS dataset with three dierent classifiers.
method using decision trees Figure 2.4.8. The most important hyperparameters for the
random forest algorithm are the strength of each classifier and the correlation between the
individual classifiers[82].
Breiman et al. state, the random forest algorithm is converging according to the “Strong
Law of Large Numbers”[82]. Mostly, the authors state that the random forest algorithm is
not prone to overfitting[82]. The exploration of the random forest classifiers is a dicult
problem. There are several methods of tackling the exploration problem.
The exploration may be done via bagging[83], or boosting[84]. Later, Breiman et al. identified
Adaboosting as the most successful method[82]. The Adaboost method is, still, the most
popular method today.
The RFC can find highly non-linear hyperplanes, that do not obey any mathematical
restrictions from geometric arguments. Training the RFC on the IRIS dataset contained in
the sklearn API[76]the classification borders indicate superior fit as compared to the other
algorithms (Figure 2.4.9). However, the algorithm can thus be prone to overfitting. The
performance of the RFC with dierent loss functions is shown in Figure 2.4.9
26
2.4 Logistic Regression
Figure 2.4.7: Schematic representation of a decision tree[85].
Figure 2.4.8: Schematic representation of the random forest algorithm[85].
27
2 Methods
Figure 2.4.9: Classification performance of the RFC with dierent loss functions.
2.4.6 Naive Bayes algorithm
The Naive Bayes algorithm (NBA) algorithm is very popular for the classification of
biomolecules (compare 1.8). The NBA classifiers are probabilistic methods that apply
naive independence assumptions on the input features. Often, NB classifiers are combined
with kernel density estimation to achieve good accuracies. To also have a baseline model,
the NB was evaluated against the IRIS dataset[76].
Figure 2.4.10: Comparison of dierent NB algorithms on the IRIS dataset[76].
The NBA classifier using a Gaussian kernel is similar to the SVM with RBF. The rela-
28
2.4 Logistic Regression
tion between the two is mathematically immediately evident. Kernels dierent from the
Gaussian kernel significantly underperform on the dataset compared to all other methods
investigated. The prior authors seem to prefer the NBA method, therefore the results on
the new approach should be very interesting.
2.4.7 Summary of machine learning algorithms
For several algorithms, the baseline behavior on a classical benchmarking dataset (IRIS)[76]
was studied. The aim of the baseline study was to be able to compare the results on the
RB binary classification task with the context of the baseline behavior of the algorithms
KNN, SVM, RFC, NBA.
The NBA algorithm yields similar results to SVM with RBF. The k-means algorithm could
not leave its linear fitting regime at all. The non-linear fitting capabilities of the SVM can
be improved with higher order polynomial kernels.
The RFC classifier yielded good results with all three dierent loss functions (Gini coe-
cient, Shannon Entropy, Shannon Log-Loss). For the non-linear classifiers RFC, KNN, and
MLP, the ROC-AUC curves are shown.
(a) RFC (b) KNN (c) MLP
Figure 2.4.11: ROC and AUC curves for three dierent non-linear classifiers with standard
parameters on the IRIS dataset[76].
29
3 Results
3.1 Established molecular descriptors
Many authors prefer a variety of descriptors, and each author seems to prefer their own
descriptors[62,86,87]. Still unarguably the most important descriptors for CADD are the
Lipinski descriptors.
In the reference study, a train-test-split of 70-30 was used, however, as the feature sets are
expected to be more descriptive and the dataset is small[62], in this study a train-test-split of
80-20 is chosen. The split was also verified to yield similar results to a 70-30 split. First, to
verify the hypothesis, the dataset is evaluated against the descriptors used by Ruiz-Moreno
et al. in their drug design process[53], which include the Lipinski rule of five.
The descriptors thus include the classical Lipinski descriptors as well as descriptors related
to the molecular shape, e.g., radius of gyration. Firstly, the feature importance is evaluated
with a random forest classifier using chembee according to Scikit-learn workflow[76]. The
results of the method to investigate the feature importance using chembee is shown in
Figure 3.1.1
The two most important features are the logP value and the inertial shape factor. The
logP value is a physicochemical descriptor related to the spatial and physical properties of
the molecule, whereas the inertial shape factor[88]is a 3D molecular descriptor only related
to the 3D shape of a given molecule.
To get a visual explanation of the fitting capabilities of these two features, dierent classi-
fiers were screened with the benchmark module of chembee. The visual inspection reveals
the three most promising classifiers as SVM, RFC, and the KNN classifier with radial basis
function.
The decision boundary for the the SVM is shown in Figure 3.1.2. The decision bound-
ary for the RFC is shown in Figure 3.1.3, and the boundary for the KNN is shown in
Figure 3.1.4.
30
3.1 Established molecular descriptors
Figure 3.1.1: Extraction of the feature importance applying the molecular descriptor used by
Ruiz Moreno et al.[53]using the random forest classifier implemented in chembee.
Figure 3.1.2: Decision boundary of the SVM fitting the biodegradability dataset with the
features logP and the inertial shape factor using chembee.
31
3 Results
Figure 3.1.3: Decision boundary of the RFC fitting the biodegradability dataset with the
features logP and the inertial shape factor using chembee.
Figure 3.1.4: Decision boundary of the KNN fitting the biodegradability dataset with the
features logP and the inertial shape factor using chembee.
32
3.1 Established molecular descriptors
Most interestingly, the non-linear fitting capabilities of a classical feed-forward neural net-
work implemented in Scikit-learn shown in Figure 3.1.5 seemed to underperform the results
obtained from KNN and RFC classifiers.
Figure 3.1.5: Decision boundary of the MLP fitting the biodegradability dataset with the
features logP and the inertial shape factor using chembee.
As shown in section 2.3, the other algorithms are incapable of fitting highly non-linear
decision boundaries on the toy datasets, and were thus also not able to obtain a non-linear
fit on the investigated dataset of RB.
However, the most important features obtained from the feature extraction done with an
RFC as implemented in chembee seem to be able to dierentiate the points in the dataset
quite well. Yet, the biological properties of a given molecule will only be explainable with
a feature space of higher dimension.
To obtain an even better fit with the suitable algorithms, a grid search for hyperparameter
optimization was conducted with the cross validation action implemented in the chembee
package.
The results of the evaluation on the RFC, KNN, and the SVM classifier is shown in Fig-
ure 3.1.6. Subsequent analysis of the stratified ROC and AUC curves is shown in Fig-
ure 3.1.7.
33
3 Results
Figure 3.1.6: Comparison of RFC,KNN, and SVM on the metrics accuracy, precision, recall,
specificity, and f-score on the CADD descriptors[62].
(a) RFC (b) KNN (c) SVM
Figure 3.1.7: ROC-AUC curves for dierent classifiers trained over five new splits of the
dataset using the hyperparameters obtained from the cross-validated grid search. The gray
area indicates one standard deviation.
34
3.2 Morgan fingerprints
3.2 Morgan fingerprints
Morgan fingerprints are a classical input for algorithms classifying biological properties of
chemicals. Still, molecular descriptors directly related to the chemical structure seem very
popular[62,86,87].
To see the impact of the pure structural information of a chemical compound on their
biological properties regarding biodegradability, Morgan fingerprints of length 2048 bits
were chosen as an input.
The metrics accuracy,f-score,precision, and recall were evaluated against a five-fold strat-
ified cross-validation on tuned hyperparameters by a cross-validated grid search imple-
mented in chembee.
The evaluated algorithms included RFC, KNN, and a SVM with RBF. The result of the
5-fold cross validation evaluated against accuracy,f-score,precision, and recall is shown in
Figure 3.2.1.
Figure 3.2.1: Comparison of RFC,KNN, and SVM on the metrics accuracy, precision, recall,
specificity, and f-score using Morgan fingerprints as the feature space.
Comparing the ROC and AUC curves shown in Figure 3.2.2, it shows that non–linear
classifiers RFC and KNN underperform the SVM.
3.3 Screening of descriptors
There are more than 1800 molecular descriptors implemented in the mordred package[89].
Other reasons for choosing mordred apart from the vast space of descriptors is that it is
35
3 Results
(a) RFC (b) KNN (c) SVM
Figure 3.2.2: ROC-AUC curves for dierent classifiers trained over five new splits of the
dataset using the hyperparameters obtained from the cross-validated grid search. The gray
area indicates one standard deviation.
fast and open-source. Thus, performance and reliability of derived models are increased
when using mordred instead of other software such as PaDel[89,90]or ChemAxon[62].
The screening of feature importance shown in Figure 3.3.1. The plot shown Figure 3.3.1
clearly shows that there are more important and less important descriptors, whereas some
do not show any importance at all.
It is notable, that in contrast to the feature extraction shown in Figure 3.1.1, the feature
extraction done on mordred[89]shows importance predominantly for the standard deviation
of the increase in impurity. It can be concluded that features having a high standard
deviation for the decrease in impurity have good discrimination properties.
Thus, all features with a standard deviation Ø0.01 were selected for further studies. The
next step, included a cross-validated grid search with the KNN, RFC and SVM over the
same hyperparameter space as used in 3.1. Again, the implemented cross-validation action
from chembee was used to perform the grid search.
Later, the performance of the classifiers with the hyperparameters obtained from the grid
search was evaluated using the evaluation action from the chembee package. The results of
the evaluation are shown in and in Figure 3.3.3.
36