PreprintPDF Available

Random forest classification for predicting lifespan-extending chemical compounds


Abstract and Figures

Ageing is a major risk factor for many conditions including cancer, cardiovascular and neurodegenerative diseases. Pharmaceutical interventions that slow down ageing and delay the onset of age-related diseases are a growing research area. The aim of this study was to build a machine learning model based on the data of the DrugAge database to predict whether a chemical compound will extend the lifespan of the worm species Caenorhabditis elegans . Five predictive models were built using the random forest algorithm with molecular fingerprints and/or molecular descriptors as features. Feature selection was achieved using variation and mutual information-based methods. The best performing classifier, built using molecular descriptors, achieved an area under the curve (AUC) score of 0.815 for classifying the compounds in the test set. The features of the model were ranked using the Gini importance measure of the random forest algorithm. The top 30 most important features included descriptors related to atom and bond counts, topological and partial charge properties. The model was applied to predict the class of compounds in an external database, consisting of 1,738 small-molecules. The chemical compounds of the screening database with a predictive probability of ≥ 0.80 for increasing the lifespan of Caenorhabditis elegans were broadly separated into (i) flavonoids, (ii) fatty acids and conjugates, and (iii) organooxygen compounds.
Content may be subject to copyright.
Page 1/26
Random forest classication for predicting lifespan-
extending chemical compounds
Soa Kapsiani
University of Surrey Faculty of Engineering and Physical Sciences
Brendan James Howlin ( )
Department of Chemistry, FEPS, University of Surrey, Guildford, Surrey, GU2 7XH, UK
Research article
Keywords: ageing, anti-ageing drugs, lifespan extension, DrugAge, C. elegans, machine learning, random
forest, molecular descriptors, molecular ngerprints, QSAR
License: This work is licensed under a Creative Commons Attribution 4.0 International License. 
Read Full License
Page 2/26
Ageing is a major risk factor for many conditions including cancer, cardiovascular and neurodegenerative
diseases. Pharmaceutical interventions that slow down ageing and delay the onset of age-related
diseases are a growing research area. The aim of this study was to build a machine learning model
based on the data of the DrugAge database to predict whether a chemical compound will extend the
lifespan of the worm species
Caenorhabditis elegans
. Five predictive models were built using the random
forest algorithm with molecular ngerprints and/or molecular descriptors as features. Feature selection
was achieved using variation and mutual information-based methods. The best performing classier,
built using molecular descriptors, achieved an area under the curve (AUC) score of 0.815 for classifying
the compounds in the test set. The features of the model were ranked using the Gini importance measure
of the random forest algorithm. The top 30 most important features included descriptors related to atom
and bond counts, topological and partial charge properties. The model was applied to predict the class of
compounds in an external database, consisting of 1,738 small-molecules. The chemical compounds of
the screening database with a predictive probability of 0.80 for increasing the lifespan of
Caenorhabditis elegans
were broadly separated into (i) avonoids, (ii) fatty acids and conjugates, and (iii)
organooxygen compounds.
Pharmacological interventions for longevity extension
Ageing is a major health, social and nancial challenge, characterised by the deterioration of the
physiological processes of an organism [1, 2]. Ageing is a predominant risk factor for many conditions
including various types of cancers, cardiovascular and neurodegenerative diseases [3, 4]. Interventions
targeting the cellular and molecular process of ageing can help delay and prevent age-related diseases
Several pharmaceutical and non-pharmaceutical interventions have been identied to extend the lifespan
of a variety of model organisms [2]. Caloric restriction can slow down ageing and protect against age-
related diseases by regulating signalling pathways such as mTOR, the mammalian target of rapamycin
[5]. However, the intensity of long-term dietary restrictions makes it dicult to maintain [4].
Pharmaceutical interventions are considered the most practical interventions for combating human
ageing, as they are easier to maintain than dietary restrictions, as well as, free of ethical concerns
associated with genetic interventions [4].
Caenorhabditis elegans in ageing research
The worm species
Caenorhabditis elegans
C. elegans
) is one of the most studied model organisms in
longevity research and has signicantly contributed to our fundamental understanding of organismal
ageing [6].
C. elegans
has a short lifespan of approximately 3 weeks which makes it well suited for
longevity studies in contrast to long-lived mammals [6, 7]. Besides its short lifespan,
C. elegans
Page 3/26
several observable and quantiable changes with ageing in its anatomical and functional features. Thus,
the ageing process can be easily monitored [6, 7].
Ageing affects several important tissues systems of
C. elegans
including cuticle (skin), hypodermis,
muscles, reproductive and nervous system [8]. With ageing, the cuticle becomes progressively thicker and
more wrinkled. The muscle tissues of
C. elegans
deteriorate resulting in a decline in locomotion [8].
Studies have shown that decline in locomotion more closely predicted the time of death of individual
worms than chronological age [8]. “Chronological age” does not always perfectly correlate with
“biological age”, which is the organism's physical state [9]. “Biological age” is inuenced by the genetic
background of the organism as well as environmental factors [9].
The nervous system of
C. elegans
displays more subtle changes with increasing age compared to other
tissue systems [7]. These include synaptic deterioration, a decline in learning ability as well as reduced
regeneration capacity of motor neurons [7]. Reproduction of
C. elegans
ceases with age and its
reproductive system structure deteriorates [7].
C. elegans
has much simpler physiology than humans, it possess several of the key organ
systems present in more complex organisms such as digestive, nervous and reproductive systems [10].
Many of the mechanisms and genes that extend the lifespan of
C. elegans
are evolutionarily conserved
across organisms, from yeast to humans [11]. Therefore, potential lifespan-extending drugs can be rst
tested on worm and then assessed on mammals.
Overview of key ageing studies
Several ageing studies have identied interventions that extend the lifespan of model organisms ranging
from nematodes and fruit ies to rodents. These interventions include dietary restrictions, genetic
modications and pharmaceutical interventions. Lee
et al.
(2006) presented the rst evidence that long-
term dietary deprivation can improve longevity in a multicellular species,
C. elegans
[12]. Harrison
et al.
(2009) showed that rapamycin, an inhibitor of the mTOR pathway, extended the lifespan of both female
and male mice [13]. In the same year, Selman
et al.
(2009) reported that genetic deletion of S6 protein
kinase 1 increased the lifespan of mice and protected against age-related diseases [14].
et al.
(2014) developed a pharmacological network to identify pharmacological classes related to the
ageing of
C. elegans
[15]. The network showed that resistance to oxidative stress and lifespan extension
clustered in a few pharmacological classes, most of them related to intercellular signalling [15].
Additionally, Putin
et al.
(2016) developed a deep learning neural network that predicted human
chronological age from a basic blood test [16]. The study identied the top ve most critical blood
markers for determining chronological age in humans, which were albumin, glucose, alkaline
phosphatase, urea and erythrocytes [16]. Mamoshina
et al.
(2018) developed a deep learning-based
haematological ageing clock using blood samples from Canadian, South Korean, and Eastern European
populations, with millions of subjects [17]. The study demonstrated that population-specic ageing
Page 4/26
clocks were more accurate in predicting chronological age and quantifying biological age than generic
ageing clocks [17].
et al.
(2017) built a random forest model to predict whether a compound would increase the
lifespan of
C. elegans
based on the data of the DrugAge database [1, 4]. The features used to build the
random forest model were molecular descriptors and gene ontology terms. Feature selection was
performed using random forests feature importance measure. The best performing model, with an AUC
score of 0.80, was applied to predict the class of the compounds in the DGIdb database.
Purpose of the work
In this study, the random forest algorithm was applied to predict whether a compound will increase the
lifespan of
C. elegans
. This was achieved by building ve predictive models, each using different
descriptor types, based on the data of DrugAge database published by Barardo
et al.
(2017) [4]. The
features of the models were molecular ngerprints and/or molecular descriptors calculated from the
structure of the compounds in DrugAge database. The lter-based feature selection method, mutual
information, was employed to select the most relevant features. To the best of our knowledge, this is the
rst application of molecular ngerprints to build a machine learning model based on the entries of the
DrugAge database. The best performing model was applied to predict the class of the compounds in an
external database, consisting of 1,738 small-molecules.
Random forest models
A random forest is a supervised machine learning algorithm that is widely applied for classication tasks.
This method was selected as it is robust to overtting in high-dimensional databases with a small
number of entries, making it suitable for the data used in this study [4].
The choice of chemical descriptors can signicantly impact on the quality and predictions of the QSAR
models. Descriptors represent chemical information of the molecules in a digital or numerical way that is
suitable for model development and are computer-interpretable [18, 19]. In this study, 2D and 3D
molecular descriptors were calculated using the Molecular Operating Environment (MOE™) software [20].
2D descriptors are calculated from the 2D structure of a molecule and provide information related to its
structural, topological and physicochemical properties [21]. On the other hand, 3D descriptors are
generated from the 3D structure of a chemical compound and include electronic parameters (e.g. dipole
momentum), quantum–chemical descriptors (e.g. HOMO and LUMO energies), and surface:volume
descriptors [19, 22, 23].
Molecular ngerprints are a digital representation of a molecule’s structure using binary vectors, where 1
corresponds to a particular feature being present and 0 that it is absent. Several different categories of
molecular ngerprints exist, each reecting different aspects of a molecule [24]. Herein, extended-
connectivity ngerprints (ECFP) of 1,024- and 2,048-bit lengths and RDKit topological ngerprints of
2,048-bit length were generated using the RDKit Python environment [25]. Lastly, the combination of
molecular descriptors with ECFPs was tested.
Page 5/26
Results And Discussion
Visualization of chemical space
This study involved high-dimensional datasets containing hundreds of molecular ngerprints and
descriptors. The PCA algorithm was applied to reduce the chemical space into two-dimensions. The
chemical space representations for the ECFP, RDKit ngerprints, molecular descriptors and combination
of ECFP with molecular descriptors produced using the PCA algorithm are shown in Fig.1.
In chemical space visualisation, structural analogues are positioned nearer to each other than to
unrelated compounds [26]. This allows clustering techniques, such as PCA, do identify neighbourhoods
with similarly structured molecules [26]. Thus, some degree of clustering was expected to be observed
between active compounds.
Among the single descriptor types, Fig.1(a-d), the highest degree of clustering between active molecules
was observed in the chemical space visualisation of the molecular descriptors. An explanation is that the
chemical ngerprints used in this study were hashed ngerprints. Hashed ngerprints often involve loss
of information due to bit collisions, thus, the distances between the ngerprints may not perfectly
correlate to the similarity of the compounds [27]. Interestingly, the chemical space visualisation of the
combined feature type, Fig.1e, is almost identical to that of the molecular descriptors shown in Fig.1d.
This indicates that the molecular descriptors have a stronger expressive power than the ECFPs of 1,024-
bit length for the chemical space analysis of the DrugAge database.
Feature selection
Feature selection was employed to select the most relevant features for predicting the activity of a
molecule in the database. This was performed only for the training set which contained 80% of
compounds in the dataset. Feature selection was achieved by applying variance and mutual information-
based pre-selection methods. This reduced the number of features used by each model, making
computational calculations less expensive. The median AUC scores and standard deviation of 10-fold
cross-validation obtained by random forest classication for each feature combination can be found in
Supplementary Table1, Additional File 1. For each descriptor type, the feature combination with the
highest AUC score in 10-fold cross-validation was selected for classifying the compounds in the test set.
In cases where two feature combinations achieved the same AUC score, the combination that had the
smallest standard deviation was used.
Model Selection
The test set contained 20% of the data not used in training the models. The performances of the random
forest classiers on 10-fold cross-validation and on classifying the compounds in the test set are shown
in Table1.
Page 6/26
Table 1
Model performances
Model Number of Selected Features Cross-Validation (AUC stdev) Test Set (AUC)
ECFP_1024 55 0.794 0.048 0.793
ECFP_2048 504 0.789 0.042 0.776
RDKit5 654 0.836 0.053 0.777
MD 69 0.823 0.041 0.815
ECFP_1024_MD 33 0.828 0.040 0.806
As illustrated in Fig.2, the predictive performances of the random forest models did not signicantly drop
for classifying the compounds in the test set and were compatible with the spread of the AUC scores
from cross-validation. This indicated that overtting was minimised.
The receiver operating characteristic (ROC) curve is the plot of the True Positive Rate (TPR) against the
False Positive Rate (FPR) at varying classication thresholds. The ROC curves, displayed in Fig.3,
compare the performances of the descriptor types for classifying the samples of the test set. Analysis of
the ROC curves indicates that the ve random forest models performed better than a random prediction.
The best performing model, selected by its ability to correctly classify the compounds in the test set, was
used for predicting the class of the compounds in the screening dataset. In general, the random forest
models with a smaller number of selected features, such as ECFP_1024, MD and ECFP_1024_MD, had
better performances on the test set. The classier built using only molecular descriptor, the MD model,
had the greatest ability to correctly predict the class of the compounds in the test set. Combining MD with
ECFP_1024, the random forest model with the second-highest predictive ability, did not result in higher
performance. The ECFP_1024 features could have provided additional information that was not useful to
the random forest classier making the predictions more dicult. Therefore, the MD model, which had an
AUC score of 0.815 for classifying the compounds in the test set, was selected for further analysis.
Confusion matrix
The confusion matrix of the MD random forest model for predicting the class of the molecules in the test
set is shown in Fig.4. The classication accuracy of the model was 0.853 and the AUC score was 0.815.
The calculation of the Positive Predictive Value (PPV), Eq.1, and Negative Predictive Value (NPV), Eq.2, is
shown below:
Page 7/26
In binary classication, the PPV and NPV are the percentage of positive and negative values, respectively,
that are correctly classied. Herein, the PPV and NPV indicate that the random forest model performed
better on correctly classifying inactive compounds than active ones. The data used in this study was
imbalanced as approximately 79% of the samples were negative entries. Thus, a random prediction that a
compound is inactive had a much higher initial probability of being correct. To handle the imbalanced
data, the “class_weight” argument of the random forest algorithm was set to “balanced”, which penalises
misclassication of the minority class [29]. This improved the performance of the model, as the PPV for
classifying the compounds of the test set increased from 61.1% (value without balancing the class
weights) to 65.6% (score achieved after balancing the class weights).
Feature importance
In this experiment, the feature relevance was measured using the “Gini importance” of the random forest
algorithm. The selected model, MD, was composed of 69 molecular descriptors calculated by the MOE™
software [30]. The table containing the full feature ranking can be found in Additional File 2. The analysis
was focused on the top 30 features with the highest Gini importance (Table2), which contained both 2D
and 3D molecular descriptors.
Page 8/26
Table 2
Top ranking MD descriptors.
importance Feature Description
0.062 a_nN Number of nitrogen atoms
0.029 PEOE_VSA + 2 Total positive van der Waals surface area of atoms with a partial
charge in the range of 0.10 to 0.15
0.026 vsurf_D8 Hydrophobic volume
0.024 h_pKa The pKa of the reaction that removes a proton
0.023 SMR_VSA6 Sum of van der Waals surface areas such that the molar
refractivity contribution is in the range of 0.485 to 0.560
0.023 rsynth A value in [0,1] indicating the synthetic reasonableness, or
feasibility, of the chemical structure. A value of 0 means it is
unlikely that the molecule can be synthesized while a value of 1
means that it is likely that the molecule can be synthesized. The
value reects the fraction of heavy atoms in the molecule that
can be traced back to starting materials fragments resulting from
retrosynthetic disconnection rules.
0.022 PEOE_VSA-4 Total positive van der Waals surface area of atoms with a partial
charge in the range of -0.25 to -0.20
0.021 PEOE_VSA + 4 Total positive van der Waals surface area of atoms with a partial
charge in the range of 0.20 to 0.25
0.021 PEOE_VSA-6 Total positive van der Waals surface area of atoms with a partial
charge that is less than − 0.30
0.021 PEOE_VSA_PPOS Total positive van der Waals surface area of atoms with a partial
charge that is greater than 0.20
0.020 chi0_C Carbon connectivity index (order 0)
0.020 Q_VSA_PNEG Total negative polar van der Waals surface area of atoms of with
a partial charge that is less than − 0.20
0.020 PEOE_VSA_POL Total polar van der Waals surface area of atoms of which the
absolute value of their partial charge is greater than 0.20
0.020 chi0v_C Carbon valence connectivity index (order 0)
0.019 SMR_VSA3 Sum of van der Waals surface areas such that the molar
refractivity contribution is in the range of 0.35 to 0.39
0.019 Q_VSA_PPOS Total positive van der Waals surface area of atoms with a partial
charge that is greater than 0.20
0.018 b_single Number of single bonds
0.018 a_count Number of atoms
Page 9/26
importance Feature Description
0.018 SlogP_VSA3 Sum of van der Waals surface areas such that the logP(o/w) is in
the range of 0.0 to 0.1
0.018 PEOE_VSA_PNEG Total negative polar van der Waals surface area of atoms of with
a partial charge that is less than − 0.20
0.017 TPSA Topological polar surface area
0.017 zagreb Zagreb index
0.017 weinerPol Wiener polarity number
0.017 opr_brigid The number of rigid bonds
0.017 Kier3 Third kappa shape index
0.016 PEOE_VSA-1 Total positive van der Waals surface area of atoms with a partial
charge in the range of -0.10 to -0.05
0.016 chi0 Atomic connectivity index (order 0)
0.016 Kier2 Second kappa shape index
0.016 SlogP_VSA2 Sum of van der Waals surface areas such that the logP(o/w) is in
the range of -0.2 to 0.0
0.015 a_nH Number of hydrogen atoms
Top 30 features ranked by Gini importance for the MD random forest model. The description of the
features was taken from the MOE™ software documentation [30].
The highest-ranking features were broadly separated into the following categories (i) atom and bond
counts (ii) topological and (iii) partial charge descriptors.
Atom and bond counts are simple descriptors that do not provide any information on molecular geometry
or atom connectivity. The highest-ranking atom and bond count descriptors were a_nN, b_single, a_count,
opr_brigid, and a_nH. While very simplistic, the atom and bond counts outperformed other more complex
molecular descriptors. This is because atom and bond counts can partially capture the overall properties
of a compound such as size, hydrogen bonding and polarity, which often impact the activity of a drug
[31]. The number of nitrogen atoms, a_nN, was the top-ranking feature of the MD random forest model
with a Gini importance score of 0.062. This is consistent with the results of Barardo
et al.
(2017) where
a_nN was also ranked highest for predicting the class of the compounds in the DrugAge database [4].
Nitrogen atoms could have affected the physicochemical properties of the drugs as well as the
interactions and binding of the molecules with target residues.
The highest-ranking topological descriptors included chi0_C, chi0v_C, zagreb, weinerPol, Kier3, chi0 and
Kier2. Topological descriptors take into account atom connectivity. The descriptors are computed from
Page 10/26
molecular graphs, where atoms are represented by vertices and the bonds by edges [32]. These
descriptors can provide information on the degree branching of the structure as well as molecular size
and shape [32]. Although topological descriptors are extensively used in predictive modelling, they are
usually hard to interpret [33]. Topological descriptors may have provided information on how well a
molecule ts in the binding site and along with atom counts the interactions with the binding residues.
Top ranking partial charge descriptors were PEOE_VSA + 2, PEOE_VSA-4, PEOE_VSA + 4, PEOE_VSA-6,
prex denotes descriptors calculated using the partial equalization of orbital electronegativity (PEOE)
algorithm for quantication of partial charges in the system [34, 35]. On the other hand, descriptors
prexed with “Q_” were calculated using the Amber10:EHT force eld [30]. In a ligand-receptor system,
partial charges can play a key role in the binding properties of the molecule as well as molecular
Predicting potential lifespan-extending compounds
The MD random forest model was applied to predict the class compounds in an external database,
consisting of 1,738 small-molecules obtained from the DrugBank database [36]. The top-ranking
compounds with a predictive probability of for increasing the lifespan of
C. elegans
are shown in
Table3. The full ranking of the molecules in the screening database can be found in Additional File 2.
The compounds were broadly separated into the following categories; (i) avonoids, (ii) fatty acids and
conjugates, and (iii) organooxygen compounds. The compound classication was taken from the
category “Class” in the chemical taxonomy section of the DrugBank database (provided by Classyre) or
assigned manually if not available [37].
Page 11/26
Table 3
Top-hit compounds from external database.
Compound name Predictive probability
Diosmin 0.96
Gamolenic acid 0.95
Rutin 0.95
Hesperidin 0.94
Lactose 0.89
6''-O-Malonyldaidzin 0.84
Fidaxomicin 0.84
Sucrose 0.83
Lactulose 0.83
Sodium aurothiomalate 0.82
Aloin 0.81
Rifapentine 0.81
Plecanatide 0.80
Calcifediol 0.80
Chlortetracycline 0.80
Chemical compounds from the screening database with a
predictive probability of 0.80 or above for increasing the of
C. elegans.
Flavonoids are a group of secondary metabolites in plants that are common polyphenols in the human
diet [38]. Major nutritional sources include tea, soy, fruits, vegetables, wine and nuts [38, 39]. Flavonoids
are separated into subclasses based on their chemical structure, including avones, avonols,
avanones, and isoavones [38]. Isoavones differ to other avonoids by having ring B attached to C-3
position of ring C, rather than the C-2 position as shown in Fig.5 [38].
Flavonoids have been associated with health benets for age-related conditions such as metabolic
diseases, cancer, inammation and cognitive decline [38, 39]. Possible mechanisms of action include
antioxidant activity, scavenging of radicals, central nervous system effects, alteration of the intestinal
Page 12/26
transport, sequestration and processing of fatty acids, PPAR activation and increase of insulin sensitivity
Diosmin was the top-hit molecule in the screening database, with a predictive probability of 0.96. Diosmin
is a avonol glycoside that is either extracted from plants such as Rutaceae or obtained synthetically
[40]. It has anti-inammatory, free radical scavenging, and anti-mutagenic properties and has been used
medically to treat pain and bleeding of haemorrhoids, chronic venous disease and lymphedema [41].
Nevertheless, diosmin has a poor aqueous solubility, which is a challenge for oral administration [42].
et al.
(2017) found that a combination of diosmin with essential oils showed skin antioxidant, anti-
ageing and sun-blocking effects on mice [42]. The underlying mechanisms for diosmin’s anti-ageing and
photo-protective effects include enhancing lymphatic drainage, ameliorating capillary microcirculation
inammation and preventing leukocyte activation, trapping, and migration [42, 43].
Other avonoids that ranked high for increasing the lifespan of
C. elegans
were rutin and hesperidin with
a predictive probability of 0.95 and 0.94, respectively. Rutin (or quercetin-3-rutinoside), is a avonol
glycoside that is abundant in many plants such as passionower, apple, tea, buckwheat seeds and citrus
fruits [44, 45]. It possesses a range of biological properties including antioxidant, anticancer,
neuroprotective, cardio-protective and skin-regenerative activities [44, 45]. Rutin had a high structural
similarity to other avonoids in the DrugAge database and particularly with quercetin 3-O-β-d-
glucopyranoside-(41)-β-d-glucopyranoside (Q3M). The Tanimoto coecient between the RDKit
ngerprints of Q3M and rutin was 0.99. The similarity map between the two compounds is shown in
Q3M is a avonoid abundant in onion peel that was found to extend the lifespan of
C. elegans
[47]. In the
same study, although rutin was found to improve the tolerance of
C. elegans
to oxidative stress, which is
desirable for longevity , it did not affect the worm's lifespan [47]. Davalli
et al.
(2016) also reported that
rutin did not improve the longevity of
C. elegans
[48]. On the other hand, Chattopadhyay
et al.
showed the rutin promoted longevity in a species of y,
D. melanogaster
Hesperidin has shown reactive oxygen species (ROS) inhibition and anti-ageing effects in the yeast
Saccharomyces cerevisiae
[49]. Fernández-Bedmar
et al.
(2011) found that hesperidin extracted
from orange juice had a positive inuence on the lifespan of
D. melanogaster
[50]. Wang
et al.
showed that orange extracts, where hesperidin was the predominant phenolic compound, increased the
mean lifespan of
C. elegans
[51]. In the same study, orange extracts were also found to promote longevity
by enhancing motility and reducing the accumulation of age pigment and ROS levels [51].
Soy isoavones include genistein, glycitein, and daidzein. Genistein, a compound of the DrugAge, has
been found to prolong the lifespan of
C. elegans
and increase its tolerance to oxidative stress [52].
et al.
(2005) found that
C. elegans
fed with soy isoavone glycitein had an improved
resistance towards oxidative stress [53]. However, in comparison to control worms, the lifespan of
fed with glycitein was not signicantly affected [53]. The effect of daidzein on the lifespan of
in the presence of pathogenic bacteria was investigated by Fischer
et al.
(2012) [54]. The study
Page 13/26
found that daidzein had an estrogenic effect that which extended the worm’s lifespan in presence of
pathogenic bacteria and heat [54]. Herein, we applied the MD random forest model to predict the effect of
6''-O-malonyldaidzin on the lifespan of
C. elegans.
6''-O-Malonyldaidzin is an o-glycoside derivative of
daidzein found in food products such as soybean, miso, soy milk and soy yoghurt [55]. Its predicted
probability for extending the lifespan of the worm was 0.84.
Fatty acids and conjugates
Lipid metabolism has an essential role in many biological processes of an organism. Lipids are used as
energy storage in the form of triglycerides and can therefore aid survival under severe conditions [56].
Additionally, lipids have a key role in intercellular and intracellular signalling as well as organelle
homeostasis [57]. Research on both invertebrates and mammals suggest that alteration in lipid levels
and composition are associated with ageing and longevity [56, 57].
A recent review by Johnson and Stolzing (2019), on lipid metabolism and its role in ageing, lifespan
extension and age-related conditions, summarised key lipid-related interventions that promote longevity
C. elegans
[58]. Some of the studies presented in that review are reported here. In response to fasting
et al.
(2013), showed that supplementing
C. elegans
with the -6 polyunsaturated fatty acids
(PUFAs) arachidonic acid and di-homoγlinoleic increased the worm’s starvation resistance and
prolonged its lifespan by stimulating autophagy [59]. Similarly, Qi
et al.
(2017), found that treating
with -3 PUFA -linolenic acid in dosedependent manner extended the worms lifespan [60]. The
study indicated that the -3 fatty acid underwent oxidation to generate a group of molecules known as
oxylipins. The ndings suggested that the increase the worm’s lifespan could be a result of the combined
effects of the α-linolenic acid and oxylipin metabolites [60]. Sugawara
et al.
(2013) found that a low dose
of sh oils, which contained PUFAs eicosapentaenoic acid and docosahexaenoic acid, signicantly
increased the lifespan of
C. elegans
[61]. The authors proposed that a low dose of sh oils induces
moderate oxidative stress that extended the lifespan of the organism. In contrast, large amounts of sh
oils had a diminishing effect on the worms lifespan [61].
Gamolenic acid or –linolenic acid (GLA) was the second top-hit molecule of the screening database with
a predictive probability of 0.95. GLA is an -6 PUFA, composed of an 18-carbon chain with three double
bonds in the 6th, 9th and 12th position [62]. Rich sources of GLA include evening primrose oil (EPO),
black currant oil, and borage oil [63]. In mammals, GLA is synthesized from linoleic acid (dietary) via the
action of the enzyme -6 desaturase [62, 63]. GLA is a precursor for other essential fatty acids such as
arachidonic acid [62, 63]. Conditions such as hypertension and diabetes as well as stress and various
aspects of ageing, reduce the capacity of -6 desaturase to convert linoleic acid to GLA [64]. This may
lead to a deciency of long-chain fatty acid derivatives and metabolites of GLA. GLA has been used as a
constituent of anti-ageing supplements and has shown to possess various therapeutic effects in humans
including improvement of age-related anomalies [62].
Sodium aurothiomalate, with a lifespan increase probability of 0.82, is a thia short-chain fatty acid used
for the treatment of rheumatoid arthritis and has potential antineoplastic activities [37, 65]. In preclinical
Page 14/26
models, sodium aurothiomalate inhibited protein kinase C iota (PKCι) signalling, which is overexpressed
in non-small cell lung, ovarian and pancreatic cancers [65]. The chemical structure of sodium
aurothiomalate is shown in Fig.7.
Organooxygen compounds
Lactose, with a lifespan increase probability of 0.89, is a disaccharide found in milk and other dairy
product. In the human intestine, lactose is hydrolysed to glucose and galactose by the enzyme lactase.
Out of the compounds in the DrugAge database, lactose had the highest structural similarity with
trehalose. Trehalose has been found to increase the mean lifespan of
C. elegans
by over 30%, without
showing any side effects [66]. The Tanimoto coecient between the RDKit ngerprint representations of
trehalose and lactose was 0.85. The similarity map generated using ECFP ngerprints is shown in Fig.8.
Even though lactose has a high (Tanimoto) similarity to trehalose, Xing
et al.
(2019) found that lactose
treatment shortened the lifespan of
C. elegans
Sucrose, with a lifespan increase probability of 0.83, is a disaccharide composed of glucose and fructose
[68]. It is used as the main form of transporting carbohydrates in fruits and vegetables [68]. Other sugars
such as trehalose, galactose and fructose have been found to extend the lifespan of
C. elegans
[66, 69,
70]. However, Zheng
et al.
(2017) found the treating
C. elegans
with sucrose had no signicant effect on
the organism’s mean lifespan [70]. In rats, sucrose has been found to shorten the mean lifespan and
elevate the blood pressure [71]. Rovenko
et al.
(2015) showed that in
D. melanogaster
, high sucrose
consumption decelerated pupation, increased pupa mortality and promoted obesity [72].
Lactulose, with a lifespan increase probability of 0.83, is a synthetic disaccharide composed of
monosaccharides lactose and galactose [72]. Lactulose has been to be an effective treatment for chronic
constipation in elderly patients as well as improve the cognitive function in patients with hepatic
encephalopathy [72, 73].
Other classes of compounds
Other compounds with a predictive probability  0.80 for increasing the lifespan of
C. elegans
aloin, a constituent of
aloe vera
with a predictive probability of 0.81, as well as the antibiotics daxomicin
(predictive probability = 0.84), rifapentine (predictive probability = 0.81) and chlortetracycline (predictive
probability = 0.80).
Aloe vera
is a well-known plant used in medicine, cosmetics and beverages. It possesses a wide range of
biological properties including anti-inammatory, anticancer, laxative and antioxidant activities as well as
promoting the healing process of dermal injuries [74, 75]. Additionally,
aloe vera
has been associated with
improving disorders such as diabetes, microbial diseases, cardiovascular and liver problems [75]. Its
biological activities have been attributed to the plethora of phytochemicals present in the
aloe vera
and gel. Various studies have demonstrated that the anthraquinones and glycosides present in the sap
have a key role in its anticancer, anti-inammatory, laxative effects, tyrosinase inhibition, free radical and
Page 15/26
proliferative activities [48]. Chandrashekara
et al.
(2011) found that
aloe vera
supplementation extended
that lifespan of
D. melanogaster
larvae [76]. This effect was attributed to the plethora of chemicals
present in
aloe vera
including proteins, lipids, amino acids and small-molecules. The authors proposed
that the
aloe vera
extract had a similar effect to the worm’s lifespan as resveratrol, including
neuroprotection and stimulation of regrowth or repair of nerve bres [76].
Aloin is a bioactive compound in various
species. It is composed of two diastereoisomers, aloin A, or
barbaloin, and aloin B, or isobarbaloin, which have similar chemical properties [55]. Aloin is an
anthraquinone glycoside, which is an anthraquinone containing a sugar molecule. Aloin has been used
medically as stimulant-laxative, alleviating constipation by triggering bowel movements [55]. In this study,
the MD random forest model was applied to predict the effect of aloin A on the lifespan of
C. elegans
which had a predictive probability of 0.81. Aloin has been found to possess anti-inammatory,
antiproliferative and anticancer activities as well as protect dermal broblasts against oxidative stress
damage [77–80]. Experimental testing would be required to further investigate the effect of aloin A on the
lifespan of
C. elegans.
Rifapentine is a macrolactam antibiotic approved for the treatment of tuberculosis [81]. Macrolactams
are a small class of compounds which consist of cyclic amides having unsaturation or heteroatoms
replacing one or more carbon atoms in the ring [37]. Macrolactams such as rifampicin and rifamycin
have been found to increase the lifespan of
C. elegans
Advanced glycation end (AGE) products are formed from the non-enzymatic reaction of sugars, such as
glucose, with proteins, lipids or nucleic acids [82]. AGE products have been implicated in ageing and age-
related diseases such as diabetes, atherosclerosis, and neurodegenerative [82]. Golegaonkar
et al.
showed that rifampicin reduced AGE products and extended the mean lifespan of
C. elegans
by 60% [82].
The effect of two other macrolactams, rifamycin SV and rifaximin, on the worm’s lifespan was also
investigated. Rifamycin SV was found to exhibit similar activity to rifampicin, while rifaximin lacked anti-
glycating activity and did not extend the lifespan of
C. elegans
. The authors suggested that the anti-
glycation properties of rifampicin and rifamycin could be attributed to the presence of a para-dihydroxyl
moiety, which was not present in rifaximin [82]. As shown in Fig.9, this functional group is also present in
rifapentine. Experimental testing would be required to investigate whether rifapentine possess similar
properties to rifampicin and rifamycin.
Evaluation of the chemical similarity principle
Several of the compounds identied by the random forest model had already been experimentally
evaluated for increasing the lifespan of
C. elegans
and other model organisms. In particular, the RDKit
ngerprints of rutin are 0.99 (Tanimoto) similar to that of Q3M, an active compound. However,
experimental studies found that although it is structurally similar to active compounds, rutin does not
extend the lifespan of
C. elegans
[47, 48]. Additionally, the Tanimoto coecient between the RDKit
ngerprint representations of lactose and trehalose, an active compound, is 0.85. Nevertheless,
in vivo
Page 16/26
studies showed that treatment with lactose reduced the lifespan of
C. elegans
[67]. In these cases, the
chemical similarity principle, which states that chemically similar compounds tend to have similar
bioactivities, appears to be invalid. An explanation presented by Martin
et al.
(2002) is that protein
structures are complex and exible systems [83]. Thus, structurally similar chemicals may bind in
different orientations to the active site, interact with a different conformation of the protein or even bind
to completely different proteins [83].
Pharmaceutical interventions that modulate ageing-related genes and pathways are considered the most
effective approach for combating human ageing and age-related diseases. Widely used strategies for
identifying active compounds include screening existing drugs with potential anti-ageing activities.
In this study, the random forest algorithm was applied to analyse the DrugAge database and predict
whether a compound would increase the lifespan of
C. elegans
. Five different random forest models were
built using molecular ngerprints and/or molecular descriptors as features. Feature selection and
dimensionality reduction were performed using variation and mutual information-based pre-selection
methods. The best performing classier, the MD model, used molecular descriptors and achieved an AUC
score of 0.815 for classifying the compounds in the test set. Combining molecular descriptors with
ECFPs did not further improve the model’s performance. The features of the MD model were ranked using
random forests Gini importance measure. Among the 30 highest important features were molecular
descriptors related to atom and bond counts, topological and partial charge properties.
The highest performing model was applied to predict the class of the compounds in the screening
database which consisted of 1,738 small-molecules from DrugBank. The compounds with a predictive
probability of  0.80 for increasing the lifespan of
C. elegans
were broadly separated into (i) avonoids,
(ii) fatty acids and conjugates, and (iii) organooxygen compounds. This study also elucidated several
molecules such as orange extracts, rutin, lactose and sucrose, that have been experimentally evaluated
C. elegans
but were not entries of the predictive database. Future work would include
in vivo
testing of
promising compounds such as linolenic acid, aloin and rifapentine to investigate their effect on the
lifespan of
C. elegans.
Dataset for predicting lifespan-extending compounds
The dataset published in the study by Barardo
et al.
(2017) contains positive entries, which are
compounds that “increase the lifespan of
C. elegans
” and negative entries, compounds that “do not
increase the lifespan of
C. elegans
” [4]. In particular, the dataset contains 1,392 compounds of which 229
are positive and 1,163 are negative entries [4]. The positive entries of this dataset were obtained from
DrugAge database of ageing-related drugs, (Build 2, release date: 01/09/2016), available in the Human
Page 17/26
Ageing Genomic Resources website [1, 84]. DrugAge provides information on drugs, compounds and
supplements with anti-ageing properties that have been found to extend the lifespan of model organisms
[1]. The species include worms, mice and ies, with the majority of data representing
C. elegans
[4]. Data
has been obtained from studies performed under standard conditions and contain information relevant to
ageing, such as average/median lifespan, maximum lifespan, strain, dosage and gender where available
[1]. The negative entries of the database used in the study of Barardo
et al.
(2017) were obtained from the
At the time of writing, the latest version of DrugAge database, Build 3 (release date: 19/07/2019), corrects
for small errors and adds hundreds of new entries. Herein, the positive entries in the database used in
et al.
(2017) were replaced with the data from the newest version of DrugAge, Build 3. The same
negative entries as Barardo
et al.
(2017) were used [4]. The modied database contained a total of 1,558
compounds with 395 positive entries and 1,163 negative ones. In this study, the term “DrugAge database”
refers to the modied dataset with a total of 1,558 compounds.
Representation of chemical compounds
The chemical structures of the DrugAge dataset were converted into canonical SMILES strings using the
Python package PubChemPy [85]. The SMILES strings were standardised by the Standardiser tool
developed by Francis Atkinson in 2014 [86]. Standardisation removed inorganic compounds, salt/solvent
components and metal species as well as neutralised the compounds by adding or removing hydrogen
atoms [86]. Stereoisomers, even if biologically may have different activities, were treated as duplicates as
they had identical SMILES strings. For two or more stereoisomers in the same class, only one was kept.
For duplicates in different classes, both were removed [87]. After standardisation and duplicate removal,
the number of molecules in DrugAge database was reduced to a total of 1,430 compounds with 304
positive and 1,126 negative entries. The predictive database used in this study can be found in Additional
File 2.
Molecular descriptor generation
The standardised SMILES strings were converted into mol les in the RDKit environment and opened in
the MOE™ software [25, 30]. The chemical structures were energy minimised in the Energy Minimize
General mode of MOE™ using Amber10:EHT force eld [30]. A total of 354 descriptors were calculated
including all 2D, internal i3D and external x3D coordinate depended on 3D descriptors. Due to software
limitation, few 3D descriptors ('AM1_E', 'AM1_Eele', 'AM1_HF', 'AM1_HOMO', 'AM1_IP', 'AM1_LUMO',
'MNDO_E', 'MNDO_Eele', 'MNDO_HF', 'MNDO_HOMO', 'MNDO_IP', 'MNDO_LUMO', 'PM3_E', 'PM3_Eele',
'PM3_HF', 'PM3_HOMO', 'PM3_IP', 'PM3_LUMO') could not be calculated for ten chemical structures. The
missing values were replaced with the average value of the remaining chemical structures for the given
Molecular ngerprint generation
Page 18/26
Molecular ngerprints were generated in the Python RDKit environment from the standardised SMILES
strings [25]. ECFP of 1,024-bits and 2,048-bits length were calculated with an atomic radius of 2. These
were represented as “ECFP_1024” and “ECFP_2048”, respectively. In addition to the ECFPs, RDKit
topological ngerprints were generated with a maximum path length of 5 bonds and denoted as
Five random forest models were build using ve different feature types and trained with the data of the
DrugAge database. The feature types explored in this study, ECFP_1024, ECFP_2048, RDKit5, MD and
ECFP_1024_MD, are described in Table4. The ECFP_1024_MD feature was a combined descriptor type
consisting of ECFPs of 1,024 bit-length and molecular descriptors.
Table 4
Description of feature types explored in this study.
Database name Feature description Number of
ECFP_1024 ECFP of 1,024-bit length generated in the Python RDKit environment 1,024
ECFP_2048 ECFP of 2,048-bit length generated in the Python RDKit environment 2,048
RDKit5 RDKit topological ngerprints with a maximum path length of 5
bonds generated in the Python RDKit environment 2,048
MD 2D and 3D molecular descriptors calculated in MOE™ 354
ECFP_1024_MD Combination of “ECFP_1024” and “MD” descriptors 1,378
Feature selection
Feature selection was performed for each of the descriptor types shown in Table4 and implemented in
Python library [88]. Features with low variance were removed rst, creating three sub-
databases var_100, var_95 and var_90. These removed features with the same value in all entries,
features that had greater than 95% of constant values and features with more than 90% constant values,
respectively [89].
For each of the sub-databases, Adjusted Mutual Information (AMI) was applied using the
“adjusted_mutual_info_score” function of
to order the features based on their AMI score [88].
The following settings were tested: using 5%, 10%, 25%, 50%, 75% and 100% of the features with the
highest AMI score [89]. For example, if var_100 for MD contained 349 features, the database with 5% of
the features would consist only of the 17 highest-ranking features. This process is outlined in
Supplementary Fig.1, Additional File 1.
10-fold Cross-validation
Cross-validation was performed in the
Python library using the “cross_val_score” function [88].
The predictive database was randomly split into 80% training and 20% test set. The 10-fold cross-
validation was performed only on the training set. The performance of the models was evaluated using
Page 19/26
the AUC measure. Cross-validation was repeated 10 times, yielding 10 AUC scores. The predictive
accuracy reported was the median AUC value of the 10 measurements obtained by cross-validation. The
median, rather than average, AUC score was calculated as the former is more robust to outliers [4].
Random forest settings
The random forest classiers were built in the
Python module [88]. To handle the unbalanced
data used in this study, the random forest parameter “class_weight” was set to “balanced”. The remaining
parameters of the random forest classier were set to their default settings. The models were run with
100 estimators (number of trees in the forest) and the maximum number of features considered in each
tree node was the square root of the total number of features. The AUC scores were calculated with
“roc_auc_score matrix of
using the “predict_proba method [88].
Chemical space implementation
The 2D representations of the chemical space were generated by applying the PCA algorithm in the
library [88]. Visualisation of molecular descriptors required feature scaling as the
descriptors had different ranges. Scale difference can negatively impact the performance of the PCA
model, as it incorrectly considers some features as more important than others. The resulting molecular
descriptors had a standard normal distribution with a mean of zero and a standard deviation of one [88].
Feature scaling was not required for the molecular ngerprints they only consisted of binary values.
Screening database
The best performing model was applied to predict the class of the compounds in an external database,
where the effect of the compounds on the lifespan of
C. elegans
was mostly unknown. The external
database consisted of small-molecules obtained from the External Drug Links database of DrugBank
(version 5.1.5, released on 2020-01-03) [36]. The External Drug Links database contained a list of drugs
and links to other databases, such as PubChem and UniProt, providing information on these compounds
[36, 90, 91].
Generation of SMILES strings, standardisation and descriptor calculation was performed in the same
method used for the training (DrugAge) database, described in the above sections. Some of the entries of
the DrugBank database were substances composed by more than one molecule, such as vegetable oils.
These entries where either removed from the database or replaced by their one of their main active
ingredients. For example, “borage oil” was replaced with “gamolenic acid”. In the case of “soy
isoavones”, the major soy isoavones (genistein, glycitein, and daidzein) had already been
experimentally evaluated on the lifespan of
C. elegans
. Therefore, the entry was replaced with “6''-O-
malonyldaidzin”, a derivative of daidzein with unknown activity. Stereoisomers were treated as duplicates
and only one of them was kept. Substances and stereoisomers present in both the DrugBank and
DrugAge databases were removed from the screening database. The resulting database consisted of a
total of 1,738 small-molecules.
Tanimoto coecient and similarity maps
Page 20/26
The Tanimoto coecients and similarity maps were computed in the Python RDKit environment [25]. The
Tanimoto similarity is calculated between a reference molecule, which is known to be active, and a
compound of interest with unknown activity.
Herein, the reference molecules were the positive entries of the DrugAge database. The compound with
unknown activity was a selected entry of the screening database which achieved a predictive probability
of  0.80 for increasing the lifespan of
C. elegans
. The Tanimoto coecient between the compound of
interest with each of the reference molecules was calculated. The highest score achieved as well as the
reference molecule used to obtain that score was reported. The Tanimoto coecients were computed
using the RDKit ngerprint representations of the compounds. Similarity maps were generated using
ECFP ngerprint representations.
Availability of data and materials
All software and datasets can be obtained by application to the authors at
Competing interests
The authors declare that there are no conicts of interest.
This research was carried out as a nal year project by SK, no funding was available or used.
Authors’ contributions
BJH designed and supervised the study. SK performed data curation, built the predictive models and
wrote the manuscript. BJH aided the interpretation of the ndings and reviewed the manuscript providing
We are grateful to the members of the Department of Chemistry at the University of Surrey for their
support throughout the study. We also acknowledge Konstantinos Kallidromitis for reading the
manuscript and discussing the implementation and results obtained from the predictive models.
Authors' information
Page 21/26
BJH (PhD) is a Senior Lecturer in Computational Chemistry at the University of Surrey, Department of
Chemistry. SK (BSc) was previously an undergraduate student at the University of Surrey and currently a
graduate student at Imperial College London.
1. Barardo D, Thornton D, Thoppil H, et al (2017) The DrugAge database of aging-related drugs. Aging
Cell 16:594–597.
2. Qian M, Liu B (2019) Advances in pharmacological interventions of aging in mice. Transl Med Aging
3. Blagosklonny M V. (2018) Disease or not, aging is easily treatable. Aging (Albany NY) 10:3067–
4. Barardo DG, Newby D, Thornton D, et al (2017) Machine learning for predicting lifespan-extending
chemical compounds. Aging (Albany NY) 9:1721–1737.
5. Longo VD, Antebi A, Bartke A, et al (2015) Interventions to slow aging in humans: Are we ready?
Aging Cell 14:497–510.
6. Mack HID, Heimbucher T, Murphy CT (2018) The nematode Caenorhabditis elegans as a model for
aging research. Drug Discov Today Dis Model 27:3–13.
7. Son HG, Altintas O, Kim EJE, et al (2019) Age-dependent changes and biomarkers of aging in
Caenorhabditis elegans. Aging Cell 18:1–11.
8. Herndon LA, Wolkow CA, Driscoll M, Hall DH (2017) Effects of Ageing on the Basic Biology and
Anatomy of C. elegans. In: Olsen A, Gill MS (eds) Ageing: Lessons from C. elegans. Springer
International Publishing, Cham, pp 9–39
9. Das UN (2011) Molecular Basis of Health and Disease. In: Molecular Basis of Health and Disease,
1st ed. Springer, Dordrecht, pp 491–512
10. Apfeld J, Alper S (2018) What Can We Learn About Human Disease from the Nematode C. elegans?
Methods Mol Biol 1706:53–75.
11. Curran SP, Ruvkun G (2007) Lifespan regulation by evolutionarily conserved genes essential for
viability. PLoS Genet 3:0479–0487.
12. Lee GD, Wilson MA, Zhu M, et al (2006) Dietary deprivation extends lifespan in Caenorhabditis
elegans. Aging Cell 5:515–524.
13. Harrison DE, Strong R, Sharp ZD, et al (2009) Rapamycin fed late in life extends lifespan in
genetically heterogeneous mice. Nature 460:392–395.
14. Selman C, Tullet JMA, Wieser D, et al (2009) Ribosomal Protein S6 Kinase 1 Signaling Regulates
Mammalian Life Span. Science (80- ) 326:140–144.
15. Ye X, Linton JM, Schork NJ, et al (2014) A pharmacological network for lifespan extension in
Caenorhabditis elegans. Aging Cell 13:206–215.
Page 22/26
16. Putin E, Mamoshina P, Aliper A, et al (2016) Deep biomarkers of human aging: Application of deep
neural networks to biomarker development. Aging (Albany NY) 8:1021–1033.
17. Mamoshina P, Kochetov K, Putin E, et al (2018) Population Specic Biomarkers of Human Aging: A
Big Data Study Using South Korean, Canadian, and Eastern European Patient Populations. J
Gerontol A Biol Sci Med Sci 73:1482–1490.
18. Winter R, Montanari F, Noé F, Clevert D-A (2019) Learning continuous and data-driven molecular
descriptors by translating equivalent chemical representations. Chem Sci 10:1692–1701.
19. Hong H, Xie Q, Ge W, et al (2008) Mold2, molecular descriptors from 2D structures for
chemoinformatics and toxicoinformatics. J Chem Inf Model 48:1337–1344.
20. Rinnie, Gaba V, Rani K, et al (2019) QSAR study on 4-alkynyldihydrocinnamic acid analogs as free
fatty acid receptor 1 agonists and antidiabetic agents: Rationales to improve activity. Arab J Chem
21. Roy K, Kar S, Das RN (2015) Chapter 2 - Chemical Information and Descriptors. In: Understanding the
Basics of QSAR for Applications in Pharmaceutical Sciences and Risk Assessment. Academic Press,
Boston, pp 47–80
22. Lo YC, Rensi SE, Torng W, Altman RB (2018) Machine learning in chemoinformatics and drug
discovery. Drug Discov Today 23:1538–1546.
23. Perkins R, Fang H, Tong W, Welsh WJ (2003) Quantitative structure-activity relationship methods:
Perspectives on drug discovery and toxicology. Environ Toxicol Chem 22:1666–1679.
24. Cereto-massagué A, José M, Valls C, et al (2015) Molecular ngerprint similarity search in virtual
screening. Methods 71:58–63.
25. RDKit: Open-source cheminformatics. Accessed April 2020.
26. Naveja JJ, Medina-Franco JL (2019) Finding Constellations in Chemical Space Through Core
Analysis. Front Chem 7:510.
27. Hirohara M, Saito Y, Koda Y, et al (2018) Convolutional neural network based on SMILES
representation of compounds for detecting chemical motif. BMC Bioinformatics 19:526.
28. Sonego P, Kocsor A, Pongor S (2008) ROC analysis: applications to the classication of biological
sequences and 3D structures. Brief Bioinform 9:198–209.
29. Chen C, Breiman L (2004) Using Random Forest to Learn Imbalanced Data. Univ California, Berkeley
30. Chemical Computing Group Inc (2019) Molecular Operating Environment (2019.01) Montreal,
Page 23/26
31. Bender A, Glen RC (2005) A discussion of measures of enrichment in virtual screening: Comparing
the information content of descriptors with increasing levels of sophistication. J Chem Inf Model
32. Gozalbes R, Doucet JP DF (2002) Application of Topological Descriptors in QSAR and Drug Design:
History and New Trends. Infect Disord Drug Targets 2:93–102.
33. Guha R, Willighagen E (2012) A survey of quantitative descriptions of molecular structure. Curr Top
Med Chem 12:1946–1956.
34. Gasteiger J, Marsili M (1980) Iterative partial equalization of orbital electronegativity—a rapid access
to atomic charges. Tetrahedron 36:3219–3228.
35. Kleinoeder T (2005) Prediction of Properties of Organic Compounds - Emperical Methods and
Management of Property Data. PhD Thesis, University of Erlangen-Nuernberg.
36. Wishart DS, Feunang YD, Guo AC, et al (2018) DrugBank 5.0: a major update to the DrugBank
database for 2018. Nucleic Acids Res 46:D1074–D1082.
37. Djoumbou Feunang Y, Eisner R, Knox C, et al (2016) ClassyFire: automated chemical classication
with a comprehensive, computable taxonomy. J Cheminform 8:61.
38. Prasain JK, Carlson SH, Wyss JM (2010) Flavonoids and age-related disease: risk, benets and
critical windows. Maturitas 66:163–171.
39. Ayaz M, Sadiq A, Junaid M, et al (2019) Flavonoids as Prospective Neuroprotectants and Their
Therapeutic Propensity in Aging Associated Neurological Disorders. Front Aging Neurosci 11:155.
40. Ramelet AA (2011) Venoactive Drugs. In: Goldman MP, Guex JJ, Weiss RA (eds) Sclerotherapy:
Treatment of Varicose and Telangiectatic Leg Veins, 5th ed. W.B. Saunders, Edinburgh, pp 369–377
41. Mangoni AA (2012) Drugs acting on the cerebral and peripheral circulations. In: Aronson JK (ed) A
worldwide yearly survey of new data in adverse drug reactions and interactions. Elsevier, pp 311–316
42. Kamel R, Abbas H, Fayez A (2017) Diosmin/essential oil combination for dermal photo-protection
using a lipoid colloidal carrier. J Photochem Photobiol B Biol 170:49–57.
43. Bergan JJ, Schmid-Schönbein GW, Takase S (2001) Therapeutic approach to chronic venous
insuciency and its complications: place of Daon 500 mg. Angiology 52 Suppl 1:S43-7.
44. Ganeshpurkar A, Saluja AK (2017) The Pharmacological Potential of Rutin. Saudi Pharm J 25:149–
45. Chattopadhyay D, Chitnis A, Talekar A, et al (2017) Hormetic ecacy of rutin to promote longevity in
Drosophila melanogaster. Biogerontology 18:397–411.
Page 24/26
46. Riniker S, Landrum GA (2013) Similarity maps - a visualization strategy for molecular ngerprints
and machine-learning methods. J Cheminform 5:43.
47. Xue YL, Ahiko T, Miyakawa T, et al (2011) Isolation and Caenorhabditis elegans lifespan assay of
avonoids from onion. J Agric Food Chem 59:5927–5934.
48. Davalli P, Mitic T, Caporali A, et al (2016) ROS, Cell Senescence, and Novel Molecular Mechanisms in
Aging and Age-Related Diseases. Oxid Med Cell Longev 2016:3565127.
49. Sun K, Xiang L, Ishihara S, et al (2012) Anti-Aging Effects of Hesperidin on Saccharomyces
cerevisiae via Inhibition of Reactive Oxygen Species and UTH1 Gene Expression. Biosci Biotechnol
Biochem 76:640–645.
50. Fernández-Bedmar Z, Anter J, de La Cruz-Ares S, et al (2011) Role of Citrus Juices and Distinctive
Components in the Modulation of Degenerative Processes: Genotoxicity, Antigenotoxicity,
Cytotoxicity, and Longevity in Drosophila. J Toxicol Environ Heal Part A 74:1052–1066.
51. Wang J, Deng N, Wang H, et al (2020) Effects of orange extracts on longevity, healthspan, and stress
resistance in Caenorhabditis elegans. Molecules 25:1–17.
52. Lee EB, Ahn D, Kim BJ, et al (2015) Genistein from Vigna angularis Extends Lifespan in
Caenorhabditis elegans. Biomol Ther (Seoul) 23:77–83.
53. Gutierrez-Zepeda A, Santell R, Wu Z, et al (2005) Soy isoavone glycitein protects against beta
amyloid-induced toxicity and oxidative stress in transgenic Caenorhabditis elegans. BMC Neurosci
54. Fischer M, Regitz C, Kahl M, et al (2012) Phytoestrogens genistein and daidzein affect immunity in
the nematode Caenorhabditis elegans via alterations of vitellogenin expression. Mol Nutr Food Res
55. Wishart DS, Feunang YD, Marcu A, et al (2018) HMDB 4.0: the human metabolome database for
2018. Nucleic Acids Res 46:D608–D617.
56. Papsdorf K, Brunet A (2019) Linking Lipid Metabolism to Chromatin Regulation in Aging. Trends Cell
Biol 29:97–116.
57. Han S, Schroeder EA, Silva-García CG, et al (2017) Mono-unsaturated fatty acids link H3K4me3
modiers to C. elegans lifespan. Nature 544:185–190.
58. Johnson AA, Stolzing A (2019) The role of lipid metabolism in aging, lifespan regulation, and age-
related disease. Aging Cell 18:e13048.
59. O’Rourke EJ, Kuballa P, Xavier R, Ruvkun G (2013) ω-6 Polyunsaturated fatty acids extend life span
through the activation of autophagy. Genes Dev 27:429–440.
Page 25/26
60. Qi W, Gutierrez GE, Gao X, et al (2017) The ω-3 fatty acid α-linolenic acid extends Caenorhabditis
elegans lifespan via NHR-49/PPARα and oxidation to oxylipins. Aging Cell 16:1125–1135.
61. Sugawara S, Honma T, Ito J, et al (2013) Fish oil changes the lifespan of
Caenorhabditis elegans
lipid peroxidation. J Clin Biochem Nutr 52:139–145.
62. Khan SA, Haider A, Mahmood W, et al (2017) Gamma-linolenic acid ameliorated glycation-induced
memory impairment in rats. Pharm Biol 55:1817–1823.
63. Knauf VC, Shewmaker C, Flider F, et al (2011) Saower with Elevated Gamma-Linolenic Acid. US
Patent 2011/0129428A1, Jun. 2, 2011.
64. Rezapour-Firouzi S (2017) Chapter 24 - Herbal Oil Supplement With Hot-Nature Diet for Multiple
Sclerosis. In: Watson RR, Killgore WDSBT-N and L in NAD (eds). Academic Press, pp 229–245
65. De Giorgio R, Ruggeri E, Stanghellini V, et al (2015) Chronic constipation in the elderly: a primer for
the gastroenterologist. BMC Gastroenterol 15:130.
66. Honda Y, Tanaka M, Honda S (2010) Trehalose extends longevity in the nematode Caenorhabditis
elegans. Aging Cell 9:558–569.
67. Xing S, Zhang L, Lin H, et al (2019) Lactose induced redox-dependent senescence and activated Nrf2
pathway. Int J Clin Exp Pathol 12:2034–2045
68. Yahia EM, Carrillo-López A, Bello-Perez LA (2019) Carbohydrates. In: Yahia EM (ed) Postharvest
Physiology and Biochemistry of Fruits and Vegetables. Woodhead Publishing, pp 175–205
69. Edwards C, Caneld J, Copes N, et al (2015) Mechanisms of amino acid-mediated lifespan extension
in Caenorhabditis elegans. BMC Genet 16:8.
70. Zheng J, Gao C, Wang M, et al (2017) Lower Doses of Fructose Extend Lifespan in Caenorhabditis
elegans. J Diet Suppl 14:264–277.
71. Preuss HG, el Zein M, Areas JL, et al (1991) Effects of excess sucrose ingestion on the life span of
hypertensive rats (SHR). Geriatr Nephrol Urol 1:13–20.
72. Rovenko BM, Kubrak OI, Gospodaryov D V, et al (2015) High sucrose consumption promotes obesity
whereas its low consumption induces oxidative stress in Drosophila melanogaster. J Insect Physiol
73. Yang N, Liu H, Jiang Y, et al (2015) Lactulose enhances neuroplasticity to improve cognitive function
in early hepatic encephalopathy. Neural Regen Res 10:1457–1462.
74. Hekmatpou D, Mehrabi F, Rahzani K, Aminiyan A (2019) The Effect of Aloe Vera Clinical Trials on
Prevention and Healing of Skin Wound: A Systematic Review. Iran J Med Sci 44:1–9
75. Baruah A, Bordoloi M, Deka Baruah HP (2016) Aloe vera: A multipurpose industrial crop. Ind Crops
Prod 94:951–963.
Page 26/26
76. Chandrashekara KT, Shakarad MN (2011) Aloe vera or Resveratrol Supplementation in Larval Diet
Delays Adult Aging in the Fruit Fly, Drosophila melanogaster. Journals Gerontol Ser A 66A:965–971.
77. Nićiforović A, Adžić M, Zarić B, Radojčić MB (2007) Adjuvant antiproliferative and cytotoxic effect of
aloin in irradiated HeLaS3 cells. Russ J Phys Chem A 81:1463–1466.
78. Park M-Y, Kwon H-J, Sung M-K (2011) Dietary aloin, aloesin, or aloe-gel exerts anti-inammatory
activity in a rat colitis model. Life Sci 88:486–492.
79. Kumar S, Matharasi DP, Gopi S, et al (2010) Synthesis of cytotoxic and antioxidant Schiff’s base
analogs of aloin. J Asian Nat Prod Res 12:360–370.
80. Liu F-W, Liu F-C, Wang Y-R, et al (2015) Aloin Protects Skin Fibroblasts from Heat Stress-Induced
Oxidative Stress Damage by Regulating the Oxidative Defense System. PLoS One 10:e0143528
81. Munsiff SS, Kambili C, Ahuja SD (2006) Rifapentine for the Treatment of Pulmonary Tuberculosis.
Clin Infect Dis 43:1468–1475.
82. Golegaonkar S, Tabrez SS, Pandit A, et al (2015) Rifampicin reduces advanced glycation end
products and activates DAF-16 to increase lifespan in Caenorhabditis elegans. Aging Cell 14:463–
83. Martin YC, Kofron JL, Traphagen LM (2002) Do structurally similar molecules have similar biological
activity? J Med Chem 45:4350–4358.
84. Tacutu R, Craig T, Budovsky A, et al (2013) Human Ageing Genomic Resources: integrated databases
and tools for the biology and genetics of ageing. Nucleic Acids Res 41:D1027–D1033.
85. PubChemPy. Accessed April 2020.
86. Atkinson F L (2014) Standardiser.atkinson/standardiser. Accessed on April
87. Kotsampasakou E, Ecker GF (2017) Predicting Drug-Induced Cholestasis with the Help of Hepatic
Transporters-An in Silico Modeling Approach. J Chem Inf Model 57:608–615.
88. Pedregosa F, Varoquaux G, Gramfort A, et al (2011) Scikit-learn. J Mach Learn Res 12:2825-2830.
89. Fehér NK (2018) Exploring Predicted Drug Metabolism in in silico Toxicity Prediction. Dissertation,
University of Cambridge.
90. Kim S, Chen J, Cheng T, et al (2018) PubChem 2019 update: improved access to chemical data.
Nucleic Acids Res 47:D1102–D1109.
91. Consortium TU (2018) UniProt: a worldwide hub of protein knowledge. Nucleic Acids Res 47:D506–
ResearchGate has not been able to resolve any citations for this publication.
Full-text available
Orange, with various bioactive phytochemicals, exerts various beneficial health effects, including anti-cancer, antioxidant, and anti-inflammatory properties. However, its anti-aging effects remain unclear. In this study, the Caenorhabditis elegans (C. elegans) model was used to evaluate the effects of orange extracts on lifespan and stress resistance. The results indicated that orange extracts dose-dependently increased the mean lifespan of C. elegans by 10.5%, 18.0%, and 26.2% at the concentrations of 100, 200, and 400 mg/mL, respectively. Meanwhile, orange extracts promoted the healthspan by improving motility, and decreasing the accumulation of age pigment and intracellular reactive oxygen species (ROS) levels without damaging fertility. The survival rates of orange extract-fed worms were obviously higher than those of untreated worms against thermal and ultraviolet-B (UV-B) stress. Moreover, the activities of superoxide dismutase (SOD) and catalase (CAT) were significantly enhanced while malondialdehyde (MDA) contents were diminished. Further investigation revealed that worms supplemented with orange extracts resulted in upregulated levels of genes, including daf-16, sod-3, gst-4, sek-1, and skn-1, and the downregulation of age-1 expression. These findings revealed that orange extracts have potential anti-aging effects through extending the lifespan, enhancing stress resistance, and promoting the healthspan.
Full-text available
Lactose is a disaccharide found in milk and thus a part of our daily food intake. Upon ingestion, it is hydrolyzed to glucose and galactose by the enzyme lactase and absorbed in the small intestine. People who suffer from lactose intolerance are unable to completely digest it due to deficiency of lactase, leading to intestinal problems such as diarrhoea, and bloating. Various studies have focused on treating these symptoms. However, the effects of lactose that diffuses passively into cells, on cellular senescence have largely remained unknown. Thus, the present study investigated the effects and mechanisms of lactose on senescence both in vitro and in vivo. The study was conducted in MRC-5 cells. The cellular senescence was estimated by determining the expression of SA-β-gal and p16ink4a. The cell viability of MRC-5 cells was determined by the CCK-8 Assay. Activity of intracellular reactive oxygen species was estimated by measuring the levels of superoxide dismutase (SOD), glutathione (GHS), and reactive oxygen species (ROS). The mechanism of lactose on cellular senescence was explored by western blotting. We also studied the effect of lactose on the lifespan of Caenorhabditis elegans. Increased activities of SA-β-gal and p16ink4a revealed the ability of lactose to induce senescence in MRC-5 cells. The elevated intracellular ROS level and decreased GSH and SOD levels in these cells were indicative of cellular oxidative stress induced by lactose. Furthermore, western blotting analysis of Nrf2 and mRNA expression of its downstream genes suggested the Nrf2/ARE pathway was involved in the oxidative stress induced by lactose. These results were further validated by the shortened lifespan of C. elegans after lactose supplement. Moreover, the lactose-induced senescence could be alleviated by an antioxidant, N-Acetyl-L-cysteine (NAC), both in vitro and in vivo. The present study observed a positive correlation between lactose and cellular oxidative stress, suggesting the latter to be an underlying mechanism of lactose-induced senescence.
Full-text available
The remarkable breakthroughs in aging research pave the way allowing us to explore potential interventions to slow down aging process, and more importantly, to improve healthiness. Multiple approaches, including pharmacological and non-pharmacological interventions (e.g. caloric restriction, physical exercise), to a great extent, successfully tackle challenges of age-related phenotypic deficits across species. To date, molecular compounds are largely emerging, such as caloric restriction mimetics, NAD+ boosters, and senolytics. The use of mouse models is essential, as one of the best tools, to evaluate the potentials of molecules against aging and to provide a translational basis for treating human frailty. Here, we briefly overview present advances on therapeutic interventions against aging in laboratory mouse models and discuss the benefits and pitfalls on their clinical application for anti-aging and aging-related pathologies in humans. Keywords: Pharmacological intervention of aging, Mouse model
Full-text available
An emerging body of data suggests that lipid metabolism has an important role to play in the aging process. Indeed, a plethora of dietary, pharmacological, genetic, and surgical lipid-related interventions extend lifespan in nematodes, fruit flies, mice, and rats. For example, the impairment of genes involved in ceramide and sphingolipid synthesis extends lifespan in both worms and flies. The overexpression of fatty acid amide hydrolase or lysosomal lipase prolongs life in Caenorhabditis elegans, while the overexpression of diacylglycerol lipase enhances longevity in both C. elegans and Drosophila melanogaster. The surgical removal of adipose tissue extends lifespan in rats, and increased expression of apolipoprotein D enhances survival in both flies and mice. Mouse lifespan can be additionally extended by the genetic deletion of diacylglycerol acyltransferase 1, treatment with the steroid 17-α-estradiol, or a ketogenic diet. Moreover, deletion of the phospholipase A2 receptor improves various healthspan parameters in a progeria mouse model. Genome-wide association studies have found several lipid-related variants to be associated with human aging. For example, the epsilon 2 and epsilon 4 alleles of apolipoprotein E are associated with extreme longevity and late-onset neurodegenerative disease, respectively. In humans, blood triglyceride levels tend to increase, while blood lysophosphatidylcholine levels tend to decrease with age. Specific sphingolipid and phospholipid blood profiles have also been shown to change with age and are associated with exceptional human longevity. These data suggest that lipid-related interventions may improve human healthspan and that blood lipids likely represent a rich source of human aging biomarkers.
Full-text available
Herein we introduce the constellation plots as a general approach that merges different and complementary molecular representations to enhance the information contained in a visual representation and analysis of chemical space. The method is based on a combination of a sub-structure based representation and classification of compounds with a “classical” coordinate-based representation of chemical space. A distinctive outcome of the method is that organizing the compounds in analog series leads to the formation of groups of molecules, aka “constellations” in chemical space. The novel approach is general and can be used to rapidly identify, for instance, insightful and “bright” Structure-Activity Relationships (StARs) in chemical space that are easy to interpret. This kind of analysis is expected to be especially useful for lead identification in large datasets of unannotated molecules, such as those obtained through high-throughput screening. We demonstrate the application of the method using two datasets of focused inhibitors designed against DNMTs and AKT1.
Full-text available
Modern research revealed that dietary consumption of flavonoids and flavonoids-rich foods significantly improves cognitive capabilities, inhibits or delays the senescence process and related neurological disorders including Alzheimer's disease (AD). The flavonoids rich foods such as green tea, cocoa, blue berry and other foods appear to improve states of cognitive hypofunction, AD and dementia-like pathological symptoms in different animal models. The mechanism of flavonoids are principally mediated via inhibition of cholinesterases (AChE, BChE), beta secretase (BACE1), free radicals and modulation of signaling pathways implicated in cognitive and neuroprotective performance. Flavonoids interact with several signaling protein pathways like ERK and PI3-kinase/Akt and modulate their actions, leading to their neuroprotective effects. Moreover, they enhance vascular blood flow and instigate neurogenesis particularly in the hippocampus area of the brain in animal models investigated so for. Flavonoids also hamper the progression of pathological symptoms of neuro-degenerative disorders via inhibition of neuronal apoptosis induced byneurotoxic substances including free radicals and beta amyloid proteins (Aβ). All these functions contribute to the maintenance of number, quality of neurons and their synaptic connectivity in the brain. Thus flavonoids can thwart the progression of age related disorders and can be a potential source for the development of new drugs effective in cognitive disabilities disorders.
Full-text available
Caenorhabditis elegans is an exceptionally valuable model for aging research because of many advantages, including its genetic tractability, short lifespan, and clear age‐dependent physiological changes. Aged C. elegans display a decline in their anatomical and functional features, including tissue integrity, motility, learning and memory, and immunity. Caenorhabditis elegans also exhibit many age‐associated changes in the expression of microRNAs and stress‐responsive genes and in RNA and protein quality control systems. Many of these age‐associated changes provide information on the health of the animals and serve as valuable biomarkers for aging research. Here, we review the age‐dependent changes in C. elegans and their utility as aging biomarkers indicative of the physiological status of aging.
Full-text available
The UniProt Knowledgebase is a collection of sequences and annotations for over 120 million proteins across all branches of life. Detailed annotations extracted from the literature by expert curators have been collected for over half a million of these proteins. These annotations are supplemented by annotations provided by rule based automated systems, and those imported from other resources. In this article we describe significant updates that we have made over the last 2 years to the resource. We have greatly expanded the number of Reference Proteomes that we provide and in particular we have focussed on improving the number of viral Reference Proteomes. The UniProt website has been augmented with new data visualizations for the subcellular localization of proteins as well as their structure and interactions. UniProt resources are available under a CC-BY (4.0) license via the web at
Full-text available
Background Previous studies have suggested deep learning to be a highly effective approach for screening lead compounds for new drugs. Several deep learning models have been developed by addressing the use of various kinds of fingerprints and graph convolution architectures. However, these methods are either advantageous or disadvantageous depending on whether they (1) can distinguish structural differences including chirality of compounds, and (2) can automatically discover effective features. Results We developed another deep learning model for compound classification. In this method, we constructed a distributed representation of compounds based on the SMILES notation, which linearly represents a compound structure, and applied the SMILES-based representation to a convolutional neural network (CNN). The use of SMILES allows us to process all types of compounds while incorporating a broad range of structure information, and representation learning by CNN automatically acquires a low-dimensional representation of input features. In a benchmark experiment using the TOX 21 dataset, our method outperformed conventional fingerprint methods, and performed comparably against the winning model of the TOX 21 Challenge. Multivariate analysis confirmed that the chemical space consisting of the features learned by SMILES-based representation learning adequately expressed a richer feature space that enabled the accurate discrimination of compounds. Using motif detection with the learned filters, not only important known structures (motifs) such as protein-binding sites but also structures of unknown functional groups were detected. Conclusions The source code of our SMILES-based convolutional neural network software in the deep learning framework Chainer is available at, and the dataset used for performance evaluation in this work is available at the same URL.
Full-text available
There has been a recent surge of interest in using machine learning across chemical space in order to predict properties of molecules or design molecules and materials with desired properties. Most of this work relies on defining clever feature representations, in which the chemical graph structure is encoded in a uniform way such that predictions across chemical space can be made. In this work, we propose to exploit the powerful ability of deep neural networks to learn a feature representation from low-level encodings of a huge corpus of chemical structures. Our model borrows ideas from neural machine translation: it translates between two semantically equivalent but syntactically different representations of molecular structures, compressing the meaningful information both representations have in common in a low-dimensional representation vector. Once the model is trained, this representation can be extracted for any new molecule and utilized as descriptor. In fair benchmarks with respect to various human-engineered molecular fingerprints and graph-convolution models, our method shows competitive performance in modelling quantitative structure-activity relationships in all analysed datasets. Additionally, we show that our descriptor significantly outperforms all baseline molecular fingerprints in two ligand-based virtual screening tasks. Overall, our descriptors show the most consistent performances in all experiments. The continuity of the descriptor space and the existence of the decoder that permits to deduce a chemical structure from an embedding vector allows for exploration of the space and opens up new opportunities for compound optimization and idea generation.