ArticlePDF Available

Predicting Drug Interaction With Adenosine Receptors Using Machine Learning and SMOTE Techniques

Authors:

Abstract and Figures

Cancer is one of the most influential factors causing death in the world. Adenosine which is a molecule, found in all human cells by coupling with G protein it turns into an adenosine receptor. Adenosine receptor is an important target for cancer therapy. Adenosine stops the growth of malignant tumor cells such as lymphoma, melanoma and prostate carcinoma. Adenosine is activated by interacting with drugs to stop tumor cells from spreading and cure cancer disease. This research aims to predict drugs and potential drug candidates that interact with adenosine receptors. We built a machine learning model using three different classification techniques: Random Forest (RF), Decision Tree (DT) and Support Vector Machine (SVM) then we chose the best technique after comparing the results. Unlike other researches, we used the drug side effect integrated into drug fingerprint as a feature to train our model to classify drugs (interacting and non-interacting) with adenosine receptors. We ranked the interacting drugs with adenosine receptors based on drug side effects to find the most preferred drug (least side effect) among several drugs, which helps in drug design. Most existing datasets contain drugs, targets and the interactions between them, neglecting drug side effects. We formed a new dataset that has the drug side effect. The new dataset is composed of 400 drugs, 794 targets and 3990 drug side effects. Since the dataset was imbalanced we applied Synthetic Minority Oversampling Technique (SMOTE). After conducting experiments, RF achieved the best classification performance with an accuracy of 75.09%.
Content may be subject to copyright.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2946314, IEEE Access
Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000.
Digital Object Identifier 10.1109/ACCESS.2017.DOI
Predicting Drug Interaction With
Adenosine Receptors Using Machine
Learning and SMOTE Techniques
ABDELRAHMAN SAAD 1, YASSER M.K. OMAR2, and FAHIMA A. MAGHRABY.3
1Arab Academy for Science, Technology and Maritime Transport(AASTMT), Cairo, Egypt
2Arab Academy for Science, Technology and Maritime Transport(AASTMT), Cairo, Egypt
3Arab Academy for Science, Technology and Maritime Transport(AASTMT), Cairo, Egypt
Corresponding author: Abdelrahman I. Saad (e-mail: abdelrahman.saad@aast.edu).
ABSTRACT Cancer is one of the most influential factors causing death in the world. Adenosine which
is a molecule, found in all human cells by coupling with G protein it turns into an adenosine receptor.
Adenosine receptor is an important target for cancer therapy. Adenosine stops the growth of malignant
tumor cells such as lymphoma, melanoma and prostate carcinoma. Adenosine is activated by interacting
with drugs to stop tumor cells from spreading and cure cancer disease. This research aims to predict drugs
and potential drug candidates that interact with adenosine receptors. We built a machine learning model
using three different classification techniques: Random Forest (RF), Decision Tree (DT) and Support Vector
Machine (SVM) then we chose the best technique after comparing the results. Unlike other researches, we
used the drug side effect integrated into drug fingerprint as a feature to train our model to classify drugs
(interacting and non-interacting) with adenosine receptors. We ranked the interacting drugs with adenosine
receptors based on drug side effects to find the most preferred drug (least side effect) among several drugs,
which helps in drug design. Most existing datasets contain drugs, targets and the interactions between them,
neglecting drug side effects. We formed a new dataset that has the drug side effect. The new dataset is
composed of 400 drugs, 794 targets and 3990 drug side effects. Since the dataset was imbalanced we applied
Synthetic Minority Oversampling Technique (SMOTE). After conducting experiments, RF achieved the best
classification performance with an accuracy of 75.09%.
INDEX TERMS Adenosine, Classifier, Drug, DTI, Drug Fingerprint, Receptor, Side Effect, Target
I. INTRODUCTION
CANCER occurs in the form of malignant tumor cause
spreading abnormal cells throughout the whole body.
There are several types of cancer affecting body organs such
as Leukemia cancer forming blood tissues in bone marrow,
Myeloma and Lymphoma, which attacks the immune system
and weakens it. Finally, the carcinoma that affects the skin or
the tissues of the body organs such as the prostate. There are
also other types of cancer like sarcoma (affects connecting
tissues ex. cartilage), brain and spinal cord cancers. In this
study, we focus on blood, skin and immune system cancer
types [1], [2].
According to the National Cancer Institute, 1,735,350 new
cases are going underdiagnosis in USA and 609,640 people
are going to die as a result of the disease [3]. There are several
ways to cure cancer such as radiotherapy and chemother-
apy. A patient is subjected to radiotherapy. Radiotherapy is
therapy using radio waves to control or kill malignant cells
while chemotherapy (hormonal thereby) is using chemical
drugs to treat the damaged cells which are our concern in
this study. Studies prove the presence of high-level ratios of
adenosine molecules in cancer tissues. Chemical drugs are
a good option to treat these molecules [4]. This adenosine
molecule showed a great impact on the growth of tumor
cells, which presented an important medical field called drug
discovery (drug repositioning) [5].
Diseases are cured by drugs such as cancer in our study by
interacting with the target (adenosine). Drugs are designed
and tested before using them this process is called drug
discovery. Discovering drugs and make use of them require
huge time and cost [6]. Machine learning facilitates predict-
ing drug-target interactions and enhances the drug discovery
VOLUME , 2018 1
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2946314, IEEE Access
process in addition to developing new applications for the
existing drugs [7].
Algorithms and machine learning models help a lot in
predicting drug-target interaction by reducing cost and time
contrast to the molecular docking which simulates the targets
in a 3D form, but it cannot simulate all the targets since they
should have special features as an input which don’t exist
for all targets [8]. Computational models used in predicting
drug-target interaction are classified to supervised machine
learning and semi-supervised machine learning. Supervised
machine learning where input and output data are known for
classification. In drug-target interaction (DTI) known drug-
target interacting pairs are considered positively labeled
while the non-interacting ones are considered negatively
labeled. Classification models use these labels in training.
In semi-supervised machine learning, only some of the data
is labeled while the majority are unlabeled. These unlabeled
data can reduce the accuracy of the classifier, which leads to
bad results.
Adenosine molecule effects appear when it interacts with
G-protein coupled, as a result, Adenosine A3, A1 and A2a
are formed [9]. Gi and Gq proteins interact with adeno-
sine molecule to form A3 while pertussis toxin-sensitive G
proteins (Gi0, Gi1, Gi2 and G3) form A1 and finally the
A2a results from interacting with Gs and Golf proteins [10].
A3 receptors are found in tumor cells in the form of HL60
and K562 leukemia while A1 receptors are found in human
melanoma A375 cell lines and finally, A2a receptors are
found in various cells such as Jurkat lymphoma. Every type
of these receptors has a significant role in treating cancer.
These receptors are activated using drugs which in turn fight
cancer.
The Drug side effect has a great influence on the process
of drug design. According to DrugBank, the total drugs are
10562 [11] but only the approved drugs are 3254 which are
eligible to be used by patients due to their accepted side
effects. In 2016, Edgar D. Coelho et al. [12] proposed that
integrating drug side effects to other features would enhance
drug-target interaction prediction.
Since previous studies, focused on matching drugs and
targets in terms of interaction and neglecting their relation
(application) to the medical field. Also, they generated drug
descriptors from the compound’s chemical structure and
neglected an important feature such as drug side effects. In
addition, we ranked the predicted drugs based on the number
of side effects that will help pharmaceutical and doctors in
drug design.
Based on the literature survey most of the existing drug
target datasets are imbalanced as the count of non-interacting
drugs is more than the count of the interacting ones so we
applied SMOTE technique to balance our dataset.
The rest of the paper is organized as follows. Section
II views the previous studies of DTI and the drug side
effects. Section III discusses the dataset, drug features, used
machine learning classification models and model’s perfor-
mance evaluation. Section IV illustrates the proposed frame-
work. Section V states the experiments and results. Section
VI discusses the experimental results. Finally, section VII
concludes the paper and suggest future approaches.
II. LITERATURE SURVEY
In 2008, Monica Campillos et al. [13] predicted targets
(proteins) using side-effect similarity. In their study, they
proved that there was a relationship between drugs and
targets connected through drug side effect as two unrelated
drugs may have similar side effects by interacting with the
same target. In other words, this strong relation helped in
predicting new targets for old drugs. Their dataset was col-
lected from Matador [14], DrugBank [15] and Ki DB [16]
public databases, which contained 746 drugs, 4857 drug-
target relations and a side-effect network, formed of drug-
drug relations. They developed a side-effect similarity mea-
sure using weighting schemes then they classified drug side-
effects using Unified Medical Language System (UMLS). By
constructing ontology network, they concluded that there was
an inversely proportional relationship between the recurrence
of drug side effect and two drugs sharing the same target
(protein), finally they predicted 2903 drug-target interacting
pairs with a probability of 25%.
In 2016, Edgar D. Coelho et al. [12] used two machine
learning classification models SVM and RF to predict DTI.
The first model predicted drugs with reference to the tar-
get’s type while the second model predicted drugs without
referring to the target’s type. They collected their dataset
from DrugBank [15] and Yamanishi et al. [17] research,
and consisted of 927 drugs, 1370 targets, and 5127 drug
interactions. SVM model with reference to the target’s type
(protein) showed a great result in terms of AUC (Area Under
The Curve). They suggested a future approach to enhance the
prediction of DTI by using both network centrality metrics
and expanding the area of proteomic space.
In 2016, Diego Galeano et al. [18] presented the idea of
chemical similarity prediction which was built on the theory
of the drugs that are similar in their chemical structures
help in predicting targets near to them. They collected their
dataset from Biogrid [19] and DrugBank [15] databases,
which contained 9336 drugs, and 4612 targets. They built
two networks, the first network consisted of nodes and each
node represented a drug and similarity between drugs was
calculated by using Tanimoto Coefficient. The second net-
work represented the interactions between proteins and called
interactome to detect the relationship between them. Finally,
they measured the similarity between the two networks to
predict similar targets. The similarity between two networks
reached 85% in terms of AUC, they suggested that enhancing
the similarity ratio would occur when integrating side effect
similarity.
In 2017, Arvind Sinha et al. [20] used Decision Tree,
Support Vector Machine, Random Forest and Naïve Bayes
as classification techniques to inspect Leishmania Donovani
membrane a special kind of protein. The aim of their study
was to predict the usability of the protein whether to be a drug
2VOLUME , 2018
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2946314, IEEE Access
target or a vaccine. They used four classification techniques
and 28 proteins [21] as an input to their model then they
evaluated each technique and used the best one. Finally, they
used another 37 proteins and decided on the role of each
protein (drug-target or vaccine). The best result was obtained
by Naïve Bayes with an accuracy of 76.17%.
In 2017, Ming Hao et al. [22] used Dual-Network Inte-
grated Logistic Matrix Factorization (DNILMF) to predict
DTI. They proposed that similar drugs and targets could
help in predicting nearby drugs and targets. They formed
a new dataset which contained 829 drugs, 733 targets and
3688 interactions. They used kernel construction techniques
to build drug and target profiles, calculated the matrix pro-
files using kernel techniques, then similar classes were dif-
fused. They predicted DTI using DNILMF which was better
than Neighborhood Regularized Logistic Matrix Factoriza-
tion (NRLMF); they said that using genetic algorithm could
enhance their proposed model.
In 2017, Ming Wen et al. [23] predicted new drug and
target interactions without considering the type of targets
by using a deep learning methodology. They formed their
dataset of 1412 drugs, 520 targets and 2146240 interaction
pairs between drug and target. The data was extracted from
DrugBank [15] database. They used Extended Connectiv-
ity Fingerprints (ECFPs) to generate drug descriptors and
Protein Sequence Compositions (PSCs) to generate target
descriptors. A neural network called Deep Belief Networks
(DBN) was implemented. They tested their model using
an external dataset from DrugBank [24] containing 4383
drugs, 2528 targets and 7352 interaction pairs between drug
and target. They compared their model to Random Forest
(RF), Decision Trees (DT) and Bernoulli Naïve Bayes (BNB)
classifiers. The accuracy of DBN, BNB, DT and RF were
85%, 72%, 76% and 83% respectively.
In 2018, Hafez Manoochehri et al. [25] used Deep Ma-
trix Factorization (DMF) to predict drug-target interaction.
They discussed two approaches. The first approach was to
build a predictive model based on identifying non-interacting
negative pairs (drug-target) in the unlabeled data then using
both positive and negative pairs to build the model. The
second approach was predicting data using Ranking on Top
methods which rank the positive interacting pairs higher than
the non-interacting negative ones. Their model was divided
into two steps. They used K-Nearest Neighbor technique
(KNN) classification technique to extract negative samples
form data then they used DMF i.e. a deep learning approach
to generate latent vectors. They used golden benchmark
dataset constructed by Yamanishi et al. [26] where there were
four different target classes Ion Channels (IC), Enzymes, G-
Protein-Coupled Receptors (GPCR) and Nuclear Receptors
(NR) that contains 204, 445, 95 and 54 drugs in IC, En-
zyme, GPCR and NR respectively and 210, 664, 223 and
26 targets in IC, Enzymes, GPCR and NR respectively. They
evaluated their model using Area Under the Precision-Recall
(AUPR) curve, Area Under the Curve (AUC) and 10-fold
cross-validation. Finally, they compared their model with
Neural Matrix Factorization (NeuMF) and DMF with random
sampling. The results were higher using their proposed model
(DMF+KNN) with an average accuracy of 73.65% using Hit
Ratio Metric.
In 2019, Abdelrahman Saad et al. [27] used KNN, RF
and DT machine-learning classification techniques to predict
DTI. They formed their dataset from drug central and spider
version 4.1 public databases. They built two matrices. The
first matrix was the drug side effect matrix and the second
was the drug-target matrix used in training and testing. They
did three experiments to study the effect of using drug
features. The aim of the first experiment was to use drug
fingerprints to classify drugs and identify their interaction
with the corresponding targets, KNN achieved an accuracy
of 95.6%. The aim of the second experiment was to use drug
side effects to classify drugs and identify their relation to tar-
gets, KNN achieved an accuracy of 91.28%. Finally, the aim
of the third experiment was to classify drugs based on using
both drug side effects and drug fingerprint KNN achieved
an accuracy of 97.63%. They came to the conclusion that
using drug fingerprints besides drug side effects enhanced
the accuracy of the used classifiers. TABLE 1 summarizes
the previous related work.
In this study, we extended our work to deeply focus and
concentrate on finding a medical application for the previous
study as it was a general case study. So, we worked on a
specific types of targets (adenosine receptors) and studied the
effect of drugs on these targets in terms of interaction. We
made use of the drug side effect to rank drugs to help in the
drug design process and choose the best drug alternatives for
the patient. We also used new techniques in our experiments
such as SMOTE to balance the dataset and SVM for classifi-
cation.
III. MATERIALS AND METHODS
A. DATASETS
We used the dataset in the study conducted by Abdelrahman
Saad et al. [27], it was a combination of two datasets where
the first dataset was a drug dataset extracted from drug
central and contained 2736 drugs, 1938 targets and 14521
interactions between drugs [28] while the other one was a
drug side effect dataset extracted from spider version 4.1 [29]
and contained 1430 drugs, 5868 side effects and 139756 drug
side effect pairs.
The new dataset was compiled using joining and merg-
ing techniques. We used the compound ID to join the two
datasets. The resulted dataset is 400 drugs in common that
has multiple targets and drug side effects as shown in TABLE
2 [27]. After that, we focused on our area of interest which is
the adenosine receptors and their associated interacting drugs
with their corresponding side effects.
We used SMOTE to generate new instances from the new
samples from the classes having minor cases ( 24 drugs in A3
and 11 drugs in both A1 and A2a) by taking instances from
the features in space for the target classes and the nearby
classes then created new samples based on combining the
VOLUME , 2018 3
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2946314, IEEE Access
TABLE 1. Summary of related work
Paper authors Objectives Approach Dataset Accuracy
Monica Campillos
et al. [13], 2008
Predicted targets (proteins)
using side effect similarity 1. Similarity networks
1. Matador
2. DrugBank
3. Ki DB
Predicted drug-target
interacting pairs
with a probability of 25%
Edgar D. Coelho
et al. [12], 2016
Predicted drug target interactions
without considering the type of target
1. Random Forests
2. Support-Vector
Machine
3. Logistic regression
1. Yamanishi ‘s dataset
2. DrugBank
1. Average SVM: 92.75%
2. Average RF: 92.25%
Diego Galeano
et al. [18], 2016
Predicted targets based
on drug chemical structure 1. Tanimoto similarity 1. Biogrid
2. Drugbank
The AUC (Area Under
The Curve) reached 85
% in similarity
Arvind Sinha
et al. [20], 2017
Predicted usability of
the protein whether being a
drug-target or a vaccine
1. SVM
2. DT
3. RF
4. Naïve Bayes
Research by Kumar et
al., 2015 titled:
“Proteomic analyses
of membrane enriched
proteins..”
1. SVM: 63%
2. RF: 73%
3. DT: 56.33%
4. Naïve Bayes: 76.17%
Ming Hao
et al. [22], 2017
Predicted interactions
between drug and target using DNILMF
1. DNILMF
2. NRLMF
Compiled a
drug-target interaction
dataset using
compound ID
1. Average DNILMF: 97.57%
2. Average NRLMF: 96.9%
Ming Wen
et al. [23], 2017
Predicted new DTI without
considering the type of targets 1.Deep learning 1.Yamanishi ‘s dataset
2. DrugBank
1. DBN: 85%
2. BNB: 72%
3. DT: 76%
4. RF: 83%
Hafez Manoochehri
et al. [25], 2018
Predicted drug-target
interactions using DMF
1. KNN
2. DMF 1.Yamanishi ‘s dataset
1. Average DMF+KNN: 73.65%
2. Average DMF with
random sampling: 71.95%
3. NeuMF: 72.47%
Abdelrahman Saad
et al. [27], 2019
Predicted drug-target interactions using
machine learning classification techniques
1. RF
2. DT
3. KNN
1. Drug cental
2. Spider 4.1
1. Using drug fingerprint:
93.57% (DT),
93.84% (RF) and 95.16% (KNN)
2. Using drug side effect:
89.97% (DT),
90.23% (RF) and 91.28% (KNN)
3. Using drug fingerprint
and drug side
effect: 96.89% (DT), 96.97% (RF)
and 97.63% (KNN)
Our research
Predicted drugs interacting with
adenosine receptors using
machine learning
and ranking predicted drugs based on
drug side effects
1.SVM
2. DT
3. RF
4. SMOTE
1. Drug central
2. Spider 4.1
1. Adenosine A3: 70.53% (SVM),
70.26% (DT) and 73.68% (RF)
2. Adenosine A1: 61.90 (SVM),
66.48% (DT) and 66.30% (RF)
3. Adenosine A2a: 69.78% (SVM),
74.36% (DT) and 75.09% (RF)
features of the target classes with the features of the nearby
classes as shown in TABLE 3.
TABLE 2. Summary of datasets
No. of features Drugs Targets Side-effect
Dataset 1 2376 1938 -
Dataset 2 1430 - 5868
New dataset 400 794 3990
TABLE 3. Number of drugs before and after SMOTE
No. of drugs Before SMOTE After SMOTE
A3 A1 A2a A3 A1 A2a
Interacting 24 11 11 205 278 278
Non-interacting 377 390 390 377 390 390
B. DRUG FINGERPRINT
Drugs are represented by a feature vector called drug fin-
gerprint. Drug fingerprint is obtained by simulating the
molecules forming the drug. The simulation is based on
molecule information such as the atom numbers and the
bonds between these atoms. This information then used
to generate encoded fingerprints (binary bits) to be used
later as a strong features. Drug fingerprint is used in clas-
sification and drug similarity techniques [30] that help
in predicting new potential drugs for the existing tar-
gets and vice versa as shown in Figure 1. In part (A)
each drug interacts with one corresponding target as pairs
[(D1,T1),(D2,T2),(D3,T3),(D4,T4),(D5,T5)] i.e., drug 1 in-
teracts only with target 1, this can be used as a valu-
able input in drug-target interaction but it is not suffi-
cient for prediction and discovering hidden interactions be-
tween drugs and targets. In part (B) we can find that not
only every drug interacts with only one target as pairs
[(D1,(T1,T2)),(D2,(T1,T2,T3)),(D3,(T3,T4,T5)),(D4,(T2,T5
)),(D5(T4,T5))], i.e. drug 1 interacts with two targets named
target 1 and target 2, but each drug could interact with mul-
tiple targets with the helping of drug fingerprint similarity,
4VOLUME , 2018
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2946314, IEEE Access
for instance, two drugs sharing same drug fingerprint could
interact with the same targets, thus will help in finding new
applications for the existing drugs.
FIGURE 1. Prediction scenario.
C. DRUG SIDE EFFECT
Drugs have side effects that cause unpleasant symptoms to
patients e.g. skin rash and dizziness. Side effects highly
impact drug discovery as it limits the use of drugs and
decreases its value. Drug side effects vary from one person
to another depending on the reaction between the chemical
substances in the drug and the targeted cells in the human
body. It has been reported that the severity of side effects
is the second cause for drug manufacturing failure and the
fourth cause leading to death in USA [31], [32].
D. SUPPORT VECTOR MACHINE (SVM)
Support Vector Machine (SVM) is one of the famously used
machine learning techniques. SVM is a machine learning
classification method. It is summarized as follows: inputs,
represented by input vectors are non-linearly mapped to a
high dimensional feature space. A decision surface and a
quadratic formula are constructed to classify between those
input features while ensuring high generalization ability of
the learning machine. It is considered as a robust and pow-
erful method in data analysis and pattern recognition [33].
Support Vector Machine (SVM) was proposed by Vapnik and
Chervonenkis in the 1990s. There are two types of patterns
linear and nonlinear. The basic idea of SVM is to construct
a decision plane (hyperplane) to separate set of objects be-
longing to different classes [34], given the following data
set (xiyi)for i= 1 . . . N , xiRdand yi∈ {−1,1}for
training a classifier of f(x)as in equation (1)
f(xi)0yi= +1
<0yi=1(1)
classes are correctly classified when yif(xi)>0in case
of binary classification, but for linear classification classifier
has an equation (2) in the form of
f(x) = wτx+b(2)
Since wrepresents the weight of the vector and brepre-
sents the bias (SVM parameters), for better classification,
performance the margin is maximized using equation (3)
f(x) = X
i
αiyix>
ix+b(3)
where Xiare supporting vectors that support the algorithm
and it is defined when the value of αi(weight of the point) is
not zero.
E. DECISION TREE (DT)
Decision Tree (DT) learning is one of the most used meth-
ods for inductive inference. It is a classification method
that approximates discrete-valued target function. Decision
Trees are constructed using only those attributes best able to
differentiate the concepts to be learned [35]. A DT is built
by initially selecting a subset of instances from a training
set. This subset is then used by the algorithm to construct
a DT. The remaining training set instances test the accuracy
of the constructed tree. If the DT classifies the instances
correctly, the procedure terminates. If an instance is incor-
rectly classified, the instance is added to the selected subset
of training instances and a new tree is constructed. This
process continues until a tree that correctly classifies all non-
selected instances are created or the DT is built from the
entire training set. A statistical property, called Information
Gain, is used. Information Gain measures how well a given
attribute separates training examples into targeted classes.
The one with the highest information (information being
the most useful for classification) is selected. In order to
define Information Gain, first, we have to define an idea from
an information theory called Entropy. Entropy measures the
amount of information in an attribute using equation (4)
Entropy(S) = X
cC
pclog2pc(4)
Given a collection Sof coutcomes where pcis the chance
of an instance of Sbelongs to outcomes c. Another metric
is the Information Gain, which measures how powerful an
attribute can sort data as in equation (5)
InformationGain(S, F ) = E ntropy(S)X
fF
|Sf|
|S|Entropy (Sf)
(5)
Given a collection Shaving set of features Sfand count
of elements in Swith feature Fhaving value f.
F. RANDOM FOREST (RF)
Random Forest (RF) is proposed by Breiman [36]. A collec-
tion of bagged decision trees based on the idea of ensemble
learning where combining several machine algorithms to
form a big generalized machine learning algorithm [37].
Several trees are built using bootstrap aggregating algorithms
by extracting a random subset of data built by the trees
[38]. Finally, based on certain splitting criteria such as Gini
VOLUME , 2018 5
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2946314, IEEE Access
[39] trees are built. Trees classify the existing features and
nominate the tree class based on voting then the forest selects
the most voted classification path of all other trees.
The RF algorithm can be implemented as follows:
Step 1: Select attributes Y from total attributes X where Y<X.
Step 2: Calculate node N from random attributes Y by building
a split.
Step 3: Calculate the next node O using the best split.
Step 4: Repeat the previous steps until only one single node is
reached.
Step 5: Build N trees by repeating step 1 to step 4.
Step 6: Prediction data P is obtained from the N trained trees
using classification voting.
Step 7: Build the final model based on the highest voted pre-
dicted attributes.
G. SYNTHETIC MINORITY OVERSAMPLING
TECHNIQUE (SMOTE)
During our first experiment in this study, we found that the
accuracy and the specificity of the used classifiers are high
and the sensitivity (true positive rate) is very low due to the
dataset was imbalanced as the number of drugs interacting
with adenosine molecule is relatively small compared to the
non-interacting drugs. This problem affected our classifica-
tion performance results. We used an oversampling tech-
nique called Synthetic Minority Oversampling Technique
(SMOTE) as proposed by Chawla et al [40] and used in many
fields such as bio-informatics [41]. In this study, we used
SMOTE which made the number of interacting drugs and
non-interacting drugs with adenosine receptors nearly equal
which balanced our dataset. To balance the dataset SMOTE
uses the following equation
Dsyn =Di+ (DKnn Di)×r(6)
where Dsyn is the synthetic data, Diare minority samples,
DKnn a sample of k-nearest neighbor from minority samples
and ris a random number between 0 and 1
SMOTE algorithm can be implemented as follows:
Step 1: Determine both Di(feature vector) and DKnn (k-
nearest neighbor from minority samples).
Step 2: Output the difference between the feature vector and the
k-nearest neighbor from minority samples.
Step 3: Multiply output by r(a random number between 0 and
1).
Step 4: Add the output to the feature vector Dito select a new
point on the line segment between feature vectors.
Step 5: Repeat steps from 1 to 4 to identify new feature vectors.
H. PERFORMANCE MEASURE
The proposed framework was assessed using Accuracy, Sen-
sitivity, Specificity, Postive Predicted Value (PPV) and Neg-
ative Predicted Value (NPV) as shown below:
Accuracy =T P +T N
T P +T N +F P +F N (7)
Sensitivity =T P
T P +F N (8)
Specif icity =T N
T N +F P (9)
PPV =T P
T P +F P (10)
N P V =T N
T N +F N (11)
In this study, TP means true positive (sign of interaction
with adenosine receptors), TN means true negative (drug not
interacting with adenosine receptors), FN (predicted positive
drug-adenosine receptor pairs to be not interacting) and FP
(predicted negative drug-adenosine receptor pairs to be inter-
acting) where positive means there is an interaction between
drug and the receptor while negative there is no interaction
between them.
IV. PROPOSED MODEL
The aim of our framework is to predict drug-target inter-
actions by applying machine-learning techniques and select
the best classifier between these different techniques. We
generated a drug fingerprint for each drug using drug features
enclosed in Structure Data File (SDF) by calling the Chem-
mineR library in R. We resolved the bias in the dataset by
using the SMOTE technique as shown in Figure 2. Afterward,
we used the machine learning classification techniques to
classify drugs based on the drug fingerprint similarity and
the known drug-target interactions to train our model which
will help to predict new interactions. Finally, we ranked the
newly predicted drugs based on the drug side effect feature
to eliminate drugs having dangerous side effects.
A. PREPROCESSING PHASE
In the processing phase, we combined both datasets and
generated a new dataset as mentioned earlier in TABLE 2.
After that we labeled (represented by a random number)
each drug, target and drug side effect then each label was
encoded. Lastly, we started building matrices, the first matrix
represented drug-target pairs where a drug (D1) interacts
with one target or more (Tk) integrating to this matrix the
drug fingerprint as shown in Figure 3. The second matrix
represented drug-side effect pairs where a drug (D1) had
several side effects (Sm) as shown in Figure 4.
We extracted our part of interest in which the drugs in-
teracting with adenosine receptors (A3, A1 and A2a) and
their corresponding side effects to form a new matrix then
we applied the SMOTE technique to balance our dataset.
B. CLASSIFICATION PHASE
In the classification phase we trained our model using SVM,
DT and RF with hyper-parameters Sigmoid kernel and Gini
criterion respectively and the data of the adenosine receptors
A3, A1 and A2a as an input to the classifiers.
6VOLUME , 2018
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2946314, IEEE Access
C. VALIDATION AND TESTING PHASE
We split the data into 70% training and 30% testing then we
applied 10-fold cross-validation technique to test and validate
our data by splitting the training set into 10 folds where k
equals 10 then we trained our models on 9 folds and we tested
it on the one remaining fold, then we took an average of
different 10 accuracies of the model evaluation which helps
in concise analysis. The final step is to compare the results of
the different classifiers and choose the best classifier.
FIGURE 2. Proposed drug-adenosine receptors interaction framework.
V. EXPERIMENT RESULTS
Drug discovery undergoes many phases before a certain drug
can be approved to be taken by patients to treat a certain
disease. Prediction of these drugs must be highly accurate
because predicting the wrong drugs can affect the patient
causing unpleasant side effects that could lead to death.
Machine-learning (classification) techniques were used to
predict if there is an interaction between drugs and adenosine
receptors. Before, carrying the experiment on the whole data
we held 5% of the real data to ensure the data is synthesized
correctly before applying our model on the whole data using
SMOTE. Three different experiments were conducted on
three different types of adenosine receptors (A3, A1 and A2a)
and the results of the three classifiers were compared after
using SMOTE technique.
FIGURE 3. Drug-target matr ix.
FIGURE 4. Drug-side effect matrix.
A. EXPERIMENT ON PART OF THE DATA
Before carrying the main experiment (whole data), We ex-
tracted 5% of the data to validate using the SMOTE technique
on the whole data. The highest accuracy was obtained by RF
with an average accuracy of 70% and an average sensitivity
of 76% while the lowest accuracy was obtained by SVM with
an average accuracy of 60% and an average sensitivity of
59% as shown in TABLE 4, TABLE 5, TABLE 6, TABLE
7, TABLE 8 and TABLE 9.
TABLE 4. Adenosine A3 receptor before using SMOTE on part of the data
(%) Before SMOTE on part of the data (A3 receptor)
SVM DT RF
Accuracy 50 33 66
Sensitivity 50 25 50
Specificity 50 50 75
PPV 66 50 50
NPV 33 25 75
VOLUME , 2018 7
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2946314, IEEE Access
TABLE 5. Adenosine A3 receptor using SMOTE on part of the data
(%) Using SMOTE on part of the data (A3 receptor)
SVM DT RF
Accuracy 60 50 70
Sensitivity 50 50 80
Specificity 75 50 60
PPV 75 60 66
NPV 50 40 75
TABLE 6. Adenosine A1 receptor before using SMOTE on part of the data
(%) Before SMOTE on part of the data (A1 receptor)
SVM DT RF
Accuracy 33 50 50
Sensitivity 50 100 100
Specificity 25 25 0
PPV 25 40 25
NPV 50 100 100
TABLE 7. Adenosine A1 receptor using SMOTE on part of the data
(%) Using SMOTE on part of the data (A1 receptor)
SVM DT RF
Accuracy 70 70 70
Sensitivity 71 66 75
Specificity 66 75 66
PPV 83 80 60
NPV 50 60 80
TABLE 8. Adenosine A2a receptor before using SMOTE on part of the data
(%) Before SMOTE on part of the data (A2a receptor)
SVM DT RF
Accuracy 33 50 66
Sensitivity 0 66 75
Specificity 33 33 50
PPV 0 50 75
NPV 100 50 50
TABLE 9. Adenosine A2a receptor using SMOTE on part of the data
(%) Using SMOTE on part of the data (A2a receptor)
SVM DT RF
Accuracy 50 60 70
Sensitivity 57 66 75
Specificity 33 50 50
PPV 66 66 85
NPV 25 50 33
B. EXPERIMENT ON THE WHOLE DATA
In this section, we carried the experiment on the whole data
and showed the results after using SMOTE on the adenosine
receptors (A3, A1 and A2a).
1) ADENOSINE A3 RECEPTOR USING SMOTE ON THE
WHOLE DATA
After applying the SMOTE technique on the whole data, the
dataset is balanced with 205 interacting drugs and 377 non-
interacting drugs with adenosine receptors. SVM, DT and
RF achieved an accuracy of 70.53%, 70.26% and 73.68%
respectively and sensitivity of 76.84%, 71.58% and 76.84%
added to it a specificity of 64.21%, 68.95% and 70.53%
for SVM, DT and RF respectively. Also a PPV (Positive
Predictive Value) of 68.22%, 69.74% and 72.28% and NPV
(Negative Predictive Value) of 73.49%, 70.81% and 75.28%
in case of SVM, DT and RF respectively as shown in TABLE
10.
TABLE 10. Adenosine A3 receptor using SMOTE on the whole data
(%) Using SMOTE on the whole data (A3 receptor)
SVM DT RF
Accuracy 70.53 70.26 73.68
Sensitivity 76.84 71.58 76.84
Specificity 64.21 68.95 70.53
PPV 68.22 69.74 72.28
NPV 73.49 70.81 75.28
2) ADENOSINE A1 RECEPTOR USING SMOTE ON THE
WHOLE DATA
After applying the SMOTE technique on the whole data, the
dataset is balanced with 278 interacting drugs and 390 non-
interacting drugs with adenosine receptors. SVM, DT and RF
achieved an accuracy of 61.90%, 66.48% and 66.30% respec-
tively and sensitivity of 56.41%, 60.07% and 59.71% added
to it a specificity of 67.40%, 72.89% and 72.89% for SVM,
DT and RF respectively. Also a PPV (positive predictive
value) of 63.37%, 68.91% and 68.78% and NPV (Negative
Predictive Value) of 60.73%, 64.61% and 64.40% in case
of SVM, DT and RF respectively as shown in TABLE 11.
The SVM, DT and RF accuracy ratio decreased by 8.63%,
3.78% and 7.38% respectively compared to the accuracy in
A3 receptor experiment. While there was a slight increase in
specificity by 3.19%, 3.94% and 2.36%.
TABLE 11. Adenosine A1 receptor using SMOTE on the whole data
(%) Using SMOTE on the whole data (A1 receptor)
SVM DT RF
Accuracy 61.90 66.48 66.30
Sensitivity 56.41 60.07 59.71
Specificity 67.40 72.89 72.89
PPV 63.37 68.91 68.78
NPV 60.73 64.61 64.40
3) ADENOSINE A2A RECEPTOR USING SMOTE ON THE
WHOLE DATA
After applying the SMOTE technique on the whole data, the
dataset is balanced with 278 interacting drugs and 390 non-
interacting drugs with adenosine receptors. SVM, DT and
RF achieved an accuracy of 69.78%, 74.36% and 75.09%
respectively and a sensitivity ratio of 75.82%, 77.29% and
79.49% added to it a specificity of 63.74%, 71.43% and
70.70% for SVM, DT and RF respectively. Also a PPV
(positive predictive value) of 67.65% ,73.01% and 73.06%
and NPV (Negative Predictive Value) of 72.50%, 75.88% and
77.51% in case of SVM, DT and RF respectively as shown in
TABLE 12.
8VOLUME , 2018
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2946314, IEEE Access
TABLE 12. Adenosine A2a receptor using SMOTE on the whole data
(%) Using SMOTE on the whole data (A2a receptor)
SVM DT RF
Accuracy 69.78 74.36 75.09
Sensitivity 75.82 77.29 79.49
Specificity 63.74 71.43 70.70
PPV 67.65 73.01 73.06
NPV 72.50 75.88 77.51
C. RANKING THE INTERACTING DRUGS WITH
ADENOSINE TARGETS
Instead of drugs that help cure patients from certain dis-
eases, it causes side effects symptoms that can lead to death.
Therefore we choose 5 random interacting drugs with each
adenosine receptors and ranked it from least to most based
on side effects. For adenosine receptor A3 the interact-
ing drugs were Adenosine, Amiodarone, Atenolol, Baclofen
and Caffeine and have side effects such as agitation, high
blood pressure and bronchospasm. Also, Cladribine, Clotri-
mazole, Gabapentin, Lovastatin and Mefloquine interacted
with adenosine A1 receptors. While Miconazole, Nifedipine,
Raloxifene, Sildenafil and Tamoxifen interacted with adeno-
sine A2a receptor as shown in TABLE 13, TABLE 14 and
TABLE 15.
TABLE 13. Adenosine A3 ranked drugs.
Drug Side effects
Adenosine 87
Baclofen 100
Atenolol 112
Caffeine 133
Amiodarone 241
TABLE 14. Adenosine A1 ranked drugs.
Drug Side effects
Lovastatin 34
Clotrimazole 239
Gabapentin 265
Mefloquine 403
Cladribine 557
TABLE 15. Adenosine A2a ranked drugs.
Drug Side effects
Tamoxifen 7
Raloxifene 29
Miconazole 106
nifedipine 131
sildenafil 172
VI. RESULTS DISCUSSION
The experiments showed that RF and DT got the highest
accuracy in classifying drugs interacting with adenosine re-
ceptor A2a 75.09% and 74.36% respectively. The incorrect
classification affected the three classifiers across the three
FIGURE 5. Adenosine A3 receptor using SMOTE on the whole data
FIGURE 6. Adenosine A1 receptor using SMOTE on the whole data
FIGURE 7. Adenosine A2a receptor using SMOTE on the whole data
adenosine receptors since the interacting drug instances with
these receptors are too small compared to the non-interacting
ones. So we used the SMOTE technique in our experiments to
create synthetic data to solve the imbalanced dataset problem.
RF had the highest accuracy among the three classifiers
across the three target receptors with an average accuracy
of 71.69%, highest sensitivity with an average sensitivity of
72.01% and highest PPV with an average PPV of 71.37%.
While the lowest specificity was scored by SVM with an
average specificity of 65.11% and the lowest NPV with an
average NPV of 68.90% as shown in Figure 5, Figure 6 and
Figure 7.
VII. CONCLUSION
Cancer is considered one of the most dangerous diseases
affecting humans. High-cost lab experiments and researches
are applied to find a cure for cancer. Enhancing the drug dis-
covery process highly depends on analyzing and processing
VOLUME , 2018 9
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2946314, IEEE Access
drug features to develop new drugs that will interact with
targets in the human body to cure the diseases. We proposed
a machine learning model to help in predicting drugs inter-
acting with targets based on drug fingerprints. In this study,
we focused on a special kind of targets called adenosine
receptors. We suffered a problem of the unbalanced dataset
which was a misleading factor in the classification perfor-
mance and accuracy. We used SMOTE to solve the problem
of the unbalanced dataset using three different classifiers
across three different adenosine targets. RF achieved the best
classification performance with an accuracy of 75.09%. Fi-
nally, we ranked the output drugs interacting with adenosine
receptors based on the drug side effect. Adenosine was the
least interacting drug with adenosine A3 receptor with 87
side effects while lovastatin was the least interacting drug
with adenosine A1 receptor with 34 side effects and finally
tamoxifen as the least interacting drug with adenosine A2a
receptor with 7 side effects. In the future, we will apply
another classification technique to enhance the accuracy of
the prediction also, increase the drug and target instances.
We will also, consider weighting drug side effects based on
medical experiences to determine the degree of severity of
the predicted drug.
REFERENCES
[1] Types of Cancer 2018, Cancer Research, London, UK, 2018.
[2] Skin Cancer 2018, American Cancer Society, USA, 2016.
[3] Cancer Statistics, National Cancer Institute, USA, 2018.
[4] S. Gessi, S. Merighi, P. A. Borea, S. Cohen, and P. Fishman, “Adenosine
Receptors and Current Opportunities to Treat Cancer,” The Adenosine
Receptors, pp. 543–555, 2018.
[5] P. Fishman, S. Bar-Yehuda, F. Barer, L. Madi, A. S. Multani, and S.
Pathak, “The A3 Adenosine Receptor as a New Target for Cancer Therapy
and Chemoprotection,” Experimental Cell Research, vol. 269, no. 2, pp.
230–236, Oct. 2001.
[6] Zhengwei Li, Pengyong Han, Zhu-Hong You, Xiao Li, Yusen Zhang,
Haiquan Yu, Ru Nie and Xing Chen, “In silico prediction of drugtarget
interaction networks based on drug chemical structure and protein se-
quences,” Scientific Reports, vol. 7, no. 1, Sep. 2017.
[7] Y. Yamanishi, M. Kotera, M. Kanehisa and S. Goto, “Drug target inter-
action prediction from chemical, genomic and pharmacological data in an
integrated framework,” Bioinformatics, vol. 26, no. 12, pp. i246- i254, Jun
.2010.
[8] Xing Chen, Chenggang Yan, Xiaotian Zhang, Xu Zhang, Feng Dai, Jian
Yin and Yongdong, “Drug–target interaction prediction: databases, web
servers and computational models,” Briefings in Bioinformatics, vol. 17,
no. 4, pp. 696–712, Aug. 2015.
[9] P. Fishman, S. Bar-Yehuda, M. Synowitz, J. D. Powell, K. N. Klotz, S.
Gessi, and P. A. Borea, “Adenosine Receptors and Cancer,” Handbook of
Experimental Pharmacology, pp. 399–441, 2009.
[10] D. Allard, M. Turcotte, and J. Stagg, “Targeting A2 adenosine receptors in
cancer,” Immunology and Cell Biology, vol. 95, no. 4, pp. 333–339, Feb.
2017.
[11] Drug Statistics, Drug Bank, Canada, USA, 2018.
[12] Edgar Coelho, José Oliveira and Joel Arrais, “Ensemble-Based Method-
ology for the Prediction of Drug-Target Interactions,” IEEE 29th Interna-
tional Symposium on Computer-Based Medical Systems (CBMS), pp.36-
41, Jun. 2016.
[13] M. Campillos, M. Kuhn, A. Gavin, L. Jensen, and P. Bork, “Drug Target
Identification Using Side-Effect Similarity,” Science, vol. 321, no. 5886,
pp. 263–266, Jul. 2008.
[14] S. Gunther, M. Kuhn, M. Dunkel, M. Campillos, C. Senger, E. Petsalaki, J.
Ahmed, E. G. Urdiales, A. Gewiess, L. J. Jensen, R. Schneider, R. Skoblo,
R. B. Russell, P. E. Bourne, P. Bork and R. Preissner, “SuperTarget and
Matador: resources for exploring drug-target relationships,” Nucleic Acids
Research, vol. 36, no. Database, pp. D919–D922, Dec. 2007.
[15] D. S. Wishart, “DrugBank: a comprehensive resource for in silico drug
discovery and exploration,” Nucleic Acids Research, vol. 34, no. 90001,
pp. D668–D672, Jan. 2006.
[16] B. L. Roth, E. Lopez, S. Patel, and W. K. Kroeze, “The Multiplicity of
Serotonin Receptors: Uselessly Diverse Molecules or an Embarrassment
of Riches?,” The Neuroscientist, vvol. 6, no. 4, pp. 252–262, Aug. 2000.
[17] Y. Yamanishi, M. Kotera, M. Kanehisa and S. Goto, “Drug target inter-
action prediction from chemical, genomic and pharmacological data in an
integrated framework,” Bioinformatics, vol. 26, no. 12, pp. i246- i254, Jun
.2010.
[18] Diego Galeano and Alberto Paccanaro, “Drug targets prediction us-
ing chemical similarity,” XLII Latin American Computing Conference
(CLEI), pp. 1-7, Oct. 2016.
[19] C. Stark, “Biogrid: a general repository for interaction datasets,” Nucleic
acids research, vol. 34, no. suppl 1, pp. D535–D539, Jan. 2006.
[20] Arvind Sinha, Pradeep Singh, Anand Prakash, Dharm Pal, Anuradha
Dube, and Awanish Kumar, ‘“Putative Drug and Vaccine Target Identi-
fication in Leishmania donovani Membrane Proteins Using Naïve Bayes
Probabilistic Classifier,” IEEE/ACM, Jan. 2017.
[21] A.Kumar, P. Misra, B. Sisodia, A. Shasany, S. Sundar, and A. Dube, ‘“Pro-
teomic analyses of membrane enriched proteins of Leishmania donovani
Indian clinical isolate by mass spectrometry,” Parasitol. Int., vol. 64, no. 4,
pp. 36–42, Aug. 2015.
[22] Ming Hao, Stephen Bryant and YanliWang, ‘“Predicting drug-target inter-
actions by dual-network integrated logistic matrix factorization,” Scientific
Reports, vol. 7, no. 1, Jan. 2017.
[23] M. Wen, Z. Zhang, S. Niu, H. Sha, R. Yang, Y. Yun, and H. Lu, ‘“Deep-
Learning-Based Drug–Target Interaction Prediction,” Journal of Proteome
Research, vol. 16, no. 4, pp. 1401–1409, Mar. 2017.
[24] David S. Wishart, Craig Knox, An Chi Guo, Dean Cheng, Savita Shri-
vastava, Dan Tzur, Bijaya Gautam and Murtaza Hassanali, ‘“DrugBank:
a knowledgebase for drugs, drug actions and drug targets,” Nucleic Acids
Research, vol. 36, pp. D901–D906, Nov. 2007.
[25] H. E. Manoochehri and M. Nourani, “Predicting Drug-Target Interaction
Using Deep Matrix Factorization,” 2018 IEEE Biomedical Circuits and
Systems Conference (BioCAS), Oct. 2018.
[26] Y. Yamanishi, M. Araki, A. Gutteridge, W. Honda, and M. Kanehisa,
‘“Prediction of drug-target interaction networks from the integration of
chemical and genomic spaces,” Bioinformatics, vol. 24, no. 13, pp.
i232–i240, Jun. 2008.
[27] A. Saad, F. A. Maghraby, and Y. M. Omar, “Predicting Drug Target
Interaction by Integrating Drug Fingerprint and Drug Side Effect Us-
ing Machine Learning,” Handbook of Experimental Pharmacology, pp.
281–290, Mar. 2019.
[28] O. Ursu, J. Holmes, C. G. Bologa, J. J. Yang, S. L. Mathias, V. Stathias,
D.-T. Nguyen, S. Schürer, and T. Oprea, “DrugCentral 2018: an update,
Nucleic Acids Research, vol. 47, no. D1, pp. D963–D970, Oct. 2018.
[29] M. Kuhn, I. Letunic, L. J. Jensen, and P. Bork, “The SIDER database
of drugs and side effects,” Nucleic Acids Research, vol. 44, no. D1, pp.
D1075–D1079, Oct. 2015.
[30] Dong-Sheng Cao, Qian-Nan Hu, Qing-Song Xu, Yan-Ning Yang, Jian-
Chao Zhao, Hong-Mei Lu, Liang-Xiao Zhang and Yi-Zeng Liang, ‘“In
silico classification of human maximum recommended daily dose based on
modified random forest and substructure fingerprint,” Analytica Chimica
Acta, vol. 692, no. 1–2, pp. 50–56, Apr. 2011.
[31] K. M. Giacomini, R. M. Krauss, D. M. Roden, M. Eichelbaum, M. R.
Hayden, and Y. Nakamura, ‘“hen good drugs go bad,” Nature, vol. 446,
no. 7139, pp. 975–977, Apr. 2007.
[32] E. Lounkine, M. J. Keiser, S. Whitebread, D. Mikhailov, J. Hamon, J. L.
Jenkins, P. Lavan, E. Weber, A. K. Doak, S. Côté, B. K. Shoichet, and L.
Urban, ‘“Large-scale prediction and testing of drug activity on side-effect
targets,” Nature, vol. 486, no. 7403, pp. 361–367, Jun. 2012.
[33] G. Chandrashekar and F. Sahin, ‘“A survey on feature selection methods,”
Computers and Electrical Engineering, vol. 40, no. 1, pp. 16–28, Jan. 2014.
[34] Ashis Pradhan, ‘“SUPPORT VECTOR MACHINE- A Survey,” Interna-
tional Journal of Emerging Technology and Advanced Engineering, vol .2
no. 8, Aug. 2012.
[35] Giuseppe Bombara, Cristian-Ioan Vasile, Francisco Penedo, Hirotoshi
Yasuoka and Calin Belta,‘“A Decision Tree Approach to Data Classifica-
tion using Signal Temporal Logic,” Proceedings of the 19th International
Conference on Hybrid Systems: Computation and Control - HSCC , 2016.
[36] L. Breiman, “Random Forests,” Machine Learning, Vol. 45, no. 1, pp. 5-
32, Oct. 2001.
10 VOLUME , 2018
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2946314, IEEE Access
[37] Sherif fayz, Mohamed rizka, and Fahima Maghraby, ‘“Cervical Cancer
Diagnosis using Random Forest Classifier with SMOTE and Feature
Reduction Techniques,” IEEE Access, pp. 1–1, 2018.
[38] Yuanli Wu, Hong Wang and Fei Wu, ‘“Automatic classification of pul-
monary tuberculosis and sarcoidosis based on random forest,” 10th Inter-
national Congress on Image and Signal Processing, BioMedical Engineer-
ing and Informatics (CISP-BMEI), Oct. 2017.
[39] Graczyk, P. Piotr, “Gini coefficient: a new way to express selectivity
of kinase inhibitors against a family of kinases,” Journal of medicinal
chemistry, Vol. 50, no. 23, pp. 5773-5779, Oct. 2007.
[40] Chawla, V. Nitesh, Bowyer, W. Kevin, Hall, O. Lawrence, Kegelmeyer and
W. Philip, “SMOTE: synthetic minority over-sampling technique,” Journal
of artificial intelligence research, Vol. 16, no. 23, pp. 321-357, Jun. 2002.
[41] T. Deepa and M. Punithavalli, ‘“An E-SMOTE technique for feature se-
lection in High-Dimensional Imbalanced Dataset,” 2011 3rd International
Conference on Electronics Computer Technology, Apr. 2011.
ABDELRAHMAN SAAD Was born in Jeddah,
Saudi Arabia, in 1992. He received a bachelor’s
degree in Information Systems from the Univer-
sity of Arab Academy for Science Technology
Maritime Transport (AASTMT), Cairo, Egypt, in
2015. From 2016 to 2017, he was a soldier in the
Egyptian Army. Since 2018, he has been a Grad-
uate Teaching Assistant. He is currently pursuing
a master’s degree in Information Systems with the
Arab Academy for Science Technology Maritime
Transport (AASTMT), Cairo, Egypt. His main fields of research interests are
bioinformatics, machine learning, and big data.
YASSER M.K. OMAR Received a Ph.D. degree
in Biomedical Engineering from Cairo Univer-
sity, Cairo, Egypt. He has been an Assistant Pro-
fessor in the Department of Computer Science,
Faculty of Computing and Information Technol-
ogy, Arab Academy for Science Technology &
Maritime Transport (AASTMT). His research in-
terests are bioinformatics, medical imaging, data
visualization, machine learning, and computing
algorithms.
FAHIMA A. MAGHRABY Received the B.S.,
M.S., and Ph.D. degrees from Ain Shams Uni-
versity, Cairo, Egypt, in 2003, 2008, and 2014,
respectively, all in Computer Science. From 2004
to 2014, she was a Lecturer Assistant with the
Institute of Computer Science, Shorouk Academy,
Cairo, Egypt. Since 2014, she has been a Lecturer
with the Faculty of Computing and Information
Technology, Arab Academy for Science, Tech-
nology & Maritime Transport (AASTMT), Cairo,
Egypt. Her research interest includes bioinformatics, imaging processing,
artificial intelligence, and cloud computing.
VOLUME , 2018 11
... To overcome challenges of imbalanced data (He and Garcia, 2009;Fernandez et al. 2017;Leevy et al. 2018), one popular method to rebalance training data is Synthetic Minority Oversampling Technique (SMOTE) (Chawla et al. 2002). SMOTE has shown good results in improving the capability of machine learning models in imbalanced applications such as drug trials (Saad et al 2019) and driving risk classification , and therefore it has been implemented in this case. SMOTE searches k-nearest minority neighbours of each minority instance, selecting one of the neighbours as a reference point and generating a new value by multiplying the difference with a random value between 0 and 1 (r). ...
Thesis
Shipping is an essential component of the global economy, but every year accidents result in significant loss of life and environmental pollution. Navigating vessels might collide with one another, run aground or capsize amongst a multitude of challenges to operating at sea. As the number and sizes of vessels have increased, novel or autonomous technologies are adopted and new environments such as the Arctic are exploited, these risks are likely to increase. Coastal states, ports and developers have a responsibility to assess these risks, and where the risk is intolerably high, implement mitigation measures to reduce them. To support this, significant research has developed a field of maritime risk analysis, attempting to employ rigorous scientific study to quantifying the risk of maritime accidents. Such methods are diverse, yet have received criticism for their lack of methodological rigour, narrow scope and one-dimensional rather than spatial-temporal approach to risk. More broadly, there is a recognition that by combining different datasets together, novel techniques might lead to more robust and practicable risk analysis tools. This thesis contributes to this purpose. It argues that by integrating massive and heterogenous datasets related to vessel navigation, machine learning algorithms can be used to predict the relative likelihood of accident occurrence. Whilst such an approach has been adopted in other disciplines this remains relatively unexplored in maritime risk assessment. To achieve this, four aspects are investigated. Firstly, to enable fast and efficient integration of different spatial datasets, the Discrete Global Grid System has been trialled as the underlying spatial data structure in combination with the development of a scalable maritime data processing pipeline. Such an approach is shown to have numerous advantageous qualities, particular relevant to large scale spatial analysis, that addresses some of the limitations of the Modifiable Areal Unit Problem. Secondly, a national scale risk model was constructed for the United States using machine learning methods, providing high-resolution and reliable risk assessment. This supports both strategic planning of waterways and real-time monitoring of vessel transits. Thirdly, to overcome the infrequency of accidents, near-miss modelling was undertaken, however, the results were shown to only have partial utility. Finally, a comparison is made of various conventional and machine methodologies, identifying that whilst the latter are often more complex, they address some failings in conventional methods. The results demonstrate the potential of these methods as a novel form of maritime risk analysis, supporting decision makers and contributing to improving the safety of vessels and the protection of the marine environment.
... While there have been several previous attempts to use machine learning for ARs (Saad et al., 2019;Wang et al., 2021), few have performed external validation. One recent study used deep learning combined with pharmacophore and docking approaches to identify novel A 1 /A 2A antagonists (Wang et al., 2021). ...
Article
Full-text available
Adenosine (ADO) is an extracellular signaling molecule generated locally under conditions that produce ischemia, hypoxia, or inflammation. It is involved in modulating a range of physiological functions throughout the brain and periphery through the membrane-bound G protein-coupled receptors, called adenosine receptors (ARs) A1AR, A2AAR, A2BAR, and A3AR. These are therefore important targets for neurological, cardiovascular, inflammatory, and autoimmune diseases and are the subject of drug development directed toward the cyclic adenosine monophosphate and other signaling pathways. Initially using public data for A1AR agonists we generated and validated a Bayesian machine learning model (Receiver Operator Characteristic of 0.87) that we used to identify molecules for testing. Three selected molecules, crisaborole, febuxostat and paroxetine, showed initial activity in vitro using the HEK293 A1AR Nomad cell line. However, radioligand binding, β-arrestin assay and calcium influx assay did not confirm this A1AR activity. Nevertheless, several other AR activities were identified. Febuxostat and paroxetine both inhibited orthosteric radioligand binding in the µM range for A2AAR and A3AR. In HEK293 cells expressing the human A2AAR, stimulation of cAMP was observed for crisaborole (EC50 2.8 µM) and paroxetine (EC50 14 µM), but not for febuxostat. Crisaborole also increased cAMP accumulation in A2BAR-expressing HEK293 cells, but it was weaker than at the A2AAR. At the human A3AR, paroxetine did not show any agonist activity at 100 µM, although it displayed binding with a Ki value of 14.5 µM, suggesting antagonist activity. We have now identified novel modulators of A2AAR, A2BAR and A3AR subtypes that are clinically used for other therapeutic indications, and which are structurally distinct from previously reported tool compounds or drugs.
... where D new is the synthetic sample, D i are minority samples, D knn a sample of k-nearest neighbour from minority samples and rand is a random number between 0 and 1 [49]. ...
Article
Full-text available
Electronic Health Records (EHRs) hold symptoms of many diverse diseases and it is imperative to build models to recognise these problems early and classify the diseases appropriately. This classification task could be presented as a single or multi-label problem. Thus, this study presents Psychotic Disorder Diseases (PDD) dataset with five labels: bipolar disorder, vascular dementia, attention-deficit/hyperactivity disorder (ADHD), insomnia, and schizophrenia as a multi-label classification problem. The study also investigates the use of deep neural network and machine learning techniques such as multilayer perceptron (MLP), support vector machine (SVM), random forest (RF) and Decision tree (DT), for identifying hidden patterns in patients’ data. The study furthermore investigates the symptoms associated with certain types of psychotic diseases and addresses class imbalance from a multi-label classification perspective. The performances of these models were assessed and compared based on an accuracy metric. The result obtained revealed that deep neural network gave a superior performance of 75.17% with class imbalance accuracy, while the MLP model accuracy is 58.44%. Conversely, the best performance in the machine learning techniques was exhibited by the random forest model, using the dataset without class imbalance and its result, compared with deep learning techniques, is 64.1% and 55.87%, respectively. It was also observed that patient’s age is the most contributing feature to the performance of the model while divorce is the least. Likewise, the study reveals that there is a high tendency for a patient with bipolar disorder to have insomnia; these diseases are strongly correlated with an R-value of 0.98. Our concluding remark shows that applying the deep and machine learning model to PDD dataset not only offers improved clinical classification of the diseases but also provides a framework for augmenting clinical decision systems by eliminating the class imbalance and unravelling the attributes that influence PDD in patients..
... SVM is a binary classifier based on the idea of hyperplanes separation of objects. Hyperplanes act as a boundary to distinguish data points to be assigned to different classes [12]. ...
... If there is a class imbalance, the accuracy of the classifier decreases. To overcome this problem, we have used oversampling method called Synthetic Minority Oversampling technique (SMOTE) to balance our dataset [20]. Too much oversampling results in overfitting problem, so that we have not applied SMOTE to the test set. ...
Article
Full-text available
Vitamin D Deficiency (VDD) is one of the most significant global health problem and there is a strong demand for the prediction of its severity using non-invasive methods. The primary data containing serum Vitamin D levels were collected from a total of 3044 college students between 18-21 years of age. The independent parameters like age, sex, weight, height, body mass index (BMI), waist circumference, body fat, bone mass, exercise, sunlight exposure, and milk consumption were used for prediction of VDD. The study aims to compare and evaluate different machine learning models in the prediction of severity in VDD. The objectives of our approach are to apply various powerful machine learning algorithms in prediction and evaluate the results with different performance measures like Precision, Recall, F1-measure, Accuracy, and Area under the curve of receiver operating characteristic (ROC). The McNemar’s test was conducted to validate the empirical results which is a statistical test. The final objective is to identify the best machine learning classifier in the prediction of the severity of VDD. The most popular and powerful machine learning classifiers like K-Nearest Neighbor (KNN), Decision Tree (DT), Random Forest (RF), AdaBoost (AB), Bagging Classifier (BC), ExtraTrees (ET), Stochastic Gradient Descent (SGD), Gradient Boosting (GB), Support Vector Machine (SVM), and Multi-Layer Perceptron (MLP) were implemented to predict the severity of VDD. The final experimentation results showed that the Random Forest Classifier achieves better accuracy of 96 % and outperforms well on training and testing Vitamin D dataset. The McNemar’s statistical test results support that the RF classifier outperforms than the other classifiers.
Article
Full-text available
Extreme weather events can result in loss of life, environmental pollution and major damage to vessels caught in their path. Many methods to characterise this risk have been proposed, however, they typically utilise deterministic thresholds of wind and wave limits which might not accurately reflect risk. To address this limitation, we investigate the potential of machine learning algorithms to quantify the relative likelihood of an incident during the US Atlantic hurricane season. By training an algorithm on vessel traffic, weather and historical casualty data, accident candidates can be identified from historic vessel tracks. Amongst the various methods tested, Support Vector Machines showed good performance with Recall at 95% and Accuracy reaching 92%. Finally, we implement the developed model using a case study of Hurricane Matthew (October 2016). Our method contributes to enhancements in maritime safety by enabling machine intelligent risk-aware ship routing and monitoring of vessel transits by Coastguard agencies.
Article
Full-text available
Analysis of drug–target interactions (DTIs) is of great importance in developing new drug candidates for known protein targets or discovering new targets for old drugs. However, the experimental approaches for identifying DTIs are expensive, laborious and challenging. In this study, we report a novel computational method for predicting DTIs using the highly discriminative information of drug-target interactions and our newly developed discriminative vector machine (DVM) classifier. More specifically, each target protein sequence is transformed as the position-specific scoring matrix (PSSM), in which the evolutionary information is retained; then the local binary pattern (LBP) operator is used to calculate the LBP histogram descriptor. For a drug molecule, a novel fingerprint representation is utilized to describe its chemical structure information representing existence of certain functional groups or fragments. When applying the proposed method to the four datasets (Enzyme, GPCR, Ion Channel and Nuclear Receptor) for predicting DTIs, we obtained good average accuracies of 93.16%, 89.37%, 91.73% and 92.22%, respectively. Furthermore, we compared the performance of the proposed model with that of the state-of-the-art SVM model and other previous methods. The achieved results demonstrate that our method is effective and robust and can be taken as a useful tool for predicting DTIs.
Article
DrugCentral is a drug information resource (http://drugcentral.org) open to the public since 2016 and previously described in the 2017 Nucleic Acids Research Database issue. Since the 2016 release, 103 new approved drugs were updated. The following new data sources have been included: Food and Drug Administration (FDA) Adverse Event Reporting System (FAERS), FDA Orange Book information, L1000 gene perturbation profile distance/similarity matrices and estimated protonation constants. New and existing entries have been updated with the latest information from scientific literature, drug labels and external databases. The web interface has been updated to display and query new data. The full database dump and data files are available for download from the DrugCentral website.
Article
Cervical cancer is the fourth most common malignant disease in women’s worldwide. In most cases cervical cancer symptoms are not noticeable at its early stages. There are a lot of factors that increase the risk of developing cervical cancer like Human Papilloma Virus (HPV), Sexual Transmitted Diseases (STD) and smoking. Identifying those factors and building a classification model to classify whether the cases are cervical cancer or not is a challenging research. This study aims at using cervical cancer risk factors to build classification model using Random Forest (RF) classification technique with Synthetic Minority Oversampling Technique (SMOTE) and two feature reduction techniques Recursive Feature Elimination (RFE) and Principle Component Analysis (PCA). Most medical datasets are often imbalanced because the number of patients is much less than the number of non-patients. Because of the imbalance of the used dataset, SMOTE is used to solve this problem. The dataset consists of 32 risk factors and 4 target variables: Hinselmann, Schiller, Cytology and Biopsy. After comparing the results, we find that the combination of the random forest classification technique with SMOTE improve the classification performance.
Chapter
Adenosine is an endogenous modulator exerting its physiological effects by activating four A1, A2A, A2B, and A3 adenosine receptors. This nucleoside increases in hypoxia that characterizes solid tumors, thus affecting vasculature, immunoescaping, and cancer growth. This chapter offers an updated overview on the current opportunities to treat tumors coming from the adenosinergic field. Several years of research has led to the conclusion that A2A and A3 subtypes are the most promising for drug development. As for A3 receptors, consequent to the efficacy of their agonists in numerous animal models of cancer, the lead compound, Namodenoson, has entered in clinical trials for hepatocellular carcinoma. Phase I results proved its optimal safety profile and efficacy, so that phase II studies are in progress. Specifically, A2A receptor is responsible for immunosuppressive effects, reducing antitumor immunity and promoting immunoescaping of cancer. Therefore, A2A receptor antagonists have been proposed to fight cancer by enhancing immunotherapy, supported also by their safety already demonstrated in clinical trials for Parkinson’s disease. Overall, from these positive results, it may be expected that A3 agonists and A2A antagonists may become future anticancer drugs with the ability to save and improve human health also for diseases with very limited treatment options.
Article
Identifying interactions between known drugs and targets is a major challenge in drug repositioning. In silico prediction of drug target interaction (DTI) can speed up the experimental work which is expensive and time-consuming by providing most potent DTIs. In silico prediction of DTI can also provide insights about the potential drug-drug interaction and promote the exploration of drug side-effect. Traditionally, the performance of DTI prediction heavily depends on the descriptors used to represent the drugs and the target proteins. In this paper, to accurately predict new DTIs between approved drugs and targets without separating the target into different classes, we developed a deep learning-based algorithmic framework named DeepDTIs. It firstly abstracts representation from raw input descriptors using unsupervised pre-training, then applies known label pairs of interaction to build a classification model. Comparing with other methods, it is found that DeepDTIs reaches or outperforms other state-of-the-art methods. The DeepDTIs can be further used to predict whether a new drug targets to some existing targets or whether a new target interacts with some existing drugs.
Article
Tumor cells use various ways to evade antitumor immune responses. Adenosine, a potent immunosuppressive metabolite, is often found elevated in the extracellular tumor microenvironment. Therefore, targeting adenosine-generating enzymes (CD39 and CD73) or adenosine receptors has emerged as a novel means to stimulate antitumor immunity. In particular, the A2 (A2a and A2b) adenosine receptors exhibit similar immunosuppressive and pro-angiogenic functions, yet have distinct biological roles in cancer. In this review, we describe the common and distinct biological consequences of A2a and A2b adenosine receptor signaling in cancer. We discuss recent pre-clinical studies and summarize the different mechanisms-of-action of adenosine-targeting drugs. We also review the rationale for combining inhibitors of the adenosine pathway with other anticancer therapies such immune checkpoint inhibitors, tumor vaccines, chemotherapy and adoptive T cell therapy.
Article
Predicting the role of protein is one of the most challenging problems. There are few approaches available for the prediction of role of unknown protein in terms of drug target or vaccine candidate. We propose here Naïve Bayes probabilistic classifier, a promising method for reliable predictions. This method is tested on the proteins identified in our mass spectrometry based membrane protemics study of Leishmania donovani parasite that causes a fatal disease (Visceral Leishmaniasis) in humans all around the world. Most of the vaccine/drug targets belonging to membrane proteins are represented as key players in the pathogenesis of Leishmania infection. Analyses of our previous results, using Naïve Bayes probabilistic classifier, indicate that this method predicts the role of unknown/hypothetical protein (as drug target/vaccine candidate) significantly with higher precision. We have employed this method in order to provide probabilistic predictions of unknown/hypothetical proteins as targets. This study reports the unknown/hypothetical proteins of Leishmania membrane fraction as a potential drug targets and vaccine candidate which is vital information for this parasite. Future molecular studies and characterization of these potent targets may produce a recombinant therapeutic/prophylactic tool against Visceral Leishmaniasis. These unknown/hypothetical proteins may open a vast research field to be exploited for novel treatment strategies.